0% found this document useful (0 votes)
31 views60 pages

Statistical Solutions For and From Signal Processing

Uploaded by

Changjing Zhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views60 pages

Statistical Solutions For and From Signal Processing

Uploaded by

Changjing Zhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Statistical Solutions For and From

Signal Processing
by

Luke Bornn

B.Sc., University of the Fraser Valley, 2003

A THESIS SUBMITTED IN PARTIAL FULFILMENT OF


THE REQUIREMENTS FOR THE DEGREE OF

Master of Science

in

The Faculty of Graduate Studies

(Statistics)

The University Of British Columbia


(Vancouver)
August, 2008

© Luke Bornn 2008


Abstract
With the wide range of fields engaging in signal processing research, many
methods do not receive adequate dissemination across disciplines due to
differences in jargon, notation, and level of rigor. In this thesis, I attempt to
bridge this gap by applying two statistical techniques originating in signal
processing to fields for which they were not originally intended. Firstly,
I employ particle filters, a tool used for state estimation in the physics
signal processing world, for the task of prior sensitivity analysis and cross
validation in Bayesian statistics. Secondly, I demonstrate the application of
support vector forecasters, a tool used for forecasting in the machine learning
signal processing world, to the field of structural health monitoring.

11
Table of Contents

Abstract .

Table of Contents iii

List of Figures v

Acknowledgements vi

Statement of Co-Authorship vii

1 Introduction 1
1.1 Particle Filtering 1
1.2 Support Vector Forecasters 4
1.3 References 5

2 An Efficient Computational Approach for Prior Sensitivity


Analysis and Cross-Validation 6
2.1 Introduction and Motivation 6
2.2 Sequential Monte Carlo Algorithms 8
2.2.1 An Efficient SMC Algorithm 10
2.3 Prior Sensitivity 12
2.3.1 Regularization Path Plots 13
2.3.2 Variable Selection Using g-Priors
. . . 15
2.4 Cross-Validation • 18
2.4.1 Application to Bayesian Penalized Regression 23
2.5 Extensions and Conclusions • 25
2.6 References • . 26

3 Structural Health Monitoring with Autoregressive Support


Vector Machines 29
3.1 Introduction 29
3.2 SVM-based SHM 31

111
Table of Contents

3.2.1 Example: Simulated Damage 37


3.3 Joint Online SHM 41
3.3.1 Example: Experimental Data 42
3.4 Conclusion 46
3.5 References 48

4 Conclusion 50

5 Technical Appendix 52

iv
List of Figures

2.1 Regularization Path Plots: The gold standard 14


2.2 Regularization Path Plots: Plots using MCMC and SMC for
fixed computational time of 5 minutes 16
2.3 Exact Marginal and Model probabilities for variable selection
using g-Priors as a function of log(g) 19
2.4 Approximate Marginal and Model probabilities for variable
selection using g-Priors as a function of log(g) 20
2.5 Diagram of cross-validation process 21
2.6 Plots of cross-validation error as a function of log) 24

3.1 Illustration of linear support vector regression fit 34


3.2 Illustration of mapping to an alternate space to induce linearity 35
3.3 Raw simulated data with highlighted artificial damage . . . 38
3.4 Autocorrelation plot of simulated data 38
3.5 SVM (top) and linear AR models (bottom) fit to subset of data 39
3.6 Residuals from SVM (top) and linear AR models (bottom)
applied to simulated data 40
3.7 Diagram of experimental structure 42
3.8 Q-Q Plot of residuals from SVM model 44
3.9 Residuals from 4 sensors for t = 7000, ... , 9384 45
3.10 Density estimate of combined residual (black) vs. chi-squared
distribution(red) 45
3.11 Combined residuals from all 4 sensors 46

v
Acknowledgements
I am indebted to my mentors, Dr. Arnaud Doucet and Dr. Raphael Got
tardo, for the wealth of opportunities and immense support they have pro
vided me. I am also grateful to the many people I have had a chance to
interact with at both the University of British Columbia and Los Alamos
National Labs. In particular I thank Dr. Dave Higdon and Dr. Todd Graves
for guiding me down a fruitful path.

vi
Statement of Co-Authorship
Excluding chapters 2 and 3, all material in this thesis was solely authored.
While I identified and carried out the research in chapter 2, Arnaud
Doucet and Raphael Gottardo provided much guidance, criticism, and feed
back.
The data in chapter 3 was produced by Gyuhae Park and Kevin Farm
holt. Chuck Farrar assisted in writing some of the details pertaining to the
structural health monitoring field. In addition, his guidance and feedback
contributed immensely to the development of the work.

vii
Chapter 1

Introduction
The research area of signal processing is concerned with analyzing signals
including sound, video, and radar. There are many components to this task,
including storage and compression, removing noise, and extracting features
of interest. As an example, we might have a noisy recording of a telephone
conversation for which we want to store the signal, remove the noise, and
identify the speakers. These signals can take many forms, either digital or
analog. We focus on statistical signal processing, which is concerned with
studying signals based on their statistical properties. We begin by describ
ing two statistical methods employed for signal processing, the first being
particle filtering, and the latter being support vector forecasters. Because
chapter 3 contains a detailed development of support vector forecasters, we
forego these details here. Later chapters then extend these methodologies to
fields for which they were not intended, specifically prior sensitivity analysis
and cross validation as well as structural health monitoring

1.1 Particle Filtering


One of the crucial areas of study in signal processing is filtering, which is
concerned with estimating a dynamic system’s true state from a series of
noisy measurements. Specifically, we assume that the system dynamics are
known up to some parameter(s). The underlying state-space model may be
written as
xtIxt_1 ‘-‘-

ytlxt py,t(yIxt)
where Xt and Yt denote the unobserved state and observation at time t,
respectively; Px,t and py,t are the state transition and measurement models,
respectively. Also, we assume a prior distribution p(xo) on xo. In the case
of linearly additive noise, we may write this state-space model as
= 16) + r
1
f(xt_
Yt = h(xt) + t. (1.1)

1
Chapter 1. Introduction

Here both the stochastic noise r and the measurement noise c are mutu
ally independent and identically distributed sequences with known density
functions. In addition, f(xt_iI8) and h(xt) are known functions up to some
parameters 8.
In order to build the framework on which to describe the filtering method
ologies, we first frame the above state-space model as a recursive Bayesian
estimation problem. Specifically, we are interested in obtaining the posterior
distribution
P(xoty1t) (1.2)
where XO:t = {xo,x1, .. ,Xt} and Y1:t = {yl,y2,
. .,yt}. Often we don’t re
. .

quire the entire posterior distribution, but merely one of its marginals. For
instance, we are often interested in the estimate of state given all obser
vations up to that point; we call this distribution the filtering density and
denote it as
p(xtlyit). (1.3)
By knowing this density, we are able to make estimates about the system’s
state, including measures of uncertainty such as confidence intervals.
If the functions f and h are linear and both rj and Ct are Gaussian,
Kalman filtering is able to obtain the filtering distribution in analytic form.
In fact it can be seen that all of the distributions of interest are Gaussian
with means and covariances that can be simply calculated. However, when
the dynamics are non-linear or the noise non-Gaussian, alternative methods
must be used.
In the case of non-linear dynamics with Gaussian noise, the standard
methodology is the extended Kalman filter, which may be considered as a
nonlinear Kalman filter which linearizes around the current mean and co
variance. However, as a result of this linearization, the filter may diverge if
the initial state estimate is wrong or the process is incorrectly modeled. In
addition, the calculation of the Jacobian in the extended Kalman filter can
become tedious in high-dimension problems. One attempted solution to this
problem has been the unscented Kalman filter (Wan and van der Merwe,
2001), which approximates the nonlinearity by transforming a random vari
able instead of through a Taylor expansion, as the extended Kalman filter
does. By employing a deterministic sampling technique known as the Un
scented transform (Julier and Uhlmann, 1997), UKF selects a minimal set
of sample points around the mean which are then propagated through the
non-linear functions while recovering the covariance matrix.
When either the stochastic or measurement noise is non-Gaussian, Monte
Carlo methods must be employed, in particular particle filters. This Monte

2
Chapter 1. Introduction

Carlo based filtering method relies on a large set of samples, called parti
cles, which are evolved through the system dynamics with potentially non-
Gaussian noise using importance sampling and bootstrap techniques. At
each time step the empirical distribution of these particles is used to approx
imate the distribution of interest and its associated features. By sampling
from some proposal distribution q(xo:tyi:t) in order to approximate (1.2),
we may use importance sampling with corresponding unnormalized weights
— P(yi:tlxo.t)P(xot)
Wt
q(xot IYi:t)

However, we typically wish to perform this estimation sequentially, and


hence we can take advantage of the Markov nature of the state and mea
surement process along with proposal distributions of the form q(xotIyi:t) =
q(xot_iIyit_i)q(xtxot_i,yit). From this we obtain the recursive weight
formula
P(ytlxt)P(xtlxt_i)
Wt=Wt_1
q(xtxo:t_i, Yi:t)
This equation allows for the sequential updating of importance weights
given an appropriate choice of proposal distribution q(xtxot1, yi:t), as well
as simple calculation of the filtering density (1.3). Since we can sample
from this proposal distribution and evaluate the likelihood and transition
probabilities, the particle filter simply involves generating a prior set of
samples, evolving these samples forward with the proposal distribution, and
subsequently calculating the importance weights. In addition, to prevent
particle degeneracy, we employ a resampling step to remove particles with
low weight and multiply those with high weight (Douc et al., 2005).
The choice of proposal distribution has a significant effect on the rate
of degeneracy. The standard (and simplest) choice is the prior distribution
q(xtxot_i, yi:t) = P(xjlxt_i) since the weights simplify to a calculation of
the likelihood. However, if the likelihood is not near the prior, this choice
will lead to large variance in the importance weights, and hence we would
like to employ a proposal distribution which uses the data to provide a
better estimate of the posterior distribution. One such possibility is to use
a Gaussian approximation of the posterior as the proposal distribution (van
der Merwe et al., 2001).
Often the problem of filtering isn’t restricted to the estimation of state,
but is also concerned with estimating some parameters 0 of the dynamic
model f(xIO). Further complicating matters, the only information we have
about the state and the model parameters is the noisy measurements {yt}ti.
While there are several approaches for solving this problem, we focus on dual

3
Chapter 1. Introduction

estimation, namely the use of parallel filters to estimate state and model pa
rameters (Wan et al., 2000). Specifically, a state-space representation is used
for both the state and parameter estimate problems. While the state-space
representation for the state is given in equation (1.1), the representation of
the model parameters is given by

= +V
Ot—1

Yt = t) + r +
6
f(xt_iI

Here nt and t are as in (1.1), while iJ is an additional iid noise term. Thus
we can run two parallel filters for both state and parameters. At each time
step the current state estimate is used in the parameter filter and the current
parameter estimate is used in the state filter. The situation is complicated
in the particle filter situation, due to the well-known problem of degenerate
weights (Casarin and Mann, 2008).
Through this filtering methodology we are able to estimate the state
of a dynamic system from noisy measurements, as well as the associated
uncertainty of these estimates. In addition, the dual framework provides
a mechanism for estimating model parameters along with the state. These
filtering tools approximate a sequence of distributions of increasing dimen
sion. In later chapters, we show how the particle filtering methodology may
be adapted for situations involving distributions of equal dimension, and
subsequently build an algorithm for efficiently performing prior sensitivity
analysis and cross-validation.

1.2 Support Vector Forecasters


While particle filtering and other filtering methods rely on knowledge of the
underlying process to de-noise the signal, support vector regression forecast
ers have a slightly different purpose. Specifically, they use a training data
set to build a model of the signal, which is then used to predict subsequent
pieces of the signal. The previous section contains a thorough description
of filtering since the associated manuscript of chapter 2 bypasses this devel
opment. However, chapter 3 provides a thorough development of support
vector forecasters, and hence we forego this development here, instead sim
ply providing some useful references. The recent work of Steinwart and
Christmann (2008) provides thorough details on both theoretical and ap
plied aspects of support vector machines, while Schlkopf and Smola (2001)
contains details, on kernel-based learning.

4
Chapter 1. Introduction

1.3 References
Casarin, R., Mann, J-M. (2008). “Online data processing: comparison
of Bayesian regularized particle filters.” arXiv:0806.4242v1.
Douc, R., Cappe, 0., Moulines, E. (2005). “Comparison of resampling
schemes for particle filtering.” Proceedings of the th International Sympo
sium on Image and Signal Processing and Analysis. pp 6469.
Julier, S. and Uhlmann, J. (2007). “A new extension of the kalman iter
to nonlinear systems.” Proceedings of AeroSense: The 11th Symposium on
Aerospace/Defence Sensing, Simulation and Controls.
Scholkopf, B. and Smola, A.J. (2001). Learning with Kernels: Support
Vector Machines, Regularization, Optimization, and Beyond. MIT Press,
Cambridge.
Steinwart, I. and Christmann, A. (2008). Support Vector Machines.
Springer, New York.
van der Merwe, R., Doucet, A., de Freitas, N., Wan, E. (2001). “The un
scented particle iter.” Advances in Neural Information Processing Systems.
13:584590.
Wan, E. and van der Merwe, R. (2001). “The unscented kalman iter.”
Kalman Filtering and Neural Networks. Ch. 7. Ed. S. Haykin.
Wan, E., van der Merwe, R., Nelson, A. (2000). “Dual estimation and
the unscented transformation.” Advances in Neural Information Processing
Systems. 12:666672.

S
Chapter 2

An Efficient Computational
Approach for Prior
Sensitivity Analysis and
Cross-Validation
2.1 Introduction and Motivation
‘An important step in any Bayesian analysis is to assess the prior distri
bution’s influence on the final inference. In order to check prior sensitivity,
the posterior distribution must be studied using a variety of prior distri
butions. If these posteriors are not available analytically, they are usually
approximated using Markov chain Monte Carlo (MCMC). Since obtaining
the posterior distribution for one given prior can be very expensive corn
putationally, repeating the. process for a large range of prior distributions
is often prohibitive. Importance sampling has been implemented as an at
tempted solution (Besag et al., 1995), but the potential of infinite variance
importance weights makes this technique useless if the posterior distribution
changes more than a trivial amount as the prior is altered. Additionally, this
importance weight degeneracy increases with the dimension of the parameter
space.
One such prior sensitivity problem is the creation of regularization path
plots a commonly used tool when performing penalized regression. In these

situations there is typically a tuning parameter which controls the amount of


shrinkage on the regression coefficients; regularization path plots graphically
display this shrinkage as a function of the tuning parameter. For instance, in
the LASSO shrinkage and variable selection method of Tibshirani (1996), the
LARS algorithm (Efron et al., 2004) may be employed to quickly produce
version of this chapter has been submitted for publication. Bornn, L., Doucet, A.,
Gottardo, R. “An Efficient Computational Approach for Prior Sensitivity Analysis and
Cross-Validation.”

6
Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation

these plots. In the Bayesian version (Vidakovic, 1998; Park and Casella,
2008), however, we may want to plot the posterior means of the regression
coefficients /3 E lilY for a range of the tuning (or penalty) parameter A. The
corresponding posterior distributions are proportional to

exp
(_
[y - (y
T
X/3) - X/3) - A
iH) (2.1)

where the response y is assumed to come from a normal distribution with


mean X/3 and variance a 2 for a model matrix X. Since approximating (2.1)
using MCMC at one level of A can take upwards of an hour depending on
the precision required, producing this plot by repeating MCMC hundreds of
times for different A is impractical.
Another tool requiring repeated posterior approximations is cross-validation,
which has two primary statistical purposes. The first is finding the value of
a given parameter (for instance, the penalty parameter in penalized regres
sion) which minimizes prediction error. The second is comparing different
models’ or methodologies’ prediction performance. In both situations the
data is split into a training set, which is used to fit the model, and a testing
set, which is used to gauge the prediction performance of the trained model.
A typical example would involve fitting a model on the training set for a
range of values of some model parameter, then setting this parameter to the
value that results in the lowest prediction error rate on the testing set. For
example, we might wish to select the value of A in (2.1) to minimize pre
diction error. From a computational standpoint, cross-validation is similar
to prior sensitivity in both structure and complexity. Further, the entire
process is usually repeated for a variety of different training and testing sets
and the results are then combined. Although importance sampling has been
applied to cross-validation (for example, Alqallaf and Gustafson, 2001), the
problem of infinite variance importance weights remains (Peruggia, 1997).
In this paper, we begin by motivating and developing sequential Monte
Carlo (SMC) methods, then subsequently apply them to prior sensitivity
analysis and cross-validation. In Section 2 we develop an efficient algorithm
for sampling from a sequence of potentially quite similar probability distri
butions defined on a common space. Section 3 demonstrates the algorithm in
a prior sensitivity setting and applies it to the creation of regularization path
plots and the sensitivity of the tuning parameters when performing variable
selection using g-Priors. Cross-validation with application to Bayesian pe
nalized regression is developed in Section 4. We close with extensions and
concluding remarks in Section 5.

7
Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation

2.2 Sequential Monte Carlo Algorithms


SMC methods are often used in the analysis of dynamic systems where
we are interested in approximating a sequence of probability distributions
ir(O) where t = 1, 2, 3, ..,T. The variable Ot can be of evolving or static
dimension as t changes; note that t is simply an index variable and need not
be real time. Most work in the SMC literature is interested in the evolving
dimension case, with applications to state-space models (Doucet et al., 2000)
and target tracking (Liu and Chen, 1998) among others. The static case,
where each lrt lies in a common space, has received less attention (Chopin,
2002; Del Moral et al., 2006). The goal of SMC methods is to sample from
the distributions {7tt} sequentially, i.e. first from in, then 7n2, up to ir.
In some situations we are concerned with each intermediate distribution,
whereas in others only the final distribution ‘ItT is of interest (for example,
Neal, 2001). For further reading, the edited volume of Doucet et al. (2001)
covers a range of developments in SMC theory and application.
The situation where the sequence of distributions lie in a common space
arises in several applications. For instance, the number of observations in
some experiments can make MCMC prohibitive. In this case mt might be
the posterior distribution of a parameter given the observations 1 through
t. Moving through the data with a sequential strategy in this way may
decrease computational complexity. Another application is transitioning
from a simple distribution in
1 to a more complex distribution of interest ‘in.
Alternatively we could consider situations analogous to simulated annealing
(Kirkpatrick et al., 1983), where in(O) cx [mr(O)]t for an increasing sequence
{t} t = 1, 2, 3, ...,T. In all of these examples the bridging distributions
4 are only used to reach the final distribution of interest inT.
mn’j
When we are interested in a certain feature of each itt, SMC will typically
be computationally cheaper than MCMC even if we can successfully sample
from each ‘Itt using MCMC. This is because SMC borrows information from
adjacent distributions, using the samples from earlier distributions to help
in approximating later distributions. Often the difficulty in using SMC is
constructing this sequence of distributions; both prior sensitivity and cross-
validation are situations where there exists a natural sequence upon which
SMC may be applied. From here forward we assume the distributions to
have a common support.
For all times t, we seek to obtain a collection of N weighted sam
ples (called particles) i = 1,...,N approximating itt where the
weights are positive and normalized to sum to 1. We may estimate expected
values with these particles using E,.(g(O)) = . g(O). One tech-

8
Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation

, o?}
nique used in SMC is importance sampling, where particles {W?
1
distributed as rt may be reused, reweighting them (before normalization)
according to

W(i) n(O)
(2.2)

in order to obtain an approximation of lrt. Thus we obtain the current


weights by multiplying the previous weights by an incremental weight.
In an attempt to prevent these weights from becoming overly non-uniform,
we may move each particle o? 1 (currently distributed according to ntj)
with a Markov kernel K(O, 0’) to a new position 0 (i)’, then subsequently

reweight the moved particles to be distributed according to lrt. Although the


kernel K(0, 0’) = lrt(O’) minimizes the variance of the importance weights,
it is typically impossible to sample from; thus it has been proposed to use
Markov kernels with invariant distribution lrt (Gilks and Berzuini, 2001). A
direct application of this strategy suffers from a major flaw, however, as the
importance distribution given by

Jni (0i)flKt(0t_i,8t)d01:T_i
is usually impossible to compute and therefore we can not calculate the
necessary importance weights. Additionally, this assumes we are able to
sample from ir 1 (Oi) which is not always the case. Alternatives attempt
to approximate Tit pointwise when possible, but the computation of these
algorithms is in 0(N
) (Del Moral et al., 2006).
2
The central idea of SMC Samplers (Del Moral et al., 2006) is to em
ploy an auxiliary backward kernel with density L_ 1 (Ot, Ot—i) to get around
this intractible integral. This backward kernel relates to a time-reversed
SMC sampler giving the same marginal distribution as the forward SMC
sampler induced by K. The backward kernel is essentially arbitrary, but
should be optimized to minimize the variance of the importance weights.
Del Moral et al. (2006) prove that the sequence of backward kernels mini
mizing the variance of the importance weights is, for any t, L?(0, Ot—1) =
iit_i(Ot_i)Kt(Ot_i, Ot)/’it(Ot). However, it is typically impossible to use this
optimal kernel since it relies on intractable marginals. Thus, we should se
lect a backward kernel that approximates this optimal kernel. Del Moral
opt
et al. (2006) give two suboptimal backward kernels to approximate L 1

9
Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation

which result in incremental weights

a t)
9
lrt(
(2.3a)
= ft_i(Ot_i)Kt(Ot_i,Ot)d6t_i
lrt(Ot_1)
w(9t_j,Ot)= (2.3b)
lrt_1 (
t—1)
8

These incremental weights are then multiplied by the weights at the previ
ous time and normalized to sum to 1. We note that the suboptimal kernel
resulting in (2.3b) is actually an approximation of that resulting in (2.3a),
and coincidentally has the same form as (2.2), the reweighting mechanism
for importance sampling. In this manner the first kernel should perform
better, particularly when successive distributions are considerably different
(Del Moral et al., 2006). Although the weights (2.3a) are a better ap
proximation of the optimal backward kernel weights, the second kernel is
convenient since the resulting incremental weights (2.3b) do not depend on
the position of the moved particles & and hence we are able to reweight the
particles prior to moving them. We include the incremental weight (2.3a)
because, when K is a Gibbs kernel moving one component at a time, it sim
plifies to rrt(Ot1,k)/rt1(Ot1,k) where k is the index of the component
being moved by the Gibbs sampler and 0 t—1,—k is the particle excluding the
kth component. By a simple Rao-Blackwell argument it can be seen that
this choice, by conditioning on the variable being moved, results in reduced
variance of the importance weights compared to (2.3b).

2.2.1 An Efficient SMC Algorithm


Now that we have described some components of SMC methodology, we
proceed to develop an efficient algorithm for performing prior sensitivity
and cross-validation. The basic idea of our algorithm is to first reweight
(i) (i)
the particles {W_
,
1 i = 1,...,N such that they are approximately
distributed as If the variance of the weights is large, we then resample
the particles with probabilities proportional to their weights, giving us a set
of N equally weighted particles (including some duplicates). After resam
pling we move the particles with a kernel of invariant distribution lrt, which
creates diversity in the particles. Our algorithm relates closely to resample
move algorithms (Gilks and Berzuini, 2001; Chopin, 2002), although our
formulation is more general and allows for the use of a variety of suboptimal
backward kernels and corresponding weights.
Moving the particles at each time step is not particularly efficient. For
example, if two successive distributions in the sequence are identical, we

10
Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross- Validation

are wasting our time by moving the particles. If successive distributions


are similar but not necessarily identical, to save computational time we can
simply copy forward the particles at time t 1. and reweight them with the

importance sampling weights (2.2). Deciding when to move particles may be


done dynamically or deterministically. A dynamic scheme would move the
particles whenever the variance of the weights becomes too large (usually
measured by the effective sample size (ESS), (Z (W)2)_1), whereas a
1
deterministic scheme would move the particles every kth time step for some
integer k. Since the sequence of distributions will likely not change at a
constant rate, it is better to use a dynamic scheme as this allows for little
particle movement during parts of the sequence with little change and more
movement in parts of the sequence where successive distributions vary more.
When the ESS drops below a specified threshold, we reweight the par
ticles at time t 1 to be approximately distributed as lrt prior to moving

them. The weights (2.3b) only depend on the particles at time t 1, 50 —

we can easily do this. In the case of a one at a time Gibbs sampler, we


can also use the weights (2.3a). Because the unweighted particles at time
t are not distributed according to lrt, we cannot simply move the particles
without first taking their weights into consideration. Thus prior to moving
the particles we must resample them such that = 1/N for i = 1,. .. , N

and the particles’ unweighted distribution is lrt. Resampling methods du


plicate particles with large weights and remove particles with low weights.
Specifically, we copy the particle times such that N = N
and E(N) = NW where are the normalized importance weights.
Lastly, all of the resampled particles are assigned equal weight. The simplest
unbiased resampling method consists of sampling N4 from a multinomial
(i)
distribution with parameter (N, {W }). It should be noted that more so
phisticated resampling schemes, such as residual resampling (Liu, 2001) and
stratified resampling (Kitigawa, 1996) exist, resulting in reduced variance of
relative to multinomial resampling. After the particles are resampled,
we can move them with the kernel K.
An efficient SMC algorithm which may be used to perform prior sensi
tivity and cross-validation is therefore:

11
Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation

fort=ldo
Obtain N weighted samples 6 from rj (directly, MCMC, etc.)
end
fort=2,...,Tdo
(i) (i) . (i)
Copy 0_ to 61,, and calculate weights W according to (2.2)
if ESS(O) > c then
Copy (oi), w)) to (or),
w))
else
Reweight: Calculate weights according to
1 x wt(bt_i,
W cc W_ t) where Wt(t_1,
6 t) is either given by
8
(3a) or (3b)
Resample: Resample particles according to above weights. Set
all weights to 1/N
Move: Move particles with Markov kernel of invariant
distribution ‘lrt
end
end
note 1: If a backward kernel is chosen such that the incremental
weights depend on the position of the moved particle O, the reweight
step comes after the move step and resampling is performed with the
weights (i)
note 2: c is a user-specified threshold on the effective sample size.

2.3 Prior Sensitivity


In the case of prior sensitivity we are interested in approximating the poste
rior distribution of some variable(s) 0 given the data D, symbolically notated
as ir(OID) cc f(DI0) . v(9) where f(DIO) and v(O) are the likelihood and the
prior distribution of 0, respectively. Here the notation v(O) is used to dif
ferentiate the prior from the posterior distribution nt(O), allowing for the
omittance of dependencies. This prior sensitivity framework has been stud
ied in a closed-form setting (Gustafon and Wasserman, 1995; Gustafson,
1996), but situations requiring Monte Carlo methods have received less at
tention. It is worth noting that only the prior distribution changes between
successive distributions (the likelihood remains the same). Thus when we
reweight particles to approximate the sequence of posterior distributions for

12
Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation

0, the weights (2.2) depend solely on the prior distribution,

)
1
(De .
(o)
1 x
w cx w
— f (Do2) t—i
(o1)

• Vt (ei?)
1 x
cx W (2.4)
Vt_i (et4)

where is the th particle sampled at time t and V(O) is the ttI prior
distribution evaluated at the point O(i) If the ESS falls below a given
.

threshold at time t (notated as c in algorithm pseudocode), we resample


and move, otherwise we simply reweight. Conveniently, resampling and
moving using (2.3b) and reweighting using (2.2) both result in the same
weight mechanism (2.4). In a later example we will also employ the weights
(2.3a), which have reduced variance relative to (2.3b).

2.3.1 Regularization Path Plots


Consider the regression model with response vector y = (yi,.. , y)” and .

model matrix X = (xi, ... , x) where x (xii, , x


)T,
j 1,. , p are
. . . . .

the column vectors of predictors (including the unit intercept vector). For
clarity of presentation we present the model with a continuous response;
however, it is simple to extend to binary responses (Albert and Chib, 1993).
We use the prostate data of Stamey et al. (1989) which has eight predictors
and a response (logarithm of prostate-specific antigen) with likelihood

2
yIpx,/3,u 1n).
2
Nn(Xi3,o (2.5)
Using a double exponential prior distribution with parameter A on the
regression coefficients /3 i,
3
(/ . ,. the corresponding posterior distri
.

bution is proportional to (2.1). We see from the form of this posterior dis
tribution that if A = 0 the MAP estimate of /3 will correspond to the least
squares solution. However, as A increases there will be shrinkage on /3 which
may be displayed using a regularization path plot. Because the shrinkage as
A varies is nonlinear, we set a schedule A = et/20, t = 1,.. ,100. We create.

a “gold standard” Bayesian Lasso regularization path plot for this data by
running MCMC with a Markov chain of length 50,000 at each level of A and
plotting the posterior mean of the resulting regression coefficients (Figure
2.1). It should be noted that the creation of this plot took over 5 hours.

13
Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation

0
0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 2.1: Regularization Path Plots: The gold standard


The plot is of standardized coefficients 1 i/ max(I/31
j vs. /
3 3 ).
1

Since the idea is to create these plots quickly for exploratory analysis,
we will compare our SMC-based method to MCMC with both constrained
to work in 5 minutes (+/— 5 seconds), and both using the same Markov
kernel. In order to perform MCMC in our time frame of 5 minutes, the
Markov chain had a length of 1200 for each of the 100 levels of ). The mean
of each resulting posterior distribution was used to plot the regularization
path plots in Figure 2.2(a). In comparison, to run in 5 minutes our SMC
algorithm used N = 4200 particles and resampled and moved the particles
when the ESS dropped below c = = 2800 (Figure 2.2(b)). For the sake

of time comparisons, all computations here and later were performed on a


Power Mac G5 with dual 2.7 GHz PowerPC processors. We see from these
plots that both methods capture the key features of the regularization path
plot as shown by the gold standard: every one of the variables has the cor
rect path. The methods vary, however, in the amount of noise. We see much
more variability using MCMC compared to SMC. This is due to SMC being
able to use many more particles since it is able to save time by borrowing in
formation from previous distributions. To be specific, the SMC algorithm in
this context had to resample and move the particles only 25 times in the en
tire sequence of 100 distributions. The remainder of the time our algorithm

14
Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation

simply reweighted the particles, which is computationally inexpensive. It is


worth noting that, because of this, adding more incremental distributions
in the sequence will have little effect on the computational time of the SMC
algorithm, unlike MCMC-based strategies, which would approximate each
new distribution with a new Markov chain. In addition, we attempted to
make these plots using importance sampling, reweighting (and not moving)
particles from 7t1 to approximate later distributions. However, the weights
became degenerate, with all of the weights eventually focussing on one par
ticle with standardized h-i norm of 0.8. Specifically, all but one of the
weights had values near zero, and the one particle with positive weight had
standardized L-1 norm of 0.8. Thus importance sampling was only able to
create roughly 1/5 of the entire plot, and hence is clearly not a candidate
methodology for creating these plots. We will see later that in many such
situations importance sampling fails, even with large amounts of particles.

2.3.2 Variable Selection Using g-Priors


Consider the normal likelihood set-up (2.5). Now, however, with an eye
towards model selection, we introduce the binary indicator variable 7 E
{0, 1}P, where 7j = 1 means the variable x3 is included in the model. Thus
7 can describe all of the 2 possible models. Following the notation of Mann
and Robert (2007), we use q 7 = 1y as a counter for the number of variables
in the model. If X is the model matrix which excludes all xj’s if = 0,

we can employ the following prior distributions for /3 and 2 (Zellner, 1986;
Mann and Robert, 2007):

)_(7+l)/
(g
_
2 lexp

From this it is straightforward to show that the posterior density for 7 is


thus
—n/2
+1)/2 [yTy_ YTX(XX)_1X
1
g
Y
7 ]
Iy,X) (g+1)
7
(
(2.6)
We perform model selection on the pollution data set of McDonald and
Schwing (1973), in which mortality rate is compared against 15 pollution
related variables in 60 metropolitan areas. The 15 independent variables
include, for instance, mean annual precipitation, population per household,
and average annual relative humidity. The response variable y is the age
adjusted mortality rate in the given metropolitan area. We seek to perform

15
Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross- Validation

(a) MCMC with 1200 samples (5 minutes)

0
0

0.0 0.2 0.4 0.6 0.6 1.0

(b) SMC with 4200 samples (5 minutes)

Figure 2.2: Regularization Path Plots: Plots using MCMC and SMC for
fixed computational time of 5 minutes
The plots are of standardized coefficients /3 vs. I/31i/max(I/31i).

16
Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation

variable selection to narrow down the number of independent variables which


best predict the response. With 15 variables, calculating the posterior prob
abilities of the over 30,000 models exactly is possible but time-consilming.
We have chosen this size of data set to allow for a benchmark from which
we can compare MCMC to SMC.
Our goal is to see how the explanatory variables change as we vary the
prior distribution parameter g. In other words, we are interested in seeing
how robust the variable selection method is to changes in the setting of g.
Our goal is to perform the variable selection for 100 levels of g for schedule
g = et/10, t = 1,.. , 100. We use a Gibbs sampler strategy to compare
.

the SMC-based algorithm to brute-force MCMC, benchmarked against the


exact solution obtained from (2.6), in which /37 and o 2 are integrated out.
Specifically, we update y one component at a time. The incremental weight
ratio (2.3b) will be the ratio of the posterior distribution (2.6) evaluated
on the complete data at successive levels of g. In addition, we are able
to use the weights (2.3a), which corresponds to the ratio of the posterior
distribution (2.6) evaluated on all of the data, excluding the variable that is
being moved by the Gibbs sampler.
In order to see our desired result, we use (2.6) to plot the exact marginal
probabilities as well as some sample model probabilities for various levels of
g (Figures 2.3(a) and 2.3(b)). This process took slightly over 8 hours, and
hence we would like to find a faster method. We constrain both stochastic
algorithms to run in 30 minutes (+/- 1 minute). As a result the MCMC
algorithm uses a Markov chain of length 10,000 and the SMC algorithm
uses 18,000 particles. We plot the resulting posterior marginal probabil
ities for each algorithm in Figures 2.4(a) and 2.4(b), respectively. First
impression shows that the plot created using MCMC has much more vari
ability. However, the smoothness in the SMC algorithm is not a result of
perfect accuracy of the method, but rather only smoothness of the reweight
ing mechanism (2.2). Because of this, if the SMC does poorly during times
of particle movement, the subsequent reweighted approximations will also
be inaccurate. To ensure this is not the case and verify that SMC is indeed
outperforming MCMC, we look at the average absolute error of the marginal
probabilities (at 100 levels of ) and for 15 variables). We find the average
absolute error in the marginal probabilities using MCMC is 0.0292 whereas
with SMC it is only 0.0187. In addition, their respective maximum absolute
errors were 0.24 and 0.08, respectively. In fact 30 runs of the algorithms
resulted in similar results, with SMC consistently outperforming MCMC.
From this we see that SMC is indeed providing a better approximation of
the true marginal probabilities.

17
Chapter 2. An Efficient Computational Approach ibr Prior Sensitivity Analysis and Cross-Validation

What then may be taken from these marginal probability plots? When
performing simple forward selection regression, the variables 1, 2, 6, 9, and
14 are chosen. Slightly different results come from doing backward selection;
in particular variables 1 and 14 are replaced by variables 12 and 13. The
LASSO solution (using 5-fold cross-validation) is the same as the forward
solution with the additional variables 7 and 8. In addition, the LASSO
solution contains some shrinkage on the regression coefficients (see exam
ple 2.3.1). Using g-Priors the variables that clearly stand out (see Figure
2.3(a)) are 1, 2, 6, 9, and 14. Thus the g-Prior solution taken from the plot
corresponds to the forward selection model. Also, for a given g, say g =
the plot obtained with SMC shows the correct top 4 variables for inclusion,
whereas the variability from the MCMC-based plot makes it impossible to
do so.

2.4 Cross-Validation

We focus on leave-s-out cross-validation, which is the case when. the testing


set consists of s observations. Continuing in the linear regression frame
work, let X\s and Y\s be the model matrix and response vector excluding
the subset S of observations (of size s). We are interesting in a collection ofT
model parameter (typically prior distribution parameter) settings resulting
in posterior densities Trt(91X\s, Y\s) for t 1, . ., T. Once we have approx
.

imations of all T posterior densities, we select the model parameter settings


which result in the best prediction of y using X. To find the sequence of
distributions lrt(OIX\s, Y\s) t = 1, ..., T, the same SMC-based algorithm

proposed for prior sensitivity is applicable. Specifically, once we have ob


tained a Monte Carlo approximation of lrl(OIX\s, Y\s) we can transition to
the remainder of the distributions 7rt(OIX\s, Y\s), t = 2,... , T using SMC.
In addition to quickly evaluating the model for a variety of settings on
the training set, SMC also provides a tool for switching the training/testing
set without fully re-approximating the posterior densities. Specifically, sup
pose we have a testing set Si, and using SMC we find approximations of
1 Y\s
rt(OjX\s ) t = 1,
1 .. , T, each of which are tested for prediction perfor
.

mance on the subset S. However, typically we are interested in performing


cross-validation for a variety of different splits of the data into training and
testing sets. Thus, we will now want a new testing set 82 and find approxima
tions of Kt(O IX\s
, Y\s
2 )’ = 1,. , T. The obvious way to accomplish this is
2 ..

to start fresh by approximating lrl(eIX\s


, Y\s
2 ) with MCMC and proceed
2
ing to approximate the remainder of the distributions using SMC. However,

18
Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation

oF

0 2 4 6 8 10

(a) Posterior Marginal Probabilities: GreenrrXi,


BlueX Yellow=X
,
2 , Red=X
6 , Purple=X
9 14

0
d
/
\ /
a

C
0

0 2 4 6 8 10

(b) Posterior Model probabilties: Purp1e1,2,4,5,9,


Red=1,2,4,5, Blue1,9, Green9, Ye11ow=6

Figure 2.3: Exact Marginal and Model probabilities for variable selection
using g-Priors as a function of log(g)
Plot (a) highlights several variables (X
,X
1 ,X
2 ,X
6 ,)
9 14 which show high
X
marginal probabilities of inclusion. Plot (b) shows the posterior
probabilities of 5 models chosen to highlight the effect of g on model size
19
Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation

c’J

0 2 4 6 8 10

(a) Posterior Marginal Probabilities: MCMC with


10000 samples (30 minutes)

cJ
0•

0
0

0 2 4 6 8 10

(b) Posterior Marginal Probabilities: SMC with 18000


samples (30 minutes)

Figure 2.4: Approximate Marginal and Model probabilities for variable Se


lection using g-Priors as a function of log(g)
Plots (a) and (b) compare MCMC to SMC’s performance.

20
Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation

Tr(OIX, y)

ir (81X\s
1 Y\S) 7tl(8IX\Smax Y\Smax)

(IX\si Y\s)
2
7t
I ‘
I
t2(8IX\S, Y\Smax)

71T(OIX\si, Y\S
)
1 —-. 7rT(&JX\Sma, Y\Smcr)

Figure 2.5: Diagram of cross-validation process


Each arrow represents transitioning using SMC.

we can be a bit more clever than this, recognizing that lrl(0IX\s


1 Y\s) and
, Y\s
2
rl(OPX\s )
2 are related (Alqallaf and Gustafson, 2001; Bhattacharya
and Haslett, 2007).
Successive splits of the data into training and testing sets should give
similar model settings. Therefore, we first build the model for a given pa
rameter setting on the full data set using SMC, resulting in an approxima
tion of 7r
(OIX, y). Then instead of using MCMC to get approximations of
1
lrl(81X\S, Y\s) for different S E {S, , Sm&c}, we can build a sequence
. . .

of distributions (lrl(OIX,y))’7(7r1(OIX\s,y\s))7 for an increasing temper


ature ‘y = 0, , 2,.. 1 e, 1 which will allow us to transition to the case-
., —

deletion posteriors. The process is illustrated in Figure 2.5. The case of


= 0, 1 with no movement step corresponds to basic case-deletion impor
tance sampling as developed in Peruggia (1997). Although case-deletion
importance sampling has been demonstrated to achieve up to 90% cost sav
ings in some circumstances (Aiqallaf and Gustafson, 2001), the problem of
degeneracy still makes importance sampling fail in many situations (Perug
gia, 1997; Epifani et al., 2005).
Let 9 = (3, u
). The posterior distribution ir(9) of 9 is proportional to
2
q(9) = f(y113, 2) x ii-(/3) x ir(o-
). Assume we collect samples from the distri
2
bution ir(9). We are interested in reweighting these samples such that they
come from the distribution attained by removing the set S. The modified

21
Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation

likelihood and posterior for this case-deletion scenario are, respectively

f\s(YI3,
2) = (2)_(ns)/2
exp {--- [(y - X/3)
(y
T - Xj3) - (Ys - ((ys
T
X5) - Xi3))] }
q\s(e) ) x ir(3) x ir(u
2
f\s(YII3u )
2

We assume that the prior distributions for 3 and cr 2 are proper and indepen
dent. Epifani et al. (2005) show that if the weights w\S(e) q\s(e)/q(e)
are used to move to the case-deletion posterior directly, then the moment
of these weights is finite if and only if all of the following conditions hold:

a) <1/r
b) n—rs >1
c) RSS*\s(r) > 0

where )H is the largest eigenvalue of the matrix H X(XTX)_lXs and


RSS*\s(r) = RSS re(I — rHs)’es where es = ys X(XTX)_1XTy

and RSS denotes the residual sum of squares of the least squares fit of the
full data set. This result should not be taken lightly: as Geweke (1989)
points out, if the 2nd moment does not exist, the importance sampling
estimator will follow neither a n’ 2 asymptotic nor a central limit theorem.
(a) states that if the leverage of the deleted observations is too large, then the
importance weights will have infinite variance. (b) gives a condition relating
sample size to the allowable test set size s. (c) says that if the influence of
the deleted observation is large relative to RSS, then the importance weights
will have infinite variance. We show here how using a sequence of artificial
intermediate distributions with SMC can help to mitigate this problem.
We introduce a sequence of distributions

(e)
7
q cc

where y = 0,e,2e,.. 1—e, ito move from q(9) = qo(O) to q\


., (e) = q
5 (e).
1
At a given step ‘y = ‘y in the sequence, the successive importance weights
appearing in the SMC algorithm to move to the next step + e are

*(
7
W\S, e) — (q(e))l_7*(q\s(e))7*

\\q(9))

22
Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation

Theorem 1.. Provided that RSS*\s(1) > 0 and the prior distributions for/3
2 are proper and independent, a sequence of distributions proportional
and a
to {(q(e))’-7(q\
(e))7; ‘y = 0, e, 2e,
5 , 1—c, 1} may be constructed to move
. . .

from q(f3) (e) for each


to q\s(e) such that the importance weights W\S
7
successive step have a finite moment under q (9) provided
7

(2.7)

where > 1 is chosen to satisfy

‘H < 1/Q (2.8a)


n—cs>2 (2.8b)
RSS*\s(c) > 0 (2.8c)
The proof may be found in the appendix. The provision that RSS*\s(1)>
o is very reasonable, and states that the least squares fit of the full data must
not fit the training set perfectly. Note also that we find for each subset S.
Thus we may use the largest allowable step size c in (2.7) for each subset 5,
maximizing the algorithm’s efficiency. While this result is not sufficient to
establish that the variance of SMC estimates are finite for a finite number N
of particles, it can be used to upper bound the asymptotic variance of SMC
estimates under additional mild regularity mixing conditions on the MCMC
kernels; see (Chopin, 2004) and (Jasra and Doucet, 2008) for similar ideas.

2.4.1 Application to Bayesian Penalized Regression


To demonstrate the strength of SMC applied to cross-validation, we use
it to select the parameter ) of the Bayesian Lasso (2.1). For brevity, we
reuse the pollution data set (McDonald and Schwing, 1973) of section 2.3.2,
selecting the parameter ) using leave-one-out cross-validation. Firstly, it is
worth pointing out that importance sampling will fail in this situation, as
AH > 1/2 on 6 of the 60 observations in this data set, and hence the sufficient
conditions to ensure finite variance are not satisfied. Using a sequence of
intermediate distributions, we find that the largest c satisfying (8) equals
1.103, or, to ensure a finite second moment, e < = .103. Thus, the

largest sequence of distributions was of length 10. For most variables c > 2,
which for r = 2 is equivalent to importance sampling. Thus SMC does not
waste time transitioning to case-deleted posteriors if importance sampling
will suffice.
We use a Gibbs sampler to approximate the posterior distribution of
(3, a
) for A = e
2 5 on the full data set and then use SMC to move to

23
Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation

0
o
0
0
C)

0
C
0
0
0
0
0 C)
0
o
0

-4 -2 02 4

(a) Cross-validation error as a function of log(A) using


MCMC with 60 samples (10 minutes)

0
0
a
1’-
0
0

000

(b) Cross-validation error as a function of log(A) using


SMC with 100 samples (10 minutes)

Figure 2.6: Plots of cross-validation error as a function of 1og(\)

24
Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation

the case-deletion posterior distributions by creating a sequence of auxiliary


distributions as described above. For each different case-deletion we then
use SMC to find approximations of the posterior for schedule A = et/5,
t = —25,. , 25. Plotting the cross-validation errors as a function of A using
. .

MCMC with a Markov chain of length 10,000 (not shown) we observe that
the average squared loss 2 is a smooth function in A with a
(Yk —Xk/3)
small bump at A = e 2 and minimum near . 312 Thus to minimize prediction
e
error (at least in terms of the squared loss) we should set A — . 312 To
e
perform this task in a time-restricted manner we constrained both MCMC
and SMC algorithms to work in 10 minutes (+/- 30 seconds). Figures 2.6(a)
and 2.6(b) are the resulting plots. The reduced variability of the SMC-based
plot allows us to make more accurate conclusions. For instance, it is clear
in the plot obtained with SMC (Figure 2.6(b)) that the minimum error lies
somewhere around A = , 312 whereas from the MCMC plot (Figure 2.6(a))
e
it could be anywhere between 1 and e .
2

2.5 Extensions and Conclusions


In our presentation of the algorithm, a fixed sequence of distributions Trt(O),
t = 1, 2, 3, T is used. However, it is also possible to determine the sequence
..,

of distributions automatically such that successive distributions are a fixed


distance apart, as measured by ESS. For instance, assume we are interested
in 7rt(8) = ir(8IAt) where A is a scalar parameter and we have a Monte
Carlo approximation of ir(OIA_i) for an arbitrary t, namely {W, 8},
i = 1,. ..,N. We may set A to ensure that ESS = c for a constant c by
solving

where is given by (2.2). This may be solved numerically or in closed-


form, if possible. This technique would be beneficial in situations where
little or nothing is known about the sequence of distributions, and hence it
would be nice to automatically create the sequence.
All our examples have considered a sequence of distributions parameter
ized by a scalar parameter for which the definition of the sequence of target
distributions is very intuitive. If we are interested in dealing with multivari
ate parameters then the algorithm may be adapted by, for instance, creating
a grid (or hyper-grid) of distributions. SMC may be used to work across

25
Chapter 2. An Efficient Computational Approach ftr Prior Sensitivity Analysis and Cross-Validation

each dimension in succession. It is worth noting that the complexity of


the algorithm scales exponentially with dimension, although MCMC does
as well.
While we have given two choices of incremental weights, (2.3a) and
(2.3b), many other choices are available (Del Moral et al., 2006). In sit
uations where the weights are dependent on the position of the moved
particle, such as with (2.3a), auxiliary particle techniques may be used
(Pitt and Shephard, 1999; Johanseri and Whiteley, 2008). Specifically, we
reweight the particles with an approximation of the weight of interest (for
instance, (2.3a)) which is oniy dependent on the particles at time t —1,
(i) (i) (i) (i) .
using Wtemp W_ 1 x W. where W is the approximation of the incre
mental weight. After we have resampled and moved the particles we then
I47’
compensate for this approximation using W’ = tie x Wmp.
We have seen that by adapting importance sampling to move particles
between successive distributions, SMC drastically limits the problem of im
portance sampling degeneracy. By using a resample-move type algorithm,
we are able to perform prior sensitivity and cross-validation in a computa
tionally feasible manner while avoiding the fore-mentioned pitfalls of impor
tance sampling. We have shown the SMC algorithm to be considerably more
efficient than existing methods based on reiterative MCMC approximations.
In this way regularization path plots and other sensitivity analysis problems
can be studied in the context of the full posterior distribution instead of a
few summary statistics. In addition, SMC provides a tool for naturally per
forming cross-validation, and in fact guarantees finite case-deletion weights
under much less stringent conditions than importance sampling. In addition,
through the importance weights, SMC provides a measure of the distance
between distributions, and hence gives a way to select a subset of distribu
tions of interest for exploratory or other purposes.

2.6 References
Albert, J.H. and Chib, S. (1993). “Bayesian Analysis of Binary and
Polytomous Response Data.” Journal of the American Statistical Associa
tion. 88:669-679.
Alqallaf, F. and Gustafson, P. (2001). “On Cross-validation of Bayesian
Models.” Canadian Journal of Statistics. 29:333-340.
Besag, J., Green, P., Higdon, D., Mengersen, K. (1995) “Bayesian Com
putation and Stochastic Systems (with discussion) .“ Journal of the Amen

26
Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross- Validation

can Statistical Association. 95:1127-1142.


Bhattacharya, S., and Haslett, J. (2007). “Importance Re-sampling
MCMC for Cross-Validation in Inverse Problems.” Bayesian Analysis. 2:385-
408.
Chopin, N. (2002). “A Sequential Particle Filter Method for Static Mod
els.” Biometrika. 89:539-552.
Chopin, N. (2004). “Central Limit Theorem for Sequential Monte Carlo
Methods and Its Application to Bayesian Inference.” Annals of Statistics.
32:2385-2411.
Del Moral, P., Doucet, A., Jasra, A. (2006). “Sequential Monte Carlo
Samplers.” Journal of the Royal Statistical Society: Series B. 68:411-436.
Doucet, A., Godsill, S., Andrieu, C. (2000). “on Sequential Monte
Carlo Sampling Methods for Bayesian Filtering.” Statistics and Computing.
10:197-208
Doucet, A., de Freitas, N., Gordon, N.J. eds. (2001). Sequential Monte
Carlo Methods in Practice. New York: Springer.
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. (2004). “Least Angle
Regression.” Annals of Statistics. 32:407-499.
Epifani, I., MacEachern, S., Peruggia, M. (2005). “Case-Deletion Impor
tance Sampling Estimators: Central Limit Theorems and Related Results.”
Technical Report No. 720, Department of Statistics, Ohio State University.
Geweke, J. (1989). “Bayesian Inference in Econometric Models Using
Monte Carlo Integration.” Journal of the American Statistical Association,
88:881-889.
Gilks, W.R. and Berzuini, C. (2001). “Following a Moving Target:
Monte Carlo Inference for Dynamic Bayesian Models.” Journal of the Royal
Statistical Society: Series B. 63:127-146.
Gustafson, P. and Wasserman, L. (1995). “Local Sensitivity Diagnostics
for Bayesian Inference.” Annals of Statistics. 23:2153-2167.
Gustafson, p. (1996). “Local Sensitivity of Inferences to Prior Marginals.”
Journal of the American Statistical Association. 91:774-781.
Jasra, A. and Doucet, A. “Stability of Sequential Monte Carlo Samplers
via the Foster-Lyapunov Condition”, Statistics and Probability Letters. to
appear.
Johansen, A., and Whiteley, N. (2008). “A Modern Perspective on Auxil
iary Particle Filters.” Proceedings of Workshop on Inference and Estimation
in Probabilistic Time Series Models. Isaac Newton Institute, June
Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P. (1983). “Optimization by
Simulated Annealing.” Science. 220:671-680.

27
Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation

Kitagawa, G. (1996). “Monte Carlo Filter and Smoother for Non-Gaussian,


Non-linear State Space Models.” Journal of Computational and Graphical
Statistics. 5:1-25.
Liu, J.S. and Chen, R. (1998). “Sequential Monte Carlo Methods for Dy
namic Systems.” Journal of the American Statistical Association. 93:1032-
1044.
Liu, J.S. (2001). Monte Carlo Strategies in Scientific Computing (2nd
ed.), New York: Springer.
McDonald, G. and Schwing, R. (1973). “Instabilities of Regression Es
timates Relating Air Pollution to Mortality.” Technometrics. 15: 463-481.
Neal, R. (2001). “Annealed Importance Sampling.” Statistical Comput
ing. 11:125-139.
Park, T. and Casella, G. (2008). “The Bayesian Lasso.” Journal of the
American Statistical Association. 103:681-686.
Peruggia, M. (1997). “On the Variability of Case-Deletion Importance
Sampling Weights in the Bayesian Linear Model.” Journal of the American
Statistical Association. 92:199-207.
Pitt, M.K. and Shephard, N. (1999). “Filtering Via Simulation: Aux
iliary Particle Filters.” Journal of the American Statistical Association.
94:590-591.
Tibshirani, R. (1996). “Regression Shrinkage and Selection via the
Lasso.” Journal of the Royal Statistical Society: Series B. 58:267-288.
Vidakovic, B. (1998). “Wavelet-Based Nonparametric Bayes Methods.”
in Practical Nonparametric and Semiparametric Bayesian Statistics. D. Dey,
P. Muller, and D Sinha (eds.). New York: Springer, 133-256.
Zellner, A. (1986). “On Assessing Prior Distributions and Bayesian Re
gression Analysis with G-prior Distributions.” Bayesian Inference and De
cision Techniques: Essays in Honor of Bruno de Finetti. pp. 233-243.

28
Chapter 3

Structural Health
Monitoring with
Autoregressive Support
Vector Machines
3.1 Introduction
The extensive literature on structural health monitoring (SHM) has docu
2
mented the critical importance of detecting damage in aerospace, civil, and
mechanical engineering systems at the earliest possible time. For instance,
airlines may be interested in maximizing the lifespan and reliability of their
jet engines or governmental authorities might like to monitor the condition
of bridges and other civil infrastructure in an effort to develop cost-effective
lifecycle maintenance strategies. These examples indicate that the ability to
efficiently and accurately monitor all types of structural systems is crucial
for both economic and life-safety issues. One such monitoring technique
is vibration-based damage detection, which is based on the principal that
damage in a structure, such as a loosened connection or crack, will alter the
dynamic response of that structure. There has been much recent work in this
area; in particular, Doebling et al. (1998) and Sohn et al. (2004) present de
tailed reviews of vibration-based SHM. Because of random and systematic
variability in experimentally measured dynamic response data, statistical
approaches are necessary to ensure that changes in a structures measured
dynamic response are a result of damage and not caused by operational and
environmental variability. Although much of the vibration-based SHM liter
ature focuses on deterministic methods for identifying damage from changes
in dynamic system response, we will focus on approaches that follow a sta
A version of this chapter has been accepted for publication. Bornn, L., Farrar, C.R.,
2
Park, G., Farinholt, K. (2008). “Structural Health Monitoring with Autoregressive Sup
port Vector Machines.” Journal of Vibration and Acoustics.

29
Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines

tistical pattern recognition paradigm for SHM (Farrar and Worden, 2008).
This paradigm consists of the four steps of 1. Operational evaluation, 2.
Data acquisition, 3. Feature extraction, and 4. Statistical classification of
features. The work presented herein focus on steps 3 and 4 of this paradigm.
One approach for performing SHM is to fit a time series predictive model
such as an autoregressive (AR) model to each sensor output using data
known to be acquired from the structure in its undamaged state. These
models are then used to predict subsequent measured data, and the residuals
(the difference between the model’s prediction and the observed value) are
the damage-sensitive feature that is used to check for anomalies. This pro
cess provides many estimates (one at each time step) of a single-dimension
feature, which is advantageous for subsequent statistical classification. The
logic behind this approach is that if the model fit to the undamaged sen
sor data no longer predicts the data subsequently obtained from the system
(and hence the residuals are large and/or correlated), there has been some
sort of change in the process underlying the generation of the data. This
change is assumed to be caused by damage to the system. These linear
time series models have been used in such a damage detection process that
include applications to a wide range of structures and associated damage
scenarios including cracking in concrete columns (Fugate et al., 2001; Sohn
et al., 2000), loose connections in a bolted metallic frame structure (Allen
et al., 2002) and damage to insulation on wiring (Clark, 2008). However,
the linear nature of this modeling approach limits the scope of application
and the ability to accurately assess the condition of systems that exhibit
nonlinearity in their undamaged state. In this paper, we demonstrate how
support vector machines (SVM) may be used to create a non-linear time
series model that provides an alternative to these linear AR models.
Once a model has been chosen and the predictions from this model have
been compared to actual sensor data, there are several statistical methods
for analyzing the resulting residuals. Sequential hypothesis tests, such as
the sequential probability ratio test (Allen et al., 2002), may be used to
test for changes in the residuals. Alternatively, statistical process control
procedures, typically in the form of control charts, may be used to indicate
abnormalities in the residuals (Fugate et al., 2001). In addition, sliding
window approaches look at the features of successive subsets of data to detect
anomalies (e.g. Clark, 2008). For example, the sliding window approach of
Ma and Perkins (2003) looks at thresholds for the residuals such that the
probability of an undamaged residual exceeding this threshold is 5%. A
subset of n consecutive data points are then checked, and large values of the
number g of points exceeding the threshold indicate damage, where g has a

30
Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines

binomial distribution (i.e. g Bin(n, .05)).


To date, most of these time series modeling approaches analyze data from
one sensor at a time, and typically some sort of scheme is used to determine
how many sensors need to indicate damage in order to trigger a system
check (e.g. Herzog et al., 2005). As an alternative, in this paper we look at
a statistically based method for combining multiple sensor output. From this
combined output analysis, we can establish the existence of damage and also
determine which sensors are contributing to the anomalous readings in an
effort to locate the damage within the sensor network’s spatial distribution.
Previously Sohn et al. (2000) have used principal component analysis to
combine data from an array of sensors, but this study only examined these
combined data in an effort to establish the existence of damage.
We first present a summary of the SVM approach to nonlinear time se
ries modeling. This procedure is illustrated on numerically generated data
with artificial anomalies added to the baseline signal in an effort to simulate
damage. This time series modeling approach is then compared to linear AR
models. Next the SVM method is coupled with a statistical analysis pro
cedure that combines modeling results from multiple sensors in an effort to
both establish the existence and the location of the damage. This procedure
is applied to data from a laboratory test structure with damage that results
in local nonlinear system response.

3.2 SVM-based SHM


Existing methods for performing damage detection extract damage-sensitive
features from data acquired on the undamaged system, and then use changes
in those features as an indicator of damage. An AR model can be fit to the
undamaged sensor output and the residuals from predictions of subsequent
data using this baseline model are then monitored for statistically significant
changes that are assumed to be caused by damage. Specifically, an AR model
with p autoregressive terms, AR(p), applied to sensor k may be written as

(3.1)

where x is the representation of the measured signal at discrete times t


from the kth sensor, are the AR coefficients or model parameters, and e
is an unobservable noise term. Thus an AR model works by fitting a simple
linear model to each point with the previous p observed points as dependent

31
Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines

variables. Note that an n point time series will yield n p equations that

can be used to generate a least square estimate of the AR coefficients or the


Yule-Walker Method can be used to solve for the coefficients (Brockwell and
Davis, 2001). Auto-regressive models work particularly well when modeling
the response of linear, time-invariant systems. If the undamaged system is
nonlinear, the AR process gives the best linear fit to the measured response,
but there is no guarantee that this model will accurately predict responses
obtained when the system is subjected to other inputs.
Because of the broad array of structural health monitoring problems,
employing a linear model confines the scope of problems for which the AR
methodology is appropriate. We thus seek to extend the fidelity of this gen
eral damage detection approach by employing a non-linear AR-type model
based upon SVMs, which have seen widespread use in machine learning and
statistical classification fields. To simplify future development, we denote
the vector . ,
. as SVMs have many features that make
them a more appropriate choice for SHM based on time series analysis. With
the right settings and appropriate training they are able to model any non
linear relationship between the current time point, x, and the p previous
time points, they are well suited for high-dimensional problems,
and the methodology is easily generalized and highly adaptable. Although
SVMs have been used for SHM before (e.g. Worden and Manson, 2007;
Shimada et al., 2006; Bulut et al., 2005, Worden and Lane, 2001; Chat
topadhyay et al., 2007), these approaches predominantly focus on one and
two class SVMs, which are used for outlier detection and group classification,
respectively. Our approach is unique in its combination of support vector re
gression, autoregressive techniques, and residual error analysis. Thus while
earlier approaches look at classifying sections of the time-series response
as damaged or undamaged directly (the dependent variable being a binary
indicator), our methodology works by using support vector regression to
model the raw time-series data, then subsequently predicting damage by
monitoring the residuals of the model. We follow the development of SVMs
for regression of Smola and Schlkopf (2004) and Ma and Perkins (2003).
First, assume we have data from a set of K sensors and we have measure
ments without damage for time t = 1, . 0 (i.e. if there is damage, it occurs
,t
. .

after time to). Next we must decide the order p of our model. There are
many methods for selecting p, such as partial autocorrelation or the Akaike
Information Criterion (AIC), which are discussed in more detail in Fugate
et al. (2001). In general, we seek the lowest order model that captures the
underlying physical process and hence will generalize to other data sets. As
with linear AR modeling, we create the training set on which to build our

32
Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines

SVM-based model by using each observation as the dependent variable and


the previous p observations as independent variables. Our training samples
are thus {(X_p:t_i, x), t p + 1, . , t
}.
0
. .

) = x for
Ideally we would like to find a function f such that f(x_.t_
1
all k and t to. However, the form of f is often restricted to the class of
linear functions (as is the case for AR models),

f(x_p:t_i) = Kw, X_p:t_i) (3.2)

where (,) denotes the dot (or inner) product and w is a vector of model
parameters. This restricted form makes perfect fit of the data impossible
in most scenarios. As a result, we allow prediction using f to have an
error bounded by e, and find w under this constraint. With the recent
advances in penalized regression methods such as ridge regression and lasso,
the improved prediction performance of shrunken (or smoothed) models is
now well-understood (Copas, 1997; Fu, 1998). Thus in order to provide
a model that maximizes prediction performance, we seek to incorporate
shrinkage on the model paramaters w. Such shrunken w may be found by
minimizing the Euclidean norm subject to the error constraint , namely

minimize 2
IIwII (3.3)
Ixk_Kw,xkt—p.t—1
subject to —

1Kw, X_p:t_i) —

This model relies on the assumption that a linear model is able to fit
the data to within precision e. However, typically such a linear model does
not exist, even for moderate settings of e. As such, we introduce the slack
variables , , to allow for deviations beyond €. The resulting formulation
is

minimize 2 +CZ+
IIwW ( +)
1 (3.4)

subject to — Kw, xp:t_i) E +


1Kw,x_p:t_i) _x

The constant C controls the tradeoff between giving small w and penalizing
deviations larger than e. In this form we see that only points that lie out
side of the bound e have an effect on w. Figure 3.1 illustrates the process
graphically.

33
Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines

Figure 3.1: Illustration of linear support vector regression fit

Although this optimization problem is straightforward to carry out, the


extension to non-linearity is revealed by the dual formulation. We thus
proceed by constructing a Lagrange function of the above by introducing a
set of dual variables.

L : IIwII2 +C ( +) +
t=p+1
-

t=p+1
(+ X (w,xp:ti))

(3.5)

(c + - + W, xp:ti)) -

+ h7i)
t=p+1 tP+l

where the dual variables a, a, are understood to be non-negative.


It can be shown that this function has a saddle point at the optimal solution,
and hence

= w —
(at + a) Xp:t_l = 0 (3.6)
t=p+1

-
=0

Plugging these saddlepoint constraints into L yields the following dual

34
Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines

l1.

“a..
70.

0 7 1 9
0

Figure 3.2: Illustration of mapping to an alternate space to induce linearity

optimization problem:

maximize (t + c) ( ± ) (ti Sp:t’-i)


tp+i
(37)
Zt=p+i (c ±c ) +Z+
xt (ct —c )
1
fx — (W,X_p:t_i)
subject to
(W,X_p:t_i) C+

Notice that by the saddlepoint constraint w = Zp+1 (t + ) Xp:t_l


we may write f as
to
f(x_p:t_i) = (€ — ü) (xi_p:t_i,x_p:t_i) (3.8)
t’=p+1

In this way w may be viewed as a linear combination of the training


points X_p.t_1. Note also that in this formation both f and the correspond
ing optimization can be described in terms of dot products between the data.
In this way, we can transform the data using the function : —* F, and

compute the dot products in the transformed space. Such mappings allow
us to extend beyond the linear framework presented above. Specifically, the
mapping allows us to fit linear functions in F which, when converted back to
IR, are nonlinear. A toy example of this process is illustrated for a mapping
2 —* R
R , namely (x,y) = (x
3 ,x/y),y), in Figure 3.1. Here the data
2
is generated using the relationship y = x. To make use of this transformed
2
space, we replace the dot product term with

( (x_pto_i) , (x_p:t_)) (3.9)

35
Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines

If F is of high dimension, then the above dot product will be extremely


expensive to compute. In some cases, however, there is a corresponding
kernel that is simple to compute. For example, the kernel k(x, y) = (x .

corresponds to a map F into the space spanned by all products of exactly d


dimensions in IRP. When d, p = 2, for instance, we have

(x .
= ((xi, X2) (yi, y2))
. 2 (3.10)
= ,x)
2
((x,v’xix (Y?V’Y1Y2Y))
= ((x),(y))

defining I’(x) = (xi, \/x1x2, x). More generally, it has been shown that
every kernel that gives a positive matrix (k(x, y)) has a corresponding map
(x) (Smola and Schlkopf, 2004). One such family of kernels we focus on is
Radial Basis Function (RBF) kernels, which have the form

k(x, y) = exp (—lix — /(2u


y11
)
2 ) (3.11)

where o2 is the kernel variance. This parameter controls fit, with large
values leading to smoother functions and small values leading to better fit.
In practice moderate values are preferred as a trade-off between model fit
and prediction performance.
Whereas a traditional AR(p) model employs a linear model that is a
function of the previous p time points, the SVM model looks at the previous
p time points compared to all groups of p successive data points from the
training sample. Specifically, the model has the form
to
:r
k iii k k
J iXj_p:i_i) — /JjnAXj_p:j_, Xt_p:t_1
jp+1

Typically only a small fraction of the coefficients /3 are non-zero. The


corresponding samples X_p.j_1 are called support vectors of the regression
function because only these select samples are used in the formulation of
the model. Once we have trained our model above, we use it to predict each
future observation. We then take the residuals and use them as an indicator
of structural change. For our purposes we employ a control chart to monitor
if the system generating the data has changed. In this discussion the control
chart is created by constructing 99% control lines that correspond to 99%
confidence intervals for the residuals of the model fit to the undamaged data
assuming the residuals are normally distributed. This normality assumption
is further discussed in the experimental results below. These control lines

36
Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines

are then extended through the remaining (potentially damaged) data and
damage is indicated when a statistically significant number of residuals, in
this case more than 1%, lie outside these lines. Note damage can also be
indicated when the residuals no longer have a random distribution even
though they may not lie outside the control lines.
RBF neural networks, which have the same form as Equation (3.12),
have previously been used to perform SHM (e.g. Rytter and Kirkegaard
(1997)). However, fitting these networks requires much more user input such
3 are non-zero as well as selecting the corresponding
as selecting which /
training points. In addition, the fitting of the neural network model is
a rather complicated nonlinear optimization process relative to the simple
quadratic optimization used in the support vector framework. Although
the SVM models are more easily developed, Schlkopf et al. (1997) have
demonstrated that SVMs still more accurately predict the data than the
RBF neural networks despite their simplicity.

3.2.1 Example: Simulated Damage


We now compare the performance of the SVM-based damage detection
method to a traditional AR model with coefficients estimated by the Yule-
Walker method (see, for example, Brockwell and Davis (1991)). The data
is generated as follows for discrete time points, t = 1,..., 1200:

= (400irt/1200)
3
sin +2
(sin
400irt/1200) + sin(200rrt/1200) (3.13)
+sin(lOOirt/1200) + ‘I’ + e (3.14)

where c is Gaussian random noise with mean 0 and standard deviation 0.1
and ‘I’ is a damage term. Three different damage cases are added to this
time series at various times as defined by

ci for t=600,...,650
sin(1000irt/1200) for t=800,...,850

for t = 1000, , 1050
...

0 otherwise

where El and 2 are Gaussian random noise with mean 0 and 1, and stan
dard deviation 0.5 and 0.2, respectively. Through the use of Jf we attempt
to simulate several different types of damaged to compare the models per
formance handling each. This raw signal is plotted in Figure 3.3 where it
can be seen that the changes caused by the damage are somewhat subtle.

37
Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines

0 200 400 600 800 1000 1200

time

Figure 3.3: Raw simulated data with highlighted artificial damage

0 5 15

Lg

Figure 3.4: Autocorrelation plot of simulated data

The order p for both models was set at 5 as determined from the autocor
relation plot in Figure 3.4. This plot is the measure of correlation between
successive time points for a given time lag. We see from the plot that after a
lag of 5, the correlation is quite small, and hence little information is gained
by including a longer past history p. This is a standard method for deter
mining model order for traditional AR models, and as such should maximize
this methods performance, ensuring the SVM-based model isnt afforded an
unfair advantage.
The results of applying both the SVM model and a traditional AR model
to the undamaged portion of the signal between time points 400 and 600 are
shown in Figure 3.5 where the signals predicted by these models are overlaid
on the actual signal. A qualitative visual assessment of Figure 3.5 shows that
the SVM more accurately predicts this signal. A quantitative assessment
is made by examining the distribution of the residual errors obtained with
each model. The standard deviation of the residual errors from the SVM
model is 0.26 while for the traditional AR it is 0.71, again indicating that
the SVM is more accurately predicting the undamaged portion of this time
series.
In order for a model to excel at detecting damage, it must fit the un
damaged data well (i.e small and randomly distributed residual errors) while
fitting the damaged data poorly as identified by increased residual errors

38
Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines

400 450 500 550 600

time

Figure 3.5: SVM (top) and linear AR models (bottom) fit to subset of data

with possibly non-random distributions. In other words, the model must be


sensitive to distributional changes in the data that result from damage. To
quantify such changes a control chart is developed based on the undamaged
portion of the time series to establish statistically based thresholds for the
damage detection process. As mentioned earlier, this control chart is calcu
lated based on the fit to the undamaged data, specifically 99% confidence
lines are drawn based on the undamaged residual error data and carried
forward for comparison on the potentially damaged data. It is in this part
of the process that the SVM’s ability to more accurately represent the data
enhances the damage detection process. The 99% confidence lines for the
SVM are much closer to the mean value of the residual errors and, hence,
will more readily identify small perturbations to the underlying system that
produce changes in the residual error distribution. In addition, the tradi
tional AR model shows a trend in the residuals, indicating lack of model
fit, even in the undamaged case. We see that during the times of damage
the residuals for the SVM-based model exceed the control limits more than
occurs with the residuals from the traditional AR model. In fact, the latter
method would likely miss the damage between time points 1000 and 1050,
where only one point exceeds the threshold versus over 10 for the SVM-based
model. This result can be seen in Figure 3.6.
Since each method performs differently for different sources of damage, it
is of interest to determine when each method will be successful in indicating

39
Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines

a)
—-
D

S
a)
‘i

0,

a)
V —.

S
(a -

*I

400 600 800 1000 1200

time

Figure 3.6: Residuals from SVM (top) and linear AR models (bottom) ap
plied to simulated data
The 99% control lines based on the residuals from the undamaged portion
of the signal are shown in red.

damage. Since the traditional AR model fits a single model to the entire
data, model fit will be very poor if the data is non-stationary (for instance
if the excitation is in the form of hammer impacts). Additionally, since the
traditional AR model as presented above does not contain a moving average
term, it will continue to fit when damage is in the form of a shift up or down
in the raw time series (as demonstrated by the third damage scenario above).
Conversely, the SVM-based method works by comparing each length of p
data to all corresponding sets in the training set. Thus, if a similar sequence
exists in the training set, we can expect the fit to be quite good. We see two
scenarios in which the SVM-based method will perform poorly. Firstly, if
there is some damage in the undamaged scenario, and similar damage occurs
in the testing set, the model will likely fit this portion quite well. Secondly, if
damage manifests itself in such a way that the time-series data is extremely
similar to the undamaged time-series, the SVM methodology will be unable
to detect it. However, we should emphasize that other methods, including
the AR model, will suffer in such scenarios as well. As an attempted solution
when the sensitivity of the method to a given type of damage is unknown
and simulation tests are impossible, various damage detection methods could
potentially be combined to boost detection power.

40
Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines

3.3 Joint Online SHM


In the undamaged state, a Gaussian distribution can often approximate the
residuals from fitted models or control charts can be developed to invoke
the central limit theorem and force some function of the residual errors to
have a Gaussian distribution such as with an x-bar control chart (Fugate,
et al, 2001). If we have K sensors, each of whose residuals are Gaussian
distributed, we would like a way of combining these residuals to come up
with a damage detection method that examines all K sensors. Noticing that
the sum of K squared standard Gaussian random variables is distributed as
a chi-squared random variable with K degrees of freedom, we square the
residuals from each sensor (after they are normalized to have mean 0 and
variance 1 based on the undamaged data) and add them together to create a
new combined residual. These new combined residuals follow a chi-squared
distribution, and hence we can make probabilistic statements about the
residuals being typical or not (indicative of damage). Specifically, consider
the combined residuals at some time point t:
K
(3.15)
k= 1

where r is the normalized residual at time t for sensor k. Assuming the


original residuals are Gaussian distributed, this random variable will have a
chi-squared distribution with K degrees of freedom. Note that even when
the original residuals are not approximately Gaussian, we may still employ
a control chart on the combined residuals to give probabilistic statements
regarding damage. For instance, when the residual errors from the fitted
model have thicker tails than Gaussian, control charts must be employed
to make probabilistic statements of the combined residual. However, as
well see in the following example, the residual errors are often very close to
Gaussian.
In addition to these combined residuals allowing us to make statements
regarding damage from multiple sensor output, they also provide us with a
mechanism for determining which sensors are most influenced by the dam
age. This latter property is of particular importance for damage location. If
this combined residual is large, and hence we determine that there is damage,
we can look at the values (r)2 for each sensor and from their magnitudes
determine which sensors contributed the most to this large combined resid
ual. If we detect damage over a range of values, we may average (r) 2 over
this range for each sensor to determine how much each sensor is contributing

41
Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines

D A
3rd Floor 3rd Floor *otor4

A
.Cthrnfl
0.77 3
2nd Floor 2nd Floor;
I

1stFloor 1st Floor et.r2


- i.
Shkor

nfl
ri r.
___ Pn.iero7ote11

074

L--. J

Figure 3.7: Diagram of experimental structure

to the anomalous reading.

3.3.1 Example: Experimental Data


We look at joint online SHM using SVMs on experimental data from a struc
ture designed to produce non-linear response when it is damaged. The struc
ture is a three-story building (Figure 3.7) consisting of aluminum columns
and plates with bolted joints and a rigid base that is constrained to slide hor
izontally on two rails when excited by an electro-dynamic shaker. Each floor
is a 30.5 x 30.5 x 2.5cm plate and is separated from adjacent floors by four
17.7x 2.5x0.6cm columns. To induce non-linear behavior, a 15.Ox 2.5x2.5cm
column is suspended from the top floor and a bumper is placed on the sec
ond floor. The contact of this suspended column with the bumper results in
non-linear effects. The initial gap between the suspended column and the
bumper is adjusted to simulate different levels of damage. In our test data
we employ the case where the column is set 0.05mm away from the bumper.
The undamaged data is obtained when the bumper and suspended column
do not contact each other. The structure is subjected to a random base
excitation from the shaker in both its undamaged and damaged condition.
Accelerometers mounted on each floor record the response of the structure
to these base excitations. A more detailed description of the test structure
and the data obtained can be found at www.lanl.gov/projects/ei.
We first concatenate the undamaged data with the damaged data to
demonstrate that the proposed methodology adequately detects the damage.

42
Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines

The SVM time series models are developed for each of the accelerometer
measurements from the undamaged data as follows:

1. Select the number of time lags that will be used in the time series
models. In this case eight time lags were used based on the AIC. Note
the number of time lags is analogous to the order of an AR model.

2. Select the parameters of the SVM model, including the kernel type
and corresponding parameters as well as C and E, which control model
fit as described earlier. In our case we used a Gaussian kernel with
variance 1 and set C 1 and e = 0.1. We have found the methodology
to be robust to choices of variance ranging over an order of magnitude.
In addition, C could be increased to force fitting of extreme values,
and could be lowered to enforce a closer fit to the training data.

3. Pass the data (arranged as dependent variable and previous p points


as independent variables) to the optimization described by Equation
(3.7). In this case we use the first 6000 undamaged points as train
ing data. This step is handled by the wide variety of support vector
machine software available covering multiple computing environments
including MATLAB and R. In particular, we employ the libSVM li
brary with accompanying MATLAB interface (Chang and Lin, 2001).

4. Once the SVM model is trained (i.e. the / j in Equation (3.12) are
3
selected) in step 3, make predictions based on the new test data from
the structure in its undamaged or damaged condition. Next, calculate
the residual between the measured data and the output of the time
series prediction.

5. Square and add the residuals from each sensor as described by Equa
tion (3.15). Build a control chart for these combined residuals to detect
damage (perhaps in conjunction with statistical tests such as a sliding
window approach).

Note that steps 1 through 4 of this process are applied to each time series
recorded by the four accelerometers shown in Figure 37.
First we will revisit the normality assumption that was made in con
structing the control chart. Figure 3.8 shows the resulting Q-Q plot for the
residuals from the SVM model fit to sensor 4 data obtained with the struc
ture in its undamaged state. The Q-Q plot compares the sample quantiles
of the residuals to theoretical quantiles of a Gaussian distribution. We see

43
Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines

Cl)
a,

0 C
G)

E
Cu
C,)
C’J

—4 —2 0 2 4

Theoretical Quantiles

Figure 3.8: Q-Q Plot of residuals from SVM model

in this figure that the sample quantiles fall very close to the theoretical line,
and hence our residuals are approximately Gaussian.
Figure 3.9 shows the residual errors from the SVM fit to each of the ac
celerometer readings, respectively, and the corresponding 99% control limits
that are based on the first 6000 points from the undamaged portion of each
signal. There are 8192 undamaged points and 8192 damaged ones. Thus
when we concatenate the data the damage occurs at time point 8193 of
16384. Figure 3.10 shows the density of the normalized residual errors from
all the sensors that have been combined according to Equation (3.15). We
see that the distribution is very nearly chi-squared. In situations where the
original residuals arent normal, this result wont be true, and hence proba
bilistic statements regarding the presence of damage must be made based
on control charts.
Figure 3.11 shows the combined residuals as a function of time. The
blue points in the plot show damage indication using the sliding window
approach of Ma and Perkins (2003) as described in the introduction and
based on the 99% control lines. Specifically we use a window size of 6 which,
when combined with the 99% control limit, detects damage whenever 1 or
more of the 6 points in the window exceeds the control line (equivalent to
binomial probability of 0.05). We see from Figure 3.9 that sensors 3 and

44
Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines

sensor 1 sensor 2
a 0

In fit
0 0
a a
V V
a
0 0
a 0 a 0
E
a It
In
0 o
q I II I
q

In In

I I

7000 7500 8000 8500 9000 7000 7500 8000 8500 9000

time time

sensor 3 sensor 4
In

0
I ill 0

II Jib 1
b11
In In
0 0
a

8
‘Is
In

q 0
I If
II)
.9
7000 7500 8000 8500 9000 7000 7500 8000 8500 9000

time time

Figure 3.9: Residuals from 4 sensors for t 7000, = ... , 9384


The 99% control lines are shown in red

5 10 15 20

Figure 3.10: Density estimate of combined residual (black) vs. chi-squared


distribution (red)

45
Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines

7000 7500 8000 8500 9000

Figure 3.11: Combined residuals from all 4 sensors


The 99% control line shown in red. Sliding window damage indicator
shown in blue.

4 are most influenced by damage. This result is expected as the bumper


is mounted between these two sensors. In fact, if we look at the average
values of (r)
2 (which are the individual squared residuals for sensor k) over
the damaged section for each sensor, we see that the first two sensors have
values 0.96 and 1.24, whereas the second two sensors have values 59.80 and
38.2, respectively. Thus from this numerical rating we can see that sensors
3 and 4 are most influenced by the damage, which agrees with the result
shown in Figure 3.9.
From this analysis it is evident that we can use the combined residuals to
establish the presence of damage in a statistically rigorous manner and then
examine the individual sensor residuals in an effort to locate the sensors
most influenced by the damage. This latter information can be used to
help locate the damage assuming that the damage is confined to a discrete
location such as the formation of a crack in a welded connection. Further
investigation is needed to assess how this procedure could be used to locate
damage for more distributed damage such as that associated with corrosion.

3.4 Conclusion
Although the application of statistical techniques to structural health mon
itoring has been investigated in the past, these techniques have predomi
nantly been limited to identifying damage-sensitive features derived from
linear models fit to the output from individual sensors. As such, they are
typically limited to identifying only that damage has occurred. In general,

46
Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines

these methods are not able to identify which sensors are associated with the
damage in an effort to locate the damage within the resolution of the sensor
array. To improve upon this approach to damage detection, we have applied
support vector machines to model sensor output time histories and have
shown that such nonlinear regression models more accurately predict the
time series when compared to linear autoregressive models. Here the metric
for this comparison is the residual errors between the measured response
data and predictions of the time series model.
The support vector machine autoregressive method is superior to tra
ditional linear AR in both its ability to handle nonlinear dynamics as well
as the structure of the model. Specifically, the support vector approach
compares each new testing point to the entire training set whereas the tra
ditional AR model finds a simple linear relationship to best describe the
entire training set, which is then used on the testing data. For example,
when dealing with transient impact data, the AR model will fail in try
ing to fit the entire time domain with a simple linear model. Whereas in
the past RBF neural networks have been used to tackle this problem, these
networks require significant user input and complex methods for fitting the
model to the training data, and hence the simple support vector framework
is preferred.
Furthermore, we have also shown how the residuals from the SVM predic
tion of each sensor time history may be combined in a statistically rigorous
manner to provide probabilistic statements regarding the presence of dam
age as assessed from the amalgamation of all available sensors. In addition,
this methodology allows us to pinpoint the sensors that are contributing
most to the anomalous readings and therefore locate the damage within
the sensor networks spatial resolution. The process was demonstrated on a
test structure where damage was simulated by introducing an impact type
of nonlinearity between the measured degrees of freedom. The authors ac
knowledge that the approach has only been demonstrated on a structure
that was tested in a well-controlled laboratory setting. This approach will
have to be extended to structures subjected to real-world operational and
environmental variability before it can be used in practice. However, the
approach has the ability to adapt to such changes through the analysis of
appropriate training data that span these conditions. Therefore, follow-on
studies will focus on applying this approach to systems with operational and
environmental variability as well as systems that exhibit nonlinear response
in their undamaged state.

47
Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines

3.5 References
Allen, D., Sohn, H., Worden, K., and Farrar, C. (2002). “Utilizing the
Sequential Probability Ratio Test for Building Joint Monitoring.” Proc of
SPIE Smart Structures Conference. San Diego, March 2002.
Bulut, A. and Singh, A.K. and Shin, P. and Fountain, T. and Jasso, H.
and Yan, L. and Elgamal, A. (2005). “Real-time Nondestructive Structural
Health Monitoring Using Support Vector Machines and Wavelets.” Proc.
SPIE. 5770:180-189.
Brockwell, P., and Davis, R. (1991). Time Series Analysis: Forecasting
and Control. Prentice-Hall.
Chattopadhyay, A., Das, S., and Coelho, CK (2007). “Damage Diag
nosis Using a Kernel-based Method.” Insight-Non-Destructive Testing and
Condition Monitoring. 49:451-458.
Chang, C-J., Lin, C-J. (2001). LIBSVM: a library for support vector
machines. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.
Clark, G. (2008) “Cable Damage Detection Using Time Domain Re
flectometry and Model-Based Algorithms.” Lawrence Livermore National
Laboratory document LLNL- CONF-4 02567.
Copas, J.B. (1997) “Using Regression Models for Prediction: Shrink
age and Regression to the Mean.” Statistical Methods in Medical Research.
6:167-183.
Doebling, S., Farrar, C., Prime, M., Shevitz, D. (1998) “A Review of
Damage Identification Methods that Examine Changes in Dynamic Proper
ties.” Shock and Vibration Digest. 30:91-105.
Farrar, C.R., Worden, K. (2007). “An Introduction to Structural Health
Monitoring.” Philosophical Transactions of the Royal Society A. 365:303-
315.
Fu, W.J. (1998). “Penalized Regressions: The Bridge Versus The Lasso.”
Journal of Computational and Graphical Statistics. 7:397-416
Fugate, M., Sohn, H., and Farrar, C.R. (2001). “Vibration-Based Dam
age Detection Using Statistical Process Control.” Mechanical Systems and
Signal Processing. 15:707-721.
Herzog, J., Hanlin, J., Wegerich, S., Wilks, A. (2005). “High Perfor
mance Condition Monitoring of Aircraft Engines.” Proc of GT2005 ASME
Turbo Expo. June 6-9, 2005.
Ma, J., and Perkins, S. (2003). “Online Novelty Detection on Tempo
ral Sequences.” Proc of ninth ACM SIGKDD international conference on
knowledge discovery and data mining. 613-618.
Rytter, A., and Kirkegaard, P. (1997) “Vibration Based Inspection Using

48
Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines

Neural Networks,” Structural Damage Assessment Using Advanced Signal


Processing Procedures, Proceedings of DAMAS 97. University of Sheffield,
UK. pp. 97108.
Scholkopf, B., Sung, K.K., Burges, CJC, Girosi, F., Niyogi, P., Poggio, T.
and Vapnik, V. (1997) “Comparing Support Vector Machines with Gaussian
Kernels to Radial Basis Function Classifiers.” IEEE Transactions on Signal
Processing. 45:2758-2765.
Schlkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., and Williamson,
R. (2001). “Estimating the Support of a High-Dimensional Distribution.”
Neural Computation. 13:1443-1471.
Shimada, M. and Mita, A. and Feng, M.Q. (2006) “Damage detection
of structures using support vector machines under various boundary condi
tions.” Proc. SPIE. 6174:61742-61742
Smola, A.J., and Schlkopf, B. (2004). “A tutorial on support vector
regression.” Statistics and Computing. 14:199-222.
Sohn, H., Czarnecki, J., and Farrar, C.R. (2000). “Structural Health
Monitoring Using Statistical Process Control.” Journal of Structural Engi
neering. 126:1356-1363.
Sohn, H., Farrar, C.R., Hemez, F.M., Shunk, D.S., Stinemates, D.W.,
Nadler, B.R., and Czarnecki, J.J. (2004). “A Review of Structural Health
Monitoring Literature from 1996-2001.” Los Alamos National Laboratory
report LA -13976-MS.
Worden, K. And Lane, A. J. (2001) “Damage Identification Using Sup
port Vector Machines,” Smart Materials and Structures. 10:540-547.
Worden, K. and Manson, G. (2007). “The Application of Machine Learn
ing to Structural Health Monitoring.” Philosophical Transactions of the
Royal Society A: Mathematical, Physical and Engineering Science. 365:515-
537.

49
Chapter 4

Conclusion
Because research in signal processing is being undertaken by physicists, com
puter scientists, statisticians, and engineers among others, many tools de
veloped by one group aren’t fully adopted by others. This is partially due
to differences in jargon, but also because of each group’s different focus and
goals. However, this thesis shows that methods developed by one group for
a given purpose may often be employed quite successfully by another group
for an entirely different problem.
With the state of the art in particle filtering focussing on limiting degen
eracy of the algorithm, it is likely that future research in the area might be
applied to the material in chapter 2 to extend the scope of application. In
addition, the development of support vector machines is moving toward im
plementing the method quickly and online, while minimizing space require
ments. These advances might increase the ability of performing structural
health monitoring as discussed in chapter 3 to long time series for which
storage and computation becomes difficult.
While this thesis successfully implements two separate statistical meth
ods, each is developed in a fairly specific nature, when in fact the scope of
application is much more general, and may apply to problems not covered in
this work. As future research, prior sensitivity and cross-validation need to
be studied with the goal of easing implementation for multi-dimensional pa
rameters. Since existing methods, including the one presented, have compu
tational complexity which scales exponentially with dimension, alternative
methods must be found. In regards to structural health monitoring, more
attention must be paid to jointly modeling all sensors simultaneously, taking
their correlation into effect. In addition, more studies must be undertaken
to understand the effect of varying environmental conditions as well as if
the initial system is slightly damaged, and hence nonlinear. Whether the
solutions to these problems come from the world of signal processing is to
be seen.
Both of the ideas presented in this thesis have been greeted with en
thusiasm from researchers at Los Alamos National Laboratories, who daily
analyze complex and computationally expensive systems. In particular, the

50
Chapter 4. Conclusion

use of sequential Monte Carlo for prior sensitivity and cross-validation has
potential to reduce the computational time of building models for under
standing complex systems such as those present in biological and weapons
research. In addition, the power gained from using SVM’s for structural
health monitoring will allow for earlier detection of damage, and hence en
sure the structure’s economic viability as well as the safety of operators.

51
Chapter 5

Technical Appendix
Proof of Theorem 1. (following along the lines of Peruggia (1997) and Epi
fani et al. (2005)) We seek to show that the rtk moment of successive
importance weights is finite. So we need to find the conditions under which
f(e)de is finite, where (O) = (q(9))l7(q\(9))7 x (w\S ))r. We
9
(
7
expand and simplify (9) to obtain

(O) 2
(
1
=f
) yIj3,u x f(yI,u
) x r(3) x ir(a2)
2 x (w\S,7(9))r

) x [w\s,
2
=f(yI,u (e)] () x
7
n—s(7+re)
2 X Tr(/3) x ir(u
)
2

x exp{- [(y - (y
T
X) - X) - ( + re)(ys - (y
T
X) - X)]}
=q(9) X

where

i(9) =) x
(2)
x exp {_± [( - )T
X
T
[X - (+ rE)XXs] ( -

—1)—i
2(9)
— T [XTX
x exp {— [T
— (7+ re)yys — (7+ re)XsX] ]}
and /3 = [XTX — (7+ re)XsX]’ [yTX (7+ re)ysX]. We will show

momentarily that [XTX (‘y + re)XsX] is positive definite, and hence


invertible. Note that q (9) is proportional to a proper density for 9 when


[XTX (‘y + re)XsXj is positive definite. In this case 9
— ’( is upper
5
q
)
bounded. Now (9) is proportional to an inverse gamma distribution pro
vided that both
n—s(7+re)
1
T
1

(7+ re)yys

[XTX —
(7+ re)XXs] > 0

52
Chapter 5. Technical Appendix

Thus, aside from showing conditions under which [XTX (7+ rc)XsX] —

is positive definite, we also need to find conditions guaranteeing the above 2


inequalities. We first shOw that [XTX — (y + r€)XXs] is positive definite.
—1
Using the Woodbury matrix identity, we see that [XTX —
(-y + rc)XXs]
may be written as

(XTX)_l + (XTX)_l(y + r)Xs (I — (7 + re)X’(XTX)_1Xs) X(XTX)_l

Now if (I — ( + re)X(XTX) is positive definite, the second term


in the above sum is positive semi-definite. This is the case when all the eigen
values of X(XTX)_lXs are less than [XTX (y + rc)XXs] —‘

may then be written as the sum of a positive definite and a positive semi-
definite matrix, and hence [XTX ( + re)XXs] is positive definite.

Now we proceed to find conditions ensuring


T
— (7+ re)yys

[XTX — (7+ re)XsX] > 0.

Simple but tedious algebra gives the following expression:


— T
— (7+ r)yys [XTX — (7+ rc)XsX]
=RSS — (7+ re)e (I — (7+ re)Hs) es
=RSS*\s(7 + rc)

which, by the theorem’s conditions, is greater than 0 for argument value


1, and since RSS*\s is a smoothly decreasing function in its argument, it
is also positive for some positive argument value less than 1. Now, we
choose e < which implies c > 7 + re. By c satisfying (2.8), the
conditions outlined in the proof hold. Namely, a) ‘H < 1/(7 + re), since
these eigenvalues are upper bounded by 1, b) n — 8(7 + re) > 2, and c)
RSS*\s(7 + r) > 0. E

53

You might also like