0% found this document useful (0 votes)
91 views4 pages

Christophe Andrieu - Arnaud Doucet Bristol, BS8 1TW, UK. Cambridge, CB2 1PZ, UK. Email

This document presents new online algorithms for estimating static parameters in nonlinear non-Gaussian state space models. The key contribution is modifying the contrast function optimized from the likelihood function to the "split-data likelihood" (SDL). Optimizing the average log-SDL prevents degeneracy issues inherent in previous approaches. Three new online expectation-maximization type algorithms are proposed: online EM, stochastic EM, and data augmentation EM. These algorithms work on data blocks of fixed size to estimate the parameters recursively as new data arrives without accumulation of errors over time.

Uploaded by

Neil John Apps
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views4 pages

Christophe Andrieu - Arnaud Doucet Bristol, BS8 1TW, UK. Cambridge, CB2 1PZ, UK. Email

This document presents new online algorithms for estimating static parameters in nonlinear non-Gaussian state space models. The key contribution is modifying the contrast function optimized from the likelihood function to the "split-data likelihood" (SDL). Optimizing the average log-SDL prevents degeneracy issues inherent in previous approaches. Three new online expectation-maximization type algorithms are proposed: online EM, stochastic EM, and data augmentation EM. These algorithms work on data blocks of fixed size to estimate the parameters recursively as new data arrives without accumulation of errors over time.

Uploaded by

Neil John Apps
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

ONLINE EXPECTATION-MAXIMIZATION TYPE ALGORITHMS FOR PARAMETER

ESTIMATION IN GENERAL STATE SPACE MODELS


Christophe Andrieu

- Arnaud Doucet

Department of Mathematics, Bristol University,


Bristol, BS8 1TW, UK.
Department of Engineering, Cambridge University
Cambridge, CB2 1PZ, UK.
Email: [email protected] - [email protected]
ABSTRACT
In this paper we present new online algorithms to estimate static
parameters in nonlinear non Gaussian state space models. These
algorithms rely on online Expectation-Maximization (EM) type al-
gorithms. Contrary to standard Sequential Monte Carlo (SMC)
methods recently proposed in the literature, these algorithms do
not degenerate over time.
1 Introduction
1.1 State-Space Models and Problem Statement
Let {X
n
}
n0
and {Y
n
}
n1
be R
p
and R
q
-valued stochastic pro-
cesses dened on a measurable space (, F) and where
is an open subset of R
k
. The process {X
n
}
n0
is an unobserved
(hidden) Markov process of initial density ; i.e. X
0
, and
Markov transition density f

(x, x

); i.e.
X
n+1
| X
n
= x f

(x, ) . (1)
One observes the process {Y
n
}
n1
. It is assumed that the obser-
vations are conditionally independent upon {X
n
}
n0
of marginal
density g

(x, y) ; i.e.
Y
n
| X
n
= x g

(x, ) . (2)
This class of models includes many nonlinear and non-Gaussian
time series models such as
X
n+1
=

(X
n
, V
n+1
) , Y
n
=

(X
n
, W
n
)
where {V
n
}
n1
and {W
n
}
n1
are independent sequences.
We assume here that the model depends on some unknown pa-
rameter denoted . The true value of is

. We are interested in
deriving recursive algorithms to estimate

for the class of sta-


tionary state-space models; i.e. when the Markov chain {X
n
}
n0
admits a limiting distribution. This problem has numerous appli-
cations in electrical engineering, econometrics, statistics etc. It
is extremely complex. Even if

was known, the simpler prob-


lem of optimal ltering, i.e. estimating the posterior distribution
of X
n
given (Y
1
, . . . , Y
n
) , does not usually admit a closed-form
solution.
Further on we will denote for any sequence z
k
/random process
Z
k
z
i:j
= (z
i
, z
i+1
, . . . , z
j
) and Z
i:j
= (Z
i
, Z
i+1
, . . . , Z
j
) .
1.2 A Brief Literature Review
1.2.1 Filtering methods
A standard approach followed in the literature consists of setting
a prior distribution on the unknown parameter and then consid-
ering the extended state S
n
(X
n
, ). This converts the param-
eter estimation into an optimal ltering problem. One can then
apply, at least theoretically, standard particle ltering techniques
[5] to estimate p ( x
n
, | Y
1:n
) and thus p ( | Y
1:n
). In this ap-
proach, the parameter space is only explored at the initialization
of the algorithm. Consequently the algorithm is inefcient; after
a few iterations the marginal posterior distribution of the parame-
ter is approximated by a single delta Dirac function. To limit this
problem, several authors have proposed to use kernel density esti-
mation methods [13]. However, this has the effect of transforming
the xed parameter into a slowly time-varying one. A pragmatic
approach consists of introducing explicitly an articial dynamic on
the parameter of interest; see [10], [11]. To avoid the introduction
of an articial dynamic model, an approach proposed in [8] con-
sists of adding Markov chain Monte Carlo (MCMC) steps so as to
add diversity among the particles. However, this approach does
not solve the xed-parameter estimation problem. More precisely,
the addition of MCMC steps does not make the dynamic model
ergodic. Thus, there is an accumulation of errors over time and the
algorithm can diverge as observed in [1]. Similar problems arise
with the recent method proposed in [16].
1.2.2 Recursive maximum likelihood
Consider the log-likelihood function
l

(Y
1:n
) =
n

k=1
log
__
g

(x
k
, Y
k
) p

(x
k
|Y
1:k1
) dx
k
_
. (3)
where p

(x
k
|Y
1:k1
) is the posterior density of the state X
k
given
the observations Y
1:k1
and . Under regularity assumptions in-
cluding the stationarity of the state-space model, one has
1
n
l

(Y
1:n
) l () (4)
where
l ()
_ _
R
q
P(R
p
)
log
__
g

(x, y) (x) dx
_

,
(dy, d) ,
where P (R
p
) is the space of probability distributions on R
p
and

,
(dy, d) is the joint invariant distribution of the couple
(Y
k
, p

(x
k
|Y
1:k1
)). It is dependent on both and the true pa-
rameter

. Maximizing l () corresponds to minimizing the fol-


lowing KullbackLeibler information measure given by
K (,

) l (

) l () 0. (5)
To optimize this cost function, Recursive Maximum Likelihood
(RML) is based on a stochastic gradient algorithm

n+1
=
n
+
n
log
__

n
(x
n
, Y
n
) p
1:n
(x
n
|Y
1:n1
) dx
n
_
.
This requires the computation of p
1:n
(x
n
|Y
1:n1
) and its deriva-
tives with respect to using the parameter
k
at time k where
p
1:n
(x
n
|Y
1:n1
) denotes the predictive distribution computed us-
ing parameter
k
at time k. This is the approach followed in [12]
for nite state-space HMM and in [6] for general state-space mod-
els. In the general state-space case, the lter and its derivatives are
approximated numerically. This method does not suffer from the
problems mentioned above. However it is sensitive to initializa-
tion and in practice it can be difcult to scale the components of
the gradient properly.
1.2.3 Online EM algorithm
Another stochastic gradient type algorithm is based on an online
version of the EM. This approach is better from a practical point
of view as it is easy to implement and numerically well-behaved.
One can actually show that online EM corresponds to a Newton-
Raphson like algorithm; the difference being that the gradient is
not scaled by the information matrix but by the complete informa-
tion matrix. Online EM algorithms have been proposed for nite
state-space HMM and linear Gaussian state-space models; e.g. [7].
It is formally possible to come up with a similar algorithm for
general state-space models. However, it requires one to compute
quantities such as
Q
n
( |

) = E

( log p

(x
0:n
, Y
1:n
)| Y
1:n
)
= E

( log (x
0
)| Y
1:n
) +
n

k=1
E

( log f

(x
k1
, x
k
)| Y
1:n
)
+
n

k=1
E

( log g

(x
k
, Y
k
)| Y
1:n
)
It is trivially possible to estimate Q
n
( |

) using particle methods


as one gets an estimate of the joint posterior density p

( x
1:n
| Y
1:n
).
However, as n increases, only the approximation of marginal den-
sities p

( x
nL:n
| Y
1:n
) for a reasonable value of L, say L around
5, is good for a reasonable number of particles [5]. It seems impos-
sible to obtain a good approximation of Q
n
( |

) with an online
algorithm as n increases. It is thus necessary to come up with an
alternative method.
1.3 Contributions
We propose here three new algorithms to address the problem
of recursive parameter estimation in general state-space models.
These algorithms rely on online EM type algorithms, namely on-
line EM, Stochastic EM (SEM) and Data Augmentation (DA). To
prevent the degeneracy inherent to all previous approaches (except
[6]), the key point of our paper is to modify the contrast function
to optimize. Instead of considering the likelihood function which
leads to (5), we will consider here the so-called split-data likeli-
hood (SDL) as originally proposed in [14], [15] for nite state-
space HMM. In this approach, the data set is divided in blocks of
say L data and one maximizes the average of the resulting log-
SDL. This leads to an alternative Kullback-Leibler contrast func-
tion. It can be shown under regularity assumptions that the set
of parameters optimizing this contrast function includes the true
parameter. An approach consists of using the Fisher identity to
obtain a gradient estimate. We will not discuss this approach here
and we will maximize the average log-SDL using online EM type
algorithms. As we work on data blocks of xed dimension, the
crucial point is that there is no more degeneracy problem. More-
over, contrary to [6], [12], [15], these algorithms are numerically
well-behaved. An additional annealing schedule can also be added
to the DA algorithm to make it more robust to initialization.
The rest of this paper is organized as follows. In Section 2, we
introduce the average log-SPL. In Section 3, we present three re-
cursive algorithms to optimize the resulting Kullback-Leibler con-
trast function and discuss the implementation issues. Finally in
Section 4, we present an application to stochastic volatility.
2 Split-data likelihood
The standard likelihood function of Y
1:nL
is dened by
L

(Y
1:nL
) =
_
(x
0
)
nL

k=1
f

(x
k1
, x
k
) g

(x
k
, Y
k
) dx
0:nL
.
The SDL consists of dividing the data Y
1:nL
in n blocks of L data.
For each data block, the pseudo-likelihood

L

_
Y
(i1)L+1:iL
_
is
given by
_
p

_
x
(i1)L+1:iL
, Y
(i1)L+1:iL
_
dx
(i1)L+1:iL
, (6)
with p

_
x
(i1)L+1:iL
, Y
(i1)L+1:iL
_
equal to

_
x
(i1)L+1
_
g

_
x
(i1)L+1
, Y
(i1)L+1
_

iL

k=(i1)L+2
f

(x
k1
, x
k
) g

(x
k
, Y
k
) ,
(7)
where

correspond to the invariant density of the latent Markov


process. The SDL corresponds to the product of the n pseudo-
likelihoods (6). It is important to remark that our algorithms will
require the knowledge of the analytical expression of this invariant
density up to a normalizing constant. This is a restriction. How-
ever, this density is known in many important applications; it is
satised for instance for all the examples addressed in [16].
We propose to maximize the average log-SDL
1
n
n

i=1

_
Y
(i1)L+1:iL
_

l () (8)
where

_
Y
(i1)L+1:iL
_
log

L

_
Y
(i1)L+1:iL
_
and

l () =
_

l

(Y
1:L
) p

(Y
1:L
) dY
1:L
,
p

(Y
1:L
) corresponds to the invariant distribution of the obser-
vations under the true parameter

; i.e. this is the marginal of


(7) for i = 1, =

. Note the difference with the standard


RML approach, see (4). Maximizing

l () is equivalent to mini-
mizing

K (,

l (

l () which satises under regularity


assumptions

K (

) = 0 [14]. There is a trade-off associated


to the choice of L. If L is small, the algorithm is typically easier
to implement but the convergence of the algorithm might be slow.
If L is large, the algorithm will converge faster as one mimics the
convergence properties of the RML estimate but the algorithm is
getting more complex.
3 Recursive Algorithms and Implementation
We describe briey in this section the algorithms. A detailed ex-
ample is given in the next section. In practice we consider only
models for which the joint density p

(x
0:n
, y
1:n
) is in the expo-
nential family so that one only needs to propagate a set of sufcient
statistics. We will denote by the set of (typically multivariate)
sufcient statistics. All algorithms rely on a non-increasing pos-
itive stepsize sequence {
i
}
i0
satisfying

i
= ,

2
i
<
; one usually selects
i
= i

with
_
1
2
, 1

.
3.1 Algorithms
The online EM algorithm proceeds as follows.
Online EM
Initialization: i = 0,
(0)
= 0 and
(0)
.
Iteration i, i 1

(i)
= (1
i
)
(i1)
+
i

(i1)
_

_
X
(i1)L+1:iL
, Y
(i1)L+1:iL
_

Y
(i1)L+1:iL
_
.

(i)
=
_

(i)
_
.
where

E

(i1)
_

_
X
(i1)L+1:iL
_

Y
(i1)L+1:iL
_
denotes the suf-
cient statistics associated to the data block Y
(i1)L+1:iL
and
is the mapping between the set of sufcient statistics and the pa-
rameter space. The symbol

E

(i1)
_
| Y
(i1)L+1:iL
_
denotes the
expectation with respect to p

_
x
(i1)L+1:iL

Y
(i1)L+1:iL
_

p

_
x
(i1)L+1:iL
, Y
(i1)L+1:iL
_
.
The online SEM algorithm is a simple variation of the online
EM which proceeds as follows.
Online SEM
Initialization: i = 0,
(0)
= 0 and
(0)
.
Iteration i, i 1
Sample

X
(i1)L+1:iL
p

(i1)
_
x
(i1)L+1:iL

Y
(i1)L+1:iL
_
.

(i)
= (1
i
)
(i1)
+
i

_

X
(i1)L+1:iL
, Y
(i1)L+1:iL
_
.

(i)
=
_

(i)
_
.
In the SEM algorithm, we replace the expectation term by an un-
biased estimate. This has the effect of adding noise in the algo-
rithm and can allow it to escape from a local maximum.
The online DA algorithm is a more recent variant introduced
by the authors in [2]. In this case, one sets a prior density p () on
the unknown parameter and dene the following articial condi-
tional density for the parameter
p ( | x
1:nL
, Y
1:nL
) [l (; x
1:nL
, Y
1:nL
)]

n
p ()
where {
n
}
n0
is the inverse of the temperature and l (; x
1:nL
, Y
1:nL
)
is dened recursively as follows
l (; x
1:L
, Y
1:L
) = p

(x
1:L
, Y
1:L
) ,
l (; x
1:iL
, Y
1:iL
) =
_
l
_
; x
1:(i1)L
, Y
1:(i1)L
_
1
i

_
p

_
x
(i1)L+1:iL
, Y
(i1)L+1:iL
_

i
.
Though it is not entirely obvious, one can show that this algo-
rithm corresponds to a noisy online EM algorithm in the case
where
n
= n. The additional parameter
n
corresponds to an
annealing schedule that can be used to slower the concentration of
p ( | x
1:nL
, Y
1:nL
); typically one will chose
n
= An

( > 0).
For models of interest, p ( | x
1:nL
, Y
1:nL
) only depends on a set
of sufcient statistics (similar to those of the EM and SEM) and
the online DA algorithm proceeds as follows.
Online DA
Initialization: i = 0,
(0)
= 0 and
(0)
.
Iteration i, i 1
Sample

X
(i1)L+1:iL
p

(i1)
_
x
(i1)L+1:iL

Y
(i1)L+1:iL
_
.

(i)
= (1
i
)
(i1)
+
i

_

X
(i1)L+1:iL
, Y
(i1)L+1:iL
_
.
Sample
(i)
p
_
|
(i)
_
.
Remark. The different components of the vector might be up-
dated one-at-a-time using a Gibbs sampling strategy.
Remark. Note that instead of using non-overlapping data blocks
{Y
(i1)L+1:iL
}
i1
, it is possible to use a sliding window. This
enables one to update the parameter estimate at the data rate for
example [14].
3.2 Implementation Issues
The algorithms presented above assume that we know how to inte-
grate with respect to p

_
x
(i1)L+1:iL

Y
(i1)L+1:iL
_
or to sam-
ple fromthis density. It is typically impossible but one can perform
these integration/sampling steps exactly or approximately using
modern simulation techniques; i.e. Markov Chain Monte Carlo
(MCMC) and SMC. When one uses SMC to estimate this density,
the approximation one gets will be reasonable only if L is not too
large. If not, one can use the forward ltering backward sampling
algorithm proposed in [9] to sample from the joint density based
on the particle approximations of the marginal ltering densities
p

_
x
k
| Y
(i1)L+1:k
_
. One should keep in mind that even if one
uses this method, the algorithm is still an online algorithm.
4 Application
Let us consider the following stochastic volatility model arising in
nance
X
n+1
= X
n
+V
n+1
,
Y
n
= exp (X
n
/2) W
n
,
where V
n
i.i.d.
N (0, 1) and W
n
i.i.d.
N (0, 1) are two mutually
sequences, independent of the initial state X
0
. We are interested in
estimating the parameter (, , ) . In this case, the stationary
distribution of the hidden process is N
_
0,

2
1
2
_
. One has
log p

(x
1:L
, Y
1:L
)
= cste
1
2
log
_
1
2
_
Llog Llog

_
1
2
_
2
2
x
2
1

1
2
2
L

k=1
Y
2
k
exp (x
k
)

1
2
2
L

k=2
(x
k
x
k1
)
2
so clearly (x
1:L
, Y
1:L
) is given by
_
x
2
1
,

L
k=1
Y
2
k
exp (x
k
) ,

L1
k=1
x
2
k
,

L
k=2
x
2
k
,

L
k=2
x
k
x
k1
_
. If one writes = (
1
,
. . . ,
5
) then the mapping between the sufcient statistics and
the parameter space necessary to the EM and SEM algorithms is
dened as follows =
_
1
L

2
and

2
=
1
L
__
1
2
_

1
+
2

3
2
5
+
4
_
,
_
1
2
_
1
+ (
1

3
)
2
+
2

5
= 0.
Plugging the expression of
2
in the last equation gives an equa-
tion in that can be solved by a simple dichotomic search. For
the DA augmentation algorithm, we used a rejection technique to
sample from p ( | ) . We present here a numerical experiment
on synthetic data. The true parameter values are = 0.9, =
1.0,
2
= 0.1 and the RDA algorithm was used for L = 2 with a
temperature
n
= n
0.5
.
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
Figure 1: Convergence of the estimate of .
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
Figure 2: Convergence of the estimate of .
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Figure 3: Convergence of the estimate of
2
.
5 Discussion
In this paper, we have proposed three original algorithms to per-
form online parameter estimation in general state-space models.
These algorithms are online EM type algorithms. Provided the in-
variant distribution of the hidden Markov process is known, these
algorithms are simple and we have demonstrated their efciency in
practice. An alternative and computationally very efcient strategy
which does not require state estimation has been recently proposed
in [3].
6 Acknowledgments
This work was initiated while the second author was visiting the
Department of Mathematics from Bristol University thanks to an
EPSRC visiting fellowship.
7 REFERENCES
[1] Andrieu, C., De Freitas, J.F.G. and Doucet, A. (1999). Se-
quential MCMC for Bayesian model selection, Proc. Work.
IEEE HOS.
[2] Andrieu, C. and Doucet A. (2000) Stochastic algorithms for
marginal MAP retrieval of sinusoids in non-Gaussian noise,
Proc. Work. IEEE SSAP.
[3] Andrieu, C., Doucet, A. and Tadi c, V.B. (2003). Online sam-
pling for parameter estimation in general state space models,
Proc. Worshop Sysid, Rotterdam.
[4] Doucet, A., Godsill, S.J. and Andrieu, C. (2000). On sequen-
tial Monte Carlo sampling methods for Bayesian ltering.
Statist. Comput., 10, 197-208.
[5] Doucet, A., de Freitas, J.F.G. and Gordon N.J. (eds.) (2001).
Sequential Monte Carlo Methods in Practice. New York:
Springer-Verlag.
[6] Doucet A. & Tadi c V.B. (2003) Parameter estimation in gen-
eral state-space models using particle methods. Ann. Inst.
Stat. Math., to appear.
[7] Elliott R.J. & Krishnamurthy V. (1999) New nite dimen-
sional lters for estimation of linear Gauss-Markov models,
IEEE Trans. Auto Control, Vol.44, No.5, pp.938-951.
[8] Gilks, W.R. and Berzuini, C. (2001). Following a moving
target - Monte Carlo inference for dynamic Bayesian models.
J. R. Statist. Soc. B, 63, pp. 127-146.
[9] Godsill, S.J., Doucet, A. & West M. (2000) Monte Carlo
smoothing for nonlinear time series. Technical Report TR00-
01, Institute Statistics and Decision Sciences, Duke Univer-
sity.
[10] Higuchi T. (1997) Monte Carlo lter using the genetic algo-
rithm operators. J. Statist. Comp. Simul., 59, 1-23.
[11] Kitagawa, G. (1998). A self-organizing state-space model. J.
Am. Statist. Ass., 93, 1203-1215.
[12] LeGland F. and Mevel L. (1997) Recursive identication in
hidden Markov models, Proc. 36th IEEE Conf. Decision and
Control, pp. 3468-3473.
[13] Liu, J. and West, M. (2001). Combined parameter and state
estimation in simulation-based ltering. In Sequential Monte
Carlo Methods in Practice (Doucet, A., de Freitas, J.F.G. and
Gordon N.J. (eds.)). New York: Springer-Verlag.
[14] Ryd en T. (1994) Consistent and asymptotically normal pa-
rameter estimates for hidden Markov models. Annals of
Statistics 22, 1884-1895.
[15] Ryd en T. (1997) On recursive estimation for hidden Markov
models. Stochastic Processes and their Applications 66, 79-
96.
[16] Storvik G. (2002) Particle lters in state space models with
the presence of unknown static parameters. IEEE. Trans. Sig-
nal Processing, 50, 281289.

You might also like