0% found this document useful (0 votes)
72 views33 pages

Lecture 8: Bayesian Estimation of Parameters in State Space Models

This document summarizes Bayesian estimation methods for parameters in state space models. It discusses batch Bayesian estimation, filtering-based Bayesian estimation using the prediction error decomposition, approximating the posterior distribution using maximum a posteriori estimates and Laplace approximations, Markov chain Monte Carlo methods like Metropolis-Hastings, and the expectation-maximization algorithm.

Uploaded by

Trinh Vtn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views33 pages

Lecture 8: Bayesian Estimation of Parameters in State Space Models

This document summarizes Bayesian estimation methods for parameters in state space models. It discusses batch Bayesian estimation, filtering-based Bayesian estimation using the prediction error decomposition, approximating the posterior distribution using maximum a posteriori estimates and Laplace approximations, Markov chain Monte Carlo methods like Metropolis-Hastings, and the expectation-maximization algorithm.

Uploaded by

Trinh Vtn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Lecture 8: Bayesian Estimation of Parameters

in State Space Models

Simo Särkkä

March 30, 2016

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Contents

1 Bayesian estimation of parameters in state space models

2 Computational methods for parameter estimation

3 Practical parameter estimation in state space models

4 Summary

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Batch Bayesian estimation of parameters

State space model with unknown parameters θ ∈ Rd :

θ ∼ p(θ)
x0 ∼ p(x0 | θ)
xk ∼ p(xk | xk −1 , θ)
yk ∼ p(yk | xk , θ).

The full posterior, in principle, can be computed as

p(y1:T | x0:T , θ) p(x0:T | θ) p(θ)


p(x0:T , θ | y1:T ) = .
p(y1:T )

The marginal posterior of parameters is then


Z
p(θ | y1:T ) = p(x0:T , θ | y1:T ) dx0:T .

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Batch Bayesian estimation of parameters (cont.)

Advantages:
A simple static Bayesian model.
We can take any numerical method (e.g., MCMC) to attack
the model.
Disadvantages:
We are not utilizing the Markov structure of the model.
Dimensionality is huge, computationally very challenging.
Hard to utilize the already developed approximations for
filters and smoothers.
Requires computation of high-dimensional integral over the
state trajectories.
For computational reasons, we will select another, filtering
and smoothing based route.

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Filtering-based Bayesian estimation of parameters
[1/3]

Directly approximate the marginal posterior distribution:

p(θ | y1:T ) ∝ p(y1:T | θ) p(θ)

The key is the prediction error decomposition:


T
Y
p(y1:T | θ) = p(yk | y1:k −1 , θ)
k =1

Luckily, the Bayesian filtering equations allow us to


compute p(yk | y1:k −1 , θ) efficiently.

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Filtering-based Bayesian estimation of parameters
[2/3]
Recall that the prediction step of the Bayesian filtering
equations computes

p(xk | y1:k −1 , θ)

Using the conditional independence of measurements we


get:
p(yk , xk | y1:k −1 , θ) = p(yk | xk , y1:k −1 , θ) p(xk | y1:k −1 , θ)
= p(yk | xk , θ) p(xk | y1:k −1 , θ).

Integration over xk thus gives


Z
p(yk | y1:k −1 , θ) = p(yk | xk , θ) p(xk | y1:k −1 , θ) dxk

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Filtering-based Bayesian estimation of parameters
[3/3]
Recursion for marginal likelihood of parameters
The marginal likelihood of parameters is given by
T
Y
p(y1:T | θ) = p(yk | y1:k −1 , θ)
k =1

where the terms can be solved via the recursion


Z
p(xk | y1:k −1 , θ) = p(xk | xk −1 , θ) p(xk −1 | y1:k −1 , θ) dxk −1
Z
p(yk | y1:k −1 , θ) = p(yk | xk , θ) p(xk | y1:k −1 , θ) dxk
p(yk | xk , θ) p(xk | y1:k −1 , θ)
p(xk | y1:k , θ) = .
p(yk | y1:k −1 , θ)

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Energy function

Once we have the likelihood p(y1:T | θ) we can compute


the posterior via
p(y1:T | θ) p(θ)
p(θ | y1:T ) = R
p(y1:T | θ) p(θ) dθ
The normalization constant in the denominator is irrelevant
and it is often more convenient to work with

p̃(θ | y1:T ) = p(y1:T | θ) p(θ)

For numerical reasons it is better to work with the logarithm


of the above unnormalized distribution.
The negative logarithm is the energy function:

ϕT (θ) = − log p(y1:T | θ) − log p(θ).

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Energy function (cont.)
The posterior distribution can be recovered via
p(θ | y1:T ) ∝ exp(−ϕT (θ)).
ϕT (θ) is called energy function, because in physics, the
above corresponds to the probability density of a system
with energy ϕT (θ).
The energy function can be evaluated recursively as
follows:
Start from ϕ0 (θ) = − log p(θ).
At each step k = 1, 2, . . . , T compute the following:
ϕk (θ) = ϕk −1 (θ) − log p(yk | y1:k −1 , θ)

For linear models, we can evaluate the energy function


exactly with help of Kalman filter.
In non-linear models we can use Gaussian filters or
particle filters for approximating the energy function.
Simo Särkkä Lecture 8: Bayesian Estimation of Parameters
Maximum a posteriori approximations

The maximum a posteriori (MAP) estimate:


MAP
θ̂ = arg max [p(θ | y1:T )] .
θ

Can be equivalently computed as


MAP
θ̂ = arg min [ϕT (θ)] ,
θ

The maximum likelihood (ML) estimate of the parameter is


a MAP estimate with a formally uniform prior p(θ) ∝ 1.
The minimum (or maximum) can be found by using various
gradient-free or gradient-based optimization methods.
Gradients can be computed by recursive equations called
sensitivity equations or sometimes by using Fisher’s
identity.

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Laplace approximations

The MAP estimate corresponds to a Dirac delta function


approximation to the posterior distribution
MAP
p(θ | y1:T ) ' δ(θ − θ̂ ),

Ignores the spread of the distribution completely.


Idea of Laplace approximation is to form a Gaussian
approximation to the posterior distribution:
MAP MAP −1
p(θ | y1:T ) ' N(θ | θ̂ , [H(θ̂ )] ),
MAP
where H(θ̂ ) is the Hessian matrix of the energy
function evaluated at the MAP estimate.

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Markov chain Monte Carlo (MCMC)

Markov chain Monte Carlo (MCMC) methods are


algorithms for drawing samples from p(θ | y1:T ).
Based on simulating a Markov chain which has the
distribution p(θ | y1:T ) as its stationary distribution.
The Metropolis–Hastings (MH) algorithm uses a proposal
density q(θ (i) | θ (i−1) ) for suggesting new samples θ (i)
given the previous ones θ (i−1) .
Gibbs’ sampling samples components of the parameters
one at a time from their conditional distributions given the
other parameters.
Adaptive MCMC methods are based on adapting the
proposal density q(θ (i) | θ (i−1) ) based on past samples.
Hamiltonian Monte Carlo (HMC) or hybrid Monte Carlo
(HMC) method simulates a physical system to construct an
efficient proposal distribution.
Simo Särkkä Lecture 8: Bayesian Estimation of Parameters
Metropolis–Hastings

Metropolis–Hastings
Draw the starting point, θ (0) from an arbitrary initial
distribution.
For i = 1, 2, . . . , N do
1 Sample a candidate point θ ∗ ∼ q(θ ∗ | θ (i−1) ).
2 Evaluate the acceptance probability
( )
(i−1) ∗ q(θ (i−1) | θ ∗ )
αi = min 1, exp(ϕT (θ ) − ϕT (θ )) .
q(θ ∗ | θ (i−1) )

3 Generate a uniform random variable u ∼ U(0, 1) and set


(
(i) θ∗ , if u ≤ αi
θ =
θ (i−1) , otherwise.

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Expectation–maximization (EM) algorithm [1/5]

Expectation–maximization (EM) is an algorithm for


computing ML and MAP estimates of parameters when
direct optimization is not feasible.
Let q(x0:T ) be an arbitrary probability density over the
states, then we have the inequality

log p(y1:T | θ) ≥ F [q(x0:T ), θ].

where the functional F is defined as


p(x0:T , y1:T | θ)
Z
F [q(x0:T ), θ] = q(x0:T ) log dx0:T .
q(x0:T )

Idea of EM: We can maximize the likelihood by iteratively


maximizing the lower bound F [q(x0:T ), θ].

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Expectation–maximization (EM) algorithm [2/5]

Abstract EM
The maximization of the lower bound can be done by
coordinate ascend as follows:
1 Start from initial guesses q (0) , θ (0) .
2 For n = 0, 1, 2, . . . do the following steps:
1 E-step: Find q (n+1) = arg maxq F [q, θ (n) ].
2 M-step: Find θ (n+1) = arg maxθ F [q (n+1) , θ].

To implement the EM algorithm we need to be able to do


the maximizations in practice.
Fortunately, it can be shown that

q (n+1) (x0:T ) = p(x0:T | y1:T , θ (n) ).

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Expectation–maximization (EM) algorithm [3/5]

We now get

F [q (n+1) (x0:T ), θ]
Z
= p(x0:T | y1:T , θ (n) ) log p(x0:T , y1:T | θ) dx0:T
Z
− p(x0:T | y1:T , θ (n) ) log p(x0:T | y1:T , θ (n) ) dx0:T .

Because the latter term does not depend on θ, maximizing


F [q (n+1) , θ] is equivalent to maximizing
Z
Q(θ, θ ) = p(x0:T | y1:T , θ (n) ) log p(x0:T , y1:T | θ) dx0:T .
(n)

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Expectation–maximization (EM) algorithm [4/5]

EM algorithm
The EM algorithm consists of the following steps:
1 Start from an initial guess θ (0) .
2 For n = 0, 1, 2, . . . do the following steps:
1 E-step: compute Q(θ, θ (n) ).
2 M-step: compute θ (n+1) = arg maxθ Q(θ, θ (n) ).

In state space models we have

log p(x0:T , y1:T | θ)


T
X T
X
= log p(x0 | θ) + log p(xk | xk −1 , θ) + log p(yk | xk , θ).
k =1 k =1

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Expectation–maximization (EM) algorithm [5/5]
Thus on E-step we compute
Z
(n)
Q(θ, θ ) = p(x0 | y1:T , θ (n) ) log p(x0 | θ) dx0
T Z
X
+ p(xk , xk −1 | y1:T , θ (n) )
k =1
× log p(xk | xk −1 , θ) dxk dxk −1
T Z
X
+ p(xk | y1:T , θ (n) ) log p(yk | xk , θ) dxk .
k =1

In linear models, these terms can be computed from the


RTS smoother results.
In non-Gaussian models we can approximate these using
Gaussian RTS smoothers or particle smoothers.
On M-step we maximize Q(θ, θ (n) ) with respect to θ.
Simo Särkkä Lecture 8: Bayesian Estimation of Parameters
State augmentation
Consider a model of the form
xk = f(xk −1 , θ) + qk −1
yk = h(xk , θ) + rk
We can now rewrite the model as
θ k = θ k −1
xk = f(xk −1 , θ k −1 ) + qk −1
yk = h(xk , θ k ) + rk
Redefining the state as x̃k = (xk , θ k ), leads to the
augmented model with without unknown parameters:
x̃k = f̃(x̃k −1 ) + q̃k −1
yk = h(x̃k ) + rk
This is called state augmentation approach.
The disadvantage is the severe non-linearity and
singularity of the augmented model.
Simo Särkkä Lecture 8: Bayesian Estimation of Parameters
Energy function for linear Gaussian models [1/3]
Consider the following linear Gaussian model with
unknown parameters θ:
xk = A(θ) xk −1 + qk −1
yk = H(θ) xk + rk
Recall that the Kalman filter gives us the Gaussian
predictive distribution

p(xk | y1:k −1 , θ) = N(xk | m− −


k (θ), Pk (θ))

Thus we get
p(yk | y1:k −1 , θ)
Z
= N(yk | H(θ) xk , R(θ)) N(xk | m− −
k (θ), Pk (θ)) dxk

= N(yk | H(θ) m− − T
k (θ), H(θ) Pk (θ) H (θ) + R(θ)).

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Energy function for linear Gaussian models [2/3]

Energy function for linear Gaussian model


The recursion for the energy function is given as

1 1
ϕk (θ) = ϕk −1 (θ) + log |2π Sk (θ)| + vTk (θ) S−1
k (θ) vk (θ),
2 2

where the terms vk (θ) and Sk (θ) are given by the Kalman filter
with the parameters fixed to θ:
Prediction:
m−
k (θ) = A(θ) mk −1 (θ)
P− T
k (θ) = A(θ) Pk −1 (θ) A (θ) + Q(θ).

(continues . . . )

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Energy function for linear Gaussian models [3/3]

Energy function for linear Gaussian model (cont.)


(. . . continues)
Update:

vk (θ) = yk − H(θ) m−
k (θ)
Sk (θ) = H(θ) P− T
k (θ) H (θ) + R(θ)
Kk (θ) = P− T −1
k (θ) H (θ) Sk (θ)
mk (θ) = m−
k (θ) + Kk (θ) vk (θ)
Pk (θ) = P− T
k (θ) − Kk (θ) Sk (θ) Kk (θ).

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


EM algorithm for linear Gaussian models

The expression for Q for the linear Gaussian models can be


written as
T
1 X s
Σ= Pk + msk [msk ]T
Q(θ, θ (n) ) T
k =1
1 T T T
1 X s
= − log |2π P0 (θ)| − log |2π Q(θ)| − log |2π R(θ)| Φ= Pk −1 + msk −1 [msk −1 ]T
2 2 2 T
k =1
( )
1 h i
− tr P−1 s s s
0 (θ) P0 + (m0 − m0 (θ)) (m0 − m0 (θ))
T
T
2 1 X
B= yk [msk ]T
( ) T
T h i k =1
− tr Q−1 (θ) Σ − C AT (θ) − A(θ) CT + A(θ) Φ AT (θ) T
2 1 X s T
) C= Pk Gk −1 + msk [msk −1 ]T
T
(
T −1
h
T T T
i k =1
− tr R (θ) D − B H (θ) − H(θ) B + H(θ) Σ H (θ) ,
2 T
1 X
D= yk yTk .
T
k =1

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


EM algorithm for linear Gaussian models (cont.)

If θ ∈ {A, H, Q, R, P0 , m0 }, we can maximize Q analytically


by setting the derivatives to zero.
Leads to an iterative algorithm: run RTS smoother,
recompute the estimates, run RTS smoother again,
recompute estimates, and so on.
The parameters to be estimated should be identifiable for
the ML/MAP to make sense: for example, we cannot hope
to blindly estimate all the model matrices.
EM is only an algorithm for computing ML (or MAP)
estimates.
Direct energy function optimization often converges faster
than EM and should be preferred in that sense.
If a RTS smoother implementation is available, EM is
sometimes easier to implement.

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Gaussian filtering based energy function
approximation

Let’s consider parameter estimation in non-linear models


of the form
xk = f(xk −1 , θ) + qk −1
yk = h(xk , θ) + rk

We can now approximate the energy function by replacing


Kalman filter with a Gaussian filter.
The approximate energy function recursion becomes
1 1
ϕk (θ) ' ϕk −1 (θ) + log |2π Sk (θ)| + vTk (θ) S−1
k (θ) vk (θ),
2 2

where the terms vk (θ) and Sk (θ) are given by a Gaussian


filter with the parameters fixed to θ.
Simo Särkkä Lecture 8: Bayesian Estimation of Parameters
Gaussian smoothing based EM algorithm
The approximation to Q function can now be written as

Q(θ, θ (n) )
1 T T
' − log |2π P0 (θ)| − log |2π Q(θ)| − log |2π R(θ)|
2( 2 2 )
1 −1
h
s s s T
i
− tr P0 (θ) P0 + (m0 − m0 (θ)) (m0 − m0 (θ))
2
T
1 X n −1 h io
− tr Q (θ) E (xk − f(xk −1 , θ)) (xk − f(xk −1 , θ))T | y1:T
2
k =1
T
1 X n h io
− tr R−1 (θ) E (yk − h(xk , θ)) (yk − h(xk , θ))T | y1:T ,
2
k =1

where the expectations can be computed using the Gaussian


RTS smoother results.
Simo Särkkä Lecture 8: Bayesian Estimation of Parameters
Particle filtering approximation of energy function [1/3]
In the particle filtering approach we can consider generic
models of the form
θ ∼ p(θ)
x0 ∼ p(x0 | θ)
xk ∼ p(xk | xk −1 , θ)
yk ∼ p(yk | xk , θ),
Using particle filter results, we can form an importance
sampling approximation as follows:
X (i) (i)
p(yk | y1:k −1 , θ) ≈ wk −1 vk ,
i
where
(i) (i) (i)
(i) p(yk | xk , θ) p(xk | xk −1 , θ)
vk = (i) (i)
π(xk | xk −1 , y1:k )
(i)
and wk −1 are the previous particle filter weights.
Simo Särkkä Lecture 8: Bayesian Estimation of Parameters
Particle filtering approximation of energy function [2/3]

SIR based energy function approximation


(i)
1 Draw samples xk from the importance distributions
(i) (i)
xk ∼ π(xk | xk −1 , y1:k ), i = 1, . . . , N.
2 Compute the following weights
(i) (i) (i)
(i) p(yk | xk , θ) p(xk | xk −1 , θ)
vk = (i) (i)
π(xk | xk −1 , y1:k )

and compute the estimate of p(yk | y1:k −1 , θ) as


X (i) (i)
p̂(yk | y1:k −1 , θ) = wk −1 vk
i

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Particle filtering approximation of energy function [3/3]
SIR based energy function approximation (cont.)
3 Compute the normalized weights as
(i) (i) (i)
wk ∝ wk −1 vk

4 If the effective number of particles is too low, perform


resampling.
The approximation of the marginal likelihood of the parameters
is: Y
p(y1:T | θ) ≈ p̂(yk | y1:k −1 , θ),
k

and the corresponding energy function approximation is


T
X
ϕT (θ) ≈ − log p(θ) − log p̂(yk | y1:k −1 , θ).
k =1

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Particle Markov chain Monte Carlo (PMCMC)

The particle filter based energy function approximation can


now be used in Metropolis–Hastings based MCMC
algorithm.
With finite N, the likelihood is only an approximation and
thus we would expect the algorithm to be an approximation
only.
Surprisingly, it turns out that this algorithm is an exact
MCMC algorithm also with finite N.
The resulting algorithm is called particle Markov chain
Monte Carlo (PMCMC) method.
Computing ML and MAP estimates via the particle filter
approximation is problematic, because resampling causes
discontinuities to the likelihood approximation.

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Particle smoothing based EM algorithm
Recall that on E-step of EM algorithm we need to compute
Q(θ, θ (n) ) = I1 (θ, θ (n) ) + I2 (θ, θ (n) ) + I3 (θ, θ (n) ),
where
Z
I1 (θ, θ (n) ) = p(x0 | y1:T , θ (n) ) log p(x0 | θ) dx0
T Z
X
(n)
I2 (θ, θ )= p(xk , xk −1 | y1:T , θ (n) )
k =1
× log p(xk | xk −1 , θ) dxk dxk −1
T Z
X
I3 (θ, θ (n) ) = p(xk | y1:T , θ (n) ) log p(yk | xk , θ) dxk .
k =1

It is also possible to use particle smoothers to approximate


the required expectations.
Simo Särkkä Lecture 8: Bayesian Estimation of Parameters
Particle smoothing based EM algorithm (cont.)

For example, by using backward simulation smoother, we


can approximate the expectations as
S
(n) 1X (i)
I1 (θ, θ )≈ log p(x̃0 | θ)
S
i=1
T −1 S
X 1X (i) (i)
I2 (θ, θ (n) ) ≈ log p(x̃k +1 | x̃k , θ)
S
k =0 i=1
T S
X 1X (i)
I3 (θ, θ (n) ) ≈ log p(yk | x̃k , θ).
S
k =1 i=1

Simo Särkkä Lecture 8: Bayesian Estimation of Parameters


Summary
The marginal posterior distribution of parameters can be
computed from the results of Bayesian filter.
Given the marginal posterior, we can e.g. use optimization
methods to compute MAP estimates or sample from the
posterior using MCMC methods.
Expectation–maximization (EM) algorithm can also be
used for iterative computation of ML or MAP estimates
using Bayesian smoother results.
The parameter posterior for linear Gaussian models can be
evaluated with Kalman filter.
The expectations required for implementing EM algorithm
for linear Gaussian models can be evaluated with RTS
smoother.
For non-linear/non-Gaussian models the parameter
posterior and EM-algorithm can be approximated with
Gaussian filters/smoothers and particle filters/smoothers.
Simo Särkkä Lecture 8: Bayesian Estimation of Parameters

You might also like