Introduction To State Space Models and Sequential Bayesian Inference
Introduction To State Space Models and Sequential Bayesian Inference
and
Sequential Bayesian Inference
Prof. Nicholas Zabaras
Email: [email protected]
URL: https://www.zabaras.com/
INTRODUCING THE STATE SPACE MODEL: Discrete time Markov Models, Tracking
Problem, Speech Enhancement, the Dynamics of an Asset, The state space model with
observations, Linear Gaussian LG-SSM, Stochastic Volatility, Bearings only tracking,
Probabilistic Programming and SSM, Bayesian Inference tasks for the SSM
J.S. Liu, Monte Carlo Strategies in Scientific Computing, Chapter 3, Springer-Verlag, New York.
A. Doucet, N. De Freitas & N. Gordon (eds), Sequential Monte Carlo in Practice, Springer-Verlag: 2001
A. Doucet, N. De Freitas, N.J. Gordon, An introduction to Sequential Monte Carlo, in SMC in Practice, 2001
J.S. Liu and R. Chen, Sequential Monte Carlo methods for dynamic systems, JASA, 1998
Pierre Del Moral, Feynman-Kac models and interacting particle systems (SMC resources)
N. de Freitas and A. Doucet, Sequential MC Methods, N. de Freitas and A. Doucet, Video Lectures, 2010
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4
References
M.K. Pitt and N. Shephard, Filtering via Simulation: Auxiliary Particle Filter, JASA, 1999
A. Doucet, S.J. Godsill and C. Andrieu, On Sequential Monte Carlo sampling methods for Bayesian filtering, Stat.
Comp., 2000
J. Carpenter, P. Clifford and P. Fearnhead, An Improved Particle Filter for Non-linear Problems, IEE 1999.
A. Kong, J.S. Liu & W.H. Wong, Sequential Imputations and Bayesian Missing Data Problems, JASA, 1994
O. Cappe, E. Moulines & T. Ryden, Inference in Hidden Markov Models, Springer-Verlag, 2005
W Gilks and C. Berzuini, Following a moving target: MC inference for dynamic Bayesian Models, JRSS B, 2001
G. Poyadjis, A. Doucet and S.S. Singh, Maximum Likelihood Parameter Estimation using Particle Methods, Joint
Statistical Meeting, 2005
N Gordon, D J Salmond, AFM Smith, Novel Approach to nonlinear non Gaussian Bayesian state estimation, IEE,
1993
R. Chen and J.S. Liu, Predictive Updating Methods with Application to Bayesian Classification, JRSS B, 1996
A Doucet, S J Godsill, C Andrieu, On SMC sampling methods for Bayesian Filtering, Stat. Comp. 2000
N. Kantas, A.D., S.S. Singh and J.M. Maciejowski, An overview of sequential Monte Carlo methods for parameter
estimation in general state-space models, in Proceedings IFAC System Identification (SySid) Meeting, 2009
C. Andrieu, A.Doucet & R. Holenstein, Particle Markov chain Monte Carlo methods, JRSS B, 2010
C. Andrieu, N. De Freitas and A. Doucet, Sequential MCMC for Bayesian Model Selection, Proc. IEEE Workshop
HOS, 1999
G. Storvik, Particle filters for state-space models with the presence of unknown static parameters, IEEE Trans. Signal
Processing, 2002
G. Poyadjis, A. Doucet and S.S. Singh, Particle Approximations of the Score and Observed Information Matrix in
State-Space Models with Application to Parameter Estimation, Biometrika, 2011
C. Caron, R. Gottardo and A. Doucet, On-line Changepoint Detection and Parameter Estimation for Genome Wide
Transcript Analysis, Technical report 2008
R. Martinez-Cantin, J. Castellanos and N. de Freitas. Analysis of Particle Methods for Simultaneous Robot
Localization and Mapping and a New Algorithm: Marginal-SLAM. International Conference on Robotics and
Automation
C. Andrieu, A.D. & R. Holenstein, Particle Markov chain Monte Carlo methods (with discussion), JRSS B, 2010
A Doucet, Sequential Monte Carlo Methods and Particle Filters, List of Papers, Codes, and Viedo lectures on SMC
and particle filters
P. Del Moral, A. Doucet and A. Jasra, Sequential Monte Carlo for Bayesian Computation, Bayesian Statistics, 2006
P. Del Moral, A. Doucet & S.S. Singh, Forward Smoothing using Sequential Monte Carlo, technical report, Cambridge
University, 2009
A Doucet, A Johansen, Particle Filtering and Smoothing: Fifteen years later, in Handbook of Nonlinear Filtering (edts
D Crisan and B. Rozovsky), Oxford Univ. Press, 2011
A. Johansen and A. Doucet, A Note on Auxiliary Particle Filters, Stat. Proba. Letters, 2008.
A. Doucet et al., Efficient Block Sampling Strategies for Sequential Monte Carlo, (with M. Briers & S. Senecal),
JCGS, 2006.
C. Caron, R. Gottardo and A. Doucet, On-line Changepoint Detection and Parameter Estimation for Genome Wide
Transcript Analysis, Stat Comput. 2011.
F. Lindsten, M. Jordan and R.C. Schön, Particle Gibbs with Ancestor Sampling, JMLR, 2014.
X n | X n 1 x ~ f | x
n
p x1:n p x1 , , xn ( x1 ) f xk | xk 1
k 2
1 2 ... d 1
1 0
A ,B
... :
1 0
The transition density is now:
fU uk | uk 1 N u ; Au ,
k 1 k 1 1
2
s uk 1 1:d 1 u
k 2:d
f k | k 1 N ( k ; k 1 , 2 I d )
The process 𝑋𝑘 = (𝛼𝑘, 𝑈𝑘 ሻ is a Markov with transition density
f xk | xk 1 N ( k ; k 1 , 2 I d )N u ; A u ,
k 1 k k 1 1
2
s uk 1 1:d 1 u
k 2:d
with
S k 1
Ak uk 1 1 k ,1 ,..., k ,d :
T
S
k 1 d
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Econometrics
The Heston model (1993) describes the dynamics of an asset price 𝑆𝑡 with the
following model for 𝑋𝑡 = log(𝑆𝑡 ሻ
Yn | X n xn ~ g | xn
The observations {𝑦𝑛} are conditionally independent given the Markov states
{𝑥𝑛}. Thus the likelihood is
n
p y1 , , yn | x1 , , xn g yi | xi
i 1
𝑋𝑛 = 𝐴𝑋𝑛−1 + 𝐵𝑢𝑛 + 𝑉𝑛
𝑌𝑛 = 𝐶𝑋𝑛 + 𝐷𝑢𝑛 + 𝐸𝑛
𝑋0 𝜇 𝑃0 0 0
𝑉𝑛 ~𝒩 0 , 0 𝑄 𝑆
𝐸𝑛 0 0 𝑆𝑇 𝑅
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19
Stochastic Volatility Model
A Stochastic Volatility Model
2
X 1 ~ N 0, 2
and X n X n 1 Vn
1
Yn exp X n / 2 Wn , where
i .i .d . i .i .d .
| | 1, Vn ~ N (0, ) and Wn ~ N (0,1)
2
xn ~ N xn 1 , 2 g yn | xn N yn ;0, 2 exp xn
Creates a clear separation between the model and the inference methods.
Opens up for the automation of inference!
L. Murray and T. B. Schön. Automated learning with a probabilistic programming language: Birch. Annual Reviews in Control, 46:29-43, 2018. 11/19
Ghahramani, Z. Probabilistic machine learning and articial intelligence. Nature 521:452-459, 2015.
Noah D. Goodman and Andreas Stuhlmüller. The design and implementation of probabilistic programming languages. Retrieved 2019-8-29 from
http://dippl.org
Jan-Willem van de Meent et al. An introduction to probabilistic programming. arXiv preprint arXiv:1809.10756, 2018.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22
Bayesian Inference for the SSM
At time 𝑛, we have a total of 𝑛 observations and the target distribution to be
estimated is the posterior 𝑝 𝒙1:𝑛 |𝒚1:𝑛 .
The target distribution is “time-varying”. The posterior distribution should be
updated after new observations are added. Thus we need to estimate a
sequence of distributions according to the time sequence.
y1 y2 ... yn observations
𝑝 𝒙1:𝑛 |𝒚1:𝑛
x1 x2 ... xn Markov Chain
n x1:n
n x1:n p x1:n | y1:n , n x1:n p x1:n , y1:n , Z n p y1:n
Zn
The posterior and marginal likelihood do not admit close forms unless
The posterior mean (minimum mean square estimate) can also be estimated
as:
X k | y1:n xk p xk | y1:n dxk
Let 𝑇 be the time at the which the particle is killed. We want to compute the
probability Pr(𝑇 > 𝑛ሻ.
x1 g x1
x g x dx
1 1 1
By integration over the state variables 𝑥𝑘 , we obtain the probability for the
particle to survive at time 𝑡 = 𝑛
Pr(T n) Pr obability of surving at time n
n n
x1 f xk | xk 1 g xk dx1:n
k 2 k 1
= 𝜇 𝑥1 ෑ 𝑓 𝑥𝑘 |𝑥𝑘−1 ෑ 𝑔 𝑥𝑘 𝑑𝒙1:𝑛
𝑘=2 𝑘=1
To place this calculation in our SMC framework, we define the following:
n n
n x1:n x1 f xk | xk 1 g xk
k 2 k 1
Then the integration needed to compute the required probability is just the
normalization constant of 𝛾𝑛 𝒙1:𝑛 , i.e.
n n
Z n x1 f xk | xk 1 g xk dx1:n
k 2 k 1
n x1:n
n x1:n and
Zn
Z n Pr(T n)
33
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras
Closed Form Inference in HMM
We have closed-form solutions for finite state-space HMM as all integrals are
becoming finite sums.
For linear Gaussian models (LG-SSM), all the posterior distributions are
Gaussian (Kalman filter).
𝛾(𝒙ሻ
We assume that 𝜋(𝒙ሻ = where 𝑍 = 𝒙(𝛾 ሻ𝑑𝒙 is unknown and is known
𝑍
pointwise.
The basic idea in Monte Carlo methods is to sample 𝑁 i.i.d. random numbers
𝑁
𝑿 𝑖 𝑖.𝑖.𝑑.
~𝜋 . and build an empirical measure 𝜋ො 𝒙 𝑑𝒙 =
1
𝛿𝑿 𝑖 𝑑𝒙
𝑁 𝑖=1
1 𝑁 𝑖 𝑖 𝑖.𝑖.𝑑.
Using this: 𝔼𝜋ෝ 𝑓 𝒙 = 𝑖=1 𝑓 𝑿 , where 𝑿 ~𝜋 .
𝑁
J.S. Liu, Monte Carlo Strategies in Scientific Computing, Chapter 3, Springer-Verlag, New York.
𝑤ℎ𝑒𝑟𝑒 𝑊 𝑖 ∝𝑤 𝑿 𝑖 𝑎𝑛𝑑 𝑊 𝑖 =1
𝑖=1
Similarly, we can approximate the normalization factor of our target
distribution as follows:
𝑁
𝑁 𝑖
𝛾 𝒙 1 1 𝛾 𝑿
𝑍መ = 𝑞ො 𝒙 𝑑𝒙 = න𝑤 𝒙 𝑞ො 𝒙 𝑑𝒙 = 𝑤 𝑿 𝑖
= 𝑖
𝑞 𝒙 𝑁 𝑁 𝑞 𝑿
𝑖=1
𝑖=1
𝜋ො 𝒙 𝑑𝒙 = 𝑊 𝑖 𝛿𝑿 𝑖 𝑑𝒙 , 𝑤ℎ𝑒𝑟𝑒 𝑊 𝑖 ∝𝑤 𝑿 𝑖 𝑎𝑛𝑑 𝑊 𝑖 =1
𝑖=1 𝑖=1
𝔼𝜋ෝ 𝑓 𝒙 = න 𝑓 𝒙 𝜋ො 𝒙 𝑑𝒙 = 𝑓 𝑿(𝑖ሻ 𝑊 𝑖
𝐴 𝑖=1
x
w x
q x
This is equivalent to saying that 𝑞(𝒙ሻ should have heavier tails than 𝜋(𝒙ሻ.
The estimate is unbiased and its variance gives the following convergence
properties:
1
𝑉𝑎𝑟𝑿 𝑖 𝔼𝑝ො𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝜑 = 𝑉𝑎𝑟𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑
1:𝑛 𝑁
𝑑
𝑁 𝔼𝑝ො𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝜑 − 𝔼𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 → 𝒩 0, 𝑉𝑎𝑟𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑
The rate of convergence is independent of 𝑛. This does not imply that Monte
Carlo bits the curse of dimensionality of 𝒳 𝑛 since it is possible that
𝑉𝑎𝑟𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 increases (with time) 𝑛.
𝑁
1
= 𝛿𝑿 𝑖 𝒙1:𝑛 𝑑𝒙1:𝑘−1 𝑑𝒙𝑘+1:𝑛
𝑁 1:𝑛
𝑖=1
𝒳 𝑛−1
𝑁
1
= 𝛿𝑋 𝑖 𝑥𝑘
𝑁 𝑘
𝑖=1
Note that the marginal likelihood 𝑝 𝒚1:𝑛 cannot be estimated as easily using
𝑖
𝑿1:𝑛 ~𝑝 𝒙1:𝑛 |𝒚1:𝑛 .
MCMC methods are not useful in this context (they are not recursive).
SMC solve partially both problems by breaking the sampling from 𝑝 𝒙1:𝑛 |𝒚1:𝑛
into a collection of simpler subproblems. First approximate 𝑝 𝑥1 |𝑦1 and 𝑝 𝑦1
at time 1, then 𝑝 𝒙1:2 |𝒚1:2 and 𝑝 𝒚1:2 at time 2 and so on.
The support of 𝑞 𝒙1:𝑛 |𝒚1:𝑛 includes the support of 𝑝 𝒙1:𝑛 |𝒚1:𝑛 i.e.
p x1:n | y1:n 0 q x1:n | y1:n 0
We use the following identity: p x1:n , y1:n q x1:n | y1:n q x1:n | y1:n
p x1:n , y1:n
p x1:n | y1:n
px 1:n , y1:n dx1:n p x1:n , y1:n q x1:n | y1:n q x1:n | y1:n dx1:n
w x1:n , y1:n q x1:n | y1:n
w x , y q x | y dx
1:n 1:n 1:n 1:n 1:n
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 50
Importance Sampling for the State Space Model
Let us draw 𝑁 samples from our importance distribution:
𝑁
𝑖 1
𝑿1:𝑛 ~𝑞 𝒙1:𝑛 |𝒚1:𝑛 , 𝑞ො𝑁 𝒙1:𝑛 |𝒚1:𝑛 = 𝛿𝑿 𝑖 𝒙1:𝑛
𝑁 1:𝑛
𝑖=1
Then using the identity in the earlier slide, we obtain the following approximation of our target
distribution:
𝑤 𝒙1:𝑛 , 𝒚1:𝑛 𝑞ො𝑁 𝒙1:𝑛 |𝒚1:𝑛
𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 =
𝒙 𝑤 1:𝑛 , 𝒚1:𝑛 𝑞ො𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝑑𝒙1:𝑛
1
𝑤 𝒙1:𝑛 , 𝒚1:𝑛 𝑁 σ𝑁 𝑖=1 𝛿𝑿 𝑖 𝒙1:𝑛
1:𝑛
=
1
න 𝑤 𝒙1:𝑛 , 𝒚1:𝑛 𝑁 σ𝑁 𝑖=1 𝛿𝑿 𝑖 𝒙1:𝑛 𝑑𝒙1:𝑛
1:𝑛
𝑁 𝑖
𝑖 𝑖
𝑤 𝑿1:𝑛 , 𝒚1:𝑛
= 𝑊𝑛 𝛿𝑿 𝑖 𝒙1:𝑛 , 𝑊𝑛 = 𝑁
1:𝑛 𝑖
𝑖=1 𝑤 𝑿1:𝑛 , 𝒚1:𝑛
𝑖=1
𝑁 𝑁
1 1 𝑖
Note that: 𝑍መ𝑛 ≡ 𝑝Ƹ 𝑁 𝒚1:𝑛 = 𝑤 𝒙1:𝑛 , 𝒚1:𝑛 𝛿𝑿 𝑖 𝒙1:𝑛 𝑑𝒙1:𝑛 = 𝑤 𝑿1:𝑛 , 𝒚1:𝑛
𝑁 1:𝑛 𝑁
𝑖=1 𝑖=1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 51
Normalized Weights in Importance Sampling
The unnormalized weights were defined as follows:
p x1:n , y1:n p x1:n | y1:n
Unnormalized weights : w x1:n , y1:n p y1:n
q x1:n | y1:n q x1:n | y1:n
Discrepancy between
target distribution and
importance distribution
1 2 𝑝2 𝒙1:𝑛 |𝒚1:𝑛
𝑍 2 𝑞 𝒙1:𝑛 |𝒚1:𝑛 𝑑𝒙1:𝑛 − 1
𝑁 𝑛 𝑞 𝒙1:𝑛 |𝒚1:𝑛
You can bring this variance to zero (variance of the unnormalized weights equal to zero) with
the selection
Of course this is what we wanted to avoid (we want to sample from an easier distribution).
However, this results points to the fact that the choice of 𝑞 needs to be as close as possible to
the target distribution.
This is a biased estimate for a finite 𝑁 and we have shown that for Importance Sampling:
𝑝2 𝒙1:𝑛 |𝒚1:𝑛
lim 𝑁 𝔼𝑝𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝜑 − 𝔼𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 = − 𝜑 𝒙1:𝑛 − 𝔼𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 𝑑𝒙1:𝑛
𝑁→∞ 𝑞 𝒙1:𝑛 |𝒚1:𝑛
𝑑 𝑝2 𝒙1:𝑛 |𝒚1:𝑛 2
𝑁 𝔼𝑝𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝜑 − 𝔼𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 → 𝒩 0, 𝜑 𝒙1:𝑛 − 𝔼𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 𝑑𝒙1:𝑛
𝑞 𝒙1:𝑛 |𝒚1:𝑛
The asymptotic bias is of the order 1/𝑁 (negligible) and the MSE error is:
The optimal distribution for estimating 𝜑 𝒙1:𝑛−1 will almost certainly not be
even similar to the marginal distribution of 𝒙1:𝑛−1 in the optimal distribution for
estimating 𝜑 𝒙1:𝑛 and this will prove to be problematic.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 55
Selection of Importance Sampling Distribution
A more appropriate approach in this context is to attempt to select the 𝑞 𝒙1:𝑛 |𝒚1:𝑛 which
minimizes the variance of the importance weights (or, equivalently, the variance of 𝑍መ𝑛 ).
Clearly, this variance is minimized for 𝑞 𝒙1:𝑛 |𝒚1:𝑛 = 𝑝 𝒙1:𝑛 |𝒚1:𝑛 . We cannot do this as this is
the reason we used IS in the first place. However, this simple result indicates that we should
aim at selecting an IS distribution which is close as possible to the target.
As discussed before, the importance sampling distribution should be selected so that the
weights are bounded or equivalently 𝑞 𝒙1:𝑛 |𝒚1:𝑛 has heavier tails than 𝑝 𝒙1:𝑛 |𝒚1:𝑛
Note that the selection of the importance sampling needs to be not only such that it covers the
support of the target but also needs to be a clever one for the particular problem of interest.
In this case, all the unnormalized importance weights will be equal and their variance equal to
zero.
To access the quality of the importance sampling approximation, note that for flat functions,
Variance of IS estimate
1 Varq x1:n | y1:n W X1:n | y1:n
Variance of Standard MC estimate
This is often interpreted as the effective sample size (𝑁 weighted samples from 𝑞 𝒙1:𝑛 |𝒚1:𝑛
are approximately equivalent to 𝑀 unweighted samples from 𝑝 𝒙1:𝑛 |𝒚1:𝑛 ).
N
M N
1 Varq x1:n | y1:n W X1:n | y1:n
N
Varq x1:n | y1:n W X (i )
1:n | y1:n N W 2 X1:( in) | y1:n 1
i 1
𝑁
We clearly can see from 𝐸𝑆𝑆 = that
1+𝑉𝑎𝑟𝑞 𝒙1:𝑛 |𝒚1:𝑛 𝑊 𝑿1:𝑛 |𝒚1:𝑛
1
N (i )2
1 ESS Wn N
i 1
We can thus have
𝐸𝑆𝑆 = 1 (one of the weights equal to 1, all other zero, very inefficient) to