0% found this document useful (0 votes)
20 views

Introduction To State Space Models and Sequential Bayesian Inference

This document introduces state space models and sequential Bayesian inference. It discusses various applications of state space models like tracking problems, speech enhancement, modeling asset dynamics, and more. It outlines the goals of learning about state space models through examples, performing Bayesian inference in these models, and using importance sampling for state space models. Finally, it provides a list of references and resources for further reading on sequential Monte Carlo methods, particle filtering, and applications of state space models.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Introduction To State Space Models and Sequential Bayesian Inference

This document introduces state space models and sequential Bayesian inference. It discusses various applications of state space models like tracking problems, speech enhancement, modeling asset dynamics, and more. It outlines the goals of learning about state space models through examples, performing Bayesian inference in these models, and using importance sampling for state space models. Finally, it provides a list of references and resources for further reading on sequential Monte Carlo methods, particle filtering, and applications of state space models.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Introduction to State Space Models

and
Sequential Bayesian Inference
Prof. Nicholas Zabaras

Email: [email protected]
URL: https://www.zabaras.com/

November 10, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras


Contents
 References and SMC Resources, Applications of SMC

 INTRODUCING THE STATE SPACE MODEL: Discrete time Markov Models, Tracking
Problem, Speech Enhancement, the Dynamics of an Asset, The state space model with
observations, Linear Gaussian LG-SSM, Stochastic Volatility, Bearings only tracking,
Probabilistic Programming and SSM, Bayesian Inference tasks for the SSM

 BAYESIAN INFERENCE IN STATE SPACE MODELS: Target distribution, Particle


motion in a random medium – marginal likelihood, Closed form inference for the HMM

 MONTE CARLO and SEQUENTIAL IMPORTANCE SAMPLING: Review of MC, Review


of IS, Estimating the normalizing constant, Variance of the weights, Monte Carlo for the
State Space Model, IS for the State Space Model, Bias and Variance IS estimates,
Selection of Importance Density, Effective Sample Size

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2


Goals
 The goals for today’s lecture include the following:

 Learn about the state space model through various examples

 Learn to perform Bayesian inference in state space models

 Learn how to perform importance sampling for state space models

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3


References
 C.P. Robert & G. Casella, Monte Carlo Statistical Methods, Chapter 11

 J.S. Liu, Monte Carlo Strategies in Scientific Computing, Chapter 3, Springer-Verlag, New York.

 A. Doucet, N. De Freitas & N. Gordon (eds), Sequential Monte Carlo in Practice, Springer-Verlag: 2001

 A. Doucet, N. De Freitas, N.J. Gordon, An introduction to Sequential Monte Carlo, in SMC in Practice, 2001

 D. Wilkison, Stochastic Modelling for Systems Biology, Second Edition, 2006

 E. Ionides, Inference for Nonlinear Dynamical Systems, PNAS, 2006

 J.S. Liu and R. Chen, Sequential Monte Carlo methods for dynamic systems, JASA, 1998

 A. Doucet, Sequential Monte Carlo Methods, Short Course at SAMSI

 A. Doucet, Sequential Monte Carlo Methods & Particle Filters Resources

 Pierre Del Moral, Feynman-Kac models and interacting particle systems (SMC resources)

 A. Doucet, Sequential Monte Carlo Methods, Video Lectures, 2007

 N. de Freitas and A. Doucet, Sequential MC Methods, N. de Freitas and A. Doucet, Video Lectures, 2010
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4
References
 M.K. Pitt and N. Shephard, Filtering via Simulation: Auxiliary Particle Filter, JASA, 1999

 A. Doucet, S.J. Godsill and C. Andrieu, On Sequential Monte Carlo sampling methods for Bayesian filtering, Stat.
Comp., 2000

 J. Carpenter, P. Clifford and P. Fearnhead, An Improved Particle Filter for Non-linear Problems, IEE 1999.

 A. Kong, J.S. Liu & W.H. Wong, Sequential Imputations and Bayesian Missing Data Problems, JASA, 1994

 O. Cappe, E. Moulines & T. Ryden, Inference in Hidden Markov Models, Springer-Verlag, 2005

 W Gilks and C. Berzuini, Following a moving target: MC inference for dynamic Bayesian Models, JRSS B, 2001

 G. Poyadjis, A. Doucet and S.S. Singh, Maximum Likelihood Parameter Estimation using Particle Methods, Joint
Statistical Meeting, 2005

 N Gordon, D J Salmond, AFM Smith, Novel Approach to nonlinear non Gaussian Bayesian state estimation, IEE,
1993

 Particle Filters, S. Godsill, 2009 (Video Lectures)

 R. Chen and J.S. Liu, Predictive Updating Methods with Application to Bayesian Classification, JRSS B, 1996

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5


References
 C. Andrieu and A. Doucet, Particle Filtering for Partially Observed Gaussian State-Space Models, JRSS B, 2002

 R Chen and J Liu, Mixture Kalman Filters, JRSSB, 2000

 A Doucet, S J Godsill, C Andrieu, On SMC sampling methods for Bayesian Filtering, Stat. Comp. 2000

 N. Kantas, A.D., S.S. Singh and J.M. Maciejowski, An overview of sequential Monte Carlo methods for parameter
estimation in general state-space models, in Proceedings IFAC System Identification (SySid) Meeting, 2009

 C. Andrieu, A.Doucet & R. Holenstein, Particle Markov chain Monte Carlo methods, JRSS B, 2010

 C. Andrieu, N. De Freitas and A. Doucet, Sequential MCMC for Bayesian Model Selection, Proc. IEEE Workshop
HOS, 1999

 P. Fearnhead, MCMC, sufficient statistics and particle filters, JCGS, 2002

 G. Storvik, Particle filters for state-space models with the presence of unknown static parameters, IEEE Trans. Signal
Processing, 2002

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6


References
 C. Andrieu, A. Doucet and V.B. Tadic, Online EM for parameter estimation in nonlinear-non Gaussian state-space
models, Proc. IEEE CDC, 2005

 G. Poyadjis, A. Doucet and S.S. Singh, Particle Approximations of the Score and Observed Information Matrix in
State-Space Models with Application to Parameter Estimation, Biometrika, 2011

 C. Caron, R. Gottardo and A. Doucet, On-line Changepoint Detection and Parameter Estimation for Genome Wide
Transcript Analysis, Technical report 2008

 R. Martinez-Cantin, J. Castellanos and N. de Freitas. Analysis of Particle Methods for Simultaneous Robot
Localization and Mapping and a New Algorithm: Marginal-SLAM. International Conference on Robotics and
Automation

 C. Andrieu, A.D. & R. Holenstein, Particle Markov chain Monte Carlo methods (with discussion), JRSS B, 2010

 A Doucet, Sequential Monte Carlo Methods and Particle Filters, List of Papers, Codes, and Viedo lectures on SMC
and particle filters

 Pierre Del Moral, Feynman-Kac models and interacting particle systems

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7


References
 P. Del Moral, A. Doucet and A. Jasra, Sequential Monte Carlo samplers, JRSSB, 2006

 P. Del Moral, A. Doucet and A. Jasra, Sequential Monte Carlo for Bayesian Computation, Bayesian Statistics, 2006

 P. Del Moral, A. Doucet & S.S. Singh, Forward Smoothing using Sequential Monte Carlo, technical report, Cambridge
University, 2009

 P. Del Moral, Feynman-Kac Formulae, Springer-Verlag, 2004

 Sequential MC Methods, M. Davy, 2007

 A Doucet, A Johansen, Particle Filtering and Smoothing: Fifteen years later, in Handbook of Nonlinear Filtering (edts
D Crisan and B. Rozovsky), Oxford Univ. Press, 2011

 A. Johansen and A. Doucet, A Note on Auxiliary Particle Filters, Stat. Proba. Letters, 2008.

 A. Doucet et al., Efficient Block Sampling Strategies for Sequential Monte Carlo, (with M. Briers & S. Senecal),
JCGS, 2006.

 C. Caron, R. Gottardo and A. Doucet, On-line Changepoint Detection and Parameter Estimation for Genome Wide
Transcript Analysis, Stat Comput. 2011.

 F. Lindsten, M. Jordan and R.C. Schön, Particle Gibbs with Ancestor Sampling, JMLR, 2014.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8


Sequential Monte Carlo (SMC) Methods
 SMC is a powerful alternative/complementary approach to MCMC to address
general Bayesian computational problems.

 Both MCMC and SMC are asymptotically bias-free but computationally


expensive.

 Variational and Expectation-Propagation (EP) methods are computationally


inexpensive but perform functional approximations of the posteriors of
interest.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9


SMC Applications
 Sequential Monte Carlo (SMC) methods are used to approximate any
sequence of probability distributions.
 They are used often in engineering, physics/chemistry, biology, etc.
 Terrain navigation, trajectory planning
 Solve differential/integral eqs., compute eigenvalues of positive operators
 Simulate polymer chains, compute free energies
 Econometrics
 Speech enhancement
 Epidemiological modelling
 …..
 State space models (SSM) are used in most tutorials for introducing SMC –
but SMC is a method for a much bigger class of problems.
 SMC methods are often known as Particle Filtering or Smoothing Methods.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10
Introducing the
State Space Model

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11


Discrete-Time Markov Model
 Consider a discrete-time Markov process : {𝑋𝑛}, 𝑛 ≥ 1

 It is defined by an initial density 𝑋1 ~𝜇(. ሻ and a transition density:

X n |  X n 1  x  ~ f   | x 

 We then can write (prior distribution of the states):

n
p  x1:n   p  x1 , , xn    ( x1 ) f  xk | xk 1 
k 2

x1 x2 ... xn Markov Chain

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12


Tracking Example
 Consider tracking a target in the 𝑋𝑌 plane (location/speed in 𝑥 − 𝑦):
X k   X k ,1 ,Vk ,1 , X k ,2 ,Vk ,2 
T

 We consider the constant velocity model:


i .i .d .
X k  AX k 1  Wk , Wk ~ N (0,  )
 ACV 0  1 T 
A  , ACV   
 0 ACV   0 1 
T 3 T2 
  CV 0 
 2
3 2
  2
 ,  CV
 0  CV   T 
T
 2 
 The transition density for this model is then:
f  xk | xk 1   N ( xk ; Axk 1 , )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13
Speech Enhancement
 We model speech signals as an autoregressive (AR) process, i.e.
d
S k    i S k i  Vk , Vk ~ N (0,  s2 )
i 1

 We can write this in a matrix form as follows:


U k  AU k 1  BVk , U k   S k ,...., S k  d 
T

 1  2 ...  d  1
   
 1   0
A ,B
 ...  :
   
 1  0
 The transition density is now:

fU  uk | uk 1   N   u  ;  Au  ,    
k 1 k 1 1
2
s uk 1 1:d 1 u  
k 2:d

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14


Speech Enhancement
 We can also consider the AR coefficients to be time dependent:
 k   k 1  Wk , Wk ~ N (0,  2 I d ), where :
 k   k ,1 ,...,  k ,d 
T

 Thus for non-stationary speech signals, we can write:

f  k |  k 1   N ( k ;  k 1 ,  2 I d )
 The process 𝑋𝑘 = (𝛼𝑘, 𝑈𝑘 ሻ is a Markov with transition density
f  xk | xk 1   N ( k ;  k 1 ,  2 I d )N u  ;  A u  , 
k 1 k k 1 1
2
s uk 1 1:d 1 u  
k 2:d

with
 S k 1 
 Ak uk 1 1   k ,1 ,...,  k ,d   : 
T 

S 
 k 1 d 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Econometrics
 The Heston model (1993) describes the dynamics of an asset price 𝑆𝑡 with the
following model for 𝑋𝑡 = log(𝑆𝑡 ሻ

𝑑𝑋𝑡 = 𝜇𝑑𝑡 + 𝑑𝑊𝑡 + 𝑑𝑍𝑡

where 𝑍𝑡 is a jump process, and 𝑑𝑊𝑡 Brownian motion.

 We approximate this (time integration) by a discrete-time Markov process

𝑋𝑡+𝛿 = 𝑋𝑡 + 𝛿𝜇 + 𝑊𝑡+𝛿,𝑡 + 𝑍𝑡+𝛿,𝑡


 The same model is used for biochemical networks, disease and population
dynamics, etc.

 D. Wilkison, Stochastic Modelling for Systems Biology, Second Edition, 2006


 E. Ionides, Inference for Nonlinear Dynamical Systems, PNAS, 2006

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16


The State Space Model
 Let us discuss in some detail a very popular dynamic system, the state space
model, now including observations
 A state space model is an extension of a Markov Chain which is able to
capture the sequential relations among hidden variables.
 It is a dynamic system including two major parts
y1 y2 ... yn observations

x1 x2 ... xn Markov Chain

 The graphical model represents the probabilistic model. Arrows indicate


conditioning dependencies.
 C. A. Naesseth, F. Lindsten and T. B. Schön. Sequential Monte Carlo methods for graphical models. Advances in Neural Information Processing Systems
(NIPS), Montreal, Quebec, Canada, December, 2014.
 L. Murray and T. B. Schön. Automated learning with a probabilistic programming language: Birch. Annual Reviews in Control, 46:29-43, 2018. 11/19
 Ghahramani, Z. Probabilistic machine learning and articial intelligence. Nature 521:452-459, 2015.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17
The State Space Model
 The two parts can be expressed by equations
 state equation: {𝑋𝑛}, 𝑛 ≥ 1 is a latent/hidden Markov process with
X 1 ~  (.) and X n |  X n 1  xn 1  ~ f   | xn 1 
 observation equation: {𝑌𝑛}, 𝑛 ≥ 1 is an observation process with the observations being
conditionally independent given {𝑋𝑛}, 𝑛 ≥ 1

Yn |  X n  xn  ~ g   | xn 

 The observations {𝑦𝑛} are conditionally independent given the Markov states
{𝑥𝑛}. Thus the likelihood is
n
p  y1 , , yn | x1 , , xn    g  yi | xi 
i 1

 Our aim is to recover {𝑋𝑛}, 𝑛 ≥ 1 given {𝑌𝑛}, 𝑛 ≥ 1.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18


Linear Gaussian State Space Model (LG-SSM)
 A Linear Gaussian State Space Model (LG-SSM)

𝑋𝑛 = 𝐴𝑋𝑛−1 + 𝐵𝑢𝑛 + 𝑉𝑛

𝑌𝑛 = 𝐶𝑋𝑛 + 𝐷𝑢𝑛 + 𝐸𝑛

 Here 𝑋𝑛 ∈ ℝ𝑛𝑥 denotes the state, 𝑢𝑛 ∈ ℝ𝑛𝑢 denotes an explanatory variable


(known signal) and 𝑌𝑛 ∈ ℝ𝑛𝑦 denotes the measurements.

 The initial state and noise variables are defined as:

𝑋0 𝜇 𝑃0 0 0
𝑉𝑛 ~𝒩 0 , 0 𝑄 𝑆
𝐸𝑛 0 0 𝑆𝑇 𝑅
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19
Stochastic Volatility Model
 A Stochastic Volatility Model

 2 
X 1 ~ N  0, 2 
and X n   X n 1  Vn
 1 
Yn   exp  X n / 2  Wn , where
i .i .d . i .i .d .
|  | 1, Vn ~ N (0,  ) and Wn ~ N (0,1)
2

xn ~ N  xn 1 ,  2  g  yn | xn   N  yn ;0,  2 exp  xn  

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20


Bearings-Only Tracking
 The simplest linear model is of the form:
i .i .d .
Yk  CX k  Ek , Ek ~ N (0,  e ) 
g ( yk | xk )  N ( yk ; CX k ,  e )

 The non-linear version (Bearings-only-tracking) is more popular:


1 X k ,2
i .i .d .
Yk  tan  Ek , Ek ~ N (0,  2 ) 
X k ,1
 1 xk ,2

g ( yk | xk )  N  yk ; tan , 
2
 x 
 k ,1 
 Note that the mean of the Gaussian is a highly non-linear function of the state.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21


Probabilistic Programming and SMC
 A probabilistic program encodes a probabilistic model according to the
semantics of a particular probabilistic programming language, giving rise to a
programmatic model.

 The memory state of a running probabilistic program evolves dynamically and


stochastically in time and so is a stochastic process.

 SMC is a common inference method for programmatic model.

 Creates a clear separation between the model and the inference methods.
Opens up for the automation of inference!

 L. Murray and T. B. Schön. Automated learning with a probabilistic programming language: Birch. Annual Reviews in Control, 46:29-43, 2018. 11/19
 Ghahramani, Z. Probabilistic machine learning and articial intelligence. Nature 521:452-459, 2015.
 Noah D. Goodman and Andreas Stuhlmüller. The design and implementation of probabilistic programming languages. Retrieved 2019-8-29 from
http://dippl.org
 Jan-Willem van de Meent et al. An introduction to probabilistic programming. arXiv preprint arXiv:1809.10756, 2018.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22
Bayesian Inference for the SSM
 At time 𝑛, we have a total of 𝑛 observations and the target distribution to be
estimated is the posterior 𝑝 𝒙1:𝑛 |𝒚1:𝑛 .
 The target distribution is “time-varying”. The posterior distribution should be
updated after new observations are added. Thus we need to estimate a
sequence of distributions according to the time sequence.
y1 y2 ... yn observations

𝑝 𝒙1:𝑛 |𝒚1:𝑛
x1 x2 ... xn Markov Chain

p  x1 | y1  p  x1:2 | y1:2  p  x1:n | y1:n  Target Distribution


n
Likelihood : p  y1 , , yn | x1 , , xn    g  yi | xi 
i 1
n
Prior : p  x1:n     x1   f  xk | xk 1 
k 2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23


Bayesian Inference for the SSM
 While our overall estimation problem is to compute the joint filtering
distribution 𝑝 𝒙1:𝑛 |𝒚1:𝑛 , the following inference problems are also of interest:
 Filtering: Compute 𝑝 𝑥𝑛 |𝒚1:𝑛 y1 y2 ... yn
𝑝 𝒙1:𝑛 |𝒚1:𝑛
 Prediction: Compute 𝑝 𝑥𝑛+1 |𝒚1:𝑛
x1 x2 ... xn
 Joint Smoothing: 𝑝 𝒙1:𝑇 |𝒚1:𝑇
p  x1 | y1  p  x1:2 | y1:2  p  x1:n | y1:n 
 Marginal Smoothing: 𝑝 𝑥𝑛 |𝒚1:𝑇 , 𝑛 ≤ 𝑇
n
Likelihood : p  y1 , , yn | x1 , , xn    g  yi | xi 
 Note: The Kalman filter provides an i 1
n
analytical solution to the filtering problem Prior : p  x1:n     x1   f  xk | xk 1 
for a LG-SSM. k 2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24


Bayesian Inference in State
Space Models

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25


Target Distribution
 In Bayesian estimation, the target distribution (posterior) for the SSM is
𝑝 𝒙1:𝑛 |𝒚1:𝑛 .

 The state equation for the Markov process defines a prior as


n
p  x1:n     x1   f  xk | xk 1 
k 2

 The observation equation defines the likelihood as


n
p  y1:n | x1:n    g  yk | xk 
k 1

 The posterior distribution is known up to a normalizing constant


p  x1:n , y1:n  n n
p  x1:n | y1:n    p  x1:n , y1:n   p  x1:n  p  y1:n | x1:n     x1   f  xk | xk 1  g  yk | xk  and
p  y1:n  k 2 k 1
Pr ior Likelihood

p  y1:n    ... p  x1:n  p  y1:n | x1:n  d x1:n

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26


Target Distribution
 In this lecture, our target distribution is as follows:

 n  x1:n 
 n  x1:n    p  x1:n | y1:n  ,  n  x1:n   p  x1:n , y1:n  , Z n  p  y1:n 
Zn

 The posterior and marginal likelihood do not admit close forms unless

 {𝑋𝑛} and {𝑌𝑛} follow linear Gaussian equations or

 When {𝑋𝑛} takes values in a finite state space 𝒳 (finite-state-space HMM)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27


Point Estimates and Posterior Marginals
 From the posterior distribution, one can compute useful point estimates
arg max p  x1:n | y1:n 
 One can also compute the MAP estimate for components or the marginals
arg max p  xk | y1:n 
p  xk | y1:n    ... p  x1:n | y1:n  d x1:k 1dxk 1:n

 The posterior mean (minimum mean square estimate) can also be estimated
as:
 X k | y1:n    xk p  xk | y1:n  dxk

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28


Particle Motion in a Random Medium
 Consider a Markovian particle {𝑋𝑛 }, 𝑛 ≥ 1 evolving in a random medium as
follows:
X1 ~  (.) and X n1 |  X n  x  ~ f  | x 

 At time 𝑛, the probability for the particle to be killed is given as 1 − 𝑔(𝑋𝑛ሻ,


where 0 ≤ 𝑔(𝑥ሻ ≤ 1 for any 𝑥 ∈ 𝐸.

 Let 𝑇 be the time at the which the particle is killed. We want to compute the
probability Pr(𝑇 > 𝑛ሻ.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29


Particle Motion in a Random Medium
 Starting from 𝑡 = 1, given the current state 𝑥1, the probability for the particle to
survive is given as 𝑔(𝑥1ሻ.

 Thus, the joint probability (particle at state 𝑥1, particle survive) is

  x1  g  x1 

 By integration on 𝑥1, the probability that a particle survives at time 𝑡 = 1 is

   x g  x  dx
1 1 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30


Particle Motion in a Random Medium
 At 𝑡 = 2, given the state 𝑥1, the current state 𝑥2 is determined by the transition
probability 𝑓 𝑥2 |𝑥1
 The probability for such a particle to survive at time 𝑡 = 2 is also determined by
current state 𝑥2, i.e. the probability is 𝑔(𝑥2ሻ
 If the particle survives at time 𝑡 = 2, it means
1. at time 1, the particle survives with probability 𝑔(𝑥1ሻ
2. state 𝑥1 determines the current state 𝑥2 with probability 𝑓(𝑥2|𝑥1ሻ
3. the probability to survive at time 𝑡 = 2 is 𝑔(𝑥2ሻ

The joint probability for the three events is


  x1  f  x2 | x1  g  x1  g  x2 
𝜇 𝑥1 𝑓 𝑥2 |𝑥1 determines the random states
𝑔 𝑥1 𝑔 𝑥2 determines the probability to survive at each time

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31


Particle Motion in a Random Medium
 This can be considered as a typical Hidden Markov Model
Markov Chain (state equation)
xk ~ f  xk | xk 1 

Survive (observation equation)


yk ~ g  xk 

 The probability density for the particle to survive at time 𝑡 = 𝑛 is


n n
  x1   f  xk | xk 1    g  xk 
k 2 k 1

 By integration over the state variables 𝑥𝑘 , we obtain the probability for the
particle to survive at time 𝑡 = 𝑛
Pr(T  n)    Pr obability of surving at time n  
n n
    x1   f  xk | xk 1  g  xk dx1:n
k 2 k 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32


Particle Motion in a Random Medium
Pr(𝑇 > 𝑛ሻ = 𝔼𝜇 Pr𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑠𝑢𝑟𝑣𝑖𝑛𝑔 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑛 =
𝑛 𝑛

= ඲ 𝜇 𝑥1 ෑ 𝑓 𝑥𝑘 |𝑥𝑘−1 ෑ 𝑔 𝑥𝑘 𝑑𝒙1:𝑛
𝑘=2 𝑘=1
 To place this calculation in our SMC framework, we define the following:
n n
 n  x1:n     x1   f  xk | xk 1    g  xk 
k 2 k 1

 Then the integration needed to compute the required probability is just the
normalization constant of 𝛾𝑛 𝒙1:𝑛 , i.e.
n n
Z n     x1   f  xk | xk 1  g  xk dx1:n
k 2 k 1

 n  x1:n 
 n  x1:n   and
Zn
Z n  Pr(T  n)
33
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras
Closed Form Inference in HMM
 We have closed-form solutions for finite state-space HMM as all integrals are
becoming finite sums.

 For linear Gaussian models (LG-SSM), all the posterior distributions are
Gaussian (Kalman filter).

 In most cases of interest, it is not possible to compute the solution in closed-


form and we need numerical approximations.

 This is the case for all non-linear non-Gaussian models.

 SMC methods for such problems are in some sense asymptotically


consistent.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34


Closed Form Inference in HMM
 Gaussian approximations: Extended Kalman filter, Unscented Kalman filter.
 Gaussian sum approximations (a weighted sum of Gaussian PDFs can be
used to approximate arbitrarily closely another density function).
 Projection filters (similar to variational methods in machine learning).
 Simple discretization of the state-space.
 Analytical methods work in simple cases but are not reliable and it is
difficult to diagnose when they fail.
 Standard discretization of the space is expensive and difficult to implement
in high-dimensional scenarios.
 We need numerical approximations.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35


Monte Carlo Methods,
Importance Sampling and
Sequential Importance Sampling

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36


Monte Carlo Methods
 Our goal is to compute an expectation value of the form :
 f  x    A f  x    x  dx
 

where 𝜋(𝒙ሻ is a probability distribution (posterior inference in Bayesian


models, Bayesian model validation, etc.)

𝛾(𝒙ሻ
 We assume that 𝜋(𝒙ሻ = where 𝑍 = ‫𝒙(𝛾 ׬‬ሻ𝑑𝒙 is unknown and  is known
𝑍
pointwise.

 The basic idea in Monte Carlo methods is to sample 𝑁 i.i.d. random numbers
𝑁
𝑿 𝑖 𝑖.𝑖.𝑑.
~𝜋 . and build an empirical measure 𝜋ො 𝒙 𝑑𝒙 =
1
෍ 𝛿𝑿 𝑖 𝑑𝒙
𝑁 𝑖=1

1 𝑁 𝑖 𝑖 𝑖.𝑖.𝑑.
 Using this: 𝔼𝜋ෝ 𝑓 𝒙 = ෌𝑖=1 𝑓 𝑿 , where 𝑿 ~𝜋 .
𝑁
J.S. Liu, Monte Carlo Strategies in Scientific Computing, Chapter 3, Springer-Verlag, New York.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37


Monte Carlo Methods
 Using the approximation of 𝜋: 𝑁
1 𝑖 𝑖.𝑖.𝑑.
𝔼𝜋ෝ 𝑓 𝒙 = ෍𝑓 𝑿 , where 𝑿 𝑖 ~ 𝜋 .
𝑁
 The following hold: 𝑖=1
1 2 2
𝔼 𝔼𝜋ෝ 𝑓 = 𝔼𝜋 𝑓 , 𝑉 𝔼𝜋ෝ 𝑓 = 𝔼 𝑓 − 𝔼𝜋 𝑓 , 𝑁 𝔼𝜋ෝ 𝑓 − 𝔼𝜋 𝑓 ~𝒩 0, 𝔼𝜋 𝑓 − 𝔼𝜋 𝑓
𝑁 𝜋

 Similarly, marginalization is also simple:


𝑁
1
𝜋(𝑥
ො 𝑝 ሻ𝑑𝑥𝑝 = න𝜋(𝑥
ො 1 , 𝑥2 , . . . , 𝑥𝑛 ሻ𝑑𝒙1:𝑝−1 𝑑𝒙𝑝+1:𝑛 = ෍ 𝛿𝑋 𝑖 𝑑𝑥𝑝
𝑁 𝑝
𝑖=1
 In MC, the samples automatically concentrate in regions of high probability
mass regardless of the dimension of the space.

 However, it is not always easy or effective to sample from the original


probability distribution 𝜋(𝒙ሻ. A more effective strategy is to focus on the
regions of “importance” in 𝜋(𝒙ሻ so as to save computational resources.
J.S. Liu, Monte Carlo Strategies in Scientific Computing, Chapter 3, Springer-Verlag, New York.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 38
Importance Sampling
 We assume that 𝜋(𝒙ሻ is only known up to a normalizing constant:
  x
  x 
Z
 For any distribution 𝑞(𝒙ሻ such that 𝜋(𝒙ሻ > 0 𝑞(𝒙ሻ > 0, we can write:
w x q  x w x q  x   x
  x  = , where w  x  
 w  x  q  x  dx Z q x
Z

 The proposal distribution 𝑞(𝒙ሻ is known as “importance density”. 𝑤(𝒙ሻ is


called the importance weight.

 The importance density can be chosen arbitrarily as any proposal easy to


sample from: 𝑁
𝑖 𝑖.𝑖.𝑑.
1
𝑿 ~𝑞 ෝ 𝒙 𝑑𝒙 = ෍ 𝛿𝑿 𝑖 𝑑𝒙
𝒙 ⇒𝒒
𝑁
𝑖=1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 39
Importance Sampling
1 𝑁
 Substitution of 𝑞ො 𝒙 𝑑𝒙 = σ𝑖=1 𝛿𝑿 𝑖 𝑑𝒙 in the importance sampling identity
𝑁
gives:
1 𝑁 𝑖 𝛿 𝑁
𝑤 𝒙 𝑞ො 𝒙 ෌ 𝑤 𝑿 𝑖 𝑑𝒙
𝑑𝒙 = 𝑁
𝑖=1 𝑿 𝑖
𝜋ො 𝒙 𝑑𝒙 = = ෍𝑊 𝛿𝑿 𝑖 𝑑𝒙 ,
‫׬‬ 𝑤 𝒙 𝑞
ො 𝒙 𝑑𝒙 1 𝑁
෌𝑖=1 𝑤 𝑿 𝑖 𝑖=1
𝑁
𝑁

𝑤ℎ𝑒𝑟𝑒 𝑊 𝑖 ∝𝑤 𝑿 𝑖 𝑎𝑛𝑑 ෍ 𝑊 𝑖 =1
𝑖=1
 Similarly, we can approximate the normalization factor of our target
distribution as follows:
𝑁
𝑁 𝑖
𝛾 𝒙 1 1 𝛾 𝑿
𝑍መ = ඲ 𝑞ො 𝒙 𝑑𝒙 = න𝑤 𝒙 𝑞ො 𝒙 𝑑𝒙 = ෍ 𝑤 𝑿 𝑖
= ෎ 𝑖
𝑞 𝒙 𝑁 𝑁 𝑞 𝑿
𝑖=1
𝑖=1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 40


Importance Sampling
𝑁 𝑁

𝜋ො 𝒙 𝑑𝒙 = ෍ 𝑊 𝑖 𝛿𝑿 𝑖 𝑑𝒙 , 𝑤ℎ𝑒𝑟𝑒 𝑊 𝑖 ∝𝑤 𝑿 𝑖 𝑎𝑛𝑑 ෍ 𝑊 𝑖 =1
𝑖=1 𝑖=1

 The distribution 𝜋(𝒙ሻ is now approximated by a weighted sum of delta


masses, where the weights compensate for the discrepancy between
𝜋(𝒙ሻ and 𝑞(𝒙ሻ.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 41


Importance Sampling
 Similarly calculation of 𝔼𝜋 𝑓 𝒙 using importance sampling gives:
𝑁

𝔼𝜋ෝ 𝑓 𝒙 = න 𝑓 𝒙 𝜋ො 𝒙 𝑑𝒙 = ෍ 𝑓 𝑿(𝑖ሻ 𝑊 𝑖
𝐴 𝑖=1

 The statistics of this estimate are given for 𝑁 >> 1 as follows:


1
𝔼 𝔼𝜋ෝ 𝑓 𝒙 = 𝔼𝜋 𝑓 𝒙 − 𝔼 𝑊 𝑿 𝑓 𝑿 − 𝔼𝜋 𝑓 𝒙
𝑁𝜋
1 2
𝑉 𝔼𝜋ෝ 𝑓 𝒙 = 𝔼 𝑊 𝑿 𝑓 𝑿 − 𝔼𝜋 𝑓 𝒙
𝑁𝜋

where as you recall we have some negligible bias:


1
𝔼 𝑊 𝑿 𝑓 𝑿 − 𝔼𝜋 𝑓 𝒙
𝑁𝜋

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 42


Estimating the Normalization Constant
 We can similarly compute the statistics of the normalization constant:
𝑁
𝑁
𝛾 𝒙 1 𝛾 𝑿(𝑖ሻ 1

𝑍=඲ 𝑞ො 𝒙 𝑑𝒙 = ෎ = ෍ 𝑤 𝑿 (𝑖ሻ
𝑞 𝒙 𝑁 𝑞 𝑿(𝑖ሻ 𝑁
𝑖=1
𝑖=1

 They are given as:


𝔼 𝑍መ = 𝑍, 𝑎𝑛𝑑
2
1 𝛾 𝒙
𝑉 𝑍መ = 𝔼𝑞 −𝑍
𝑁 𝑞 𝒙

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 43


Variance of the Weights
 We select 𝑞(𝒙ሻ as close as possible to 𝜋(𝒙ሻ.

 The variance of the weights is bounded iff


 2  x
 q x dx  
 In practice, it is sufficient to ensure that the weights are bounded:

  x
w x  
q x

 This is equivalent to saying that 𝑞(𝒙ሻ should have heavier tails than 𝜋(𝒙ሻ.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 44


Monte Carlo for the State Space Model
 We are interested to estimate 𝑝 𝒙1:𝑛 |𝒚1:𝑛 = 𝑝 𝑝𝒙1:𝑛𝒚 ,𝒚1:𝑛 ∝ 𝑝 𝒙1:𝑛 , 𝒚1:𝑛
1:𝑛

 For now let us start with a fixed 𝑛.


 A Monte Carlo approximation (empirical measure) of our target distribution is
of the form: 𝑁
1 𝑖
𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 = ෍ 𝛿𝑿 𝑖 𝒙1:𝑛 , 𝑤ℎ𝑒𝑟𝑒 𝑿1:𝑛 ~𝑝 𝒙1:𝑛 |𝒚1:𝑛
𝑁 1:𝑛
𝑖=1

 For any function 𝜑 𝒙1:𝑛 : 𝒳 𝑛 → ℝ, we can use a Monte Carlo approximation of


its expectation as:

𝔼𝑝ො𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝜑 = න 𝜑 𝒙1:𝑛 𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝑑𝒙1:𝑛


𝒳𝑛
𝑁 𝑁
1 1 𝑖
= න 𝜑 𝒙1:𝑛 ෍ 𝛿𝑿 𝑖 𝒙1:𝑛 𝑑𝒙1:𝑛 = ෍ 𝜑 𝑿1:𝑛
𝑁 1:𝑛 𝑁
𝒳𝑛 𝑖=1 𝑖=1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 45


Monte Carlo for the State Space Model
 This earlier estimate is asymptotically consistent (converges towards
𝔼𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 ).

 The estimate is unbiased and its variance gives the following convergence
properties:

1
𝑉𝑎𝑟𝑿 𝑖 𝔼𝑝ො𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝜑 = 𝑉𝑎𝑟𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑
1:𝑛 𝑁
𝑑
𝑁 𝔼𝑝ො𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝜑 − 𝔼𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 → 𝒩 0, 𝑉𝑎𝑟𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑

 The rate of convergence is independent of 𝑛. This does not imply that Monte
Carlo bits the curse of dimensionality of 𝒳 𝑛 since it is possible that
𝑉𝑎𝑟𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 increases (with time) 𝑛.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 46


Monte Carlo for the State Space Model
 The Monte Carlo approximation can easily be used to compute any marginal
distribution, e.g. 𝑝 𝑥𝑘 |𝒚1:𝑛

𝑝Ƹ 𝑁 𝑥𝑘 |𝒚1:𝑛 = න 𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝑑𝒙1:𝑘−1 𝑑𝒙𝑘+1:𝑛


𝒳 𝑛−1

𝑁
1
= ඲ ෍ 𝛿𝑿 𝑖 𝒙1:𝑛 𝑑𝒙1:𝑘−1 𝑑𝒙𝑘+1:𝑛
𝑁 1:𝑛
𝑖=1
𝒳 𝑛−1
𝑁
1
= ෍ 𝛿𝑋 𝑖 𝑥𝑘
𝑁 𝑘
𝑖=1

 Note that the marginal likelihood 𝑝 𝒚1:𝑛 cannot be estimated as easily using
𝑖
𝑿1:𝑛 ~𝑝 𝒙1:𝑛 |𝒚1:𝑛 .

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 47


Difficulties with Standard Monte Carlo Sampling
 Problem 1 - It is difficult to generate exact samples from our target high-
𝑖
dimensional distribution 𝑿1:𝑛 ~𝑝 𝒙1:𝑛 |𝒚1:𝑛 .

 MCMC methods are not useful in this context (they are not recursive).

 Problem 2 – Even if we can address Problem 1, algorithms to generate


samples from 𝑝 𝒙1:𝑛 |𝒚1:𝑛 will have at least complexity 𝒪 (𝑛ሻ (increasing
linearly with 𝑛).

 As 𝑛 increases, we would like to be able to sample from 𝑝 𝒙1:𝑛 |𝒚1:𝑛 with


an algorithm that keeps the computational cost fixed at each time step 𝑛.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 48


Difficulties with Standard Monte Carlo Sampling
 Problem 1 - It is difficult to generate exact samples from our target high-
𝑖
dimensional distribution 𝑿1:𝑛 ~𝑝 𝒙1:𝑛 |𝒚1:𝑛 .

 Problem 2 – Even if we can address Problem 1, algorithms to generate


samples from 𝑝 𝒙1:𝑛 |𝒚1:𝑛 will have at least complexity 𝒪 (𝑛ሻ (increasing
linearly with 𝑛).

 SMC solve partially both problems by breaking the sampling from 𝑝 𝒙1:𝑛 |𝒚1:𝑛
into a collection of simpler subproblems. First approximate 𝑝 𝑥1 |𝑦1 and 𝑝 𝑦1
at time 1, then 𝑝 𝒙1:2 |𝒚1:2 and 𝑝 𝒚1:2 at time 2 and so on.

 Each target distribution is approximated by a cloud of random samples


(particles) evolving according to importance sampling & resampling steps.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 49


Importance Sampling for the State Space Model
 Rather than sampling directly from our target distribution 𝑝 𝒙1:𝑛 |𝒚1:𝑛 , we should
sample from an importance distribution 𝑞 𝒙1:𝑛 |𝒚1:𝑛 .
 Note that in the notation here for 𝑞, 𝒚1:𝑛 is used as a parameter – not to indicate any
posterior distribution.
 The importance distribution needs to satisfy the following properties:

 The support of 𝑞 𝒙1:𝑛 |𝒚1:𝑛 includes the support of 𝑝 𝒙1:𝑛 |𝒚1:𝑛 i.e.
p  x1:n | y1:n   0  q  x1:n | y1:n   0

 It is easy to sample from 𝑞 𝒙1:𝑛 |𝒚1:𝑛

 We use the following identity:  p  x1:n , y1:n  q  x1:n | y1:n   q  x1:n | y1:n 
p  x1:n , y1:n 
p  x1:n | y1:n   
 px 1:n , y1:n  dx1:n   p  x1:n , y1:n  q  x1:n | y1:n  q  x1:n | y1:n  dx1:n
w  x1:n , y1:n  q  x1:n | y1:n 

 w  x , y  q  x | y  dx
1:n 1:n 1:n 1:n 1:n
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 50
Importance Sampling for the State Space Model
 Let us draw 𝑁 samples from our importance distribution:
𝑁
𝑖 1
𝑿1:𝑛 ~𝑞 𝒙1:𝑛 |𝒚1:𝑛 , 𝑞ො𝑁 𝒙1:𝑛 |𝒚1:𝑛 = ෍ 𝛿𝑿 𝑖 𝒙1:𝑛
𝑁 1:𝑛
𝑖=1
 Then using the identity in the earlier slide, we obtain the following approximation of our target
distribution:
𝑤 𝒙1:𝑛 , 𝒚1:𝑛 𝑞ො𝑁 𝒙1:𝑛 |𝒚1:𝑛
𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 =
‫𝒙 𝑤 ׬‬1:𝑛 , 𝒚1:𝑛 𝑞ො𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝑑𝒙1:𝑛
1
𝑤 𝒙1:𝑛 , 𝒚1:𝑛 𝑁 σ𝑁 𝑖=1 𝛿𝑿 𝑖 𝒙1:𝑛
1:𝑛
=
1
න 𝑤 𝒙1:𝑛 , 𝒚1:𝑛 𝑁 σ𝑁 𝑖=1 𝛿𝑿 𝑖 𝒙1:𝑛 𝑑𝒙1:𝑛
1:𝑛
𝑁 𝑖
𝑖 𝑖
𝑤 𝑿1:𝑛 , 𝒚1:𝑛
= ෍ 𝑊𝑛 𝛿𝑿 𝑖 𝒙1:𝑛 , 𝑊𝑛 = 𝑁
1:𝑛 𝑖
𝑖=1 ෍ 𝑤 𝑿1:𝑛 , 𝒚1:𝑛
𝑖=1

𝑁 𝑁
1 1 𝑖
 Note that: 𝑍መ𝑛 ≡ 𝑝Ƹ 𝑁 𝒚1:𝑛 = ඲ 𝑤 𝒙1:𝑛 , 𝒚1:𝑛 ෍ 𝛿𝑿 𝑖 𝒙1:𝑛 𝑑𝒙1:𝑛 = ෍ 𝑤 𝑿1:𝑛 , 𝒚1:𝑛
𝑁 1:𝑛 𝑁
𝑖=1 𝑖=1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 51
Normalized Weights in Importance Sampling
 The unnormalized weights were defined as follows:
p  x1:n , y1:n  p  x1:n | y1:n 
Unnormalized weights : w  x1:n , y1:n    p  y1:n 
q  x1:n | y1:n  q  x1:n | y1:n 
Discrepancy between
target distribution and
importance distribution

 The normalized weights are then given as:

Normalized weights :Wn(i ) 



w X1:(in) , y1:n 
 w X 
N
(i )
1:n , y1:n
i 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 52


Optimal Importance Sampling Distribution
 𝑝Ƹ 𝑁 𝒚1:𝑛 is an unbiased estimate of 𝑍𝑛 ≡ 𝑝 𝒚1:𝑛 with variance:

1 2 𝑝2 𝒙1:𝑛 |𝒚1:𝑛
𝑍 ඲ 2 𝑞 𝒙1:𝑛 |𝒚1:𝑛 𝑑𝒙1:𝑛 − 1
𝑁 𝑛 𝑞 𝒙1:𝑛 |𝒚1:𝑛

 You can bring this variance to zero (variance of the unnormalized weights equal to zero) with
the selection

𝑞 𝒙1:𝑛 |𝒚1:𝑛 = 𝑝 𝒙1:𝑛 |𝒚1:𝑛

Of course this is what we wanted to avoid (we want to sample from an easier distribution).

 However, this results points to the fact that the choice of 𝑞 needs to be as close as possible to
the target distribution.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 53


Bias & Variance of Importance Sampling Estimates
 We are interested in an importance sampling approximation of 𝔼𝑝 𝒙 𝜑
1:𝑛 |𝒚1:𝑛
𝑁
𝑖 𝑖
𝐼𝑛𝐼𝑆 𝜑 ≡ 𝔼𝑝ො𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝜑 = ෍ 𝑊𝑛 𝜑 𝑿1:𝑛
𝑖=1

 This is a biased estimate for a finite 𝑁 and we have shown that for Importance Sampling:
𝑝2 𝒙1:𝑛 |𝒚1:𝑛
lim 𝑁 𝔼𝑝𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝜑 − 𝔼𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 = −඲ 𝜑 𝒙1:𝑛 − 𝔼𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 𝑑𝒙1:𝑛
𝑁→∞ 𝑞 𝒙1:𝑛 |𝒚1:𝑛

𝑑 𝑝2 𝒙1:𝑛 |𝒚1:𝑛 2
𝑁 𝔼𝑝𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝜑 − 𝔼𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 → 𝒩 0, ඲ 𝜑 𝒙1:𝑛 − 𝔼𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 𝑑𝒙1:𝑛
𝑞 𝒙1:𝑛 |𝒚1:𝑛

 The asymptotic bias is of the order 1/𝑁 (negligible) and the MSE error is:

MSE  bias 2  variance



O N 2   
O N 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 54


Selection of Importance Sampling Distribution
 For a given test function, 𝜑 𝒙1:𝑛 it is easy to establish the importance
distribution minimizing the asymptotic variance of 𝐼𝑛𝐼𝑆 𝜑 .

 However, such a result is of minimal interest in a filtering context as this


distribution depends on 𝜑 𝒙1:𝑛 and we are typically interested in the
expectations of several test functions.

 Moreover, even if we were interested in a single test function, say 𝜑 𝒙1:𝑛 =


𝑥𝑛 , then selecting the optimal importance distribution at time 𝑛 would have
detrimental effects when we will try to obtain a sequential version of the
algorithms.

 The optimal distribution for estimating 𝜑 𝒙1:𝑛−1 will almost certainly not be
even similar to the marginal distribution of 𝒙1:𝑛−1 in the optimal distribution for
estimating 𝜑 𝒙1:𝑛 and this will prove to be problematic.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 55
Selection of Importance Sampling Distribution
 A more appropriate approach in this context is to attempt to select the 𝑞 𝒙1:𝑛 |𝒚1:𝑛 which
minimizes the variance of the importance weights (or, equivalently, the variance of 𝑍መ𝑛 ).

 Clearly, this variance is minimized for 𝑞 𝒙1:𝑛 |𝒚1:𝑛 = 𝑝 𝒙1:𝑛 |𝒚1:𝑛 . We cannot do this as this is
the reason we used IS in the first place. However, this simple result indicates that we should
aim at selecting an IS distribution which is close as possible to the target.

 As discussed before, the importance sampling distribution should be selected so that the
weights are bounded or equivalently 𝑞 𝒙1:𝑛 |𝒚1:𝑛 has heavier tails than 𝑝 𝒙1:𝑛 |𝒚1:𝑛

𝑤 𝒙1:𝑛 , 𝒚1:𝑛 ≤ 𝐶 ∀𝒙1:𝑛 ∈ 𝒳 𝑛

 Note that the selection of the importance sampling needs to be not only such that it covers the
support of the target but also needs to be a clever one for the particular problem of interest.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 56


Effective Sample Size
 In our importance sampling approximation from the target 𝑝 𝒙1:𝑛 |𝒚1:𝑛 using the importance
distribution 𝑞 𝒙1:𝑛 |𝒚1:𝑛 (for a fixed 𝑛), we would like ideally to have 𝑞 𝒙1:𝑛 |𝒚1:𝑛 =
𝑝 𝒙1:𝑛 |𝒚1:𝑛 .

 In this case, all the unnormalized importance weights will be equal and their variance equal to
zero.

 To access the quality of the importance sampling approximation, note that for flat functions,
Variance of IS estimate
 1  Varq x1:n | y1:n W  X1:n | y1:n 
Variance of Standard MC estimate

 This is often interpreted as the effective sample size (𝑁 weighted samples from 𝑞 𝒙1:𝑛 |𝒚1:𝑛
are approximately equivalent to 𝑀 unweighted samples from 𝑝 𝒙1:𝑛 |𝒚1:𝑛 ).

N
M N
1  Varq x1:n | y1:n W  X1:n | y1:n 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 57


Effective Sample Size
 We often approximate the effective sample size 𝑀 as follows:
1
 N (i )2 
ESS    Wn 
 i 1 
since

   
N
Varq x1:n | y1:n W X (i )
1:n | y1:n  N  W 2 X1:( in) | y1:n  1
i 1

𝑁
 We clearly can see from 𝐸𝑆𝑆 = that
1+𝑉𝑎𝑟𝑞 𝒙1:𝑛 |𝒚1:𝑛 𝑊 𝑿1:𝑛 |𝒚1:𝑛

1
 N (i )2 
1  ESS    Wn   N
 i 1 
 We can thus have

 𝐸𝑆𝑆 = 1 (one of the weights equal to 1, all other zero, very inefficient) to

 𝐸𝑆𝑆 = 𝑁 (all weights equal to 1/𝑁, excellent sampling).


Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 58

You might also like