0% found this document useful (0 votes)
39 views46 pages

Lec35 SequentialImportanceSampling

The document discusses sequential importance sampling for state space models. The key goals are to learn about sequential importance sampling and understand online Bayesian parameter estimation in state space models. Sequential importance sampling allows estimating filtering distributions and normalizing constants sequentially over time by reusing samples from the previous time step, as opposed to running separate MCMC for each time step. This improves computational efficiency.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views46 pages

Lec35 SequentialImportanceSampling

The document discusses sequential importance sampling for state space models. The key goals are to learn about sequential importance sampling and understand online Bayesian parameter estimation in state space models. Sequential importance sampling allows estimating filtering distributions and normalizing constants sequentially over time by reusing samples from the previous time step, as opposed to running separate MCMC for each time step. This improves computational efficiency.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Sequential Importance Sampling

Prof. Nicholas Zabaras

Email: [email protected]
URL: https://www.zabaras.com/

November 12, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras


Contents
 Sequential Bayesian Inference
 Importance sampling for the state space model
 SEQUENTIAL IMPORTANCE SAMPLING: Sequential IS, Factorization of the
importance density, Variance of the IS Estimates, IS in High-Dimensions, The
Bootstrap Particle Filter, Resampling
 BAYESIAN RECURSION FORMULAS: Filtering and marginal likelihood, The
Bootstrap Filter implementing the prediction/update recursions, The Kalman
filter updates for a LG-SSM, Forward-Filtering Backward-Smoothing relations,
Forward-Backward two filter smoother
 ONLINE BAYESIAN PARAMETER ESTIMATION: Introduction, MLE Solution
and Fisher’s identity, Expectation-Maximization approach, Gaussian Process
SSM

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2


Goals
 The goals for today’s lecture include the following:

 Learn about sequential importance sampling for state space models

 Understand online Bayesian parameter estimation in state space models

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3


Bayesian Inference for the SSM
 While our overall estimation problem is to compute the joint filtering
distribution 𝑝 𝒙1:𝑛 |𝒚1:𝑛 , the following inference problems are also of interest:
 Filtering: Compute 𝑝 𝑥𝑛 |𝒚1:𝑛 y1 y2 ... yn
𝑝 𝒙1:𝑛 |𝒚1:𝑛
 Prediction: Compute 𝑝 𝑥𝑛+1 |𝒚1:𝑛
x1 x2 ... xn
 Joint Smoothing: 𝑝 𝒙1:𝑇 |𝒚1:𝑇
p  x1 | y1  p  x1:2 | y1:2  p  x1:n | y1:n 
 Marginal Smoothing: 𝑝 𝑥𝑛 |𝒚1:𝑇 , 𝑛 ≤ 𝑇
n
Likelihood : p  y1 , , yn | x1 , , xn    g  yi | xi 
 Note: The Kalman filter provides an i 1
n
analytical solution to the filtering problem Prior : p  x1:n     x1   f  xk | xk 1 
for a LG-SSM. k 2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4


Importance Sampling for the State Space Model
 Let us draw 𝑁 samples from our importance distribution:
𝑁
𝑖 1
𝑿1:𝑛 ~𝑞 𝒙1:𝑛 |𝒚1:𝑛 , 𝑞ො𝑁 𝒙1:𝑛 |𝒚1:𝑛 = ෍ 𝛿𝑿 𝑖 𝒙1:𝑛
𝑁 1:𝑛
𝑖=1
 Then using the identity in the earlier slide, we obtain the following approximation of our target
distribution:
𝑤 𝒙1:𝑛 , 𝒚1:𝑛 𝑞ො𝑁 𝒙1:𝑛 |𝒚1:𝑛
𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 =
‫𝒙 𝑤 ׬‬1:𝑛 , 𝒚1:𝑛 𝑞ො𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝑑𝒙1:𝑛
1
𝑤 𝒙1:𝑛 , 𝒚1:𝑛 𝑁 σ𝑁 𝑖=1 𝛿𝑿 𝑖 𝒙1:𝑛
1:𝑛
=
1
න 𝑤 𝒙1:𝑛 , 𝒚1:𝑛 𝑁 σ𝑁 𝑖=1 𝛿𝑿 𝑖 𝒙1:𝑛 𝑑𝒙1:𝑛
1:𝑛
𝑁 𝑖
𝑖 𝑖
𝑤 𝑿1:𝑛 , 𝒚1:𝑛
= ෍ 𝑊𝑛 𝛿𝑿 𝑖 𝒙1:𝑛 , 𝑊𝑛 = 𝑁
1:𝑛 𝑖
𝑖=1 ෍ 𝑤 𝑿1:𝑛 , 𝒚1:𝑛
𝑖=1

𝑁 𝑁
1 1 𝑖
 Note that: 𝑍መ𝑛 ≡ 𝑝Ƹ 𝑁 𝒚1:𝑛 = ඲ 𝑤 𝒙1:𝑛 , 𝒚1:𝑛 ෍ 𝛿𝑿 𝑖 𝒙1:𝑛 𝑑𝒙1:𝑛 = ෍ 𝑤 𝑿1:𝑛 , 𝒚1:𝑛
𝑁 1:𝑛 𝑁
𝑖=1 𝑖=1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5
Bias & Variance of Importance Sampling Estimates
 We are interested in an importance sampling approximation of 𝔼𝑝 𝒙 𝜑
1:𝑛 |𝒚1:𝑛
𝑁
𝑖 𝑖
𝐼𝑛𝐼𝑆 𝜑 ≡ 𝔼𝑝ො𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝜑 = ෍ 𝑊𝑛 𝜑 𝑿1:𝑛
𝑖=1

 This is a biased estimate for a finite 𝑁 and we have shown in our earlier lecture on
Importance Sampling that:
𝑝2 𝒙1:𝑛 |𝒚1:𝑛
lim 𝑁 𝔼𝑝𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝜑 − 𝔼𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 = −඲ 𝜑 𝒙1:𝑛 − 𝔼𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 𝑑𝒙1:𝑛
𝑁→∞ 𝑞 𝒙1:𝑛 |𝒚1:𝑛

𝑑 𝑝2 𝒙1:𝑛 |𝒚1:𝑛 2
𝑁 𝔼𝑝𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝜑 − 𝔼𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 → 𝒩 0, ඲ 𝜑 𝒙1:𝑛 − 𝔼𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 𝑑𝒙1:𝑛
𝑞 𝒙1:𝑛 |𝒚1:𝑛

 The asymptotic bias is of the order 1/𝑁 (negligible) and the MSE error is:
MSE  bias 2  variance

O N 2   
O N 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6


Sequential Importance Sampling
for the SSM

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7


Sequential Importance Sampling
 Let us return to our state space model and consider a sequential Monte Carlo approximation
of 𝑝 𝒙1:𝑛 |𝒚1:𝑛 ∝ 𝑝 𝒙1:𝑛 , 𝒚1:𝑛 .

 The distributions 𝜋𝑛 = 𝑝 𝒙1:𝑛 |𝒚1:𝑛 are known up to a normalizing constant:


 n ( x1:n ) p  x1:n , y1:n 
 n ( x1:n )  
Zn Zn
 We want to estimate the expectations of functions 𝜑𝑛 : 𝒳 𝑛 → ℝ

n n    n ( x1:n ) n ( x1:n )dx1:n

and/or the normalizing constants 𝑍𝑛.

 One can use MCMC to sample from 𝜋𝑛 , 𝑛 = 1,2, . . . This calculation will be slow and cannot
compute 𝑍𝑛 , 𝑛 = 1,2 …

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8


Sequential Importance Sampling
 We want to do these calculations sequentially starting with 𝜋1 and 𝑍1 at step
(time 1), then proceeding to 𝜋2 and 𝑍2, etc.

 Sequential Monte Carlo (SMC) provides the means to do so as an alternative


algorithm to MCMC.

The key idea is that if 𝜋𝑛−1 does not differ a lots from 𝜋𝑛 , we should be able to
reuse our estimate of 𝜋𝑛−1 to approximate 𝜋𝑛 .

 In sequential importance sampling, the proposal distribution is defined


sequentially and the weights are evaluated sequentially.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9


Sequential Importance Sampling
 We want to design a sequential importance sampling method to approximate

 n n1 and Z n n1

 Assume that `at time 1’, we have approximations 𝜋ො 1 𝑥1 = 𝑝Ƹ 𝑁 𝑥1 |𝑦1 , 𝑍መ1 using an importance
density 𝑞1 𝑥1 |𝑦1 .
𝑖
𝑋1 ~𝑞1 𝑥1 |𝑦1 , 𝑖 = 1,2, . . . , 𝑁
𝑁 𝑖
𝑖 𝑖
𝑤1 𝑋1 , 𝑦1
𝑝Ƹ 𝑁 𝑥1 |𝑦1 𝑑𝑥1 = ෍ 𝑊1 𝛿𝑋 𝑖 𝑑𝑥1 , 𝑤ℎ𝑒𝑟𝑒 𝑊1 = 𝑁
1 𝑗
𝑖=1 ෍ 𝑤1 𝑋1 , 𝑦1
𝑗=1
𝑁
1 𝑖
𝑍መ1 = ෍ 𝑤1 𝑋1 , 𝑦1 𝑤𝑖𝑡ℎ
𝑁
𝑖=1
𝛾1 𝑥1 𝑝 𝑥1 , 𝑦1
𝑤1 𝑥1 , 𝑦1 = =
𝑞1 𝑥1 |𝑦1 𝑞1 𝑥1 |𝑦1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10


Sequential Importance Sampling
 At `time 2’, we want to approximate 𝜋ො 2 𝒙1:2 = 𝑝Ƹ 𝑁 𝒙1:2 |𝒚1:2 , 𝑍መ2 using an importance density
𝑞2 𝒙1:2 |𝒚1:2 .

𝑖
 We want to reuse the samples 𝑋1 and 𝑞1(𝑥1|𝑦1) in building the importance sampling
approximation for 𝜋2 𝒙1:2 , 𝑍2 . Let us select a proposal distribution that factorizes as:
q2  x1:2 | y1:2   q1  x1 | y1  q2  x2 | y1:2 , x1 
𝑖
 To obtain 𝑿1:2 ~𝑞2 𝒙1:2 |𝒚1:2 , we need to sample as follows:
X 2(i ) | X 1(i ) ~ q2  x2 | y1:2 , X 1(i ) 
 The importance sampling weight for this step is then:

 2  x1:2  p  x1:2 , y1:2 


w2  x1:2 , y1:2    
q2  x1:2 | y1:2  q1  x1 | y1  q2  x2 | y1:2 , x1 
p  x1 , y1  p  x1:2 , y1:2  p  x1:2 , y1:2 
  w1  x1 , y1 
q1  x1 | y1  p  x1 , y1  q2  x2 | y1:2 , x1  p  x1 , y1  q2  x2 | y1:2 , x1 
Weight from
step1 Incremental weight

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11


Sequential Importance Sampling
 The normalized weights for step 2 are then given as:
p  x1:2 , y1:2 
W (i )
 w2  x1:2 , y1:2   w1  x1 , y1 
p  x1 , y1  q2  x2 | y1:2 , x1 
2
Weight from
step1 Incremental weight
 Generalizing to step 𝑛, we can write:
qn  x1:n | y1:n   qn 1  x1:n 1 | y1:n 1  qn  xn | y1:n , x1:n 1 
n
 q1  x1 | y1   qk  xk | y1:k , x1:k 1 
k 2

 Thus if
X1:(in)1 ~ qn 1  x1:n 1 | y1:n 1 
we sample 𝑋𝑛 from

X n(i ) | X1:(in)1 ~ qn  xn | y1:n , X 1:( in)1 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12


Sequential Importance Sampling
 The weights for step 𝑛 are then given as:
p ( X1:(in) , y1:n ) p ( X1:(in)1 , y1:n 1 ) p ( X1:( in) , y1:n )
wn ( X , y1:n ) 
(i )
1:n 
qn ( X1:(in) | y1:n ) qn 1 ( X1:(in)1 | y1:n 1 ) p( X1:(in)1 , y1:n 1 )qn ( X n( i ) | X1:( in)1 , y1:n )
wn1 ( X1:( ni )1 , y1:n1 )

p ( X1:(in) , y1:n )
 wn 1 ( X (i )
1:n 1 , y1:n 1 )
p ( X1:(in)1 , y1:n 1 )qn ( X n(i ) | X1:( in)1 , y1:n )
 Similarly the normalized weights are as follows:
Wn(i )  Wn ( X1:(in) , y1:n )  wn ( X1:(in) , y1:n )

 For our state space model, the above update formula takes the form:
f ( X n(i ) | X n( i)1 ) g ( yn | X n( i ) )
wn ( X , y1:n )  wn 1 ( X
(i )
1:n
(i )
1:n 1 , y1:n 1 )
qn ( X n(i ) | y1:n , X 1:(in)1 )
𝑁 𝑁
𝑖 1 𝑖
 At each time we have 𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 = ෍ 𝑊𝑛 𝛿𝑿 𝑖 𝒙1:𝑛 and 𝑍መ𝑛 𝒙1:𝑛 |𝒚1:𝑛 = ෍ 𝑤𝑛 .
𝑖=1 1:𝑛 𝑁 𝑖=1
𝑖
 In general, we may need to store all the paths 𝑿1:𝑛 even if our interest is to only compute
𝜋𝑛 (𝑥𝑛 ) = 𝑝 𝑥𝑛 |𝒚1:𝑛 .
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13
Variance of the IS Estimates
 In this sequential framework, it would seem that the only freedom the user has at time 𝑛 is the
choice of 𝑞n 𝑥𝑛 |𝒙1:𝑛−1 , 𝒚1:𝑛 .

 A sensible strategy consists of selecting it so as to minimize the variance of 𝑤𝑛 𝒙1:𝑛 . It is


straightforward to check that this is achieved by selecting
𝑜𝑝𝑡
𝑞𝑛 𝑥𝑛 |𝒚1:𝑛 , 𝒙1:𝑛−1 = 𝑝𝑛 𝑥𝑛 |𝒚1:𝑛 , 𝒙1:𝑛−1
as in this case the variance of 𝑤𝑛 𝒙1:𝑛 conditional upon 𝒙1:𝑛−1 is zero and the associated
incremental weight is given by

𝑜𝑝𝑡 𝑝(𝒙1:𝑛 , 𝒚1:𝑛 ) 𝑝(𝒙1:𝑛−1 , 𝒚1:𝑛 ) 𝑝(𝑥𝑛 |𝒙1:𝑛−1 , 𝒚1:𝑛 )


𝛼𝑛 𝒙1:𝑛 = =
𝑝(𝒙1:𝑛−1 , 𝒚1:𝑛−1 )𝑞𝑛 (𝑥𝑛 |𝒙1:𝑛−1 , 𝒚1:𝑛 ) 𝑝(𝒙1:𝑛−1 , 𝒚1:𝑛−1 ) 𝑞𝑛 (𝑥𝑛 |𝒙1:𝑛−1 , 𝒚1:𝑛 )
𝛾𝑛 𝒙1:𝑛−1 ‫𝒙 𝑛𝛾 ׬‬1:𝑛 𝑑𝑥𝑛
= =
𝛾𝑛−1 𝒙1:𝑛−1 𝛾𝑛−1 𝒙1:𝑛−1
𝑜𝑝𝑡
 It is not always possible to sample from 𝑝𝑛 𝑥𝑛 |𝒚1:𝑛 , 𝒙1:𝑛−1 nor to compute 𝛼𝑛 𝒙1:𝑛 . In
𝑜𝑝𝑡
these cases, one should employ an approximation of 𝑞𝑛 𝑥𝑛 |𝒚1:𝑛 , 𝒙1:𝑛−1 for
𝑞𝑛 𝑥𝑛 |𝒚1:𝑛 , 𝒙1:𝑛−1 . Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14
Variance of the IS Estimates
 In those scenarios in which the time required to sample from 𝑞𝑛 𝑥𝑛 |𝒚1:𝑛 , 𝒙1:𝑛−1 and to
compute 𝛼𝑛 𝒙1:𝑛 is independent of 𝑛 (and this is, indeed, the case if 𝑞𝑛 is chosen sensibly
and one is concerned with a problem such as fitering), it appears that we have provided a
solution for Problem 2 (computational complexity being 𝒪(𝑛)).

 However, the methodology presented here suffers from severe drawbacks.

 Even for standard IS, the variance of the resulting estimates increases exponentially with 𝑛.

 The variance of the weights will grow unboundedly (weight degeneracy – after some time will only
be one weight with non-zero value)

 As SIS is nothing but a special version of IS in which the importance distribution is of the form
𝑛

𝑞𝑛 𝒙1:𝑛 |𝒚1:𝑛 = 𝑞1 𝑥1 |𝑦1 ෑ 𝑞𝑘 𝑥𝑘 |𝒚1:𝑘 , 𝒙1:𝑘−1


𝑘=2
it suffers from the same problems.

 We demonstrate this using a very simple toy example.


Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Variance of the IS Estimates
 Consider the case 𝒳 = ℝ and 𝑝𝑛 𝒙1:𝑛 = ς𝑛𝑘=1 𝒩 𝑥𝑘 ; 0,1 , 𝛾𝑛 𝒙1:𝑛 =
𝑥𝑘2
ς𝑛𝑘=1 𝑒𝑥𝑝 − , 𝑍𝑛 = 2𝜋 𝑛/2 .
2
1
 Select the following reasonable importance sampling distribution (1 ≠ 𝜎2 > )
2
𝑛 𝑛

𝑞𝑛 𝒙1:𝑛 = ෑ 𝑞𝑘 𝑥k = ෑ 𝒩 𝑥𝑘 ; 0, 𝜎 2
𝑘=1 𝑘=1
 Note that:
𝑛
1 1
𝛾𝑛 𝒙1:𝑛 −
𝑛/2 𝜎 𝑛 𝑒 2 ෍ 𝑥𝑖
2
(1−
𝜎2
) 𝑛/2 𝜎 𝑛
𝑤𝜎 (𝒙1:𝑛 ) = = 2𝜋 𝑖=1 ≤ 2𝜋 ∀𝑥
𝑞𝑛 𝒙1:𝑛
2
𝛾𝑛2 𝒙1:𝑛
𝑉𝑎𝑟𝑞𝜎 𝑤𝜎 (𝒙1:𝑛 ) = ඲𝑞𝜎 2 𝑑𝒙 − න𝛾𝑛 𝒙1:𝑛 𝑑𝒙1:𝑛 =
𝑞𝜎 (𝒙1:𝑛 ) 1:𝑛

𝑛 2 𝑛 Τ2 𝑛/2
𝑛
1 𝑛
෌𝑖=1 𝑥𝑖 1 𝑛
𝜎2 𝜎4
= 2𝜋 ඳ 𝜎 exp − 2− 2 𝑑𝒙1:𝑛 − 1 = 2𝜋 𝜎𝑛 − 1 = 2𝜋 𝑛
−1
2𝜋 𝑛/2 2 𝜎 2𝜎 2 − 1 2𝜎 2 − 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16


Importance Sampling in High-Dimensions
𝑛/2 𝑛/2
𝑛
𝜎4 𝑉𝑎𝑟 𝑍መ𝑛 1 𝜎4
𝑉𝑎𝑟𝑞𝜎 𝑤𝜎 (𝒙1:𝑛 ) = 2𝜋 −1 → = −1
2𝜎 2 − 1 𝑍𝑛2 𝑁 2𝜎 2 − 1

1
 It is easy to see that: 𝜎 4 > 2𝜎 2 − 1 ⇔ 𝜎 2 − 1 2 > 0 for 1 ≠ 𝜎 2 > . Therefore:
2

𝑉𝑎𝑟𝑞𝜎 𝑤𝜎 (𝒙1:𝑛 ) → ∞ 𝑎𝑠 𝑛 → ∞

 The variance of the weights increases exponentially fast with dimensionality. This is
despite the good choice of 𝑞𝑛 𝒙1:𝑛 .

 For example, if we select 𝜎 2 = 1.2 then we have a reasonably good importance distribution as
𝑉𝑎𝑟 𝑍෠𝑛
𝑞𝑘 𝑥𝑘 ≈ 𝜋𝑛 𝑥𝑘 but 𝑁 ≈ 1.103𝑛/2 which is approximately equal to 1.9 × 1021 for 𝑛 =
𝑍𝑛2
𝑉𝑎𝑟 𝑍෠𝑛
1000! We would need to use 𝑁 ≈ 2 × 1023 particles to obtain a relative variance ≈
𝑍𝑛2
0.01.This is impractical.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17


Proposal Distribution Factorization
 From practical perspective, we use proposal distributions of the form:

qn  xn | y1:n , x1:n 1   qn  xn | yn , xn 1 

 Given 𝒙𝑛−1 and 𝑦𝑛 , 𝒚1:𝑛−1 and 𝒙1:𝑛−2 don’t bring any new information about 𝑋𝑛.
 Our sequential importance sampling update now looks as follows:
qn  x1:n | y1:n   qn 1  x1:n 1 | y1:n 1  qn  xn | yn , xn 1 
Importance Samping at n Distribution of the paths X1:( ni )1 Conditional Distribution of X n( i )
n
 q  x1   qk  xk | yk , xk 1 
k 2

𝑖
 Thus we assume that at 𝑛 − 1 we have sampled 𝑿1:𝑛−1 ~𝑞𝑛−1 𝒙1:𝑛−1 |𝒚1:𝑛−1 and to obtain
𝑖 𝑖 𝑖
𝑿1:𝑛 ~𝑞 𝒙1:𝑛 |𝒚1:𝑛 , we need to sample 𝑋𝑛 ~𝑞𝑛 𝑥𝑛 |𝑦𝑛, 𝑋𝑛−1 and then set

 
X (i )
 (i )
X1:n 1 , Xn (i )

1:n
 Pr eviously Sampled Paths Sampled Single Component at time n 
 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18


Sequential Importance Sampling
 We now need to show that we can recursively compute estimates of our target distribution
𝑝 𝒙1:𝑛 |𝒚1:𝑛 as well as of 𝑝 𝒚1:𝑛 .

 From our earlier Importance Sampling approximations:


𝑁 𝑖
𝑖 𝑖
𝑤 𝑿1:𝑛 , 𝒚1:𝑛
𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 = ෍ 𝑊𝑛 𝛿𝑿 𝑖 𝒙1:𝑛 , 𝑊𝑛 = 𝑁
1:𝑛 𝑖
𝑖=1 𝑁 ෍ 𝑤 𝑿1:𝑛 , 𝒚1:𝑛
1 𝑖
𝑖=1
𝑝Ƹ 𝑁 𝒚1:𝑛 = ෍ 𝑤 𝑿1:𝑛 , 𝒚1:𝑛
𝑁
𝑖=1
 We can show the following recursions for calculations of these weights:
p  x1:n , y1:n  p  x1:n 1 , y1:n 1  f  xn | xn 1  g  yn | xn 
w  x1:n , y1:n   
q  x1:n | y1:n  q  x1:n 1 | y1:n 1  q  xn | yn , xn 1 
w x1:n1 , y1:n1  Incremental Weight

 This suggests the following sequential Importance Sampling Algorithm.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19


Sequential Importance Sampling
At step 𝒏 = 𝟏:
𝑖
 Sample 𝑋1 ~𝑞 𝑥1 |𝑦1 , 𝑖 = 1, . . . , 𝑁 and then approximate:
𝑁 𝑖 𝑖
𝑖 𝑖 𝑖
𝜇 𝑋1 𝑔 𝑦1 |𝑋1
𝑝Ƹ 𝑁 𝑥1 |𝑦1 = ෍ 𝑊1 𝛿𝑋 𝑖 𝑥1 , 𝑊1 𝑋1 , 𝑦1 ∝ 𝑖
𝑖=1
1 𝑞 𝑋1 |𝑦1

At step 𝒏 ≥ 𝟐:

𝑖 𝑖
 Sample 𝑋𝑛 ~𝑞 𝑥𝑛 |𝑦𝑛, 𝑋𝑛−1 , 𝑛 = 1, . . . , 𝑁, and compute:
𝑁
𝑖
𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 = ෍ 𝑊𝑛 𝛿𝑿 𝑖 𝒙1:𝑛 ,
1:𝑛
𝑖=1
𝑖 𝑖 𝑖
𝑖 𝑖 𝑖
𝑓 𝑋𝑛 |𝑋𝑛−1 𝑔 𝑦𝑛 |𝑋𝑛
𝑊𝑛 ∝ 𝑤 𝑿1:𝑛 , 𝒚1:𝑛 = 𝑤 𝑿1:𝑛−1 , 𝒚1:𝑛−1 𝑖 𝑖
𝑞 𝑋𝑛 |𝑦𝑛, 𝑋𝑛−1

 The algorithm has computational complexity O(𝑁) independent of 𝑛.


Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20
Sequential Importance Sampling
 Note that the complexity of the algorithm does not increase with 𝑛.

 The algorithm is fully parallelizable.

 Also note that if our interest is on computing the marginal posterior, 𝑝Ƹ 𝑁 𝑥𝑛 |𝒚1:𝑛 (posterior
𝑖 𝑖
filtered density), then we only need to store 𝑿𝑛−1:𝑛 rather than all the 𝑿1:𝑛 paths
𝑁
𝑖
𝑝Ƹ 𝑁 𝑥𝑛 |𝒚1:𝑛 = ෍ 𝑊𝑛 𝛿𝑋 𝑖 𝑥𝑛 ,
𝑛
𝑖=1
𝑖 𝑖 𝑖
𝑖 𝑖 𝑖
𝑓 𝑋𝑛 |𝑋𝑛−1 𝑔 𝑦𝑛 |𝑋𝑛
𝑊𝑛 ∝ 𝑤 𝑿1:𝑛 , 𝒚1:𝑛 = 𝑤 𝑿1:𝑛−1 , 𝒚1:𝑛−1 𝑖 𝑖
𝑞 𝑋𝑛 |𝑦𝑛, 𝑋𝑛−1

 One can show that this approaches the true posterior as 𝑁 → ∞.

 Crisan, D., P. D. Moral, and T. Lyons (1999). Discrete filtering using branching and interacting particle
systems. Markov Processes and Related Fields 5(3), 293–318.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21
The Bootstrap Particle Filter
 A simple choice of an importance sampling distribution 𝑞𝑛 𝒙1:𝑛 |𝒚1:𝑛 is derived based on the
following:
qn  x1:n | y1:n   p  x1:n 

that is q1  x1 | y1     x1 
qn  x1:n | y1:n  p  x1:n 
and qn  xn | x1:n 1 , y1:n     f  xn | xn 1 
qn 1  x1:n 1 | y1:n 1  p  x1:n 1 

 We also have:
p  x1:n , y1:n  p  x1:n 1 , y1:n 1  f  xn | xn 1  g  yn | xn 
wn  x1:n , y1:n     wn 1  x1:n 1 , y1:n 1   g  yn | xn 
qn  x1:n | y1:n  qn 1  x1:n 1 | y1:n 1  qn  xn | x1:n 1 , y1:n 

 This choice is extremely poor if the data are very informative (peaky likelihood), since the
proposal distribution doesn’t include any information from the data 𝒚1:𝑛 .
n
wn  x1:n , y1:n   wn 1  x1:n 1 , y1:n 1   g  yn | xn    g  yk | xk 
k 1

 In the Bootstrap particle filter the particles are simulated according to the dynamical model
and weights are assigned according to the likelihood.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22
Bootstrap Particle Filter
 One selects
q1  x1    ( x1 ) and qn  xn | x1:n1   qn  xn | xn 1   f  xn | xn1 

𝑖 𝑖 𝑖
 At time 𝑛 = 1, we sample 𝑋1 ~𝜇(. ) and set 𝑤1 = 𝑔 𝑦1 |𝑋1

 At time 𝑛 (𝑛 > 1)

𝑖 𝑖 𝑖 𝑖 𝑖
 sample 𝑋𝑛 ~𝑓(. |𝑿1:𝑛−1 ) and set 𝑿1:𝑛 = 𝑿1:𝑛−1 , 𝑋𝑛

 evaluate the importance weights


N
wn(i )  wn(i)1 g  yn | X n(i )  or normalized : W n
(i )
 w /  wn( i )
(i )
n
i 1

 At any time 𝑛 we have:

~   x1  g  y1 | x1   f  xk | xk 1  g  yk | xk , wn  X    g( y
n n
(i ) (i )
X 1:n 1:n k |X k( i ) )
k 2 k 2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23


Resampling
 As 𝑛 increases, the mass of our approximation to the target distribution concentrates on a few
particles (degeneracy problem):
𝑁
𝑖
𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 = ෍ 𝑊𝑛 𝛿𝑿 𝑖 𝒙1:𝑛 ≈ 𝛿𝑿 𝑖0 𝒙1:𝑛
1:𝑛 1:𝑛
𝑖=1

 Here a single delta mass is used (weight 1 for particle 𝑖0).


𝑖
 When the variance of the weights 𝑊𝑛 is high, the resampling idea essentially is to kill the
𝑖
samples with low weights 𝑊𝑛 𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑡𝑜 1Τ𝑁 and multiply the particles with higher
weights.
 Of course the assumption here is that particles with low weights (relative to 1/𝑁) at step 𝑛 will
have even lower weights at later on steps.
 Resampling techniques are a key ingredient of SMC methods which can partially address the
problem of degeneracy of the SIS algorithm.

 Ref : J.S.Liu, R.Chen. Blind deconvolution via sequential imputations. Journal of the American Statistical Association. 1995, 90:p567.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24
Resampling
 Let us assume that at time 𝑛 the following approximation holds:
𝑁
𝑖
𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 = ෍ 𝑊𝑛 𝛿𝑿 𝑖 𝒙1:𝑛
1:𝑛
𝑖=1

𝑗 𝑁
 With resampling (sampling with replacement from 𝒞 𝑊𝑛 in proportion to the weights
𝑗=1
𝑖
𝑊𝑛 ), we sample 𝑁 times from the above distribution
𝑖
෩ 1:𝑛
𝑿 ~𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 , 𝑖 = 1, . . . , 𝑁

to build a new approximation: 𝑁


1
𝑝෤𝑁 𝒙1:𝑛 |𝒚1:𝑛 = ෍ 𝛿𝑿෩ 𝑖 𝒙1:𝑛
𝑁 1:𝑛
𝑖=1

𝑖
෩ 1:𝑛
 Note that the resampled particles 𝑿 are approximately distributed according to 𝑝 𝒙1:𝑛 |𝒚1:𝑛
but they are statistically dependent (so CLT approximations, etc. are not holding!)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25
Bayesian Recursion Formulas for the
State Space Model

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26


Filtering and Marginal Likelihood
 Let us return to the SSM where the objective is to compute 𝑝 𝒙1:𝑛 |𝒚1:𝑛 . We
want to calculate this sequentially.
 We can write the following recursion equation:
p  x1:n , y1:n  p  y1:n  p  x1:n , y1:n  p  y1:n 1 
p  x1:n | y1:n   p  x1:n 1 | y1:n 1   p  x1:n 1 | y1:n 1 
p  x1:n 1 , y1:n 1  p  y1:n 1  p  x1:n 1 , y1:n 1  p  y1:n 
Pr edictive: p  x1:n | y1:n1 

1 g ( yn | xn ) f  xn | xn 1  p  x1:n 1 | y1:n 1 
 g ( yn | xn ) f  xn | xn 1  p  x1:n 1 | y1:n 1  
p  yn | y1:n 1  p  yn | y1:n 1 
where the prediction of 𝑦𝑛 given 𝒚1:𝑛−1 is:
p  yn | y1:n 1    p  yn , xn | y1:n 1  dxn   g ( yn | xn ) p  xn | y1:n 1  dxn

  g ( yn | xn ) p  xn , xn 1 | y1:n 1  dxn 1:n   g ( yn | xn ) f ( xn | xn 1 ) p  xn 1 | y1:n 1  dxn 1:n


 We can write our update equation above in two recursive steps:
Step I - Prediction : p  x1:n | y1:n 1   f  xn | xn 1  p  x1:n 1 | y1:n 1 
g ( yn | xn ) p  x1:n | y1:n 1 
Step II -Update : p  x1:n | y1:n    g ( yn | xn ) p  x1:n | y1:n 1 
p  yn | y1:n 1 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27
Filtering and Marginal Likelihood
 A two-step prediction/update for the marginal (filtering distributions) 𝑝 𝑥𝑛 |𝒚1:𝑛
can also be easily derived.
Step I - Prediction : p  xn | y1:n 1    p  xn 1:n | y1:n 1  dxn 1

  p  xn | xn 1 , y1:n 1  p  xn 1 | y1:n 1  dxn 1

  f  xn | xn 1  p  xn 1 | y1:n 1  dxn 1
g ( yn | xn ) p  xn | y1:n 1 
Step II -Update : p  xn | y1:n   p  xn | yn , y1:n 1  
p  yn | y1:n 1 
where: 𝑝 𝑦𝑛 |𝒚1:𝑛−1 = ‫𝒚| 𝑛𝑥 𝑝) 𝑛𝑥| 𝑛𝑦(𝑔 ׬‬1:𝑛−1 𝑑𝑥𝑛 .

 Our key emphasis remains in the calculation of 𝑝 𝒙1:𝑛 |𝒚1:𝑛 even if our
interests are in computing 𝑝 𝑥𝑛 |𝒚1:𝑛 .
 This recursion leads to the Kalman filter for LG-SSM.
 SMC is a simple simulation-based implementation of this recursion.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28
Filtering and Marginal Likelihood
 To compute the normalizing factor 𝑝 𝒚1:𝑛 , one can use recursive calculation
avoiding high-dimensional integration.
𝑛

𝑝 𝒚1:𝑛 = 𝑝(𝑦1 ) ෑ 𝑝 𝑦𝑘 |𝒚1:𝑘−1


𝑘=2

 To compute 𝑝 𝑦𝑘 |𝒚1:𝑘−1 , we use the recursion derived earlier:

𝑝 𝑦𝑘 |𝒚1:𝑘−1 = න𝑝 𝑦𝑘 , 𝑥𝑘 |𝒚1:𝑘−1 𝑑𝑥𝑘 = න𝑔(𝑦𝑘 |𝑥𝑘 )𝑝 𝑥𝑘 |𝒚1:𝑘−1 𝑑𝑥𝑘

= න𝑔(𝑦𝑘 |𝑥𝑘 )𝑝 𝑥𝑘 , 𝑥𝑘−1 |𝒚1:𝑘−1 𝑑𝒙𝑘−1:𝑘 = න𝑔(𝑦𝑘 |𝑥𝑘 )𝑓(𝑥𝑘 |𝑥𝑘−1 )𝑝 𝑥𝑘−1 |𝒚1:𝑘−1 𝑑𝒙𝑘−1:𝑘

 The calculation of 𝑝 𝒚1:𝑛 is a product of lower dimensional integrals.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29


MC Implementation of the Prediction Step
 The Bootstrap Particle Filter (Gordon et al. 1993) considered earlier can be seen as a natural
Monte-Carlo simulation based implementation of the prediction and updating recursive
relations.

 Assume you have at time 𝑛 − 1


𝑁
1 𝑖
𝑝෤𝑁 𝒙1:𝑛−1 |𝒚1:𝑛−1 = ෍ 𝛿𝑿෩ 𝑖 ෩ 1:𝑛−1
𝒙1:𝑛−1 , 𝑤ℎ𝑒𝑟𝑒 𝑿 ~𝑝 𝒙1:𝑛−1 |𝒚1:𝑛−1
𝑁 1:𝑛−1
𝑖=1

𝑖 𝑖 𝑖 𝑖 𝑖
By sampling 𝑋𝑛 ~𝑓 𝑥𝑛 |𝑋෨𝑛−1 , setting 𝑿1:𝑛 = 𝑿
෩ 1:𝑛−1 , 𝑋𝑛 and using 𝑝 𝒙1:𝑛 |𝒚1:𝑛−1 =
𝑓 𝑥𝑛 |𝑥𝑛−1 𝑝 𝒙1:𝑛−1 |𝒚1:𝑛−1 we obtain:

𝑁
1
𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛−1 = ෍ 𝛿𝑿 𝑖 𝒙1:𝑛
𝑁 1:𝑛
𝑖=1

 Sampling from 𝑓 𝑥𝑛 |𝑥𝑛−1 is straightforward.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30


Importance Sampling Implementation of the Updating Step
𝑔(𝑦𝑛 |𝑥𝑛 )𝑝 𝒙1:𝑛 |𝒚1:𝑛−1
 Our target at time 𝑛 is 𝑝 𝒙1:𝑛 |𝒚1:𝑛 = , where 𝑝 𝑦𝑛 |𝒚1:𝑛−1 =
𝑝 𝑦𝑛 |𝒚1:𝑛−1
‫𝒚| 𝑛𝑥 𝑝) 𝑛𝑥| 𝑛𝑦(𝑔 ׬‬1:𝑛−1 𝑑𝑥𝑛 .
𝑁
1
 Substitute 𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛−1 = ෍ 𝛿𝑿 𝑖 𝒙1:𝑛 for 𝑝 𝒙1:𝑛 |𝒚1:𝑛−1 and note that:
𝑁 𝑖=1 1:𝑛

𝑁 𝑁
1 1 𝑖
𝑝ො𝑁 𝑦𝑛 |𝒚1:𝑛−1 = න𝑔(𝑦𝑛 |𝑥𝑛 )𝑝ො𝑁 𝒙1:𝑛 |𝒚1:𝑛−1 𝑑𝒙1:𝑛 = ඲ 𝑔(𝑦𝑛 |𝑥𝑛 ) ෍ 𝛿𝑿 𝑖 𝒙1:𝑛 𝑑𝒙1:𝑛 = ෍ 𝑔(𝑦𝑛 |𝑋𝑛 )
𝑁 1:𝑛 𝑁
𝑖=1 𝑖=1

 Finally 𝑝 𝒙1:𝑛 |𝒚1:𝑛 becomes:


𝑁
𝑖
෍ 𝑔(𝑦𝑛 |𝑋𝑛 )𝛿𝑿 𝑖 𝒙1:𝑛 𝑁
𝑔(𝑦𝑛 |𝑥𝑛 )𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛−1 𝑖=1 1:𝑛 (𝑖)
𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 = = 𝑁 = ෍ 𝑊𝑛 𝛿𝑿 𝑖 𝒙1:𝑛
𝑝Ƹ 𝑁 𝑦𝑛 |𝒚1:𝑛−1 𝑖 1:𝑛
෍ 𝑔(𝑦𝑛 |𝑋𝑛 ) 𝑖=1
𝑖=1
𝑁
(𝑖) 𝑖 (𝑖)
where the normalized weights are defined as 𝑊𝑛 ∝ 𝑔(𝑦𝑛 |𝑋𝑛 ), ෍ 𝑊𝑛 = 1.
𝑖=1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31
Multinomial Resampling
 We have a weighted approximation 𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 of 𝑝 𝒙1:𝑛 |𝒚1:𝑛
𝑁
𝑖
𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 = ෍ 𝑊𝑛 𝛿𝑿 𝑖 𝒙1:𝑛
1:𝑛
𝑖=1
(𝑖)
෩ 1:𝑛
 To obtain 𝑁 samples 𝑿 approximately distributed according to 𝑝 𝒙1:𝑛 |𝒚1:𝑛 , resample 𝑁
𝑖
times with replacement according to the weights 𝑊𝑛
(𝑖)
෩ 1:𝑛
𝑿 ~𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 , 𝑖 = 1, . . . , 𝑁
to build a new approximation:
𝑁
𝑁 (𝑖)
1 𝑁𝑛
𝑝෤𝑁 𝒙1:𝑛 |𝒚1:𝑛 = ෍ 𝛿𝑿෩ 𝑖 𝒙1:𝑛 = ෎ 𝛿 𝑖 𝒙
𝑁 1:𝑛 𝑁 𝑿1:𝑛 1:𝑛
𝑖=1
𝑖=1

(𝑖) 𝑖 𝑖 𝑖 𝑖 𝑖
 Here 𝑁𝑛 follow a multinomial with 𝔼 𝑁𝑛 = 𝑁𝑊𝑛 , 𝑉𝑎𝑟 𝑁𝑛 = 𝑁𝑊𝑛 1 − 𝑊𝑛

 The computational cost is 𝒪 𝑁 .


Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32
Multinomial Resampling
 The resampling algorithm is based on the following two steps:
For 𝑖 = 1, … , 𝑁
𝑗 𝑁
 Select one of the components: 𝑎𝑛𝑖 ~𝒞 𝑊𝑛−1 (Categorical distribution)
𝑗=1
𝑖
𝑎𝑛
𝑖
 Generate a sample from the selected component: 𝑋𝑛 ~𝑓(𝑥𝑛 |𝑋𝑛−1 )
𝑖
𝑎𝑛 𝑖 𝑖
 The particle 𝑋𝑛−1 is referred to as the ancestor of the 𝑋𝑛 since 𝑋𝑛 is
𝑖
𝑎𝑛
generated conditionally on 𝑋𝑛−1 .
 The variable 𝑎𝑛𝑖 ∈ 1,2, … , 𝑁 is referred to as the ancestor index since it
𝑖
indexes the ancestor of particle 𝑋𝑛 at time 𝑛 − 1.
 The ancestor indices are essentially random variables that are used to make
the stochasticity of the resampling step explicit by keeping track of which
particles get resampled.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras
33
Vanilla SMC: Bootstrap Filter (Gordon et al.)
For 𝑖 = 1, … , 𝑁

 Time 1:
𝑖
 Sample 𝑁 particles 𝑋1 ~𝜇(𝑥1 ) and compute:
𝑁
𝑖 𝑖 𝑖
𝑝Ƹ 𝑁 𝑥1 |𝑦1 = ෍ 𝑊1 𝛿𝑋 𝑖 𝑥1 , 𝑊1 ∝ 𝑔 𝑦1 |𝑋1
1
𝑖=1
𝑁
𝑖 1
 Resample 𝑋෨1 ~𝑝Ƹ 𝑁 𝑥1 |𝑦1 to obtain 𝑝෤𝑁 𝑥1 |𝑦1 = ෍ 𝛿𝑋෨ 𝑖 𝑥1 .
𝑁 𝑖=1 1

𝑁
 Time 𝑛, 𝑛 ≥ 2. Given 𝑝Ƹ𝑁 𝒙1:𝑛−1 |𝒚1:𝑛−1 = ෍
𝑖
𝑊𝑛−1 𝛿𝑿 𝑖 𝒙1:𝑛−1
𝑖=1 1:𝑛−1

𝑗 𝑁
𝑁
1
 Resample: Sample 𝑎𝑛𝑖 ~𝒞 𝑊𝑛−1 → 𝑝෤𝑁 𝒙1:𝑛−1 |𝒚1:𝑛−1 =
𝑁
෍ 𝛿𝑿෩ 𝑖 𝒙1:𝑛−1 .
𝑗=1 𝑖=1 1:𝑛−1
𝑖
𝑎𝑛 𝑁
𝑖 𝑖 𝑖
෩ 1:𝑛−1 𝑖 1
 Propagate: Sample 𝑋𝑛 ~𝑓(𝑥𝑛 |𝑋𝑛−1 ), set 𝑿1:𝑛 = 𝑿 , 𝑋𝑛 → 𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛−1 =
𝑁
෍ 𝛿𝑿 𝑖 𝒙1:𝑛 .
𝑖=1 1:𝑛

𝑁
𝑖 𝑖 𝑖
 Weight: Compute 𝑊𝑛 ∝𝑔 𝑦𝑛 |𝑋𝑛 and normalize → 𝑝Ƹ𝑁 𝒙1:𝑛 |𝒚1:𝑛 = ෍ 𝑊𝑛 𝛿𝑿 𝑖 𝒙1:𝑛 .
𝑖=1 1:𝑛

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34


Ancestral Path
 Consider the case with 𝑁 = 3 and 𝑛 = 3. 𝑥11 𝑥21 𝑥31

 Assume the resampling shown, e.g. particle 2 at 𝑛 = 1 is 𝑥12 𝑥22 𝑥32


resampled twice, particle 3 at 𝑛 = 2 is resampled twice.
𝑥13 𝑥23 𝑥33
 The ancestral paths were defined earlier in the form
𝑖 𝑖
෩ 1:𝑛−1 𝑖 𝑖
෩ 1:𝑛−1
𝑿1:𝑛 = 𝑿 , 𝑋𝑛 where 𝑿 refers to resampled paths at 𝑛 − 1, e.g. the ancestral path
1
෩ 1:2 2 (1) 2
of particle 1 at time 3 is: 𝑿 , 𝑋31 = 𝑋෨1 , 𝑋෨2 , 𝑋31 = 𝑋෨1 , 𝑋22 , 𝑋31 = 𝑋12 , 𝑋22 , 𝑋31 .

 To make the notation for the ancestral paths explicit, one can represent them in the form
𝑖 𝑎𝑖
𝑛
𝑖 𝑎𝑛 𝑖 𝑿1:𝑛−1 is the path that terminates at the
𝑿1:𝑛 = 𝑿1:𝑛−1 , 𝑋𝑛 𝑖 𝑖
𝑖
𝑎𝑛
ancestor of 𝑋𝑛 i.e. 𝑋𝑛 ~𝑓(𝑥𝑛 |𝑋𝑛−1 )

 For example, the ancestral path for particle 1 at time 3 can be written as:
𝑎1
𝑎2 3 𝑎31
𝑋1 , 𝑋2 , 𝑋31 = 𝑋12 , 𝑋22 , 𝑋31
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35
Vanilla SMC: Bootstrap Filter (Gordon et al.)
𝑁
𝑖
 At time 𝑛, 𝑛 ≥ 2. Given 𝑝Ƹ 𝑁 𝒙1:𝑛−1 |𝒚1:𝑛−1 = ෍ 𝑊𝑛−1 𝛿𝑿 𝑖 𝒙1:𝑛−1
𝑖=1 1:𝑛−1
𝑁
1
 After resampling: It produces 𝑝෤𝑁 𝒙1:𝑛−1 |𝒚1:𝑛−1 = ෍ 𝛿𝑿෩ 𝑖 𝒙1:𝑛−1 .
𝑁 𝑖=1 1:𝑛−1
𝑁
1 𝑖 𝑖 𝑖
 After propagation: It produces 𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛−1 = 𝑁 ෍ 𝛿𝑿 𝑖 ෩ 1:𝑛−1
𝒙1:𝑛 where 𝑿1:𝑛 = 𝑿 , 𝑋𝑛
𝑖=1 1:𝑛

𝑁
𝑖
 After weighting: It produces 𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 = ෍ 𝑊𝑛 𝛿𝑿 𝑖 𝒙1:𝑛 .
𝑖=1 1:𝑛

 In the original bootstrap particle filter of Gordon et al. (particles simulated with the dynamical
model and weights are assigned according to the likelihood) the focus was on computing an
approximation 𝑝Ƹ 𝑁 𝑥𝑛 |𝒚1:𝑛 of the filtering marginal.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36
SMC Output
 At time 𝑛 we have:
𝑁
1 𝑖 𝑖 𝑖
𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛−1 = ෍ 𝛿𝑿 𝑖 ෩ 1:𝑛−1
𝒙1:𝑛 , 𝑿1:𝑛 = 𝑿 , 𝑋𝑛
𝑁 𝑖=1 1:𝑛
𝑁 𝑁
𝑖 1
𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 = ෍ 𝑊𝑛 𝛿𝑿 𝑖 𝒙1:𝑛 , 𝑝෤𝑁 𝒙1:𝑛 |𝒚1:𝑛 = ෍ 𝛿𝑿෩ 𝑖 𝒙1:𝑛
𝑖=1 1:𝑛 𝑁 𝑖=1 1:𝑛

 The marginal likelihood estimate 𝑝Ƹ 𝑁 𝑦𝑘 |𝒚1:𝑘−1 = ‫𝑝) 𝑛𝑥| 𝑛𝑦(𝑔 ׬‬Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛−1 𝑑𝒙1:𝑛 =
𝑁
1 𝑖
෍ 𝑔(𝑦𝑘 |𝑋𝑘 ) is:
𝑁 𝑖=1
𝑁 𝑁 𝑁
1 𝑖
𝑝Ƹ 𝑁 𝒚1:𝑛 = 𝑝 𝑦1 ෑ 𝑝Ƹ 𝑁 𝑦𝑘 |𝒚1:𝑘−1 =ෑ ෍ 𝑔 𝑦𝑘 |𝑋𝑘
𝑁
𝑘=2 𝑘=1 𝑖=1

 Computational complexity is 𝒪 (𝑁) at each time step and memory requirements 𝒪 (𝑛𝑁).

 If we are only interested in 𝑝 𝑥𝑛 |𝒚1:𝑛 or 𝑝 𝑠𝑛 𝒙1:𝑛 |𝒚1:𝑛 where 𝑠𝑛 𝒙1:𝑛 =


Ψ𝑛 𝑠𝑛 , 𝑠𝑛−1 𝒙1:𝑛−1 , e.g. 𝑠𝑛 𝒙1:𝑛 = σ𝑛𝑘=1 𝑥𝑘2 is fixed dimension, then memory requirements
are 𝒪 𝑁 .
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37
Kalman Filtering Solution for LG-SSM
 Recall the LG-SSM:
𝑋𝑛 = 𝐴𝑋𝑛−1 + 𝐵𝑢𝑛 + 𝑉𝑛

𝑌𝑛 = 𝐶𝑋𝑛 + 𝐷𝑢𝑛 + 𝐸𝑛

𝑋0 𝜇 𝑃0 0 0
𝑉𝑛 ~𝒩 0 , 0 𝑄 𝑆
𝐸𝑛 0 0 𝑆𝑇 𝑅
 Measurement update:
𝑝 𝑥𝑛 |𝒚1:𝑛 = 𝒩 𝑥𝑛 |𝑥ො𝑛|𝑛 , 𝑃𝑛|𝑛
𝑥ො𝑛|𝑛 = 𝑥ො𝑛|𝑛−1 + 𝐾𝑛 𝑦𝑛 − 𝐶 𝑥ො𝑛|𝑛−1 − 𝐷𝑢𝑛
𝑃𝑛|𝑛 = 𝐼 − 𝐾𝑛 𝐶 𝑃𝑛|𝑛−1
−1
𝐾𝑛 = 𝑃𝑛|𝑛−1 𝐶 𝑇 𝐶𝑃𝑛|𝑛−1 𝐶 𝑇 + 𝑅

 Prediction:
𝑝 𝑥𝑛+1 |𝒚1:𝑛 = 𝒩 𝑥𝑛+1 |𝑥ො𝑛+1|𝑛 , 𝑃𝑛+1|𝑛
𝑥ො𝑛+1|𝑛 = 𝐴𝑥ො𝑛|𝑛 + 𝐵𝑢𝑛
𝑃𝑛+1|𝑛 = 𝐴𝑃𝑛|𝑛 𝐴𝑇 + 𝑄
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 38
Forward-Filtering Backward-Smoothing
 One can also estimate the marginal smoothing distribution 𝑝 𝑥𝑛 |𝒚1:𝑇 , 𝑛 =
1, . . . , 𝑇 (an offline estimate once all measurements 𝒚1:𝑇 are collected)
I − Forward pass: Compute and store 𝑝 𝑥𝑛 |𝒚1:𝑛 , 𝑝 𝑥𝑛+1 |𝒚1:𝑛 , 𝑛 = 1,2, . . . , 𝑇
use the update and prediction recursions derived earlier
𝑓(𝑥𝑛+1 |𝑥𝑛 )
𝐼𝐼 − Backward pass (𝑛 = 𝑇 − 1, 𝑇 − 2, . . , 1): 𝑝 𝑥𝑛 |𝒚1:𝑇 = 𝑝 𝑥𝑛 |𝒚1:𝑛 න 𝑝 𝑥𝑛+1 |𝒚1:𝑇 𝑑𝑥𝑛+1
𝑝 𝑥𝑛+1 |𝒚1:𝑛

 Indeed, one can show:


p  xn | y1:T    p  xn , xn 1 | y1:T  dxn 1   p ( xn | xn 1 , y1:T ) p  xn 1 | y1:T  dxn 1
f ( xn 1 | xn )
  p ( xn | xn 1 , y1:n ) p  xn 1 | y1:T  dxn 1  p  xn | y1:n   p  xn 1 | y1:T  dxn 1
p  xn 1 | y1:n 

 Here we used (see next) 𝑝 𝑥𝑛 𝑥𝑛+1 , 𝒚1:𝑇 = 𝑝 𝑥𝑛 |𝑥𝑛+1 , 𝒚1:𝑛 .


 Fredrik Lindsten and Thomas B. Schön. Backward simulation methods for Monte Carlo statistical inference. Foundations and
Trends in Machine Learning, 6(1):1-143, 2013.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 39
Forward-Filtering Backward-Smoothing
 Here we highlight the proof of the Eq. used in the earlier slide:

p ( xn | xn 1 , y1:T )  p  xn | xn 1 , y1:n 

 Note that:
p  yn 1:T | xn , xn 1 , y1:n  p  xn | xn 1 , y1:n 
p  xn | xn 1 , y1:T   p  xn | xn 1 , y1:n , yn 1:T  
p  yn 1:T | xn 1 , y1:n 
p  yn 1:T | xn 1  p  xn | xn 1 , y1:n 
  p  xn | xn 1 , y1:n 
p  yn 1:T | xn 1 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 40


Forward-Backward (Two-Filter) Smoother
 One can also estimate the marginal smoothing distribution 𝑝 𝑥𝑛 |𝒚1:𝑇 , 𝑛 =
1, . . . , 𝑇 as follows (see proof on the following slide):
Step I - Backward information filter : p  yn:T | xn    p  yn:T , xn 1 | xn  dxn 1

  p  yn:T | xn 1 , xn  f  xn 1 | xn  dxn 1   p  yn , yn 1:T | xn 1 , xn  f  xn 1 | xn  dxn 1

 g ( yn | xn )  p  yn 1:T | xn 1  f  xn 1 | xn  dxn 1
p ( xn | y1:n 1 ) p  yn:T | xn 
Step II -Update : p  xn | y1:T   p  xn | y1:n 1 , yn:T  
p  yn:T | y1:n 1 

 Note that we can have: ‫𝑛𝒚 𝑝 ׬‬:𝑇 |𝑥𝑛 𝑑𝑥𝑛 = ∞. This precludes the use of SMC
algorithms.

 To address this, a generalized version was proposed using a set of artificial


distributions 𝑝෤𝑛 𝑥𝑛 .
 Briers, M., Doucet, A. and Maskell, S. (2008) Smoothing algorithms for state-space models, Ann Inst Stat Math (2010) 62:61–
89. Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 41
Online Bayesian Parameter
Estimation

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 42


Online Bayesian Parameter Estimation
 Let the SSM be defined with some unknown static parameter 𝜃 with some prior 𝑝(𝜃):*
X 1 ~  (.) and X n |  X n 1  xn 1  ~ f  xn | xn 1 
Yn |  X n  xn  ~ g  yn | xn 
 Given data 𝒚1:𝑛 , inference now is based on:
p  , x1:n | y1:n   p  | y1:n  p  x1:n | y1:n  ,
where
p  | y1:n   p  y1:n  p  
 Need to learn both 𝒙1:𝑛 and 𝜃 from the observations 𝒚1:𝑛 . We can use standard SMC but on
the extended space 𝑍𝑛 = (𝑋𝑛 , 𝜃𝑛).
f  zn | zn 1   n1  n  f  xn | xn 1  , g  yn | zn   g  yn | xn 
 Note that 𝜃 is a static parameter – does not involve with 𝑛.

*M. Kok, J. D. Hol and T. B. Schön. Using inertial sensors for position and orientation estimation. Foundations and Trends of Signal Processing, 11(1-2):1-153,
2017. (In motion capture, 𝑋𝑛 can represent the position/orientation of different body segments of a person, 𝜃 the body shape and 𝑌𝑛 measurements from sensors).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 43


Maximum Likelihood Parameter Estimation
 Standard approaches for parameter estimation consists of computing the Maximum
Likelihood (ML) estimate
 ML  arg max log p  y1:n 
 The likelihood function can be multimodal and there is no guarantee to find its global
optimum.

 Standard (stochastic) gradient algorithms can be used (e.g. based on Fisher’s identity) to find
a local minimum:

 log p  y1:n     log p  x1:n , y1:n  p  x1:n | y1:n  dx1:n

 These algorithms can work decently but it can be difficult to scale the components of the
gradients.

 Note that these algorithms involve computing 𝑝𝜃 𝒙1:𝑛 |𝒚1:𝑛 which is the key result of the SMC
algorithm.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 44
Expectation/Maximization for HMM
 One can also use the EM algorithm
 ( i )  Q  ( i ) , 
Q  ( i ) ,    log p  x1:n , y1:n  p ( i 1)  x1:n | y1:n  dx1:n

  log    x1  g  y1 | x1   p ( i 1)  x1 | y1:n  dx1


n
   log  f  xk | xk 1  g  yk | xk   p ( i 1)  xk 1:k | y1:n  dxk 1:k
k 2

 Above we used: n n
p  x1:n , y1:n     x1   f  xk | xk 1  g  yk | xk 
k 2 k 1

 Implementation of the EM algorithm requires computing expectations with respect to the


smoothing distributions 𝑝𝜃 𝑖−1 𝒙𝑘−1:𝑘 |𝒚1:𝑛 .

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 45


Gaussian Process SSM
 The Gaussian process (GP) is a non-parametric probabilistic model for
nonlinear functions.

 Consider an SSM model of the form:

𝑋𝑛 = 𝑓 𝑋𝑛−1 + 𝑉𝑛 ; 𝑠. 𝑡. 𝑓 𝑋 ~𝐺𝑃 0; 𝑘𝜂,𝑓 (𝑥; 𝑥 ′ ) ;


𝑌𝑛 = 𝑔 𝑋𝑛 + 𝐸𝑛 ; 𝑠. 𝑡. 𝑔 𝑋 ~𝐺𝑃 0; 𝑘𝜂,𝑔 (𝑥; 𝑥 ′ ) :

 The model functions 𝑓 and 𝑔 are assumed to be realizations from Gaussian


process priors and 𝑉𝑛 ~𝒩 0; 𝑄 , 𝐸𝑛 ~𝒩 0; 𝑅 .

 The inference task becomes the calculation of the joint posterior


𝑝(𝑓, 𝑔, 𝑄, 𝑅, 𝜂, 𝒙1:𝑛 |𝒚1:𝑛 ).
 Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl Rasmussen. Bayesian inference and learning in Gaussian process state-space models with particle MCMC. NIPS, 2013.
 Andreas Svensson and Thomas B. Schön. A flexible state space model for learning nonlinear dynamical systems. Automatica, 80:189-199, June, 2017.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 46

You might also like