Lec35 SequentialImportanceSampling
Lec35 SequentialImportanceSampling
Email: [email protected]
URL: https://www.zabaras.com/
𝑁 𝑁
1 1 𝑖
Note that: 𝑍መ𝑛 ≡ 𝑝Ƹ 𝑁 𝒚1:𝑛 = 𝑤 𝒙1:𝑛 , 𝒚1:𝑛 𝛿𝑿 𝑖 𝒙1:𝑛 𝑑𝒙1:𝑛 = 𝑤 𝑿1:𝑛 , 𝒚1:𝑛
𝑁 1:𝑛 𝑁
𝑖=1 𝑖=1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5
Bias & Variance of Importance Sampling Estimates
We are interested in an importance sampling approximation of 𝔼𝑝 𝒙 𝜑
1:𝑛 |𝒚1:𝑛
𝑁
𝑖 𝑖
𝐼𝑛𝐼𝑆 𝜑 ≡ 𝔼𝑝ො𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝜑 = 𝑊𝑛 𝜑 𝑿1:𝑛
𝑖=1
This is a biased estimate for a finite 𝑁 and we have shown in our earlier lecture on
Importance Sampling that:
𝑝2 𝒙1:𝑛 |𝒚1:𝑛
lim 𝑁 𝔼𝑝𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝜑 − 𝔼𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 = − 𝜑 𝒙1:𝑛 − 𝔼𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 𝑑𝒙1:𝑛
𝑁→∞ 𝑞 𝒙1:𝑛 |𝒚1:𝑛
𝑑 𝑝2 𝒙1:𝑛 |𝒚1:𝑛 2
𝑁 𝔼𝑝𝑁 𝒙1:𝑛 |𝒚1:𝑛 𝜑 − 𝔼𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 → 𝒩 0, 𝜑 𝒙1:𝑛 − 𝔼𝑝 𝒙1:𝑛 |𝒚1:𝑛 𝜑 𝑑𝒙1:𝑛
𝑞 𝒙1:𝑛 |𝒚1:𝑛
The asymptotic bias is of the order 1/𝑁 (negligible) and the MSE error is:
MSE bias 2 variance
O N 2
O N 1
One can use MCMC to sample from 𝜋𝑛 , 𝑛 = 1,2, . . . This calculation will be slow and cannot
compute 𝑍𝑛 , 𝑛 = 1,2 …
The key idea is that if 𝜋𝑛−1 does not differ a lots from 𝜋𝑛 , we should be able to
reuse our estimate of 𝜋𝑛−1 to approximate 𝜋𝑛 .
Assume that `at time 1’, we have approximations 𝜋ො 1 𝑥1 = 𝑝Ƹ 𝑁 𝑥1 |𝑦1 , 𝑍መ1 using an importance
density 𝑞1 𝑥1 |𝑦1 .
𝑖
𝑋1 ~𝑞1 𝑥1 |𝑦1 , 𝑖 = 1,2, . . . , 𝑁
𝑁 𝑖
𝑖 𝑖
𝑤1 𝑋1 , 𝑦1
𝑝Ƹ 𝑁 𝑥1 |𝑦1 𝑑𝑥1 = 𝑊1 𝛿𝑋 𝑖 𝑑𝑥1 , 𝑤ℎ𝑒𝑟𝑒 𝑊1 = 𝑁
1 𝑗
𝑖=1 𝑤1 𝑋1 , 𝑦1
𝑗=1
𝑁
1 𝑖
𝑍መ1 = 𝑤1 𝑋1 , 𝑦1 𝑤𝑖𝑡ℎ
𝑁
𝑖=1
𝛾1 𝑥1 𝑝 𝑥1 , 𝑦1
𝑤1 𝑥1 , 𝑦1 = =
𝑞1 𝑥1 |𝑦1 𝑞1 𝑥1 |𝑦1
𝑖
We want to reuse the samples 𝑋1 and 𝑞1(𝑥1|𝑦1) in building the importance sampling
approximation for 𝜋2 𝒙1:2 , 𝑍2 . Let us select a proposal distribution that factorizes as:
q2 x1:2 | y1:2 q1 x1 | y1 q2 x2 | y1:2 , x1
𝑖
To obtain 𝑿1:2 ~𝑞2 𝒙1:2 |𝒚1:2 , we need to sample as follows:
X 2(i ) | X 1(i ) ~ q2 x2 | y1:2 , X 1(i )
The importance sampling weight for this step is then:
Thus if
X1:(in)1 ~ qn 1 x1:n 1 | y1:n 1
we sample 𝑋𝑛 from
p ( X1:(in) , y1:n )
wn 1 ( X (i )
1:n 1 , y1:n 1 )
p ( X1:(in)1 , y1:n 1 )qn ( X n(i ) | X1:( in)1 , y1:n )
Similarly the normalized weights are as follows:
Wn(i ) Wn ( X1:(in) , y1:n ) wn ( X1:(in) , y1:n )
For our state space model, the above update formula takes the form:
f ( X n(i ) | X n( i)1 ) g ( yn | X n( i ) )
wn ( X , y1:n ) wn 1 ( X
(i )
1:n
(i )
1:n 1 , y1:n 1 )
qn ( X n(i ) | y1:n , X 1:(in)1 )
𝑁 𝑁
𝑖 1 𝑖
At each time we have 𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 = 𝑊𝑛 𝛿𝑿 𝑖 𝒙1:𝑛 and 𝑍መ𝑛 𝒙1:𝑛 |𝒚1:𝑛 = 𝑤𝑛 .
𝑖=1 1:𝑛 𝑁 𝑖=1
𝑖
In general, we may need to store all the paths 𝑿1:𝑛 even if our interest is to only compute
𝜋𝑛 (𝑥𝑛 ) = 𝑝 𝑥𝑛 |𝒚1:𝑛 .
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13
Variance of the IS Estimates
In this sequential framework, it would seem that the only freedom the user has at time 𝑛 is the
choice of 𝑞n 𝑥𝑛 |𝒙1:𝑛−1 , 𝒚1:𝑛 .
Even for standard IS, the variance of the resulting estimates increases exponentially with 𝑛.
The variance of the weights will grow unboundedly (weight degeneracy – after some time will only
be one weight with non-zero value)
As SIS is nothing but a special version of IS in which the importance distribution is of the form
𝑛
𝑞𝑛 𝒙1:𝑛 = ෑ 𝑞𝑘 𝑥k = ෑ 𝒩 𝑥𝑘 ; 0, 𝜎 2
𝑘=1 𝑘=1
Note that:
𝑛
1 1
𝛾𝑛 𝒙1:𝑛 −
𝑛/2 𝜎 𝑛 𝑒 2 𝑥𝑖
2
(1−
𝜎2
) 𝑛/2 𝜎 𝑛
𝑤𝜎 (𝒙1:𝑛 ) = = 2𝜋 𝑖=1 ≤ 2𝜋 ∀𝑥
𝑞𝑛 𝒙1:𝑛
2
𝛾𝑛2 𝒙1:𝑛
𝑉𝑎𝑟𝑞𝜎 𝑤𝜎 (𝒙1:𝑛 ) = 𝑞𝜎 2 𝑑𝒙 − න𝛾𝑛 𝒙1:𝑛 𝑑𝒙1:𝑛 =
𝑞𝜎 (𝒙1:𝑛 ) 1:𝑛
𝑛 2 𝑛 Τ2 𝑛/2
𝑛
1 𝑛
𝑖=1 𝑥𝑖 1 𝑛
𝜎2 𝜎4
= 2𝜋 ඳ 𝜎 exp − 2− 2 𝑑𝒙1:𝑛 − 1 = 2𝜋 𝜎𝑛 − 1 = 2𝜋 𝑛
−1
2𝜋 𝑛/2 2 𝜎 2𝜎 2 − 1 2𝜎 2 − 1
1
It is easy to see that: 𝜎 4 > 2𝜎 2 − 1 ⇔ 𝜎 2 − 1 2 > 0 for 1 ≠ 𝜎 2 > . Therefore:
2
𝑉𝑎𝑟𝑞𝜎 𝑤𝜎 (𝒙1:𝑛 ) → ∞ 𝑎𝑠 𝑛 → ∞
The variance of the weights increases exponentially fast with dimensionality. This is
despite the good choice of 𝑞𝑛 𝒙1:𝑛 .
For example, if we select 𝜎 2 = 1.2 then we have a reasonably good importance distribution as
𝑉𝑎𝑟 𝑍𝑛
𝑞𝑘 𝑥𝑘 ≈ 𝜋𝑛 𝑥𝑘 but 𝑁 ≈ 1.103𝑛/2 which is approximately equal to 1.9 × 1021 for 𝑛 =
𝑍𝑛2
𝑉𝑎𝑟 𝑍𝑛
1000! We would need to use 𝑁 ≈ 2 × 1023 particles to obtain a relative variance ≈
𝑍𝑛2
0.01.This is impractical.
qn xn | y1:n , x1:n 1 qn xn | yn , xn 1
Given 𝒙𝑛−1 and 𝑦𝑛 , 𝒚1:𝑛−1 and 𝒙1:𝑛−2 don’t bring any new information about 𝑋𝑛.
Our sequential importance sampling update now looks as follows:
qn x1:n | y1:n qn 1 x1:n 1 | y1:n 1 qn xn | yn , xn 1
Importance Samping at n Distribution of the paths X1:( ni )1 Conditional Distribution of X n( i )
n
q x1 qk xk | yk , xk 1
k 2
𝑖
Thus we assume that at 𝑛 − 1 we have sampled 𝑿1:𝑛−1 ~𝑞𝑛−1 𝒙1:𝑛−1 |𝒚1:𝑛−1 and to obtain
𝑖 𝑖 𝑖
𝑿1:𝑛 ~𝑞 𝒙1:𝑛 |𝒚1:𝑛 , we need to sample 𝑋𝑛 ~𝑞𝑛 𝑥𝑛 |𝑦𝑛, 𝑋𝑛−1 and then set
X (i )
(i )
X1:n 1 , Xn (i )
1:n
Pr eviously Sampled Paths Sampled Single Component at time n
At step 𝒏 ≥ 𝟐:
𝑖 𝑖
Sample 𝑋𝑛 ~𝑞 𝑥𝑛 |𝑦𝑛, 𝑋𝑛−1 , 𝑛 = 1, . . . , 𝑁, and compute:
𝑁
𝑖
𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 = 𝑊𝑛 𝛿𝑿 𝑖 𝒙1:𝑛 ,
1:𝑛
𝑖=1
𝑖 𝑖 𝑖
𝑖 𝑖 𝑖
𝑓 𝑋𝑛 |𝑋𝑛−1 𝑔 𝑦𝑛 |𝑋𝑛
𝑊𝑛 ∝ 𝑤 𝑿1:𝑛 , 𝒚1:𝑛 = 𝑤 𝑿1:𝑛−1 , 𝒚1:𝑛−1 𝑖 𝑖
𝑞 𝑋𝑛 |𝑦𝑛, 𝑋𝑛−1
Also note that if our interest is on computing the marginal posterior, 𝑝Ƹ 𝑁 𝑥𝑛 |𝒚1:𝑛 (posterior
𝑖 𝑖
filtered density), then we only need to store 𝑿𝑛−1:𝑛 rather than all the 𝑿1:𝑛 paths
𝑁
𝑖
𝑝Ƹ 𝑁 𝑥𝑛 |𝒚1:𝑛 = 𝑊𝑛 𝛿𝑋 𝑖 𝑥𝑛 ,
𝑛
𝑖=1
𝑖 𝑖 𝑖
𝑖 𝑖 𝑖
𝑓 𝑋𝑛 |𝑋𝑛−1 𝑔 𝑦𝑛 |𝑋𝑛
𝑊𝑛 ∝ 𝑤 𝑿1:𝑛 , 𝒚1:𝑛 = 𝑤 𝑿1:𝑛−1 , 𝒚1:𝑛−1 𝑖 𝑖
𝑞 𝑋𝑛 |𝑦𝑛, 𝑋𝑛−1
Crisan, D., P. D. Moral, and T. Lyons (1999). Discrete filtering using branching and interacting particle
systems. Markov Processes and Related Fields 5(3), 293–318.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21
The Bootstrap Particle Filter
A simple choice of an importance sampling distribution 𝑞𝑛 𝒙1:𝑛 |𝒚1:𝑛 is derived based on the
following:
qn x1:n | y1:n p x1:n
that is q1 x1 | y1 x1
qn x1:n | y1:n p x1:n
and qn xn | x1:n 1 , y1:n f xn | xn 1
qn 1 x1:n 1 | y1:n 1 p x1:n 1
We also have:
p x1:n , y1:n p x1:n 1 , y1:n 1 f xn | xn 1 g yn | xn
wn x1:n , y1:n wn 1 x1:n 1 , y1:n 1 g yn | xn
qn x1:n | y1:n qn 1 x1:n 1 | y1:n 1 qn xn | x1:n 1 , y1:n
This choice is extremely poor if the data are very informative (peaky likelihood), since the
proposal distribution doesn’t include any information from the data 𝒚1:𝑛 .
n
wn x1:n , y1:n wn 1 x1:n 1 , y1:n 1 g yn | xn g yk | xk
k 1
In the Bootstrap particle filter the particles are simulated according to the dynamical model
and weights are assigned according to the likelihood.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22
Bootstrap Particle Filter
One selects
q1 x1 ( x1 ) and qn xn | x1:n1 qn xn | xn 1 f xn | xn1
𝑖 𝑖 𝑖
At time 𝑛 = 1, we sample 𝑋1 ~𝜇(. ) and set 𝑤1 = 𝑔 𝑦1 |𝑋1
At time 𝑛 (𝑛 > 1)
𝑖 𝑖 𝑖 𝑖 𝑖
sample 𝑋𝑛 ~𝑓(. |𝑿1:𝑛−1 ) and set 𝑿1:𝑛 = 𝑿1:𝑛−1 , 𝑋𝑛
~ x1 g y1 | x1 f xk | xk 1 g yk | xk , wn X g( y
n n
(i ) (i )
X 1:n 1:n k |X k( i ) )
k 2 k 2
Ref : J.S.Liu, R.Chen. Blind deconvolution via sequential imputations. Journal of the American Statistical Association. 1995, 90:p567.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24
Resampling
Let us assume that at time 𝑛 the following approximation holds:
𝑁
𝑖
𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 = 𝑊𝑛 𝛿𝑿 𝑖 𝒙1:𝑛
1:𝑛
𝑖=1
𝑗 𝑁
With resampling (sampling with replacement from 𝒞 𝑊𝑛 in proportion to the weights
𝑗=1
𝑖
𝑊𝑛 ), we sample 𝑁 times from the above distribution
𝑖
෩ 1:𝑛
𝑿 ~𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 , 𝑖 = 1, . . . , 𝑁
𝑖
෩ 1:𝑛
Note that the resampled particles 𝑿 are approximately distributed according to 𝑝 𝒙1:𝑛 |𝒚1:𝑛
but they are statistically dependent (so CLT approximations, etc. are not holding!)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25
Bayesian Recursion Formulas for the
State Space Model
1 g ( yn | xn ) f xn | xn 1 p x1:n 1 | y1:n 1
g ( yn | xn ) f xn | xn 1 p x1:n 1 | y1:n 1
p yn | y1:n 1 p yn | y1:n 1
where the prediction of 𝑦𝑛 given 𝒚1:𝑛−1 is:
p yn | y1:n 1 p yn , xn | y1:n 1 dxn g ( yn | xn ) p xn | y1:n 1 dxn
f xn | xn 1 p xn 1 | y1:n 1 dxn 1
g ( yn | xn ) p xn | y1:n 1
Step II -Update : p xn | y1:n p xn | yn , y1:n 1
p yn | y1:n 1
where: 𝑝 𝑦𝑛 |𝒚1:𝑛−1 = 𝒚| 𝑛𝑥 𝑝) 𝑛𝑥| 𝑛𝑦(𝑔 1:𝑛−1 𝑑𝑥𝑛 .
Our key emphasis remains in the calculation of 𝑝 𝒙1:𝑛 |𝒚1:𝑛 even if our
interests are in computing 𝑝 𝑥𝑛 |𝒚1:𝑛 .
This recursion leads to the Kalman filter for LG-SSM.
SMC is a simple simulation-based implementation of this recursion.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28
Filtering and Marginal Likelihood
To compute the normalizing factor 𝑝 𝒚1:𝑛 , one can use recursive calculation
avoiding high-dimensional integration.
𝑛
= න𝑔(𝑦𝑘 |𝑥𝑘 )𝑝 𝑥𝑘 , 𝑥𝑘−1 |𝒚1:𝑘−1 𝑑𝒙𝑘−1:𝑘 = න𝑔(𝑦𝑘 |𝑥𝑘 )𝑓(𝑥𝑘 |𝑥𝑘−1 )𝑝 𝑥𝑘−1 |𝒚1:𝑘−1 𝑑𝒙𝑘−1:𝑘
𝑖 𝑖 𝑖 𝑖 𝑖
By sampling 𝑋𝑛 ~𝑓 𝑥𝑛 |𝑋෨𝑛−1 , setting 𝑿1:𝑛 = 𝑿
෩ 1:𝑛−1 , 𝑋𝑛 and using 𝑝 𝒙1:𝑛 |𝒚1:𝑛−1 =
𝑓 𝑥𝑛 |𝑥𝑛−1 𝑝 𝒙1:𝑛−1 |𝒚1:𝑛−1 we obtain:
𝑁
1
𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛−1 = 𝛿𝑿 𝑖 𝒙1:𝑛
𝑁 1:𝑛
𝑖=1
𝑁 𝑁
1 1 𝑖
𝑝ො𝑁 𝑦𝑛 |𝒚1:𝑛−1 = න𝑔(𝑦𝑛 |𝑥𝑛 )𝑝ො𝑁 𝒙1:𝑛 |𝒚1:𝑛−1 𝑑𝒙1:𝑛 = 𝑔(𝑦𝑛 |𝑥𝑛 ) 𝛿𝑿 𝑖 𝒙1:𝑛 𝑑𝒙1:𝑛 = 𝑔(𝑦𝑛 |𝑋𝑛 )
𝑁 1:𝑛 𝑁
𝑖=1 𝑖=1
(𝑖) 𝑖 𝑖 𝑖 𝑖 𝑖
Here 𝑁𝑛 follow a multinomial with 𝔼 𝑁𝑛 = 𝑁𝑊𝑛 , 𝑉𝑎𝑟 𝑁𝑛 = 𝑁𝑊𝑛 1 − 𝑊𝑛
Time 1:
𝑖
Sample 𝑁 particles 𝑋1 ~𝜇(𝑥1 ) and compute:
𝑁
𝑖 𝑖 𝑖
𝑝Ƹ 𝑁 𝑥1 |𝑦1 = 𝑊1 𝛿𝑋 𝑖 𝑥1 , 𝑊1 ∝ 𝑔 𝑦1 |𝑋1
1
𝑖=1
𝑁
𝑖 1
Resample 𝑋෨1 ~𝑝Ƹ 𝑁 𝑥1 |𝑦1 to obtain 𝑝𝑁 𝑥1 |𝑦1 = 𝛿𝑋෨ 𝑖 𝑥1 .
𝑁 𝑖=1 1
𝑁
Time 𝑛, 𝑛 ≥ 2. Given 𝑝Ƹ𝑁 𝒙1:𝑛−1 |𝒚1:𝑛−1 =
𝑖
𝑊𝑛−1 𝛿𝑿 𝑖 𝒙1:𝑛−1
𝑖=1 1:𝑛−1
𝑗 𝑁
𝑁
1
Resample: Sample 𝑎𝑛𝑖 ~𝒞 𝑊𝑛−1 → 𝑝𝑁 𝒙1:𝑛−1 |𝒚1:𝑛−1 =
𝑁
𝛿𝑿෩ 𝑖 𝒙1:𝑛−1 .
𝑗=1 𝑖=1 1:𝑛−1
𝑖
𝑎𝑛 𝑁
𝑖 𝑖 𝑖
෩ 1:𝑛−1 𝑖 1
Propagate: Sample 𝑋𝑛 ~𝑓(𝑥𝑛 |𝑋𝑛−1 ), set 𝑿1:𝑛 = 𝑿 , 𝑋𝑛 → 𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛−1 =
𝑁
𝛿𝑿 𝑖 𝒙1:𝑛 .
𝑖=1 1:𝑛
𝑁
𝑖 𝑖 𝑖
Weight: Compute 𝑊𝑛 ∝𝑔 𝑦𝑛 |𝑋𝑛 and normalize → 𝑝Ƹ𝑁 𝒙1:𝑛 |𝒚1:𝑛 = 𝑊𝑛 𝛿𝑿 𝑖 𝒙1:𝑛 .
𝑖=1 1:𝑛
To make the notation for the ancestral paths explicit, one can represent them in the form
𝑖 𝑎𝑖
𝑛
𝑖 𝑎𝑛 𝑖 𝑿1:𝑛−1 is the path that terminates at the
𝑿1:𝑛 = 𝑿1:𝑛−1 , 𝑋𝑛 𝑖 𝑖
𝑖
𝑎𝑛
ancestor of 𝑋𝑛 i.e. 𝑋𝑛 ~𝑓(𝑥𝑛 |𝑋𝑛−1 )
For example, the ancestral path for particle 1 at time 3 can be written as:
𝑎1
𝑎2 3 𝑎31
𝑋1 , 𝑋2 , 𝑋31 = 𝑋12 , 𝑋22 , 𝑋31
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35
Vanilla SMC: Bootstrap Filter (Gordon et al.)
𝑁
𝑖
At time 𝑛, 𝑛 ≥ 2. Given 𝑝Ƹ 𝑁 𝒙1:𝑛−1 |𝒚1:𝑛−1 = 𝑊𝑛−1 𝛿𝑿 𝑖 𝒙1:𝑛−1
𝑖=1 1:𝑛−1
𝑁
1
After resampling: It produces 𝑝𝑁 𝒙1:𝑛−1 |𝒚1:𝑛−1 = 𝛿𝑿෩ 𝑖 𝒙1:𝑛−1 .
𝑁 𝑖=1 1:𝑛−1
𝑁
1 𝑖 𝑖 𝑖
After propagation: It produces 𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛−1 = 𝑁 𝛿𝑿 𝑖 ෩ 1:𝑛−1
𝒙1:𝑛 where 𝑿1:𝑛 = 𝑿 , 𝑋𝑛
𝑖=1 1:𝑛
𝑁
𝑖
After weighting: It produces 𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 = 𝑊𝑛 𝛿𝑿 𝑖 𝒙1:𝑛 .
𝑖=1 1:𝑛
In the original bootstrap particle filter of Gordon et al. (particles simulated with the dynamical
model and weights are assigned according to the likelihood) the focus was on computing an
approximation 𝑝Ƹ 𝑁 𝑥𝑛 |𝒚1:𝑛 of the filtering marginal.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36
SMC Output
At time 𝑛 we have:
𝑁
1 𝑖 𝑖 𝑖
𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛−1 = 𝛿𝑿 𝑖 ෩ 1:𝑛−1
𝒙1:𝑛 , 𝑿1:𝑛 = 𝑿 , 𝑋𝑛
𝑁 𝑖=1 1:𝑛
𝑁 𝑁
𝑖 1
𝑝Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛 = 𝑊𝑛 𝛿𝑿 𝑖 𝒙1:𝑛 , 𝑝𝑁 𝒙1:𝑛 |𝒚1:𝑛 = 𝛿𝑿෩ 𝑖 𝒙1:𝑛
𝑖=1 1:𝑛 𝑁 𝑖=1 1:𝑛
The marginal likelihood estimate 𝑝Ƹ 𝑁 𝑦𝑘 |𝒚1:𝑘−1 = 𝑝) 𝑛𝑥| 𝑛𝑦(𝑔 Ƹ 𝑁 𝒙1:𝑛 |𝒚1:𝑛−1 𝑑𝒙1:𝑛 =
𝑁
1 𝑖
𝑔(𝑦𝑘 |𝑋𝑘 ) is:
𝑁 𝑖=1
𝑁 𝑁 𝑁
1 𝑖
𝑝Ƹ 𝑁 𝒚1:𝑛 = 𝑝 𝑦1 ෑ 𝑝Ƹ 𝑁 𝑦𝑘 |𝒚1:𝑘−1 =ෑ 𝑔 𝑦𝑘 |𝑋𝑘
𝑁
𝑘=2 𝑘=1 𝑖=1
Computational complexity is 𝒪 (𝑁) at each time step and memory requirements 𝒪 (𝑛𝑁).
𝑌𝑛 = 𝐶𝑋𝑛 + 𝐷𝑢𝑛 + 𝐸𝑛
𝑋0 𝜇 𝑃0 0 0
𝑉𝑛 ~𝒩 0 , 0 𝑄 𝑆
𝐸𝑛 0 0 𝑆𝑇 𝑅
Measurement update:
𝑝 𝑥𝑛 |𝒚1:𝑛 = 𝒩 𝑥𝑛 |𝑥ො𝑛|𝑛 , 𝑃𝑛|𝑛
𝑥ො𝑛|𝑛 = 𝑥ො𝑛|𝑛−1 + 𝐾𝑛 𝑦𝑛 − 𝐶 𝑥ො𝑛|𝑛−1 − 𝐷𝑢𝑛
𝑃𝑛|𝑛 = 𝐼 − 𝐾𝑛 𝐶 𝑃𝑛|𝑛−1
−1
𝐾𝑛 = 𝑃𝑛|𝑛−1 𝐶 𝑇 𝐶𝑃𝑛|𝑛−1 𝐶 𝑇 + 𝑅
Prediction:
𝑝 𝑥𝑛+1 |𝒚1:𝑛 = 𝒩 𝑥𝑛+1 |𝑥ො𝑛+1|𝑛 , 𝑃𝑛+1|𝑛
𝑥ො𝑛+1|𝑛 = 𝐴𝑥ො𝑛|𝑛 + 𝐵𝑢𝑛
𝑃𝑛+1|𝑛 = 𝐴𝑃𝑛|𝑛 𝐴𝑇 + 𝑄
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 38
Forward-Filtering Backward-Smoothing
One can also estimate the marginal smoothing distribution 𝑝 𝑥𝑛 |𝒚1:𝑇 , 𝑛 =
1, . . . , 𝑇 (an offline estimate once all measurements 𝒚1:𝑇 are collected)
I − Forward pass: Compute and store 𝑝 𝑥𝑛 |𝒚1:𝑛 , 𝑝 𝑥𝑛+1 |𝒚1:𝑛 , 𝑛 = 1,2, . . . , 𝑇
use the update and prediction recursions derived earlier
𝑓(𝑥𝑛+1 |𝑥𝑛 )
𝐼𝐼 − Backward pass (𝑛 = 𝑇 − 1, 𝑇 − 2, . . , 1): 𝑝 𝑥𝑛 |𝒚1:𝑇 = 𝑝 𝑥𝑛 |𝒚1:𝑛 න 𝑝 𝑥𝑛+1 |𝒚1:𝑇 𝑑𝑥𝑛+1
𝑝 𝑥𝑛+1 |𝒚1:𝑛
p ( xn | xn 1 , y1:T ) p xn | xn 1 , y1:n
Note that:
p yn 1:T | xn , xn 1 , y1:n p xn | xn 1 , y1:n
p xn | xn 1 , y1:T p xn | xn 1 , y1:n , yn 1:T
p yn 1:T | xn 1 , y1:n
p yn 1:T | xn 1 p xn | xn 1 , y1:n
p xn | xn 1 , y1:n
p yn 1:T | xn 1
g ( yn | xn ) p yn 1:T | xn 1 f xn 1 | xn dxn 1
p ( xn | y1:n 1 ) p yn:T | xn
Step II -Update : p xn | y1:T p xn | y1:n 1 , yn:T
p yn:T | y1:n 1
Note that we can have: 𝑛𝒚 𝑝 :𝑇 |𝑥𝑛 𝑑𝑥𝑛 = ∞. This precludes the use of SMC
algorithms.
*M. Kok, J. D. Hol and T. B. Schön. Using inertial sensors for position and orientation estimation. Foundations and Trends of Signal Processing, 11(1-2):1-153,
2017. (In motion capture, 𝑋𝑛 can represent the position/orientation of different body segments of a person, 𝜃 the body shape and 𝑌𝑛 measurements from sensors).
Standard (stochastic) gradient algorithms can be used (e.g. based on Fisher’s identity) to find
a local minimum:
These algorithms can work decently but it can be difficult to scale the components of the
gradients.
Note that these algorithms involve computing 𝑝𝜃 𝒙1:𝑛 |𝒚1:𝑛 which is the key result of the SMC
algorithm.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 44
Expectation/Maximization for HMM
One can also use the EM algorithm
( i ) Q ( i ) ,
Q ( i ) , log p x1:n , y1:n p ( i 1) x1:n | y1:n dx1:n
Above we used: n n
p x1:n , y1:n x1 f xk | xk 1 g yk | xk
k 2 k 1