Temporal Parallelization of Bayesian Smoothers
Temporal Parallelization of Bayesian Smoothers
Abstract— This paper presents algorithms for temporal paral- but there is a lack of algorithms that are specifically designed for
lelization of Bayesian smoothers. We define the elements and the solving state-estimation problems in parallel architectures. There are,
operators to pose these problems as the solutions to all-prefix-
however, some existing approaches for parallelizing Kalman type
sums operations for which efficient parallel scan-algorithms are
available. We present the temporal parallelization of the general of filters as well as particle filters. One approach was studied in
Bayesian filtering and smoothing equations and specialize them to [11] and [12] is to parallelize the corresponding batch formulation,
linear/Gaussian models. The advantage of the proposed algorithms which leads to sub-linear computational methods, because the matrix
is that they reduce the linear complexity of standard smoothing computations can be parallelized. If the state-space of the Kalman
algorithms with respect to time to logarithmic.
arXiv:1905.13002v2 [stat.CO] 20 Feb 2020
where with the filtering densities but not the marginal likelihood p (y1:n ),
we can still recover the marginal likelihoods as follows. We first run
R
gj (y) fj (x | y) fi (y | z) dy
fij (x | z) = R , the parallel filtering algorithm to recover all filtering distributions
g (y) fi (y | z) dy
Z j p (xk | y1:k ) for k = 1 to n and then, we perform the following
gij (z) = gi (z) gj (y) fi (y | z) dy. decomposition for p (y1:n )
The proof that ⊗ has the associative property is given in Appendix I. n
Y
Theorem 3: Given the element ak = (fk , gk ) ∈ F where p (y1:n ) = p (yk | y1:k−1 ) ,
k=1
fk (xk | xk−1 ) = p (xk | yk , xk−1 ) ,
where
gk (xk−1 ) = p (yk | xk−1 ) , Z
p (yk | y1:k−1 ) = p (yk | xk ) p (xk | y1:k−1 ) dxk .
p (x1 | y1 , x0 ) = p (x1 | y1 ), and p (y1 | x0 ) = p (y1 ), the k-th
prefix sum is Each factor p (yk | y1:k−1 ) can be computed in parallel using the
predictive density p (xk | y1:k−1 ) and the likelihood p (yk | xk ). We
p (xk | y1:k )
a1 ⊗ a2 ⊗ · · · ⊗ ak = . can then recover all p (y1:k ) by O(log n) parallel recursive pairwise
p (y1:k )
Theorem 3 is proved in Appendix I. Theorem 3 implies that we can multiplications of the adjacent terms.
parallelise the computation of all filtering distributions p (xk | y1:k ) It is also possible to perform the parallelization at block level
and the marginal likelihoods p (y1:k ), of which the latter ones can instead of at individual element level. When using the parallel scan
be used for parameter estimation [6]. algorithm, we do not need to assign each single-measurement element
Remark 4: If we only know p (yk | xk−1 ) up to a proportionality to a single computational node, but instead we can perform initial
constant, which means that gk (xk−1 ) ∝ p (yk | xk−1 ), we can still computations in blocks such that a single node processes a block of
recover the filtering density p (xk | y1:k ) by the above operations. measurements before combining the results with other blocks. The
However, we will not be able to recover the marginal likelihoods results of the blocks can then be used as the elements in the parallel-
p (y1:k ). We can nevertheless recover p (y1:k ) by an additional scan algorithm. This kind of procedure corresponds to selecting the
parallel pass, as will be explained in Section III-C. elements for the scan algorithm to consist of blocks of length l:
p xlk | yl(k−1)+1:kl , xl(k−1)
B. Bayesian smoothing ak =
p yl(k−1)+1:kl | xl(k−1)
The Bayesian smoothing pass requires that the filtering densities
have been obtained beforehand. In smoothing, we consider a different in filtering and
type of element a and binary operator ⊗ than those used in filtering.
As we will see in this section, an element a is a function a : Rnx × ak = p xlk | y1:l(k+1)−1 , xl(k+1) (7)
Rnx → [0, ∞) that belongs to the set
Z in smoothing instead of the corresponding terms with l = 1. A
practical advantage of this is that we can more easily distribute the
S = a : a (x | z) dx = 1 .
computations to a limited number of computational nodes while still
getting the optimal speedup from parallelization.
Definition 5: Given two elements ai ∈ S and aj ∈ S, the binary
associative operator ⊗ for Bayesian smoothing is IV. PARALLEL LINEAR /G AUSSIAN FILTER AND SMOOTHER
ai ⊗ aj = aij , The parallel linear/Gaussian filter and smoother are obtained by
particularising the element a and binary operator ⊗ for Bayesian
where filtering and smoothing explained in the previous section to lin-
Z
aij (x | z) = ai (x | y) aj (y | z) dy. ear/Gaussian systems. The sequential versions of these algorithms
correspond to the Kalman filter and the RTS smoother.
The proof that ⊗ has the associative property is included in Ap- We consider the linear/Gaussian state space model
pendix II.
Theorem 6: Given the element ak = p (xk | y1:k , xk+1 ) ∈ S, xk = Fk−1 xk−1 + uk−1 + qk−1 ,
with an = p (xn | y1:n ), we have that yk = Hk xk + dk + rk ,
ak ⊗ ak+1 ⊗ · · · ⊗ an = p (xk | y1:n ) . where Fk−1 ∈ Rnx ×nx and Hk ∈ Rny ×nx are known matrices,
Theorem 6 is proved in Appendix II. Theorem 6 implies that we uk−1 ∈ Rnx and dk ∈ Rny are known vectors, and qk−1 and rk
can compute all smoothing distributions in parallel form. However, are zero-mean, independent Gaussian noises with covariance matrices
it should be noted we should apply the parallel scan algorithm with Qk−1 ∈ Rnx ×nx and Rk ∈ Rny ×ny . The initial distribution is
elements in reverse other, that is, with elements bk = an−k+1 , so given as x0 ∼ N(m0 , P0 ). With this model, we have that
that the prefix-sums b1 ⊗ · · · ⊗ bk recover the smoothing densities.
p (xk | xk−1 ) = N (xk ; Fk−1 xk−1 + uk−1 , Qk−1 ) , (8)
C. Additional aspects p (yk | xk ) = N (yk ; Hk xk + dk , Rk ) . (9)
We proceed to discuss additional aspects of the previous formula- In this section, we use the notation NI (·; η, J) to denote a
tion of filtering and smoothing. In Section III-A, it was indicated that Gaussian density parameterised in information form so that η is the
the marginal likelihood p (y1:n ) is directly available from the parallel information vector and J is the information matrix. If a Gaussian dis-
scan algorithm if gk (xk−1 ) = p (yk | xk−1 ). However, sometimes tribution has mean x and covariance matrix P , its parameterisation in
we only know p (yk | xk−1 ) up to a proportionality constant so information form is η = P −1 x and J = P −1 . This parametrization
gk (xk−1 ) ∝ p (yk | xk−1 ), as will happen in Section IV. Although corresponds to so-called information form of Kalman filter [23]. We
in this case, the parallel scan Bayesian filtering algorithm provides us also use Inx to denote an identity matrix of size nx .
4 GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017
2 10 7
True Trajectory KF flops
0 Measurements PKF span flops
-4
10 5
v
-6
-8
-10 10 4
-12
-14 10 3 0
-5 0 5 10 15 20 25 10 10 1 10 2 10 3 10 4
u Time series length
Fig. 2. Simulated trajectory from the linear tracking model in Eqs. (15) Fig. 3. The flops and the span and work flops for the sequential Kalman
and (16) along with the Kalman filter (KF) and RTS smoother results. filter (KF) and the parallel Kalman filter (PKF).
10 7
position (u, v), we aim to solve the smoothing problem in order to RTS flops
determine the whole trajectory of the target. PRTS span flops
8 the present framework along with various other Bayesian filter and
Ratio of work flops smoother approximations proposed in literature.
6
4 A PPENDIX I
In this appendix, we prove the required results for Bayesian
2 KF work ratio
RTS work ratio
filtering: the associative property of the operator in Definition 2 and
0 Theorem 3.
0 500 1000 1500 2000 2500
Time series length
A. Associative property
Fig. 5. Ratio of work flops for the parallel and sequential Kalman filter
and the parallel and sequential RTS smoother.
In order to prove the associative property of ⊗ for filtering, we need
to prove that for three elements (fi , gi ), fj , gj , (fk , gk ) ∈ F , the
following relation holds
smoothing solutions for state space models. The framework is based (fi , gi ) ⊗ fj , gj ⊗ (fk , gk )
on formulating the computations in terms of associative operations
= (fi , gi ) ⊗ fj , gj ⊗ (fk , gk ) .
(18)
between suitably defined elements such that the all-prefix-sums
operation computed by a parallel-scan algorithm exactly produces We proceed to perform the calculations on both sides of the equation
the Bayesian filtering and smoothing solutions. The advantage of to check that they yield the same result.
the framework is that the parallelization allows for performing the 1) Left-hand side: We use Definition 2 in the left-hand side of
computations in O(log n) span complexity, where n is the number (18) and obtain
of data points, while sequential filtering and smoothing algorithms
fij , gij ⊗ (fk , gk ) = fijk , gijk ,
have an O(n) complexity. Parallel versions of Kalman filters and
Rauch–Tung–Striebel smoothers were derived as special cases of the where
framework. The computational advantages of the framework were
fijk (x | z)
illustrated in a numerical simulation.
gk (y) fk (x | y) gj y 0 fj y | y 0 fi y 0 | z dy 0 dy
RR
A disadvantage of the proposed methodology is that although = , (19)
gk (y) gj (y 0 ) fj (y | y 0 ) fi (y 0 | z) dy 0 dy
RR
the wall-clock time of execution is significantly reduced, the total
amount of computations (and hence required energy) is larger than and
with conventional sequential algorithms. Although the total amount
of computations is only increased by a constant factor, in some gijk (z)
Z
systems, such as small-scale mobile systems, even if parallelization = gij (z) gk (y) fij (y | z) dy
would be possible, it can be beneficial to use the classic algorithms. Z
However, the speedup gain of the proposed approach is beneficial
= gi (z) gj (y) fi (y | z) dy
in applications such as data-assimilation based weather forecasting
[24] and other spatio-temporal systems appearing, for example, in "Z "R
gj y 0 fj y | y 0 fi y 0 | z dy 0
# #
tomographic reconstruction [8] or machine learning [25], where the × gk (y) dy
gj (y 0 ) fi (y 0 | z) dy 0
R
computations take a significant amount of time. In these systems, it ZZ
is possible to dedicate the required amount of extra computational = gi (z) gk (y) gj y 0 fj y | y 0 fi y 0 | z dy 0 dy.
(20)
resources to gain the significant speedup provided by parallelization.
Although we have restricted our consideration to specific types 2) Right-hand side: We first use operator ⊗ to the elements with
parallel-scan algorithms, it is also possible to use other kinds of algo- indices j and k in the right-hand side of (18), see Definition 2,
rithms for computing the prefix sums corresponding to the Bayesian
filtering and smoothing solutions. We could also select algorithms fj , gj ⊗ (fk , gk ) = fjk , gjk ,
for given computer or network architectures, for minimizing the where
communication between the nodes, or for minimizing the energy R
gk (y) fk (x | y) fj (y | z) dy
consumption [26], [27]. The present formulation of the computations fjk (x | z) = R ,
g (y) fj (y | z) dy
in terms of local associative operations is likely to have other Zk
applications beyond parallelization. For example, in decentralized gjk (z) = gj (z) gk (y) fj (y | z) dy.
systems, it is advantageous to be able to first perform operations
locally and then combine them to produce the full state-estimation Then, the right-hand side of (18) becomes
solution. 0 0
(fi , gi ) ⊗ fjk , gjk = fijk , gijk ,
The proposed framework is also valid for discrete state spaces as
well as for other state spaces provided that we consider the elements where
with the appropriate domain and replace the Lebesgue integrals by 0
fijk (x | z)
integrals with respect to the corresponding reference measure, e.g., R
counting measure in the case of discrete-state models. gjk (y) fjk (x | y) fi (y | z) dy
= R
The framework could be extended to non-linear and non-Gaussian gjk (y) fi (y | z) dy
gj (y) gk y 0 fk x | y 0 fj y 0 | y dy 0 fi (y | z) dy
R R
models by replacing the exact Kalman filters and smoothers with iter-
ated extended Kalman filters and smoothers [28], [29] or their sigma- =
gj (y) gk (y 0 ) fj (y 0 | y) dy 0 fi (y | z) dy
R R
point/numerical-integration versions such as posterior linearization
gj (y) gk y 0 fk x | y 0 fj y 0 | y fi (y | z) dy 0 dy
RR
filters and smoothers [30]–[32]. Possible future work also includes = , (21)
gj (y) gk (y 0 ) fj (y 0 | y) fi (y | z) dy 0 dy
RR
developing particle filter and smoother methods (see, e.g., [6]) for
SÄRKKÄ et al.: TEMPORAL PARALLELIZATION OF BAYESIAN SMOOTHERS 7
We use ak+l in Theorem 6 to calculate [10] Y. C. Ho and R. C. K. Lee, “A Bayesian approach to problems in
stochastic estimation and control,” IEEE Transactions on Automatic
[ak ⊗ · · · ⊗ ak+l−1 ] ⊗ ak+l Control, vol. 9, no. 4, pp. 333–339, 1964.
[11] T. D. Barfoot, C. H. Tong, and S. Särkkä, “Batch continuous-time
Z
= p (xk | y1:k+l−1 , xk+l ) p (xk+l | y1:k+l , xk+l+1 ) dxk+l trajectory estimation as exactly sparse Gaussian process regression,” in
Z Proceedings of Robotics: Science and Systems (RSS), 2014.
= p (xk | y1:k+l , xk+l , xk+l+1 ) [12] A. Grigorievskiy, N. Lawrence, and S. Särkkä, “Parallelizable sparse
inverse formulation Gaussian processes (SpInGP),” in Proceedings of
IEEE International Workshop on Machine Learning for Signal Process-
× p (xk+l | y1:k+l , xk+l+1 ) dxk+l
Z ing (MLSP), 2017.
[13] P. M. Lyster, S. E. Cohn, R. Ménard, L. P. Chang, S. J. Lin, and R. G.
= p (xk , xk+l | y1:k+l , xk+l+1 ) dxk+l Olsen, “Parallel implementation of a Kalman filter for constituent data
assimilation,” Monthly Weather Review, vol. 125, no. 7, pp. 1674–1686,
= p (xk | y1:k+l , xk+l+1 ) . 1997.
This proves (31). [14] G. Evensen, “The ensemble Kalman filter: Theoretical formulation and
practical implementation,” Ocean dynamics, vol. 53, no. 4, pp. 343–367,
If l = n − k − 1 and an as in Theorem 6, we have 2003.
[15] A. Lee, C. Yau, M. B. Giles, A. Doucet, and C. C. Holmes, “On
[ak ⊗ · · · ⊗ an−1 ] ⊗ an the utility of graphics cards to perform massively parallel simulation
Z
of advanced Monte Carlo methods,” Journal of Computational and
= p (xk | y1:n−1 , xn ) p (xn | y1:n ) dxn Graphical Statistics, vol. 19, no. 4, pp. 769–789, 2010.
Z [16] O. Rosen and A. Medvedev, “Efficient parallel implementation of state
= p (xk | y1:n , xn ) p (xn | y1:n ) dxn estimation algorithms on multicore platforms,” IEEE Transactions on
Control Systems Technology, vol. 21, no. 1, pp. 107–120, 2013.
= p (xk | y1:n ) . [17] M. E. Liggins, C.-Y. Chong, I. Kadar, M. G. Alford, V. Vannicola, and
S. Thomopoulos, “Distributed fusion architectures and algorithms for
This result finishes the proof of Theorem 6. target tracking,” Proceedings of the IEEE, vol. 85, no. 1, pp. 95–107,
1997.
[18] R. E. Ladner and M. J. Fischer, “Parallel prefix computation,” Journal
A PPENDIX III of the ACM, vol. 27, no. 4, pp. 831–838, 1980.
In this appendix, we prove Lemma 8. We have the following easily [19] G. E. Blelloch, “Scans as primitive parallel operations,” IEEE Transac-
verifiable identities: tions on Computers, vol. 38, no. 11, pp. 1526–1538, 1989.
[20] ——, “Prefix sums and their applications,” School of Computer Science,
NI (y; η, J)N(y; m, C) Carnegie Mellon University, Tech. Rep. CMU-CS-90-190, 1990.
[21] R. E. Kalman, “A new approach to linear filtering and prediction
∝ N(y; [J + C −1 ]−1 [η + C −1 m], [J + C −1 ]−1 ) problems,” Transactions of the ASME, Journal of Basic Engineering,
vol. 82, no. 1, pp. 35–45, 1960.
and [22] H. E. Rauch, F. Tung, and C. T. Striebel, “Maximum likelihood estimates
of linear dynamic systems,” AIAA Journal, vol. 3, no. 8, pp. 1445–1450,
NI (y; η, J)NI (y; η 0 , J 0 ) ∝ NI (y; η + η 0 , J + J 0 ). 1965.
We also have [23] B. D. O. Anderson and J. B. Moore, Optimal Filtering. Prentice-Hall,
Z 1979.
NI (y; η, J)N(y; Az + b, C)dy [24] N. Cressie and C. K. Wikle, Statistics for Spatio-Temporal Data. John
Wiley & Sons, 2011.
[25] S. Särkkä, A. Solin, and J. Hartikainen, “Spatiotemporal learning via
∝ NI (z; A> [I + JC]−1 (η − Jb), A> [I + JC]−1 JA). infinite-dimensional Bayesian filtering and smoothing,” IEEE Signal
Processing Magazine, vol. 30, no. 4, pp. 51–61, 2013.
By using Definition 2 for fij and gij together with parameterizations [26] A. Grama, V. Kumar, A. Gupta, and G. Karypis, Introduction to parallel
in Lemma 8, elementary computations lead to (13) and (14). computing, 2nd ed. Pearson Education, 2003.
[27] P. Sanders and J. L. Träff, “Parallel prefix (scan) algorithms for MPI,”
ACKNOWLEDGMENT in Recent Advances in Parallel Virtual Machine and Message Passing
Interface. EuroPVM/MPI 2006. Lecture Notes in Computer Science,
The authors would like to thank Academy of Finland for financial B. Mohr, J. Träff, J. Worringen, and J. Dongarra, Eds. Springer, 2006,
support. vol. 4192.
[28] B. M. Bell and F. W. Cathey, “The iterated Kalman filter update as
a Gauss–Newton method,” IEEE Transactions on Automatic Control,
R EFERENCES vol. 38, no. 2, pp. 294–297, 1993.
[1] T. Rauber and G. Rünger, Parallel programming: For multicore and [29] B. M. Bell, “The iterated Kalman smoother as a Gauss–Newton method,”
cluster systems, 2nd ed. Springer, 2013. SIAM Journal on Optimization, vol. 4, no. 3, pp. 626–636, 1994.
[2] S. Cook, CUDA programming: a developer’s guide to parallel computing [30] A. F. Garcı́a-Fernández, L. Svensson, M. R. Morelande, and S. Särkkä,
with GPUs. Morgan Kaufmann, 2013. “Posterior linearisation filter: principles and implementation using sigma
[3] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to points,” IEEE Transactions on Signal Processing, vol. 63, no. 20, pp.
Algorithms, 3rd ed. MIT Press, 2009. 5561–5573, 2015.
[4] A. H. Jazwinski, Stochastic Processes and Filtering Theory. Academic [31] A. F. Garcı́a-Fernández, L. Svensson, and S. Särkkä, “Iterated poste-
Press, New York, 1970. rior linearization smoother,” IEEE Transactions on Automatic Control,
[5] Y. Bar-Shalom, X.-R. Li, and T. Kirubarajan, Estimation with Applica- vol. 62, no. 4, pp. 2056–2063, 2017.
tions to Tracking and Navigation. Wiley, New York, 2001. [32] F. Tronarp, A. F. Garcı́a-Fernández, and S. Särkkä, “Iterative filtering
[6] S. Särkkä, Bayesian Filtering and Smoothing. Cambridge University and smoothing in non-linear and non-Gaussian systems using conditional
Press, 2013. moments,” IEEE Signal Processing Letters, vol. 25, no. 3, pp. 408–412,
[7] O. Cappé, E. Moulines, and T. Rydén, Inference in Hidden Markov 2018.
Models, ser. Springer Series in Statistics. New York, NY: Springer-
Verlag, 2005.
[8] J. Kaipio and E. Somersalo, Statistical and Computational Inverse
Problems. Springer, 2005.
[9] S. Särkkä, M. A. Álvarez, and N. D. Lawrence, “Gaussian process latent
force models for learning and stochastic control of physical systems,”
IEEE Transactions on Automatic Control, 2019.