0% found this document useful (0 votes)
38 views8 pages

Temporal Parallelization of Bayesian Smoothers

Uploaded by

tmp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views8 pages

Temporal Parallelization of Bayesian Smoothers

Uploaded by

tmp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

GENERIC COLORIZED JOURNAL, VOL. XX, NO.

XX, XXXX 2017 1

Temporal Parallelization of Bayesian Smoothers


Simo Särkkä, Senior Member, IEEE, Ángel F. Garcı́a-Fernández

Abstract— This paper presents algorithms for temporal paral- but there is a lack of algorithms that are specifically designed for
lelization of Bayesian smoothers. We define the elements and the solving state-estimation problems in parallel architectures. There are,
operators to pose these problems as the solutions to all-prefix-
however, some existing approaches for parallelizing Kalman type
sums operations for which efficient parallel scan-algorithms are
available. We present the temporal parallelization of the general of filters as well as particle filters. One approach was studied in
Bayesian filtering and smoothing equations and specialize them to [11] and [12] is to parallelize the corresponding batch formulation,
linear/Gaussian models. The advantage of the proposed algorithms which leads to sub-linear computational methods, because the matrix
is that they reduce the linear complexity of standard smoothing computations can be parallelized. If the state-space of the Kalman
algorithms with respect to time to logarithmic.
arXiv:1905.13002v2 [stat.CO] 20 Feb 2020

filter is large, it is then possible to speed up the matrix computations


Index Terms— Bayesian smoothing, Kalman filtering and via parallelization [13], [14]. Particle filters can also be parallelized
smoothing, parallel computing, parallel scan, prefix sums over the particles [15], [16] the bottleneck being the resampling step.
In some specific cases such as in multiple target tracking [17] it is
possible to develop parallelized algorithms by using the structure of
I. I NTRODUCTION the specific problem.
Parallel computing is rapidly transforming from a scientists’ The contribution of this article is to propose a novel general
computational tool to a general purpose computational paradigm. algorithmic framework for parallel computation of Bayesian smooth-
The availability of affordable massively-parallel graphics processing ing solutions for state space models. We also present algorithms
units (GPUs) as well as widely-available parallel grid and cloud for parallelizing the Bayesian filtering solutions, but our focus is
computing systems [1]–[3] drive this transformation by bringing in smoothing, because the parallel computation is done off-line in
parallel computing technology to everyday use. This creates a demand the sense that all the data needs to be available during the parallel
for parallel algorithms that can harness the full power of the parallel computations and it cannot arrive sequentially. Our approach to
computing hardware. parallelization differs from the aforementioned approaches in the
Stochastic state-space models allow for modeling of time- aspect that we replace the whole Bayesian filtering and smoothing
behaviour and uncertainties of dynamic systems, and they have long formalism with another, parallelizable formalism. We replace the
been used in various tracking, automation, communications, and Bayesian filtering and smoothing equations [4], [6] with another set
imaging applications [4]–[8]. More recently, they have also been used of equations that can be combined with so-called scan or prefix-
as representations of prior information in machine learning setting sums algorithm [2], [18]–[20], which is one of the fundamental
(see, e.g., [9]). In all of these applications, the main problem can algorithm frameworks in parallel computing. The advantage of this
be mathematically formulated as a state-estimation problem on the is that it allows for reduction of the linear O(n) complexity of
stochastic model, where we estimate the unknown phenomenon from batch filtering and smoothing algorithms to logarithmic O(log n)
a set of noisy measurement data. Given the mathematical problem, span-complexity in the number n of data points. Based on the
the remaining task is to design efficient computational methods for novel formulation we develop parallel algorithms for computing the
solving the inference problems on large data sets such that they utilize filtering and smoothing solutions to linear Gaussian systems with
the computational hardware as effectively as possible. the logarithmic span-complexity. As this parallelization is done in
Bayesian filtering and smoothing methods [6] provide the classical temporal direction, the individual steps of the resulting algorithm
[10] solutions to state-estimation problems which are computationally could further be parallelized in the same way as Kalman filters and
optimal in the sense that their computational complexities are linear particle filters have previously been parallelized [13]–[17].
with respect to the number of data points. Although these solutions The organization of the article is the following. In Section II we
are optimal for single central processing unit (CPU) systems, due review the classical Bayesian filtering and smoothing methodology
to the sequential nature of the algorithms, their complexity remains as well as the parallel scan algorithm for computing prefix sums.
linear also in parallel multi-CPU systems. However, genuine parallel In Section III we present the general framework for parallelizing
algorithms are often capable to perform operations in logarithmic Bayesian filtering and smoothing methods. Section IV is concerned
number of steps by massive parallelization of the operations. More with specializing the general framework to linear Gaussian systems.
precisely, their span-complexity [3], that is, the number of computa- Numerical example of linear/Gaussian systems is given in Section V
tional steps as measured by a wall-clock, is often logarithmic with and finally Section VI concludes the article along with discussion on
respect to the number of data points. However, their total number of various aspects of the methodology.
operations, the work-complexity, is still linear as all the data points
II. B ACKGROUND
need to be processed.
Despite the long history of state-estimation methods, the exist- A. Bayesian filtering and smoothing
ing parallelization methods have concentrated on parallelizing the Bayesian filtering and smoothing methods [6] are algorithms for
subproblems arising in Bayesian filtering and smoothing methods, statistical inference in probabilistic state-space models of the form
xk ∼ p(xk | xk−1 ),
S. Särkkä is with the Department of Electrical Engineering (1)
and Automation, Aalto University, 02150 Espoo, Finland (email: yk ∼ p(yk | xk ).
[email protected]).
Ángel F. Garcı́a-Fernández is with the Department of Electrical En-
with x0 ∼ p(x0 ). Above, the state xk ∈ Rnx at time step k evolves
gineering and Electronics, University of Liverpool, Liverpool L69 3GJ, as a Markov process with transition density p(xk | xk−1 ). State xk is
United Kingdom (email: [email protected]). observed by the measurement yk ∈ Rny whose density is p(yk | xk ).
2 GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017

// Save the input:


The objective of Bayesian filtering is to compute the posterior
for i ← 1 to n do . Compute in parallel
density p(xk | y1:k ) of the state xk given the measurements y1:k =
bi ← ai
(y1 , . . . , yk ) up to time step k. Given the measurements up to a
end for
time step n, the objective is smoothing is to compute the density
// Up sweep:
p(xk | y1:n ) of the state xk for k < n.
for d ← 0 to log2 n − 1 do
The key insight of Bayesian filters and smoothers is that the
for i ← 0 to n − 1 by 2d+1 do . Compute in parallel
computation of the required densities can be done in linear O(n)
j ← i + 2d
number of computational steps by using recursive (forward) filtering
k ← i + 2d+1
and (backward) smoothing algorithms. This is significant, because a
ak ← aj ⊗ ak
naive computation of the posterior distribution would typically take
end for
at least O(n3 ) computational steps.
end for
The Bayesian filter is a sequential algorithm, which iterates the
// Down sweep:
following prediction and update steps:
Z for d ← log2 n − 1 to 0 do
p(xk | y1:k−1 ) = p(xk | xk−1 ) p(xk−1 | y1:k−1 ) dxk−1 , (2) for i ← 0 to n − 1 by 2d+1 do . Compute in parallel
j ← i + 2d
p(yk | xk ) p(xk | y1:k−1 ) k ← i + 2d+1
p(xk | y1:k ) = R . (3) t ← aj
p(yk | xk ) p(xk | y1:k−1 ) dxk
aj ← ak
Given the filtering outputs for k = 1, . . . , n, the Bayesian forward- ak ← ak ⊗ t
backward smoother consists of the following backward iteration for end for
k = n − 1, . . . , 1: end for
p(xk | y1:n ) // Final pass:
Z
p(xk+1 | xk ) p(xk+1 | y1:n ) (4) for i ← 1 to n do . Compute in parallel
= p(xk | y1:k ) dxk+1 . ai ← ai ⊗ bi
p(xk+1 | y1:k )
end for
When applied to a batch of data of size n, the computational complex-
ity of the filter and smoother is O(n) as they perform n sequential Fig. 1. Parallel scan algorithm for in-place transformation of the
sequence (ai ) into its all-prefix-sums in O(log n) span-complexity.
steps in forward and backward directions when looping over the data. Note that the algorithm in this forms assumes that n is a power of 2,
The Kalman filter and Rauch–Tung–Striebel (RTS) smoother [21], but it can easily be generalized to an arbitrary n.
[22] are the solutions to these recursions when the transition densities
are linear and Gaussian. The filtering and smoothing equations can
also be analogously solved in closed form for discrete-state models iteration inherently takes O(n) time. We can now see the analogy
[7]. In this paper, we show how to parallelize the previous recursions of the iteration to the Bayesian filter discussed in previous section
using parallel scan-algorithm, which is reviewed next. – both of the algorithms have linear O(n) complexity, because they
need to loop over all the elements in forward direction. A similar
argument applies to the Bayesian smoothing pass.
B. Parallel scan-algorithm
Fortunately, the all-prefix-sum sums operation can be computed in
The parallel-scan algorithm [20] is a general parallel computing parallel in O(log n) span-time by using up-sweep and down-sweep
framework that can be used to convert sequential O(n) algorithms algorithms [20] shown in Fig. 1. These algorithms correspond to up
with certain associative property to O(log n) parallel algorithms. The and down traversals in a binary tree which are used for computing
algorithm was originally developed for computing prefix sums [18], partial (generalized) sums of the elements. A final pass is then used to
where it uses the associative property of summation. The algorithm construct the final result. The algorithms can be used for computing
has since been generalized to arbitrary associative binary operators all-prefix-sums (5) for an arbitrary associative operator ⊗.
and it is used as the basis of multitude of parallel algorithms including
sorting, linear programming, and graph algorithms. This kind of III. PARALLEL B AYESIAN FILTERING AND SMOOTHING
algorithms are especially useful in GPU-based computing systems
In this section, we explain how to define the elements and
and they are likely to be fundamental algorithms in a many future
the binary operators to be able to perform Bayesian filtering and
parallel computing systems.
smoothing using parallel scan algorithms.
The problem that the parallel-scan algorithm [20] solves is the
all-prefix-sums operation, which is defined next.
A. Bayesian filtering
Definition 1: Given a sequence of elements (a1 , a2 , . . . , an ),
where ai belongs to a certain set, along with an associative binary In order to perform parallel Bayesian filtering, we need to find the
operator ⊗ on this set, the all-prefix-sums operation computes the suitable element ak and the binary associative operator ⊗. As we
sequence will see in this section, an element a consists of a pair (f, g) ∈ F
(a1 , a1 ⊗ a2 , . . . , a1 ⊗ · · · ⊗ an ). (5) where F is
 Z 
For example, if we have n = 4, ai = i, and ⊗ denotes the F = (f, g) : f (y | z) dy = 1 , (6)
ordinary summation, the all-prefix-sums are (1, 3, 6, 10). If ⊗ denotes
the subtraction, the all-prefix-sums are (1, −1, −4, −8). It should be and f : Rnx × Rnx → [0, ∞) represents a conditional density, and
noted that the operator is not necessarily commutative so we use a g : Rnx → [0, ∞) represents a likelihood.
product symbol, as matrix products are not commutative, instead of

Definition 2: Given two elements (fi , gi ) ∈ F and fj , gj ∈ F ,
a summation symbol. the binary associative operator ⊗ for Bayesian filtering is
The all prefix-sums operation can be computed sequentially by  
processing one element after the other. However, this direct sequential (fi , gi ) ⊗ fj , gj = fij , gij ,
SÄRKKÄ et al.: TEMPORAL PARALLELIZATION OF BAYESIAN SMOOTHERS 3

where with the filtering densities but not the marginal likelihood p (y1:n ),
we can still recover the marginal likelihoods as follows. We first run
R
gj (y) fj (x | y) fi (y | z) dy
fij (x | z) = R , the parallel filtering algorithm to recover all filtering distributions
g (y) fi (y | z) dy
Z j p (xk | y1:k ) for k = 1 to n and then, we perform the following
gij (z) = gi (z) gj (y) fi (y | z) dy. decomposition for p (y1:n )
The proof that ⊗ has the associative property is given in Appendix I. n
Y
Theorem 3: Given the element ak = (fk , gk ) ∈ F where p (y1:n ) = p (yk | y1:k−1 ) ,
k=1
fk (xk | xk−1 ) = p (xk | yk , xk−1 ) ,
where
gk (xk−1 ) = p (yk | xk−1 ) , Z
p (yk | y1:k−1 ) = p (yk | xk ) p (xk | y1:k−1 ) dxk .
p (x1 | y1 , x0 ) = p (x1 | y1 ), and p (y1 | x0 ) = p (y1 ), the k-th
prefix sum is Each factor p (yk | y1:k−1 ) can be computed in parallel using the
predictive density p (xk | y1:k−1 ) and the likelihood p (yk | xk ). We
 
p (xk | y1:k )
a1 ⊗ a2 ⊗ · · · ⊗ ak = . can then recover all p (y1:k ) by O(log n) parallel recursive pairwise
p (y1:k )
Theorem 3 is proved in Appendix I. Theorem 3 implies that we can multiplications of the adjacent terms.
parallelise the computation of all filtering distributions p (xk | y1:k ) It is also possible to perform the parallelization at block level
and the marginal likelihoods p (y1:k ), of which the latter ones can instead of at individual element level. When using the parallel scan
be used for parameter estimation [6]. algorithm, we do not need to assign each single-measurement element
Remark 4: If we only know p (yk | xk−1 ) up to a proportionality to a single computational node, but instead we can perform initial
constant, which means that gk (xk−1 ) ∝ p (yk | xk−1 ), we can still computations in blocks such that a single node processes a block of
recover the filtering density p (xk | y1:k ) by the above operations. measurements before combining the results with other blocks. The
However, we will not be able to recover the marginal likelihoods results of the blocks can then be used as the elements in the parallel-
p (y1:k ). We can nevertheless recover p (y1:k ) by an additional scan algorithm. This kind of procedure corresponds to selecting the
parallel pass, as will be explained in Section III-C. elements for the scan algorithm to consist of blocks of length l:
   
p xlk | yl(k−1)+1:kl , xl(k−1)
B. Bayesian smoothing ak =    
p yl(k−1)+1:kl | xl(k−1)
The Bayesian smoothing pass requires that the filtering densities
have been obtained beforehand. In smoothing, we consider a different in filtering and
type of element a and binary operator ⊗ than those used in filtering.  
As we will see in this section, an element a is a function a : Rnx × ak = p xlk | y1:l(k+1)−1 , xl(k+1) (7)
Rnx → [0, ∞) that belongs to the set
 Z  in smoothing instead of the corresponding terms with l = 1. A
practical advantage of this is that we can more easily distribute the
S = a : a (x | z) dx = 1 .
computations to a limited number of computational nodes while still
getting the optimal speedup from parallelization.
Definition 5: Given two elements ai ∈ S and aj ∈ S, the binary
associative operator ⊗ for Bayesian smoothing is IV. PARALLEL LINEAR /G AUSSIAN FILTER AND SMOOTHER
ai ⊗ aj = aij , The parallel linear/Gaussian filter and smoother are obtained by
particularising the element a and binary operator ⊗ for Bayesian
where filtering and smoothing explained in the previous section to lin-
Z
aij (x | z) = ai (x | y) aj (y | z) dy. ear/Gaussian systems. The sequential versions of these algorithms
correspond to the Kalman filter and the RTS smoother.
The proof that ⊗ has the associative property is included in Ap- We consider the linear/Gaussian state space model
pendix II.
Theorem 6: Given the element ak = p (xk | y1:k , xk+1 ) ∈ S, xk = Fk−1 xk−1 + uk−1 + qk−1 ,
with an = p (xn | y1:n ), we have that yk = Hk xk + dk + rk ,
ak ⊗ ak+1 ⊗ · · · ⊗ an = p (xk | y1:n ) . where Fk−1 ∈ Rnx ×nx and Hk ∈ Rny ×nx are known matrices,
Theorem 6 is proved in Appendix II. Theorem 6 implies that we uk−1 ∈ Rnx and dk ∈ Rny are known vectors, and qk−1 and rk
can compute all smoothing distributions in parallel form. However, are zero-mean, independent Gaussian noises with covariance matrices
it should be noted we should apply the parallel scan algorithm with Qk−1 ∈ Rnx ×nx and Rk ∈ Rny ×ny . The initial distribution is
elements in reverse other, that is, with elements bk = an−k+1 , so given as x0 ∼ N(m0 , P0 ). With this model, we have that
that the prefix-sums b1 ⊗ · · · ⊗ bk recover the smoothing densities.
p (xk | xk−1 ) = N (xk ; Fk−1 xk−1 + uk−1 , Qk−1 ) , (8)
C. Additional aspects p (yk | xk ) = N (yk ; Hk xk + dk , Rk ) . (9)
We proceed to discuss additional aspects of the previous formula- In this section, we use the notation NI (·; η, J) to denote a
tion of filtering and smoothing. In Section III-A, it was indicated that Gaussian density parameterised in information form so that η is the
the marginal likelihood p (y1:n ) is directly available from the parallel information vector and J is the information matrix. If a Gaussian dis-
scan algorithm if gk (xk−1 ) = p (yk | xk−1 ). However, sometimes tribution has mean x and covariance matrix P , its parameterisation in
we only know p (yk | xk−1 ) up to a proportionality constant so information form is η = P −1 x and J = P −1 . This parametrization
gk (xk−1 ) ∝ p (yk | xk−1 ), as will happen in Section IV. Although corresponds to so-called information form of Kalman filter [23]. We
in this case, the parallel scan Bayesian filtering algorithm provides us also use Inx to denote an identity matrix of size nx .
4 GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017

A. Linear/Gaussian filtering with


−1
We first describe the representation of an element ak ∈ F for Aij = Aj Inx + Ci Jj Ai ,
filtering in linear and Gaussian systems by the following lemma. −1 
bij = Aj Inx + Ci Jj bi + Ci ηj + bj ,
Lemma 7: For linear/Gaussian systems, the element ak ∈ F for −1
filtering becomes Cij = Aj Inx + Ci Jj Ci A>j + Cj ,
> −1 
fk (xk | xk−1 ) = p (xk | yk , xk−1 ) = N (xk ; Ak xk−1 + bk , Ck ) , ηij = Ai Inx + Jj Ci ηj − Jj bi + ηi ,
−1
gk (xk−1 ) = p (yk | xk−1 ) ∝ NI (xk−1 ; η k , Jk ) , Jij = A> i Inx + Jj Ci Jj Ai + Ji .
The proof is provided in Appendix III.
where the parameters of the first term are given for k > 1 as
Ak = (Inx − Kk Hk ) Fk−1 , B. Linear/Gaussian smoothing
bk = uk−1 + Kk (yk − Hk uk−1 − dk ) , We first describe the representation of an element ak ∈ S for
smoothing in linear and Gaussian systems by the following lemma.
Ck = (Inx − Kk Hk ) Qk−1 , (10)
Lemma 9: For linear/Gaussian systems, the element ak ∈ S for
Kk = Qk−1 Hk> Sk−1 , smoothing becomes
Sk = Hk Qk−1 Hk> + Rk , ak (xk | xk+1 ) = p (xk | y1:k , xk+1 )
and for k = 1 as = N (xk ; Ek xk+1 + gk , Lk ) ,
m−1 = F0 m0 + u0 , where for k < n
P1− F0 P0 F0> + Q0 ,
−1
=

Ek = Pk Fk> Fk Pk Fk> + Qk ,
S1 = H1 P1− H1> + R1 ,
gk = xk − Ek (Fk xk + uk ) ,
K1 = P1− H1> S1−1 , (11)
Lk = Pk − Ek Fk Pk ,
A1 = 0,
and for k = n we have
b1 = m− −
1 + K1 [y1 − H1 m1 − d1 ],
En = 0,
C1 = P1− − K1 S1 K1> .
gn = xn ,
The parameters of the second term are given as
Ln = Pn .
>
ηk = Fk−1 Hk> Sk−1 (yk − Hk uk−1 − dk ) , Above, xk and Pk are the filtering mean and covariance matrix at
>
(12)
Jk = Fk−1 Hk> Sk−1 Hk Fk−1 , time step k, such that p (xk | y1:k ) = N (xk ; xk , Pk ).
Lemma 9 is obtained by performing a Kalman filter update on density
for k = 1, . . . , n. p (xk | y1:k ) with an observation xk+1 , whose distribution is given
In Lemma 7, densities p (xk | yk , xk−1 ) and p (yk | xk−1 ) are by (8). Element ak for smoothing with linear/Gaussian systems can
obtained by applying the Kalman filter update with measurement be parameterised as ak = (Ek , gk , Lk ).
yk , distributed according to (9), applied to the density p (xk | xk−1 ) Lemma 10: Given two elements ai ∈ S and aj ∈ S with
in (8) and matching the terms. For the first step we have applied parameterisation
the Kalman filter prediction and update steps starting from x0 ∼
N (m0 , P0 ) and matched the terms. ai (y | z) = N (y; Ei z + gi , Li ) ,
Therefore, an element ak can be parameterised by the binary operator ⊗ for smoothing becomes
(Ak , bk , Ck , ηk , Jk ), which can be computed for each element
in parallel. Also, it is relevant to notice that if the system parameters ai ⊗ aj = aij ,
(Fk , uk , Qk , Hk , dk , Rk ) do not depend on the time step k, the
where
only parameters of ak that depend on k are bk and ηk , as they Z
depend on the measurement yk .  aij (x | z) = ai (x | y) aj (y | z) dy
Lemma 8: Given two elements (fi , gi ) ∈ F and fj , gj ∈ F , Z
with parameterisations

= N (x; Ei y + gi , Li ) N y; Ej z + gj , Lj dy
fi (y | z) = N (y; Ai z + bi , Ci ) ,

= N x; Eij z + gij , Lij ,
gi (z) ∝ NI (z; η i , Ji ) ,
 and
fj (y | z) = N y; Aj z + bj , Cj ,
 Eij = Ei Ej ,
gj (z) ∝ NI z; η j , Jj ,
gij = Ei gj + gi ,
the binary operator ⊗ for filtering becomes
Lij = Ei Lj Ei> + Li .
 
(fi , gi ) ⊗ fj , gj = fij , gij ,
V. N UMERICAL EXPERIMENT
where In order to illustrate the benefit of parallelization we consider
 a simple tracking model (see, e.g., [5], [6]) with the state x =
fij (x | z) = N x; Aij z + bij , Cij , (13) >
 u v u̇ v̇ , where (u, v) is the 2D position and (u̇, v̇) is the
gij (z) ∝ NI z; ηij , Jij , (14) 2D velocity of the tracked object. From noisy measurements of the
SÄRKKÄ et al.: TEMPORAL PARALLELIZATION OF BAYESIAN SMOOTHERS 5

2 10 7
True Trajectory KF flops
0 Measurements PKF span flops

Floating point operations (flops)


KF estimate PKF work flops
RTS estimate
-2 10 6

-4

10 5
v

-6

-8

-10 10 4

-12

-14 10 3 0
-5 0 5 10 15 20 25 10 10 1 10 2 10 3 10 4
u Time series length
Fig. 2. Simulated trajectory from the linear tracking model in Eqs. (15) Fig. 3. The flops and the span and work flops for the sequential Kalman
and (16) along with the Kalman filter (KF) and RTS smoother results. filter (KF) and the parallel Kalman filter (PKF).

10 7
position (u, v), we aim to solve the smoothing problem in order to RTS flops
determine the whole trajectory of the target. PRTS span flops

Floating point operations (flops)


The model has the form PRTS work flops
10 6
xk = F xk−1 + qk−1 ,
(15)
yk = H xk + rk , 10 5
where qk ∼ N(0, Q), rk ∼ N(0, R), and
 3
∆t2

  ∆t
0 0 10 4
1 0 ∆t 0  3 3
2
2
0 1 0 ∆t  0 ∆t ∆t
3 0 2

F = 0 0 1 0  , Q = q  ∆t2
  ,
0 ∆t 0  10 3

 2
0 0 0 1 ∆t2
0 2 0 ∆t
(16)
10 2 0
along with 10 10 1 10 2 10 3 10 4
   2  Time series length
1 0 0 0 σ 0
H= , R= . (17)
0 1 0 0 0 σ2 Fig. 4. The flops and the span and work flops for the (sequential) RTS
smoother and the parallel RTS (PRTS) smoother.
In our simulations we used the parameters σ = 0.5, ∆t = 0.1, q = 1,
and started the trajectory from a random Gaussian initial condition
>
with mean m0 = 0 0 1 −1 and covariance P0 = I4 . Fig. 3 shows the flops required by the sequential KF along with
Fig. 2 shows a typical trajectory and measurements from the model the span flops and work flops required by the parallel Kalman filter
defined by Eqs. (15) and (16) along with the Kalman filter and algorithm. As expected, with small data set sizes the number of span
RTS smoother solutions. As the parallel algorithms produce exactly flops required by the parallel KF is larger than that of the sequential
the same filter and smoothing solutions as the classic sequential KF, but already starting from time step count of around 20, the span
algorithms, this result also illustrates the typical result produced by flops is lower for the parallel KF. The logarithmic growth of the span
the proposed algorithms. flops in the parallel algorithm can be clearly seen while the number
We now aim to evaluate the required number of floating point of flops for the sequential KF grows linearly. However, the work flops
operations (flops) for generating the smoothing solution for this required by the parallel KF is approximately 8 times the flops of the
model. In order to do that, we run the sequential filter and smoothing sequential KF. This means that although the execution time for the
methods (KF and RTS) as well as the proposed parallel algorithms parallel algorithms is smaller than for the sequential algorithms, they
(PKF and PRTS) over simulated data sets of different sizes and need to perform more floating point operations in total.
evaluate their span and work flops. The span flops here refers to The flops required by the sequential RTS smoother along with the
the minimum number of floating point steps when the parallelizable span flops and work flops required by the parallel RTS smoother
operations in the algorithm are done in parallel – this corresponds to are shown in Fig. 4. In this case, the parallel algorithm reaches the
the actual execution time required to do the computations in a parallel sequential algorithm speed already with data set of size less than 10.
computer. The work flops refers to the total number of operations that Furthermore, the total number of floating point operations required
the parallel computer needs to perform – it measures the total energy by the parallel algorithm is approximately 4 times the operations
required for the computations or equivalently the time required by the required by the sequential algorithm. The ratios of these total (work)
algorithm in a single-core computer. As the classic sequential KF and operations for both the filter and smoother are shown in Fig. 5.
RTS algorithms are not parallelizable, their span and work flops are
equal. The flops have been computed by estimating how many flops
each of the matrix operations takes (multiplication, summation, LU- VI. C ONCLUSION AND D ISCUSSION
factorization) and incrementing the flops counter after every operation In this article we have proposed a novel general algorithmic
in the code. framework for parallel computation of batch Bayesian filtering and
6 GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017

8 the present framework along with various other Bayesian filter and
Ratio of work flops smoother approximations proposed in literature.
6

4 A PPENDIX I
In this appendix, we prove the required results for Bayesian
2 KF work ratio
RTS work ratio
filtering: the associative property of the operator in Definition 2 and
0 Theorem 3.
0 500 1000 1500 2000 2500
Time series length
A. Associative property
Fig. 5. Ratio of work flops for the parallel and sequential Kalman filter
and the parallel and sequential RTS smoother.
In order to prove the associative property of ⊗ for filtering, we need
to prove that for three elements (fi , gi ), fj , gj , (fk , gk ) ∈ F , the
following relation holds
 
smoothing solutions for state space models. The framework is based (fi , gi ) ⊗ fj , gj ⊗ (fk , gk )
on formulating the computations in terms of associative operations  
= (fi , gi ) ⊗ fj , gj ⊗ (fk , gk ) .

(18)
between suitably defined elements such that the all-prefix-sums
operation computed by a parallel-scan algorithm exactly produces We proceed to perform the calculations on both sides of the equation
the Bayesian filtering and smoothing solutions. The advantage of to check that they yield the same result.
the framework is that the parallelization allows for performing the 1) Left-hand side: We use Definition 2 in the left-hand side of
computations in O(log n) span complexity, where n is the number (18) and obtain
of data points, while sequential filtering and smoothing algorithms  
fij , gij ⊗ (fk , gk ) = fijk , gijk ,
have an O(n) complexity. Parallel versions of Kalman filters and
Rauch–Tung–Striebel smoothers were derived as special cases of the where
framework. The computational advantages of the framework were
fijk (x | z)
illustrated in a numerical simulation.
gk (y) fk (x | y) gj y 0 fj y | y 0 fi y 0 | z dy 0 dy
RR   
A disadvantage of the proposed methodology is that although = , (19)
gk (y) gj (y 0 ) fj (y | y 0 ) fi (y 0 | z) dy 0 dy
RR
the wall-clock time of execution is significantly reduced, the total
amount of computations (and hence required energy) is larger than and
with conventional sequential algorithms. Although the total amount
of computations is only increased by a constant factor, in some gijk (z)
Z
systems, such as small-scale mobile systems, even if parallelization = gij (z) gk (y) fij (y | z) dy
would be possible, it can be beneficial to use the classic algorithms. Z 
However, the speedup gain of the proposed approach is beneficial
= gi (z) gj (y) fi (y | z) dy
in applications such as data-assimilation based weather forecasting
[24] and other spatio-temporal systems appearing, for example, in "Z "R
gj y 0 fj y | y 0 fi y 0 | z dy 0
   # #
tomographic reconstruction [8] or machine learning [25], where the × gk (y) dy
gj (y 0 ) fi (y 0 | z) dy 0
R
computations take a significant amount of time. In these systems, it ZZ
is possible to dedicate the required amount of extra computational = gi (z) gk (y) gj y 0 fj y | y 0 fi y 0 | z dy 0 dy.
  
(20)
resources to gain the significant speedup provided by parallelization.
Although we have restricted our consideration to specific types 2) Right-hand side: We first use operator ⊗ to the elements with
parallel-scan algorithms, it is also possible to use other kinds of algo- indices j and k in the right-hand side of (18), see Definition 2,
rithms for computing the prefix sums corresponding to the Bayesian  
filtering and smoothing solutions. We could also select algorithms fj , gj ⊗ (fk , gk ) = fjk , gjk ,
for given computer or network architectures, for minimizing the where
communication between the nodes, or for minimizing the energy R
gk (y) fk (x | y) fj (y | z) dy
consumption [26], [27]. The present formulation of the computations fjk (x | z) = R ,
g (y) fj (y | z) dy
in terms of local associative operations is likely to have other Zk
applications beyond parallelization. For example, in decentralized gjk (z) = gj (z) gk (y) fj (y | z) dy.
systems, it is advantageous to be able to first perform operations
locally and then combine them to produce the full state-estimation Then, the right-hand side of (18) becomes
solution.  0 0 
(fi , gi ) ⊗ fjk , gjk = fijk , gijk ,
The proposed framework is also valid for discrete state spaces as
well as for other state spaces provided that we consider the elements where
with the appropriate domain and replace the Lebesgue integrals by 0
fijk (x | z)
integrals with respect to the corresponding reference measure, e.g., R
counting measure in the case of discrete-state models. gjk (y) fjk (x | y) fi (y | z) dy
= R
The framework could be extended to non-linear and non-Gaussian gjk (y) fi (y | z) dy
gj (y) gk y 0 fk x | y 0 fj y 0 | y dy 0 fi (y | z) dy
R R    
models by replacing the exact Kalman filters and smoothers with iter-
ated extended Kalman filters and smoothers [28], [29] or their sigma- =
gj (y) gk (y 0 ) fj (y 0 | y) dy 0 fi (y | z) dy
R R 
point/numerical-integration versions such as posterior linearization
gj (y) gk y 0 fk x | y 0 fj y 0 | y fi (y | z) dy 0 dy
RR   
filters and smoothers [30]–[32]. Possible future work also includes = , (21)
gj (y) gk (y 0 ) fj (y 0 | y) fi (y | z) dy 0 dy
RR
developing particle filter and smoother methods (see, e.g., [6]) for
SÄRKKÄ et al.: TEMPORAL PARALLELIZATION OF BAYESIAN SMOOTHERS 7

and The second element of a1 ⊗ [a2 ⊗ · · · ⊗ ak ], denoted as g1k is


0
gijk (z) g1k (x0 )
Z Z
= gi (z) gjk (y) fi (y | z) dy = p (y1 ) p (y2:k | x1 ) p (x1 | y1 ) dx1
Z  Z 
= p (y1:k ) . (27)
gj (y) gk y 0 fj y 0 | y dy 0 fi (y | z) dy
 
= gi (z)
ZZ Results (26) and (27) finish the proof of Theorem 3.
gj (y) gk y 0 fj y 0 | y fi (y | z) dy 0 dy.
 
= gi (z) (22)
A PPENDIX II
0 0
In (19), (20), it is met that fijk (x | z) = f ijk (x | z) and gijk (z) =
In this appendix, we prove the required results for Bayesian
g ijk (z), which proves the associative property of ⊗ in Definition 2.
smoothing: the associative property of the operator in Definition 5
and Theorem 6.
B. Proof of Theorem 3
In this appendix, we prove Theorem 3. We first prove by induction A. Associative property
that In order to prove the associative property of ⊗ for filtering, we
need to prove that, for three elements ai , aj , ak ∈ S , the following
ak−l ⊗ · · · ⊗ ak−1 ⊗ ak
relation holds:
= (p (xk | yk−l:k , xk−l−1 ) , p (yk−l:k | xk−l−1 )) , (23)    
ai ⊗ aj ⊗ ak = ai ⊗ aj ⊗ ak . (28)
for l < k + 1. Relation (23) holds for l = 0 by definition of ak .
Then, assuming that We proceed to perform the calculations on both sides of the equation
to check that they yield the same result.
ak−l+1 ⊗ · · · ⊗ ak−1 ⊗ ak 1) Left-hand side: We apply the operator in Definition 5 on the
= (p (xk | yk−l+1:k , xk−l ) , p (yk−l+1:k | xk−l )) (24) left-hand side of (28) to obtain

holds, we need to prove that (23) holds. aij ⊗ ak = aijk ,


We calculate the first element of ak−l ⊗ bk−l+1 , denoted by fab ,
where
where bk−l+1 = ak−l+1 ⊗ · · · ⊗ ak−1 ⊗ ak . We have Z
aijk (x | z) = aij (x | y) ak (y | z) dy
fab (xk | xk−l )
R ZZ
p (yk−l+1:k , xk | xk−l ) p (xk−l | yk−l , xk−l−1 ) dxk−l = ai x | y 0 aj y 0 | y ak (y | z) dydy 0 .
 
(29)
=
p (yk−l+1:k | yk−l , xk−l−1 )
p (yk−l+1:k , xk | yk−l , xk−l−1 ) 2) Right-hand side: We first calculate aj ⊗ ak using the operator
=
p (yk−l+1:k | yk−l , xk−l−1 ) in Definition 5 which gives
= p (xk | yk−l:k , xk−l−1 ) .
Z
ajk (x | z) = aj (x | y) ak (y | z) dy.
Function fab corresponds to the first element of (23), as required.
We further get Then, we calculate the right hand side of (28) we have that
ai ⊗ ajk = a0ijk ,
Z
gab (xk−l−1 ) = p (yk−l | xk−l−1 ) p (yk−l+1:k | xk−l )
where
× p (xk−l | yk−l , xk−l−1 ) dxk−l Z
= p (yk−l | xk−l−1 ) p (yk−l+1:k | yk−l , xk−l−1 ) a0ijk (x | z) = ai (x | y) ajk (y | z) dy
= p (yk−l:k | xk−l−1 ) .
ZZ
ai (x | y) aj y | y 0 ak y 0 | z dy 0 dy.
 
= (30)
Function gab corresponds to the second element of (23), as required.
Substituting l = k + 2 into (23), we obtain We can see that aijk in (29) is equal to a0ijk in (30), which proves
the associative property of the operator in Definition 5.
a2 ⊗ · · · ⊗ ak
= (p (xk | y2:k , x1 ) , p (y2:k | x1 )) . (25) B. Proof of Theorem 6
We now calculate the first element of a1 ⊗ [a2 ⊗ · · · ⊗ ak ], denoted In this appendix, we prove Theorem 6. We first prove by induction
as f1k , where a1 is given in Theorem 3: that

f1k (xk | x0 ) ak ⊗ · · · ⊗ ak+l


R
p (y2:k | x1 ) p (xk | y2:k , x1 ) p (x1 | y1 ) dx1 = p (xk | y1:k+l , xk+l+1 ) , (31)
= R
p (y2:k | x1 ) p (x1 | y1 ) dx1
R for l < n − k. Relation (31) holds for l = 0 by definition of ak .
p (y2:k , xk | x1 , y1 ) p (x1 | y1 ) dx1 Then, assuming that
=
p (y2:k | y1 )
p (y2:k , xk | y1 ) ak ⊗ · · · ⊗ ak+l−1
=
p (y2:k | y1 ) = p (xk | y1:k+l−1 , xk+l ) (32)
= p (xk | y1:k ) . (26) holds, we need to prove that (23) holds.
8 GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017

We use ak+l in Theorem 6 to calculate [10] Y. C. Ho and R. C. K. Lee, “A Bayesian approach to problems in
stochastic estimation and control,” IEEE Transactions on Automatic
[ak ⊗ · · · ⊗ ak+l−1 ] ⊗ ak+l Control, vol. 9, no. 4, pp. 333–339, 1964.
[11] T. D. Barfoot, C. H. Tong, and S. Särkkä, “Batch continuous-time
Z
= p (xk | y1:k+l−1 , xk+l ) p (xk+l | y1:k+l , xk+l+1 ) dxk+l trajectory estimation as exactly sparse Gaussian process regression,” in
Z Proceedings of Robotics: Science and Systems (RSS), 2014.
= p (xk | y1:k+l , xk+l , xk+l+1 ) [12] A. Grigorievskiy, N. Lawrence, and S. Särkkä, “Parallelizable sparse
inverse formulation Gaussian processes (SpInGP),” in Proceedings of
IEEE International Workshop on Machine Learning for Signal Process-
× p (xk+l | y1:k+l , xk+l+1 ) dxk+l
Z ing (MLSP), 2017.
[13] P. M. Lyster, S. E. Cohn, R. Ménard, L. P. Chang, S. J. Lin, and R. G.
= p (xk , xk+l | y1:k+l , xk+l+1 ) dxk+l Olsen, “Parallel implementation of a Kalman filter for constituent data
assimilation,” Monthly Weather Review, vol. 125, no. 7, pp. 1674–1686,
= p (xk | y1:k+l , xk+l+1 ) . 1997.
This proves (31). [14] G. Evensen, “The ensemble Kalman filter: Theoretical formulation and
practical implementation,” Ocean dynamics, vol. 53, no. 4, pp. 343–367,
If l = n − k − 1 and an as in Theorem 6, we have 2003.
[15] A. Lee, C. Yau, M. B. Giles, A. Doucet, and C. C. Holmes, “On
[ak ⊗ · · · ⊗ an−1 ] ⊗ an the utility of graphics cards to perform massively parallel simulation
Z
of advanced Monte Carlo methods,” Journal of Computational and
= p (xk | y1:n−1 , xn ) p (xn | y1:n ) dxn Graphical Statistics, vol. 19, no. 4, pp. 769–789, 2010.
Z [16] O. Rosen and A. Medvedev, “Efficient parallel implementation of state
= p (xk | y1:n , xn ) p (xn | y1:n ) dxn estimation algorithms on multicore platforms,” IEEE Transactions on
Control Systems Technology, vol. 21, no. 1, pp. 107–120, 2013.
= p (xk | y1:n ) . [17] M. E. Liggins, C.-Y. Chong, I. Kadar, M. G. Alford, V. Vannicola, and
S. Thomopoulos, “Distributed fusion architectures and algorithms for
This result finishes the proof of Theorem 6. target tracking,” Proceedings of the IEEE, vol. 85, no. 1, pp. 95–107,
1997.
[18] R. E. Ladner and M. J. Fischer, “Parallel prefix computation,” Journal
A PPENDIX III of the ACM, vol. 27, no. 4, pp. 831–838, 1980.
In this appendix, we prove Lemma 8. We have the following easily [19] G. E. Blelloch, “Scans as primitive parallel operations,” IEEE Transac-
verifiable identities: tions on Computers, vol. 38, no. 11, pp. 1526–1538, 1989.
[20] ——, “Prefix sums and their applications,” School of Computer Science,
NI (y; η, J)N(y; m, C) Carnegie Mellon University, Tech. Rep. CMU-CS-90-190, 1990.
[21] R. E. Kalman, “A new approach to linear filtering and prediction
∝ N(y; [J + C −1 ]−1 [η + C −1 m], [J + C −1 ]−1 ) problems,” Transactions of the ASME, Journal of Basic Engineering,
vol. 82, no. 1, pp. 35–45, 1960.
and [22] H. E. Rauch, F. Tung, and C. T. Striebel, “Maximum likelihood estimates
of linear dynamic systems,” AIAA Journal, vol. 3, no. 8, pp. 1445–1450,
NI (y; η, J)NI (y; η 0 , J 0 ) ∝ NI (y; η + η 0 , J + J 0 ). 1965.
We also have [23] B. D. O. Anderson and J. B. Moore, Optimal Filtering. Prentice-Hall,
Z 1979.
NI (y; η, J)N(y; Az + b, C)dy [24] N. Cressie and C. K. Wikle, Statistics for Spatio-Temporal Data. John
Wiley & Sons, 2011.
[25] S. Särkkä, A. Solin, and J. Hartikainen, “Spatiotemporal learning via
∝ NI (z; A> [I + JC]−1 (η − Jb), A> [I + JC]−1 JA). infinite-dimensional Bayesian filtering and smoothing,” IEEE Signal
Processing Magazine, vol. 30, no. 4, pp. 51–61, 2013.
By using Definition 2 for fij and gij together with parameterizations [26] A. Grama, V. Kumar, A. Gupta, and G. Karypis, Introduction to parallel
in Lemma 8, elementary computations lead to (13) and (14). computing, 2nd ed. Pearson Education, 2003.
[27] P. Sanders and J. L. Träff, “Parallel prefix (scan) algorithms for MPI,”
ACKNOWLEDGMENT in Recent Advances in Parallel Virtual Machine and Message Passing
Interface. EuroPVM/MPI 2006. Lecture Notes in Computer Science,
The authors would like to thank Academy of Finland for financial B. Mohr, J. Träff, J. Worringen, and J. Dongarra, Eds. Springer, 2006,
support. vol. 4192.
[28] B. M. Bell and F. W. Cathey, “The iterated Kalman filter update as
a Gauss–Newton method,” IEEE Transactions on Automatic Control,
R EFERENCES vol. 38, no. 2, pp. 294–297, 1993.
[1] T. Rauber and G. Rünger, Parallel programming: For multicore and [29] B. M. Bell, “The iterated Kalman smoother as a Gauss–Newton method,”
cluster systems, 2nd ed. Springer, 2013. SIAM Journal on Optimization, vol. 4, no. 3, pp. 626–636, 1994.
[2] S. Cook, CUDA programming: a developer’s guide to parallel computing [30] A. F. Garcı́a-Fernández, L. Svensson, M. R. Morelande, and S. Särkkä,
with GPUs. Morgan Kaufmann, 2013. “Posterior linearisation filter: principles and implementation using sigma
[3] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to points,” IEEE Transactions on Signal Processing, vol. 63, no. 20, pp.
Algorithms, 3rd ed. MIT Press, 2009. 5561–5573, 2015.
[4] A. H. Jazwinski, Stochastic Processes and Filtering Theory. Academic [31] A. F. Garcı́a-Fernández, L. Svensson, and S. Särkkä, “Iterated poste-
Press, New York, 1970. rior linearization smoother,” IEEE Transactions on Automatic Control,
[5] Y. Bar-Shalom, X.-R. Li, and T. Kirubarajan, Estimation with Applica- vol. 62, no. 4, pp. 2056–2063, 2017.
tions to Tracking and Navigation. Wiley, New York, 2001. [32] F. Tronarp, A. F. Garcı́a-Fernández, and S. Särkkä, “Iterative filtering
[6] S. Särkkä, Bayesian Filtering and Smoothing. Cambridge University and smoothing in non-linear and non-Gaussian systems using conditional
Press, 2013. moments,” IEEE Signal Processing Letters, vol. 25, no. 3, pp. 408–412,
[7] O. Cappé, E. Moulines, and T. Rydén, Inference in Hidden Markov 2018.
Models, ser. Springer Series in Statistics. New York, NY: Springer-
Verlag, 2005.
[8] J. Kaipio and E. Somersalo, Statistical and Computational Inverse
Problems. Springer, 2005.
[9] S. Särkkä, M. A. Álvarez, and N. D. Lawrence, “Gaussian process latent
force models for learning and stochastic control of physical systems,”
IEEE Transactions on Automatic Control, 2019.

You might also like