Fully Parallel Particle Learning
Fully Parallel Particle Learning
r
X
i
v
:
1
2
1
2
.
1
6
3
9
v
1
[
s
t
a
t
.
C
O
]
7
D
e
c
2
0
1
2
Fully Parallel Particle Learning
for GPGPUs and Other Parallel Devices
Kenichiro McAlinn
Hiroaki Katsura
Teruo Nakatsuma
Graduate School of Economics, Keio University, 2-15-45 Mita, Minato-ku, Tokyo, Japan
Graduate School of Economics, Keio University, 2-15-45 Mita, Minato-ku, Tokyo, Japan
Faculty of Economics, Keio University, Keio University, 2-15-45 Mita, Minato-ku, Tokyo, Japan
1
1 Introduction
The state space model (SSM) has been one of the indispensable tools for time series analysis
and optimal control for decades. Although the archetypal SSM is linear and Gaussian, the liter-
ature of more general non-linear and non-Gaussian SSMs have been rapidly growing in the last
two decades. For lack of an analytically tractable way to estimate the general SSM, numerous
approximation methods have been proposed. Among them, arguably the most widely applied
method is particle ltering (Gordon et al. (1993), Kitagawa (1996)). Particle ltering is a type
of sequential Monte Carlo method in which the integrals we need to evaluate for ltering are
approximated by the Monte Carlo integration. To improve numerical accuracy and stability
of the particle ltering algorithm, various extensions such as the auxiliary particle lter (Pitt
and Shephard (1999)) have been proposed, and still actively studied by many researchers. For
SSMs with unknown parameters, Kitagawa (1998) proposed a self-organizing state space mod-
eling approach in which the unknown parameters are regarded as a subset of the state variables
and the joint posterior distribution of the parameters and the state variables is evaluated with
a particle ltering algorithm. Other particle ltering methods that can simultaneously estimate
parameters have been proposed by Liu and West (2001), Storvik (2002), Fearnhead (2002), Pol-
son et al. (2008), Johannes and Polson (2008), Johannes et al. (2008), Carvalho et al. (2010),
just to name a few. These particle ltering methods that estimate state variables and parameters
simultaneously are often called particle learning methods in the literature. Although the ef-
fectiveness of particle ltering methods have been proven through many different applications
(see Zou and Chakrabarty (2007), Mihaylova et at. (2007), Chai and Yang (2007), Montemerlo
et al. (2003), Dukic et al. (2009), and Lopes and Tsay (2011) among others), it is offset by the
fact that it is a time-consuming technique. Some practitioners still shy away from using it in
their applications because of this.
This attitude toward particle ltering would be changed by the latest technology: paral-
lel computing. As we will discuss in Section 2, some parts of the particle ltering procedure
are ready to be executed simultaneously on many processors in a parallel computing environ-
ment. In light of inexpensive parallel processing devices such as GPGPUs
1
(general purpose
1
A high-performance GPU (graphics processing unit) was originally developed for displaying high-resolution
2D/3D graphics necessary in video games and computer-aided design. Because a GPU is designed with a massive
2
graphics processing units) available to the general public, more and more researchers start to
jump on the bandwagon of parallel computing. Lee et al. (2010) reviewed general attempts
at parallelization of Bayesian estimation methods. Durham and Geweke (2011) developed a
sequential Monte Carlo method designed for GPU computing and applied it to complex non-
linear dynamic models, which are numerically intractable even for the Markov chain Monte
Carlo method. As for parallelization of particle ltering, a few researches (see Montemayor
et al. (2004), Maskell et al. (2006), Hendeby et al. (2007) for example) have been reported,
though the eld is still in a very early stage. To the best of the authors knowledge, Hendeby et
al. (2010) is the only successful implementation of the particle ltering algorithm on the GPU.
Their implementation, however, depends on device-specic functionalities and its resampling
algorithm is not an exact one.
In developing parallel algorithms designed for the GPU, there are a few bottlenecks one
should avoid. First, processing sequential algorithms on the GPU can be inefcient because of
the GPUs device memory architecture. Roughly speaking, a GPU has two types of memory:
memory assigned to each core and memory shared by all cores. Access to the core-linked mem-
ory is fast while access to the shared memory takes more time. Ideally, one should try as much
to keep all calculations on each core without any communications among cores. The second
bottleneck is that it is time-consuming to transfer memory between the host memory, which the
CPU uses, and the device memory, which the GPU uses. In other words the bandwidth between
the GPUs device memory and the CPUs host memory is very narrow. Thus an ideal parallel
algorithmfor the GPU would be to calculate everything within the GPU, preferably within each
core (without inter-core communications).
With these bottlenecks in mind, we have developed a new parallel algorithm that computes
the full cycle of the particle ltering algorithm in a massively and fully parallel manner, from
computing the likelihood for each particle, constructing the cumulative distribution function
(CDF) for resampling, resampling the particles with the CDF, and propagating new particles
number of processor cores to conduct single-instruction multiple-data (SIMD) processing, it has been regarded
as an attractive platform of parallel computing and researchers started to use it for high-performance computing.
As GPU manufacturers try to take advantage of this opportunity, it has evolved into a more computation-oriented
device called GPGPU. Nowadays almost all GPUs have more or less capabilities for parallel computing, so the
distinction between GPUs and GPGPUs are blurred.
3
for the next cycle. By keeping all of our computations within the GPU and avoiding all memory
transfer between the GPU and the CPU during the execution of the particle ltering algorithm.
In this way, we exploit the great benets of parallel computing on the GPU while avoiding its
short comings. Additionally, since our parallel algorithm does not utilize any device-specic
functionalities, it can be easily implemented on other parallel computing devices including
cloud computing systems.
In order to compare our new parallel algorithm with conventional sequential algorithms,
we conducted a Monte Carlo experiment in which we applied the competing particle learn-
ing algorithms to estimate a simple state space model (stochastic trend with noise model) and
recorded the execution time of each algorithm. The results showed that our parallel algorithm
on the GPU is far superior to the conventional sequential algorithm on the CPU by around 30x
to 200x.
The organization of this paper is as follows. In Section 2, we briey review state space
models and particle ltering and learning. In Section 3, we describe how to implement a fully
parallelized particle ltering algorithm, in particular how to parallelize the CDF construction
step and the resampling step. In Section 4, we report the results of our Monte Carlo experiment
and discuss their implications. In Section 5, we state our conclusion.
2 State Space Models and Particle Filtering
A general form of SSM is given by
y
t
p(y
t
|x
t
)
x
t
p(x
t
|x
t1
)
(1)
where p(y
t
|x
t
) stands for the conditional distribution or density of observation y
t
given unob-
servable x
t
and p(x
t
|x
t1
) stands for the conditional distribution or density of x
t
given x
t1
,
which is the previous realization of x
t
itself. In the literature of SSM, unobservable x
t
, which
dictates the stochastic process of y
t
, is called the state variable.
Time series data analysis with the SSM is centered on how to dig up hidden structures of the
state variable out of the observations {y
t
}
T
t=1
. In particular, the key questions in applications
of SSM are (i) how to estimate the current unobservable x
t
, (ii) how to predict the future
4
state variables, and (iii) how to infer about the past state variables with the data currently
available. These aspects of state space modeling are called ltering, prediction, and smoothing,
respectively.
The ltering procedure, which is the main concern in our study, is given by the sequential
Bayes lter:
p(x
t
|y
1:t1
) =
p(x
t
|x
t1
)p(x
t1
|y
1:t1
)dx
t1
, (2)
p(x
t
|y
1:t
) =
p(y
t
|x
t
)p(x
t
|y
1:t1
)
p(y
t
|x
t
)p(x
t
|y
1:t1
)dx
t
, (3)
where y
1:t
= {y
1
, . . . , y
t
} (t = 1, . . . , T) and p(x
t
|y
1:t
) is the conditional density of the state
variable x
t
given y
1:t
. In essence, equation (3) is the well-known Bayes rule to update the
conditional density of x
t
while equation (2) is the one-period-ahead predictive density of x
t
given the past observations y
1:t1
. By applying (2) and (3) repeatedly, one keeps the conditional
density p(x
t
|y
1:t
) updated as a new observation comes in.
In general, a closed-form of neither (2) nor (3) is available, except for the linear Gaussian
case where we can use the Kalman lter (Kalman (1960)). See West and Harrison (1997) on
detailed accounts about the linear Gaussian SSM. To deal with this difculty, we apply particle
ltering, in which we approximate the integrals in (2) and (3) with particles, a random sample
of the state variables generated from either the conditional density p(x
t
|y
1:t
) or the predictive
density p(x
t
|y
1:t1
). Let {x
(i)
t
}
N
i=1
denote N particles generated from p(x
t
|y
1:t1
). We can
approximate p(x
t
|y
1:t1
) by
p(x
t
|y
1:t1
)
1
N
N
i=1
(x
t
x
(i)
t
) (4)
where () is the Dirac delta. Then the ltering equation (3) is approximated by
p(x
t
|y
1:t
)
p(y
t
|x
t
)
1
N
N
i=1
(x
t
x
(i)
t
)
p(y
t
|x
t
)
1
N
N
i=1
(x
t
x
(i)
t
)dx
t
=
N
i=1
w
(i)
t
(x
t
x
(i)
t
), (5)
w
(i)
t
=
p(y
t
|x
(i)
t
)
N
i=1
p(y
t
|x
(i)
t
)
.
5
(5) implies that the conditional density p(x
t
|y
1:t
) is discretized on particles {x
(i)
t
}
N
i=1
with proba-
bilities {w
(i)
t
}
N
i=1
. Therefore we can obtain a sample of x
t
, { x
(i)
t
}
N
i=1
, from p(x
t
|y
1:t
) by drawing
each x
(i)
t
out of {x
(i)
t
}
N
i=1
with probabilities {w
(i)
t
}
N
i=1
when the approximation (5) is sufciently
accurate. This procedure is called resampling. In reverse, if we have N particles { x
(i)
t1
}
N
i=1
gen-
erated from p(x
t1
|y
1:t1
), we can approximate p(x
t
|y
1:t1
) by
p(x
t
|y
1:t1
)
p(x
t
|x
t1
)
1
N
N
i=1
(x
t1
x
(i)
t1
)dx
t1
=
1
N
N
i=1
p(x
t
| x
(i)
t1
). (6)
Then (6) implies that we can obtain a sample of x
t+1
, {x
(i)
t+1
}
N
i=1
, from p(x
t+1
|y
1:t
) by gen-
erating each x
(i)
t+1
from p(x
t+1
| x
(i)
t
), which is called propagation. Hence we can mimic the
sequential Bayes lter by repeating the propagation equation (6) and the resampling equation
(5) for t = 1, 2, 3, . . . This is the basic principle of particle ltering. The formal representation
of the particle ltering algorithm is given as follows.
ALGORITHM: PARTICLE FILTERING
Step 0: Set the starting values of N particles { x
(i)
0
}
N
i=1
.
Step 1: Propagate x
(i)
t
from p(x
t
| x
(i)
t1
), (i = 1, . . . , N).
Step 2: Compute weight w
(i)
t
p(y
t
|x
(i)
t
).
Step 3: Resample { x
(i)
t
}
N
i=1
from {x
(i)
t
}
N
i=1
with weight w
(i)
t
.
When a state space model depends on unknown parameters ,
y
t
p(y
t
|x
t
, )
x
t
p(x
t
|x
t1
, )
(7)
we need to evaluate the posterior distribution p(|y
1:t
) given the observations y
1:t
. In the
framework of particle ltering, p(|y
1:t
) is sequentially updated as a new observation arrives,
which is called particle learning. The particle learning algorithm is dened as follows. Let
{z
(i)
t
= (x
(i)
t
,
(i)
t
)}
N
i=1
and { z
(i)
t
= ( x
(i)
t
,
(i)
t
)}
N
i=1
denote particles jointly generated from
p(x
t
, |y
1:t1
) and p(x
t
, |y
1:t
) respectively. Then the particle approximation of the Bayesian
6
learning process (Kitagawa (1998)) is given by
p(z
t
|y
1:t1
)
1
N
N
i=1
p(z
t
| z
(i)
t1
), (8)
p(z
t
|y
1:t
)
N
i=1
w
(i)
t
(z
t
z
(i)
t
), w
(i)
t
=
p(y
t
|z
(i)
t
)
N
i=1
p(y
t
|z
(i)
t
)
. (9)
This is a rather straightforward generalization of the particle ltering.
ALGORITHM: PARTICLE LEARNING
Step 0: Set the starting values of N particles {z
(i)
0
}
N
i=1
.
Step 1: Propagate z
(i)
t
from p(z
t
| z
(i)
t1
), (i = 1, . . . , N).
Step 2: Compute weight w
(i)
t
p(y
t
|z
(i)
t
).
Step 3: Resample { z
(i)
t
}
N
i=1
from {z
(i)
t
}
N
i=1
with weight w
(i)
t
.
Once we generate {
(i)
t
} by the particle learning, we can treat themas a Monte Carlo sample
of drawn from the posterior density p(|y
1:t
). Thus we calculate the posterior statistics on
with {
(i)
t
} in the same manner as the traditional Monte Carlo method or the state-of-the-art
Markov chain Monte Carlo (MCMC) method.
The computational burden of particle ltering will be prohibitively taxing as the number of
particles N increases. The number of the likelihood p(y
t
|x
(i)
t
) to be evaluated, the number of
executions for constructing the CDF of particles, and the number of particles to be generated
in propagation will increase in O(N). The number of executions for resampling with the CDF
will increase in O(N
2
) when we use a naive resampling algorithm, but it can be reduced to
O(N log N) with more efcient algorithms, which we will discuss in the next section. Thus
sequential particle-by-particle execution of each step in the particle ltering (and learning) al-
gorithm is inefcient when N is large, even though the particle ltering method by construction
requires a large number of particles to guarantee precision in the estimation.
To reduce the time for computation, we propose to parallelize all steps in particle ltering
so that we can execute the parallelized particle ltering algorithm completely inside the GPU.
The key to constructing an efcient parallel algorithm is asynchronous out-of-order execution
of jobs assigned to each processor. We need to keep a massive number of processors in the GPU
7
as busy as possible to fully exploit a potential computational power of the GPU. Therefore each
processor should waste no milliseconds by waiting for other processors until they complete
their jobs. If the order of execution is independent of the end results, asynchronous out-of-
order execution is readily implemented. In this situation, parallelization is rather straightfor-
ward. In the particle ltering method, this is the case for computation of the likelihood and the
propagation step and these steps can be computed in parallel without any modications. For
constructing the CDF and resampling particles, on the other hand, the conventional algorithm
does not allow asynchronous out-of-order execution and thus parallelization of these steps can
be tricky. In order to devise a fully parallelized particle ltering method, we need to develop
new algorithms for parallelization for these steps. In the next section, we describe how to
implement the CDF construction and resampling in a parallel computing environment.
3 Full Parallelization of Particle Filtering
3.1 Fully parallelized CDF construction
The goal of resampling is to generate N random integers, which are the indices of particles,
from a discrete distribution on {1, . . . , N} with the cumulative distribution function,
q(i)
i
j=1
p(y
t
|z
(j)
t
), (i = 1, . . . , N). (10)
Therefore we need to construct the CDF (10) before we perform resampling.
Hendeby et al. (2007) developed an algorithm designed for parallel execution of the CDF
construction. The process consists of two procedures: the forward adder and the backward
adder. These are illustrated in Table 1. Suppose that we have four particles and their weights
are given by {2, 4, 3, 1}. What we want to compute is the cumulative sum, {2, 6, 9, 10}. First
we apply the forward adder as described in Panel (a) of Table 1. In the initial step of the forward
adder, each weight is assigned to a node (nodes are corresponding to threads in the GPU). Let
w
(i)
j
denote the sum of weights of Node i in Step j. Thus the initial states of the nodes are
w
(1)
1
= 2, w
(2)
1
= 4, w
(3)
1
= 3, and w
(4)
1
= 1. Then in each step of the forward adder, a
neighboring pair of nodes is combined to form a new node which inherits the sum of weights
from the two combining nodes as its new weight. For example, in Step 2 of Table 1(a), a pair
8
of nodes, w
(1)
1
= 2 and w
(2)
1
= 4, is combined into a new node w
(1)
2
= 2 + 4 = 6. We repeat
this procedure until all nodes are collapsed into a single node whose weight is the sum of all
weights.
Then we apply the backward adder as described in Panel (b) of Table 1. Note that the
numbering of steps is reversed in Table 1 (b) as this is intentional. In the initial step of the
backward adder, we start at the last node in the forward adder. Let s
(i)
j
denote the sumof weights
up to Particle i in Step j. In Panel (b) of Table 1, we already have s
(4)
3
= 2 + 4 + 3 + 1 = 10
by the forward adder. As it moves backward, the current node branches into two nodes in each
step. The new right node inherits the same value of the current node, while the partial sum of
the new left node equals the value of the current node minus the weight of the right node in the
corresponding step of the forward adder. For example, in Step 1 of Table 1(b), the node with
s
(2)
2
= 6 branches to s
(1)
1
= s
(2)
2
w
(2)
1
= 6 4 = 2 and s
(2)
1
= s
(2)
2
= 6. Note that the number
of particles should be an integer power of 2 in order to apply this algorithm.
Table 1: Parallel computation of the cumulative sum
(a) Forward adder
Step 1 w
(1)
1
= 2 w
(2)
1
= 4 w
(3)
1
= 3 w
(4)
1
= 1
Step 2 w
(1)
2
= 2 + 4 = 6 w
(2)
2
= 3 + 1 = 4
Step 3 w
(1)
3
= 6 + 4 = 10
(b) Backward adder
Step 3 s
(4)
3
= 10
Step 2 s
(2)
2
= 10 4 = 6 s
(4)
2
= 10
Step 1 s
(1)
1
= 6 4 = 2 s
(2)
1
= 6 s
(3)
1
= 10 1 = 9 s
(4)
1
= 10
3.2 A Review of the Conventional Resampling Algorithms
Before we proceed to describe our parallel resampling algorithm, we briey review several
conventional sequential resampling algorithms. In the most naive manner, the resampling algo-
rithm is expressed as
for(i in 1:N) {
9
u <- runif(1);
for(j in 1:N) {
if(u < q[j])
break;
}
sampled[i] <- j;
}
Although this is the most basic and exact resampling procedure, it is extremely time-consuming
O(N
2
) operation, as we need to search out each particle one by one in the whole CDF. One
simple way to improve it is by sorting the uniform variates before the search process starts:
u <- sort(runif(N));
for(i in 1:N) {
for(j in i:N) {
if(u[j] < q[j])
break;
}
sampled[i] <- j;
}
Compared to the naive algorithm, this algorithm starts off by sampling all necessary random
numbers from the uniform distribution, sorts them in ascending order, and then sequentially
resamples particle with them. By sorting the uniform variates, each search for a particle can
be started where the last research was left off. The offset of this algorithm is that, although
the resampling procedure is a more efcient O(N log N) operation, sorting uniform variates
can be computationally strenuous as the number of particles increase, depending on the sorting
algorithm we use.
Alternative resampling algorithms have been introduced in order to circumvent the compu-
tational strains found in exact resampling algorithms like those above. The stratied resampling
algorithm,
for(i in 1:N) {
10
u <- (i-1+runif(1))/N;}
for(j in {i}:N) {
if(u < q[j])
break;
}
sampled[i] <- j;
}
conducts the resampling procedure by generating uniform variates on N equally spaced inter-
vals [(i 1)/N, i/N] (i = 1, . . . , N). Since only one particle will always be picked for each
interval, it does not exactly generate random integers from the the CDF {q(1), . . . , q(N)}. The
systematic resampling,
u0 <- runif(1);
for(i in 1:N) {
u <- (i-1+u0)/N:
for(j in i:N) {
if(u < q[j])
break;
}
sampled[i] <- j;
}
is similar to the stratied resampling but it always chooses an identical point in [(i1)/N, i/N]
for all i. In essence, the two alternative resampling algorithms are similar to the aforementioned
resampling with sorted uniform variates, but without the time-consuming sorting procedure.
However, neither stratied sampling nor systematic sampling is exact. Therefore, they may
produce less accurate results compared to exact resampling algorithms. In particular when es-
timating unknown parameters in particle learning, they may be more problematic because they
cannot jointly resample the state variables and the parameters. Thus, despite their computa-
tional superiority in a sequential framework, we need to be careful of using them in practice.
Additionally, as these methods benet only from the sequential framework, it is not obvious
how to parallelize them.
11
3.3 Fully parallelized resampling
To parallelize the resampling procedure while maintaining its exactness, we have developed a
parallel resampling algorithm
2
based on the cut-point method by Chen and Asau (1974).
A cut-point, I
j
, for given j = 1, . . . , N is the smallest index i such that the corresponding
probability q(I
j
) should be greater than (j 1)/N. In other words,
q(I
j
) = min
1iN
q(i) subject to q(i) >
j 1
N
, (j = 1, . . . , N). (11)
Given cut-points {I
1
, . . . , I
N
}, random integers between 1 and N is generated from the CDF
{q(1), . . . , q(N)} by the following procedure:
ALGORITHM: CUT-POINT METHOD
Step 0: Let j = 1
Step 1: Generate u from the uniform distribution on the interval (0, 1).
Step 2: Let k = I
Nu
where x stands for the smallest integer greater than or equal to x.
Step 3: If u > q(k), let k k + 1 and repeat Step 3; otherwise, go to Step 4.
Step 4: Store k as the index of the particle.
Step 5: If j < N, let j j + 1 and go back to Step 1; otherwise, exit the loop.
Once all cut-points {I
1
, . . . , I
N
} are given, parallel execution of the cut-point method is
straightforward because the execution of Step 1 Step 3 does not depend on the index j. The
fully parallel resampling algorithm distributed on N threads is given as follows.
ALGORITHM: PARALLELIZED CUT-POINT METHOD
Step 0: Initiate the j-th thread.
2
Hendeby et al. (2007) developed a parallel resampling algorithm for particle ltering which is specically
designed for GPUs. Their method, however, is dependent on a device specic functionality (rasterization) and its
efciency and scalability is limited by the GPU architecture. Our parallel algorithm, on the other hand, is more
versatile and scalable because it requires only basic thread coordination mechanisms such as shared memory and
thread synchronization which are provided by most parallel computing systems.
12
Step 1: Generate u from the uniform distribution on the interval (0, 1).
Step 2: Let k = I
Nu
where x stands for the smallest integer greater than or equal to x.
Step 3: If u > q(k), let k k + 1 and repeat Step 3; otherwise, go to Step 4.
Step 4: Store k as the index of the particle.
Step 5: Wait until all threads complete the job. Otherwise, exit the loop.
However, the conventional algorithm for computation of the cut-points (see Fishman (1996,
p.158) for example) is not friendly to parallel execution. Thus, we have developed an efcient
algorithm for parallel search of all cut-points. To devise such a search algorithm, let us intro-
duce
L
j
= Nq(j), (j = 1, . . . , N).
and L
0
= 0 for convention. Due to the monotonicity of the cumulative distribution function,
we observe
1. 0 = L
0
< L
1
L
N
= N.
2. If L
j1
< L
j
, a cut-point such that
q(I
k
) = min
1iN
q(i) subject to Nq(i) > k 1, (k = L
j1
+ 1, . . . , L
j
)
is given as I
k
= j.
3. If L
j1
= L
j
, j is not corresponding to any cut-points.
4. L
1
= I
1
always holds.
The above properties give us a convenient criterion to check whether a particular L
j
is a cut-
point or not and it leads to the following multi-thread parallel algorithm to nd all cut-points.
ALGORITHM: PARALLELIZED CUT-POINT SEARCH
Step 0: Initiate the j-th thread.
Step 1: Compute L
j
= Nq(j).
13
Step 2: Let k = L
j
.
Step 3: If k > L
j1
, let I
k
= j and end the thread.
Step 4: Let k k 1 and go to Step 3.
With a fully parallel resampling algorithm, particle ltering can be executed in a fully par-
allel manner without any compromise. Additionally, as the particle ltering algorithm (and the
particle learning algorithm) is conducted completely on the GPU and each particle goes through
the algorithm on each designated core without syncing, the advantage of parallel computing is
gained to the fullest while its shortcoming is kept to its minimum (data transfer between the
GPUs device memory and the CPUs host memory only occurs at the beginning and the end
of the computation).
4 Numerical Experiment
In our experiment, we use a stochastic trend with noise model:
y
t
= x
t
+
t
,
t
N(0,
2
)
x
t
= x
t1
+
t
,
t
N(0,
2
)
(12)
as the benchmark model for performance comparison. In (12), we set x
0
= 0,
2
= 1,
2
= 0.1,
and generate {y
1
, . . . , y
100
}. Then we treat
2
and
2
as unknown parameters and apply the
particle learning algorithm by Carvalho et al. (2010) to (12). The prior distributions are
x
0
N(0, 10),
2
IG(5, 4),
2
IG(5, 0.4)
To demonstrate the effectiveness of our new parallel algorithm, we will compare the fol-
lowing types of algorithms:
Sequential algorithm on the CPU
CPU(n): Naive resampling with single precision
CPU(s): Resampling with sorted uniform variates with single precision
Parallel algorithm on the GPU
14
GPU(sp): Parallel resampling by the cut-point method with single precision
GPU(dp): Parallel resampling by the cut-point method with double precision
The rst two are conventional sequential algorithms for resampling. The code for the parallel
algorithm is written in CUDA while that for the sequential algorithms is in C. Both are com-
piled and executed on the same Linux PC with specications shown in Table 2. Alternative
resampling algorithms, such as the stratied and systematic resampling, are not considered
here, as they are not exact resampling. However, if we exclude the time consumed by the sort-
ing procedure from the resampling time of CPU(s), we get a very good estimate of how long
they might take.
Table 2: Hardware specications
CPU (Intel Core-i7 2700K) GPU (NVIDIA GTX580)
Core clock rate 3.50GHz 772MHz
Number of cores 4 512
Memory 8GB 3GB
For each algorithm, we execute the particle learning ten times with the same generated path,
{y
1
, . . . , y
100
}, and recorded the execution time of each trial. To avoid the inuence of possible
outliers, we took the average of the middle ve of them. The results are listed in Table 3 and
the plots of the total execution time against the number of particles are shown in Figure 1.
The results clearly show that our new parallel algorithm, which runs completely in parallel
and keeps all executions within the GPU, can be extremely effective compared to conventional
sequential algorithms. As the number of particles increases (and the precision of the estimate
increases), GPU(sp) outperforms that of CPU(n) by more than 200x in the case of 1,048,576(=
2
20
) particles and 1,671x in the case of 8,388,608(= 2
23
) particles. Even when compared with
the exact resampling with sorted uniform variates (CPU(s)), GPU(sp) is consistently faster by
more than 20x when the number of particles is 131,072 or more. In the comparison between
GPU(sp) and GPU(dp), the difference is somewhere around two times, which is consistent
with intuition. Interestingly, the computation on the GPU in double precision is still a good
5-10x faster than that of the CPU in single precision, which demonstrates the sheer power of
15
Table 3: Comparison in execution time
Number of particles
Time 1,024 4,096 16,384 131,072 1,048,576 8,388,608
(i) CPU(n) 30 80 390 7,050 307,600 19,855,870
(ii) CPU(s) 48 188 754 6,088 49,266 330,742
(iii) GPU(sp) 17 21 44 215 1,526 11,881
(iv) GPU(dp) 20 25 129 436
Ratio 1,024 4,096 16,384 131,072 1,048,576 8,388,608
(i) / (iii) 1.7 3.8 7.7 32.7 201.5 1671.3
(ii) / (iii) 2.8 9.0 17.3 28.3 32.3 27.8
(ii) / (iv) 2.4 7.4 5.8 14.0
(iv) / (iii) 1.1 1.2 3.0 2.0
Note: the values of execution time are in milliseconds.
10
3
10
4
10
5
10
6
10
7
10
1
10
2
10
3
10
4
10
5
10
6
10
7
10
8
Number of particles
E
x
e
c
u
t
i
o
n
t
i
m
e
(
m
i
l
l
i
s
e
c
o
n
d
)
CPU(n)
CPU(s)
GPU(sp)
GPU(dp)
Figure 1: Plots of execution time against the number of particles
16
parallel processing on the GPU. Due to memory failure, GPU(dp) failed for particles more than
a million, though this could be remedied by upgrading the GPU to the one with more memory
or using multiple GPUs.
To see which part of the particle learning contributes to the reduction in execution time, we
divide the cycle of particle learning into the following steps;
Initialize: set the starting values of the particles;
CDF: compute the likelihood and construct the CDF;
Resample: resample the particles with the CDF;
Propagate: propagate a new set of particles;
Store: store the generated particles into the CPUs host memory (GPU only);
Other: keep the results and proceed with the particle learning;
The results in the case of 131,972 particles are listed in Table 4. The tendency we observe in
Table 4 is similar in the other cases.
Breaking down the execution time gives us deep insights into how the GPU architecture
works and its strong and weak points. Examining the results in CPU(s), we rst notice that the
CDF step and Propagation step put together occupy the bulk of the total execution time, while
the Resampling step only accounts for less than ten percent of the total execution time and much
of it coming fromthe sorting step. Looking closely into the gains by parallelization in each step,
the largest comes from the CDF step with a gain of 248x, followed by the Propagation step
with a gain of 45.3x, then followed by the Resampling step with a gain of 11.9x. Although the
gain in the Resampling step has less of an impact compared with the overwhelming gain in CDF
and Propagation, it does not overshadow the fact that it gained 2.7x in single precision even if
we ignore the time spent in sorting the uniform random variates and focus on the resampling
procedure only. That implies that our parallel resampling on the GPU can beat the stratied
resampling on the CPU since the stratied resampling is roughly equivalent to the resampling
with sorted uniform variates without sorting in terms of computational complexity. As for
the Other step, CPU(s) and GPU(sp) is identical. This is because for both algorithms, all
executions of this step are conducted only on the CPU. Thus, we observe no difference. Finally,
17
Table 4: Breakdown of execution time
Cycle of particle learning
Time Initialize CDF Resample Propagate Store Other Total
(i) CPU(s) 26 2652 454 2900 56 6088
(102)
(ii) GPU(sp) 0.72 11 38 64 46 56 215
(iii) GPU(dp) 1.51 19 61 174 89 92 436
Ratio Initialize CDF Resample Propagate Store Other Total
(i) / (ii) 35.9 248.0 11.9 45.3 1.0 28.3
(2.7)
(iii) / (ii) 2.1 1.8 1.6 2.7 1.9 1.7 2.0
Notes: (a) the number of particle is 131,072;
(b) the values of execution time are in milliseconds;
(c) the number in parentheses is corresponding to the time excluding
the sorting step.
18
we observe a good amount of reduction in initiating the particle learning algorithm by our
parallel algorithm; however, the time spent in initiation is quite trivial, in particular when the
number of the sample period T is large (T = 100 in our experiment).
Although it is clear that our parallel algorithm is superior to the conventional sequential
algorithm through every step, Table 4 indicates that there is one drawback of using the GPU.
That is memory transfer. The Store step measures the time it takes to transfer the generated
particles from the GPUs device memory to the CPUs host memory. Table 4 shows that it
takes up roughly 15-20% of the execution time. Note that, for fairness of the experiment, the
GPU returns all of the particles it generated to the CPUs host memory. If we were to return
only the mean, the variance, and other statistics of the state variables and parameters, the time
for the Store step can be cut down signicantly.
5 Conclusion
In this study, we have developed a new algorithm to perform particle ltering and learning in
a parallel computing environment, in particular on GPGPUs. Our new algorithm has several
advantages. First, it enables us to keep all executions of the particle ltering (and learning) al-
gorithm within the GPU so that data transfer between the GPUs device memory and the CPUs
host memory is minimized. Second, unlike the stratied sampling or the systematic sampling,
our parallel sampling algorithm based on the cut-point method can resample particles exactly
from their CDF. Lastly, since our algorithm does not utilize any device specic functionalities,
it is straightforward to apply our algorithm to a multiple GPU system or a large grid computing
system.
Then we conducted a Monte Carlo experiment in order to compare our parallel algorithm
with conventional sequential algorithms. In the experiment, our algorithm implemented on the
GPU yields results far better than the conventional sequential algorithms on the CPU. Although
we keep the SSM as simple as possible in the experiment, our parallel algorithm can also be
applied to more complex models without any fundamental modications to the programming
code and this little investment will return a signicant gain in execution time instantaneously.
Our fully parallelized particle ltering algorithm is benecial for various applications that
require estimating powerful but complex models in a shorter span of time; ranging from motion
19
tracking technology to high-frequency trading. We even envision that one can performreal-time
ltering of the state variables and the unknown parameters in a high-dimensional nonlinear non-
Gaussian SSM on an affordable parallel computing system in a completely parallel manner.
That would pave the way for a new era of computationally intensive data analysis.
20
Reference
[1] Carvalho, C. M., Johannes, A. M., Lopes, H. F., and Polson, N. G. (2010), Particle
Learning and Smoothing, Statistical Sience, 25, 88106.
[2] Chai, X. and Yang, Q. (2007), Reducing the calibration effort for probabilistic indoor
location estimation, IEEE Transactions on Mobile Computing, 6, 649662.
[3] Chen, H. C., and Asau, Y. (1974), On generating random variates from an empirical
distribution, American Institute of Industrial Engineers (AIIE) Transactions, 6, 163166.
[4] Douc, R., Cappe, O., and Moulines, E. (2005), Comparison of Resampling Schemes for
Particle Filtering, Proceedings of the 4th International Symposium on Image and Signal
Processing and Analysis.
[5] Doucet, A., and Johannes, A. M. (2011), A tutorial on particle ltering and smooth-
ing: fteen years later, The Oxford Handbook of Nonlinear Filtering, Oxford University
Press.
[6] Dukic, V. M., Lopes, H. F., and Polson, N. G. (2009), Tracking u epidemics using
Google u trends and particle learning, Working paper, Univ. Chicago Booth School of
Business.
[7] Durham, G. B., and Geweke, G. (2011), Massively Parallel Sequential Monte Carlo for
Bayesian Inference, Working paper.
[8] Fearnhead, P. (2002), Markov chain Monte Carlo, sufcient statistics, and particle l-
ters, J. Comput. Graph. Statist. 11 848862.
[9] Fishman, G. S. (1996), Monte Carlo: Concepts, Algorithms, and Applications, Springer.
[10] Gordon, N. J., Salmond, D. J., and Smith, A. F. M. (1993), Novel approach to
nonlinear/non-Gaussian Bayesian state estimation, IEEE Proceedings-F, 140, 107113.
[11] Harvey, A. C., Ruiz, E., and Shephard, N. (1994), Multivariate stochastic variance mod-
els, Review of Economic Studies, 61, 247264.
[12] Hendeby, G., Karlsson, R., Gustafsson, F. (2010), Particle ltering: the need for speed,
Journal on Advances in Signal Processing, 22.
21
[13] Hendeby, G., Hol, J. D., Karlsson, R., and Gustafsson, F. (2007), A graphics processing
unit implementation of the particle lter, Proceedings of the 15th European Statistical
Signal Processing Conference (EUSIPCO 07), 16391643, Poznan, Poland.
[14] Johannes, M. and Polson, N. G. (2008), Exact particle ltering and learning. Working
paper, Univ. Chicago Booth School of Business.
[15] Johannes, M., Polson, N. G. and Yae, S. M. (2008), Non-linear ltering and learning,
Working paper, Univ. Chicago Booth School of Business.
[16] Kalman, R. E. (1960), A new approach to linear ltering and prediction problems,
Transactions of the ASME, Journal of Basic Engineering, 82, 3545.
[17] Kitagawa, G. (1996), Monte Carlo lter and smoother for non-Gaussian nonlinear state
space models, Journal of Computational and Graphical Statistics, 5, 125.
[18] Kitagawa, G. (1998), A self-organizing state-space model, Journal of the American
Statistical Association, 93, 12031215.
[19] Lee, A., Yau, C., Giles, M. B., Doucet, A., and Holmes, C. C. (2010), On the utility of
graphics cards to perform massively parallel simulation of advanced Monte Carlo meth-
ods, Journal of Computational and Graphical Statistics, 19, 769789.
[20] Liu, J. and West, M. (2001), Combined parameters and state estimation in simulation-
based ltering, Sequential Monte Carlo Methods in Practice (A. Doucet, N. de Freitas
and N. Gordon, eds.). New York: Springer-Verlag, 197223.
[21] Lopes, H. F. and Tsay, R. S. (2011), Particle lters and Bayesian inference in nancial
econometrics, Journal of Forecasting, 30, 168209.
[22] Maskell, S., Alun-Jones, B., and Macleod, M. (2006), A single instruction multiple data
particle lter, Proceedings of Nonlinear Statistical Signal Processing Workshop (NSSPW
06), Cambridge, UK.
[23] Mihaylova, L., Angelova, D., Honary, S., Bull, D. R., Canagarajah. C. N., and Ristic, B.
(2007), Mobility tracking in cellular networks using particle ltering, IEEE Transac-
tions on Wireless Communications, 6, 35893599.
22
[24] Montemayor, A. S., Pantrigo, J. J., Sanchez, A., and Fernandez, F. (2004), Particle lter
on GPUs for real time tracking, Proceedings of the International Conference on Com-
puter Graphics and Interactive Techniques (SIGGRAPH 04), 94, Los Angeles, CA, USA.
[25] Montemerlo, M., Thrun, S., Koller, D., and Wegbreit, B. (2003), FastSLAM 2.0: an im-
proved particle ltering algorithm for simultaneous localization and mapping that prov-
ably converges, Proceedings of the 18th International Joint Conference on Articial In-
telligence, 11511157, Acapulco, Mexico. Computational Statistics & Data Analysis, 56,
36903704.
[26] Pitt, M. and Shephard, N. (1999), Filtering via simulation: Auxiliary particle lters, J.
Amer. Statist. Assoc., 94, 590599.
[27] Polson, N. G., Stroud, J. and Muller, P. (2008), Practical ltering with sequential param-
eter learning, J. Roy. Statist. Soc. Ser. B, 70, 413428.
[28] Storvik, G. (2002), Particle lters in state space models with the presence of unknown
static parameters, IEEE Trans. Signal Process, 50, 281289.
[29] West, M. and Harrison, J. (1997), Bayesian Forecasting and Dynamic Models, 2nd ed.
Springer.
[30] Zou, Y. and Chakrabarty, K. (2007), Distributed mobility management for target tracking
in mobile sensor networks, IEEE Transactions on Mobile Computing, 6, 872887.
23