Krishna 2021
Krishna 2021
ABSTRACT Particle filtering is very reliable in modelling non-Gaussian and non-linear elements of physical
systems, which makes it ideal for tracking and localization applications. However, a major drawback of
particle filters is their computational complexity, which inhibits their use in real-time applications with
conventional CPU or DSP based implementation schemes. The re-sampling step in the particle filters
creates a computational bottleneck since it is inherently sequential and cannot be parallelized. This paper
proposes a modification to the existing particle filter algorithm, which enables parallel re-sampling and
reduces the effect of the re-sampling bottleneck. We then present a high-speed and dedicated hardware
architecture incorporating pipe-lining and parallelization design strategies to supplement the modified
algorithm and lower the execution time considerably. From an application standpoint, we propose a novel
source localization model to estimate the position of a source in a noisy environment using the particle filter
algorithm implemented on hardware. The design has been prototyped using Artix-7 field-programmable
gate array (FPGA), and resource utilization for the proposed system is presented. Further, we show the
execution time and estimation accuracy of the high-speed architecture and observe a significant reduction
in computational time. Our implementation of particle filters on FPGA is scalable and modular, with a low
execution time of about 5.62 µs for processing 1024 particles (compared to 64 ms on Intel Core i7-7700 CPU
with eight cores clocking at 3.60 GHz) and can be deployed for real-time applications.
INDEX TERMS Particle filters, field programmable gate array, bearings-only tracking, Bayesian filtering,
unmanned ground vehicle, hardware architectures, real-time processing.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 9, 2021 98185
A. Krishna et al.: FPGA Implementation of PFs for Robotic Source Localization
the state [5]. In most real-time scenarios, these models are SIR filter (cf. Algorithm 1) to address this problem and
non-linear and non-Gaussian. Traditional filters like Kalman make a parallel and high-speed implementation possible.
filters prove to be less reliable for such applications, and it The modified algorithm proposed (cf. Algorithm 2) uses a
is proven that PFs outperform conventional filters in such network of smaller filters termed sub-filters, each processing
scenarios [6]. independently and concurrently. The processing of total N
PFs are inherently Bayesian in nature, intending to con- particles is partitioned into K sub-filters so that at most
struct a posterior density of the state (e.g., the location N /K particles are processed within a sub-filter. This method
of a target or source) from observed noisy measurements. reduces the overall computation time by a factor of K . The
In PFs, the posterior of the state is represented by a set of modified algorithm also introduces an additional particle
weighted random samples known as particles. A weighted routing step (cf. Algorithm 2), which distributes the particles
average of the samples gives the state estimate (location of among the sub-filters and makes the parallel implementation
the source). PFs use three major steps: Sampling, Impor- of re-sampling possible. The particle routing step is integrated
tance, and Re-sampling for state estimation, thus deriving with the sampling step in the architecture proposed and does
the name SIR filter. In the sampling step, particles from the not require any additional time for computation. We also
prior distribution are drawn. The importance step is used compare the estimation accuracy of the standard algorithm
to update the weights of particles based on input measure- with the modified algorithm in Section VIII-C and infer that
ments. The re-sampling step prevents any weight degeneracy the modified and the standard approaches do not vary signifi-
by discarding particles with lower weights and replicating cantly in terms of estimation error. Additionally, the modified
particles having higher weights. Since PFs apply a recur- algorithm achieves a very low execution time of about 5.62 µs
sive Bayesian calculation, all particles must be processed for when implemented on FPGA, compared to 64 ms on Intel
sampling, importance, and re-sampling steps. Then, the pro- Core i7-7700 CPU with eight cores clocking at 3.60 GHz
cess is repeated for the next input measurement, resulting for processing 1024 particles and outperforms other state-of-
in enormous computational complexity. Further, the execu- the-art FPGA implementation techniques.
tion time of PFs is proportional to the number of particles,
which inhibits the use of PFs in various real-time applications 2) HARDWARE CONTRIBUTION
wherein a large number of particles need to be processed to We implemented the modified SIR algorithm on an FPGA
obtain a good performance. Several implementation strate- and key features of the proposed architecture are:
gies have been proposed in the literature to address this issue • Modularity: We divide the overall computation into
and make PFs feasible in real-time applications discussed in multiple sub-filters, which process a fixed number of
Section. II. particles in parallel, and the processing of the particles
is local to the sub-filter. This modular approach makes
A. OUR CONTRIBUTIONS the design adaptable and straightforward as it allows
The contributions of this paper are on algorithmic and hard- us to customize the number of sub-filters in the design
ware fronts: depending on the sampling rate of input measurement
and the amount of parallelism needed.
1) ALGORITHMIC CONTRIBUTION • Scalability: Our architecture can be scaled easily to
We propose a novel source localization model employing a process a large number of particles without increasing
light source as the target/source to be localized and an UGV the execution time by using additional sub-filters.
carrying an array of photodiodes to sense and localize the • Design complexity: The proposed architecture relies on
source. Photodiode measurements and the UGV position are the exchange of particles between sub-filters. However,
processed to estimate the bearing of the light source relative communication and design complexity increase propor-
to the UGV. Based on the bearing of the light source, we try to tionally with the number of sub-filters used in the design.
localize the source using the PF algorithm. Reflective objects In our architecture, we employ a simple ring topology to
and other stray light sources are also picked by the sen- exchange particles between sub-filters to reduce com-
sor (photodiodes), leading to false detections. In this study, plexity and design time.
we have successfully demonstrated that our PF system is • Memory utilization: The sampling step uses particles
robust to noise and can localize the source even when the from the previous time instant to estimate the particles
environment is noisy. We introduce two parameters α and β to of the present time instant. This requires the sampled
model the sensor imperfections and background noise activ- and re-sampled particles to be cached in two separate
ity, respectively. However, the PFs are computationally very memories. The straightforward implementation of the
expensive, and the execution time often becomes unrealistic modified SIR algorithm needs 2 × K memory ele-
using a traditional CPU-based platform. The primary issue ments each of depth M for storing the sampled and
faced during the design of high-speed PF architecture is the re-sampled particles for K sub-filters. Here, M is the
parallelization of the re-sampling step. The re-sampling step number of particles in a sub-filter (M = N /K ). How-
is inherently not parallelizable as it needs the information ever, applications involving non-linear models require
of all particles. We propose a modification to the standard a large number of particles [7]. This would make the
total memory requirement 2 × K significant for large K video. Ye and Zhang [11] implemented a SIR filter on the
or M . The proposed architecture reduces this memory Xilinx Virtex-5 FPGA for bearings-only tracking applica-
requirement to K memory elements each of depth M tions. Sileshi et al. [12]–[14] suggested two methods for
using a dual-port ram, as explained in Section VII-A1. implementation of PFs on an FPGA: the first method is
Therefore, the proposed architecture lowers memory a hardware/software co-design approach for implementing
utilization, and reduced memory access makes it more PFs using MicroBlaze soft-core processor, and the second
energy-efficient. approach is a full hardware design to reduce execution
• Real-time: Since all sub-filters operate in parallel, time. Velmurugan [15] proposed an FPGA implementa-
the execution time is significantly reduced compared to tion of a PF algorithm for tracking applications without
that of other traditional implementation schemes that use any parallelization using the Xilinx system generator tool.
just one filter block [8]. Our implementation has a very Schwiegelshohn et al. [16] proposed the FPGA optimized
low execution time of about 5.62 µs (i.e., a sampling rate re-sampling (FO-resampling) to parallelize the re-sampling
of 178 kHz) for processing 1024 particles and outper- step by introducing virtual particles. A fixed number of vir-
forms most state-of-the-art implementations, allowing tual particles are generated around every real particle, and
real-time deployment. if the importance factor (weight) of the real particle is less
• Flexibility: The proposed architecture is not limited to a than the virtual particle, then it gets replaced. Otherwise,
single application, and the design can be easily modified the same real particles are propagated in the next iteration.
by making slight changes to the architecture for other PF However, the resource utilization of their architecture is sub-
applications. stantially higher compared to the conventional PF algorithms.
The architecture was successfully implemented on the Mountney et al. [17] proposed a modular PF architecture for
Artix-7 FPGA and the experimental results show its efficacy Brain Machine Interfaces (BMI). Their architecture intro-
in source localization. duces multiple particle processors to parallelize the state
The rest of the paper is organized as follows: We provide vector and likelihood estimations. Although the state vec-
the theory behind Bayesian filtering and PFs in Section III tor estimation and likelihood computations are parallelized,
and IV, respectively. An experimental setup for the proposed the re-sampling step is done sequentially, which is the major
source localization model using a Bearings-only track- drawback of the architecture. Recently, Alam and Gustafsson
ing (BOT) framework is presented in section V. In this frame- [18] proposed an improved re-sampling architecture by intro-
work, input to the filter is a time-varying angle (bearing) of ducing a weight pre-fetch mechanism to reduce the latency of
the source, and each input is processed by the PF algorithm the re-sampling step. In this technique, new particle weights
implemented on hardware to estimate the source location. are pre-fetched along with the random values concurrently,
Further, in section VI, we propose algorithmic modifications which help in reducing the total number of cycles for re-
to the existing PF algorithm that make the high-speed imple- sampling. Pre-fetching parameters, on the other hand, neces-
mentation possible. The architecture for implementing PFs on sitates the use of additional buffers to store the pre-fetched
hardware is provided in Section VII. Evaluation of resource data, resulting in the increased area and power consumption.
utilization on the Artix-7 FPGA, performance analysis in Miao et al. [19] proposed a parallel implementation scheme
terms of execution time, estimation accuracy, and the experi- for PFs using multiple processing elements (PEs) and a cen-
mental results are provided in Section VIII. tral unit (CU) to reduce the execution time. PE performs
sampling and weight update, while CU performs re-sampling.
II. STATE-OF-THE-ART The communication overhead between the PE and the CU,
The first hardware prototype for PFs was proposed by on the other hand, grows linearly with the number of PEs,
Athalye et al. [8] by implementing a standard SIR filter rendering the design unscalable for large-scale particle pro-
on an FPGA. They provided a generic hardware frame- cessing. In other work, Velmurugan et al. [20] took an analog
work for realizing SIR filters and implemented traditional approach to implement PF with low-power consumption.
PFs without parallelization on FPGA. As an extension Their implementation utilizes a minimum number of data
to [8], Bolić et al. [9] suggested a theoretical framework converters to reduce both area and power. However, owing
for parallelizing the re-sampling step by proposing dis- to the analog mixed mode implementation, their architecture
tributed algorithms called Re-sampling with Proportional is not scalable, and verification of the design is difficult
Allocation (RPA) and Re-sampling with Non-proportional compared to the digital counterparts due to lack of standard
Allocation (RNA) of particles to minimize execution time. design and test flows in large analog implementation.
The design complexity of RPA is significantly higher than Further, several real-time software-based implementation
that of RNA due to non-deterministic routing and com- schemes have been proposed with the intent to reduce com-
plex routing protocol. Though the RNA solution is pre- putational time. Hendeby et al. [21], [22] proposed the
ferred over RPA for high-speed implementations with low first Graphical Processing Unit (GPU) based PFs, demon-
design time, the RNA algorithm trades performance for strating that the GPU-based architecture outperforms the
speed improvement. Agrawal et al. [10] proposed an FPGA CPU-based implementation in terms of processing speed.
implementation of a PF algorithm for object tracking in Murray et al. [27] provided an analysis of two alternative
schemes for the re-sampling step based on Metropolis of shared memory architecture and parallel scan step to
and Rejection resamplers to reduce the overall execution obtain the prefix sum. Furthermore, they don’t present any
time. They compared it with standard Systematic resam- architecture for implementing the algorithm on hardware.
plers [28] over GPU and CPU platforms. Chitchian et al. [23] Choppala et al. [30] introduced a random network as a fixed
devised an algorithm for implementing a distributed com- re-sampling unit in PF. This network assigns each particle a
putation PF on GPU for fast real-time control applications. predetermined set of other particles with which it will inter-
Zhang et al. [29] suggested an architecture for efficiently act, and the re-sampler randomly selects one particle from
implementing PFs on a DSP for wireless network track- the set. However, they don’t show the hardware feasibility of
ing applications. Gong et al. [24] present a shared-memory the proposed network on FPGA. Par and Tosun [25] present
systematic re-sampling (SMSR) algorithm to parallelize a parallel implementation of PF algorithm based on both
the re-sampling step on a GPU. Their algorithm is very multi-core processors and on a GPU using Compute Unified
challenging to implement on an FPGA due to the use Device Architecture (CUDA). Their performance analysis
shows that up to 75x speedup can be achieved on a 512-core Constructing the posterior based on Bayes rule is a con-
GPU over sequential implementation. Kim et al. [26] imple- ceptual solution and is analytically estimated using tra-
mented PF on a GPU for target position estimation and ditional Kalman filters. However, in a non-Gaussian and
parallelized the calculation process utilizing multiple GPU non-linear setting, the analytic solution is intractable, and
cores. The proposed algorithm was simulated on a CPU in approximation-based methods such as PFs are employed to
MATLAB and then verified on GPU, resulting in a 55% find an approximate Bayesian solution. A detailed illustra-
reduction in execution time. However, they do not show the tion of the Bayesian framework and its implementation for
hardware feasibility. In addition, these software-based meth- estimating the state of a system is provided by [31], [32].
ods have their own drawbacks when it comes to hardware
implementation owing to their high computational complex- IV. PARTICLE FILTERS BACKGROUND
ity. Therefore, it is essential to develop a high-speed and The core principle behind PFs is to represent the required
dedicated hardware design with the capacity to process a posterior density with a collection of random samples called
large number of particles in specified time to meet the speed particles, each with its own weights, and then calculate the
demands of real-time applications. This paper addresses this state estimate using these particles and weights. The particles
issue by proposing a high-speed architecture that is mas- and their weights are represented by {xti , wit }N
i=1 , where N is
sively parallel and easily scalable to handle a large number the total number of particles. xti denotes the ith particle at
of particles. The benefits of the proposed architecture are time instant t. wit represents the weight corresponding to the
summarized in Section. I-A2. particle xti . The variant of PF called sampling, importance,
and re-sampling filter (SIRF) is presented in Algorithm 1.
III. BAYESIAN FRAMEWORK
The evolution of the state sequence xt in a dynamic state space Algorithm 1 SIR Algorithm
model is characterised by: Initialization: Set the particle weights of the previous time
step to 1/N, {wit−1 }N
i=1 = 1/N .
xt = ft (xt−1 , wt ) (1) i }N and
Input: Particles from previous time step {xt−1 i=1
where, ft is a nonlinear function of the state xt−1 , and wt measurement zt .
represents the process noise. The objective is to recursively Output: Particles of current time step {bxti }N
i=1 .
estimate the state xt based on a measurement defined by: Method:
FIGURE 1. UGV Design. (a) Schematic of the UGV with a photodiode housing mounted on top. (b) The region around the
UGV is divided into 8 sectors with 45◦ angular separation.
weights are eliminated, and particles with higher weights are The orientation of the longitudinal axis of UGV is repre-
replicated to compensate for the discarded particles depend- sented by φtUGV , which gives its true bearing.
ing on the weight wit associated with the particle xti . The The source is considered to be stationary, and its
xti }N
re-sampled set of particles is denoted by {b i=1 . co-ordinates in the 2-dimensional setting is given by:
xt = [Xt , Yt ] (7)
V. SOURCE LOCALIZATION MODEL
This section gives an overview of the experimental setup and At time instant t, a set of 8 photodiode measurements
measurement model relevant to the source localization. are captured zt = {z1t , z2t · · · z8t }, which comprise of the
target-associated measurement and clutter noise. Then, based
A. OVERVIEW OF THE EXPERIMENTAL SETUP on the measurement model (2), the source-associated mea-
In our source localization model, an omnidirectional light surement can be modelled as:
source serves as a source to be localized. A photodiode hous- zt = g(xt ) + vt (8)
ing mounted on top of the UGV (cf. Fig. 1(a)) constitutes a
sensor to measure the relative intensity of light in a horizontal Since the measurement gives the bearing information of the
plane. The space around the UGV is divided into 8 sectors source, we have:
UGV
with 45◦ angular separation, as shown in Fig. 1(b), and an
−1 Yt − Yt
g(xt ) = tan (9)
array of 8 photodiodes are placed inside the circular housing Xt − XtUGV
to sense the light source in all directions. The housing con- The four-quadrant inverse tangent function evaluated from
fines the angle of exposure of the photodiode to 45◦ . Depend- [0, 2π) gives the true bearing of the source.
ing on the light incident on each photodiode, we consider the The relevant probabilities needed to model the sensor
output of the photodiode to be either 0 or 1. imperfections and clutter noise are as follows:
The PF algorithm applied to the BOT model requires
(i) The probability of clutter noise (nt ) produced by a stray
dynamic motion between the sensor and source [33]. In our
or reflective light source is: p(nt ) = β.
experimental configuration, we have a stationary source and a
(ii) The probability of the jth photodiode output being 1
moving sensor mounted on the UGV. The UGV is made to tra- j
i.e., (zt = 1) either due to the light source or clutter noise
verse in the direction of the source and eventually converges j
at the source location. Reflective sources and other stray light is: p(zt |xt , nt ) = α.
sources are potential sources of noise picked up by the sensor, (iii) If there is a light source in the sector j, then jth photodi-
producing false detections. A target-originated measurement, ode output will be 1 with a probability of α irrespective
along with noise, is sensed by the photodiodes and processed of noise. The likelihood of photodiode output being 1 or
in addition to the UGV position data to measure the light 0 in the presence of the source is:
( j
source’s bearing with respect to the UGV. Based on the j α, for zt = 1.
bearing of the light source, we try to estimate its position p(zt |xt ) = j (10)
1 − α, for zt = 0.
using the PF algorithm.
(iv) If there is no source in sector j, then there is a noise
B. MEASUREMENT MODEL source with probability β. The likelihood of photodiode
The position of UGV (xtUGV ) at time instant t is defined by output being 1 or 0 in the absence of the source is:
( j
the Cartesian co-ordinate system: j αβ, for zt = 1.
xt ) =
p(zt |e j (11)
xtUGV = [XtUGV , YtUGV ] 1 − αβ, for zt = 0.
(k,i)
These two likelihoods are used in our system to model k = 1, · · · K . The particle xt represents the position in the
the sensor imperfections and noise, and even with high noise Cartesian co-ordinate system.
probability β, the PF algorithm is robust enough to localize
the source.
VII. ARCHITECTURE OVERVIEW
VI. ALGORITHMIC MODIFICATION OF SIRF FOR In this section, we present a high-speed architecture for
REALIZING HIGH-SPEED ARCHITECTURE PFs, based on the modified SIR algorithm presented in
In this section, we suggest modifications to the standard Section VI.
SIR algorithm to make it parallelizable. The key idea of The top-level architecture shown in Fig. 3 utilizes a filter
high-speed architecture is to utilize multiple parallel filters, bank consisting of K sub-filters working in parallel. Sam-
termed sub-filters, working simultaneously and performing pling, importance, and re-sampling operations are carried out
sampling, importance, and re-sampling operations indepen- within a sub-filter. In addition to the SIR step, a fixed number
dently on particles. The architecture utilizes K sub-filters in of particles are routed between sub-filters after the comple-
parallel to process a total of N particles. Thus, the number tion of every iteration as part of the particle routing opera-
of particles processed within each sub-filter is M = N /K . tion. The sub-filters are connected based on ring-topology
In comparison to traditional filters, the amount of particles inside the filter bank. M particles are time-multiplexed and
processed inside each sub-filter is reduced by a factor of K . processed within each sub-filter, and Q = M /2 particles
The sampling and importance steps are inherently paral- are exchanged with neighbouring sub-filters. Since the num-
lelizable since there is no data dependency for the particle ber of particles exchanged and the routing topology are
generation and weight calculation. However, the re-sampling fixed, the proposed architecture has very low design com-
step cannot be parallelized as it needs to have the information plexity. The design can be easily scaled up to process a
of all particles. This creates a major bottleneck in the parallel large number of particles (N ) by replicating sub-filters. The
implementation scheme. Thus, in addition to the SIR stage, binary measurements of the eight photodiodes (zt ) are fed
we introduce a particle routing step, as shown in Algorithm 2, as an input to the filter bank along with the true bearing
to route particles between sub-filters. Our empirical analysis (φtUGV ) and the position of the UGV (xtUGV ). Random number
shows that the particle routing step enables the distribution generation needed for the sampling and re-sampling steps
of particles among sub-filters, and the re-sampling step can is provided by a random number generator block. We use
be effectively parallelized. Section. VIII-C shows that there a parallel multiple output LFSR architecture presented by
is no substantial variation in the estimation error between Milovanović et al. [34] for random number generation. A 16
the proposed modified SIR algorithm and the conventional bit LFSR is used since our internal variables are 16 bits wide.
algorithm. An algorithmic flowchart is shown in Fig. 2. Further, a detailed description of the sub-filter architecture is
provided in Section VII-A. The sector check block, described
Algorithm 2 High-Level Description of Each Sub-Filter k in Section VII-B, computes the particle population in each
Performing SIR and Particle Routing Operations of the eight sectors and outputs a sector index that has the
Initialization: Set the particle weights of previous time step maximum particle population. This information is used by
(k,i)
to 1/M, {wt−1 }M i=1 = 1/M .
the UGV to traverse in the direction of the source. The mean
(k,i)
Input: Particles from previous time step {xt−1 }M computational block used to calculate the global mean of all
i=1 and
measurement zt N particles from K sub-filters to estimate the source location
Output: Particles of current time step {b
(k,i)
xt }M (post ), is explained in Section VII-C.
i=1
Method:
1: Particle Routing: Exchange Q particles with neighbour- A. SUB-FILTER ARCHITECTURE
ing sub-filters. The sub-filter is the main computational block responsible
(k,q) Q (k−1,q) Q
2: {xt−1 }q=1 ← {xt−1 }q=1 for k = 2, · · · K , and for particle generation, processing, and filtering. It consists of
(k,q) Q (K ,q) Q
3: {xt−1 }q=1 ← {xt−1 }q=1 for k = 1. three main sub-modules, namely, sampling, importance and
4: Sampling and Importance: re-sampling, as shown in Fig. 4. The sampling and impor-
5: for i = 1 to M do tance blocks are pipelined in operation. The re-sampling
(k,i) (k,i) step cannot be pipelined with the former steps as it requires
6: Sample xt ∼ p(xt |xt−1 )
(k,i) (k,i) (k,i) weight information of all particles. Thus, it is started after
7: Calculate wt = wt−1 p(zt |xt )
8: end for the completion of the importance step. Since sampling and
9: Re-sampling: Compute the re-sampled particles importance stages are pipelined, together they take M clock
(k,i)
xt }M
(k,i) (k,i)
, wt }M cycles to iterate for M particles, as shown in Algorithm 2 from
{b i=1 from {xt i=1 .
line 5 to line 8. The particle routing between the sub-filters
is done along with the sampling step and does not require
The particles and their associated weights in sub-filter any additional cycles. The re-sampling step takes 3M clock
(k,i) (k,i)
k at time step t are represented by {xt , wt }M i=1 , for cycles, as discussed in Section VII-A3.
FIGURE 2. Flowchart illustrating the sequence of operations carried out incorporating the modified SIR
algorithm. T represents the total time steps for localizing the source.
(k,i) (k,i)
1) SAMPLING AND ROUTING Further, {bxt }M M
i=1 is utilized to obtain particles {xt+1 }i=1 of
The sampling step involves generating new sampled parti- the next time step. Thus, with the straightforward approach,
(k,i) (k,i)
cles {xt }M xt−1 }M
i=1 by propagating re-sampled particles {b i=1
we would need two memories each of depth M to store
(k,i) (k,i)
from the previous time step using the dynamic state space {xt }M i=1 and {bxt }M i=1 within a sub-filter. Similarly, for
model: K sub-filters we would require 2 × K memory elements,
(k,i) (k,i)
each of depth M . This increases memory usage for higher
xt xt−1 )
∼ p(xt |b (12) K or M . In this work, we suggest a novel scheme to store
(k,i) the particles using a single dual-port memory instead of two
Conventionally, particles {xt }Mi=1 are used to generate the memory blocks, which brings down the total memory require-
(k,i)
weights {wt }M i=1 in the importance unit, and using these ment for storing particles to K memory elements, each of
(k,i)
weights we determine the re-sampled particles {b xt }M i=1 . depth M .
In this scheme, since the re-sampled particles are actually and particles 1, 3, 4 & 6 are discarded. The re-sampling unit
(k,i) (k,i) M returns Ind R = (2, 2, 2, 2, 5, 5) and Ind D = (1, 3, 4, 6). The
the subset of sampled particles (i.e.,{bxt }M i=1 ⊂ {xt }i=1 )
(k,i)
xt }M
instead of storing {b in a different memory, we can read sequence of the dual-port memory is (2, 2, 2, 2, 5 & 5)
i=1
(k,i)
and the write sequence is (2, 1, 3, 4, 5 & 6). Initially, particle
use the same memory as {xt }M i=1 and use suitable pointers (k,2)
(k,i) M 2 (xt−1 ) is read from the dual-port memory and after prop-
or indices to read {b
xt }i=1 . (k,i)
The re-sampling unit in our case is modified such that agation in the sampling block, the sampled particle xt is
(k,i) written back to the memory location 2. Next, particle 2 is
instead of returning re-sampled particles b xt−1 , it returns the
read again from memory location 2. However, this time the
indices of replicated (Ind R(k,i) ) and discarded (Ind D(k,i) )
content of the location is changed, and it no longer holds the
particles (cf. Fig. 4). Ind R(k,i) is used as a read address of (k,2)
original particle xt−1 , which causes an error while reading.
the dual-port particle memory shown in Fig. 5 to point to
(k,i) In order to avoid this scenario, we introduce a sub-block
the re-sampled particlesb xt−1 . The dual-port memory enables
(a) (cf. Fig. 5), wherein when we read the particle from
us to perform read and write operations simultaneously;
the memory for the first time, it is temporarily stored in a
however, this might result in data overwriting. For example,
(k,2)
register. Hence, whenever there is a replication in Ind R or
consider six particles, after re-sampling particle 2 (xt−1 ) is read address, we read the particle from the register instead
(k,5)
replicated four times; particle 5 (xt−1 ) is replicated two times of memory. The Rep signal is generated by comparing Ind R
with its previous value and if both are same, Rep will be made control the switching between the local and routed particles
high. by making it low for the first M /2 cycles and then making
Further, we introduce a sub-block (b) (cf. Fig. 5), which it high for the next M /2 cycles. Further, at time instant 0,
is responsible for routing the particles between neighbouring we feed the UGV position x0UGV as a prior to the sampling
sub-filters. Out of M particles read from the particle memory block to distribute the particles around the UGV. The Sel Int
(k,q) M /2
of sub-filter k, the first M /2 particles, i.e., {xt−1 }q=1 are control signal is made low in the first iteration, i.e., at time
sent to sub-filter k + 1, and simultaneously the first M /2 instant 0, and then made high for the subsequent iterations.
(k−1,q) M /2
particles, i.e., {xt−1 }q=1 , of sub-filter k − 1 are read and
fed to the sampling block of sub-filter k. The sampling block 2) IMPORTANCE
propagates the particles from time step t − 1 to time step t. The importance unit computes the weights of the particles
(k−1,q) M /2
The routed particles from sub-filter k − 1 {xt−1 }q=1 , and based on the photodiode measurements zt given by:
(k,q)
last M /2 local particles {xt−1 }M q=M /2+1 read from particle (k,i) (k,i) (k,i)
wt = wt−1 p(zt |xt ) (13)
memory of sub-filter k are propagated by the sampling block
and written back to the memory. The input to the sampling (k,i) (k,i)
(k,i) wt−1 is initialized to 1/M. Estimation of p(zt |xt )
block are particles of time step t − 1 (xt−1 ) and the output (k,i)
(k,i) involves determining the angle of each particle (θt ), which
are particles of current time step t (xt ). The sampling block is computed using an inverse tangent function based on the
pseudocode is provided by Algorithm 3. The random number position of the UGV (xtUGV ) and position of the particle
PRN (k) needed for random sampling of particles as shown in (k,i)
(xt ), as follows:
Algorithm 3, line 2 and line 3 is provided by a random number
generator block (cf. Fig. 3). The Sel Route signal is used to (k,i)
!
(k,i) −1 Yt − YtUGV
θt = tan (k,i)
Xt − XtUGV
Algorithm 3 Sampling Block Pseudocode where, Xt
(k,i)
and Yt
(k,i)
represents the co-ordinates of the par-
Input: Particles from previous time step (k,i)
(k,i) (k,i) (k,i) ticle xt in two-dimensional Cartesian co-ordinate system.
xt−1 = [Xt−1 , Yt−1 ] and random number
(k) (k) The inverse tangent function is implemented using a
PRN (k) = [PRNx , PRNy ]. Cordic IP block provided by Xilinx [35]. The architecture of
(k,i)
Output: Particles of current time step xt . the importance unit is shown in Fig. 6. The index generator
Method: block estimates the angle of the particles with respect to the
1: for i = 1 to M do longitudinal axis of the UGV based on the bearing of the
2: Xt
(k,i) (k,i) (k)
= Xt−1 + PRNx ∗ std UGV (φtUGV ). In addition to this, the index generator block
(k,i)
3: Yt
(k,i) (k,i) (k)
= Yt−1 + PRNy ∗ std F std is the standard is used for determining the sector indices (Ind θt ) of the
deviation. particles based on the angle information. The sector indices
4: xt
(k,i) (k,i)
= [Xt , Yt ]
(k,i) of the particle can be defined as follows:
5: end for Ind θt
(k,i)
= d4/π ∗ (θt
(k,i)
− φtUGV )e
zt is 8 bit wide data consisting of 8 binary photodiode mea- Algorithm 4 Systematic Re-Sampling
surements {z1t , z2t · · · z8t }. Based on the measurement zt and Input: Un-normalized weights ({wt }M
(k,i)
i=1 ) of M particles,
the sector indices of particles, weights are generated by the summation of all the weights in a sub-filter (Sum w) and the
weight computation block. These weights are stored in the uniform random number (U0 ) between [0, 1]
weight memory using the address provided by the sampling Output: Replicated index (Ind R) and Discarded index
unit, to store weights in the same order as the sampled (Ind D).
(k,i)
particles xt . The sum of all the weights required by the Method:
re-sampling unit is obtained by an accumulator. The particle Sum w
population block is used to estimate the number of particles 1: Compute Aw =
M
present in each of the eight sectors, using the sector indices 2: Initialize: U _scale = U0 × Aw
of particles for a given sub-filter. The particle count in each 3: s = 0, p = 0, m = 0
of the eight sectors of sub-filter k is concatenated and given 4: for i=1 to M do
as the output Count Ind θ (k) . For example, if sector 1 has 15 5: while s < U _scale do
particles, sector 3 has 14 particles, and sector 5 has 3 particles, 6: p=p+1
then Count Ind θ (k) = {15, 0, 14, 0, 3, 0, 0, 0}. 7: s = s + w(k,p)
8: if s < U _scale then
9: m=m+1
3) RE-SAMPLING
10: Ind D(k,m) = p
Particles with higher weights are replicated, while particles 11: end if
with lower weights are discarded during the re-sampling 12: end while
process. This is accomplished by utilizing a Systematic 13: U _scale = U _scale + Aw
re-sampling algorithm shown in Algorithm 4. A detailed 14: Ind R(k,i) = p
description of the systematic re-sampling algorithm is pro- 15: end for
vided in [8], [28]. The weights and sum of all weights
are obtained from the importance unit. The random number
(U0 ) needed to compute the parameter U _scale in line 2 of
Algorithm 4 is provided by the Random number genera- of Algorithm 4 takes 2M cycles for execution in hardware
tor block shown in Fig. 3. The algorithm presented works as it involves fetching M weights from weight memory and
with un-normalized weights, which will avoid M division doing M comparison operations. Further, line 13 and line
operations on all particles to implement normalization. The 14 take M cycles to obtain M replicated indices. Thus, in total,
division required to compute Aw in line 1 of Algorithm 4 is the execution of the re-sampling step requires 2M +M = 3M
implemented using the right shift operation. This approach cycles.
consumes fewer resources and area on hardware. The repli-
cated and discarded indices generated by the systematic B. SECTOR CHECK BLOCK
re-sampling block are stored in their respective memories, The direction/orientation of the UGV is decided by the pop-
as shown in Fig. 7. In the worst-case scenario, the inner loop ulation of particles in different sectors and is used to move
A. RESOURCE UTILIZATION
The architecture presented was implemented on Artix-
7 FPGA. Resource utilization of the implemented design for
the different number of sub-filters is summarized in Table 2.
The number of particles per sub-filter (M ) was fixed to 32
FIGURE 8. Sector check block architecture. for synthesizing the design. All memory modules shown in
the architecture for storing particles, weights, replicated, and
discarded indices are translated into embedded 18kb block
towards the source. This is achieved by the sector check random access memory (BRAM) available on the FPGA,
block, which estimates the particle population in each of using a block memory generator (BMG) IP [36] provided by
the eight sectors and gives the sector index with maximum Xilinx. The number of 18kb BRAM blocks needed on the
particle count. The block diagram shown in Fig. 8 utilizes FPGA is indicated in the Block RAM column of Table 2. It
eight parallel adders to count the number of particles in each can be seen that the resource utilization increases proportion-
sector. The particle count in a given sector of K sub-filters is ally with the number of sub-filters. For 64 sub-filters, 64%
(k)
fed as an input to the adder. Count Ind θn in Fig. 8 denotes of the slice LUTs (lookup tables) are used, and a maximum
the particle count in sector n of sub-filter k. The output of an of approximately 90 sub-filters can fit onto a single Artix-7
adder gives the total particle population in a particular sector. (xc7a200tfbg484-1) FPGA platform.
Furthermore, the sector index (Ind θ) having the maximum
particle count is estimated using a max computation block. B. EXECUTION TIME
The UGV uses this information to traverse in the direction of The proposed design utilizes K parallel sub-filters, thus
the source. bringing down the number of particles processed within a
sub-filter to N /K . Since, sampling and importance blocks are
C. MEAN ESTIMATION BLOCK pipelined, these steps take N /K + τs + τi clock cycles and
The mean of total N particle positions is estimated using the the re-sampling step takes 3N /K + τr cycles to process N /K
mean estimation block. Particle positions from K sub-filters particles, where τs , τi and τr represent the start-up latency of
are fed in parallel and accumulated over M cycles to gener- the sampling, importance and re-sampling units, respectively.
ate the sum, which is further divided by N , by right shift- Since all the K sub-filters are parallelized, the time taken to
ing log2 (N ) times to get the mean. In our implementation, process a total of N particles for SIR operation is:
we consider N as a power of 2. The mean gives an estimate
TSIR = (4N /K + τ )Tclk
of the position of the source post .
where, τ = τs + τi + τr and Tclk is the clock period of the
VIII. RESULTS design.
In this section, we present the resource utilization of the Fig. 9 gives the timing diagram for completion of SIR
proposed design on an FPGA. We also evaluate the exe- operations using the proposed architecture for N particles,
cution time of the proposed architecture as a function of for a single iteration. Furthermore, since particle routing
the number of sub-filters and inspect the estimation accu- is incorporated within the sampling step, the transfer of
racy by scaling the number of particles. We then compare particles between the sub-filters do not take any additional
C. ESTIMATION ACCURACY
We analyzed the estimation accuracy for the 2D source local-
ization problem as a function of the number of particles
(N ) for the standard and the modified SIR algorithm. The
estimation error gives the error between the actual source
location and the estimated source location given by:
q
Error = (posx − x)2 + (posy − y)2 (14)
where, posx and posy denote the estimated position of the
source obtained from the PF algorithm, in the 2D Cartesian
FIGURE 9. Timing diagram for SIR operations of the proposed design.
co-ordinate system. x and y denote the true position of the
source in the 2D arena.
The algorithm for the standard SIR filter is presented in
cycles. This makes the design scalable for a large number Section. IV and has no parallelization incorporated. The mod-
of sub-filters, as the routing operation requires no extra ified SIR algorithm implements parallelization by utilizing
time. K sub-filters working concurrently to reduce the execution
In Fig. 10(a), we show the execution time of the proposed time, introduced in Section. VI. The estimation errors pre-
architecture as a function of the number of sub-filters (K ) sented in Fig. 10(c) are the average errors in 1000 runs
for different N . As expected, the execution time increases over 250 time-steps. It is inferred that there is no significant
with the number of particles (N ). In many applications, for difference in the estimation error between the standard and
example, in biomedical signal processing, the state space the modified SIR algorithm. Additionally, the modified algo-
dimension is very high [7]. Consequently, a large number rithm achieves lower execution time and allows the parallel
of particles are needed to obtain satisfactory performance. computation of PFs. Further, it is noted that by scaling the
In such cases, the computation time often becomes unreal- number of particles, the estimation accuracy improves as the
istic. Introducing parallelization in the design by using more error decreases.
sub-filters (K ) brings down the execution time significantly,
as shown in Fig. 10(a). However, the reduction in execution D. CHOICE OF THE NUMBER OF SUB-FILTERS K
time by increasing K comes at the cost of added hardware, Choice of the number of sub-filters (K ) used in the design
which can be inferred from Fig. 10(b). Thus, there is a depends on several factors such as, the number of particles
trade-off between the speed and the hardware utilized. For (N ), the clock frequency of the design (fclk ), and the obser-
instance, using a single sub-filter and no parallelization uses vation sampling rate (fs ) of the measurement samples. The
a mere 1.4k (1%) LUTs to process 256 particles, and the time sampling rate gives the rate at which new input measurements
taken for SIR operations is around 1075 clock cycles. On the can be processed. N is chosen depending on the application
other hand, an 8 sub-filter design takes only 178 clock cycles for which the particle filter is applied. fclk is selected based
for SIR operations, but utilizes 11k (8%) LUTs. Thus, there on the maximum frequency supported by the design. The
is a trade-off between speed and hardware used. The given relationship between the sampling rate and the execution time
FPGA resources limit the total number of sub-filters that can (TSIR ) of the filter is given by:
be accommodated on an FPGA, thus limiting the maximum fclk
fs = 1/TSIR =
achievable speed. (4N /K + τ )
VOLUME 9, 2021 98197
A. Krishna et al.: FPGA Implementation of PFs for Robotic Source Localization
FIGURE 10. Performance analysis of the proposed design. (a) Execution time of the proposed design as a function of the number of sub-filters
(K ), for different number of particles (N). (b) Resource utilization in terms of the number of slice LUTs used as a function of the number of
sub-filters (K ). (c) Estimation error as a function of the number of particles (N) for the standard SIR filter without any parallelization using
algorithm 1 and the modified SIR filter with parallelization using algorithm 2.
where, fclk = 1/Tclk . Thus, for a specified measurement not scalable to process a large number of particles at the high
sampling rate (fs ), the clock frequency of the design (fclk ), and sampling rate, as the execution time is proportional to the
the number of particles (N ), we can determine the number of number of particles. Also, the re-sampling step is a major
sub-filters (K ) needed from the above equation. For instance, computational bottleneck, as it is inherently not paralleliz-
in our application, we use 256 particles because the error able. In this work, we propose a modification to the exist-
curve levels off at N = 256 (cf. Fig. 10(c)), and there is no ing algorithm that overcomes this computational bottleneck
improvement in the estimation error by further increasing N . of the PF algorithm and makes the high-speed implemen-
Thus, to achieve a sampling rate of fs = 562 kHz, with 256 tation possible. We introduce an additional particle routing
particles and clock frequency fclk = 100 MHz, we utilize step (cf. Algorithm 2) allowing for parallel re-sampling.
K = 8 sub-filters. The maximum number of sub-filters that We develop a PF architecture based on the modified algorithm
can be used in the design depends on the resources of the incorporating parallelization and pipelining design strategies
given FPGA. to reduce the execution time. Since the particle routing step is
coupled with the sampling step and the routing is constrained
E. COMPARISON WITH STATE-OF-THE-ART between the two neighboring sub-filters, our implementation
IMPLEMENTATIONS is highly scalable and has low complexity. In comparison,
A comparison of our design with state-of-the-art implementa- other parallel implementations suffer from scalability issues
tions is provided in Table 3. To obtain a valid assessment with due to the high communication overhead between the concur-
other works, we have used N = 1, 024 particles (although rent processing elements.
256 particles are sufficient for our application as error curve Despite the difficulty of directly comparing the proposed
levels off at N = 256 (cf. Fig. 10(c)) and K = 8 sub-filters for architecture to other implementations owing to variation in
comparison. The majority of current implementation schemes model, application, device, and particle count (N ), our design
use the standard SIR algorithm (cf. Algorithm 1), which does achieves high input sampling rates, even for a large number
not support parallelization. Moreover, their architectures are of particles, by scaling the number of sub-filters K . The first
TABLE 3. Performance summary and comparison with state-of-the-art particle filter implementation schemes.
hardware architecture for implementing PFs on an FPGA was hardware acceleration module. Furthermore, using a large
provided by Athalye et al. [8], applied to a tracking problem. number of parallel particle processors to speed up the design
Their architecture is generic and does not incorporate any is constrained by the number of bus interfaces available in the
parallelization in the design. Thus, their architecture suffers soft-core processor (MicroBlaze). Thus, to improve the sam-
from a low sampling rate of about 16 kHz for 2048 particles, pling rate, they proposed a second approach which is entirely
which is approximately 11 times lower than the sampling rate a hardware design. However, their architecture does not sup-
of our design. However, owing to non-parallel architecture, port parallel processing and achieves a low sampling rate of
the resource consumption of their design (4.4k registers and about 18 kHz, whereas our system can sample at 178 kHz
3.8k LUTs) is relatively low. Agrawal et al. [10] proposed for processing the same 1024 particles. Their full hardware
a PF architecture for object tracking in video with 59k LUTs system utilizes 1.4k registers and 19k LUTs. Velmurugan
and a sampling rate of around 42kHz. Another state-of-the-art [15] proposed a fully digital PF FPGA implementation for
system was presented in [11]. The authors implemented tracking application, without any parallelization in the design.
the SIR filter on the Xilinx Virtex-5 FPGA platform for They used a high-level Xilinx system generator tool to gen-
bearings-only tracking application and achieved a sampling erate the VHDL code for deployment on a Xilinx FPGA
rate of 46 kHz for 1024 particles. Regarding its hardware from Simulink models or MATLAB code. Their design is
utilization, it uses 13.6k registers and 7.3k LUTs, which are not optimized in terms of hardware utilization as they use
comparable to those of our design; however, their sampling a high-level abstraction tool and lack flexibility to fine-tune
rate is four times lower than that of our system. Sileshi et al. the design. On the other hand, our design is completely
[12] proposed two methods for implementing PFs on hard- hand-coded in Verilog and provides granular control to tweak
ware. The first method was a hardware/software (HW/SW) the design parameters, and ensures that the design can be
co-design framework, where the software components were easily integrated into a multitude of PF applications. They
implemented using an embedded MicroBlaze processor. A PF achieve a sampling rate of about 30 kHz for 1000 particles,
hardware acceleration module on an FPGA was used for the which is six times lower than that of our design. Further,
hardware portion. This HW/SW co-design approach has a their resource consumption is relatively high (17.4k registers
low sampling rate of about 1 kHz due to communication and 30.9k LUTs) as they use high-level abstraction tools for
overhead between the MicroBlaze soft processor and the implementation. In other work, Schwiegelshohn et al. [16]
FIGURE 11. 2D source localization experimental result. The source is positioned at [6, 22] marked by a ’red’ circular dot. At the start, the UGV is
positioned at [38, −4]. The model is run over 250 time-steps for 256 particles, and the UGV traverses towards the source based on sensor
measurements. The final source estimate (post ) obtained by the PF algorithm is marked by a ’yellow’ circular dot and has an estimation error of 0.5.
The probabilities α and β are set as 0.8 and 0.6, respectively.
FIGURE 13. 3D source localization experimental result. The source is positioned at [40, 5, 25], and the initial position of the UGV is [10, 30, 0]. The
model is run over 350 time-steps for 512 particles. Here, the UAV traverses in three dimensions to move towards the source. The error between the
source and the estimated location is 0.83. The probabilities α and β are set as 0.8 and 0.4, respectively.
of 0.6. However, with an increase in noise probability (β), to 8 sensors used in 2D localization, here we utilize 16 sensors
the number of time-steps or iterations required to localize for scanning the entire 3D space. We consider α = 0.8 and
the source also increases, as shown in Fig. 12. The source is β = 0.4, and the model was run over 350 time-steps for
considered to be localized if the estimation error is less than 512 particles to localize the source. The result is presented
the predetermined threshold, which is 2.5 in our case. The in Fig. 13. The estimation error between the actual source
time-steps shown in Fig. 12 are the average time required to location and the estimated source location in the 3D arena
localize the source over 500 runs. The entire design was coded is given by:
in Verilog HDL, and the design was implemented on FPGA. q
All variables were translated from the floating-point to the Error = (posx −x)2 +(posy −y)2 +(posz −z)2 (15)
fixed-point representation for the implementation on FPGA.
We have used a 16-bit fixed-point representation for particles where, posx , posy and posz denote the estimated position of
and their associated weights. All bearing-related information, the source obtained from PF algorithm, in the 3D Cartesian
such as the angle of the UGV and the angle of particles used co-ordinate system. x, y and z represent the true position of
in the importance block, is represented by a 12-bit fixed-point the source in the 3D arena.
representation. Further, the indices of the replicated and the
discarded particles are integers and are represented using IX. CONCLUSION AND OUTLOOK
log2 (M ) = 5 bits. The output estimate of the source location In this paper, we presented an architecture for the hardware
(post ) is represented using a 16-bit representation. N = 256 realization of PFs, particularly sampling, importance, and
particles were used for processing. K = 8 sub-filters were re-sampling filters, on an FPGA. PFs perform better than
used in the design with M = 32 particles processed within traditional Kalman filters in non-linear and non-Gaussian
each sub-filter. M /2 = 16 particles were exchanged between settings. Interesting insights into the advantages of PFs,
the sub-filters after the completion of every iteration as part performance comparison, and trade-offs of PFs over other
of the particle routing operation. The time taken to complete non-PF solutions are provided by [38], [39]. However, PFs
SIR operations for N = 256 and K = 8 is 178 clock cycles. are computationally very demanding and take a significant
With a clock frequency of 100 MHz, the speed at which we amount of time to process a large number of particles; hence,
can process new samples is around 562 kHz, and the execu- PFs are seldom used for real-time applications. In our archi-
tion time for SIR operation is 1.78 µs. This high sampling tecture, we try to address this issue by exploiting paralleliza-
rate enables us to use the proposed hardware architecture in tion and pipelining design techniques to reduce the overall
various real-time applications. execution time, thus making the real-time implementation
Further, we show that the 2D source localization problem of PFs feasible. However, a major bottleneck in high-speed
can be extended to 3D, and we have modelled it in software parallel implementation of the SIR filter is the re-sampling
using MATLAB. This 3D model incorporates position along step, as it is inherently not parallelizable and cannot be
the x, y, and z directions. Here, an Unmanned Aerial Vehi- pipelined with other operations. In this regard, we modified
cle (UAV) can be utilized to localize the source. As compared the standard SIR filter to make it parallelizable. The modified
algorithm has an additional particle routing step and utilizes [3] N. Merlinge, K. Dahia, and H. Piet-Lahanier, ‘‘A box regularized particle
several sub-filters working concurrently and performing SIR filter for terrain navigation with highly non-linear measurements,’’ IFAC-
PapersOnLine, vol. 49, no. 17, pp. 361–366, Sep. 2016.
operations independently on particles to reduce the overall [4] Z. Zhang and J. Chen, ‘‘Fault detection and diagnosis based on
execution time. Our implementation is highly scalable and particle filters combined with interactive multiple-model estimation
has low complexity since the particle routing step is inte- in dynamic process systems,’’ ISA Trans., vol. 85, pp. 247–261, Feb. 2019.
[Online]. Available: http://www.sciencedirect.com/science/article/
grated with the sampling step, and the routing is confined pii/S001905781830394X
between the two adjacent sub-filters. On the other hand, other [5] A. Doucet, S. Godsill, and C. Andrieu, ‘‘On sequential Monte Carlo
parallel architectures have scalability issues due to the high sampling methods for Bayesian filtering,’’ Statist. Comput., vol. 10, no. 3,
pp. 197–208, Jul. 2000, doi: 10.1023/A:1008935410038.
communication overhead between the concurrent processing [6] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, ‘‘A tuto-
elements. rial on particle filters for online nonlinear/non-Gaussian Bayesian
A performance assessment in terms of the resource utilized tracking,’’ IEEE Trans. Signal Process., vol. 50, no. 2, pp. 174–188,
Feb. 2002.
on an FPGA, execution time, and estimation accuracy is pre-
[7] L. Miao, J. J. Zhang, C. Chakrabarti, and A. Papandreou-Suppappola,
sented. We also compared the estimation error of the modified ‘‘Multiple sensor sequential tracking of neural activity: Algorithm and
SIR algorithm with that of the standard SIR algorithm and FPGA implementation,’’ in Proc. 44th Asilomar Conf. Signals, Syst. Com-
noted that there is no significant difference in the estima- put., Nov. 2010, pp. 369–373.
[8] A. Athalye, M. Bolić, S. Hong, and P. M. Djurić, ‘‘Generic hardware
tion error. The proposed architecture has a total execution architectures for sampling and resampling in particle filters,’’ EURASIP
time of about 5.62 µs (i.e., a sampling rate of 178 kHz) by J. Adv. Signal Process., vol. 2005, no. 17, Oct. 2005, Art. no. 476167, doi:
utilizing 8 sub-filters for processing N = 1024 particles. 10.1155/ASP.2005.2888.
[9] M. Bolić, P. M. Djurić, and S. Hong, ‘‘Resampling algorithms and architec-
We compared our design with state-of-the-art FPGA imple- tures for distributed particle filters,’’ IEEE Trans. Signal Process., vol. 53,
mentation schemes and found that our design outperforms no. 7, pp. 2442–2450, Jul. 2005.
other implementation schemes in terms of execution time. [10] S. Agrawal, P. Engineer, R. Velmurugan, and S. Patkar, ‘‘FPGA implemen-
tation of particle filter based object tracking in video,’’ in Proc. Int. Symp.
The low execution time (i.e., high input sampling rate) makes Electron. Syst. Design (ISED), Dec. 2012, pp. 82–86.
our architecture ideal for real-time applications. [11] B. Ye and Y. Zhang, ‘‘Improved FPGA implementation of particle filter
The proposed PF architecture is not limited to a particular for radar tracking applications,’’ in Proc. 2nd Asian–Pacific Conf. Synth.
Aperture Radar, Oct. 2009, pp. 943–946.
application and can be used for other applications by modify-
[12] B. G. Sileshi, J. Oliver, and C. Ferrer, ‘‘Accelerating particle filter
ing the importance block of the sub-filter. The sampling and on FPGA,’’ in Proc. IEEE Comput. Soc. Annu. Symp. VLSI (ISVLSI),
re-sampling block designs are generic and can be used for any Jul. 2016, pp. 591–594.
application. [13] B. Sileshi, C. Ferrer, and J. Oliver, ‘‘Accelerating techniques for particle
filter implementations on FPGA,’’ in Emerging Trends in Computational
We also present a novel source localization model to esti- Biology, Bioinformatics, and Systems Biology. Amsterdam, The
mate the position of a source based on received sensor mea- Netherlands: Elsevier, 2015, ch. 2, pp. 19–37. [Online]. Available: https://
surements. Our PF implementation is robust to noise and can www.sciencedirect.com/science/article/pii/B9780128025086000028?via
%3Dihub
predict the source position even with a high noise probability. [14] B. G. Sileshi, J. Oliver, R. Toledo, J. Gonçalves, and P. Costa, ‘‘On
Experimental results show the estimated source location with the behaviour of low cost laser scanners in HW/SW particle filter
respect to the actual location for 2D and 3D settings and SLAM applications,’’ Robot. Auton. Syst., vol. 80, pp. 11–23, Jun. 2016.
[Online]. Available: http://www.sciencedirect.com/science/article/
demonstrate the effectiveness of the proposed algorithm. pii/S0921889015303201
In recent times, there has been an increase in the utilization [15] R. Velmurugan, ‘‘Implementation strategies for particle filter based target
of UGVs in several instances, such as disaster relief, and tracking,’’ Ph.D. dissertation, School Electr. Comput. Eng., Georgia Inst.
Technol., Atlanta, GA, USA, May 2007.
military applications, due to reduced human involvement and
[16] F. Schwiegelshohn, E. Ossovski, and M. Hübner, ‘‘A fully parallel particle
the ability to carry out the task remotely. The proposed source filter architecture for FPGAs,’’ in Applied Reconfigurable Computing,
localization model using PFs can autonomously navigate and K. Sano, D. Soudris, M. Hübner, and P. C. Diniz, Eds. Cham, Switzerland:
localize the source of interest without any human interven- Springer, 2015, pp. 91–102.
[17] J. Mountney, I. Obeid, and D. Silage, ‘‘Modular particle filtering FPGA
tion, which would be very helpful in missions wherein there hardware architecture for brain machine interfaces,’’ in Proc. Annu. Int.
is an imminent threat involved, such as locating chemical, Conf. IEEE Eng. Med. Biol. Soc., Aug. 2011, pp. 4617–4620.
biological or radiative sources in an unknown environment. [18] S. A. Alam and O. Gustafsson, ‘‘Improved particle filter resampling archi-
tectures,’’ J. Signal Process. Syst., vol. 92, no. 6, pp. 555–568, Jun. 2020,
Further, the proposed PF framework and its hardware realiza- doi: 10.1007/s11265-019-01489-y.
tion would be useful for the signal processing community for [19] L. Miao, J. J. Zhang, C. Chakrabarti, and A. Papandreou-Suppappola,
solving various state estimation problems such as tracking, ‘‘Efficient Bayesian tracking of multiple sources of neural activity: Algo-
rithms and real-time FPGA implementation,’’ IEEE Trans. Signal Process.,
navigation, and positioning in real-time. vol. 61, no. 3, pp. 633–647, Feb. 2013.
[20] R. Velmurugan, S. Subramanian, V. Cevher, D. Abramson, K. M. Odame,
J. D. Gray, H.-J. Lo, J. H. McClellan, and D. V. Anderson, ‘‘On low-power
REFERENCES
analog implementation of particle filters for target tracking,’’ in Proc. 14th
[1] N. J. Gordon, D. J. Salmond, and A. F. M. Smith, ‘‘Novel approach to Eur. Signal Process. Conf., 2006, pp. 1–5.
nonlinear/non-Gaussian Bayesian state estimation,’’ IEE Proc. F-Radar [21] G. Hendeby, J. D. Hol, R. Karlsson, and F. Gustafsson, ‘‘A graphics
Signal Process., vol. 140, no. 2, pp. 107–113, Apr. 1993. processing unit implementation of the particle filter,’’ in Proc. 15th Eur.
[2] M. Tian, Y. Bo, Z. Chen, P. Wu, and C. Yue, ‘‘Multi-target tracking Signal Process. Conf., Sep. 2007, pp. 1639–1643.
method based on improved firefly algorithm optimized particle filter,’’ [22] G. Hendeby, R. Karlsson, and F. Gustafsson, ‘‘Particle filtering: The need
Neurocomputing, vol. 359, pp. 438–448, Sep. 2019. [Online]. Available: for speed,’’ EURASIP J. Adv. Signal Process., vol. 2010, no. 1, Jun. 2010,
http://www.sciencedirect.com/science/article/pii/S0925231219308240 Art. no. 181403, doi: 10.1155/2010/181403.
[23] M. Chitchian, A. Simonetto, A. S. van Amesfoort, and T. Keviczky, ‘‘Dis- ANDRÉ VAN SCHAIK (Fellow, IEEE) received
tributed computation particle filters on GPU architectures for real-time the M.Sc. degree in electrical engineering from the
control applications,’’ IEEE Trans. Control Syst. Technol., vol. 21, no. 6, University of Twente, Enschede, The Netherlands,
pp. 2224–2238, Nov. 2013. in 1990, and the Ph.D. degree in neuromorphic
[24] P. Gong, Y. O. Basciftci, and F. Ozguner, ‘‘A parallel resampling algorithm engineering from the Swiss Federal Institute
for particle filtering on shared-memory architectures,’’ in Proc. IEEE 26th of Technology (EPFL) Lausanne, Switzerland,
Int. Parallel Distrib. Process. Symp. Workshops PhD Forum, May 2012, in 1998. From 1991 to 1994, he was a Researcher
pp. 1477–1483.
at the Swiss Centre for Electronics and Microtech-
[25] K. Par and O. Tosun, ‘‘Parallelization of particle filter based localization
nology (CSEM), where he developed the first com-
and map matching algorithms on multicore/manycore architectures,’’ in
Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2011, pp. 820–826. mercial neuromorphic chip—the optical motion
[26] S. Kim, J. Cho, and D. Park, ‘‘Moving-target position estimation using detector used in Logitech trackballs. In 1998, he was a Postdoctoral Research
GPU-based particle filter for IoT sensing applications,’’ Appl. Sci., vol. 7, Fellow with the Department of Physiology, The University of Sydney, and he
no. 11, p. 1152, Nov. 2017. [Online]. Available: https://www.mdpi.com/ became a Senior Lecturer at their School of Electrical and Information Engi-
2076-3417/7/11/1152 neering, in 1999, and a Reader, in 2004. In 2011, he became a Full Professor
[27] L. M. Murray, A. Lee, and P. E. Jacob, ‘‘Parallel resampling in the particle at Western Sydney University. He is a Pioneer in neuromorphic engineering
filter,’’ J. Comput. Graph. Statist., vol. 25, no. 3, pp. 789–805, Jul. 2016, and the Director of the International Centre for Neuromorphic Systems at
doi: 10.1080/10618600.2015.1062015. Western Sydney University. He has authored more than 200 articles. He is
[28] M. Bolić, A. Athalye, P. M. Djurić, and S. Hong, ‘‘Algorithmic modifi- an inventor of more than 35 patents. In addition, he has founded three tech-
cation of particle filters for hardware implementation,’’ in Proc. 12th Eur. nology start-ups. His research interests include neuromorphic engineering,
Signal Process. Conf., Sep. 2004, pp. 1641–1644. encompassing neurophysiology, computational neuroscience, software and
[29] Y. Q. Zhang, T. Sathyan, M. Hedley, P. H. W. Leong, and A. Pasha, ‘‘Hard- algorithm development, and electronic hardware design. He is a fellow of
ware efficient parallel particle filter for tracking in wireless networks,’’ the IEEE for contributions to neuromorphic circuits and systems.
in Proc. IEEE 23rd Int. Symp. Pers., Indoor Mobile Radio Commun.
(PIMRC), Sep. 2012, pp. 1734–1739.
[30] P. B. Choppala, P. D. Teal, and M. R. Frean, ‘‘Particle filter parallelisation
using random network based resampling,’’ in Proc. 17th Int. Conf. Inf.
Fusion (FUSION), 2014, pp. 1–8.
[31] C. S. Thakur, S. Afshar, R. M. Wang, T. J. Hamilton, J. Tapson, and
A. van Schaik, ‘‘Bayesian estimation and inference using stochastic elec-
tronics,’’ Frontiers Neurosci., vol. 10, p. 104, Mar. 2016. [Online]. Avail-
able: https://www.frontiersin.org/article/10.3389/fnins.2016.00104
[32] A. Krishna and C. S. Thakur, ‘‘Bayesian source localization using stochas-
tic computation,’’ in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS),
May 2021, p. i-5.
[33] J. L. Palmer, R. Cannizzaro, B. Ristic, T. Cheah, C. V. D. Nagahawatte,
J. L. Gilbert, and S. Arulampalam, ‘‘Source localisation with a Bernoulli
particle-filter-based bearings-only tracking algorithm,’’ in Proc. Australas.
Conf. Robot. Automat. (ACRA), Canberra, ACT, Australia, 2015. [Online].
Available: https://www.araa.asn.au/conference/acra-2015-2/
[34] E. Milovanović, M. Stojcev, I. Milovanović, T. Nikolic, and
Z. Stamenkovic, ‘‘Concurrent generation of pseudo random numbers
with LFSR of Fibonacci and Galois type,’’ Comput. Informat., vol. 34,
pp. 941–958, Aug. 2015.
[35] CORDIC V6.0 LogiCORE IP Product Guide, Xilinx, San Jose, CA, USA,
2017, pp. 1–66.
[36] Block Memory Generator V8.4, Xilinx, San Jose, CA, USA, 2017,
pp. 1–129.
[37] Opal Kelly XEM7310. Accessed: May 10, 2021. [Online]. Available:
https://opalkelly.com/products/xem7310/ CHETAN SINGH THAKUR (Senior Member,
[38] F. Ababsa, M. Mallem, and D. Roussel, ‘‘Comparison between particle IEEE) received the Ph.D. degree in neuromorphic
filter approach and Kalman filter-based technique for head tracking in engineering from The MARCS Research Insti-
augmented reality systems,’’ in Proc. IEEE Int. Conf. Robot. Automat.
tute, Western Sydney University (WSU), in 2016.
(ICRA), vol. 1, Apr. 2004, pp. 1021–1026.
He was an Adjunct Faculty appointment at the
[39] N. Y. Ko and T. G. Kim, ‘‘Comparison of Kalman filter and particle filter
used for localization of an underwater vehicle,’’ in Proc. 9th Int. Conf. International Center for Neuromorphic Systems,
Ubiquitous Robots Ambient Intell. (URAI), Nov. 2012, pp. 350–352. WSU, Australia. He then worked as a Research
Fellow at Johns Hopkins University. He worked
for six years with Texas Instruments, Singapore,
ADITHYA KRISHNA received the B.E. degree as a Senior Integrated Circuit Design Engineer,
from the Department of Electronics and Commu- designing IPs for mobile processors. He is currently an Assistant Professor
nication Engineering, PES Institute of Technology, at the Indian Institute of Science (IISc), Bengaluru. His research expertise
Bengaluru, India, in 2017. He has been working as include in neuromorphic computing, mixed-signal VLSI systems, computa-
a Research Assistant at the NeuRonICS Labora- tional neuroscience, probabilistic signal processing, and machine learning.
tory, Department of Electronic Systems Engineer- His research interest includes understanding the signal processing aspects
ing, Indian Institute of Science (IISc), since 2018. of the brain and apply those to build novel intelligent systems. He was
His research interests include VLSI architecture a recipient of several awards, such as the Young Investigator Award from
design, embedded system design, neuromorphic Pratiksha Trust, the Early Career Research Award by Science and Engineer-
computing, and machine learning. He was a recip- ing Research Board, India, and the Inspire Faculty Award by the Department
ient of the Summer Research Fellowship 2016 under the Indian Academy of of Science and Technology, India.
Sciences, Bengaluru.