Kalman Net
Kalman Net
Abstract—State estimation of dynamical systems in real-time the unscented Kalman filter (UKF) [10]. Methods based on
is a fundamental task in signal processing. For systems that are sequential Monte-Carlo (MC) sampling such as the family
well-represented by a fully known linear Gaussian state space of particle filters (PFs) [11]–[13], were introduced for state
(SS) model, the celebrated Kalman filter (KF) is a low complexity
estimation in non-linear, non-Gaussian SS models. To date,
arXiv:2107.10043v2 [eess.SP] 24 Jan 2022
1
we build upon the success of our previous work in MB with mean µ and covariance Σ is denoted by N (µ, Σ). Finally,
deep learning for signal processing and digital communication R and Z are the sets of real and integer numbers, respectively.
applications [24]–[27] to propose a hybrid MB/DD online
recursive filter, coined KalmanNet. In particular, we focus II. S YSTEM M ODEL AND P RELIMINARIES
on real-time state estimation for continuous-value SS models
A. State Space Model
for which the KF and its variants are designed. We assume
that the noise statistics are unknown and the underlying SS We consider dynamical systems characterized by a SS
model is partially known or approximated from a physical model in discrete-time [36]. We focus on (possibly) non-linear,
model of the system dynamics. To design KalmanNet, we Gaussian, and continuous SS models, which for each t ∈ Z
identify the Kalman gain (KG) computation of the KF as a are represented via
critical component encapsulating the dependency on the noise xt = f (xt−1 ) + et , wt ∼ N (0, Q) , xt ∈ Rm , (1a)
statistics and domain knowledge, and replace it by a compact
RNN of limited complexity, which is integrated into the KF yt = h (xt ) + vt , vt ∼ N (0, R) , yt ∈ Rn . (1b)
flow. The resulting system uses labeled data to learn to carry In (1a), xt is the latent state vector of the system at time t,
out Kalman filtering in a supervised manner. which evolves from the previous state xt−1 , by a (possibly)
Our main contributions are summarized as follows: non-linear, state-evolution function f (·) and by an AWGN
1) We design KalmanNet, which is an interpretable, low et with covariance matrix Q. In (1b), yt is the vector of
complexity, and data-efficient DNN-aided real-time state observations at time t, which is generated from the current
estimator. KalmanNet builds upon the flow and theoret- latent state vector by a (possibly) non-linear observation (emis-
ical principles of the KF, incorporating partial domain sion) mapping h (·) corrupted byAWGN vt with covariance
knowledge of the underlying SS model in its operation. R. For the special case where the evolution or the observation
2) By learning the KG, KalmanNet circumvents the depen- transformations are linear, there exist matrices F, H such that
dency of the KF on knowledge of the underlying noise
statistics, thus bypassing numerically problematic matrix f (xt−1 ) = F · xt−1 , h (xt ) = H · xt . (2)
inversions involved in the KF equations and overcoming In practice, the state-evolution model (1a) is determined by
the need for tailored solutions for non-linear systems; e.g., the complex dynamics of the underlying system, while the
approximations to handle non-linearities as in the EKF. observation model (1b) is dictated by the type and quality of
3) We show that KalmanNet learns to carry out Kalman the observations. For instance, xt can determine the location,
filtering from data in a manner that is invariant to the velocity, and acceleration of a vehicle, while yt are measure-
sequence length. Specifically, we present an efficient ments obtained from several sensors. The parameters of these
supervised training scheme that enables KalmanNet to models may be unknown and often require the introduction
operate with arbitrary long trajectories while only training of dedicated mechanisms for their estimation in real-time
using short trajectories. [37], [38]. In some scenarios, one is likely to have access
4) We evaluate KalmanNet in various SS models. The to an approximated or mismatched characterization of the
experimental scenarios include synthetic setups, tracking underlying dynamics.
the chaotic Lorenz system, and localization using the SS models are studied in the context of several different
Michigan NCLT data set [28]. KalmanNet is shown to tasks; these tasks are different in their nature, and can be
converge much faster compared with purely DD systems, roughly classified into two main categories: observation ap-
while outperforming the MB EKF, UKF, and PF, when proximation, and hidden state recovery. The first category
facing model mismatch and dominant non-linearities. deals with approximating parts of the observed signal yt . This
The proposed KalmanNet leverages data and partial domain can correspond to, e.g., the prediction of future observations
knowledge to learn the filtering operation, rather than using given past observations; the generation of missing observations
data to explicitly estimate the missing SS model parameters. in a given block via imputation; and the denoising of the
Although there is a large body of work that combines SS observations. The second category considers the recovery of
models with DNNs, e.g., [29]–[35], these approaches are a hidden state vector xt . This family of state recovery tasks
sometimes for different SS related tasks (e.g., smoothing, includes offline recovery, also referred to as smoothing, where
imputation); with a different focus, e.g., incorporating high- one must recover a block of hidden state vectors, given a
dimensional visual observations to a KF; or under different block of observations, e.g., [35]. The focus of this paper is
assumptions, as we discuss in detail below. filtering; i.e., online recovery of xt from past and current noisy
The rest of this paper is organized as follows: Section II observations {yτ }tτ =1 . For a given x0 , filtering involves the
reviews the SS model and its associated tasks, and discusses design of a mapping from yt to x̂t , ∀t ∈ {1, 2, . . . , T } , T ,
related works. Section III details the proposed KalmanNet. where T is the time horizon.
Section IV presents the numerical study. Section V provides
concluding remarks and future work.
Throughout the paper, we use boldface lower-case letters B. Data-Aided Filtering Problem Formulation
for vectors, and boldface upper-case letters for matrices. The The filtering problem is at the core of real-time tracking.
transpose, `2 norm, and stochastic expectation are denoted by Here, one must provide an instantaneous estimate of the state
>
{·} , k·k, and E [·], respectively. The Gaussian distribution xt based on each incoming observation yt in an online manner.
2
Our main focus is on scenarios where one has partial knowl- This can be achieved by jointly learning the parameters and
edge of the SS model that describes the underlying dynamics. state sequence using expectation maximization [46]–[48] and
Namely, we know (or have an approximation of) the state- Bayesian probabilistic algorithms [37], [38], or by selecting
evolution (transition) function f (·) and the state-observation from a set of a priori known models [49]. When training
(emission) function h (·). For real world applications, this data is available, it is commonly used to tune the missing
knowledge is derived from our understating of the system parameters in advance, in a supervised or an unsupervised
dynamics, its physical design, and the model of the sensors. As manner, as done in [50]–[52]. The main drawback of these
opposed to the classical assumptions in KF, the noise statistics strategies is that they are restricted to an imposed parametric
Q and R are not known. More specifically, we assume: model on the underlying dynamics (e.g., Gaussian noises).
• Knowledge of the distribution of the noise signals et and
vt is not available. When one can bound the uncertainty in the SS model in
• The functions f (·) and h (·) may constitute an approxi- advance, an alternative approach to learning is to minimize
mation of the true underlying dynamics. Such approxima- the worst-case estimation error among all expected SS models.
tions can correspond to, for instance, the representation Such robust variations were proposed for various state estima-
of continuous time dynamics in discrete time, acquisition tion algorithms, including Kalman variants [15]–[17], [53] and
using misaligned sensors, and other forms of mismatches. particle filters [54], [55]. The fact that these approaches aim
While we focus on filtering in partially-known SS models, to design the filter to be suitable for multiple different SS
we assume that we have access to a labeled data set containing model typically results in degraded performance compared to
a sequence of observations and their corresponding ground operating with known dynamics.
truth states. In various scenarios of interest, one can assume
access to some ground truth measurements in the design stage. When the underlying system’s dynamics are complex and
For example, in field experiments it possible to add extra only partially known or the emission model is intractable and
sensors both internally or externally to collect the ground truth cannot be captured in a closed form—e.g., visual observations
needed for training. It is also possible to compute the ground as in a computer vision task [56]—one can resort to approxi-
truth data using offline and more computationally intensive mations and to the use of DNNs. Variational inference [57]–
algorithms. Finally, the inference complexity of the learned [59] is commonly used in connection with SS models, as in
filter should be of the same order (and preferably smaller) as [29]–[31], [33], [34], by casting the Bayesian inference task
that of MB filters, such as the EKF. to optimization of a parameterized posterior and maximizing
an objective. Such approaches cannot typically be applied
directly to state recovery in real-time, as we consider here,
C. Related Work and the learning procedure tends to be complex and prone to
A key ingredient in recursive Bayesian filtering is the update approximation errors.
operation; namely, the need to update the prior estimate using
new observed information. For linear Gaussian SS using the A common strategy when using DNNs is to encode the
KF, this boils down to computing the KG. While the KF observations into some latent space, that is assumed to obey
assumes linear SS models, many problems encountered in a simple SS model, typically a linear Gaussian one, and track
practice are governed by non-linear dynamics, for which one the state in the latent domain as in [56], [60], [61] or to use
should resort to approximations. Several extensions of the DNNs to estimate the parameters of the SS model as in [62],
KF were proposed to deal with non-linearities. The EKF [63]. Tracking in the latent space can also be extended by
[7], [8] is a quasi-linear algorithm based on an analytical applying a DNN decoder to the estimated state to return to the
linearization of the SS model. More recent non-linear vari- observations domain, while training the overall system end-to-
ations are based on numerical integration: UKF [10], the end [31], [64]. The latter allows to design trainable systems
Gauss-Hermite Quadrature [39], and the Cubature KF [40]. for recovering missing observations and predicting future ones
For more complex SS models, and when the noise cannot by assuming that the temporal relationship can be captured
be modeled as Gaussian, multiple variants of the PF were as an SS model in the latent space. This form of DNN-
proposed that are based on sequential MC [11]–[13], [41]– aided systems is typically designed for unknown or highly
[45]. These MC algorithms are considered to be asymptotically complex SS models, while we focus in this work on setups
exact but relatively computationally heavy when compared with partial domain knowledge, as detailed in Subsection II-B.
to Kalman-based algorithms. These MB algorithms require Another line of work is to combine RNNs [65], or variational
accurate knowledge of the SS model, and their performance inference [32], [66] with MC based sampling. Also related is
is typically degraded in the presence of model mismatch. the work [35], which used learned models in parallel with MBs
The combination of machine learning and SS models, and algorithms operating with full knowledge of the SS model,
specifically Kalman-based algorithms, is the focus of growing applying a graph neural network in parallel to the Kalman
research attention. To frame the current work in the context of smoother to improve its accuracy via neural augmentation.
existing literature, we focus on the approaches that preserve Estimation was performed by an iterative message passing
the general structure of the SS model. The conventional over the entire time horizon. This approach is suitable for the
approach to deal with partially known SS models is to im- smoothing task and is computational intensive, so therefore
pose a parametric model and then estimate its parameters. may not be suitable for real-time filtering [67].
3
D. Model-Based Kalman Filtering x0
x̂t
t = 0
t > 0
Our proposed KalmanNet, detailed in the following section,
Z −1 f • h − ∆yt x̂t
x̂t|t−1 ŷt|t−1
is based on the MB KF, which is a linear recursive estimator. + × + •
yt +
In every time step t, the KF produces a new estimate xt using
only the previous estimate x̂t−1 as a sufficient statistic and the
Σ̂t|t−1 Kalman Gain Kt
new observation yt . As a result, the computational complexity Σ̂t|t−1 · Ĥ · Ŝ−1
t|t−1
of the KF does not grow in time. We first describe the original Ŝ−1
t|t−1
algorithm for linear SS models, as in (2), and then discuss how Kt
Ŝt|t−1 {·}−1
it is extended into the EKF for non-linear SS models.
Σ̂t−1 − Σ̂t
The KF can be described by a two-step procedure: predic- Z −1 F̂ · {} · F̂> + Q̂ •
Ĥ · {} · Ĥ> + R̂ •
Kt · {} · Kt + •
Ŝt|t−1 Σ̂t|t−1 +
tion and update, where in each time step t ∈ T , it compute
t > 0
t = 0
Σ̂t|t−1
Σ̂t
the first- and second-order statistical moments. Σ0
1) The first step predicts the current a priori statistical Fig. 1: EKF block diagram. Here, Z −1 is the unit delay.
moments based on the previous a posteriori estimates.
An illustration of the EKF is depicted in Fig. 1. The
Specifically, the moments of x are computed using the
resulting filter admits an efficient linear recursive structure.
knowledge of the evolution matrix F as
However, it requires full knowledge of the underlying model
x̂t|t−1 = F · x̂t−1|t−1 , (3a) and notably degrades in the presence of model mismatch.
Σt|t−1 = F · Σt−1|t−1 · F + Q >
(3b) When the model is highly non-linear, the local linearity ap-
proximation may not hold, and the EKF can result in degraded
and the moments of the observations y are computed performance. This motivates the augmentation of the EKF into
based on the knowledge of the observation matrix H as the deep learning-aided KalmanNet, detailed next.
ŷt|t−1 = H · x̂t|t−1 (4a)
St|t−1 = H · Σt|t−1 · H + R. >
(4b) III. K ALMAN N ET
Here, we present KalmanNet; a hybrid, interpretable, data
2) In the update step, the a posteriori state moments are
efficient architecture for real-time state estimation in non-
computed based on the a priori moments as
linear dynamical systems with partial domain knowledge.
x̂t|t = x̂t|t−1 + Kt · ∆yt (5a) KalmanNet combines the MB Kalman filtering with an RNN
Σt|t = Σt|t−1 − Kt · St|t−1 · K> to cope with model mismatch and non-linearities. To introduce
t . (5b)
KalmanNet, we begin by explaining its high level operation in
Here, Kt is the KG, and it is given by Subsection III-A. Then we present the features processed by
its internal RNN and the specific architectures considered for
Kt = Σt|t−1 · H> · S−1
t|t−1 . (6)
implementing and training KalmanNet in Subsections III-B-
The term ∆yt is the innovation; i.e., the difference III-D. Finally, we provide a discussion in Subsection III-E.
between the predicted observation and the observed value,
and it is the only term that depends on the observed data
A. High Level Architecture
∆yt = yt − ŷt|t−1 . (7) We formulate KalmanNet by identifying the specific compu-
The EKF extends the KF for non-linear f (·) and/or h (·), as tations of the EKF that are based on unavailable knowledge.
in (1). Here, the first-order statistical moments (3a) and (4a) As detailed in Subsection II-B, the functions f (·) and h (·)
are replaced with are known (though perhaps inaccurately); yet the covariance
matrices Q and R are unavailable. These missing statistical
x̂t|t−1 = f (x̂t−1 ) , (8a) moments are used in MB Kalman filtering only for computing
ŷt|t−1 = h x̂t|t−1 , (8b) the KG (see Fig. 1). Thus, we design KalmanNet to learn the
KG from data, and combine the learned KG in the overall KF
respectively. The second-order moment though cannot be flow. This high level architecture is illustrated in Fig. 2.
propagated through the non-linearity, and must thus be ap-
In each time instance t ∈ T , similarly to the EKF,
proximated. The EKF linearizes the differentiable f (·) and
KalmanNet estimates x̂t in two steps; prediction and update.
h (·) in a time-dependent manner using their partial derivative
matrices, also known as Jacobians, evaluated at x̂t−1|t−1 and 1) The prediction step is the same as in the MB EKF, except
x̂t|t−1 . Namely, that only the first-order statistical moments are predicted.
In particular, a prior estimate for the current state x̂t|t−1
F̂t = Jf x̂t−1|t−1 (9a) is computed from the previous posterior x̂t−1 via (8a).
Then, a prior estimate for the current observation ŷt|t−1
Ĥt = Jh x̂t|t−1 , (9b)
is computed from x̂t|t−1 via (8b). As opposed to its MB
where F̂t is plugged into (3b) and Ĥt is used in (4b) and (6). counterparts, KalmanNet does not rely on the knowledge
When the SS model is linear, the EKF coincides with the KF, of noise distribution, and does not maintain an explicit
which is achieves the MMSE for linear Guassian SS models. estimate of the second-order statistical moments.
4
x0 x̂t|t h0
t = 0 x̂t|t−1
t > 0
t = 0
Z −1 t > 0
x̂t−1
Z −1 • f • h Fully connected ht−1 Fully connected
ŷt|t−1− ∆yt x̂t|t
+ • + • linear input layer GRU linear output layer
ht−1
+
yt ×
×
∆x̂t−1
rt
WM
Z −1
σ
∆yt Kt
•
+ Kalman Gain ∆yt ht Kt
WZ
×
σ
+
zt
− ∆x̂t−1
-1
tanh
Recurrent Neural Network
+
W
ĥt
Fig. 2: KalmanNet block diagram.
ht
2) In the update step, KalmanNet uses the new observation ∆x̂t−1 ∈ Rm ht
∆yt ∈ Rn Kt ∈ Rm×n
yt to compute the current state posterior x̂t from the
previously computed prior x̂t|t−1 in a similar manner to
the MB KF as in (5a), i.e., using the innovation term Fig. 3: KalmanNet RNN block diagram (architecture #1). The
∆yt computed via (7) and the KG Kt . As opposed to architecture is comprised of a fully connected input layer,
the MB EKF, here the computation of the KG is not followed by a GRU layer (whose internal division into gates
given explicitly, rather, it is learned from data using is illustrated [21]), and an output fully connected layer. Here,
an RNN, as illustrated in Fig. 2. The inherent memory the input features are F2 and F4.
of RNNs allows to implicitly track the second-order uncertainty of our state estimate. The difference operation
statistical moments without requiring knowledge of the removes the predictable components, and thus the time series
underlying noise statistics. of differences is mostly affected by the noise statistics that
Designing an RNN to learn how to compute the KG as part we wish to learn. The RNN described in Fig. 2 can use all
of an overall KF flow requires answers to three key questions: the features, although extensive empirical evaluation suggests
1) From which input features (signals) will the network learn that the specific choice of combination of features depends on
the KG? the problem at hand. Our empirical observations indicate that
2) What should be the architecture of the internal RNN? good combinations are {F1, F2, F4} and {F1, F3, F4}.
3) How will this network be trained from data?
In the following sections we address these questions. C. Neural Network Architecture
The internal DNN of KalmanNet uses the features discussed
B. Input Features
in the previous section to compute the KG. It follows from
The MB KF and its variants compute the KG from knowl- (6) that computing the KG Kt involves tracking the second-
edge of the underlying statistics. To implement such compu- order statistical moments Σt . The recursive nature of the KG
tations in a learned fashion, one must provide input (features) computation indicates that its learned module should involve
that capture the knowledge needed to evaluate the KG to a an internal memory element as an RNN to track it.
neural network. The dependence of Kt on the statistics of We consider two architectures for the KG computing RNN.
the observations and the state process indicates that in order The first, illustrated in Fig. 3, aims at using the internal
to track it, in every time step t ∈ T , the RNN should be memory of RNNs to jointly track the underlying second-
provided with input containing statistical information of the order statistical moments required for computing the KG in an
observations yt and the state-estimate x̂t−1 . Therefore, the implicit manner. To that aim, we use GRU cells [21] whose
following quantities that are related to the unknown statistical hidden state is of the size of some integer product of m2 + n2 ,
relationship of the SS model can be used as input features to which is the joint dimensionality of the tracked moments
the RNN: Σ̂t|t−1 in (3b), and Ŝt in (4b). In particular, we first use a
F1 The observation difference ∆ỹt = yt − yt−1 . fully connected (FC) input layer whose output is the input
F2 The innovation difference ∆yt = yt − ŷt|t−1 . to the GRU. The GRU state vector ht is mapped into the
F3 The forward evolution difference ∆x̃t = x̂t|t − x̂t−1|t−1 . estimated KG Kt ∈ Rm×n using an output FC layer with
This quantity represents the difference between two con- m · n neurons. While the illustration in Fig. 3 uses a single
secutive posterior state estimates, where in time instance GRU layer, one can also utilize multiple layers to increase
t, the available feature is ∆x̃t−1 . the capacity and abstractness of the network, as we do in the
F4 The forward update difference ∆x̂t = x̂t|t − x̂t|t−1 , i.e., numerical study reported in Subsection IV-E. The architecture
the difference between the posterior state estimate and proposed does not directly design the hidden state of the
the prior state estimate, where again in time instance t GRU to correspond to the unknown second-order statistical
we use ∆x̂t−1 . moments that are tracked by the MB KF. As such, it uses
Features F1 and F3 encapsulate information about the state- a relatively large number of state variables that are expected
evolution process, while features F2 and F4 encapsulate the to provide the required tracking capacity. For example, in the
5
∆x̂t = x̂t|t − x̂t|t−1 estimate x̂t|t , we train KalmanNet end-to-end. Namely, we
Z −1
t > 0 GRU 1
∆e
xt = x̂t|t − x̂t−1|t−1 compute the loss function L based on the state estimate x̂t ,
∆x̂t−1 ∆yt = yt − ŷt|t−1
Q̂0 t = 0 Q̂ ∆e
yt = yt − yt−1 which is not the output of the internal RNN. Since this vector
Qt
takes values in a continuous set Rm , we use the squared-error
loss,
2
Z −1
Σ̂t|t L = xt − x̂t|t (10)
t > 0 GRU 2
Σ̂0 t = 0 Σ̂
∆e
xt−1 which is also used to evaluate the MB KF. By doing so,
we build upon the ability to backpropagate the loss to the
Σ̂t|t−1 computation of the KG. One can obtain the loss gradient with
respect to the KG from the output of KalmanNet since
2
Z −1
∂L ∂ kKt ∆yt − ∆xt k
=
t > 0 GRU 3 ∆yt ∂Kt ∂Kt
R̂0 t = 0 Ŝ ∆e
yt
= 2 · (Kt · ∆yt − ∆xt ) · ∆yt> , (11)
Ŝt where ∆xt , xt − x̂t|t−1 . The gradient computation in (11)
indicates that one can learn the computation of the KG by
Kt training KalmanNet end-to-end using the squared-error loss.
In particular, this allows to train the overall filtering system
without having to externally provide ground truth values of
Fig. 4: KalmanNet RNN block diagram (architecture #2). The the KG for training purposes.
input features are used to update three GRUs with dedicated The data set used for training comprises N trajectories that
FC layers, and the overall interconnection between the blocks can be of varying lengths. Namely, by letting Ti be the length
is based on the flow of the KG computation in the MB KF. of the ith training trajectory, the data set is given by D =
N
{(Yi , Xi )}1 , where
numerical study in Section IV we set the dimensionality of ht
(i) (i) (i) (i) (i)
to be 10 · (m2 + n2 ). This often results in substantial over- Yi = y1 , . . . , yTi , Xi = x0 , x1 , . . . , xTi . (12)
parameterization, as the number of GRU parameters grows
quadratically with the number of state variables [68]. By letting Θ denote the trainable parameters of the RNN,
The second architecture uses separate GRU cells for each and γ be a regularization coefficient, we then construct an `2
tracked second-order statistical moments. The division of the regularized mean-squared error (MSE) loss measure
architecture into separate GRU cells and FC layers and their Ti
1 X (i) (i)
2
2
interconnection is illustrated in Fig. 4. As shown in the figure, `i (Θ) = x̂t yt ; Θ −xt + γ · kΘk . (13)
the network composed of three GRU layers, connected in a Ti t=1
cascade with dedicated input and output FC layers. The first To optimize Θ, we use a variant of mini-batch stochastic
GRU layer tracks the unknown state noise covariance Q, thus gradient descent in which for every batch indexed by k, we
tracking m2 variables. Similarly, the second and third GRUs choose M < N trajectories indexed by ik1 , . . . , ikM , computing
track the predicted moments Σ̂t|t−1 (3b) and Ŝt (4b), thus the mini-batch loss as
having m2 and n2 hidden state variables, respectively. The M
GRUs are interconnected such that the learned Q is used to 1 X
Lk (Θ) = ` k (Θ) . (14)
compute Σ̂t|t−1 , which in turn is used to obtain Ŝt , while both M j=1 ij
Σ̂t|t−1 and Ŝt are involved in producing Kt (6). This archi-
tecture, which is composed of a non-standard interconnection Since KalmanNet is a recursive architecture with both
between GRUs and FC layers, is more directly tailored towards an external recurrence and an internal RNN, we use the
the formulation of the SS model and the operation of the backpropagation through time (BPTT) algorithm [69] to train
MB KF compared with the simpler first architecture. As such, it. Specifically, we unfold KalmanNet across time with shared
it provides lesser abstraction; i.e., it is expected to be more network parameters, and then compute a forward and back-
constrained in the family of mappings it can learn compared ward gradient estimation pass through the network. We con-
with the first architecture, while as a result also requiring sider three different variations of applying the BPTT algorithm
less trainable parameters. For instance, in the numerical study for training KalmanNet:
reported in Subsection IV-D, utilizing the first architecture V1 Direct application of BPTT, where for each training itera-
requires the order of 5 · 105 trainable parameters, while the tion the gradients are computed over the entire trajectory.
second architecture utilizes merely 2.5 · 104 parameters. V2 An application of the truncated BPTT algorithm [70].
Here, given a data set of long trajectories (e.g., T = 3000
D. Training Algorithm time steps), each long trajectory is divided into multiple
KalmanNet is trained using the available labeled data set short trajectories (e.g., T = 100 time steps), which are
in a supervised manner. While we use a neural network for shuffled and used during training.
computing the KG rather than for directly producing the V3 An alternative application of truncated BPTT, where we
6
truncate each trajectory to a fixed (and relatively short) RNNs for end-to-end state estimation, and also approaches
length, and train using these short trajectories. the MMSE performance achieved by the MB KF in linear
Overall, directly applying BPTT via V1 may be com- Gaussian SS models. Furthermore, the fact that KalmanNet
putationally expensive and unstable. Therefore, a favorable preserves the flow of the EKF implies that the intermediate fea-
approach is to first use the truncated BPTT as in V2 as a tures exchanged between its modules have a specific operation
warm-up phase (train first on short trajectories) in order to meaning, providing interpretability that is often scarce in end-
stabilize its learning process, after which KalmanNet is tuned to-end deep learning systems. Finally, the fact that KalmanNet
using V1. The procedure in V3 is most suitable for systems learns to compute the KG indicates the possibility of providing
that are known to be likely to quickly converge to a steady- not only estimates of the state xt , but also a measure of
state (e.g., linear SS models). In our numerical study reported confidence in this estimate, as the KG can be related to the
in Section IV we utilize all three approaches. covariance of the estimate, as initially explored in [71].
These combined gains of KalmanNet over purely MB
and DD approaches were recently observed in [72], which
E. Discussion utilized an early version of KalmanNet for real-time velocity
KalmanNet is designed to operate in a hybrid DD/MB estimation in an autonomous racing car. In such a setup, a non-
manner, combining deep learning with the classical EKF linear MB mixed KF was traditionally used, and suffered from
procedure. By identifying the specific noise-model-dependent performance degradation due to inherent mismatches in the
computations of the EKF and replacing them with a dedicated formulation of the SS model describing the problem. Nonethe-
RNN integrated in the EKF flow, KalmanNet benefits from less, previously proposed DD techniques relying on RNNs for
the individual strengths of both DD and MB approaches. The end-to-end state estimation were not operable in the desired
augmentation of the EKF with dedicated deep learning mod- frequencies on the hardware limited vehicle control unit. It
ules results in several core differences between KalmanNet was shown in [72] that the application of KalmanNet allowed
and its MB counterpart. Unlike the MB EKF, KalmanNet does to achieve improved real-time velocity tracking compared to
not attempt to linearize the SS model, and does not impose a MB techniques while being deployed on the control unit of
statistical model on the noise signals. In addition, KalmanNet the vehicle.
filters in a non-linear manner, as its KG matrix depends on the Our design of KalmanNet gives rise to many interesting
input yt . Due to these differences, compared to MB Kalman future extensions. Since we focus here on SS models where the
filtering, KalmanNet is more robust to model mismatch and mappings f (·) and h (·) are known up to some approximation
can infer more efficiently, as demonstrated in Section IV. In errors, a natural extension of KalmanNet is to use the data
particular, the MB EKF is sensitive to inaccuracies in the to pre-estimate them, as demonstrated briefly in the numerical
underlying SS model, e.g., in f (·) and h (·), while KalmanNet study. Another alternative to cope with these approximation
can overcome such uncertainty by learning an alternative KG errors is to utilize dedicated neural networks to learn these
that yields accurate estimation. mappings while training the entire model in an end-to-end
Furthermore, KalmanNet is derived for SS models when fashion. Doing so is expected to allow KalmanNet to be
noise statistics are not specified explicitly. A MB approach utilized in scenarios with analytically intractable SS models, as
to tackle this without relying on data employs the robust often arises when tracking based on unstructured observations,
Kalman filter [15]–[17], which designs the filter to minimize e.g., visual observations as in [56].
the maximal MSE within some range of assumed SS models, While we train KalmanNet in a supervised manner using
at the cost of performance loss, compared to knowing the true labeled data, the fact that it preserves the operation of the
model. When one has access to data, the direct strategy to MB EKF that produces a prediction of the next observation
implement the EKF in such setups is to use the data to estimate ŷt|t−1 on each time instance indicates the possibility of using
Q and R, either directly from the data or by backpropagating this intermediate feature for unsupervised training. One can
through the operation of the EKF as in [51], and utilize these thus envision KalmanNet being trained offline in a supervised
estimates to compute the KG. As covariance estimation can manner, while tracking variations in the underlying SS model
be a challenging task when dealing with high-dimensional at run-time by online self supervision, following a similar
signals, KalmanNet bypasses this need by directly learning rationale to that used in [24], [25] for deep symbol detection
the KG, and by doing so approaches the MSE of MB Kalman in time-varying communication channels.
filtering with full knowledge of the SS model, as demonstrated Finally, we note that while we focus here on filtering
in Section IV. Finally, the computation complexity for each tasks, SS models are used to represent additional related
time step t ∈ T is also linear in the RNN dimensions and does problems such as smoothing and prediction, as discussed in
not involve matrix inversion. This implies that KalmanNet is a Subsection II-A. The fact that KalmanNet does not explicitly
good candidate to apply for high dimensional SS models and estimate the SS model implies that it cannot simply substitute
on computationally limited devices. these parameters into an alternative algorithm capable of
Compared to purely DD state estimation, KalmanNet ben- carrying out tasks other than filtering. Nonetheless, one can
efits from its model-awareness and the fact that its operation still design DNN-aided algorithms for these tasks operating
follows the flow of MB Kalman filtering rather than being with partially known SS models as extensions of KalmanNet,
utilized as a black box. As numerically observed in Section IV, in the same manner as many MB algorithms build upon the
KalmanNet achieves improved MSE compared to utilizing KF. For instance, as the MB KF constitutes the first part of the
7
Rauch-Tung-Striebel smoother [73], one can extend Kalman- standard deviation, where we denote these measures by µ̂ and
Net to implement high-performance smoothing in partially σ̂, respectively.
known SS models, as we have recently began investigating 1) KalmanNet Setting: In Section III we present several
in [67]. Nonetheless, we leave the exploration of extensions architectures and training mechanisms that can be used when
of KalmanNet to alternative tasks associated with SS models implementing KalmanNet. In our experimental study we con-
for future work. sider three different configurations of KalmanNet:
C1 KalmanNet architecture #1 with input features {F2, F4}
IV. E XPERIMENTS AND R ESULTS and with training algorithm V3
In this section we present an extensive numerical study of C2 KalmanNet architecture #1 with input features {F2, F4}
KalmanNet1 , evaluating its performance in multiple setups and and with training algorithm V1
comparing it to various benchmark algorithms: C3 KalmanNet architecture #1 with input features {F1, F3,
(a) In our first experimental study we consider multiple F4} and with training algorithm V2.
linear SS models, and compare KalmanNet to the MB C4 KalmanNet architecture #2 with all input features and
KF which is known to minimize the MSE in such a setup. with training algorithm V1.
We also confirm our design and architectural choices by In all our experiments KalmanNet was trained using the Adam
comparing KalmanNet with alternative RNN based end- optimizer [74].
to-end state estimators. 2) Model-Based Filters: In the following experimental
(b) We next consider two non-linear SS models, a sinusoidal study we compare KalmanNet with several MB filters. For
model, and the chaotic Lorenz attractor. We compare the UKF we used the software package [75], while the PF
KalmanNet with the common non-linear MB bench- is implemented based on [76] using 100 particles and with-
marks; namely, the EKF, UKF, and PF. out parallelization. During our numerical study, when model
(c) In our last study we consider a localization use case based uncertainty was introduced, we optimized the performance of
on the Michigan NCLT data set [28]. Here, we compare the MB algorithms by carefully tuning the covariance matrices,
KalmanNet with MB KF that assumes a linear Wiener usually via a grid search. For long trajectories (e.g., T > 1500)
kinematic model [36] and with a vanilla RNN based it was sometimes necessary to tune these matrices, even in
end-to-end state estimator, and demonstrate the ability the case of full information, to compensate for inaccurate
of KalmanNet to track real world dynamics that was not uncertainty propagation due to non-linear approximations and
synthetically generated from an underlying SS model. to avoid divergence.
8
40 0
-17
-17.5
-2
30 -18
-4
-18.5
20 -19
-6
-19.5
-8
10 -20
MSE [dB]
MSE [dB]
-20.5
-10
-21
0
-12
-21.5
4
-22
3.6 -22.5
0 20 40 60 80 100 120
3.4
-16
3.2
-20
3
-18
2.8
2.6
2.2
2 -22
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5
-40 0 100 200 300 400 500 600 700 800 900 1000
-10 -5 0 5 10 15 20 25 30 35 40
(a) KalmanNet converges to MMSE. (b) Learning curves for DD state estimation.
• Vanilla RNN directly maps the observed yt to an estimate TABLE I: Test MSE in [dB] when trained using T = 20.
of the state x̂t . Test T Vanilla RNN MB RNN MB RNN, diff. KalmanNet KF
• MB RNN imitates the Kalman filtering operation by first 20 -20.98 -21.53 -21.92 -21.92 -21.97
200 58.14 36.8 -21.88 -21.90 -21.91
recovering x̂t|t−1 using domain knowledge, i.e., via (3a),
and then uses the RNN to estimate an increment ∆x̂t simulate a 2×2 SS model with mismatches in either the state-
from the prior to posterior. evolution model (F) or in the state-observation model (H).
All RNNs utilize the same architecture as in KalmanNet with
a single GRU layer and the same learning hyperparameters. In State-Evolution Mismatch: Here, we set T = 20, and ν =
this experiment we test the trained models on trajectories with 0 [dB] and use a rotated evolution matrix Fα◦ , α ∈ {10◦ , 20◦ }
the same length as they were trained on, namely T = 20. We for data generation. The state-evolution matrix available to
can clearly observe how each of the key design considerations the filters, denoted F0 , is again set to take the controllable
of KalmanNet affect the learning curves depicted in Fig. 5b: canonical form. The mismatched design matrix F0 is related
• The incorporation of the known SS model allows the to true Fα◦ via
MB RNN to outperform the vanilla RNN, although both cos α − sin α
Fα◦ = Rxy α◦ · F 0 , R xy
α◦ = (16)
converge slowly and fail to achieve the MMSE. sin α cos α
• Using the sequences of differences as input notably Such scenarios represent a setup in which the analytical
improves the convergence rate of the MB RNN, indi- approximation of the SS model differs from the true generative
cating the benefits of using the differences as features, as model. The resulting MSE curves depicted in Fig. 6a demon-
discussed in Subsection III-B. strate that KalmanNet (with setting C2) achieves a 3 [dB] gain
• Learning is further improved by using the RNN for over the MB KF. In particular, despite the fact that KalmanNet
recovering the KG as part of the KF flow, as done by implements the KF with an inaccurate state-evolution model,
KalmanNet, rather than for directly estimating xt . it learns to apply an alternative KG, resulting in MSE within
To further evaluate the gains of KalmanNet over end-to-end a minor gap from the MMSE; i.e., from the KF with the true
RNNs, we compare the pre-trained models using trajectories Fα◦ plugged in.
with different initial conditions and a longer time horizon
(T = 200) than the one on which they were trained (T = 20). State-Observation Mismatch: Next, we simulate a setup
The results, summarized in Table I, show that KalmanNet with state-observation mismatch while setting T = 100 and
maintains achieving the MMSE, as already observed in Fig. 5a. ν = −20 [dB]. The model mismatch is achieved by using a
The MB RNN and vanilla RNN are more than 50 [dB] from rotated observation matrix Hα=10◦ for data generation, while
the MMSE, implying that their learning is not transferable and using H = I as the observation design matrix. Such scenarios
that they do not learn to implement Kalman filtering. However, represent a setup in which a slight misalignment (≈ 5%) of the
when provided with the difference features as we proposed in sensors exists. The resulting achieved MSE depicted in Fig. 6b
Subsection III-B, the DD systems are shown to be applicable demonstrates that KalmanNet (with setting C2) converges to
in longer trajectories, with KalmanNet achieving MSE within within a minor gap from the MMSE. Here, we also did an ad-
a minor gap of that achieved by the MB KF. The results of this ditional experiment, where we first estimated the observation
study validate the considerations used in designing KalmanNet matrix from data, and then had KalmanNet use the estimate
for the DD filtering problem discussed in Subsection II-B. matrix denoted Ĥα . In this case it is observed in Fig. 6b that
3) Partial Information: To conclude our study on linear KalmanNet achieves the MMSE lower bound. These results
models, we next evaluate the robustness of KalmanNet to imply that KalmanNet converges also in distribution to the
model mismatch as a result of partial model information. We KF.
9
10
20
5
10 0
-5
0
MSE [dB]
MSE [dB]
-10
-10
-15
2 -9
1.5
-10
-20 1
-20
-11
0.5
-12
0
-25 -13
-0.5
-30 -1 -14
-1.5
-30 -15
-2
-16
-2.5
-40 -3 -35 -17
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5
9.75 9.8 9.85 9.9 9.95 10 10.05 10.1 10.15 10.2 10.25
-10 -5 0 5 10 15 20 25 30 35 40 -10 -5 0 5 10 15 20 25 30
-3
2 2
h (x) = a · (b · x + c) , y∈R . (17b) -40
-4
-5
-6
-60
-10 -5 0 5 10 15 20 25 30 35 40
10
TABLE IV: MSE [dB] - Synthetic non-linear SS model; partial TABLE V: MSE [dB] - Lorenz attractor with noisy state
information. observations.
1/r2 [dB] −12.04 −6.02 0 20 40 1/r2 [dB] 0 10 20 30 40
EKF µ̂ -2.99 -5.07 -7.57 -22.67 -36.55 EKF -10.45 -20.37 -30.40 -40.39 -49.89
σ̂ ±0.63 ±0.89 ±0.45 ±0.42 ±0.3 UKF -5.62 -12.04 -20.45 -30.05 -40.00
UKF µ̂ -0.91 -1.54 -5.18 -24.06 -37.96
PF -9.78 -18.13 -23.54 -30.16 -33.95
σ̂ ±0.60 ±0.23 ±0.29 ±0.43 ±2.21
PF µ̂ -2.32 -3.29 -4.83 -23.66 -33.13 KalmanNet -9.79 -19.75 -29.37 -39.68 -48.99
σ̂ ±0.89 ±0.53 ±0.64 ±0.48 ±0.45
KalmanNet µ̂ -6.62 -11.60 -15.83 -34.23 -45.29
σ̂ ±0.46 ±0.45 ±0.44 ±0.58 ±0.64
TABLE VI: MSE [dB] - Lorenz attractor with non-linear
In particular, the noiseless state-evolution of the continuous- observations
time process xτ with τ ∈ R+ is given by 1/r2 [dB] −10 0 10 20 30
EKF 26.38 21.78 14.50 4.84 -4.02
−10 10 0 UKF nan nan nan nan nan
∂
xτ = A (xτ ) · xτ , A (xτ ) = 28 −1 −x1,τ . PF 24.85 20.91 14.23 11.93 4.35
∂τ
0 x1,τ − 83 KalmanNet 14.55 6.77 -1.77 -10.57 -15.24
(18)
To get a discrete-time state-evolution model, we repeat the performance, with MSE values surpassing 30 [dB]. To stabilize
steps used in [35]. First, we sample the noiseless process with the EKF, we had to perform a grid search using the available
sampling interval ∆τ and assume that A (xτ ) can be kept data set to optimize the process noise Q used by the filter.
constant in a small neighborhood of xτ ; i.e., Noisy non-linear observations: Next, we consider the case
where the observations are given by a non-linear function of
A (xτ ) ≈ A (xτ +∆τ ) . the current state, setting h to take the form of a transformation
Then, the continuous-time solution of the differential system from a cartesian coordinate system to spherical coordinates.
(18), which is valid in the neighborhood of xτ for a short time We further set T = 20 and ν = 0 [dB]. From the results
interval ∆τ , is depicted in Fig. 8b and reported in Table VI we observe that
in such non-linear setups, the sub-optimal MB approaches op-
xτ +∆τ = exp (A (xτ ) · ∆τ ) · xτ . (19) erating with full information of the SS model are substantially
Finally, we take the Taylor series expansion of (19) and a finite outperformed by KalmanNet (with setting C4).
series approximation (with J coefficients), which results in 2) Partial Information: W proceed to evaluate KalmanNet
J
and compare it to its MB counterparts under partial model
X (A (xτ ) · ∆τ )
j
F (xτ ) , exp (A (xτ ) · ∆τ ) ≈ I+ . (20) information. We consider three possible sources of model
j=1
j! mismatch arising in the Lorenz attractor setup:
• State-evolution mismatch due to use of a Taylor series
The resulting discrete-time evolution process is given by
approximation of insufficient order.
xt+1 = f (xt ) = F (xt ) · xt . (21) • State-observation mismatch as a result of misalignment
due to rotation.
The discrete-time state-evolution model in (21), with addi- • State-observation mismatch as a result of sampling from
tional process noise, is used for generating the simulated continuous-time to discrete-time.
Lorenz attractor data. Unless stated otherwise the data was
Since the EKF produced the best results in the full information
generated with J = 5 Taylor order, and ∆τ = 0.02 sampling
case among all non-linear MB filtering algorithms, we use it
interval. In the following experiments, KalmanNet is consis-
as a baseline for the MSE lower bound.
tently invariant of the distribution of the noise signals, with the
models it uses for f (·) and h (·) varying between the different State-evolution mismatch: In this study, both KalmanNet
studies, as discussed in the sequel. and the MB algorithms operate with a crude approximation
of the evolution dynamics obtained by computing (20) with
1) Full Information: We first compare KalmanNet to the J = 2, while the data is generated with an order J = 5 Taylor
MB filter when using the state-evolution matrix F computed series expansion. We again set h to be the identity mapping,
via (20) with J = 5. T = 2000, and ν = −20 [dB]. The results, depicted in Fig. 9a
Noisy state observations: Here, we set h (·) to be the and reported in Table VII, demonstrate that KalmanNet (with
identity transformation, such that the observations are noisy setting C4) learns to partially overcome this model mismatch,
versions of the true state. Further, we set ν = −20 [dB], and outperforming its MB counterparts operating with the same
T = 2000. As observed in Fig. 8a, despite being trained level of partial information.
on short trajectories T = 100, KalmanNet (with setting C3) State-observation rotation mismatch: Here, the presence
achieves excellent MSE performance—namely, comparable to of mismatch in the observations model is simulated by using
EKF—and outperforms the UKF and PF. The full details of data generated by an identity matrix rotated by merely θ = 1◦ .
the experiment are given in Table V. All the MB algorithms This rotation is equivalent to sensor misalignment of ≈ 0.55%.
were optimized for performance; e.g., applying the EKF with The results depicted in Figure. 9b and reported in Table VIII
full model information achieves an unstable state tracking clearly demonstrate that this allegedly minor rotation can cause
11
0
25
-5
-10 20
-15 15
-20 10
MSE [dB]
MSE [dB]
-25
5
-30 22
-10
0 20
-35 -12 18
16
-14 -5 14
-40 -16 12
10
-18 -10
-45 8
-20 6
-0.3 -0.2 -0.1 0 0.1 0.2 0.3
-50
9.5 9.6 9.7 9.8 9.9 10 10.1 10.2 10.3 10.4 10.5
-15
-10 -5 0 5 10 15 20 25 30
0 5 10 15 20 25 30 35 40
(a) T = 2000, ν = −20 [dB], h (·) = I. (b) T = 20, ν = 0 [dB], h (·) non-linear.
12
-10 0
-15 -5
-20 -10
-25 -15
MSE [dB]
MSE [dB]
-30 -20
-20
-22
-20
-24
-30 -22
-26
-28
-35
-45 -30
-28
-30
19.5 19.6 19.7 19.8 19.9 20 20.1 20.2 20.3 20.4 20.5
-40 19.5 19.6 19.7 19.8 19.9 20 20.1 20.2 20.3 20.4 20.5
-50
10 15 20 25 30 35 40 0 5 10 15 20 25 30
readings (e.g., GPS and odometer) and the ground truth loca- TABLE X: Numerical MSE in [dB] for the NCLT experiment.
tions of a moving Segway robot. Given these noisy readings,
the goal of the tracking algorithm is to localize the Segway Baseline EKF KalmanNet Vanilla RNN
from the raw measurements at any given time. 25.47 25.385 22.2 40.21
13
Decimated Observations
Ground Truth EKF PF
14
[33] E. Archer, I. M. Park, L. Buesing, J. Cunningham, and L. Paninski, [59] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference:
“Black box variational inference for state space models,” arXiv preprint A review for statisticians,” Journal of the American statistical Associa-
arXiv:1511.07367, 2015. tion, vol. 112, no. 518, pp. 859–877, 2017.
[34] R. Krishnan, U. Shalit, and D. Sontag, “Structured inference networks [60] T. Haarnoja, A. Ajay, S. Levine, and P. Abbeel, “Backprop kf: Learning
for nonlinear state space models,” in Proceedings of the AAAI Confer- discriminative deterministic state estimators,” in Advances in Neural
ence on Artificial Intelligence, vol. 31, no. 1, 2017. Information Processing Systems, 2016, pp. 4376–4384.
[35] V. G. Satorras, Z. Akata, and M. Welling, “Combining generative and [61] B. Laufer-Goldshtein, R. Talmon, and S. Gannot, “A hybrid approach for
discriminative models for hybrid inference,” in Advances in Neural speaker tracking based on TDOA and data-driven models,” IEEE/ACM
Information Processing Systems, 2019, pp. 13 802–13 812. Trans. Audio, Speech, Language Process., vol. 26, no. 4, pp. 725–735,
[36] Y. Bar-Shalom, X. R. Li, and T. Kirubarajan, Estimation with applica- 2018.
tions to tracking and navigation: theory algorithms and software. John [62] H. Coskun, F. Achilles, R. DiPietro, N. Navab, and F. Tombari, “Long
Wiley & Sons, 2004. short-term memory Kalman filters: Recurrent neural estimators for pose
[37] K.-V. Yuen and S.-C. Kuok, “Online updating and uncertainty quantifica- regularization,” in Proceedings of the IEEE International Conference on
tion using nonstationary output-only measurement,” Mechanical Systems Computer Vision, 2017, pp. 5524–5532.
and Signal Processing, vol. 66, pp. 62–77, 2016. [63] S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang, and
[38] H.-Q. Mu, S.-C. Kuok, and K.-V. Yuen, “Stable robust extended kalman T. Januschowski, “Deep state space models for time series forecasting,”
filter,” Journal of Aerospace Engineering, vol. 30, no. 2, p. B4016010, in Advances in Neural Information Processing Systems, 2018, pp. 7785–
2017. 7794.
[64] P. Becker, H. Pandya, G. Gebhardt, C. Zhao, C. J. Taylor, and
[39] I. Arasaratnam, S. Haykin, and R. J. Elliott, “Discrete-time nonlinear fil-
G. Neumann, “Recurrent Kalman networks: Factorized inference in
tering algorithms using Gauss–Hermite quadrature,” Proc. IEEE, vol. 95,
high-dimensional deep feature spaces,” in International Conference on
no. 5, pp. 953–977, 2007.
Machine Learning. PMLR, 2019, pp. 544–552.
[40] I. Arasaratnam and S. Haykin, “Cubature Kalman filters,” IEEE Trans. [65] X. Zheng, M. Zaheer, A. Ahmed, Y. Wang, E. P. Xing, and A. J. Smola,
Autom. Control, vol. 54, no. 6, pp. 1254–1269, 2009. “State space LSTM models with particle MCMC inference,” preprint
[41] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial arXiv:1711.11179, 2017.
on particle filters for online nonlinear/non-Gaussian Bayesian tracking,” [66] T. Salimans, D. Kingma, and M. Welling, “Markov chain monte carlo
IEEE Trans. Signal Process., vol. 50, no. 2, pp. 174–188, 2002. and variational inference: Bridging the gap,” in International Conference
[42] N. Chopin, P. E. Jacob, and O. Papaspiliopoulos, “SMC2: an efficient on Machine Learning. PMLR, 2015, pp. 1218–1226.
algorithm for sequential analysis of state space models,” Journal of the [67] X. Ni, G. Revach, N. Shlezinger, R. J. van Sloun, and Y. C. Eldar,
Royal Statistical Society: Series B (Statistical Methodology), vol. 75, “RTSNET: Deep learning aided kalman smoothing,” in Proc. IEEE
no. 3, pp. 397–426, 2013. ICASSP, 2022.
[43] L. Martino, V. Elvira, and G. Camps-Valls, “Distributed particle [68] R. Dey and F. M. Salem, “Gate-variants of gated recurrent unit (GRU)
metropolis-Hastings schemes,” in IEEE Statistical Signal Processing neural networks,” in Proc. IEEE MWSCAS, 2017, pp. 1597–1600.
Workshop (SSP), 2018, pp. 553–557. [69] P. J. Werbos, “Backpropagation through time: what it does and how to
[44] C. Andrieu, A. Doucet, and R. Holenstein, “Particle Markov chain do it,” Proc. IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
Monte Carlo methods,” Journal of the Royal Statistical Society: Series [70] I. Sutskever, Training recurrent neural networks. University of Toronto
B (Statistical Methodology), vol. 72, no. 3, pp. 269–342, 2010. Toronto, Canada, 2013.
[45] J. Elfring, E. Torta, and R. van de Molengraft, “Particle filters: A hands- [71] I. Klein, G. Revach, N. Shlezinger, J. E. Mehr, R. J. van Sloun, Y. Eldar
on tutorial,” Sensors, vol. 21, no. 2, p. 438, 2021. et al., “Uncertainty in data-driven Kalman filtering for partially known
[46] R. H. Shumway and D. S. Stoffer, “An approach to time series smoothing state-space models,” in Proc. IEEE ICASSP, 2022.
and forecasting using the em algorithm,” Journal of time series analysis, [72] A. López Escoriza, G. Revach, N. Shlezinger, and R. J. G. van Sloun,
vol. 3, no. 4, pp. 253–264, 1982. “Data-driven Kalman-based velocity estimation for autonomous racing,”
[47] Z. Ghahramani and G. E. Hinton, “Parameter estimation for linear in Proc. IEEE ICAS, 2021.
dynamical systems,” 1996. [73] H. E. Rauch, F. Tung, and C. T. Striebel, “Maximum likelihood estimates
[48] J. Dauwels, A. Eckford, S. Korl, and H.-A. Loeliger, “Expectation max- of linear dynamic systems,” AIAA Journal, vol. 3, no. 8, pp. 1445–1450,
imization as message passing-part i: Principles and gaussian messages,” 1965.
arXiv preprint arXiv:0910.2832, 2009. [74] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[49] L. Martino, J. Read, V. Elvira, and F. Louzada, “Cooperative parallel preprint arXiv:1412.6980, 2014.
particle filters for online model selection and applications to urban [75] Labbe, Roger, FilterPy - Kalman and Bayesian Filters in Python, 2020.
mobility,” Digital Signal Processing, vol. 60, pp. 172–185, 2017. [Online]. Available: https://filterpy.readthedocs.io/en/latest/
[50] P. Abbeel, A. Coates, M. Montemerlo, A. Y. Ng, and S. Thrun, [76] Jerker Nordh, pyParticleEst - Particle based methods in Python, 2015.
“Discriminative training of Kalman filters.” in Robotics: Science and [Online]. Available: https://pyparticleest.readthedocs.io/en/latest/index.
Systems, vol. 2, 2005, p. 1. html
[51] L. Xu and R. Niu, “EKFNet: Learning system noise statistics from [77] W. Gilpin, “Chaos as an interpretable benchmark for forecasting and
measurement data,” in Proc. IEEE ICASSP, 2021, pp. 4560–4564. data-driven modelling,” arXiv preprint arXiv:2110.05266, 2021.
[52] S. T. Barratt and S. P. Boyd, “Fitting a kalman smoother to data,” in 2020
American Control Conference (ACC). IEEE, 2020, pp. 1526–1531.
[53] L. Xie, Y. C. Soh, and C. E. De Souza, “Robust kalman filtering
for uncertain discrete-time systems,” IEEE Transactions on automatic
control, vol. 39, no. 6, pp. 1310–1314, 1994.
[54] C. M. Carvalho, M. S. Johannes, H. F. Lopes, and N. G. Polson, “Particle
learning and smoothing,” Statistical Science, vol. 25, no. 1, pp. 88–106,
2010.
[55] I. Urteaga, M. F. Bugallo, and P. M. Djurić, “Sequential monte carlo
methods under model uncertainty,” in 2016 IEEE Statistical Signal
Processing Workshop (SSP). IEEE, 2016, pp. 1–5.
[56] L. Zhou, Z. Luo, T. Shen, J. Zhang, M. Zhen, Y. Yao, T. Fang,
and L. Quan, “KFNet: Learning temporal camera relocalization using
Kalman filtering,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2020, pp. 4919–4928.
[57] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”
preprint arXiv:1312.6114, 2013.
[58] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backprop-
agation and approximate inference in deep generative models,” in
International conference on machine learning. PMLR, 2014, pp. 1278–
1286.
15