Bellman Filtering For State Space Models
Bellman Filtering For State Space Models
Rutger-Jan Lange1
1
Econometric Institute, Erasmus School of Economics, Rotterdam, Netherlands
Contact: [email protected]
January 8, 2021
Abstract
This article presents a filter for state-space models based on Bellman’s dynamic programming prin-
ciple applied to the mode estimator. The proposed Bellman filter (BF) generalises the Kalman
filter (KF) including its extended and iterated versions, while remaining equally inexpensive com-
putationally. The BF is also (unlike the KF) robust under heavy-tailed observation noise and
applicable to a wider range of (nonlinear and non-Gaussian) models, involving e.g. count, intensity,
duration, volatility and dependence. (Hyper)parameters are estimated by numerically maximising
a BF-implied log-likelihood decomposition, which is an alternative to the classic prediction-error
decomposition for linear Gaussian models. Simulation studies reveal that the BF performs on par
with (or even outperforms) state-of-the-art importance-sampling techniques, while requiring a frac-
tion of the computational cost, being straightforward to implement and offering full scalability to
higher dimensional state spaces.
1 Introduction
1.1 State-space models
State-space models allow observations to be affected by a hidden state that changes stochastically
over time. For discrete times t = 1, 2, . . . , n, the observation yt ∈ Rl is drawn from a conditional
distribution, p(yt |αt ), while the latent state αt ∈ Rm follows a first-order Markov process with a
known state-transition density, p(αt |αt−1 ), and some given initial condition, p(α1 ), i.e.
Here p(·|·) and p(·) denote generic conditional and marginal densities; i.e. any two p’s need not denote
the same probability density function (e.g. Polson et al., 2008, p. 415). For a given model, the functional
form of all p’s is considered known. These densities may further depend on a fixed (hyper)parameter ψ,
which for notational simplicity is suppressed. Both the observation and state-transition densities may
∗
I thank Maksim Anisimov, Francisco Blasques, Dick van Dijk, Dennis Fok, Maria Grith, Andrew Harvey, Siem Jan
Koopman, Erik Kole, Rutger Lit, Rasmus Lonn, André Lucas, Robin Lumsdaine, Andrea Naghi, Jochem Oorschot,
Richard Paap, Andreas Pick, Krzysztof Postek, Rogier Quaedvlieg, Daniel Ralph, Marcel Scharth, Annika Schnücker,
Stephen Thiele, Nando Vermeer, Sebastiaan Vermeulen, Michel van der Wel, Martina Zaharieva, Mikhail Zhelonkin and
Chen Zhou for their valuable comments.
methods. For simplicity, we assume the mode exists (and is unique), and we take its optimality proper-
ties for granted. To address the first drawback of the mode estimator — computational complexity —
we employ Bellman’s (1957) principle of dynamic programming, thus avoiding the need to solve opti-
misation problems involving an ever-increasing number of states. Instead, Bellman’s equation involves
the maximisation over a single state vector of length m for each time step. The required computing
cost per time step now remains constant over time, reducing the cumulative computational burden
from O(t4 ) to linear in time.
If the state takes a finite number of (discrete) values, Bellman’s equation can be solved exactly
for all time steps, yielding Viterbi’s (1967) algorithm. This algorithm can be used to track, in real
time, the most likely state of a finite set of states (see Table 1) and has proven successful in digital
signal processing applications (Forney, 1973; Viterbi, 2006). Exact solubility of Bellman’s equation
is lost, however, when the states take continuous values in Rm , as in this article. To address this
problem, several authors have considered discretising the state space such that Viterbi’s algorithm
can be used (e.g. Künsch, 2001, p. 125, Godsill et al., 2001). This approach can lead to prohibitive
computational requirements (Künsch, 2001, p. 125): the discretised solution of Bellman’s equation
requires the computation and storage of N m values for each time step, where N is the number of grid
points in each of m spatial directions (e.g. billions of values if N = 100 and m = 5). This is known
as the ‘curse of dimensionality’ (Bellman, 1957, p. ix). Another difficulty, more so in statistics than in
engineering, is that Viterbi’s algorithm leaves the estimation problem unaddressed.
When the latent states take values in a continuum, the solution to Bellman’s equation is a function,
known as the value function, which maps the (continuous) state space Rm to values in R. While the
value function cannot generally be found exactly, there is one exception to this rule. If model (1)
is linear and Gaussian, Bellman’s equation can be solved in closed form for all time steps, yielding
Kalman’s (1960) filter. In this case, the value function turns out to be multivariate quadratic with a
unique global maximum for every time step; the argmax equals the Kalman-filtered state. That the
Kalman filter corresponds to an exact function-space solution to Bellman’s equation does not appear
to be widely known.
This exact solution corresponding to Kalman’s filter suggests that using quadratic approximations
— which is different in principle from using discretisation methods — may be accurate in a broader
context. In the early days of dynamic programming, Bellman considered polynomial approximations
of value functions with the specific aim of avoiding ‘dimensionality difficulties’ associated with dis-
cretisation schemes (Bellman and Dreyfus, 1959, p. 247). Value functions in the context of filtering
applications are special in that they possess a global maximum for each time step; the argmax deter-
mines the filtered state. For this reason, we consider a particularly simple polynomial: the multivariate
quadratic function. While generally inexact, quadratic functions can accurately approximate smooth
The joint log likelihood `(a1:t , y1:t ) is, a priori, a random function of the observations y1:t , even though
the data are considered known and fixed ex post. Next, the mode is defined as the sequence of states
that maximise equation (2).
a
e 1:t|t := (e
a1|t , a
e 2|t , . . . , a
e t|t ) = arg max `(a1:t , y1:t ), t ≤ n. (3)
a1:t ∈Rm×t
Elements of the mode at time step t are denoted by a e i|t for i ≤ t, where i denotes the state that is
estimated, while t denotes the information set used. The entire solution is denoted a e 1:t|t ∈ Rm×t , which
is a collection of t vectors. Iterative solution methods were proposed in Shephard and Pitt (1997),
Durbin (1997) and Durbin and Koopman (2000), who use Newton’s method, and So (2003), who uses
quadratic hill climbing.
Computing times for solving optimisation problem (3) typically grow as O(t3 ). This is unfortunate
because, for the purposes of online filtering, we are predominantly interested in the last column of
a e t|t , but for all times t ≤ n. To obtain the desired sequence of real-time filtered states
e 1:t|t , i.e. a
{eat|t }t=1,...,n , we must compute the mode a e 1:t|t for all time steps, and then, for each time step, extract
the right-most column as the filtered state a e t|t .
a
e t0 |t0 , . . . , a
e t|t , . . . , a
e n|n , (4)
Estimator (4) is generally infeasible as it is computed using the true (hyper)parameter ψ. Further,
estimator (4) is computationally intensive as each filtered state a e t|t requires the (increasingly large)
optimisation problem (3) to be solved, involving m × t first-order conditions. This observation raises
the question whether it is possible to proceed in real time without computing a large and increasing
number of ‘smoothed’ state estimates as required by the optimisation (3). As we show in the next
section, this is indeed possible when we make use of Bellman’s dynamic programming principle.
That is, in transitioning from time t − 1 to time t, two terms are added: one representing the state-
transition density, `(at |at−1 ); the other representing the observation density, `(yt |at ). Next, we define
the value function by maximising `(a1:t , y1:t ) with respect to all states apart from the most recent
state at ∈ Rm .
The value function Vt (at ) depends on the data y1:t , as indicated by the subscript t, which are considered
fixed, and on its argument at , which is a continuous variable in Rm . Recursion (5) implies that the
value function (6) satisfies Bellman’s equation, as stated below.
Proposition 1 (Bellman’s equation) Suppose a e t|t exists for all t ≥ t0 , where 1 ≤ t0 ≤ n. The
value function (6) satisfies Bellman’s equation:
n o
Vt (at ) = `(yt |at ) + max m `(at |at−1 ) + Vt−1 (at−1 ) , at ∈ Rm , (7)
at−1 ∈R
Bellman’s equation (7) recursively relates the value function Vt (at ) to the (previous) value function
Vt−1 (at−1 ) by adding one term reflecting the state transition, `(at |at−1 ); one term reflecting the
observation density, `(yt |at ); and a subsequent maximisation over a single state variable, at−1 ∈ Rm .
The value function Vt (at ) still depends on the data y1:t−1 , but only indirectly, i.e. through the previous
value function Vt−1 (at−1 ). Apart from assuming the existence of the mode, no (additional) assumptions
addressed in section 5. The aim is to approximate, in function space and for all time steps, the solution
to Bellman’s equation (7). We write the model with a linear Gaussian state equation as in Koopman
et al. (2015, 2016):
where t = 1, . . . , n. The system vector c and system matrix T are assumed to be of appropriate
dimensions. The covariance matrix Q is assumed symmetric and positive semi-definite. The obser-
vation density p(yt | αt ) may be non-Gaussian and involve nonlinearity. We may employ exponential
link functions to ensure that variables such as intensity or volatility remain positive. In our notation,
the link function is left implicit; the observation density p(yt |αt ) may contain any desired (nonlinear)
dependence on the state αt .
The solution a∗t−1 depends linearly on at ; in dynamic programming terms, the ‘policy function’ is
linear in the state. Substituting argmax (13) back into equation (12), which was to be optimised, and
performing some algebra (for details, see Appendix B), equation (12) becomes
1
Vt (at ) = `(yt |at ) − (at − at|t−1 )0 It|t−1 (at − at|t−1 ) + constants, at ∈ Rm , (14)
2
where at remains as the only variable on the right-hand side, and we have defined the predicted state
at|t−1 and predicted precision matrix It|t−1 as follows:
d2 Vt (a)
at|t = argmax Vt (a), It|t = − . (17)
da da0 a=at|t
a∈Rm
The argmax determines our filtered state estimate, while the computation of second derivatives at the
peak facilitates the proposed recursive approach, where each value function is approximated quadrat-
ically around its peak. The expression for It|t is ‘local’ in the sense that it utilises second derivatives
at a single point; global fitting methods could also be used.
For the value function Vt (a) in equation (14) to possess a unique global optimum, it is sufficient
that the matrix of negative second derivatives, i.e. It|t−1 − d2 `(yt |a)/(dada0 ), is positive definite for
all a ∈ Rm , where −d2 `(yt |a)/(dada0 ) is the realised information. Even if the existence of a global
maximum is guaranteed, however, the potentially complicated functional form `(yt |at ) implies that
the maximisation over at in equation (14) cannot, in general, be performed analytically. Nonetheless
it is straightforward to write down analytically the steps of e.g. Newton’s optimisation method (e.g.
Nocedal and Wright, 2006). Indeed, a plain-vanilla application of Newton’s method to maximising
Vt (a) with respect to its argument reads
−1
d2 Vt (a)
(i+1) (i) dVt (a)
at|t = at|t + − , (18)
da da0 da
(i)
a=at|t
(i)
where elements of the resulting sequence are denoted as at|t for i = 0, 1, . . .. As indicated in Table 2
(0)
under the step ‘Start’, Newton’s method (18) requires an initialisation to be specified, e.g. at|t = at|t−1 ,
such that the starting point for the optimisation, at every time step, is equal to the prediction made
at the previous time step.
Recalling value function (14), the gradient and negative Hessian can be approximated in closed
form as follows:
d Vt (a) d ` yt |a
a ∈ Rm ,
= − It|t−1 a − at|t−1 , (19)
da da
d2 ` yt |a
d2 Vt (a)
− = It|t−1 − , a ∈ Rm . (20)
da da0 da da0
As the observation yt is fixed, the score in equation (19) and the realised information in equation (20)
are viewed as functions of the state variable a ∈ Rm . Simply put, Newton’s version of the Bellman
10
11
5 Estimation method
This section considers the estimation problem, as distinct from the filtering problem, in that we aim
to estimate both the time-varying states α1:t and the constant (hyper)parameter ψ. As before, we
take model (9) with linear Gaussian state dynamics, and we continue to assume the existence of the
mode. To estimate the constant parameter ψ, computationally intensive methods have been considered
by many authors (see section 1). We deviate from this strand of literature by decomposing the log
likelihood in terms of the ‘fit’ generated by the Bellman filter, penalised by the realised Kullback-Leibler
(KL, see Kullback and Leibler, 1951) divergence between filtered and predicted states. Intuitively, we
wish to maximise the congruence of the Bellman-filtered states and the data, while also minimising
the distance between filtered and predicted states to prevent over-fitting. The proposed decomposition
has the advantage that all terms can be evaluated or approximated using the output of the Bellman
filter in Table 2; no sampling techniques or numerical integration methods are required. The resulting
estimation method is as straightforward and computationally inexpensive as ordinary estimation of
the Kalman filter using maximum likelihood.
To introduce the proposed decomposition, we focus on the log-likelihood contribution of a single
observation, `(yt |Ft−1 ) := log p(yt |Ft−1 ). The next computation is straightforward and holds for all
yt ∈ Rl and all αt ∈ Rm :
`(yt |Ft−1 ) = `(yt , αt |Ft−1 ) − `(αt |yt , Ft−1 ) = `(yt |αt ) + `(αt |Ft−1 ) − `(αt |Ft ). (21)
While the above decomposition is valid for any αt ∈ Rm , the resulting expression is not a computable
quantity, as αt remains unknown. It is practical to evaluate the expression at the Bellman-filtered
state estimate at|t , such that, by swapping the order of the last two terms, we obtain
n o
`(yt |Ft−1 ) = `(yt |αt ) − `(αt |Ft ) − `(αt |Ft−1 ) . (22)
αt =at|t αt =at|t
The first term on the right-hand side, `(yt |αt ) evaluated at αt = at|t , quantifies the congruence
(or ‘fit’) between the Bellman-filtered state at|t and the observation yt , which we wish to maximise.
We simultaneously want to minimise the realised KL divergence between predictions and updates,
as determined by the difference between the two terms in curly brackets. The trade-off between
maximising the first term and minimising the second, which appears with a minus sign, gives rise to a
meaningful optimisation problem.
While the decomposition above is itself exact, we do not generally have an exact expression for
the KL divergence. To ensure that the log-likelihood contribution (22) is computable, we now turn
to approximating the KL divergence term. In deriving the Bellman filter for model (9), we presumed
that the researcher’s knowledge, as measured in log-likelihood space for each time step, could be
approximated by a multivariate quadratic function. Extending this line of reasoning, we consider the
following approximations of the two terms that compose the realised KL divergence:
1 1
`(αt |Ft ) ≈ log det{It|t /(2π)} − (αt − at|t )0 It|t (αt − at|t ), (23)
2 2
1 1
`(αt |Ft−1 ) ≈ log det{It|t−1 /(2π)} − (αt − at|t−1 )0 It|t−1 (αt − at|t−1 ). (24)
2 2
Here the state αt is understood as a variable in Rm , while at|t−1 , at|t , It|t−1 and It|t are known
quantities determined by the Bellman filter in Table 2. If the model is linear and Gaussian, then the
12
where all terms on the right-hand side implicitly or explicitly depend on the (hyper)parameter ψ.
Time t0 ≥ 0 is large enough to ensure the mode exists at time t0 . If model (9) is stationary and α0
is drawn from the unconditional distribution, as in our simulation studies in section 6, then t0 = 0.
The case t0 > 0 is analogous to that for the Kalman filter when the first t0 observations are used to
construct a ‘proper’ prior (see Harvey, 1990, p. 123). The first term inside curly brackets, involving the
observation density, is given by model (9). The remaining terms can be computed based on the output
of the Bellman filter in Table 2. Expression (25) can be viewed as an alternative to the prediction-error
decomposition for linear Gaussian state-space models (see e.g. Harvey, 1990, p. 126), the advantage
being that estimator (25) is applicable more generally.
Corollary 2 Take the linear Gaussian state-space model specified in Corollary 1. Assume that the
Kalman-filtered covariance matrices {Pt|t } are positive definite. Estimator (25) then equals the MLE.
Estimator (25) is only slightly more computationally demanding than standard maximum likelihood
estimation of the Kalman filter. The sole source of additional computational complexity derives from
the fact that the Bellman filter in Table 2 may perform several optimisation steps for each time step,
while the Kalman filter performs only one. However, because each optimisation step is straightforward
and few steps are typically required, the additional computational burden is negligible. Models of
type (9) can now be approximately estimated with the same ease as a linear Gaussian state-space
model.
6 Simulation studies
6.1 Design
We conduct a Monte Carlo study to investigate the performance of the Bellman filter for a range of
data-generating processes (DGPs). We consider 10 DGPs with linear Gaussian state dynamics (9) and
observation densities in Table 3, which also includes link functions, scores and information quantities.
To avoid selection bias on our part, Table 3 has been adapted with minor modifications from Koopman
et al. (2016). In taking the DGPs chosen by these authors, we essentially test the performance of the
Bellman filter on an ‘exogenous’ set of models. While the numerically accelerated importance sampling
(NAIS) method in Koopman et al. (2015, 2016) has been shown to produce highly accurate results,
the Bellman filter turns out to be equally (if not more) accurate at a fraction of the computational
cost.
We add one DGP to the nine considered in Koopman et al. (2016): a local-level model with heavy-
tailed observation noise. While a local-level model with Gaussian observation noise would be solved
exactly by the Kalman filter, the latter does not adjust for heavy-tailed observation noise. Although the
Kalman filter remains the best linear unbiased estimator, the results below show that the (nonlinear)
Bellman filter fares better.
For each DGP in Table 3, we simulate 1,000 time series of length 5,000, where constant (hy-
per)parameters for the first nine DGPs are taken from Koopman et al. (2016, Table 3).4 We use
4
The state-transition equation has parameters c = 0, T = 0.98, Q = 0.025, except for both dependence models, in
13
1. For the generally infeasible mode estimator (4), we use the true parameters and a moving window
of 250 observations, so that 250 first-order conditions are solved for each time step (larger windows
result in excessive computational times). For simplicity5 , one-step-ahead predictions of quantities
of interest are obtained by applying link functions, e.g. λ̃t|t−1 = exp(c + T ãt−1|t−1 ).
2. For the Bellman filter, the algorithm in Table 2 is initialised using the unconditional distribu-
tion. Each optimisation procedure takes as its starting point the most recent prediction and uses
Newton’s method if the realised information in Table 3 is nonnegative, and Fisher’s method oth-
(i) (i−1)
erwise. The stopping criterion is either |at|t − at|t | < 0.0001 or imax = 40 iterations, whichever
occurs first (on average, ∼5 iterations are needed). Newton’s updating step is used if the realised
information in Table 3 is nonnegative. Otherwise a weighted average of Newton’s and Fisher’s
updating steps is used, where the weights are chosen to guarantee It|t ≥ It|t−1 .6 Predictions are
made using (a) the true parameters, (b) in-sample estimated parameters, and (c) out-of-sample
estimated parameters. Parameter estimation is based on estimator (25) using the first (or last)
2,500 observations for out-of-sample (or in-sample) estimation. Bellman-predicted states at|t−1
are transformed using link functions to obtain e.g. λt|t−1 = exp(at|t−1 ).
3. For the numerically accelerated importance sampling (NAIS) method, we follow Koopman et al.
(2016). We deviate in computing, for each time step, not only the weighted mean but also the
weighted median of the (simulated) predictions, where the weights are as in Koopman et al.
(2016). We refer to these methods as NAIS-mean and NAIS-median, respectively.
4. The Kalman filter is used to estimate both stochastic volatility (SV) models and the local-level
model. For both SV models, we follow the common practice of squaring the observations and
taking logarithms to obtain a linear state-space model, albeit with biased and non-Gaussian
observation noise (for details, see Ruiz, 1994 or Harvey et al., 1994). Predicted states can now
be obtained via quasi maximum likelihood estimation (QMLE) of the Kalman filter. For the
local-level model with heavy-tailed observation noise, the Kalman filter is applied directly, i.e.
without adjustments, and estimated by QMLE.
which case c = 0.02, T = 0.98, Q = 0.01. In the observation equation, Student’s t distributions have ten degrees of
freedom, i.e. ν = 10, except for the local-level model, in which case ν = 3. The remaining shape parameters are k = 4
for the negative binomial distribution, k = 1.5 for the Gamma distribution, k = 1.2 for the Weibull distribution, and
σ = 0.45 for the local-level model.
5
The transformation of predictions using (monotone) link functions is exact only if the (untransformed) predictions
are based on the median, not the mode, but for simplicity we ignore this difference.
6
For the dependence model with the Gaussian distribution, the weight placed on Fisher’s updating step should weakly
exceed 1/2. For the Student’s t distribution, this generalises to 1/2 × (ν + 4)/(ν + 3). For the local-level model with
heavy-tailed noise, the weight given to Fisher’s updating step should weakly exceed (1 + ν/3)/(1 + 3ν).
14
t
Count Negative bin. λt = exp(αt ) yt −
Γ(k)Γ(yt + k) k + λt (k + λt )2 k + λt
Intensity Exponential λt = exp(αt ) λt exp(−λt yt ) 1 − λ t yt yt λt 1
ytk−1 exp(−yt /βt ) yt yt
Duration Gamma βt = exp(αt ) −k k
Γ(k)βtk βt
k
βt
k
k (yt /βt )k−1 yt yt
Duration Weibull βt = exp(αt ) k −k k2 k2
βt exp{(yt /βt )k } βt βt
exp{−yt2 /(2σt2 )} yt2
1 yt2 1
Volatility Gaussian σt2 = exp(αt ) −
{2πσt2 }1/2 2σt2 2 2σt2 2
− ν+1
yt2
2
Γ ν+1
2
1 + (ν−2)σ 2 ωt yt2 1 ν − 2 ωt2 yt2 ν
Volatility Student’s t σt2 = exp(αt ) t
−
2σt2 ν + 1 2σt2
p
(ν − 2)πΓ (ν/2) σt 2 2ν + 6
ν+1
15
ωt :=
n 2 2 o ν − 2 + yt2 /σt2
y +y2t −2ρt y1t y2t
1 − exp(−αt ) exp − 1t 2(1−ρ 2) ρt 1 z1t z2t 2
1 z1t 2
+ z2t 1 − ρ2t 1 + ρ2t
t
Dependence Gaussian ρt = + 0 −
2 1 − ρ2t 4 1 − ρt 2
p
1 + exp(−αt ) 2π 1 − ρ2t 2 4 4
z1t := y1t − ρt y2t
z2t := y2t − ρt y1t
2 2
− ν+2
y1t +y2t −2ρt y1t y2t 2
1 − exp(−αt ) ν 1 + 2
2(ν−2)(1−ρt ) ρt ωt z1t z2t 2
ωt z1t 2
+ z2t 1 − ρ2t 1 ωt2 2
z1t 2
z2t 2 + ν(1 + ρ2t )
Dependence Student’s t ρt = + 0 − −
2 1 − ρ2t 4 1 − ρt 2
2 ν + 2 (1 − ρ2t )2
p
1 + exp(−αt ) 2π(ν − 2) 1 − ρ2t 2 4 4(ν + 4)
z1t := y1t − ρt y2t ν+2
ωt := y 2 +y2t
2 −2ρ y y
z2t := y2t − ρt y1t ν − 2 + 1t 2(1−ρ t 1t 2t
2)
− ν+1 t
ν+1 (yt −µt )2 2
Γ 2
1+ (ν−2)σ 2 1 (ν + 1)et ν + 1 ν − 2 − e2t ν(ν + 1)
Local level Student’s t µt = αt 0
σ ν − 2 + e2t σ 2 (ν − 2 + e2t )2 σ 2 (ν − 2)(ν + 3)
p ν
(ν − 2)πΓ 2
σ
yt − µt
et :=
σ
Note: The table contains ten data-generating processes (DGPs) and link functions, the first nine of which are adapted from Koopman et al. (2016). For each model, the
DGP is given by the linear Gaussian state equation (9) in combination with the observation density and link functions indicated in the table. The table further displays
link functions, scores, realised information quantities and expected information quantities. The realised information quantities are nonnegative, except for the bottom
three models as indicated by 0 . . .. We deviate from Koopman et al. (2016) by computing scores and information quantities with respect to the state αt , which is subject
to linear Gaussian dynamics, rather than with respect to its transformation given by λt , βt , σt2 or ρt .
Table 4: Mean absolute errors (MAEs) of one-step-ahead predictions in simulation studies.
6.2 Results
Table 4 contains MAEs of one-step-ahead predictions (RMSEs are shown in Appendix E). When
reporting RMSEs and MAEs, we display the losses obtained from the NAIS-mean and NAIS-median
methods, respectively, which are optimal for these loss functions (the Bellman filter, being based on
the mode, is technically suboptimal for both loss functions).
We focus on three findings. First, comparison of the performance of the Bellman filter against that
of the (generally infeasible) mode estimator (4) reveals that the MAE of the Bellman filter using true
parameters is at most ∼0.3% higher for all DGPs considered. When the parameters are estimated in an
in-sample setting, the Bellman filter slightly outperforms the infeasible estimator. In an out-of-sample
setting, the MAE of the Bellman filter remains within ∼1.7% of that of the infeasible mode estimator
for nine out of ten DGPs, while exceeding the MAE of the mode estimator by at most ∼2% (for the
dependence model with the Student’s t distribution). Only a small fraction of the additional MAE is
caused by approximate filtering and estimation. Rather, most of the additional MAE is caused by the
design choice that the parameter estimation uses only the first half of the data, whereas the evaluation
of MAEs pertains to the second half.
Second, although the Kalman and Bellman filters are usually in close agreement, the robustness of
the Bellman filter means that it compares favourably with the Kalman filter for the SV and local-level
models. Focusing on the local-level model, the performance of the Bellman filter is within ∼0.3%
of the infeasible estimator, even at parameters estimated out-of-sample. In contrast, the Kalman
filter, confronted with heavy-tailed observation noise, lags ∼8% behind the infeasible estimator. This
difference is not due to the choice of loss function; the relative performance of the Kalman filter
deteriorates further if we report RMSEs (see Appendix E). Moreover, the maximum absolute error
in the out-of-sample period, averaged across 1,000 samples, is 1.74 for the Kalman filter; considerably
higher than that for the Bellman filter (0.97). This shows that the Bellman filter is more robust in the
16
0.6
0.5
0.4
0.5
0.4
0.4
0.3
0.3
0.3
0.2 0.2
0.3 0.4 0.5 0.2 0.3 0.4 0.5 0.6 0.2 0.3 0.4 0.5 0.6 0.7
(a) Count (Poisson) (b) Count (Negative Binomial) (c) Intensity (Exponential)
0.9 0.3
0.6
0.8
0.5 0.25
0.7
0.4
0.6 0.2
0.5 0.3
0.15
0.4
0.2
0.3 0.1
0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.1 0.15 0.2 0.25 0.3
(d) Duration (Gamma) (e) Duration (Weibull) (f) Volatility (Gaussian)
0.3
0.18 0.18
0.14 0.14
0.2
0.12 0.12
0.08 0.08
0.1
0.1 0.15 0.2 0.25 0.3 0.08 0.10 0.12 0.14 0.16 0.18 0.08 0.10 0.12 0.14 0.16 0.18
(g) Volatility (Student’s t) (h) Dependence (Gaussian) (i) Dependence (Student’s t)
Note: Each plot contains one dot for each of the 1,000 simulations. The coordinates of the dots are deter-
mined by the mean absolute error (MAE) of the Bellman filter, horizontally, and the numerically accelerated
importance sampling (NAIS) median method in Koopman et al. (2016), vertically. Each dot involves an av-
erage over 2,500 out-of-sample predictions by both methods. Forty-five degree lines are also shown. For the
simulation setting, see the note to Table 4.
17
face of heavy-tailed observation noise, while having only a single additional parameter to estimate (the
degrees of freedom of the observation noise, ν). While several robust filters have been constructed ad
hoc (e.g. Harvey and Luati, 2014 and Calvet et al., 2015), in our case robustness follows automatically
from Bellman’s equation (14) along with the fact that the location score for a Student’s t distribution
is bounded.
Third, the Bellman filter performs approximately on par with the NAIS-median method, despite
the latter being more computationally intensive and, in fact, theoretically optimal for the absolute loss
function. Table 4 shows that the NAIS-median method outperforms the Bellman filter by a maximum
of 0.17% in terms of MAE (for the dependence model with a Gaussian distribution). For six out of nine
DGPs, the roles are reversed: the Bellman filter marginally outperforms the NAIS-median method.
This may be due to the fact that the NAIS method also contains an approximation; for each time step,
only a finite number of predictions are simulated on which the median (or mean) is based.
Zooming in on the performance of the Bellman filter and the NAIS-median methods for individual
samples, Figure 1 demonstrates that both methods perform almost identically for all 1,000 samples for
each DGP; in each sub-figure, the dots are highly concentrated around the 45-degree lines. Clearly, the
sample drawn is far more influential in determining the MAE than the choice between both filtering
methods. Digging down even deeper, to the individual predictions, we use the 2.5 million predictions
made by the Bellman filter for each DGP as an ‘exogenous’ variable to ‘explain’ the corresponding
2.5 million predictions made by the NAIS-median method. The resulting coefficients of determination
(essentially R2 values without fitting a model) exceed 99% across all DGPs, meaning that the individual
predictions, too, are near identical.
Table 5 shows that the Bellman filter and the NAIS-median method differ in their computation
times. In solving the estimation problem, the Bellman filter is faster by a factor 1.1 (for the Poisson
distribution) to a factor ∼6 (for the Weibull distribution). In solving the filtering problem, the Bellman
filter is faster by a factor between ∼400 (for the Poisson and exponential distributions) and ∼1,000
(for the Weibull distribution).
Finally, Appendix F demonstrates that
p predicted confidence intervals implied by the Bellman filter,
i.e. with endpoints given by at|t−1 ± 2/ It|t−1 for each time step, tend to be fairly accurate, containing
the true states 93% to 96% of the time across all DGPs.
18
References
Anderson, B. D. and Moore, J. B. (2012) Optimal Filtering. Courier Corporation.
Andrieu, C., Doucet, A. and Holenstein, R. (2010) Particle Markov chain Monte Carlo methods. Journal of the
Royal Statistical Society: Series B (Statistical Methodology), 72, 269–342.
Barra, I., Hoogerheide, L., Koopman, S. J. and Lucas, A. (2017) Joint Bayesian analysis of parameters and
states in nonlinear non-Gaussian state space models. Journal of Applied Econometrics, 32, 1003–1026.
Bauwens, L. and Hautsch, N. (2006) Stochastic conditional intensity processes. Journal of Financial Economet-
rics, 4, 450–493.
Bauwens, L. and Veredas, D. (2004) The stochastic conditional duration model: A latent variable model for the
analysis of financial durations. Journal of Econometrics, 119, 381–412.
Bellman, R. and Dreyfus, S. (1959) Functional approximations and dynamic programming. Mathematical Tables
and Other Aids to Computation, 247–251.
Bellman, R. E. (1957) Dynamic Programming. Courier Dover Publications.
Bunch, P. and Godsill, S. (2016) Approximations of the optimal importance density using Gaussian particle flow
importance sampling. Journal of the American Statistical Association, 111, 748–762.
Calvet, L. E., Czellar, V. and Ronchetti, E. (2015) Robust filtering. Journal of the American Statistical Associ-
ation, 110, 1591–1606.
De Valpine, P. (2004) Monte Carlo state-space likelihoods by weighted posterior kernel density estimation.
Journal of the American Statistical Association, 99, 523–536.
Durbin, J. (1997) Optimal estimating equations for state vectors in non-Gaussian and nonlinear state space time
series models. Lecture Notes-Monograph Series, 285–291.
Durbin, J. and Koopman, S. J. (1997) Monte Carlo maximum likelihood estimation for non-Gaussian state space
models. Biometrika, 84, 669–684.
— (2000) Time series analysis of non-Gaussian observations based on state space models from both classical and
Bayesian perspectives. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62, 3–56.
19
20
A Proof of Proposition 1
Standard dynamic-programming arguments imply
where a
e t|t was defined in equation (3).
21
In the last line, we have used the fact that all three terms with curly brackets equal Q−1 T [It|t + T 0 Q−1 T ]−1 T 0 Q−1 , such
that two terms with curly brackets and opposite signs cancel, leaving only one term with a negative sign, which confirms
prediction step (16).
22
which is exactly the Kalman filter level update written in information form. To see the equivalence with the covariance
−1
form of the Kalman filter, suppose that Pt|t−1 := It|t−1 exists. Then, using the Woodbury matrix-inversion formula (see
e.g. Henderson and Searle, 1981, eq. 1), the expression above is equivalent to
which is exactly the Kalman filter updating step (see e.g. Harvey, 1990, p. 106). For the information matrix update we
have
d2 ` yt |a
It|t = It|t−1 − = It|t−1 + Z 0 H −1 Z. (C.5)
da da0
a=at|t
−1 −1
If the inverses Pt|t−1 := It|t−1 and Pt|t := It|t exist, then, again using Henderson and Searle (1981, eq. 1), we find
−1
Pt|t = It|t = (It|t−1 + Z 0 H −1 Z)−1 = Pt|t−1 − Pt|t−1 Z 0 (ZPt|t−1 Z 0 + H)−1 ZPt|t−1 , (C.6)
which is exactly the Kalman filter covariance matrix updating step (again, see Harvey, 1990, p. 106).
The score and marginal information are similar to those in Appendix C, as long as Z there is replaced by the Jacobian
of the transformation from αt to Zt , i.e. dZ(at )/da0t . Hence
d ` yt |at dZ 0 −1
= H (yt − d − Z(at )), (D.2)
dat dat
d2 ` yt |at dZ 0 −1 dZ
0
= − H + second-order derivatives. (D.3)
dat dat dat da0t
The iterated extended Kalman filter (IEKF) is obtained from the Bellman filter by choosing Newton’s method and by
making one further simplifying approximation: namely that all second-order derivatives of elements of Zt with respect to
the elements of αt are zero. It is not obvious under what circumstances this approximation is justified, but here we are
interested only in showing that the IEKF is a special case of the Bellman filter. Higher-order IEKFs may be obtained by
retaining the second-order derivatives. If the observation noise εt is heavy tailed, however, the Bellman filter in Table 2
suggests a ‘robustified’ version of the Kalman filter and its extensions, in which case the tail behaviour of p(yt |at ) is
accounted for in the optimisation step by using the score d`(yt |at )/dat .
23
Table E.1: Root mean squared errors (RMSEs) of one-step-ahead predictions in simulation studies.
24