10 Optimal Estimation
10 Optimal Estimation
1 State estimation
Knowing the current state of the environment is crucial for artificial agents
to reliably perform real-time decision-making. In state estimation, we
aim to infer the latent state of the system at the current point in time by
continuously combining measurements from different sensor sources (multi-
modal perception).
We first examine discrete-time dynamical systems from a proba-
bilistic perspective. Figure 1 visualizes a partially observable Markov
decision process (POMDP) in form of a graphical model.
1
derived from the state could be camera images or Lidar measurements of the
environment that depend on the robot position and movement.
The measurements and control inputs are known, but the hidden state
history x0:t is unknown. We assume that the system evolves in a stochastic
manner and that observations are stochastic. Therefore, we model the state
Xt and observations Zt as random variables that can take on xt and zt as
possible values. To simplify notation, we write p(Xt = xt ) as p(xt ) and
similarly for Zt .
The goal of a state estimator is to compute or approximate the posterior
probability distribution of the state xt given the available data (histories of
measurements and control inputs) and known models for state transitions
and observations. Specifically, we want to know the posterior distribution of
the state p(xt |z1:t , u1:t ), which assigns a probability to every possible value
that the state can take on given the sequence of measurements and control
inputs. The posterior is also called the belief about the state’s value at time
t, represented by bel(t). The probabilistic formulation does not give an exact
value for the state, but enables us to quantify our uncertainty about what
the state may be.
By representing a dynamical system using the graphical model in Figure 1,
we assume that the state is complete. This assumptions leads to two key
properties. First, we assume that the system is Markovian, i.e. the current
state xt only depends on the previous state xt−1 and previous control input
ut−1 , as opposed to the entire history z1:t−1 . This can be expressed as
2 Bayes filter
A recursive filter continuously ingests new measurements to estimate the
state posterior as illustrated in Figure (2). At each time step t, we compute
2
our new posterior p(xt |z1:t , u0:t ) using only our old posterior p(xt−1 |z1:t−1 , u0:t−1 ),
new control input ut−1 and new measurement zt . Thus, the complexity of a
recursive filter is constant with respect to time; it does not depend on the
size of the history and is suitable for real-time inference.
2.2 Derivation
We now derive the Bayes filter, the simplest recursive filter. Our goal is
to derive a recursive expression for the state posterior that only depends
on previous posterior, the current measurement zt , control input ut , and
the transition and measurement models. The models are assumed to be
known and tractable to compute p(xt |xt−1 , ut ) and p(zt |xt ). Bayes rule and
conditional independence of measurements gives us:
3
p(zt |xt , z1:t−1 , u1:t )p(xt |z1:t−1 , u1:t )
p(xt |z1:t , u1:t ) = (3)
p(zt |z1:t−1 , u1:t )
p(zt |xt )p(xt |z1:t−1 , u1:t )
= (4)
p(zt |z1:t−1 , u1:t )
We then simplify the two terms on the right hand side that still depend
on the entire history of observations. Let us first look at the denominator.
Marginalizing over xt and using the conditional independence of measure-
ments gives:
Z
p(zt |z1:t−1 , u1:t ) = p(zt , xt |z1:t−1 , u1:t )dxt
xt
Z
= p(zt |xt , z1:t−1 , u1:t )p(xt |z1:t−1 , u1:t )dxt
xt
Z
= p(zt |xt )p(xt |z1:t−1 , u1:t )dxt
xt
Next, we can then simplify p(xt |z1:t−1 , u1:t ), which now appears twice in
Equation 4. Marginalizing over xt−1 and applying the Markov assumption
gives:
Z
p(xt |z1:t−1 , u1:t ) = p(xt , xt−1 |z1:t−1 , u1:t )dxt−1
xt−1
Z
= p(xt |xt−1 , z1:t−1 , u1:t )p(xt−1 |z1:t−1 , u1:t )dxt−1
xt−1
Z
= p(xt |xt−1 , ut )p(xt−1 |z1:t−1 , u1:t )dxt−1
xt−1
We can now put together the full form of the Bayes filter, illustrated in
Figures (3) and (4).
4
Figure 3: Diagram of the Bayes filter.
There are two steps. In the prediction step, we use our transition model
p(xt |xt−1 , ut ) and previous posterior to predict the state posterior at the
ˆ − 1),
current time step p(xt |z1:t−1 , u1:t ). The prediction is represented as bel(t
and it is our best guess for the distribution of xt without any new information.
When t = 1, the previous posterior does not exist yet, so we use a prior p(x0 )
that represents our initial belief about how x is distributed at the start t = 0.
In the update step, we refine the predicted belief by incorporating the new
measurement zt . The measurement likelihood, p(zt |xt ), indicates how
likely it would be to observe zt given the predicted xt . Thus, the numerator
5
in the update step assigns a probability to a specified xt based on both 1) how
well xt explains our measurement zt and 2) how likely our dynamical system
transitions from the previous posterior estimate xt−1 to xt . The denominator
is simply a normalization constant to ensure that our updated posterior
distribution sums to 1. Once we predict and update our belief, we can repeat
the process at t + 1 as long as we have a new measurement zt+1 .
The Bayes filter is simple and generalizes to any probability distribution.
In practice, however, it is limited since it is often intractable to compute the
integral for the normalization constant. We need to know the entire prob-
ability distribution (e.g., in a tabular form for every value on the domain),
which is often only possible when dealing with discrete random variables or
when assuming a particular functional form of the posterior distribution, e.g.
Gaussian. Thus, it is more useful to think of the Bayes filter as a “proba-
bilistic template” that we can adapt to different sets of systems with specific
needs and assumptions.
ˆ − 1) = T (ut )bel(t − 1)
bel(t (6)
p(Xt = 1|z1:t−1 , u1:t )
p(Xt = 2|z1:t−1 , u1:t )
= (7)
..
.
p(Xt = N |z1:t−1 , u1:t )
6
with T (ut ) representing the transition probabilities:
p(Xt = 1|Xt−1 = 1, ut ) . . . p(Xt = 1|Xt−1 = N, ut )
T (ut ) = .. ... ..
. (8)
. .
p(Xt = N |Xt−1 = 1, ut ) . . . p(Xt = N |Xt−1 = N, ut )
3 Kalman filter
Earlier, we saw the state transition model p(xt |xt−1 , ut ) and measurement
model p(zt |xt ) represented as probability distributions. This is the Bayesian
perspective. The transition and measurement models can also be viewed
from a dynamical systems perspective where Equation (11) represent the
dynamics model and Equation (12) the measurement model.
f (·) is the dynamics of the system that encapsulates how the state
evolves over time from xt to xt+1 . Similarly, h(·) is a function that maps the
state xt to the corresponding observation. wt and vt are the process noise
(random disturbances in the system) and measurement noise (within the
sensor), respectively, that arise naturally in dynamical systems. We assume
7
that the dynamics and measurement model as well as the noise statistics
Dw , Dv (such that w ∼ Dw , v ∼ Dv ) are known. The dynamical system
equations are consistent with the graphical model in Figure 1 and our as-
sumption that the state is complete: xt only depends on xt−1 and ut , and zt
only depends on xt .
Figure 5: The multivariate Gaussian shape on the left [2], and the contour
ellipses on the right.
8
Gaussians, please refer to the CS 229 notes [2, 1] and Berkeley CS 189 notes
[3].
The Kalman filter is the adaptation of the Bayes filter to linear-
Gaussian systems. As the name implies, linear-Gaussian systems make
two key assumptions. First, we assume that all random variables involved
(e.g. state, measurements, posteriors, noise) are multivariate Gaussians. Sec-
ond, we assume that all variables are linear in their parent variables; f (·) and
h(·) are linear (in the form y = Ax). Equations (14) and (15) illustrate the
linear-Gaussian system model.
xt = At xt−1 + Bt ut + wt (14)
zt = Ct xt + vt (15)
We assume the process noise and measurement noise are Gaussian white
noise with zero mean, that is wt ∼ N (0, Qt ) and vt ∼ N (0, Rt ), and for
times t, τ such that t ̸= τ , Cov(wt , wτ ) = 0 and Cov(vt , vτ ) = 0. Our initial
condition, or the prior, is represented as x0 ∼ N (µ0|0 , Σ0|0 ). We further
assume that Cov(x0 , vt ) = 0, Cov(x0 , wt ) = 0 for all t, and Cov(wt , vτ ) = 0
for all t, τ .
While the Kalman filter is limited to linear-Gaussian systems, it enables
us to efficiently deal with continuous random variables and vectors (infinitely
many outcomes). This is because as long as we have µ and Σ, we have the
entire probability distribution. Thus, instead of predicting and updating the
entire distribution p(x) (for all possible x) as we did for the Bayes filter, we
simply need to predict and update our mean vector and covariance matrix at
each time step, which we can do with closed-form equations. Here, we only
lay the Kalman filter equations and interpret them to explain the underly-
ing intuition. There are numerous derivations online from the probabilistic
perspective by applying the expectation definitions for mean and covariance,
or from a optimization perspective as the best linear unbiased estimator in
terms of mean-squared error.
Similar to the Bayes filter, the Kalman filter consists of a predict and
update step.
In the predict step, we compute the mean and covariance of the predicted
posterior p(xt |z1:t−1 , u1:t ) ∼ N (µt|t−1 , Σt|t−1 ) from our previous posterior and
knowledge of the system’s process model:
9
The previous posterior is p(xt−1 |z1:t−1 , u1:t−1 ) ∼ N (µt−1|t−1 , Σt−1|t−1 ). To
predict the mean µt|t−1 , we plug µt−1|t−1 into Equation 14 to get Equation 16.
The noise term wt in Equation 14 has zero mean so it does not affect µt|t−1 .
To derive the covariance prediction in Equation 17, we first make use of the
fact that given a random variable x with covariance Σ, Cov(Ax) = AΣAT .
Then, to account for any possible disturbances as the system evolves over
time, we add in the covariance Q of the process noise. Since all covariance
matrices are positive definite, this summation leads to larger predicted co-
variance values representing increased uncertainty. Because we are making a
prediction about the state at time t with data only up to t − 1, we are now
less confident about the state distribution.
In the update step, we refine the predicted mean and covariance with the
new measurement zt . The mean and covariance update equations are:
Ct Σt|t−1 CtT
Kt = Ct−1 (21)
Ct Σt|t−1 CtT + Rt
Let us analyze the fraction on the right hand side for two extreme cases.
First, the case that Rt approaches zero in the limit means that we believe
there is little measurement noise. The fraction approaches 1 and K is simply
Ct−1 , and therefore µt|t = Ct−1 zt . No noise means we are confident that our
measurement zt is highly accurate, and we can simply invert our measurement
model to obtain the true state. Furthermore, the covariance is updated to
be Kt = Σt|t−1 − Σt|t−1 = 0, indicating our high confidence about the value
of the state. The second case is that Σt approaches zero in the limit, which
10
means that there is little process noise. The fraction approaches 0, K = 0,
the updated mean is µt|t = µt|t−1 and the updated covariance is Σt|t = Σt|t−1 .
This means that because there are no disturbances in the system, the process
model alone is sufficient to propagate the state across time so we do not
need to incorporate the measurement at all (assuming we initialize the filter
with a good prior). In practice, neither of these two cases ever occur; the
filter attempts to update the mean and covariance to use both the actual
measurement and the predicted measurement, as illustrated in Figure (6) for
the case of a 2D state.
The axes in Figure 6 are the components of the state; we are looking at the
distributions from top-down. The ellipse in orange represents the distribution
for our predicted measurement, given by N (Ct µt|t−1 , Ct Σt|t−1 CtT ). The ellipse
in blue represents the distribution for the actual measurement, given by
N (zt , Rt ). The Kalman filter multiplies both of these distributions to find
the overlapping region under which xt is likely for both distributions (recall
p(x1 , x2 ) = p(x1 )p(x2 ). This turns out to be our updated distribution, which
is another Gaussian represented by the green ellipse, given by N (µt|t , Σt|t ).
Since we’re taking the overlap, notice how the updated distribution is smaller
compared to the two parent distributions. While the predict step increases
the covariance (uncertainty), the update step uses the new measurement zt
to reduce the covariance. Thus, the covariance alternates between increasing
11
and decreasing after the predict and update step. Overall, the covariance
decreases until it converges.
12
Figure 8: Pushing a Gaussian distribution through a non-linear mapping.
13
5 Acknowledgements
References
[1] Chuong B Do. More on multivariate gaussians. URL http://cs229. stan-
ford. edu/section/more on gaussians. pdf.[Online], 2008.
14