Recursive Least Squares
Recursive Least Squares
Yan-Bin Jia
Dec 3, 2019
1 Estimation of a Constant
We start with estimation of a constant based on several noisy measurements. Suppose we have a
resistor but do not know its resistance. So we measure it several times using a cheap (and noisy)
multimeter. How do we come up with a good estimate of the resistance based on these noisy
measurements?
More formally, suppose x = (x1 , x2 , . . . , xn )T is a constant but unknown vector, and y =
(y1 , y2 , . . . , yl )T is an l-element noisy measurement vector. Our task is to find the “best” estimate
x̃ of x. Here we look at perhaps the simplest case where each yi is a linear combination of xj ,
1 ≤ j ≤ n, with addition of some measurement noise νi . Thus, we are working with the following
linear system,
y = Hx + ν,
where ν = (ν1 , ν2 , . . . , νl )T , and H is an l × n matrix; or with all terms listed,
y1 H11 · · · H1n x1 ν1
.. .. .. .. .. + .. .
. = . . . . .
yl Hl1 · · · Hln xn νl
Given an estimate x̃, we consider the difference between the noisy measurements and the pro-
jected values H x̃:
ǫ = y − H x̃.
Under the least squares principle, we will try to find the value of x̃ that minimizes the cost function
J(x̃) = ǫT ǫ
= (y − H x̃)T (y − H x̃)
= y T y − x̃T Hy − y T H x̃ + x̃T H T H x̃.
The necessary condition for the minimum is the vanishing of the partial derivative of J with
respect to x̃, that is,
∂J
= −2y T H + 2x̃T H T H = 0.
∂ x̃
∗
The material is adapted from Sections 3.1–3.3 in Dan Simon’s book Optimal State Estimation [1].
1
We solve the equation, obtaining
x̃ = (H T H)−1 H T y. (1)
The inverse (H T H)−1
exists if rank(H) = n (which implies l ≥ n). In other words, when the
number of measurements is no fewer than the number of variables, and these measurements are
linearly independent.
Example 1. Suppose we are trying to estimate the resistance x of an unmarked resistor based on l noisy
measurements using a multimeter. In this case,
y = Hx + ν, (2)
where
H = (1, · · · , 1)T . (3)
Substitution of the above into equation (1) gives us the optimal estimate of x as
x̃ = (H T H)−1 H T y
1 T
= H y
l
y1 + · · · + yl
= . (4)
l
E(νi2 ) = σi2 , 1 ≤ i ≤ l.
We also assume that the noise for each measurement has zero mean and is independent. The
covariance matrix for all measurement noise is
R = E(νν T )
2
σ1 · · · 0
.. . . .. .
= . . .
0 · · · σl2
2
If a measurement yi is noisy (as indicate by a large standard deviation σi ), we care less about the
discrepancy between it and the ith element of H x̃ because we do not have much confidence in this
measurement. The cost function J can be expanded as follows:
Note that the measurement noise matrix R must be non-singular for a solution to exist. In other
words, each measurement yi must be corrupted by some noise for the estimation method to work.
Example 2. We get back to the problem in Example 1 of resistance estimation, for which the equations are
given in (2) and (3). Suppose each of the l noisy measurements has variance
E(νi2 ) = σi2 .
R = diag(σ12 , . . . , σl2 ).
It is easy to verify that the above estimate simplifies to (4) when all measurements have the same standard
deviation σ.
3
A linear recursive estimator can be written in the following form:
y k = Hk x + ν k ,
x̃k = x̃k−1 + Kk (y k − Hk x̃k−1 ). (6)
Here Hk is an m × n matrix, and Kk is n × m and referred to as the estimator gain matrix. We refer
to y k − Hk x̃k−1 as the correction term. Namely, the new estimate x̃k is modified from the previous
estimate x̃k−1 with a correction via the gain vector. The measurement noise has zero mean, i.e.,
E(ν k ) = 0.
The current estimation error is
ǫk = x − x̃k
= x − x̃k−1 − Kk (y k − Hk x̃k−1 )
= ǫk−1 − Kk (Hk x + ν k − Hk x̃k−1 )
= ǫk−1 − Kk Hk (x − x̃k−1 ) − Kk ν k
= (I − Kk Hk )ǫk−1 − Kk ν k , (7)
If E(ν k ) = 0 and E(ǫk−1 ) = 0, then E(ǫk ) = 0. So if the measurement noise ν k has zero mean for
all k, and the initial estimate of x is set equal to its expected value, then x̃k = xk for all k. With
this property, the estimator (6) is called unbiased. The property holds regardless of the value of
the gain vector Kk . It says that on the average the estimate x̃ will be equal to the true value x.
The key is to determine the optimal value of the gain vector Kk . The optimality criterion used
by us is to minimize the aggregated variance of the estimation errors at time k:
Jk = E(kx − x̃k k2 )
= E(ǫTk ǫk )
= E Tr(ǫk ǫTk )
= Tr(Pk ), (8)
where Tr is the trace operator1 , and the n × n matrix Pk = E(ǫk ǫTk ) is the estimation-error
covariance, Next, we obtain Pk via a substitution of (7):
T
Pk = E (I − Kk Hk )ǫk−1 − Kk ν k (I − Kk Hk )ǫk−1 − Kk ν k
The estimation error ǫk−1 at time k − 1 is independent of the measurement noise ν k at time k,
which implies that
4
Given the definition of the m × m matrix Rk = E(ν k ν Tk ) as covariance of ν k , the expression of Pk
becomes
Pk = (I − Kk Hk )Pk−1 (I − Kk Hk )T + Kk Rk KkT . (9)
Equation (9) is the recurrence for the covariance of the least squares estimation error. It is
consistent with the intuition that as the measurement noise (Rk ) increases, the uncertainty (Pk )
increases. Note that Pk as a covariance matrix is positive definite.
What remains is to find the value of the gain vector Kk that minimizes the cost function given
by (7). The mean of the estimation error is zero independent of the value of Kk already. Thus,
the minimizing value of Kk will make the cost function consistently close to zero. We need to
differentiate Jk with respect to Kk .2
Theorem 1 Let C and X be matrices of the same dimension r × s. Suppose C does not depend
on X. Then the following holds:
∂Tr(CX T )
= C, (10)
∂X
∂Tr(XCX T )
= XC + XC T . (11)
∂X
∂
A proof of the theorem is given in Appendix A. In the case that C is symmetric, ∂X Tr(XCX T ) =
2XC. With these facts in mind, we first substitute (9) into (8) and differentiate the resulting
expression with respect to Kk :
∂Jk ∂ ∂
= Tr Pk−1 − Kk Hk Pk−1 − Pk−1 HkT KkT + Kk (Hk Pk−1 HkT )KkT + Tr(Kk Rk KkT )
∂Kk ∂Kk ∂Kk
∂
= −2 Tr(Pk−1 HkT KkT ) + 2Kk (Hk Pk−1 HkT ) + 2Kk Rk (by (11))
∂Kk
= −2Pk−1 HkT + 2Kk Hk Pk−1 HkT + 2Kk Rk (by (10))
= −2Pk−1 HkT + 2Kk (Hk Pk−1 HkT + Rk )
In the second equation above, we also used that Pk−1 is independent of Kk and that Kk Hk Pk−1 and
Pk−1 HkT KkT are transposes of each other (since Pk−1 is symmetric) so they have the same trace.
Setting the partial derivative to zero, we solve for Kk :
Substitute the above for Kk into equation (9) for Pk . The operation followed by an expansion leads
to a few steps of manipulation as follows:
Pk = (I − Pk−1 HkT Sk−1 Hk )Pk−1 (I − Pk−1 HkT Sk−1 Hk )T + Pk−1 HkT Sk−1 Rk Sk−1 Hk Pk−1
2
The derivative of a function f with respect to a matrix A = (aij ) is a matrix ∂f /∂A = (∂f /∂aij ).
5
= Pk−1 − Pk−1 HkT Sk−1 Hk Pk−1 − Pk−1 HkT Sk−1 Hk Pk−1 +
Pk−1 HkT Sk−1 Hk Pk−1 HkT Sk−1 Hk Pk−1 + Pk−1 HkT Sk−1 Rk Sk−1 Hk Pk−1
= Pk−1 − Pk−1 HkT Sk−1 Hk Pk−1 − Pk−1 HkT Sk−1 Hk Pk−1 + Pk−1 HkT Sk−1 Sk Sk−1 Hk Pk−1
(after merging the underlined terms into Sk )
= Pk−1 − 2Pk−1 HkT Sk−1 Hk Pk−1 + Pk−1 HkT Sk−1 Hk Pk−1
= Pk−1 − Pk−1 HkT Sk−1 Hk Pk−1 (14)
= Pk−1 − Kk Hk Pk−1 (by (13))
= (I − Kk Hk )Pk−1 . (15)
Note that in the above Pk is symmetric as a covariance matrix, and so is Sk .
We take the inverses of both sides of equation (14) and plug into the expression for Sk :
−1
Pk−1 = Pk−1 − Pk−1 HkT (Hk Pk−1 HkT + Rk )−1 Hk Pk−1 .
| {z } | {z } | {z } | {z }
A B D C
= Pk−1
−1
+ HkT Rk−1 Hk . (16)
The above yields an alternative expression for the convariance matrix:
−1
Pk = Pk−1
−1
+ HkT Rk−1 Hk . (17)
This expression is more complicated than (15) since it requires three matrix inversions. Neverthe-
less, it has computational advantages in certain situations in practice [1, pp.156–158].
We can also derive an alternate form for the convariance Pk as follows. Start with a multiplica-
tion of the right of (12) with Pk Pk−1 . Then, substitute (16) for Pk−1 into the resulting expression.
Multiply the Pk−1 Hk factor inside the parenthesized factor on its left, and extract HkT Rk−1 out of
the parentheses. The last two parenthesized factors will cancel each other, yielding
Kk = Pk HkT Rk−1 . (18)
6
2. Iterate the follow two steps.
(a) Obtain a new measurement y k , assuming that it is given by the equation
y k = Hk x + ν k ,
where the noise ν k has zero mean and covariance Rk . The measurement noise at each
time step k is independent. So,
T 0, if i 6= j,
E(ν i ν j ) =
Rj , if i = j.
Essentially, we assume white measurement noise.
(b) Update the estimate x̃ and the covariance of the estimation error sequentially according
to (12), (6), (15), which are re-listed below:
Kk = Pk−1 HkT (Hk Pk−1 HkT + Rk )−1 , (19)
Pk = (I − Kk Hk )Pk−1 , (20)
x̃k = x̃k−1 + Kk (y k − Hk x̃k−1 ), (21)
or according to (17), (18), and (21):
−1
Pk = Pk−1
−1
+ HkT Rk−1 Hk ,
Kk = Pk HkT Rk−1 ,
x̃k = x̃k−1 + Kk (y k − Hk x̃k−1 ).
Note that (21) and (20) can switch their order in one round of update.
Example 3. We revisit the resistance estimation problem presented in Examples 1 and 2. Now, we want
to iteratively improve our estimate of the resistance x. At the kth sampling, our measurement is
yk = Hk x + ν k = x + ν k ,
Rk = E(νk2 ).
Here, the measurement vector Hk is a scalar 1. Furthermore, we suppose that each measurement has the
same covariance so Rk is a constant written as R.
Before the first measurement, we have some idea about the resistance x. This becomes our initial
estimate. Also, we have some uncertainty about this initial estimate, which becomes our initial covariance.
Together we have
x̃0 = E(x),
P0 = E((x − x̃0 )2 ).
If we have no idea about the resistance, set P0 = ∞. If we are certain about the resistance value, set P0 = 0.
(Of course, then there would be no need to take measurements.)
After the first measurement (k=1), we update the estimate and the error covariance according to equa-
tions (19)–(20) as follows:
P0
K1 = ,
P0 + R
P0
x̃1 = x̃0 + (y1 − x̃0 ),
P0 + R
P0 P0 R
P1 = 1− P0 = .
P0 + R P0 + R
7
After the second measurement, the estimates become
P1 P0
K2 = = ,
P1 + R 2P0 + R
P1
x̃2 = x̃1 + (y2 − x̃1 )
P1 + R
P0 + R P0
= x̃1 + y2 ,
2P0 + R 2P0 + R
P1 R P0 R
P2 = = .
P1 + R 2P0 + R
By induction, we can show that
P0
Kk = ,
kP0 + R
(k − 1)P0 + R P0
x̃k = x̃k−1 + yk ,
kP0 + R kP0 + R
P0 R
Pk = .
kP0 + R
Note that if x is known perfectly a priori, then P0 = 0, which implies that Kk = 0 and x̃k = x̃0 , for all
k. The optimal estimate of x is independent of any measurements that are obtained. At the opposite end of
the spectrum, if x is completely unknown a priori, then P0 = ∞. The above equation for x̃k becomes,
(k − 1)P0 + R P0
x̃k = lim x̃k−1 + yk
P0 →∞ kP0 + R kP0 + R
k−1 1
= x̃k−1 + yk
k k
1
= (k − 1)x̃k−1 + yk .
k
1
Pk
The right hand side of the last equation above is just the running average ȳk = k j=1 yj of the measure-
ments. To see this, we first have
k
X k−1
X
yj = yj + yk
j=1 j=1
k−1
1 X
= (k − 1) yj + yk
k − 1 j=1
= (k − 1)ȳk−1 + yk .
Since x̃1 = ȳ1 , the recurrences for x̃k and ȳk are the same. Hence x̃k = ȳk for all k.
yk = x1 + 0.99k−1 x2 + νk ,
where Hk = (1, 0.99k−1 )T , and νk is a random variable with zero mean and a variance R = 0.01.
8
Let the real values be x̃ = (x1 , x2 )T = (10, 5)T . Suppose the initial estimates are x̃1 = 8 and x̃2 = 7 with
P0 equal to the identity matrix. We apply the recursive least squares algorithm. The next figure3 shows the
evolutions of the estimates x̃1 and x̃2 , along with those of the variance of the estimation errors. It can be
seen that after a couple dozen measurements, the estimates are getting very close to the true values 10 and 5.
The variances of the estimation errors asymptotically approach zero. This means that we have increasingly
more confidence in the estimates with more measurements obtained.
A Proof of Theorem 1
Proof
Denote C = (cij ), X = (xij ), and CX T = (dij ). The trace of CX T is
r
X
Tr(CX T ) = dtt
t=1
r X
X s
= ctk xtk .
t=1 k=1
From the above, we easily obtain its partial derivatives with respect to the entries of X:
∂
Tr(CX T ) = cij .
∂xij
∂ ∂ ∂
Tr(XCX T ) = Tr(XCY T ) + Tr(Y CX T )
∂X ∂X Y =X ∂X Y =X
∂
= Tr(Y C T X T ) +Y C (by (10))
∂X Y =X Y =X
3
Figure 3.1, p. 92 of [1].
9
= Y CT +XC
Y =X
= XC T + XC.
References
[1] D. Simon. Optimal State Estimations. John Wiley & Sons, Inc., Hoboken, New Jersey, 2006.
10