0% found this document useful (0 votes)
30 views6 pages

Kalman

Uploaded by

Wong Kai Wen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views6 pages

Kalman

Uploaded by

Wong Kai Wen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

T.

Rothenberg Fall, 2007

State Space Models and the Kalman Filter


1 Introduction
Many time-series models used in econometrics are special cases of the class of linear state
space models developed by engineers to describe physical systems. The Kalman …lter, an
e¢ cient recursive method for computing optimal linear forecasts in such models, can be
exploited to compute the exact Gaussian likelihood function.
The linear state-space model postulates that an observed time series is a linear function of
a (generally unobserved) state vector and the law of motion for the state vector is …rst-order
vector autoregression. More precisely, let yt be the observed variable at time t and let t
denote the values taken at time t by a vector of p state variables. Let A and b be p p and
p 1 matrices of constants. We assume that fyt g is generated by

y t = b0 t + ut ; (1)
t = A t 1 + vt (2)

where the scalar ut and the vector vt are mean zero, white-noise processes, independent of
each other and of the initial value 0 . We denote 2 = E(u2t ) and = E(vt vt0 ): Equation (1) is
sometimes called the ”measurement” equation while (2) is called the ”transition” equation.
The assumption that the autoregression is …rst-order is not restrictive, since higher-order
systems can be handled by adding additional state variables.
In most engineering (and some economic) applications, the ’s represent meaningful but
imperfectly measured physical variables. Models based on the ”permanent” income hypoth-
esis are classic examples. But sometimes state-space models are used simply to exploit the
fact that rather complicated dynamics in an observable variable can result from adding noise
to a linear combination of autoregressive variables. For example, all ARMA models for yt can
be put in state space form even though the state variables t have no particular economic
meaning. An even richer class of (possibly nonstationary) state space models can be produced
by introducing an observed exogenous forcing variable xt into the measurement equation, by
letting b; A; 2 ; and depend on t, and by letting yt be a vector. Since these generalizations
complicate the notation but do not a¤ect the basic theory, they will be ignored in these notes.

2 ARMA Models in State Space Form


Consider the ARMA(1,1) model

yt = 'yt 1 + "t + " t 1:

De…ning =( 0 = (yt ; "t )0 ; we can write yt = b0 where b = (1; 0)0 and


t 1t ; 2t ) t

1t ' 1 1;t 1 "t


= + :
2t 0 0 2;t 1 "t

Thus the ARMA(1,1) model has a state-space representation with ut = 0:


More generally, suppose fyt g is a mean-zero ARMA(p,q) process. Let m = max(p; q + 1):
Then, we can write

y t = '1 y t 1 + + 'm y t m + "t + 1 "t 1 + + m 1 "t m+1

1
with the redundant coe¢ cients set to zero. De…ne the column vectors
2 3 2 3 2 3
1 '1 1
6 0 7 6 ' 7 6 7
6 7 6 2 7 6 1 7
b =6 . 7; c =6 .. 7, d =6 .. 7:
m 1 4 .. 5 (m 1) 1 4 . 5 m 1 4 . 5
0 'm 1 m 1

By successive substitution, one can verify that yt has the state space representation

y t = b0 t; t =A t 1 + vt

where t is an m-dimensional state vector, ut = 0; vt = d"t and


c Im 1
A= .
'm 00

3 The Kalman Filter


Denote the vector (y1 ; :::; yt ) by Yt .The Kalman …lter is a recursive algorithm for producing
optimal linear forecasts of t+1 and yt+1 from the past history Yt , assuming that A, b, 2 ,
and are known. De…ne

at = E( t jYt 1) and Vt = var( t jYt 1 ): (3)

If the u’s and v’s are normally distributed, the minimum MSE forecast of yt+1 at time t
is b0 at+1 . The key fact (which we shall derive below) is that, under normality, at+1 can be
calculated recursively by
yt b0 at AVt bb0 Vt A0
at+1 = Aat + AVt b , Vt+1 = + AVt A0 (4)
b0 V t b + 2 b0 V t b + 2
starting with the appropriate initial values (a1 ; V1 ). To forecast yt+1 = b0 at+1 at time t, one
needs only the current yt and the previous forecast of t and its variance. Previous values
y1 ; :::; yt 1 enter only through at . Note that yt enters linearly into the calculation of at and
does not enter at all into the calculation of Vt . The forecast of yt is a linear …lter of previous
y’s. If the errors are not normal, the forecasts produced from iterating (4) are still of interest;
they are best linear predictors.
The appropriate starting values a1 and V1 depend on the assumption made on 0 . If the
f t g are covariance stationary, then each t must have zero mean and constant variance. In
that case, a1 = E[ 1 ] = 0 and V1 = var[ 1 ] must satisfy V1 = AV1 A0 + . This implies
1
vec(V1 ) = [I (A A)] vec( ): (5)

In practice, one often uses mathematically convenient initial conditions and relies on the fact
that, for weakly dependent processes, initial conditions do not matter very much. For more
details, see A. Harvey, Forecasting, Structural Time Series Models and the Kalman Filter
(1989), Chapter 3.

4 Using the Kalman Filter to Compute ML Estimates


Suppose we wish to estimate the unknown parameters of a given state-space model from
the observations y1 ; :::; yT : Let f (yt jYt 1 ) represent the conditional density of yt , given the
previous y’s. The joint density function for the y’s can always be factored as

f (y1 )f (y2 jY1 )f (y3 jY2 ):::f (yT jYT 1 ):

2
If the y’s are normal, it follows from equations (1) and (2) that f (yt jYt 1 ) is also normal
with mean b0 at and variance 2 + b0 Vt b. Hence, the log likelihood function is (apart from a
constant)
1P T (yt b0 at )2
[ln(b0 Vt b + 2 ) + 0 ] (6)
2 t=1 b Vt b + 2
and can be computed from the output of the Kalman …lter. Of course, an alternative expres-
sion for the normal log-likelihood is
1
[ln j j + y 0 1
y]
2
where y = (y1 ; :::; yT )0 and = E(yy 0 ): Thus, the Kalman …lter can be viewed as a recursive
algorithm for computing 1 and j j: After evaluating the normal likelihood (for any given

values of the parameters), quasi maximum likelihood estimates can be obtained by grid search
or iterative methods such as employed in the Newton-Raphson algorithm.
The Kalman …lter can also be used to compute GLS regression estimates. As an example,
consider the regression model yt = 0 xt + ut ; where xt is a vector of K exogenous variables
and ut is a stationary normal ARMA(p,q) process with known parameters. Direct use of GLS
requires …nding the inverse of the variance matrix for the u0 s: This can be achieved more easily
using the Kalman …lter. If ut were observable, one could put the model for ut in state space
form and compute via the Kalman …lter the best linear predictor of ut given its past history,
say E(ut jpast u’s) = b0 at ; and the prediction error variance Var(ut jpast u’s) = b0 Vt b + 2 :
Note that the T random variables
ut E(ut jpast u’s)
ut = p t = 1; :::; T
V ar(ut jpast u’s)

are uncorrelated with unit variance. Since E(ut jpast u’s) is linear in past u’s and V ar(ut jpast
u’s) does not depend on the ut at all, we can write in vector notation u = Ru where R is a
nonrandom triangular matrix. Of course, we do not observe the u’s. But we can apply this
…lter to the y and X data, constructing K + 1 new time series y = Ry and X = RX: Note
that y = X + u : If we regress y on X ; the resulting coe¢ cient is the GLS estimate
since by construction u is white noise.

5 Derivation of the Recursion Equations


Recall that, if a scalar random variable Z and a random vector X are jointly normal, then

cov(X; Z) cov(X; Z)cov(X; Z)0


E(XjZ) = E(X) + (Z EZ); V ar(XjZ) = V ar(X) (7)
var(Z) var(Z)

De…ne the random variables at = E( t jYt ) and Vt = var( t jYt ): Note that at and at are
both expectations of the same random variable t , the former conditioning on yt and the
latter not. Likewise Vt and Vt are both variances of t , the former conditioning on yt and
the latter not. Since, conditional on Yt 1 , the vector t and the scalar yt are jointly normal,
we can use (7) to calculate a relationship between at and at and between Vt and Vt . From
(1) and (2) we have
0
Cov( t ; yt jYt 1) = Cov( t ; t bjYt 1 ) = Vt b
0
V ar(yt jYt 1) = V ar(b t + ut jYt 1) = b0 V t b + 2

E( t jYt 1) = at ; E(yt jYt 1) = b0 at

3
Thus, letting t play the role of X and yt the role of Z, we have from (7)

yt b0 at Vt bb0 Vt 0
at = at + Vt b and V t = Vt (8)
b0 V t b + 2 b0 V t b + 2
From (2), it follows that

at+1 = Aat and Vt+1 = AVt A0 + (9)

The ”updating”equations (8) describe how the forecast of the state vector at time t is changed
when yt is observed. Together with the ”prediction” equations (9), they imply the recursion
(4).
In models where the state variables have an economic interpretation, it is sometimes
desirable to estimate t using all the available data. Starting with aT and VT computed with
the Kalman …lter, one can iterate backwards to compute E( t jYT ): The relevant recursion,
called the ”smoothing” algorithm, is derived and discussed in Harvey’s book.

6 Matrices that Diagonalize the Covariance Matrix for y


Again, let Yt denote the vector y1 ; :::; yt . Note that at is a linear function of the data in Yt 1
and hence the prediction error et = yt b0 at is a linear function of the data in Yt . If t > s,

E(et es ) = Ees E(b0 t b0 at + ut jYt 1) = 0:

If t = s,
Ee2t = E[var(b0 t + ut jYt 1 )] = 2
+ b0 Vt b:
Thus, the fet g are a set of uncorrelated, but heteroskedastic random variables. Denoting
the vector of the y’s by y and the vector of the e’s by e, we have e = Gy, where G is
a nonrandom triangular matrix such that Eee0 = G(Eyy0 )G0 is diagonal. Thus, as noted
in Section 4, the Kalman …lter can be viewed as an algorithm for exactly diagonalizing the
covariance matrix of y.
For ARMA models, an alternative to calculating the exact Gaussian likelihood is to
approximate the likelihood by conditioning on the …rst few y’s and "’s. After conditioning,
the remaining y 0 s can be written as an invertible linear function of a …nite number of current
and lagged innovations. Thus, approximating the likelihood by conditioning is equivalent
to …nding a triangular linear transform of the data having a scalar covariance matrix and is
closely related to the linear transform employed by the Kalman …lter. More precisely, suppose
one used as the initial variance matrix V1 ; not the stationary variance given in equation (5),
but instead some variance satisfying
AV1 bb0 V1 A0
V1 = + AV1 A0 :
b0 V 1 b + 2
Then the iteration scheme (4) produces a constant matrix Vt and the term b0 Vt b+ 2 appearing
in the likelihood (6) does not depend on t. If that term does not depend on the unknown
ARMA coe¢ cients either,
P the Gaussian maximum likelihood estimator minimizes the sum of
squared innovations (yt b0 at )2 . Thus, using the Kalman …lter after setting initial conditions
to produce a constant Vt matrix is equivalent to conditioning on initial values and computing
nonlinear least squares estimates.
There is still one more statistical procedure that involves a linear transformation approxi-
mately diagonalizing the covariance matrix of y. If the y’s are a stationary stochastic process,
the T T Fourier matrix F, with elements fkt = e2 ikt=T , not depending on any unknown pa-
rameters, approximately diagonalizes any stationary covariance matrix. The variable z = Fy

4
is the Fourier transform of y and is the starting point for spectral analysis of time series data.
Whereas the variances of the e’s are interpreted as forecast error variances (and are constant
under the conditioning approach), the variances of the z’s (often called the spectrum) are
measures of the relative importance of the various cyclical components of the time series.
Although spectral (or frequency domain) analysis can be viewed as a computational device
for simplifying the calculation of the parametric Gaussian likelihood function, it is more
commonly viewed as a nonparametric approach to studying time series data. It is usually
used when the sample size is very large and little structure is imposed except stationarity.
Indeed, studying the spectrum using smoothed periodogram values is essentially equivalent
to studying the autocorrelation function without assuming a parametric model. In contrast,
state space models (e.g., ARMA) impose considerable structure and typically have only a
small number of unknown parameters. In addition, stationarity is not necessary. Perhaps
because data are so limited and stationarity often implausible, economists seem to prefer
the state-space approach to modelling. The Kalman …lter is then available as a convenient
computational tool.

7 Nonlinear State-Space Models


If we drop the assumption that ut and vt are normal, best one-step-ahead predictors are
no longer linear in the y’s. Maximizing the normal likelihood using the linear Kalman …lter
yields consistent estimates, but at the cost of some e¢ ciency loss. Exact maximum likelihood
using a nonlinear …lter is computationally feasible in low-dimensional problems even if the
f t g process is not autoregressive as long as it is Markovian; that is, as long as the conditional
density of t given all past ’s depends only on t 1 :
Consider the state-space model with measurement equation

yt = b0 t + ut

where the ut are i.i.d. with marginal density function f ( ). The p-dimensional state vectors
f t g are a Markov process, independent of the process fut g; with joint conditional density

Pr[x t x + dxj all past ’s] = h(xj t 1 )dx:

Again, let Yt denote the vector y1 ; :::; yt . The independence and Markovian assumptions
imply that the conditional density of yt , given Yt 1 and t , is given by f (yt b0 t ) and that
the conditional density of t , given Yt 1 and t 1 , is h( t j t 1 ); that is, they do not depend
on past y’s.
The likelihood function is the product of the conditional densities p(yt jYt 1 ) for t =
1; :::; T . If g( t jYt 1 ) is the conditional density of t given Yt 1 , we have
Z
p(yt jYt 1 ) = f (yt b0 t )g( t jYt 1 )d t . (10)

Using Bayes rule for manipulating conditional probabilities, we …nd


Z Z
g( t jYt 1 ) = h( t j t 1 )g( t 1 jYt 1 )d t 1 = h( t j t 1 )g( t 1 jyt 1 ; Yt 2 )d t 1
Z
f (yt 1 b0 t 1 )g( t 1 jY 2 )
= h( t j t 1 ) R d t 1: (11)
f (yt 1 b0 t 1 )g( t 1 jYt 2 )d t 1

If f and h are known functions and we have an initial density g( 1 ); equation (11) is a
recursive relation de…ning g for period t in terms of its value in period t 1. If f and h are

5
normal densities, the integrals are easily evaluated and we …nd the usual Kalman up-dating
formula. Otherwise, numerical integration usually is required.
If takes on only a …nite number of discrete values, g is a mass function and the in-
tegration is replace by summation. The calculations then simplify. Suppose t is a scalar
random variable taking on K di¤erent values r1 ; :::; rK . Let gt be the K-dimensional vector
whose k’th element is g(rk jYt 1 ) P r[ t = rk jYt 1 ]. Let Ht be the K K Markov matrix
whose ij element is P r[ t = ri j t 1 = rj ]. Let ft be the K-dimensional vector whose k’th
element is f (yt brk ) and let zt be the K-dimensional vector whose k’th element is ftk gtk .
The likelihood function is QT QT 0
t=1 p(yt jYt 1 ) = t=1 ft gt

where, from (11), the g’s can be computed from the recursion

Ht zt 1
gt =
ft0 1 gt 1

A simple example is Hamilton’s Markov switching model. We assume


0
yt = xt + 0 xt t + ut

where t is a binary zero-one Markovian random variable such that P r[ t = 1j t 1 = 1] = p


and P r[ t = 0j t 1 = 0] = q. Thus, with probability 1 q we switch from a regime where
E[yt ] = 0 xt to a regime where E[yt ] = ( + )0 xt ; we switch back with probability 1 p.

You might also like