Week1 PDF
Week1 PDF
,k
at unequal times n = n
.
In future weeks we will consider spaces of paths that depend on continu-
ous time t rather than discrete time n. The corresponding path spaces, and
probability distributions on them, are one of the main subjects of this course.
3 Multivariate normals
Most of the material in this section should be review for most of you. The
multivariate Gaussian, or normal, probability distribution is important for so
many reasons that it would be dull to list them all here. That activity might
help you later as you review for the nal exam. The important takeaway is
linear algebra as a way to deal with multivariate normals.
3.1 Linear transformations of random variables
sub:lt
Let X R
d
be a multivariate random variable. We write X u(x) to indicate
that u is the probability density of X. Let A be a square d d matrix that
describes an invertible linear transformation of random variables X = AY ,
Y = A
1
X. Let v(y) be the probability density of Y . The relation between u
and v is
v(y) = |det(A)| u(Ay) . (3) uvd
This is equivalent to
u(x) =
1
|det(A)|
x
_
A
1
x
_
. (4) vud
We will prove it in the form (
uvd
3) and use it in the form (
vud
4).
The determinants may be the most complicated things in the formulas (
uvd
3)
and (
vud
4). They may be the least important. It is common that probability
densities are known only up to a constant. That is, we know u(x) = cf(x) with
a formula for f, but we do not know c. Even if there is a formula for c, the
formula may be more helpful without it.
4
(For example, the Student t-density is
u(x) = c
_
1 +
x
2
n
_
n+1
2
,
with
c =
_
n+1
2
_
n
_
n
2
_ ,
in terms of the Euler Gamma function (a) =
_
0
t
a1
e
t
dt. The important
features of the t-distribution are easier to see without the formula for c: the fact
that it is approximately normal for large n, that it is symmetric and smooth,
and that u(x) x
p
for large x with exponent p = n + 1 (power law tails).)
Here is an informal way to understand the transformation rule (
uvd
3) and get
the determinant prefactor in the right place. Consider a very small region, B
y
,
in R
d
about the point y. This region could be a small ball or box, say. Call
the volume of B
y
, informally, |dy|. Under the transformation y x = Ay, say
that B
y
is transformed to a small region B
x
about x. Let |dx| be the volume
of B
x
. Since B
y
A
B
x
(this means that the transformation A takes B
y
to B
x
),
the ratio of the volumes is given by the determinant:
|dx| = |det(A)| |dy| .
This formula is exactly true even if |dy| is not small.
But when B
y
small, we have the approximate formula
Pr(B
y
) =
_
By
y(y
)dy
k=1
u
k
B
k
_
C =
n
k=1
u
k
AB
k
C .
It even works with integrals. If B(x) is a matrix function of x R
d
and u(x) is
a probability density function, then
_
(AB(x)C) u(x) dx = A
__
B(x) u(x) dx
_
C .
This may be said in a more abstract way. If B is a random matrix and A and
C are xed, not random, then
E[ABC] = AE[B] C . (6) eabc
Matrix multiplication is associative and linear even when some of the matrices
are row vectors or column vectors. These can be treated as 1 d and d 1
matrices respectively.
Of course, matrix multiplication is not commutative: AB = BA in gen-
eral. The matrix transpose reverses the order of matrix multiplication: (AB)
t
=
(B
t
) (A
t
). Matrix inverse does the same if A and B are square matrices: (AB)
1
=
_
B
1
_ _
A
1
_
. If A and B are not square, it is possible that AB is invertible even
though A and B are not.
We illustrate matrix algebra in probability by nding transformation rules
for the mean and covariance of a multivariate random variable. Suppose Y R
d
is a d component random variable, and X = AY . It is not necessary here for
A to be invertible or square, as it was in subsection
sub:lt
3.1. The mean of Y is
the d component vector given either in matrix/vector form as
Y
= E[Y ], or in
component form as
Y,j
= E[Y
j
]. The expected value of Y is
X
= E[X] = E[AY ] = AE[Y ] = A
Y
.
We may take A out of the expectation because of the linearity of matrix multi-
plication, and the fact that Y may be treated as a d 1 matrix.
Slightly less trivial is the transformation formula for the covariance matrix.
The covariance matrix C
Y
is the d d symmetric matrix whose entries are
C
Y,jk
= E[(Y
j
Y,j
) (Y
k
Y,k
)] .
6
The diagonal entries of C
Y
are the variances of the components of Y :
C
Y,jj
= E
_
(Y
j
Y,j
)
2
_
=
2
Yj
.
Now consider the d d matrix B(Y ) = (Y
Y
) (Y
Y
)
t
. The (j, k) entry
of B is just (Y
j
Y,j
) (Y
k
Y,k
). Therefore the covariance matrix may be
expressed as
C
Y
= E
_
(Y
Y
) (Y
Y
)
t
_
. (7) cx
The linearity formula (
eabc
6), and associativity, give the transformation law for
covariances:
C
Y
= E
_
(Y
Y
) (Y
Y
)
t
_
= E
_
(AY A
Y
) (AY A
Y
)
t
_
= E
_
{A(Y
Y
)} {A(Y
Y
)}
t
_
= E
__
A(Y
Y
)
__
(Y
Y
)
t
A
t
__
= E
_
A
_
(Y
Y
) (Y
Y
)
t
_
A
t
_
= AE
_
(Y
Y
) (Y
Y
)
t
_
A
t
C
X
= AC
Y
A
t
. (8)
The second to the third line uses distributivity. The third to the fourth uses
the property of matrix transpose. The fourth to the fth is distributivity again.
The fth to the sixth is linearity.
3.3 Gaussian probability density
This subsection and the next one use the multivariate normal probability den-
sity. The aim is not to use the formula but to nd ways to avoid using it. We
use the general probability density formula to prove the important facts about
Gaussians. Working with these general properties is simpler than working with
probability density formulas. These properties are
Linear functions of Gaussians are Gaussian. If X is Gaussian and Y =
AX, then Y is Gaussian.
Uncorrelated Gaussians are independent. If X
1
and X
2
are two compo-
nents of a multivariate normal and if cov(X
1
, X
2
) = 0 then X
1
and X
2
are independent.
Conditioned Gaussians are Gaussian. If X
1
and X
2
are two compenents of
a multivariate normal, then the distribution of X
1
, conditioned on knowing
the value X
2
= x
2
, is Gaussian.
7
In this subsection and the next, we use the formula for the Gaussian probability
density to prove these three properties.
Let H be a d d matrix that is SPD (symmetric and positive denite). Let
= (
1
, . . .
d
)
t
R
d
be a vector. If X has the probability density
u(x) =
_
det(H)
(2)
d/2
e
(x)
t
H(x)/2
. (9) NH
then we say that X is a multivariate Gaussian, or normal, with parameters
and H. The probability density on the right is denoted by N(, H
1
). It will
be clear soon why it is convenient to use H
1
instead of H. We say that X is
centered if = 0. In that case the density is symmetric, u(x) = u(x).
It is usually more convenient to write the density formula as
u(x) = c e
(x)
t
H(x)/2
.
The value of the prefactor,
c =
_
det(H)
(2)
d/2
,
often is not important. A probability density is Gaussian if it is the exponential
of a quadratic function of x.
We give some examples before explaining (
NH
9) in general. A univariate normal
has d = 1. In that case we drop the vector and matrix notation because and
H are just numbers. The simplest univariate normal is the univariate standard
normal, with = 0, and H = 1. We often use Z to denote a standard normal,
so
Z
1
2
e
z
2
/2
= N(0, 1) . (10) sn1
The cumulative distribution function, or CDF, of the univariate standard normal
is
N(x) = Pr( Z < x) =
1
2
_
x
e
z
2
/2
dz .
There is no explicit formula for N(x) but it is easy to calculate numerically. Most
numerical software packages include procedures that compute N(x) accurately.
The general univariate density may be written without matrix/vector nota-
tion:
u(x) =
2
e
(x)
2
h/2
. (11) wn
Simple calculations (explained below) show that the mean is E[X] = , and the
variance is
2
X
= var(X) = E
_
(X )
2
_
=
1
h
.
In view of this, the probability density (
wn
11) is also written
u(x) =
1
2
2
e
(x)
2
/(2
2
)
. (12) uvn
8
If X has this density, we write X N(,
2
). It would have been more accurate
to write
2
X
and
X
, but we make a habit of dropping subscripts that are easy
to guess from context.
It is useful to express a general univariate normal in terms of a standard
univariate normal. If we want X N(,
2
), we can take Z (0, 1) and take
X = + Z . (13) XmsZ
It is clear that E[X] = . Also,
var(X) = E
_
(X )
2
_
=
2
E
_
Z
2
=
2
.
A calculation with probability densities (see below) shows that if (
sn1
10) is the
density of Z and X is given by (
XmsZ
13), then X has the density (
uvn
12). This is handy
in calculations, such as
Pr[X < a] = Pr[ +Z < a] = Pr[Z < (a ) /] = N
_
a
_
.
This says that the probability of X < a depends on how many standard devia-
tions a is away from the mean, which is the argument of N on the right.
We move to a multivariate normal and return to matrix and vector notation.
The standard multivariate normal has = 0 and H = I
dd
. In this case,
the exponent in (
NH
9) involves just z
t
Hz = z
t
z = z
2
1
+ z
2
d
, and det(H) = 1.
Therefore the N(0, I) probability density (
NH
9) is
Z
1
(2)
d/2
e
z
t
z/2
(14)
=
_
1
2
_
d
e
z
2
1
z
2
d
/2
=
_
1
2
e
z
2
1
/2
_
_
1
2
e
z
2
2
/2
_
_
1
2
e
z
2
d
/2
_
. (15)
The last line writes the probability density of Z as a product of one dimen-
sional N(0, 1) densities for the components Z
1
, . . . , Z
d
. This implies that the
components Z
j
are independent univariate standard normals. The elements of
the covariance matrix are
C
Z,jj
= var(Z
j
) = E
_
Z
2
j
= 1
C
Z,jk
= cov(Z
j
, Z
k
) = E
_
Z
j
Z
2
k
= 0 if j = k
_
. (16) ZI
In matrix terms, this is just C
Z
= I. The covariance matrix of the multivari-
ate standard normal is the identity matrix. In this case at least, uncorrelated
Gaussian components Z
j
are also independent.
The themes of this section so far are the general transformation law (
vud
4), the
covariance transformation formula (
cycx
8), and the Gaussian density formula (
NH
9).
9
We are ready to combine them to see how multivariate normals transform under
linear transformations. Suppose Y has probability density (assuming for now
that = 0)
v(y) = c e
y
t
Hy/2
,
and X = AY , with an invertible A so Y = A
1
X. We use (
vud
4), and calculate
the exponent
y
t
Hy =
_
A
1
x
_
t
H
_
A
1
x
_
= x
t
_
_
A
1
_
t
HA
1
_
x = x
t
Hx ,
with
H =
_
A
1
_
t
HA
1
. (Note that
_
A
1
_
t
= (A
t
)
1
. We denote these by
A
t
, and write
H = A
t
HA
1
.) The formula for
H is not important here.
What is important is that u(x) = c e
x
t
e
Hx/2
, which is Gaussian. This proves
that a linear transformation of a multivariate normal is a multivariate normal,
at least if the linear transformation is invertible.
We come to the relationship between H, the SPD matrix in (
NH
9), and C, the
covariance matrix of X. In one dimension the relation is
2
= C = h
1
. We
now show that for d 1 the relationship is
C = H
1
. (17) CH
The multivariate normal, therefore, is N(, H
1
) = N(, C). This is consistent
with the one variable notation N(,
2
). The relation (
CH
17) allows us to rewrite
the probability density (
NH
9) in its more familiar form
u(x) = c e
(x)
t
C
1
(x)/2
. (18) N
The prefactor is (presuming C = H
1
)
c =
1
(2)
d/2
_
det (C)
=
_
det (H)
(2)
d/2
. (19) pf
The proof of (
CH
17) uses an idea that is important for computation. A natural
multivariate version of (
XmsZ
13) is
X = +AZ , (20) XmAZ
where Z N(0, I). We choose A so that X N(, C). Then we use the
transformation formula to nd the density formula for X. The desired (
CH
17) will
fall out. The whole thing is an exercise in linear algebra.
The mean property is clear, so we continue to take = 0. The covariance
transformation formula (
cycx
8), with C
Z
= I in place of C
Y
, implies that C
X
= AA
t
.
We can create a multivariate normal with covariance C if we can nd an A with
AA
t
= C . (21) cho
You can think of A as the square root of C, just as is the square root of
2
in the one dimensional version (
XmsZ
13).
10
There are dierent ways to nd a suitable A. One is the Cholesky factor-
ization C = LL
t
, where L is a lower triangular matrix. This is described in
any good linear algebra book (Strang, Lax, not Halmos). This is convenient
for computation because numerical software packages usually include a routine
that computes the Cholesky factorization.
This is the algorithm for creating X. We now nd the probability density
of X using the transformation formula (
vud
4). We write c for the prefactor in any
probability density formula. The value of c can be dierent in dierent formulas.
We saw that v(z) = ce
z
t
z/2
. With z = A
1
x, we get
u(x) = c v(A
1
x) = c e
(A
1
x)
t
A
1
x/2
.
The exponent is, as we just saw, x
2
Hx/2 with
H = A
t
A
1
.
But if we take the inverse of both sides of (
cho
21) we nd C
1
= A
t
A
1
. This
proves (
CH
17), as the expressions for C
1
and H are the same.
The prefactor works out too. The covariance equation (
cho
21) implies that
det(C) = det(A) det(A
t
) = [ det(A) ]
2
.
Using (
zd
14) and (
uvd
3) together gives
c =
1
(2)
d/2
1
det(A)
,
which is the prefactor formula (
pf
19).
Two important properties of the multivariate normal, for your review list, is
that they exist for any C and are easy to generate. The covariance square root
equation (
cho
21) has a solution for any SPD matrix C. If C is a desired covariance
matrix, the mapping (
XmAZ
20) produces multivariate normal X with covariance C.
Standard software packages include random number generators that produce
independent univariate standard normals Z
k
. If you have C and you want a
million vectors independent random vectors X, you rst compute the Cholesky
factor, L. Then a million times you use the standard normal random number
generator to produce a d component standard normal Z and do the matrix
calculation X = LZ.
This makes the multivariate normal family dierent from other multivariate
families. Sampling a general multivariate random variable can be challenging for
d larger than about 5. Practitioners resort to heavy handed and slow methods
such as Markov chain Monte Carlo. Moreover, there is a modeling question
that may be hard to answer for general random variables. Suppose you want
univariate random variables X
1
, . . ., X
d
each to have density f(x) and you want
them to be correlated. If f(x) is a univariate normal, you can make the X
k
components of a multivariate normal with desired variances and correlations.
If the X
k
are not normal, the copula transformation maps the situation to a
multivariate normal. Warning: the copula transformation has been blamed for
the 2007 nancial meltdown, seriously.
11
3.4 Conditional and marginal distributions
sub:cm
When you talk about conditional and marginal distributions, you have to say
which variables are xed or known, and which are variable or unknown. We
write the multivariate random variable as (X, Y ) with X R
d
X
and y R
d
Y
.
The number of random variables in total is still d = d
X
+ d
Y
. We study the
conditional distribution of X, conditioned on knowing Y = y. We also study
the marginal distribution of Y .
The math here is linear algebra with block vectors and matrices. The total
random variable is partitioned into its X and Y parts as
_
X
Y
_
=
_
_
_
_
_
_
_
_
_
_
X
1
.
.
.
X
d
X
Y
1
.
.
.
Y
d
Y
_
_
_
_
_
_
_
_
_
_
R
d
The Gaussian joint distribution of X and Y has in its exponent
_
x
t
, y
t
_
_
H
XX
| H
XY
H
Y X
| H
Y Y
__
x
y
_
= x
t
H
XX
x + 2x
t
H
XY
y + y
t
H
Y Y
y .
(This relies on x
t
H
XY
y = y
t
H
Y X
x, which is true because H
Y X
= H
t
XY
, which
is true because H is symmetric.) The joint distribution of X and Y is
u(x, y) = c e
(x
t
H
XX
x +2x
t
H
XY
y +y
t
H
Y Y
y)/2
.
If u(x, y) is any joint distribution, then the conditional density of X condi-
tioned on Y = y, is u(x | Y = y) = u(x | y) = c(y)u(x, y). Here we think of x
as the variable and y as a parameter. The normalization constant here depends
on the parameter. For the Gaussian, we have
u(x| Y = y) = c(y) e
(x
t
H
XX
x +2x
t
H
XY
y)/2
. (22) cX
The factor e
y
t
H
Y Y
y/2
has been absorbed into c(y).
We see from (
cX
22) that the conditional distribution of X is the exponential
of a quadratic function of x, which is to say, Gaussian. The algebraic trick
of completing the square identies the conditional mean. We seek to write the
exponent (
cX
22) in the form of the exponent of (
NH
9)
x
t
H
XX
x + 2x
t
H
XY
y = (x
X
(y))
t
H
XX
(x
X
(y)) + m(y) .
The m(y) will eventually be absorbed into the y dependent prefactor. Some
algebra shows that this works provided
2x
t
H
XY
y = 2x
t
H
XX
X
(y) .
12
This will hold for all x if H
XY
y = H
XX
X
(y), which gives the formula for
the conditional mean:
X
(y) = H
1
XX
H
XY
y . (23) cm
The conditional mean
X
(y) is in some sense the best prediction of the unknown
X given the known Y = y.
3.5 Generating a multivariate normal, interpreting covari-
ance
sub:ch
If we have M with MM
t
= C, we can think of M as a kind of square root of
C. It is possible to nd a real d d matrix M as long as C is symmetric and
positive denite. We will see two distinct ways to do this that give two dierent
M matrices.
The Cholesky factorization is one of these ways. The Cholesky factorization
of C is a lower triangular matrix L with LL
t
= C Lower triangular means that
all non-zero entries of L are on or below the digonal:
L =
_
_
_
_
_
_
_
_
l
11
0 0
l
21
l
22
0
.
.
.
.
.
.
.
.
.
.
.
.
0
l
d1
l
dd
_
_
_
_
_
_
_
_
.
Any good linear algebra book explains the basic facts of Cholesky factorization.
These are such an L exists as long as C is SPD. There is a unique lower triangular
L with positive diagonal entries: l
jj
> 0. There is a straightforward algorithm
that calculates L from C using approximately d
3
/6 multiplications (and the
same number of additions).
If you want to generate X N(, C), you compute the Cholesky factoriza-
tion of C. Any good package of linear algebra software can do this, including
downloadable software LAPACK for C or C++ or FORTRAN programming,
and the build in linear algebra facilities in Python, R, and Matlab. To make an
X, you need d independent standard normals Z
1
, . . . , Z
d
. Most packages that
generate pseudo-random numbers have a procedure to generate such standard
normals. This includes Python, R, and Matlab. To do it in C, C++, FOR-
TRAN, you can use a uniform pseudo-random number generator and then use
the Box Muller formula to get Gaussians. You assemble the Z
j
into a vector
Z = (Z
1
, . . . , Z
d
and take X = LZ +.
Consider as an example the two dimensional case with = 0. Here, we
want X
1
and X
2
that are jointly normal. It is common to specify var(X
1
) =
2
1
,
var(X
2
) =
2
1
, and the correlation coecient
12
= corr(X
1
, X
2
) =
cov(X
1
, X
2
)
2
=
E(X
1
X
2
)
2
.
13
In this case, the Cholesky factor is
L =
_
1
0
12
2
_
1
2
12
2
_
. (24) L2
The general formula X = LZ becomes
X
1
=
1
Z
1
(25)
X
2
=
12
2
Z
1
+
_
1
2
12
2
Z
2
. (26)
It is easy to calculate E
_
X
2
1
=
2
1
, which is the desired value. Similarly, because
Z
1
and Z
2
are independent, we have
var(X
2
) = E
_
X
2
2
=
2
12
2
2
+
_
1
2
12
_
2
2
=
2
2
,
which is the desired answer, too. The correlation coecient is also correct:
corr(X
1
, X
2
) =
E
_
X
2
2
2
=
E[
1
Z
1
12
2
Z
1
]
2
=
12
E
_
Z
2
1
=
12
.
You can, and should, verify by matrix multiplication that
LL
t
=
_
2
1
1
12
12
2
2
_
,
which is the desired covariance matrix of (X
1
, X
2
)
t
.
We could have turned the formulas (
x1
25) and (
x2
26) around as
X
1
=
_
1
2
12
1
Z
1
+
12
1
Z
2
+
X
2
=
2
Z
2
.
In this version, it looks like X
2
is primary and X
1
gets some of its value from
X
2
. In (
x1
25) and (
x2
26), it looks like X
1
is primary and X
2
gets some of its value
from X
1
. These two models are equally valid in the sense that they product
the same observed (X
1
, X
2
) distribution. It is a good idea to keep this in mind
when interpreting regression studies involving X
1
and X
2
.
4 Linear Gaussian recurrences
sec:lgr
Linear Gaussian linear recurrence relations (
lrr
1) illustrate the ideas in the previous
section. We now know that if V
n
is a multivariate normal with mean zero, there
is a matrix B so that V
n
= BZ
n
, where Z
n
N(0, I), is a standard multivariate
normal. Therefore, we rewrite (
lrr
1) as
X
n+1
= AX
n
+ BZ
n
. (27) lrz
Since the X
n
are Gaussian, we need only describe their means and covariances.
This section shows that the means and covariances satisfy recurrence relations
derived from (
lrr
1). The next section explores the distributions of paths. This
determines, for example, the joint distribution of X
n
and X
m
for n = m. These
and more general path spaces and path distributions are important throughout
the course.
14
4.1 Probability distribution dynamics
sub:pdd
As long as Z
n
is independent of X
n
, we can calculate recurrence relations for
n
= E[X
n
] and C
n
= cov[X
n
]. For the mean, we have (you may want to glance
back to subsection
sub:la
3.2)
n+1
= E[AX
n
+ BZ
n
]
= AE[X
n
] + BE[Z
n
]
n+1
= A
n
. (28)
This says that the recurrence relation for the means is the same as the recurrence
relation (
lrz
27) for the random states if you turn o the noise (set Z
n
to zero).
For the covariance, it is convenient to combine (
lrz
27) and (
mur
28) into
X
n+1
n+1
= A(X
n
n
) + BZ
n
.
The covariance calculation starts with
C
n+1
= E
_
(X
n+1
n+1
) (X
n+1
n+1
)
t
_
= E
_
(A(X
n
n
) +BZ
n
) (A(X
n
n
) +BZ
n
)
t
_
We expand the last into a sum of four terms. Two of these are zero, one being
E
__
A(X
n
n
) (BZ
n
)
t
__
= 0 ,
because Z
n
has mean zero and is independent of X
n
. We keep the non-zero
terms:
C
n+1
= E
_
(A(X
n
n
)) (A(X
n
n
))
t
_
+ E
_
(BZ
n
) (BZ
n
)
t
_
= E
_
A
_
(X
n
n
) (X
n
n
)
t
_
A
t
_
+ E
_
B
_
Z
n
Z
t
n
_
B
t
= AE
_
(X
n
n
) (X
n
n
)
t
_
A
t
+ BE
_
Z
n
Z
t
n
B
t
B
t
C
n+1
= AC
n
A
t
+ BB
t
. (29)
The recurrence relations (
mur
28) and (
cr
29) determine the distribution of X
n+1
in
terms of the distribution of X
n
. As such, they are the rst example in this class
of a forward equation.
We will see in subsection
sub:ho
4.2 that there are natural examples where the
dimension of the noise vector Z
n
is less than d, and the noise matrix B is not
square. When that happens, we let m denote the number of components of Z
n
,
which is the number of sources of noise. The noise matrix B is d m; it has d
rows and m columns. The case m > d is not important for applications. The
matrices in (
cr
29) all are d d, including BB
t
. If you wonder whether it might
be B
t
B instead, note that B
t
B is mm, which might be the wrong size.
15
4.2 Higher order recurrence relations, the Markov prop-
erty
sub:ho
It is common to consider recurrence relations with more than one lag. For
example, a k lag relation might take the form
X
n+1
= A
0
X
n
+ A
1
X
n1
+ + A
k1
X
nk+1
+ BZ
n
. (30) lk
From the point of view of X
n+1
, the k lagged states are X
n
(one lag), up to
X
nk+1
(k lags). It is natural to consider models with multiple lags if X
n
represent observable aspects of a large and largely unobservable system. For
example, the components of X
n
could be public nancial data at time n. There
is much unavailable private nancial data. The lagged values X
nj
might give
more insight into the complete state at time n than just X
n
.
We do not need a new theory of lag k systems. State space expansion puts a
multi-lag system of the form (
lk
30) into the form of a two term recurrence relation
(
lrz
27). This formulation uses expanded vectors
X
n
=
_
_
_
_
_
X
n
X
n1
.
.
.
X
nk+1
_
_
_
_
_
.
If the original states X
n
have d components, then the expanded states
X
n
have
kd components. The noise vector Z
n
does not need expanding because noise
vectors have no memory. All the memory in the system is contained in
X
n
. The
recurrence relations in the expanded state formulation are
X
n+1
=
A
X
n
+
BZ
n
.
In more detail, this is
_
_
_
_
_
X
n+1
X
n
.
.
.
X
nk+2
_
_
_
_
_
=
_
_
_
_
_
_
_
A
0
A
1
A
k+1
I 0 0
0 I 0
.
.
.
.
.
.
.
.
.
0 I 0
_
_
_
_
_
_
_
_
_
_
_
_
X
n
X
n1
.
.
.
X
nk+1
_
_
_
_
_
+
_
_
_
_
_
B
0
.
.
.
0
_
_
_
_
_
Z
n
. (31) cmrr
The matrix
A is the companion matrix of the recurrence relation (
lk
30).
We will see in subsection
sub:ss
4.3 that the stability of a recurrence relation (
lrz
27)
is determined by the eigenvalues of A. For the case d = 1, you might know that
the stability of the recurrence relation (
lk
30) is determined by the roots of the
characteristic polynomial p(z) = z
k
A
0
z
k1
A
k1
. These statements are
consistent because the roots of the characteristic polynomial are the eigenvalues
of the companion matrix.
If X
n
satises a k lag recurrence (
lk
30), then the covariance matrix,
C
n
=
cov(
X
n
), satises
C
n+1
=
A
C
n
A
t
+
B
B
t
. The simplest way to nd the d d
16
covariance matrix C
n
, is to nd the kd kd covariance matrix
C
n
and look at
the top left d d block.
The Markov property will be important throughout the course. If the X
n
satisfy the one lag recurrence relation (
lrz
27), then they have the Markov property.
In this case the X
n
form a Markov chain. If they satisfy the k lag recurrence
relation with k > 1 (in a non-trivial way) then the stochastic process X
n
does not
have the Markov property. The informal denition is as follows. The process
has the Markov property if X
n
is all the information about the past that is
relevant for predicting the future. Said more formally, the distribution of X
n+1
conditional on X
n
, . . . , X
0
is the same as the distribution of X
n+1
conditional
on X
n
alone.
If a random process does not have the Markov property, you can blame that
on the state space being too small, so that X
n
does not have as much information
about the state of the system as it should. In many such cases, a version of state
space expansion can create a more complete collection of information at time n.
Genuine state space expansion, with k > 1, always gives a noise matrix
B
with fewer sources of noise than state variables. The number of state variables
is kd and the number of noise variables is m d.
4.3 Large time behavior and stability
sub:ss
Large time behavior is the behavior of X
n
as n . The stochastic process
(
lrz
27) is stable if it settles into a stochastic steady state for large n. The states X
n
can not have a limit, because of the constant inuence of random noise. But the
probability distributions, u
n
(x), with X
n
u
n
(x), can have limits. The limit
u(x) = lim
n
u
n
(x) is a statistical steady state. The nite time distributions
u
n
are Gaussian: u
n
= N(
n
, C
n
), with
n
and C
n
satisfying the recurrences
(
mur
28) and (
cr
29). The limiting distribution depends on the following limits:
= lim
n
n
(32)
C = lim
n
C
n
(33)
If these limits exist, then
3
u = N(, C).
In the following discussion we rst ignore several subtleties in linear algebra
for the sake of simplicity. Conclusions are correct as initially stated if m = d, B
is non-singular, and there are no Jordan blocks in the eigenvalue decomposition
of A. We will then re-examine the reasoning to gure out what can happen in
exceptional degenerate cases.
The limit (
mul
32) depends on the eigenvalues of A. Denote the eigenvalues
by
j
and the corresponding right eigenvectors by r
j
, so that Ar
j
=
j
r
j
for
j = 1, . . . , d. The eigenvalues and eigenvectors do not have to be real even
when A is real. The eigenvectors form a basis of C
d
, so the means
n
have
3
Some readers will worry that this statement is not proven with mathematical rigor. It
can be, but we are avoiding that kind of technical discussion.
17
unique representations
n
=
d
j=1
m
n,j
r
j
. The dynamics (
mur
28) implies that
m
n+1,j
=
j
m
n,j
. This implies that
m
n,j
=
n
j
m
0,j
. (34) mp
The matrix A is strongly stable if |
j
| < 1 for j = 1, . . . , d. In this case
m
n,j
0 as n for each j. In fact, the convergence is exponential. We
see that if A is strongly stable, then
n
0 as n independent of the
initial mean
0
. The opposite case is that |
j
| > 1 for some j. Such an A is
strongly unstable. It usually happens that |
n
| as n for a strongly
unstable A. The limiting distribution u does not exist for strongly unstable A.
The borderline case is |
j
| 1, for all j and there is at least one j with |
j
| 1.
This may be called either weakly stable or weakly unstable.
If A is strongly stable, then the limit (
cl
33) exists. We do not expect C
n
0
because the uncertainty in X
n
is continually replenished by noise. We start with
a direct but possibly unsatisfying proof. A second and more complicated proof
follows. The rst proof just uses the fact that if A is strongly stable, then
A
n
c a
n
, (35) ab
for some constant c and positive a < 1. The value of c depends on the matrix
norm and is not important for the proof.
We prove that the limit (
cl
33) exists by writing C as a convergent innite
sum. To simplify notation, write R for BB
t
. Suppose C
0
is given, then (
cr
29)
gives C
1
= AC
0
A
t
+R. Using (
cr
29) again gives
C
2
= AC
1
A
t
+ R
= A
_
AC
0
A
t
+ R
_
A
t
+ R
= A
2
C
0
_
A
t
_
2
+ ARA
t
+ R
= A
2
C
0
_
A
2
_
t
+ ARA
t
+ R
We can continue in this way to see (by induction) that
C
n
= A
n
C
0
(A
n
)
t
+ A
n1
R
_
A
n1
_
t
+ + R .
This is written more succinctly as
C
n
= A
n
C
0
(A
n
)
t
+
n1
k=0
A
k
R
_
A
k
_
t
. (36) gsf
The limit of the C
n
exists because the rst term on the right goes to zero as
n and the second term converges to the innite sum
C =
k=0
A
k
R
_
A
k
_
t
. (37) gsi
18
For the rst term, note that (
ab
35) and properties of matrix norms imply that
4
_
_
_A
n
C
0
(A
n
)
t
_
_
_ (ca
n
) C
0
(ca
n
) = ca
2n
C
0
.
We write c instead of c
2
at the end because c is a generic constant whose value
does not matter. The right side goes to zero as n because a < 1. For the
second term, recall that an innite sum is the limit of its partial sums if the
innite sum converges absolutely. Absolute convergence is the convergence of
the sum of the absolute values, or the norms in case of vectors and matrices.
Here the sum of norms is:
k=0
_
_
_A
k
R
_
A
k
_
t
_
_
_ .
Properties of norms bound this by a geometric series:
_
_
_A
k
R
_
A
k
_
t
_
_
_ c a
2k
R .
You can nd C without summing the innite series (
gsi
37). Since the limit
(
cl
33) exists, you can take the limit on both sides of (
cr
29), which gives
C ACA
t
= BB
t
. (38) le
Subsection
sub:ev
4.4 explains that this is a system of linear equations for the entries
of C. The system is solvable and the solution is positive denite if A is strongly
stable. As a warning, (
le
38) is solvable in most cases even when A is strongly
unstable. But in those cases the C you get is not positive denite and therefore
is not the covariance matrix of anything. The dynamical equation (
cr
29) and the
steady state equation (
le
38) are examples of Liapounov equations.
Here are the conclusions: if A is strongly stable then u
n
, the distribution of
X
n
has u
n
u as n , with a Gaussian limit u = N(0, C), and C is given
by (
gsi
37), or by solving (
le
38). If A is not strongly stable, then it is unlikely that
the u
n
have a limit as n . It is not altogether impossible in degenerate
situations described below. If A is strongly unstable, then it is most likely that
n
as n . If A is weakly unstable, then probably C
n
as
n because the sum (
gsi
37) diverges.
4.4 Linear algebra and the limiting covariance
sub:ev
This subsection is a little esoteric. It is (to the author) interesting mathematics
that is not strictly necessary to understand the material for this week. Here we
nd eigenvalues and eigen-matrices for the recurrence relation (
cr
29). These are
related to the eigenvalues and eigenvectors of A.
4
Part of this expression is similar to the design on Courant Institute tee shirts.
19
The covariance recurrence relation (
cr
29)has the same stability/instability di-
chotomy. We explain this by reformulating it as more standard linear algebra.
Consider rst the part that does not involve B, which is
C
n+1
= AC
n
A
t
. (39) Le
Here, the entries of C
n+1
are linear functions of the entries of C
n
. We describe
this more explicitly by collecting all the distinct entries of C
n
into a vector c
n
.
There are D = (d + 1)d/2 entries in c
n
because the elements of C
n
below the
diagonal are equal to the entries above. For example, for d = 3 there are D = 6
distinct entries in C
n
, which are C
n,11
, C
n,12
, C
n,13
, C
n,22
, C
n,23
, and C
n,33
,
which makes c
n
= (C
n,11
, C
n,12
, C
n,13
, C
n,22
, C
n,23
, C
n,33
)
t
R
D
(= R
6
). There
is a DD matrix, L so that c
n+1
= Lc
n
. In the case d = 2 and A =
_
_
,
the C
n
recurrence relation, or dynamical Liapounov equation without BB
t
, (
cr
29)
is
_
C
n+1,11
C
n+1,12
C
n+1,12
C
n+1,22
_
=
_
__
C
n+1,11
C
n+1,12
C
n+1,12
C
n+1,22
__
_
.
This is equivalent to D = 3 and
_
_
C
n+1,11
C
n+1,12
C
n+1,22
_
_
=
_
_
2
2
2
+
2
2
2
_
_
_
_
C
n,11
C
n,12
C
n,22
_
_
.
And that identies L as
L =
_
_
2
2
2
+
2
2
2
_
_
.
This formulation is not so useful for practical calculations. Its only purpose is
to show that (
Le
39) is related to a D D matrix L.
The limiting behavior of C
n
depends on the eigenvalues of L. It turns out
that these are determined by the eigenvalues of A in a simple way. For each pair
(j, k) there is an eigenvalue of L, which we call
jk
, that is equal to
j
k
. To
understand this, note that an eigenvector, s, of L, with Ls = s, corresponds
to a symmetric d d eigen-matrix, S, with
ASA
t
= S .
It happens that S
jk
= r
j
r
t
k
+r
k
r
t
j
is the eigen-matrix corresponding to eigenvalue
jk
=
i
j
. (To be clear, S
jk
is a d d matrix, not the (j, ik) entry of a matrix
20
S.) For one thing, it is symmetric (S
t
jk
= S
jk
). For another thing:
AS
jk
A
t
= A
_
r
j
r
t
k
+r
k
r
t
j
_
A
t
= A
_
r
j
r
t
k
_
A
t
+ A
_
r
k
r
t
j
_
A
t
= (Ar
j
) (Ar
k
)
t
+ (Ar
k
) (Ar
j
)
t
= (
j
r
j
) (
k
r
k
)
t
+ (
k
r
k
) (
j
r
j
)
t
=
j
j
_
r
j
r
t
k
+r
k
r
t
j
_
=
jk
S
jk
.
A counting argument shows that all the eigenvalues and eigen-matrices of L
take the form of S
jk
for some j k. The number of such pairs is the same D,
which is the number of independent entries in a general symmetric matrix. We
do not count S
jk
with j < k because S
jk
= S
kj
with k > j.
Now suppose A is strongly stable. Then the Liapounov dynamical equation
(
cr
29) is equivalent to
c
n+1
= Lc
n
+ r .
Since all the eigenvalues of L are less than one in magnitude, a little reasoning
with linear algebra shows that c
n
c as n , and that cLc = (I L)c = r.
The matrix I L is invertible because L has no eigenvalues equal to 1. This
is a dierent proof that the steady state Liapounov equation (
le
38) has a unique
solution. It is likely that L has no eigenvalue equal to 1 even if A is not strongly
stable. In this case (
le
38) has a solution, which is a symmetric matrix C. But
there is no guarantee that this C is positive denite, so it does not represent a
covariance matrix.
4.5 Degenerate cases
sub:d
The simple conclusions of subsections
sub:ss
4.3 and
sub:ev
4.4 do not hold in every case.
The reasoning there assumed things about the matrices A and B that you might
think are true in almost every interesting case. But it is important to understand
how things might more complicated in borderline and degenerate cases. For one
thing, many important special cases are such borderline cases. Many more
systems have behavior that is strongly inuenced by near degeneracy. A
process that is weakly but not strongly unstable is simple Gaussian random
walk, which is a model of Brownian motion. A covariance that is nearly singular
is the covariance matrix of asset returns, of the S&P 500 stocks. This is a matrix
of rank 500 that is pretty well approximated for many purposes by a matrix of
rank 10.
4.5.1 Rank of B
The matrix B need not be square or have rank d.
21
5 Paths and path space
sec:p
There are questions about the process (
lrr
1) that depend X
n
for one n. For ex-
ample, what is Pr (X
n
1 for 1 n 10)? The probability distribution on
path space answers such questions. For linear Gaussian processes, the distri-
bution in path space is Gaussian. This is not surprising. This subsection goes
through the elementary mechanics of Gaussian path space. We also describe
more general path space terminology that carries over to other other kinds of
Markov processes.
Two relevant probability spaces are the state space and the path space. We
let S denote the state space. This is the set of all possible values of the state
at time n. This week, the state is a d component vector and S = R
d
. The path
space is called . This week, is sequences of states with a given starting and
ending time. That is, X
[n1:n2]
is a sequence (X
n1
, X
n1+1
, . . . , X
n2
). There
are n
2
n
1
+ 1 states in the sequence, so = R
(n2n1+1)d
. Even if the state
space is not R
d
, still a path is a sequence of states. We express this by writing
= S
n2n1+1
. The path space depends on n
1
and n
2
(only the dierence,
really), but we leave that out of the notation because it is usually clear from
the discussion.
6 Exercises
sec:ex
ex:cg 1. This exercise works through conditional distributions of multivariate nor-
mals in a sequence of steps. The themes (for the list of facts about Gaus-
sians) are the role of linear algebra and the relation to linear regression.
Suppose X and Y have d
X
and d
Y
components respectively. Let u(x, y)
be the joint density. Then the conditional distribution of Y conditional
on X = x is u( y | X = x) = c(x)u(x, y). This says that the conditional
distribution is the same, up to a normalization constant) as the joint dis-
tribution once you x the variable whose value is known (x in this case).
The normalization constant is determined by the requirement that the
conditional distribution has total probability equal to 1:
c(x) =
1
_
u(x, y) dy
.
For Gaussian random variables, nding c(x) usually both easy and unnec-
essary.
(a) This part works out the simplest case. Take d = 2, and X =
(X
1
, X
2
)
t
. Suppose X N(0, H
1
). Fix the value of X
1
= x
1
and
calculate the distribution of the one dimensional random variable X
2
.
If H is
H =
_
h
11
h
12
h
12
h
22
_
,
22
then the joint density is
u(x
1
, x
2
) = c exp
_
_
h
11
x
2
1
+ 2h
12
x
1
x
2
+ h
22
x
2
2
_
/2
.
The conditional density looks almost the same:
u( x
2
| x
1
) = c(x
1
) exp
_
_
2h
12
x
1
x
2
+ h
22
x
2
2
_
/2
.
Why is it allowed to leave the term h
11
x
2
1
out of the exponent? Com-
plete the square to write this in the form
u( x
2
| x
1
) = c(x
1
) exp
_
(x
2
(x
1
))
2
/(2
2
2
)
_
.
Find formulas for the conditional mean, (x
1
), and the conditional
variance,
2
2
.
23