0% found this document useful (0 votes)

110 views

Week1 PDF

This document provides an overview and introduction to the topics that will be covered during the first week of a graduate course on stochastic calculus. The first week will focus on linear Gaussian recurrence relations, which are simple stochastic processes used throughout science and economics. These relations can be described by matrices and involve modeling random systems that evolve linearly over time but are also influenced by independent random disturbances at each time step. The behavior of linear Gaussian processes can be fully understood through matrix equations that describe how the mean and covariance of the process evolve over time.

Uploaded by

Ryan Davis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

110 views

Week1 PDF

Uploaded by

Ryan Davis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Week 1

Discrete time Gaussian Markov processes

Jonathan Goodman
September 10, 2012
1 Introduction to Stochastic Calculus
These are lecture notes for the class Stochastic Calculus oered at the Courant
Institute in the Fall Semester of 2012. It is a graduate level class. Students
should have a solid background in probability and linear algebra. The topic
selection is guided in part by the needs of our MS program in Mathematics in
Finance. But it is not focused entirely on the Black Scholes theory of derivative
pricing. I hope that the main ideas are easier to understand in general with
a variety of applications and examples. I also hope that the class is useful to
engineers, scientists, economists, and applied mathematicians outside the world
of nance.
The term stochastic calculus refers to a family of mathematical methods for
studying dynamics with randomness. Stochastic by itself means random, and
it implies dynamics, as in stochastic process. The term calculus by itself has
two related meanings. One is a system of methods for calculating things, as in
the calculus of pseudo-dierential operators or the umbral calculus.
1
The tools
of stochastic calculus include the backward equations and forward equations,
which allow us to calculate the time evolution of expected values and probability
distributions for stochastic processes. In simple cases these are matrix equations.
In more sophisticated cases they are partial dierential equations of diusion
type.
The other sense of calculus is the study of what happens when t 0. In
this limit, nite dierences go to derivatives and sums go to integrals. Calculus
in this sense is short for dierential calculus and integral calculus,
2
which refers
to the simple rules for calculating derivatives and integrals the product rule,
the fundamental theorem of calculus, and so on. The operations of calculus,
integration and dierentiation, are harder to justify than the operations of alge-
bra. But the formulas often are simpler and more useful: integrals can be easier
than sums.
1
Just name-dropping. These are not part of our Stochastic Calculus class.
2
Richard Courant, the founder of the Courant Institute, wrote a two volume textbook
Dierential and Integral Calculus. (originally Vorlesungen uber Dierential und Integralrech-
nung).
1
Dierential and integral calculus is good for modeling as well as for calcu-
lation. We understand the dynamics of a system by asking how the system
changes in a small interval of time. The mathematical model can be a sys-
tem of dierential equations. We predict the behavior of the system by solving
the dierential equations, either analytically or computationally. Examples of
this include rate equations of physical chemistry and the laws of Newtonian
dynamics.
The stochastic calculus in this sense is the Ito calculus. The extra Ito term
makes the Ito calculus more complicated than ordinary calculus. There is no
Ito term in ordinary calculus because the quadratic variation is zero. The Ito
calculus also is a framework for modeling. A stochastic process may be described
by giving an Ito stochastic dierential equation, an SDE. There are relatively
simple rules for deriving this SDE from basic information about the short time
behavior of the process. This is analogous to writing an ordinary dierential
equation (ODE) to describe the evolution of a system that is not random. If
you can describe the behavior of the system over very short time intervals, you
can write the ODE. If you can write the ODE, there is an array of analytical
and computational methods that help you gure out how the system behaves.
This course starts with two simple kinds of stochastic processes that may be
described by basic methods of probability. This week we cover linear Gaussian
recurrence relations. These are used throughout science and economics as the
simplest class of models of stochastic dynamics. Almost everything about linear
Gaussian processes is determined by matrices and linear algebra. Next week we
discuss another class of random processes described by matrices, nite state
space Markov chains. Week 3 begins the transition to continuous time with
continuous time versions of the Gaussian processes discussed this week. The
simplest of these processes is Brownian motion, which is the central construct
that drives most of the Ito calculus.
After than comes the technical core of the course, the Ito integral, Itos
lemma, and general diusion processes. We will see how to associate partial dif-
ferential equations to diusion processes and how to nd approximate numerical
solutions.
It is impractical to do all this in a mathematically rigorous way in just one
semester. This class will indicate some of the main ideas of the mathematically
rigorous theory, but we will not discuss them thoroughly. Experience shows
that careful people can Ito calculus more or less correctly without being able to
recite the formal proofs. Indeed, the ordinary calculus of Newton and Leibnitz
is used daily by scientists and engineers around the world, most of whom would
be unable to give a mathematically correct denition of the derivative.
Computing is an integral part of this class as it is an integral part of applied
mathematics. The theorems and formulas of stochastic calculus are easier to
understand when you see them in action in computations of specic examples.
More importantly, in the practice of stochastic calculus, it is very rare that a
problem gets solved without some computation. A training class like this one
should include all aspects of the subject, not just those in use before computers
were invented.
2
2 Introduction to the material for the week
The topic this week is linear recurrence relations with Gaussian noise. A linear
recurrence relation is an equation of the form
X
n+1
= AX
n
+ V
n
. (1) lrr
The state vector at time n is a column vector with d components:
X
n
=
_
_
_
X
n,1
.
.
.
X
n,d
_
_
_ R
d
.
The innovation, or noise vector, or random forcing, is the d component V
n
.
The forcing vectors are i.i.d., which stands for independent and identically dis-
tributed. The model is dened by the matrix A and the probability distribution
of the forcing. The model does not change with time because A and the distri-
bution of the V
n
are the same for each n. The recurrence relation is Gaussian
if the noise vectors V
n
are Gaussian.
This is a simple model of the evolution of a system that is somewhat pre-
dictable, but not entirely. The d components X
n,1
, . . . , X
n,d
represent the state
of the system at time n. The deterministic part of the dynamics is X
n+1
= AX
n
.
This says that the components at the next time period are linear functions of
the components in the current period. The term V
n
represents the random in-
uences at time n. In this model, everything about time n relevant to predicting
time n + 1 is contained in X
n
. Therefore, the noise at time n, which is V
n
, is
completely independent is anything we have seen before.
In statistics, a model of the form Y = AX + V is a linear regression model
(though usually with E[V ] = 0 and V called ). In it, a family of variables
Y
i
are predicted using linear functions of a dierent family of variables X
j
.
The noise components V
k
model the extent to which the Y variables cannot be
predicted by the X variables. The model (
lrr
1) is called autoregressive, or AR,
because values of variables X
j
are predicted by values of the same variables at
the previous time.
It is possible to understand the behavior of linear a Gaussian AR model in
great detail (technical detail, you must assume it starts with X
0
that is Gaus-
sian). All the subsequent X
n
are multivariate normal. There are simple matrix
recurrence relations that determine their means and covariance matrices. These
recurrence relations important in themselves, and they are our rst example of
an ongoing theme for the course, backward and forward equations.
This material makes a good start for the class for several reasons. For one
thing, it gives us an excuse to review multivariate Gaussian random variables.
Also, we have a simple context in which to talk about path space. In this case,
a path is a sequence of consecutive states, which we write as
X
[n1:n2]
= (X
n1
, X
n1+1
, . . . , X
n2
) . (2) p
3
The notation [n
1
: n
2
] comes from two sources. Several programming languages
use similar notation to denote a sequence of consecutive integers: [n
1
: n
2
] =
{n
1
, n
1
+ 1, . . . , n
2
}. In mathematics, [n
1
, n
2
] refers to the closed interval con-
taining all real numbers between n
1
and n
2
, including n
1
and n
2
. We write
[n
1
: n
2
] to denote just the integers in that interval.
The path is an object in a big vector space. Each of the X
n
has d components.
The number of integers n in the set [n
1
: n
2
] is n
2
n
1
+1. Altogether, the path
X
[n1:n2]
has d(n
2
n
1
+ 1) components. Therefore X
[n1:n2]
can be viewed as a
point in the path space R
d(n2n1+1)
. As such it is Gaussian. Its distribution is
completely determined by its mean and covariance matrix. The mean of X
[n1:n2]
is determined by the means of the individual X
n
. The covariance matrix of
X
[n1:n2]
has dimension d(n
2
n
1
+ 1) d(n
2
n
1
+ 1). Some of its entries
give the variances and covariances of the components of X
n
. Others are the
covariances of compenents X
n,j
with X
n

,k
at unequal times n = n

.
In future weeks we will consider spaces of paths that depend on continu-
ous time t rather than discrete time n. The corresponding path spaces, and
probability distributions on them, are one of the main subjects of this course.
3 Multivariate normals
Most of the material in this section should be review for most of you. The
multivariate Gaussian, or normal, probability distribution is important for so
many reasons that it would be dull to list them all here. That activity might
help you later as you review for the nal exam. The important takeaway is
linear algebra as a way to deal with multivariate normals.
3.1 Linear transformations of random variables
sub:lt
Let X R
d
be a multivariate random variable. We write X u(x) to indicate
that u is the probability density of X. Let A be a square d d matrix that
describes an invertible linear transformation of random variables X = AY ,
Y = A
1
X. Let v(y) be the probability density of Y . The relation between u
and v is
v(y) = |det(A)| u(Ay) . (3) uvd
This is equivalent to
u(x) =
1
|det(A)|
x
_
A
1
x
_
. (4) vud
We will prove it in the form (
uvd
3) and use it in the form (
vud
4).
The determinants may be the most complicated things in the formulas (
uvd
3)
and (
vud
4). They may be the least important. It is common that probability
densities are known only up to a constant. That is, we know u(x) = cf(x) with
a formula for f, but we do not know c. Even if there is a formula for c, the
formula may be more helpful without it.
4
(For example, the Student t-density is
u(x) = c
_
1 +
x
2
n
_

n+1
2
,
with
c =

_
n+1
2
_

n
_
n
2
_ ,
in terms of the Euler Gamma function (a) =
_

0
t
a1
e
t
dt. The important
features of the t-distribution are easier to see without the formula for c: the fact
that it is approximately normal for large n, that it is symmetric and smooth,
and that u(x) x
p
for large x with exponent p = n + 1 (power law tails).)
Here is an informal way to understand the transformation rule (
uvd
3) and get
the determinant prefactor in the right place. Consider a very small region, B
y
,
in R
d
about the point y. This region could be a small ball or box, say. Call
the volume of B
y
, informally, |dy|. Under the transformation y x = Ay, say
that B
y
is transformed to a small region B
x
about x. Let |dx| be the volume
of B
x
. Since B
y
A
B
x
(this means that the transformation A takes B
y
to B
x
),
the ratio of the volumes is given by the determinant:
|dx| = |det(A)| |dy| .
This formula is exactly true even if |dy| is not small.
But when B
y
small, we have the approximate formula
Pr(B
y
) =
_
By
y(y

)dy

v(y) |dy| . (5) udx

The exact formula Pr(B
x
) = Pr(B
y
) then gives the approximate formula
v(y) |dy| u(x) |dx| = u(Ay) |det(A)| |dy| .
In the limit |dy| 0, the approximations become exact. Cancel the common
factor |dy| from both sides and you get the transformation formula (
uvd
3).
3.2 Linear algebra, matrix multiplication
sub:la
Very simple facts about matrix multiplication make the mathematicians work
much simpler than it would be otherwise. This applies to the associativity
property of matrix multiplication and the distributive property of matrix mul-
tiplication and addition. This is part of what makes linear algebra so useful in
practical probability.
Suppose A, B, and C are three matrices that are compatible for multiplica-
tion. Associativity is the formula (AB) C = A(BC). We can write the product
simply as ABC because the order of multiplication does not matter. Associativ-
ity holds for products of more factors. For example (A(BC)) D = (AB) (CD)
5
gives two of the many ways to calculate the matrix product ABCD: you can
compute BC, then multiply from the left by A and lastly multiply from the
right by D, or you can rst calculate AB and CD and then multiply those.
Distributivity is the fact that matrix product is a linear function of each
factor. Suppose AB is compatible for matrix multiplication, that B
1
and B
2
have the same shape (number of rows and columns) as B, and that u
1
and u
2
are numbers. Then A(u
1
B
1
+u
2
B
2
) = u
1
(AB
1
) + u
2
(AB
2
). This works with
more than two B matrices, and with matrices on the right and left, such as
A
_
n

k=1
u
k
B
k
_
C =
n

k=1
u
k
AB
k
C .
It even works with integrals. If B(x) is a matrix function of x R
d
and u(x) is
a probability density function, then
_
(AB(x)C) u(x) dx = A
__
B(x) u(x) dx
_
C .
This may be said in a more abstract way. If B is a random matrix and A and
C are xed, not random, then
E[ABC] = AE[B] C . (6) eabc
Matrix multiplication is associative and linear even when some of the matrices
are row vectors or column vectors. These can be treated as 1 d and d 1
matrices respectively.
Of course, matrix multiplication is not commutative: AB = BA in gen-
eral. The matrix transpose reverses the order of matrix multiplication: (AB)
t
=
(B
t
) (A
t
). Matrix inverse does the same if A and B are square matrices: (AB)
1
=
_
B
1
_ _
A
1
_
. If A and B are not square, it is possible that AB is invertible even
though A and B are not.
We illustrate matrix algebra in probability by nding transformation rules
for the mean and covariance of a multivariate random variable. Suppose Y R
d
is a d component random variable, and X = AY . It is not necessary here for
A to be invertible or square, as it was in subsection
sub:lt
3.1. The mean of Y is
the d component vector given either in matrix/vector form as
Y
= E[Y ], or in
component form as
Y,j
= E[Y
j
]. The expected value of Y is

X
= E[X] = E[AY ] = AE[Y ] = A
Y
.
We may take A out of the expectation because of the linearity of matrix multi-
plication, and the fact that Y may be treated as a d 1 matrix.
Slightly less trivial is the transformation formula for the covariance matrix.
The covariance matrix C
Y
is the d d symmetric matrix whose entries are
C
Y,jk
= E[(Y
j

Y,j
) (Y
k

Y,k
)] .
6
The diagonal entries of C
Y
are the variances of the components of Y :
C
Y,jj
= E
_
(Y
j

Y,j
)
2
_
=
2
Yj
.
Now consider the d d matrix B(Y ) = (Y
Y
) (Y
Y
)
t
. The (j, k) entry
of B is just (Y
j

Y,j
) (Y
k

Y,k
). Therefore the covariance matrix may be
expressed as
C
Y
= E
_
(Y
Y
) (Y
Y
)
t
_
. (7) cx
The linearity formula (
eabc
6), and associativity, give the transformation law for
covariances:
C
Y
= E
_
(Y
Y
) (Y
Y
)
t
_
= E
_
(AY A
Y
) (AY A
Y
)
t
_
= E
_
{A(Y
Y
)} {A(Y
Y
)}
t
_
= E
__
A(Y
Y
)
__
(Y
Y
)
t
A
t
__
= E
_
A
_
(Y
Y
) (Y
Y
)
t
_
A
t
_
= AE
_
(Y
Y
) (Y
Y
)
t
_
A
t
C
X
= AC
Y
A
t
. (8)
The second to the third line uses distributivity. The third to the fourth uses
the property of matrix transpose. The fourth to the fth is distributivity again.
The fth to the sixth is linearity.
3.3 Gaussian probability density
This subsection and the next one use the multivariate normal probability den-
sity. The aim is not to use the formula but to nd ways to avoid using it. We
use the general probability density formula to prove the important facts about
Gaussians. Working with these general properties is simpler than working with
probability density formulas. These properties are
Linear functions of Gaussians are Gaussian. If X is Gaussian and Y =
AX, then Y is Gaussian.
Uncorrelated Gaussians are independent. If X
1
and X
2
are two compo-
nents of a multivariate normal and if cov(X
1
, X
2
) = 0 then X
1
and X
2
are independent.
Conditioned Gaussians are Gaussian. If X
1
and X
2
are two compenents of
a multivariate normal, then the distribution of X
1
, conditioned on knowing
the value X
2
= x
2
, is Gaussian.
7
In this subsection and the next, we use the formula for the Gaussian probability
density to prove these three properties.
Let H be a d d matrix that is SPD (symmetric and positive denite). Let
= (
1
, . . .
d
)
t
R
d
be a vector. If X has the probability density
u(x) =
_
det(H)
(2)
d/2
e
(x)
t
H(x)/2
. (9) NH
then we say that X is a multivariate Gaussian, or normal, with parameters
and H. The probability density on the right is denoted by N(, H
1
). It will
be clear soon why it is convenient to use H
1
instead of H. We say that X is
centered if = 0. In that case the density is symmetric, u(x) = u(x).
It is usually more convenient to write the density formula as
u(x) = c e
(x)
t
H(x)/2
.
The value of the prefactor,
c =
_
det(H)
(2)
d/2
,
often is not important. A probability density is Gaussian if it is the exponential
of a quadratic function of x.
We give some examples before explaining (
NH
9) in general. A univariate normal
has d = 1. In that case we drop the vector and matrix notation because and
H are just numbers. The simplest univariate normal is the univariate standard
normal, with = 0, and H = 1. We often use Z to denote a standard normal,
so
Z
1

2
e
z
2
/2
= N(0, 1) . (10) sn1
The cumulative distribution function, or CDF, of the univariate standard normal
is
N(x) = Pr( Z < x) =
1

2
_
x

e
z
2
/2
dz .
There is no explicit formula for N(x) but it is easy to calculate numerically. Most
numerical software packages include procedures that compute N(x) accurately.
The general univariate density may be written without matrix/vector nota-
tion:
u(x) =

2
e
(x)
2
h/2
. (11) wn
Simple calculations (explained below) show that the mean is E[X] = , and the
variance is

2
X
= var(X) = E
_
(X )
2
_
=
1
h
.
In view of this, the probability density (
wn
11) is also written
u(x) =
1

2
2
e
(x)
2
/(2
2
)
. (12) uvn
8
If X has this density, we write X N(,
2
). It would have been more accurate
to write
2
X
and
X
, but we make a habit of dropping subscripts that are easy
to guess from context.
It is useful to express a general univariate normal in terms of a standard
univariate normal. If we want X N(,
2
), we can take Z (0, 1) and take
X = + Z . (13) XmsZ
It is clear that E[X] = . Also,
var(X) = E
_
(X )
2
_
=
2
E
_
Z
2

=
2
.
A calculation with probability densities (see below) shows that if (
sn1
10) is the
density of Z and X is given by (
XmsZ
13), then X has the density (
uvn
12). This is handy
in calculations, such as
Pr[X < a] = Pr[ +Z < a] = Pr[Z < (a ) /] = N
_
a

_
.
This says that the probability of X < a depends on how many standard devia-
tions a is away from the mean, which is the argument of N on the right.
We move to a multivariate normal and return to matrix and vector notation.
The standard multivariate normal has = 0 and H = I
dd
. In this case,
the exponent in (
NH
9) involves just z
t
Hz = z
t
z = z
2
1
+ z
2
d
, and det(H) = 1.
Therefore the N(0, I) probability density (
NH
9) is
Z
1
(2)
d/2
e
z
t
z/2
(14)
=
_
1

2
_
d
e
z
2
1
z
2
d
/2
=
_
1

2
e
z
2
1
/2
_

_
1

2
e
z
2
2
/2
_

_
1

2
e
z
2
d
/2
_
. (15)
The last line writes the probability density of Z as a product of one dimen-
sional N(0, 1) densities for the components Z
1
, . . . , Z
d
. This implies that the
components Z
j
are independent univariate standard normals. The elements of
the covariance matrix are
C
Z,jj
= var(Z
j
) = E
_
Z
2
j

= 1
C
Z,jk
= cov(Z
j
, Z
k
) = E
_
Z
j
Z
2
k

= 0 if j = k
_
. (16) ZI
In matrix terms, this is just C
Z
= I. The covariance matrix of the multivari-
ate standard normal is the identity matrix. In this case at least, uncorrelated
Gaussian components Z
j
are also independent.
The themes of this section so far are the general transformation law (
vud
4), the
covariance transformation formula (
cycx
8), and the Gaussian density formula (
NH
9).
9
We are ready to combine them to see how multivariate normals transform under
linear transformations. Suppose Y has probability density (assuming for now
that = 0)
v(y) = c e
y
t
Hy/2
,
and X = AY , with an invertible A so Y = A
1
X. We use (
vud
4), and calculate
the exponent
y
t
Hy =
_
A
1
x
_
t
H
_
A
1
x
_
= x
t
_
_
A
1
_
t
HA
1
_
x = x
t

Hx ,
with

H =
_
A
1
_
t
HA
1
. (Note that
_
A
1
_
t
= (A
t
)
1
. We denote these by
A
t
, and write

H = A
t
HA
1
.) The formula for

H is not important here.
What is important is that u(x) = c e
x
t
e
Hx/2
, which is Gaussian. This proves
that a linear transformation of a multivariate normal is a multivariate normal,
at least if the linear transformation is invertible.
We come to the relationship between H, the SPD matrix in (
NH
9), and C, the
covariance matrix of X. In one dimension the relation is
2
= C = h
1
. We
now show that for d 1 the relationship is
C = H
1
. (17) CH
The multivariate normal, therefore, is N(, H
1
) = N(, C). This is consistent
with the one variable notation N(,
2
). The relation (
CH
17) allows us to rewrite
the probability density (
NH
9) in its more familiar form
u(x) = c e
(x)
t
C
1
(x)/2
. (18) N
The prefactor is (presuming C = H
1
)
c =
1
(2)
d/2
_
det (C)
=
_
det (H)
(2)
d/2
. (19) pf
The proof of (
CH
17) uses an idea that is important for computation. A natural
multivariate version of (
XmsZ
13) is
X = +AZ , (20) XmAZ
where Z N(0, I). We choose A so that X N(, C). Then we use the
transformation formula to nd the density formula for X. The desired (
CH
17) will
fall out. The whole thing is an exercise in linear algebra.
The mean property is clear, so we continue to take = 0. The covariance
transformation formula (
cycx
8), with C
Z
= I in place of C
Y
, implies that C
X
= AA
t
.
We can create a multivariate normal with covariance C if we can nd an A with
AA
t
= C . (21) cho
You can think of A as the square root of C, just as is the square root of
2
in the one dimensional version (
XmsZ
13).
10
There are dierent ways to nd a suitable A. One is the Cholesky factor-
ization C = LL
t
, where L is a lower triangular matrix. This is described in
any good linear algebra book (Strang, Lax, not Halmos). This is convenient
for computation because numerical software packages usually include a routine
that computes the Cholesky factorization.
This is the algorithm for creating X. We now nd the probability density
of X using the transformation formula (
vud
4). We write c for the prefactor in any
probability density formula. The value of c can be dierent in dierent formulas.
We saw that v(z) = ce
z
t
z/2
. With z = A
1
x, we get
u(x) = c v(A
1
x) = c e
(A
1
x)
t
A
1
x/2
.
The exponent is, as we just saw, x
2
Hx/2 with
H = A
t
A
1
.
But if we take the inverse of both sides of (
cho
21) we nd C
1
= A
t
A
1
. This
proves (
CH
17), as the expressions for C
1
and H are the same.
The prefactor works out too. The covariance equation (
cho
21) implies that
det(C) = det(A) det(A
t
) = [ det(A) ]
2
.
Using (
zd
14) and (
uvd
3) together gives
c =
1
(2)
d/2
1
det(A)
,
which is the prefactor formula (
pf
19).
Two important properties of the multivariate normal, for your review list, is
that they exist for any C and are easy to generate. The covariance square root
equation (
cho
21) has a solution for any SPD matrix C. If C is a desired covariance
matrix, the mapping (
XmAZ
20) produces multivariate normal X with covariance C.
Standard software packages include random number generators that produce
independent univariate standard normals Z
k
. If you have C and you want a
million vectors independent random vectors X, you rst compute the Cholesky
factor, L. Then a million times you use the standard normal random number
generator to produce a d component standard normal Z and do the matrix
calculation X = LZ.
This makes the multivariate normal family dierent from other multivariate
families. Sampling a general multivariate random variable can be challenging for
d larger than about 5. Practitioners resort to heavy handed and slow methods
such as Markov chain Monte Carlo. Moreover, there is a modeling question
that may be hard to answer for general random variables. Suppose you want
univariate random variables X
1
, . . ., X
d
each to have density f(x) and you want
them to be correlated. If f(x) is a univariate normal, you can make the X
k
components of a multivariate normal with desired variances and correlations.
If the X
k
are not normal, the copula transformation maps the situation to a
multivariate normal. Warning: the copula transformation has been blamed for
the 2007 nancial meltdown, seriously.
11
3.4 Conditional and marginal distributions
sub:cm
When you talk about conditional and marginal distributions, you have to say
which variables are xed or known, and which are variable or unknown. We
write the multivariate random variable as (X, Y ) with X R
d
X
and y R
d
Y
.
The number of random variables in total is still d = d
X
+ d
Y
. We study the
conditional distribution of X, conditioned on knowing Y = y. We also study
the marginal distribution of Y .
The math here is linear algebra with block vectors and matrices. The total
random variable is partitioned into its X and Y parts as
_
X
Y
_
=
_
_
_
_
_
_
_
_
_
_
X
1
.
.
.
X
d
X
Y
1
.
.
.
Y
d
Y
_
_
_
_
_
_
_
_
_
_
R
d
The Gaussian joint distribution of X and Y has in its exponent
_
x
t
, y
t
_
_
H
XX
| H
XY
H
Y X
| H
Y Y
__
x
y
_
= x
t
H
XX
x + 2x
t
H
XY
y + y
t
H
Y Y
y .
(This relies on x
t
H
XY
y = y
t
H
Y X
x, which is true because H
Y X
= H
t
XY
, which
is true because H is symmetric.) The joint distribution of X and Y is
u(x, y) = c e
(x
t
H
XX
x +2x
t
H
XY
y +y
t
H
Y Y
y)/2
.
If u(x, y) is any joint distribution, then the conditional density of X condi-
tioned on Y = y, is u(x | Y = y) = u(x | y) = c(y)u(x, y). Here we think of x
as the variable and y as a parameter. The normalization constant here depends
on the parameter. For the Gaussian, we have
u(x| Y = y) = c(y) e
(x
t
H
XX
x +2x
t
H
XY
y)/2
. (22) cX
The factor e
y
t
H
Y Y
y/2
has been absorbed into c(y).
We see from (
cX
22) that the conditional distribution of X is the exponential
of a quadratic function of x, which is to say, Gaussian. The algebraic trick
of completing the square identies the conditional mean. We seek to write the
exponent (
cX
22) in the form of the exponent of (
NH
9)
x
t
H
XX
x + 2x
t
H
XY
y = (x
X
(y))
t
H
XX
(x
X
(y)) + m(y) .
The m(y) will eventually be absorbed into the y dependent prefactor. Some
algebra shows that this works provided
2x
t
H
XY
y = 2x
t
H
XX

X
(y) .
12
This will hold for all x if H
XY
y = H
XX

X
(y), which gives the formula for
the conditional mean:

X
(y) = H
1
XX
H
XY
y . (23) cm
The conditional mean
X
(y) is in some sense the best prediction of the unknown
X given the known Y = y.
3.5 Generating a multivariate normal, interpreting covari-
ance
sub:ch
If we have M with MM
t
= C, we can think of M as a kind of square root of
C. It is possible to nd a real d d matrix M as long as C is symmetric and
positive denite. We will see two distinct ways to do this that give two dierent
M matrices.
The Cholesky factorization is one of these ways. The Cholesky factorization
of C is a lower triangular matrix L with LL
t
= C Lower triangular means that
all non-zero entries of L are on or below the digonal:
L =
_
_
_
_
_
_
_
_
l
11
0 0
l
21
l
22
0
.
.
.
.
.
.
.
.
.
.
.
.
0
l
d1
l
dd
_
_
_
_
_
_
_
_
.
Any good linear algebra book explains the basic facts of Cholesky factorization.
These are such an L exists as long as C is SPD. There is a unique lower triangular
L with positive diagonal entries: l
jj
> 0. There is a straightforward algorithm
that calculates L from C using approximately d
3
/6 multiplications (and the
same number of additions).
If you want to generate X N(, C), you compute the Cholesky factoriza-
tion of C. Any good package of linear algebra software can do this, including
downloadable software LAPACK for C or C++ or FORTRAN programming,
and the build in linear algebra facilities in Python, R, and Matlab. To make an
X, you need d independent standard normals Z
1
, . . . , Z
d
. Most packages that
generate pseudo-random numbers have a procedure to generate such standard
normals. This includes Python, R, and Matlab. To do it in C, C++, FOR-
TRAN, you can use a uniform pseudo-random number generator and then use
the Box Muller formula to get Gaussians. You assemble the Z
j
into a vector
Z = (Z
1
, . . . , Z
d
and take X = LZ +.
Consider as an example the two dimensional case with = 0. Here, we
want X
1
and X
2
that are jointly normal. It is common to specify var(X
1
) =
2
1
,
var(X
2
) =
2
1
, and the correlation coecient

12
= corr(X
1
, X
2
) =
cov(X
1
, X
2
)

2
=
E(X
1
X
2
)

2
.
13
In this case, the Cholesky factor is
L =
_

1
0

2
_
1
2
12

2
_
. (24) L2
The general formula X = LZ becomes
X
1
=
1
Z
1
(25)
X
2
=
12

2
Z
1
+
_
1
2
12

2
Z
2
. (26)
It is easy to calculate E
_
X
2
1

=
2
1
, which is the desired value. Similarly, because
Z
1
and Z
2
are independent, we have
var(X
2
) = E
_
X
2
2

=
2
12

2
2
+
_
1
2
12
_

2
2
=
2
2
,
which is the desired answer, too. The correlation coecient is also correct:
corr(X
1
, X
2
) =
E
_
X
2
2

2
=
E[
1
Z
1

2
Z
1
]

2
=
12
E
_
Z
2
1

=
12
.
You can, and should, verify by matrix multiplication that
LL
t
=
_

2
1

1

12

2
2
_
,
which is the desired covariance matrix of (X
1
, X
2
)
t
.
We could have turned the formulas (
x1
25) and (
x2
26) around as
X
1
=
_
1
2
12

1
Z
1
+
12

1
Z
2
+
X
2
=
2
Z
2
.
In this version, it looks like X
2
is primary and X
1
gets some of its value from
X
2
. In (
x1
25) and (
x2
26), it looks like X
1
is primary and X
2
gets some of its value
from X
1
. These two models are equally valid in the sense that they product
the same observed (X
1
, X
2
) distribution. It is a good idea to keep this in mind
when interpreting regression studies involving X
1
and X
2
.
4 Linear Gaussian recurrences
sec:lgr
Linear Gaussian linear recurrence relations (
lrr
1) illustrate the ideas in the previous
section. We now know that if V
n
is a multivariate normal with mean zero, there
is a matrix B so that V
n
= BZ
n
, where Z
n
N(0, I), is a standard multivariate
normal. Therefore, we rewrite (
lrr
1) as
X
n+1
= AX
n
+ BZ
n
. (27) lrz
Since the X
n
are Gaussian, we need only describe their means and covariances.
This section shows that the means and covariances satisfy recurrence relations
derived from (
lrr
1). The next section explores the distributions of paths. This
determines, for example, the joint distribution of X
n
and X
m
for n = m. These
and more general path spaces and path distributions are important throughout
the course.
14
4.1 Probability distribution dynamics
sub:pdd
As long as Z
n
is independent of X
n
, we can calculate recurrence relations for

n
= E[X
n
] and C
n
= cov[X
n
]. For the mean, we have (you may want to glance
back to subsection
sub:la
3.2)

n+1
= E[AX
n
+ BZ
n
]
= AE[X
n
] + BE[Z
n
]

n+1
= A
n
. (28)
This says that the recurrence relation for the means is the same as the recurrence
relation (
lrz
27) for the random states if you turn o the noise (set Z
n
to zero).
For the covariance, it is convenient to combine (
lrz
27) and (
mur
28) into
X
n+1

n+1
= A(X
n

n
) + BZ
n
.
The covariance calculation starts with
C
n+1
= E
_
(X
n+1

n+1
) (X
n+1

n+1
)
t
_
= E
_
(A(X
n

n
) +BZ
n
) (A(X
n

n
) +BZ
n
)
t
_
We expand the last into a sum of four terms. Two of these are zero, one being
E
__
A(X
n

n
) (BZ
n
)
t
__
= 0 ,
because Z
n
has mean zero and is independent of X
n
. We keep the non-zero
terms:
C
n+1
= E
_
(A(X
n

n
)) (A(X
n

n
))
t
_
+ E
_
(BZ
n
) (BZ
n
)
t
_
= E
_
A
_
(X
n

n
) (X
n

n
)
t
_
A
t
_
+ E
_
B
_
Z
n
Z
t
n
_
B
t

= AE
_
(X
n

n
) (X
n

n
)
t
_
A
t
+ BE
_
Z
n
Z
t
n
B
t

B
t
C
n+1
= AC
n
A
t
+ BB
t
. (29)
The recurrence relations (
mur
28) and (
cr
29) determine the distribution of X
n+1
in
terms of the distribution of X
n
. As such, they are the rst example in this class
of a forward equation.
We will see in subsection
sub:ho
4.2 that there are natural examples where the
dimension of the noise vector Z
n
is less than d, and the noise matrix B is not
square. When that happens, we let m denote the number of components of Z
n
,
which is the number of sources of noise. The noise matrix B is d m; it has d
rows and m columns. The case m > d is not important for applications. The
matrices in (
cr
29) all are d d, including BB
t
. If you wonder whether it might
be B
t
B instead, note that B
t
B is mm, which might be the wrong size.
15
4.2 Higher order recurrence relations, the Markov prop-
erty
sub:ho
It is common to consider recurrence relations with more than one lag. For
example, a k lag relation might take the form
X
n+1
= A
0
X
n
+ A
1
X
n1
+ + A
k1
X
nk+1
+ BZ
n
. (30) lk
From the point of view of X
n+1
, the k lagged states are X
n
(one lag), up to
X
nk+1
(k lags). It is natural to consider models with multiple lags if X
n
represent observable aspects of a large and largely unobservable system. For
example, the components of X
n
could be public nancial data at time n. There
is much unavailable private nancial data. The lagged values X
nj
might give
more insight into the complete state at time n than just X
n
.
We do not need a new theory of lag k systems. State space expansion puts a
multi-lag system of the form (
lk
30) into the form of a two term recurrence relation
(
lrz
27). This formulation uses expanded vectors

X
n
=
_
_
_
_
_
X
n
X
n1
.
.
.
X
nk+1
_
_
_
_
_
.
If the original states X
n
have d components, then the expanded states

X
n
have
kd components. The noise vector Z
n
does not need expanding because noise
vectors have no memory. All the memory in the system is contained in

X
n
. The
recurrence relations in the expanded state formulation are

X
n+1
=

A

X
n
+

BZ
n
.
In more detail, this is
_
_
_
_
_
X
n+1
X
n
.
.
.
X
nk+2
_
_
_
_
_
=
_
_
_
_
_
_
_
A
0
A
1
A
k+1
I 0 0
0 I 0
.
.
.
.
.
.
.
.
.
0 I 0
_
_
_
_
_
_
_
_
_
_
_
_
X
n
X
n1
.
.
.
X
nk+1
_
_
_
_
_
+
_
_
_
_
_
B
0
.
.
.
0
_
_
_
_
_
Z
n
. (31) cmrr
The matrix

A is the companion matrix of the recurrence relation (
lk
30).
We will see in subsection
sub:ss
4.3 that the stability of a recurrence relation (
lrz
27)
is determined by the eigenvalues of A. For the case d = 1, you might know that
the stability of the recurrence relation (
lk
30) is determined by the roots of the
characteristic polynomial p(z) = z
k
A
0
z
k1
A
k1
. These statements are
consistent because the roots of the characteristic polynomial are the eigenvalues
of the companion matrix.
If X
n
satises a k lag recurrence (
lk
30), then the covariance matrix,

C
n
=
cov(

X
n
), satises

C
n+1
=

A

C
n

A
t
+

B

B
t
. The simplest way to nd the d d
16
covariance matrix C
n
, is to nd the kd kd covariance matrix

C
n
and look at
the top left d d block.
The Markov property will be important throughout the course. If the X
n
satisfy the one lag recurrence relation (
lrz
27), then they have the Markov property.
In this case the X
n
form a Markov chain. If they satisfy the k lag recurrence
relation with k > 1 (in a non-trivial way) then the stochastic process X
n
does not
have the Markov property. The informal denition is as follows. The process
has the Markov property if X
n
is all the information about the past that is
relevant for predicting the future. Said more formally, the distribution of X
n+1
conditional on X
n
, . . . , X
0
is the same as the distribution of X
n+1
conditional
on X
n
alone.
If a random process does not have the Markov property, you can blame that
on the state space being too small, so that X
n
does not have as much information
about the state of the system as it should. In many such cases, a version of state
space expansion can create a more complete collection of information at time n.
Genuine state space expansion, with k > 1, always gives a noise matrix

B
with fewer sources of noise than state variables. The number of state variables
is kd and the number of noise variables is m d.
4.3 Large time behavior and stability
sub:ss
Large time behavior is the behavior of X
n
as n . The stochastic process
(
lrz
27) is stable if it settles into a stochastic steady state for large n. The states X
n
can not have a limit, because of the constant inuence of random noise. But the
probability distributions, u
n
(x), with X
n
u
n
(x), can have limits. The limit
u(x) = lim
n
u
n
(x) is a statistical steady state. The nite time distributions
u
n
are Gaussian: u
n
= N(
n
, C
n
), with
n
and C
n
satisfying the recurrences
(
mur
28) and (
cr
29). The limiting distribution depends on the following limits:
= lim
n

n
(32)
C = lim
n
C
n
(33)
If these limits exist, then
3
u = N(, C).
In the following discussion we rst ignore several subtleties in linear algebra
for the sake of simplicity. Conclusions are correct as initially stated if m = d, B
is non-singular, and there are no Jordan blocks in the eigenvalue decomposition
of A. We will then re-examine the reasoning to gure out what can happen in
exceptional degenerate cases.
The limit (
mul
32) depends on the eigenvalues of A. Denote the eigenvalues
by
j
and the corresponding right eigenvectors by r
j
, so that Ar
j
=
j
r
j
for
j = 1, . . . , d. The eigenvalues and eigenvectors do not have to be real even
when A is real. The eigenvectors form a basis of C
d
, so the means
n
have
3
Some readers will worry that this statement is not proven with mathematical rigor. It
can be, but we are avoiding that kind of technical discussion.
17
unique representations
n
=

d
j=1
m
n,j
r
j
. The dynamics (
mur
28) implies that
m
n+1,j
=
j
m
n,j
. This implies that
m
n,j
=
n
j
m
0,j
. (34) mp
The matrix A is strongly stable if |
j
| < 1 for j = 1, . . . , d. In this case
m
n,j
0 as n for each j. In fact, the convergence is exponential. We
see that if A is strongly stable, then
n
0 as n independent of the
initial mean
0
. The opposite case is that |
j
| > 1 for some j. Such an A is
strongly unstable. It usually happens that |
n
| as n for a strongly
unstable A. The limiting distribution u does not exist for strongly unstable A.
The borderline case is |
j
| 1, for all j and there is at least one j with |
j
| 1.
This may be called either weakly stable or weakly unstable.
If A is strongly stable, then the limit (
cl
33) exists. We do not expect C
n
0
because the uncertainty in X
n
is continually replenished by noise. We start with
a direct but possibly unsatisfying proof. A second and more complicated proof
follows. The rst proof just uses the fact that if A is strongly stable, then
A
n
c a
n
, (35) ab
for some constant c and positive a < 1. The value of c depends on the matrix
norm and is not important for the proof.
We prove that the limit (
cl
33) exists by writing C as a convergent innite
sum. To simplify notation, write R for BB
t
. Suppose C
0
is given, then (
cr
29)
gives C
1
= AC
0
A
t
+R. Using (
cr
29) again gives
C
2
= AC
1
A
t
+ R
= A
_
AC
0
A
t
+ R
_
A
t
+ R
= A
2
C
0
_
A
t
_
2
+ ARA
t
+ R
= A
2
C
0
_
A
2
_
t
+ ARA
t
+ R
We can continue in this way to see (by induction) that
C
n
= A
n
C
0
(A
n
)
t
+ A
n1
R
_
A
n1
_
t
+ + R .
This is written more succinctly as
C
n
= A
n
C
0
(A
n
)
t
+
n1

k=0
A
k
R
_
A
k
_
t
. (36) gsf
The limit of the C
n
exists because the rst term on the right goes to zero as
n and the second term converges to the innite sum
C =

k=0
A
k
R
_
A
k
_
t
. (37) gsi
18
For the rst term, note that (
ab
35) and properties of matrix norms imply that
4
_
_
_A
n
C
0
(A
n
)
t
_
_
_ (ca
n
) C
0
(ca
n
) = ca
2n
C
0
.
We write c instead of c
2
at the end because c is a generic constant whose value
does not matter. The right side goes to zero as n because a < 1. For the
second term, recall that an innite sum is the limit of its partial sums if the
innite sum converges absolutely. Absolute convergence is the convergence of
the sum of the absolute values, or the norms in case of vectors and matrices.
Here the sum of norms is:

k=0
_
_
_A
k
R
_
A
k
_
t
_
_
_ .
Properties of norms bound this by a geometric series:
_
_
_A
k
R
_
A
k
_
t
_
_
_ c a
2k
R .
You can nd C without summing the innite series (
gsi
37). Since the limit
(
cl
33) exists, you can take the limit on both sides of (
cr
29), which gives
C ACA
t
= BB
t
. (38) le
Subsection
sub:ev
4.4 explains that this is a system of linear equations for the entries
of C. The system is solvable and the solution is positive denite if A is strongly
stable. As a warning, (
le
38) is solvable in most cases even when A is strongly
unstable. But in those cases the C you get is not positive denite and therefore
is not the covariance matrix of anything. The dynamical equation (
cr
29) and the
steady state equation (
le
38) are examples of Liapounov equations.
Here are the conclusions: if A is strongly stable then u
n
, the distribution of
X
n
has u
n
u as n , with a Gaussian limit u = N(0, C), and C is given
by (
gsi
37), or by solving (
le
38). If A is not strongly stable, then it is unlikely that
the u
n
have a limit as n . It is not altogether impossible in degenerate
situations described below. If A is strongly unstable, then it is most likely that

n
as n . If A is weakly unstable, then probably C
n
as
n because the sum (
gsi
37) diverges.
4.4 Linear algebra and the limiting covariance
sub:ev
This subsection is a little esoteric. It is (to the author) interesting mathematics
that is not strictly necessary to understand the material for this week. Here we
nd eigenvalues and eigen-matrices for the recurrence relation (
cr
29). These are
related to the eigenvalues and eigenvectors of A.
4
Part of this expression is similar to the design on Courant Institute tee shirts.
19
The covariance recurrence relation (
cr
29)has the same stability/instability di-
chotomy. We explain this by reformulating it as more standard linear algebra.
Consider rst the part that does not involve B, which is
C
n+1
= AC
n
A
t
. (39) Le
Here, the entries of C
n+1
are linear functions of the entries of C
n
. We describe
this more explicitly by collecting all the distinct entries of C
n
into a vector c
n
.
There are D = (d + 1)d/2 entries in c
n
because the elements of C
n
below the
diagonal are equal to the entries above. For example, for d = 3 there are D = 6
distinct entries in C
n
, which are C
n,11
, C
n,12
, C
n,13
, C
n,22
, C
n,23
, and C
n,33
,
which makes c
n
= (C
n,11
, C
n,12
, C
n,13
, C
n,22
, C
n,23
, C
n,33
)
t
R
D
(= R
6
). There
is a DD matrix, L so that c
n+1
= Lc
n
. In the case d = 2 and A =
_

_
,
the C
n
recurrence relation, or dynamical Liapounov equation without BB
t
, (
cr
29)
is
_
C
n+1,11
C
n+1,12
C
n+1,12
C
n+1,22
_
=
_

__
C
n+1,11
C
n+1,12
C
n+1,12
C
n+1,22
__

_
.
This is equivalent to D = 3 and
_
_
C
n+1,11
C
n+1,12
C
n+1,22
_
_
=
_
_

2
2
2
+

2
2
2
_
_
_
_
C
n,11
C
n,12
C
n,22
_
_
.
And that identies L as
L =
_
_

2
2
2
+

2
2
2
_
_
.
This formulation is not so useful for practical calculations. Its only purpose is
to show that (
Le
39) is related to a D D matrix L.
The limiting behavior of C
n
depends on the eigenvalues of L. It turns out
that these are determined by the eigenvalues of A in a simple way. For each pair
(j, k) there is an eigenvalue of L, which we call
jk
, that is equal to
j

k
. To
understand this, note that an eigenvector, s, of L, with Ls = s, corresponds
to a symmetric d d eigen-matrix, S, with
ASA
t
= S .
It happens that S
jk
= r
j
r
t
k
+r
k
r
t
j
is the eigen-matrix corresponding to eigenvalue

jk
=
i

j
. (To be clear, S
jk
is a d d matrix, not the (j, ik) entry of a matrix
20
S.) For one thing, it is symmetric (S
t
jk
= S
jk
). For another thing:
AS
jk
A
t
= A
_
r
j
r
t
k
+r
k
r
t
j
_
A
t
= A
_
r
j
r
t
k
_
A
t
+ A
_
r
k
r
t
j
_
A
t
= (Ar
j
) (Ar
k
)
t
+ (Ar
k
) (Ar
j
)
t
= (
j
r
j
) (
k
r
k
)
t
+ (
k
r
k
) (
j
r
j
)
t
=
j

j
_
r
j
r
t
k
+r
k
r
t
j
_
=
jk
S
jk
.
A counting argument shows that all the eigenvalues and eigen-matrices of L
take the form of S
jk
for some j k. The number of such pairs is the same D,
which is the number of independent entries in a general symmetric matrix. We
do not count S
jk
with j < k because S
jk
= S
kj
with k > j.
Now suppose A is strongly stable. Then the Liapounov dynamical equation
(
cr
29) is equivalent to
c
n+1
= Lc
n
+ r .
Since all the eigenvalues of L are less than one in magnitude, a little reasoning
with linear algebra shows that c
n
c as n , and that cLc = (I L)c = r.
The matrix I L is invertible because L has no eigenvalues equal to 1. This
is a dierent proof that the steady state Liapounov equation (
le
38) has a unique
solution. It is likely that L has no eigenvalue equal to 1 even if A is not strongly
stable. In this case (
le
38) has a solution, which is a symmetric matrix C. But
there is no guarantee that this C is positive denite, so it does not represent a
covariance matrix.
4.5 Degenerate cases
sub:d
The simple conclusions of subsections
sub:ss
4.3 and
sub:ev
4.4 do not hold in every case.
The reasoning there assumed things about the matrices A and B that you might
think are true in almost every interesting case. But it is important to understand
how things might more complicated in borderline and degenerate cases. For one
thing, many important special cases are such borderline cases. Many more
systems have behavior that is strongly inuenced by near degeneracy. A
process that is weakly but not strongly unstable is simple Gaussian random
walk, which is a model of Brownian motion. A covariance that is nearly singular
is the covariance matrix of asset returns, of the S&P 500 stocks. This is a matrix
of rank 500 that is pretty well approximated for many purposes by a matrix of
rank 10.
4.5.1 Rank of B
The matrix B need not be square or have rank d.
21
5 Paths and path space
sec:p
There are questions about the process (
lrr
1) that depend X
n
for one n. For ex-
ample, what is Pr (X
n
1 for 1 n 10)? The probability distribution on
path space answers such questions. For linear Gaussian processes, the distri-
bution in path space is Gaussian. This is not surprising. This subsection goes
through the elementary mechanics of Gaussian path space. We also describe
more general path space terminology that carries over to other other kinds of
Markov processes.
Two relevant probability spaces are the state space and the path space. We
let S denote the state space. This is the set of all possible values of the state
at time n. This week, the state is a d component vector and S = R
d
. The path
space is called . This week, is sequences of states with a given starting and
ending time. That is, X
[n1:n2]
is a sequence (X
n1
, X
n1+1
, . . . , X
n2
). There
are n
2
n
1
+ 1 states in the sequence, so = R
(n2n1+1)d
. Even if the state
space is not R
d
, still a path is a sequence of states. We express this by writing
= S
n2n1+1
. The path space depends on n
1
and n
2
(only the dierence,
really), but we leave that out of the notation because it is usually clear from
the discussion.
6 Exercises
sec:ex
ex:cg 1. This exercise works through conditional distributions of multivariate nor-
mals in a sequence of steps. The themes (for the list of facts about Gaus-
sians) are the role of linear algebra and the relation to linear regression.
Suppose X and Y have d
X
and d
Y
components respectively. Let u(x, y)
be the joint density. Then the conditional distribution of Y conditional
on X = x is u( y | X = x) = c(x)u(x, y). This says that the conditional
distribution is the same, up to a normalization constant) as the joint dis-
tribution once you x the variable whose value is known (x in this case).
The normalization constant is determined by the requirement that the
conditional distribution has total probability equal to 1:
c(x) =
1
_
u(x, y) dy
.
For Gaussian random variables, nding c(x) usually both easy and unnec-
essary.
(a) This part works out the simplest case. Take d = 2, and X =
(X
1
, X
2
)
t
. Suppose X N(0, H
1
). Fix the value of X
1
= x
1
and
calculate the distribution of the one dimensional random variable X
2
.
If H is
H =
_
h
11
h
12
h
12
h
22
_
,
22
then the joint density is
u(x
1
, x
2
) = c exp
_

_
h
11
x
2
1
+ 2h
12
x
1
x
2
+ h
22
x
2
2
_
/2

.
The conditional density looks almost the same:
u( x
2
| x
1
) = c(x
1
) exp
_

_
2h
12
x
1
x
2
+ h
22
x
2
2
_
/2

.
Why is it allowed to leave the term h
11
x
2
1
out of the exponent? Com-
plete the square to write this in the form
u( x
2
| x
1
) = c(x
1
) exp
_
(x
2
(x
1
))
2
/(2
2
2
)
_
.
Find formulas for the conditional mean, (x
1
), and the conditional
variance,
2
2
.
23