0% found this document useful (0 votes)
35 views

Multivariate Normal Distribution: 3.1 Basic Properties

The multivariate normal (MVN) distribution generalizes the univariate normal distribution to multiple dimensions. For a random vector X of dimension p with mean μ and covariance matrix Σ: 1) Any linear combination of X is also MVN distributed. 2) Subsets of variables in X are MVN distributed. 3) Variables in X are uncorrelated if and only if they are independent for the MVN distribution. The conditional distribution of a subset of variables given other variables is also MVN. Maximum likelihood estimation of the mean and covariance matrix of an MVN distributed population uses the sample mean and covariance matrix.

Uploaded by

Apam Benjamin
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Multivariate Normal Distribution: 3.1 Basic Properties

The multivariate normal (MVN) distribution generalizes the univariate normal distribution to multiple dimensions. For a random vector X of dimension p with mean μ and covariance matrix Σ: 1) Any linear combination of X is also MVN distributed. 2) Subsets of variables in X are MVN distributed. 3) Variables in X are uncorrelated if and only if they are independent for the MVN distribution. The conditional distribution of a subset of variables given other variables is also MVN. Maximum likelihood estimation of the mean and covariance matrix of an MVN distributed population uses the sample mean and covariance matrix.

Uploaded by

Apam Benjamin
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

3.

Multivariate Normal Distribution


The MVN distribution is a generalization of the univariate normal distribution which has the
density function (p.d.f.)
) (r) =
1
_
2o
exp
_

(r j)
2
2o
2
_
< r <
where j = mean of distribution, o
2
= variance. In jdimensions the density becomes
) (x) =
1
(2)
p=2
[[
1=2
exp
_

1
2
(x )
T

1
(x )
_
(3.1)
Within the mean vector there are j (independent) parameters and within the symmetric co-
variance matrix there are
1
2
j (j + 1) independent parameters [
1
2
j (j + 3) independent parameters
in total]. We use the notation
x s N
p
(; ) (3.2)
to denote a RV x having the jvariate MVN distribution with
E(x) =
Co (x) =
Note that MVN distributions are entirely characterized by the rst and second moments of the
distribution.
3.1 Basic properties
If x (j 1)is MVN with mean and covariance matrix
Any linear combination of x is MVN
Let y = Ax +c with A( j) and c ( 1) then
y s N
q
_

y
,
y
_
where
y
= A +c and
y
= AA
T
.
Any subset of variables in x has a MVN distribution.
If a set of variables is uncorrelated, then they are independently distributed. In particular
i) if o
ij
= 0 then r
i
, r
j
are independent.
1
ii) if x is MVN with covariance matrix , then Ax and Bx are independent if and only if
Co (Ax; Bx) = AB
T
(3.3)
= 0
Conditional distributions are MVN.
Result
For the MVN distribution, variable are uncorrelated = variable are independent.
Proof
Let x (j 1) be partitioned as
x =
_
x
1
x
2
_

j
with mean vector
=
_

2
_

j
and covariance matrix
j
=
_

11

12

21

22
_

j
i) Independent = uncorrelated (always holds).
Suppose x
1
, x
2
are independent. Then ) (x
1
, x
2
) = /(x
1
) q (x
2
) is a factorization of the
multivariate p.d.f.and
12
= Co (x
1
, x
2
) = E
_
(x
1

1
) (x
2

2
)
T
_
factorizes into the
product of E[(x
1

1
)] and E
_
(x
2

2
)
T
_
which are both zero since E(x
1
) =
1
and
E(x
2
) =
2
. Hence
12
= 0.
ii) Uncorrelated = independent (for MVN)
This result depends on factorizing the p.d.f. (3.1) when
12
= 0.
In this case (x )
T

1
(x ) has the partitioned form
_
x
T
1

T
1
, x
T
2

T
2
_
_

11
0
0
22
_
1
_
x
1

1
x
2

2
_
=
_
x
T
1

T
1
, x
T
2

T
2
_
_

1
11
0
0
1
22
__
x
1

1
x
2

2
_
= (x
1

1
)
T

1
11
(x
1

1
) + (x
2

2
)
T

1
22
(x
2

2
)
2
so that exp(x )
T

1
(x ) factorizes into the product of
exp
_
(x
1

1
)
T

1
11
(x
1

1
)
_
and exp
_
(x
2

2
)
T

1
22
(x
2

2
)
_
.
Therefore the p.d.f. can be written as
) (x) = q (x
1
) /(x
2
)
proving that x
1
and x
2
are independent.
3.2 Conditional distribution
Let X =
_
X
1
X
2
_

j
be a partitioned MVN random jvector,
with mean =
_

2
_
and covariance matrix
=
_

11

12

21

22
_
The conditional distribution of X
2
given X
1
= x
1
is MVN with
E(X
2
[X
1
= x
1
) =
2
+
21

1
11
(x
1

1
) (3.4a)
Co (X
2
[X
1
= x
1
) =
22

21

1
11

12
(3.4b)
Note: the notation X
1
to denote the r.. and x
1
to denote a specic constant value (realization
of X
1
) will be very useful here.
Proof of 3.4a
Dene a transformation from (X
1
, X
2
) to new variables X
1
and X
0
2
= X
2

21

1
11
X
1
. This
is achieved by the linear transformation
_
X
1
X
0
2
_
=
_
I 0

21

1
11
I
__
X
1
X
2
_
(3.5a)
= AX say. (3.5b)
This linear relationship shows that X
1
, X
0
2
are jointly MVN (by rst property of MVN stated
above.)
We now show that X
0
2
and X
1
are independent by proving that X
1
and X
0
2
are uncorrelated.
Approach 1:
3
Co
_
X
1
, X
0
2
_
= Co
_
X
1
, X
2

21

1
11
X
1
_
= Co(X
1
, X
2
) Co (X
1
, X
1
)
1
11

12
=
12

11

1
11

12
= 0
Approach 2:
In (3.3), write A =
_
B
C
_
where B =
_
I 0
_
and C =
_

21

1
11
I
_
Co
_
X
1
, X
0
2
_
= Co (BX; CX)
= BC
T
=
_
I 0
_
_

11

12

21

22
__

1
11

12
I
_
=
_

11

12
_
_

1
11

12
I
_
= 0
Since X
0
2
and X
1
are MVN and uncorrelated they are independent. Thus
E
_
X
0
2
[X
1
= x
1
_
= E
_
X
0
2
_
= E
_
X
2

21

1
11
X
1
_
=
2

21

1
11

1
Now, as X
0
2
= X
2

21

1
11
X
1
and X
1
= x
1
is given, we have
E(X
2
[X
1
= x
1
) = E
_
X
0
2
[X
1
= x
1
_
+
21

1
11
x
1
=
2

21

1
11

1
+
21

1
11
x
1
=
2
+
21

1
11
(x
1

1
)
as required.
Proof of 3.4b
Because X
0
2
is independent of X
1
Co
_
X
0
2
[X
1
= x
1
_
= Co
_
X
0
2
_
4
The left hand side is
1Ho = Co
_
X
0
2
[X
1
= x
1
_
= Co
_
X
2

21

1
11
x
1
[X
1
= x
1
_
= Co (X
2
[X
1
= x
1
)
The right hand side is
1Ho = Co
_
X
0
2
_
= Co
_
X
2

21

1
11
X
1
_
=
22

21

1
11

12
following from the general expansion
Co (X
2
DX
1
) = Co (X
2
, X
2
) DCo (X
1
, X
2
)
Co (X
2
, X
1
) D
T
+DCo (X
1
, X
1
) D
T
with D =
21

1
11
. Therefore
Co (X
2
[X
1
= x
1
) =
22

21

1
11

12
as required.
Example
Let x have a MVN distribution with covariance matrix
=
_

_
1 j j
2
j 1 0
j
2
0 1
_

_
Show that the conditional distribution of (A
1
, A
2
) given A
3
= r
3
is also MVN with mean
=
_
j
1
+j
2
(r
3
j
3
)
j
2
_
and covariance matrix
_
1 j
4
j
j 1
_
5
Solution
Let Y
1
=
_
A
1
A
2
_
and Y
2
= (A
3
) then
EY
1
=
_
j
1
j
2
_
EY
2
= (j
3
) .
We have Co
_
Y
1
Y
2
_
=
_

11

12

21

22
_
where

11
=
_
1 j
j 1
_

12
=
_
j
2
0
_
=
T
21

22
= [1]
Hence
E[Y
1
[Y
2
= r
3
] =
1
+
12

1
22
(r
3
j
3
)
=
_
j
1
j
2
_
+
_
j
2
0
_
(x
3

3
)
=
_
j
1
+j
2
(r
3
j
3
)
j
2
_
and .
Co [Y
1
[Y
2
= r
3
] =
11

12

1
22

21
=
_
1 j
j 1
_

_
j
2
0
_
_
j
2
0
_
=
_
1 j
4
j
j 1
_
3.3 Maximum-likelihood estimation
Let X
T
= (x
1
, ..., x
n
) contain an independent random sample of size : from
p
(; ) .
The maximum likelihood estimates (MLE s) of , are the sample mean and covariance matrix
(with divisor :)
^ = x (3.6a)
^
= S (3.6b)
6
The likelihood function is a function of the parameters ; given the data X
1(; jX) =
n

r=1
) (x
r
[; ) (3.7)
The RHS is evaluated by substituting the individual data vectors x
1
, ..., x
n
in turn into the
p.d.f. of
p
(; ) and taking the product.
n

r=1
) (x
r
[; ) = (2)

np
2
jj
n=2
exp
_

1
2
n

r=1
(x
r
)
T

1
(x
r
)
_
Maximizing 1 is equivalent to minimizing the "log likelihood" function
| (; ) = 2 log 1
= 2
n

r=1
log ) (x
r
[; )
= 1 +:log jj+
n

r=1
(x
r
)
T

1
(x
r
) (3.8)
where 1 is a constant independent of ; :
Result 3.3
| (; ) = :
_
log [ j+tr
_

1
_
S +dd
T
__
(3.9)
up to an additive constant, where d = x :
Proof
Noting that x
r
= (x
r
x) +d the nal term in the likelihood expression (3.8) becomes
n

r=1
(x
r
)
T

1
(x
r
)
=
n

r=1
(x
r
x)
T

1
(x
r
x) +:d
T

1
d
= :tr
_

1
S
_
+:d
T

1
d
= :tr
_

1
_
S +dd
T
_
proving the expression (3.9). Note that the cross-product terms have vanished because

n
r=1
x
r
=
7
: x and therefore
n

r=1
d
T

1
(x
r
x) = d
T

1
n

r=1
(x
r
x)
=
n

r=1
(x
r
x)
T

1
d
= 0
In (3.9) the dependence on is entirely through d. Now assume that is positive denite (p.d.),
then so is
1
as

1
= V
1
V
T
where = V V
T
is the eigenanalysis of . Thus \d 6= 0 we have d
T

1
d > 0. Hence | (; )
is minimized with respect to for xed when d = 0 i.e.
^ = x
Final part of proof: to minimize the log-likelihood | (^ ; ) w.r.t. let
| (^ ; ) = :
_
log [[ +tr
_

1
S
__
= () (3.10)
We show that
() (S) = :
_
log [[ log [S[ +tr
_

1
S
_
j
_
= :
_
tr
_

1
S
_
log [
1
S[ j
_
(3.11)
_ 0
Lemma 1

1
S is positive semi-denite (proved elsewhere). Therefore the eigenvalues of
1
S are
positive.
Lemma 2
For any set of positive numbers
_ log G+ 1
where and G are the arithmetic, geometric means respectively.
Proof
8
For all r we have c
x
_ 1 +r (simple exercise).Consider a set of : strictly positive numbers j
i

j
i
_ 1 + log j
i

j
i
_ : +

log j
i
_ 1 + log
_

j
i
_1
n
= 1 + log G
as required.
Recall that for any (: :) matrix A; we have tr (A) =

n
i=1
`
i
the sum of the eigenvalues,
and [ Aj =

`
i
the product of the eigenvalues. Let `
i
(i = 1, ..., j) be the positive eigenvalues of

1
S and substitute in (3.11)
log [
1
S[ = log
_

`
i
_
= j log G
tr
_

1
S
_
=

`
i
= j
Hence
() (S) = :j log G1
_ 0
This proves that the MLEs are as stated in (3.6) .
3.3 Sampling distribution of x and S
The Wishart distribution (Denition)
If M (j j) can be written M = X
T
X where X (:j) is a data matrix from
p
(0, ) then
M is said to have a Wishart distribution with scale matrix and degrees of freedom :. We write
M s \
p
(;:) (3.12)
When = I
p
the distribution is said to be in standard form.
Note:
9
The Wishart distribution is the multivariate generalization of the chi-square
2
distribution
Additive property of matrices with a Wishart distribution
Let M
1
, M
2
be matrices having the Wishart distribution
M
1
s \
p
(;:
1
)
M
2
s \
p
(;:
2
)
independently, then
M
1
+M
2
s \
p
(;:
1
+:
2
)
This property follows from the denition of the Wishart distribution because data matrices are
additive in the sense that if
X =
_
X
1
X
2
_
is a combined data matrix consisting of :
1
+:
2
rows then
X
T
X = X
T
1
X
1
+X
T
2
X
2
is matrix (known as the "Gram matrix") formed from the combined data matrix X:
Case of j = 1
When j = 1 we know from the denition of
2
r
as the distribution of the sum of squares of r
independent (0, 1) variates that
M =
m

i=1
r
2
i
s o
2

2
m
so that
\
1
_
o
2
, :
_
= o
2

2
m
Sampling distributions
Let x
1
, x
2
, ..., x
n
be a random sample of size : from
p
(, ). Then
1. The sample mean x has the normal distribution
x s
p
_
,
1
:

_
2. The (scaled) sample covariance matrix has the Wishart distribution:
(: 1) S
u
s \
p
(;: 1)
3. The distributions of x and S
u
are independent.
10
3.4 Estimators for special circumstances
3.4.1 j proportional to a given vector
Sometimes is known to be proportional to a given vector, so = /
0
with
0
being a known
vector.
For example if x represents a sample of repeated measurements then = /1where 1 =
(1, 1, ..., 1)
T
is the jvector of 1
0
:.
We nd the MLE of / for this situation. Suppose is known and = /
0
. Let d
0
= x/
0
.
The log likelihood is
| (/) = 2 log 1
= :
_
log [ j+ tr
_

1
_
S +d
0
d
T
0
__
= :
_
log [ j+ tr
_

1
S
_
+ ( x/
0
)
T

1
( x/
0
)
_
= :
_
x
T

1
x2/
T
0

1
x+/
2

T
0

+ constant terms indept of /


Set
d|
d/
= 0 to minimize | (/) w.r.t. /
2
T
0

1
x+2
_

T
0

0
_
/ = 0
from which
^
/ =

T
0

1
x

T
0

0
(3.13)
Properties
We now show that
^
/ is an unbiased estimator of / and determine the variance of
^
/
In (3.13)
^
/ takes the form
1
c
c
T
x with c
T
=
T
0

1
and c =
T
0

0
so
E
_
^
/
_
=
c
T
E[ x]
c
=
/c
T

0
c
.
=
/
T
0

0
c
since E[ x] = /
0
. Hence
E
_
^
/
_
= / (3.14)
showing that
^
/ is an unbiased estimator.
11
Note that \ ar [ x] =
1
:
and therefore that \ ar
_
c
T
x

=
1
:
c
T
c we have
\ ar
_
^
/
_
=
1
:c
2
c
T
c
=
1
:

T
0

0
_

T
0

0
_
2
=
1
:
T
0

0
(3.15)
3.4.2 Linear restriction on j
We determine an estimator for to satisfy a linear restriction
A = b
where A (:j) and b (:1) are given constants and is assumed to be known.
We write the restriction in vector form g () = 0 and form the Lagrangean
/(; ) = | () + 2
T
g ()
where
T
= (`
1
, ..., `
m
) is a vector of Lagrange multipliers (the factor 2 is inserted just for
convenience).
/(; ) = | () + 2
T
(A b)
= :
_
( x )
T

1
( x ) + 2
T
(A b)
_
ignore constant terms involving
Set
d
d
/(; ) = 0 using results from Example Sheet 2:
2
1
( x ) + 2A
T
= 0
x = A
T
(3.16)
We use the constraint A = b to evaluate the Lagrange multipliers : Premultiply by A
A x b = AA
T

=
_
AA
T
_
1
(A x b)
Substitute into (3.16)
^ = x A
T
_
AA
T
_
1
(A x b) (3.17)
12
3.4.3 Covariance matrix proportional to a given matrix
We consider estimating / when = /
0
, where
0
is a given.constant matrix. The likelihood
(3.8) takes the form when d = 0 (^ = x)
| (/) = :
_
log [/
0
[ +tr
_
1
/

1
0
S
__
plus constant terms (not involving /).
| (/) =
_
j log / +
1
/
tr
_

1
0
S
_
_
+ constant terms
d|
d/
= 0 =
j
/

1
/
2
tr
_

1
0
S
_
= 0
Hence
^
/ =
tr
_

1
0
S
_
j
(3.18)
13

You might also like