0% found this document useful (0 votes)

218 views131 pages

Multivariate Analysis - M.E

This document provides notes on multivariate statistical analysis and matrix algebra concepts including: 1. Random vectors, covariance matrices, and properties of matrix operations such as addition, multiplication, transpose, inverse, rank, trace, and determinant. 2. Orthogonal matrices that preserve lengths and angles. 3. Complex numbers and properties of complex conjugates and transposes of matrices. 4. Eigenvalues and eigenvectors of matrices, including that eigenvectors associated with distinct eigenvalues cannot be proportional. 5. Symmetric matrices where the transpose is equal to the original matrix.

Uploaded by

Farm House

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

218 views131 pages

Multivariate Analysis - M.E

Uploaded by

Farm House

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 131

STAT 542 Notes, Winter 2007; MDP

STAT 542: MULTIVARIATE STATISTICAL ANALYSIS

1. Random Vectors and Covariance Matrices.

1.1. Review of vectors and matrices. (The results are stated for
vectors and matrices with real entries but also hold for complex entries.)
An m × n matrix A ≡ {aij } is an array of mn numbers:
 
a11 . . . a1n
. .. 
A =  .. . .
am1 . . . amn
This matrix represents the linear mapping (≡ linear transformation)
A : Rn → Rm
(1.1)
x → Ax,
where x ∈ Rn is written as an n × 1 column vector and
     n a x 
a11 . . . a1n x1 j=1 1j j
. . .  
Ax ≡  .. ..   ..  ≡  ..
. ∈R .
m
n
am1 . . . amn xn j=1 amj xj

The mapping (1.1) clearly satisﬁes the linearity property:

A(ax + by) = aAx + bBy.
Matrix addition: If A ≡ {aij } and B ≡ {bij } are m × n matrices, then
(A + B)ij = aij + bij .
Matrix multiplication: If A is m × n and B is n × p, then the matrix
product AB is the m × p matrix AB whose ij-th element is

n
(1.2) (AB)ij = aik bkj .
k=1

B A
Then AB is the matrix of the composition Rp → Rn → Rm of the two
linear mappings determined by A and B [verify]:
(AB)x = A(Bx) ∀x ∈ Rp .

1
STAT 542 Notes, Winter 2007; MDP

Transpose matrix: If A ≡ {aij } is m × n, its transpose is the n × m

matrix A (sometimes denoted by A ) whose ij-th element is aji . That is,
the m row vectors (n column vectors) of A are the m column vectors (n
row vectors) of A . Note that [verify]

(1.3) (A + B) = A + B ;
(1.4) (AB) = B A (A : m × n, B : n × p);
(1.5) (A−1 ) = (A )−1 (A : n × n, nonsingular).

Rank of a matrix: The row (column) rank of a matrix S : m × n is the

dimension of the linear space spanned by its rows (columns). The rank of
A is the dimension r of the largest nonzero minor (= r × r subdeterminant)
of A. Then [verify]

row rank(A) ≤ min(m, n),

column rank(A) ≤ min(m, n),
rank(A) ≤ min(m, n),

row rank(A) = m − dim [row space(A)]⊥ ,

column rank(A) = n − dim [column space(A)]⊥ ,

row rank(A) = column rank(A)
= rank(A) = rank(A )
= rank(AA ) = rank(A A).

Furthermore, for A : m × n and B : n × p,

rank(AB) ≤ min(rank(A), rank(B)).

Inverse matrix: If A : n × n is a square matrix, its inverse A−1 (if it

exists) is the unique matrix that satisﬁes

AA−1 = A−1 A = I,

where I ≡ In is the n × n identity matrix diag(1, . . . , 1). If A−1 exists then

A is called nonsingular (or regular). The following are equivalent:

2
STAT 542 Notes, Winter 2007; MDP

(a) A is nonsingular.
(b) The n columns of A are linearly independent (i.e., column rank(A) = n).
Equivalently, Ax = 0 for every nonzero x ∈ Rn .
(c) The n rows of A are linearly independent (i.e., row rank(A) = n).
Equivalently, x A = 0 for every nonzero x ∈ Rn .
(d) The determinant |A| = 0 (i.e., rank(A) = n). [Deﬁne det geometrically.]

Note that if A is nonsingular then A−1 is nonsingular and (A−1 )−1 = A.

If A : m × m and C : n × n are nonsingular and B is m × n, then [verify]

rank(AB) = rank(B) = rank(BC).

If A : n × n and B : n × n are nonsingular then so is AB, and [verify]

(1.6) (AB)−1 = B −1 A−1 .

If A ≡ diag(d1 , . . . , dn ) with all di = 0 then A−1 = diag(d−1 −1

1 , . . . , dn ).

Trace: For a square matrix A ≡ {aij } : n × n, the trace of A is

n
(1.7) tr(A) = aii ,
i=1

the sum of the diagonal entries of A. Then

(1.8) tr(aA + bB) = a tr(A) + b tr(B);

(1.9) tr(AB) = tr(BA); (Note : A : m × n, B : n × m)
(1.10) tr(A ) = tr(A). (A : n × n)

Proof of (1.9):
m m n
tr(AB) = (AB)ii = aik bki
i=1 i=1 k=1
n m n
= bki aik = (BA)kk = tr(BA).
k=1 i=1 k=1

3
STAT 542 Notes, Winter 2007; MDP

Determinant: For a square matrix A ≡ {aij } : n × n, its determinant is

n
|A| = (π) aiπ(i)
π i=1

= ±Volume(A([0, 1]n )),

where π ranges over all n! permutations of 1, . . . , n and (π) = ±1 according

to whether π is an even or odd permutation. Then

(1.11) |AB| = |A| · |B| (A, B : n × n);

(1.12) |A−1 | = |A|−1
(1.13) |A | = |A|;
n
(1.14) |A| = aii if A is triangular (or diagonal).
i=1

Orthogonal matrix. An n × n matrix Γ is orthogonal if

(1.15) ΓΓ = I.

This is equivalent to the fact that the n row vectors of Γ form an orthonor-
mal basis for Rn . Note that (1.15) implies that Γ = Γ−1 , hence also
Γ Γ = I, which is equivalent to the fact that the n column vectors of Γ also
form an orthonormal basis for Rn .
Note that Γ preserves angles and lengths, i.e., preserves the usual inner
product and norm in Rn : for x, y ∈ Rn ,

(Γx, Γy) ≡ (Γx) (Γy) = x Γ Γy = x y ≡ (x, y),

so
Γx2 ≡ (Γx, Γx) = (x, x) ≡ x2 .

In fact, any orthogonal transformation is a product of rotations and reﬂec-

tions. Also, from (1.13) and (1.15), |Γ|2 = 1, so |Γ| = ±1.

4
STAT 542 Notes, Winter 2007; MDP

Complex numbers and matrices. For any complex number c ≡ a + ib ∈

C, let c̄ ≡ a − ib denote the complex conjugate of c. Note that ¯c̄ = c and

cc̄ = a2 + b2 ≡ |c|2 ,
¯
cd = c̄d.

For any complex matrix C ≡ {cij }, let C̄ = {c̄ij } and deﬁne C ∗ = C̄ . Note
that

(1.16) (CD)∗ = D∗ C ∗ .

The characteristic roots ≡ of the n × n matrix A are the n roots

l1 , . . . , ln of the polynomial equation

(1.17) |A − l I| = 0.

These roots may be real or complex; the complex roots occur in conjugate
pairs. Note that the eigenvalues of a triangular or diagonal matrix are just
its diagonal elements.
By (b) (for matrices with possibly complex entries), for each eigenvalue
l there exists some nonzero (possibly complex) vector u ∈ Cn s.t.

(A − l I)u = 0,
equivalently,
(1.18) Au = lu.

The vector u is called a characteristic vector ≡ eigenvector for the eigenvalue

l. Since any nonzero multiple cu is also an eigenvector for l, we will usually
normalize u to be a unit vector, i.e., u2 ≡ u∗ u = 1.
For example, if A is a diagonal matrix, say
 
d1 0 ··· 0
 0 d2 ··· 0 
A = diag(d1 , . . . , dn ) ≡ 
 ... ..  ,
. 
..
.
0 0 · · · dn

5
STAT 542 Notes, Winter 2007; MDP

then its eigenvalues are just d1 , . . . , dn , with corresponding eigenvectors

u1 , . . . , un , where

(1.19) 1 , 0, . . . , 0)
ui ≡ (0, . . . , 0,
i

is the i-th unit coordinate vector.

Note, however, that in general, eigenvalues need not be distinct and
eigenvectors need not be unique. For example, if A is the identity matrix
I, then its eigenvalues are 1, . . . , 1 and every unit vector u ∈ Rn is an
eigenvector for the eigenvalue 1: Iu = 1 · u.
However, eigenvectors u, v associated with two distinct eigenvalues l,
m cannot be proportional: if u = cv then

lu = Au = cAv = cmv = mu,

which contradicts the assumption that l = m.

Symmetric matrix. An n × n matrix S ≡ {sij } is symmetric if A = A ,

i.e., if sij = sji ∀i, j.

Lemma 1.1. Let S be a real symmetric n × n matrix.

(a) Each eigenvalue l of S is real and has a real eigenvector γ ∈ Rn .
(b) If l = m are distinct eigenvalues of S with corresponding real eigenvec-
tors γ and ψ, then γ ⊥ ψ, i.e., γ ψ = 0. Thus if all the eigenvalues of S
are distinct, each eigenvalue l has exactly one real eigenvector γ.
Proof. (a) Let l be an eigenvalue of S with eigenvector u = 0. Then

Su = lu ⇒ u∗ Su = lu∗ u = l.

But S is real and symmetric, so S ∗ = S, hence

u∗ Su = (u∗ Su)∗ = u∗ S ∗ (u∗ )∗ = u∗ Su.

Thus u∗ Su is real, hence l is real. Since S − l I is real, the existence of a

real eigenvector γ for l now follows from (b) on p.3.

6
STAT 542 Notes, Winter 2007; MDP

(b) We have Sγ = lγ and Svψ = mψ, hence

lψ γ = ψ Sγ = (ψ Sγ) = γ Sψ = mγ ψ = mψ γ,

so γ ψ = 0 since l = m.
¯

Proposition 1.2. Spectral decomposition of a real symmetric ma-

trix. Let S be a real symmetric n × n matrix with eigenvalues l1 , . . . , ln
(necessarily real). Then there exists a real orthogonal matrix Γ such that

(1.20) S = Γ Dl Γ ,

where Dl = diag(l1 , . . . , ln ). Since SΓ = ΓDl , the i-th column vector γi of

Γ is a real eigenvector for li .
Proof. For simplicity suppose that l1 , . . . , ln are distinct. Let γ1 , . . . , γn be
the corresponding unique real unit eigenvectors (apply Lemma 1.1b). Since
γ1 , . . . , γn is an orthonormal basis for Rn , the matrix

(1.21) Γ ≡ (γ1 , . . . , γn ) : n × n

satisﬁes Γ Γ = I, i.e., Γ is an orthogonal matrix. Since each γi is an

eigenvector for li , SΓ = ΓDl [verify], which is equivalent to (1.20).
[The case where the eigenvalues are not distinct can be established by a
“perturbation” argument. Perturb S slightly so that its eigenvalues become
distinct (non-trivial) and apply the ﬁrst case. Now use a limiting argument
based on the compactness of the set of all n × n orthogonal matrices.] ¯

Lemma 1.3. If S is a real symmetric matrix with eigenvalues l1 , . . . , ln ,,

n
(1.22) tr(S) = li ;
i=1
n
(1.23) |S| = li .
i=1

Proof. This is immediate from the spectral decomposition (1.20) of S.

Positive deﬁnite matrix. An n × n matrix S is positive semi-deﬁnite

(psd) (also written as S ≥ 0) if it is symmetric and its quadratic form is
nonnegative:

(1.24) x Sx ≥ 0 ∀ x ∈ Rn ;

7
STAT 542 Notes, Winter 2007; MDP

S is positive deﬁnite (pd) (also written as S > 0) if it is symmetric and its

quadratic form is positive:

(1.25) x Sx > 0 ∀ nonzero x ∈ Rn .

• The identity matrix is pd: x Ix = x2 > 0 if x = 0.

• A diagonal matrix diag(d1 , . . . , dn ) is psd (pd) iﬀ each di ≥ 0 (> 0).
• If S : n × n is psd, then ASA is psd for any A : m × n.
• If S : n × n is pd, then ASA is pd for any A : m × n of full rank m ≤ n.
• AA is psd for any A : m × n.
• AA is pd for any A : m × n of full rank m ≤ n.
Note: This shows that the proper way to “square” a matrix A is to form
AA (or A A), not A2 , which need not even be symmetric.
• S pd ⇒ S has full rank ⇒ S −1 exists ⇒ S −1 ≡ (S −1 )S(S −1 ) is pd.

Lemma 1.4. (a) A symmetric n × n matrix S with eigenvalues l1 , . . . , ln

is psd (pd) iﬀ each li ≥ 0 (> 0). In particular, |S| ≥ 0 (> 0) if S is psd
(pd), so a pd matrix is nonsingular.
(b) Suppose S is pd with distinct eigenvalues l1 > · · · > ln > 0 and corre-
sponding unique real unit eigenvectors γ1 , . . . , γn . Then the set

(1.26) E ≡ {x ∈ Rn | x S −1 x = 1}
√ √
is the ellipsoid with principle axes l1 γ1 , . . ., ln γn .

Proof. (a) Apply the above results and the spectral decomposition (1.20).
(b) From (1.20), S = ΓDl Γ with Γ = (γ1 · · · γn ), so S −1 = ΓDl−1 Γ and,

E = {x ∈ Rn | (Γ x) Dl−1 (Γ x) = 1}

= Γ{y ∈ Rn | y Dl−1 y = 1} (y = Γx)
y2 2

y
= Γ y ≡ (y1 , . . . , yn ) 1
+ ··· + n
=1
l1 ln
≡ ΓE0 .

8
STAT 542 Notes, Winter 2007; MDP

√ √
But E0 is the ellipsoid with principal axes l1 u1 , . . ., √ln un (recall
√ (1.19))
and Γui = γi , so E is the ellipsoid with principle axes l1 γ1 , . . ., ln γn .
¯

Square root of a pd matrix. Let S be an n × n pd matrix. Any n × n

matrix A such that AA = S is called a square root of S, denoted by S 2 .
1

From the spectral decomposition S = ΓDl Γ , one version of S 2 is

1 1 1
S 2 = Γ diag(l12 , . . . , ln2 ) Γ ≡ ΓDl2 Γ ;
1
(1.27)
1
this is a symmetric square root of S. Any square root S 2 is nonsingular, for
1 1
(1.28) |S 2 | = |S| 2 > 0.

Partitioned pd matrix. Partition the pd matrix S : n × n as

n1 n2

n1 S11 S12
(1.29) S= ,
n2 S21 S22

where n1 + n2 = n. Then both S11 and S22 are symmetric pd [why?],

S12 = S21 , and [verify!]
−1

In1 −S12 S22 S11 S12 In1 0 S11·2 0
(1.30) −1 = ,
0 In2 S21 S22 −S22 S21 In2 0 S22

where
−1
(1.31) S11·2 ≡ S11 − S12 S22 S21

is necessarily pd [why?] This in turn implies the two fundamental identities

−1

S11 S12 In1 S12 S22 S11·2 0 In1 0
(1.32) = −1 ,
S21 S22 0 In2 0 S22 S22 S21 In2
−1 −1
−1

S11 S12 In1 0 S11·2 0 In1 −S12 S22
(1.33) = −1 −1 ,
S21 S22 −S22 S21 In2 0 S22 0 In2

The following three consequences of (1.32) and (1.33) are immediate:

9
STAT 542 Notes, Winter 2007; MDP

(1.34) S is pd ⇐⇒ S11·2 and S22 are pd ⇐⇒ S22·1 and S11 are pd.

(1.35) |S| = |S11·2 | · |S22 | = |S22·1 | · |S11 | .

x1
For x ≡ ∈ Rn , the quadratic form x S −1 x can be decomposed as
x2
−1 −1 −1 −1
(1.36) x S −1 x = (x1 − S12 S22 x2 ) S11·2 (x1 − S12 S22 x2 ) + x2 S22 x2 .

Exercise 1.5. Cholesky decompositions of a pd matrix. Use (1.32)

and induction on n to obtain an upper triangular square root U of S, i.e.,
S = U U . Similarly, S has a lower triangular square root L, i.e. S = LL .
Note: Both U ≡ {uij } and L ≡ {lij } are unique if the positivity conditions
uii > 0 ∀i and lii > 0 ∀i are imposed on their diagonal elements. To see
this for U , suppose that U U = V V where V is also an upper triangular
matrix with each vii > 0. Then U −1 V (U −1 V ) = I, so Γ ≡ U −1 V is both
upper triangular and orthogonal, hence Γ = diag(±1, . . . , ±1) =: D [why?]
Thus V = U D, and the positivity conditions imply that D = I.
¯

Projection matrix. An n × n matrix P is a projection matrix if it is

symmetric and idempotent: P 2 = P .

Lemma 1.6. P is a projection matrix iﬀ it has the form

Im 0
(1.37) P =Γ Γ
0 0

for some orthogonal matrix Γ : n × n and some m ≤ n. In this case,

rank(P ) = m = tr(P ).
Proof. Since P is symmetric, P = ΓDl Γ by its spectral decomposition.
But the idempotence of P implies that each li = 0 or 1. (A permutation of
the rows and columns, which is also an orthogonal transformation, may be
necessary to obtain the form (1.37).)
¯

Interpretation of (1.37): Partition Γ as

m n−m

(1.38) Γ =n Γ1 Γ2 ,

10
STAT 542 Notes, Winter 2007; MDP

so (1.37) becomes

(1.39) P = Γ1 Γ1 .

But Γ is orthogonal so Γ Γ = In , hence

Γ1 Γ1 Γ1 Γ2 Im 0
(1.40) ΓΓ≡ = .
Γ2 Γ1 Γ2 Γ2 0 In−m

Thus from (1.39) and (1.40),

P Γ1 = (Γ1 Γ1 ) Γ1 = Γ1 ,
P Γ2 = (Γ1 Γ1 ) Γ2 = 0.

This shows that P represents the linear transformation that projects Rn

orthogonally onto the column space of Γ1 , which has dimension m = tr(P ).
Furthermore, In − P is also symmetric and idempotent [verify] with
rank(In − P ) = n − m. In fact,

In − P = ΓΓ − P = (Γ1 Γ1 + Γ2 Γ2 ) − Γ1 Γ1 = Γ2 Γ2 ,

so In − P represents the linear transformation that projects Rn orthogonally

onto the column space of Γ2 , which has dimension n − m = tr(In − P ).
Note that the column spaces of Γ1 and Γ2 are perpendicular, since
Γ1 Γ2
= 0. Equivalently, P (In − P ) = (In − P )P = 0, i.e., applying P and
In − P successively sends any x ∈ Rn to 0.

11
STAT 542 Notes, Winter 2007; MDP

1.2. Matrix exercises.

1. For S : p × p and U : p × q, with S > 0 (positive deﬁnite), show that

|S + U U | = |S| · |Iq + U S −1 U |,

where | · | denotes the determinant and Iq is the q × q identity matrix.

2. For S : p × p and a : p × 1 with S > 0, show that

−1 a S −1 a
a (S + aa ) a= .
1 + a S −1 a
3. For S : p × p and T : p × p with S > 0 and T ≥ 0, show that

−1 λi (T S −1 )
λi [T (S + T ) ]= , i = 1, . . . , p,
1 + λi (T S −1 )

where λ1 ≥ · · · ≥ λp denote the ordered eigenvalues.

4. Let A > 0 and B > 0 be p × p matrices with A ≥ B. Partition A as

A11 A12
A=
A21 A22

and let A11·2 = A11 −A12 A−1

22 A21 . Partition B in the same way and similarly
deﬁne B1.2 . Show:
(i) A11 ≥ B11 .
(ii) B −1 ≥ A−1 .
(iii) A11·2 ≥ B11·2 .

5. For S : p × p with S > 0, partition S and S −1 as

11
S11 S12 −1 S S 12
S= , S = ,
S21 S22 S 21 S 22
−1
respectively. Show that S 11 ≥ S11 , and equality holds iﬀ S12 = 0, or
12
equivalently, iﬀ S = 0.

12
STAT 542 Notes, Winter 2007; MDP

6. Now partition S and S −1 as

 
S11 S12 S13
S(12) S(12)3
S =  S21 S22 S23  ≡ ,
S3(12) S33
S31 S32 S33
 
S 11 S 12 S 13 (12)
S S (12)3
S −1 =  S 21 S 22 S 23  ≡ .
S 3(12) S 33
S 31 S 32 S 33

Then
−1
S(12)·3 ≡ S(12) − S(12)3 S33 S3(12)

S11 − S13 S3−1 S31 S12 − S13 S3−1 S32
=
S21 − S23 S3−1 S31 S22 − S23 S3−1 S32

S11·3 S12·3
≡ ,
S21·3 S22·3

with similar relations holding for S (12)·3 . Note that

S (12) = (S(12)·3 )−1 , S(12) = (S (12)·3 )−1 ,

but in general
S 11 = (S11·2 )−1 , S11 = (S 11·2 )−1 ;
instead,
S 11 = (S11·(23) )−1 , S11 = (S 11·(23) )−1 .
Show:
(i) (S(12)·3 )11·2 = S11·(23) .
(ii) S11·2 = (S 11·3 )−1 .
(iii) S12·3 (S22·3 )−1 = −(S 11 )−1 S 12 .
(iv) S11 ≥ S11·2 ≥ S11·(23) . When do the inequalities become equalities?
(v) S12·3 (S22·3 )−1 = −(S 11·4 )−1 S 12·4 . (for a 4 × 4 partitioning.)

13
STAT 542 Notes, Winter 2007; MDP

 
X1
.
1.3. Random vectors and covariance matrices. Let X ≡  ..  be
Xn
a rvtr in R . The expected value of X is the vector
n

 
E(X1 )
 
E(X) ≡  ...  ,
E(Xn )

which is the center of gravity of the probability distribution of X in Rn .

Note that expectation is linear: for rvtrs X, Y and constant matrices A, B,

(1.41) E(AX + BY ) = A E(X) + B E(Y ).

 
Z1 · · · Z1n
. .. 
Similarly, if Z ≡  .. . is a random matrix in Rm×n ,
Zm1 · · · Zmn
E(Z) is also deﬁned component-wise:
 
E(Z1 ) · · · E(Z1n )
 .. .. 
E(Z) =  . . .
E(Zm1 ) · · · E(Zmn )

Then for constant matrices A : k × m and B : n × p,

(1.42) E(AZB) = A E(Z) B.

The covariance matrix of X (≡ the variance-covariance matrix), is

Cov(X) = E [(X − EX)(X − EX) ]

 Var(X1 ) Cov(X1 , X2 ) · · · Cov(X1 , Xn ) 

 Cov(X2 , X1 ) Var(X2 ) · · · Cov(X2 , Xn ) 
=
 .. .. .. .

. . .
Cov(Xn , X1 ) Cov(Xn , X2 ) · · · Var(Xn )

14
STAT 542 Notes, Winter 2007; MDP

The following formulas are essential: for X : n × 1, A : m × n, a : n × 1,

(1.43) Cov(X) = E(XX ) − (EX)(EX) ;

(1.44) Cov(AX + b) = A Cov(X) A ;
(1.45) Var(a X + b) = a Cov(X) a.

Lemma 1.7. Let X ≡ (X1 , . . . , Xn ) be a random vector in Rn .

(a) Cov(X) is psd.
(b) Cov(X) is pd unless ∃ a nonzero a ≡ (a1 , . . . , an ) ∈ Rn s.t. the linear
combination
a X ≡ a1 X1 + · · · + an Xn = constant,
i.e., the support of X is contained in some hyperplane of dimension ≤ n−1.
Proof. (a) This follows immediately from (1.45).
(b) If Cov(X) is not pd, then ∃ a nonzero a ∈ Rn s.t.

0 = a Cov(X) a = Var(a X).

But this implies that a X = const.

For rvtrs X : m × 1 and Y : n × 1, deﬁne

Cov(X, Y ) = E [(X − EX)(Y − EY ) ]

 Cov(X , Y ) Cov(X1 , Y2 ) ··· Cov(X1 , Yn ) 

1 1
 Cov(X2 , Y1 ) Cov(X2 , Y2 ) ··· Cov(X2 , Yn ) 
=
 .. .. ..  .

. . .
Cov(Xm , Y1 ) Cov(Xm , Y2 ) · · · Cov(Xm , Yn )
Clearly Cov(X, Y ) = [Cov(Y, X)] . Then [verify]

(1.46) Cov(X ± Y ) = Cov(X) + Cov(Y ) ± Cov(X, Y ) ± Cov(Y, X).

and [verify]

X⊥
⊥ Y ⇒ Cov(X, Y ) = 0
(1.47) ⇒ Cov(X ± Y ) = Cov(X) + Cov(Y ).

15
STAT 542 Notes, Winter 2007; MDP

Variance of sample average (sample mean) of rvtrs: Let X1 , . . . , Xn

be i.i.d. rvtrs in Rp , each with mean vector µ and covariance matrix Σ. Set
X̄n = 1
n (X1 + · · · + Xn ).
Then E(X̄n ) = µ and, by (1.47),
(1.48) Cov(X̄n ) = 1
n2 Cov(X1 + · · · + Xn ) = 1
n Σ.

Exercise 1.8. Verify the Weak Law of Large Numbers (WLLN) for rvtrs:
p
X̄n converges to µ in probability (Xn → µ), that is, for each > 0,
P [X̄n − µ ≤ ] → 1 as n → ∞.

Example 1.9a. Equicorrelated random variables. Let X1 , . . . , Xn

be rvs with common mean µ and common variance σ 2 . Suppose they are
equicorrelated, i.e., Cor(Xi , Xj ) = ρ ∀i = j. Let
n
(1.49) X̄n = n1 (X1 + · · · + Xn ), 1
s2n = n−1 i=1 (Xi − X̄n ) ,
2

the sample mean and sample variance, respectively. Then

(1.50) E(X̄n ) = µ (so X̄n is unbiased for µ); .
n2 Var(X1 + · · · + Xn )
1
Var(X̄n ) =
n2 [nσ + n(n − 1)ρσ ]
1 2 2
= [why?]
σ2
(1.51) = n [1 + (n − 1)ρ].

When X1 , . . . , Xn are uncorrelated (ρ = 0), in particular when they are

2
independent, then (1.51) reduces to σn , which → 0 as n → ∞. When
ρ = 0, however, Var(X̄n ) → σ 2 ρ = 0 so the WLLN fails for equicorrelated
i.d. rvs Also, (1.51) imposes the constraint
(1.52) − n−1
1
≤ ρ ≤ 1.
Next, using (1.51),
n

E(s2n ) = 1
n−1 E X
i=1 i
2
− n(X̄ n )2
2
= n−1 n(σ + µ ) − n n [1 + (n − 1)ρ] + µ
1 2 2 σ 2

(1.53) = (1 − ρ)σ 2 .

16
STAT 542 Notes, Winter 2007; MDP

Thus s2n is unbiased for σn2 if ρ = 0 but not otherwise.

Example 1.9b. We now re-derive (1.51) and (1.53) via covariance matri-
ces, using properties (1.44) and (1.45). Set X = (X1 , . . . , Xn ) , so
   
µ 1
 ..  .
(1.54) E(X) =  .  ≡ µ en , where en =  ..  : n × 1,
µ 1
 
1 ρ ··· ρ
 .
. .. 
..
ρ 2 1 
Cov(X) = σ  . 
. .. ..
. . ρ
.
ρ ··· ρ 1

(1.55) ≡ σ 2 [(1 − ρ)In + ρ en en ].

1
Then X̄n = n en X, so by (1.45),

σ2
Var(X̄n ) = n2 en [(1 − ρ)In + ρ en en ]en
σ2
= n2 [(1 − ρ)n + ρn ]
2
[since en en = n]
σ2
= n [1 + (n − 1)ρ],

which agrees with (1.51).

To ﬁnd E(s2n ), write

n
n
(Xi − X̄n ) =
2
Xi2 − n(X̄n )2
i=1 i=1

= X X − n1 (en X)2
= X X − n1 (X en )(en X)

≡ X In − en en X
(1.56) ≡ X QX,

where en ≡ en
√
n
is a unit vector, P ≡ en en is the projection matrix of
rank 1 ≡ tr(en en ) that projects Rn orthogonally onto the 1-dimensional

17
STAT 542 Notes, Winter 2007; MDP

subspace spanned by en , and Q ≡ In − en en is the projection matrix of

rank n−1 ≡ tr Q that projects Rn orthogonally onto the (n−1)-dimensional
subspace e⊥
n [draw ﬁgure]. Now complete the following exercise:

Exercise 1.10. Prove Lemma 1.11 below, and use it to show that
(1.57) E(X QX) = (n − 1)(1 − ρ)σ 2 ,
which is equivalent to (1.53).
¯

Lemma 1.11. Let X : n × 1 be a rvtr with E(X) = θ and Cov(X) = Σ.

Then for any n × n symmetric matrix A,
(1.58) E(X AX) = tr(AΣ) + θ Aθ.
(This generalizes the relation E(X 2 ) = Var(X) + (E X)2 .)

Example 1.9c. Eqn. (1.53) also can be obtained from the properties of
the projection matrix Q. First note that [verify]
√
(1.59) Qen = nQen = 0.
Deﬁne
 
Y1
.
(1.60) Y ≡  ..  = QX : n × 1,
Yn
so
(1.61) E(Y ) = Q E(X) = µ Qen = 0,

E(Y Y ) = Cov(Y ) = σ 2 Q[(1 − ρ)In + ρen en ]Q

(1.62) = σ 2 (1 − ρ)Q.
Thus, since Q is idempotent (Q2 = Q),
E(X QX) = E(Y Y ) = E[ tr (Y Y )]
= tr[ E (Y Y )]
= σ 2 (1 − ρ) tr (Q)
= σ 2 (1 − ρ)(n − 1),

18
STAT 542 Notes, Winter 2007; MDP

which again is equivalent to (1.53).

Exercise 1.12. Show that Cov(X) ≡ σ 2 [(1 − ρ)In + ρen en ] in (1.55) has
one eigenvalue = σ 2 [1 + (n − 1)ρ] with eigenvector en , and n − 1 eigenvalues
= σ 2 (1 − ρ). ¯

Exercise 1.13. Suppose that Σ = Cov(X) : n × n. Show that the extreme

eigenvalues of Σ satisfy

λ1 (Σ) = max Var(a X),

a=1

λn (Σ) = min Var(a X).

¯
a=1

19
STAT 542 Notes, Winter 2007; MDP

2. The Multivariate Normal Distribution (MVND).

2.1. Deﬁnition and basic properties.

Consider a random vector X ≡ (X1 , . . . , Xp ) ∈ Rp , where X1 , . . . , Xp are
i.i.d. standard normal random variables, i.e., Xi ∼ N (0, 1), so E(X) = 0
and Cov(X) = Ip . The pdf of X (i.e., the joint pdf of X1 , . . . , Xp ) is
p
f (x) = (2π)− 2 e− 2 (x1 +···+xp )
1 2 2

p
= (2π)− 2 e− 2 x x ,
1
(2.1) x ∈ Rp .

For any nonsingular matrix A : p × p and any µ : p × 1 ∈ Rp , consider the

random vector Y := AX + µ. Since the Jacobian of this linear (actually,
aﬃne) mapping is | ∂X|
∂Y
= |A|+ > 0, the pdf of Y is
−1
p
(y−µ)) A−1 (y−µ)
f (y) = (2π)− 2 |A|−1 − 2 (A 1
+ e
p −1
= (2π)− 2 |AA |− 2 e− 2 (y−µ) (AA )
1 1
(y−µ)
p −1
= (2π)− 2 |Σ|− 2 e− 2 (y−µ) Σ
1 1
(2.2) (y−µ)
, y ∈ Rp ,

where
E(Y ) = A E(X) + µ = µ,
Cov(Y ) = A Cov(X) A = AA ≡ Σ > 0.
Since the distribution of Y depends only on µ and Σ, we denote this distri-
bution by Np (µ, Σ), the multivariate normal distribution (MVND) on Rp
with mean vector µ and covariance matrix Σ.

Exercise 2.1. (a) Show that the moment generating function of X is

1
(2.3) mX (w) ≡ E(ew X ) = e 2 w w .

(b) Let Y = AX + µ where now A : q × p and µ ∈ Rq . Show that the mgf

of Y is
1
(2.4) mY (w) ≡ E(ew Y ) = ew µ+ 2 w Σw

where Σ ≡ AA = Cov(Y ). Thus the distribution of Y ≡ AX + µ depends

only on µ and Σ even when A is singular and/or a non-square matrix, so
we may again write Y ∼ Nq (µ, Σ).

20
STAT 542 Notes, Winter 2007; MDP

Lemma 2.1. Aﬃne transformations preserve normality.

If Y ∼ Nq (µ, Σ), then for C : r × q and d : r × 1,

(2.5) Z ≡ CY + d ∼ Nr (Cµ + d, CΣC ).

Proof. Represent Y as AX + µ, so Z = (CA)X + (Cµ + d) is also an aﬃne

transformation of X, hence also has an MVND with E(Z) = Cµ + d and
Cov(Z) = (CA)(CA) = CΣC .
¯

Lemma 2.2. Independence ⇐⇒ zero covariance.

Suppose that Y ∼ Np (µ, Σ) and partition Y , µ, and Σ as

p1 p2

p1 Y1 p1 µ1 p1 Σ11 Σ12
(2.6) Y = , µ= , Σ= ,
p2 Y2 p2 µ2 p2 Σ21 Σ22

where p1 + p2 = p. Then Y1 ⊥
⊥ Y2 ⇐⇒ Σ12 = 0.

Proof. This follows from the pdf (2.2) or the mgf (2.4).
¯

Proposition 2.3. Marginal & conditional distributions are normal.

If Y ∼ Np (µ, Σ) and Σ22 is pd then

(2.8) Y1 | Y2 ∼ Np1 µ1 + Σ12 Σ−1 22 (Y2 − µ2 ), Σ 11·2 ,

(2.9) Y2 ∼ Np2 (µ2 , Σ22 ).

Proof. Method 1: Assume also that Σ is nonsingular. By the quadratic

identity (1.35) applied with µ, y, and Σ partitioned as in (2.6),

(2.10) (y − µ) Σ−1 (y − µ)

= y1 − µ1 − Σ12 Σ−1 22 (y 2 − µ2 ) Σ11·2 (· · ·) + (y2 − µ2 ) Σ−1
22 (· · ·).

Since also |Σ| = |Σ11·2 | |Σ22 |, the result follows from the pdf (2.2).

21
STAT 542 Notes, Winter 2007; MDP

Method 2. By Lemma 2.1 and the quadratic identity (1.32),

Y1 − Σ12 Σ−1
22 Y2 Ip1 −Σ12 Σ−1
22 Y1
=
Y2 0 Ip2 Y2

µ1 − Σ12 Σ−1 µ Σ11·2 0
(2.11) ∼ Np1 +p2 22 2
, .
µ2 0 Σ22

Thus by Lemma 2.1 for C = ( Ip1 0p1 ×p2 ) and ( 0p2 ×p1 Ip2 ), respectively,

Y1 − Σ12 Σ−1
22 Y2 ∼ N p1
µ1 − Σ 12 Σ −1
22 µ2 , Σ 11·2 ,
Y2 ∼ Np2 (µ2 , Σ22 ) ,

which yields (2.9). Also Y1 − Σ12 Σ−1

22 Y2 ⊥
⊥ Y2 by (2.11) and Lemma 2.2, so

(2.12) Y1 − Σ12 Σ−1 Y

22 2 | Y2 ∼ N p1 µ1 − Σ Σ −1
12 22 2µ , Σ 11·2

which yields (2.8).

2.2. The MVND and the chi-square distribution.

The chi-square distribution χ2n with n degrees of freedom (df ) can be deﬁned
as the distribution of

Z12 + · · · + Zn2 ≡ Z Z ≡ Z2 ,

where Z ≡ (Z1 , . . . , Zn ) ∼ Nn (0, In ). (That is, Z1 , . . . , Zn are i.i.d. stan-

dard N (0, 1) rvs.) Recall that

(2.13) χ2n ∼ Gamma α = n

2, λ= 1
2 ,
(2.14) E(χ2n ) = n,
(2.15) Var(χ2n ) = 2n.

Now consider X ∼ Nn (µ, Σ) with Σ pd. Then

(2.16) Z ≡ Σ−1/2 (X − µ) ∼ Nn (0, In ),

(2.17) Z Z = (X − µ) Σ−1 (X − µ) ∼ χ2n .

22
STAT 542 Notes, Winter 2007; MDP

Suppose, however, that we omit Σ−1 in (2.17) and seek the distribution
of (X − µ) (X − µ). Then this will not have a chi-square distribution in
general. Instead, by the spectral decomposition Σ = ΓDλ Γ , (2.16) yields

(X − µ) (X − µ) = Z ΣZ = (Γ Z) Dλ (Γ Z)

(2.18) ≡ V Dλ V = λ1 V12 + · · · λn Vn2 ,

where λ1 , . . . , λn are the eigenvalues of Σ and V ≡ Γ Z ∼ Nn (0, In ). Thus

the distribution of (X − µ) (X − µ) is a positive linear combination of in-
dependent χ21 rvs, which is not (proportional to) a χ2n rv. [Check via mgfs!]

Lemma 2.5. Quadratic forms and projection matrices.

Let X∼ Nn (ξ, σ 2 In ) and let P be an n × n projection matrix with rank(P ) =
tr(P ) ≡ m. Then the quadratic form determined by X − ξ and P satisﬁes

(2.23) (X − ξ) P (X − ξ) ∼ σ 2 χ2m .

Proof. By Lemma 1.6, there exists an orthogonal matrix Γ : n × n s.t.

Im 0
P =Γ Γ .
0 0

Then Y ≡ Γ (X − ξ) ∼ Nn (0, σ 2 In ), so with Y = (Y1 , . . . , Yn ) ,

Im 0
(X − ξ) P (X − ξ) = Y Y = Y12 + · · · Ym2 ∼ σ 2 χ2m .
¯
0 0

23
STAT 542 Notes, Winter 2007; MDP

2.3. The noncentral chi-square distribution.

Extend the result (2.17) to (2.30) as follows: First let Z ≡ (Z1 , . . . , Zn ) ∼
Nn (ξ, In ), where ξ ≡ (ξ1 , . . . , ξn ) ∈ Rn . The distribution of

Z12 + · · · + Zn2 ≡ Z Z ≡ Z2

is called the noncentral chi-square distribution with n degrees of freedom

(df ) and noncentrality parameter ξ2 , denoted by χ2n (ξ2 ). Note that
Z1 , . . . , Zn are independent, each with variance = 1, but now E(Zi ) = ξi .
To show that the distribution of Z2 depends on ξ only through its
(squared) length ξ2 , choose1 an orthogonal (rotation) matrix Γ : n × n
such that Γξ = (ξ, 0, . . . , 0) , i.e., Γ rotates ξ into (ξ, 0, . . . , 0) , and set

Y = ΓZ ∼ Nn (Γξ, ΓΓ ) = Nn ((ξ, 0, . . . , 0) , In ) .

Then the desired result follows since

Z2 = Y 2 ≡ Y12 + Y22 + · · · + Yn2

∼ [N1 (ξ, 1)]2 + [N1 (0, 1)]2 + · · · + [N1 (0, 1)]2
≡ χ21 (ξ2 ) + χ21 + · · · + χ21
(2.24) ≡ χ21 (ξ2 ) + χ2n−1 ,

where the chi-square variates in each line are mutually independent.

√
Let V ≡ Y12 ∼ χ21 (δ) ∼ [N1 ( δ, 1)]2 , where δ = ξ2 . We ﬁnd the pdf
of V as follows:
d √ √
fV (v) = P [− v ≤ Y1 ≤ v]
dv
√v √
d 1 − 12 (t− δ)2
= √ e dt
dv 2π −√v
√v √
1 δ d 2
= √ e− 2 e t δ − t2
e dt
2π dv −√v

1
Let the ﬁrst row of Γ be ξ ≡ ξ
ξ and let the remaining n − 1 rows be any
orthonormal basis for L⊥ .

24
STAT 542 Notes, Winter 2007; MDP

√ ∞ k k

1 δ d
v t δ2 t2
= √ e− 2 √
e− 2 dt
2π dv − v k!
k=0
∞ √
1 −δ δ 2 d
k v
t2
=√ e 2 √
tk e− 2 dt
2π k! dv − v
k=0
∞ √
1 − δ δk d v
t2
=√ e 2 √
t2k e− 2 dt [why?]
2π (2k)! dv − v
k=0
∞
1 − δ δ k k− 1 − v
=√ e 2 v 2e 2 [verify]
2π (2k)!
k=0
∞
1+2k
δ k
(2) v 2 −1 −
e 2
v

= e− 2
δ
(2.25) 1+2k 1+2k
· ck ,
k! 2 2 Γ 2
k=0

pdf of χ 2
Poisson( 2 ) weights
δ
1+2k

where

1+2k

2k k! 2 Γ 1+2k
2
ck = √ 2
=1
(2k)! 2π

by the Legendre duplication formula for the Gamma function. Thus we

have represented the pdf of a χ21 (δ) rv as a mixture (weighted average) of
central chi-square pdfs with Poisson weights. This can be written as follows:

(2.26) χ21 (δ) K ∼ χ21+2K where K ∼ Poisson (δ/2) .

Thus by (2.24) this implies that Z Z ≡ Z2 ∼ χ2n (δ) satisﬁes

(2.27) χ2n (δ) K ∼ χ2n+2K where K ∼ Poisson (δ/2) .

That is, the pdf of a noncentral chi-square rv χ2n (δ) is a Poisson(δ/2)-

mixture of the pdfs of central chi-square rvs with n + 2k df, k = 0, 1, . . ..

The representation (2.27) can be used to obtain the mean and variance
of χ2n (δ):

25
STAT 542 Notes, Winter 2007; MDP

E[χ2n (δ)] = E{E[χ2n+2K | K]}

= E(n + 2K)
= n + 2(δ/2)
(2.28) = n + δ;
Var[χ2n (δ)] = E[Var(χ2n+2K | K)] + Var[E(χ2n+2K | K)]
= E[2(n + 2K)] + Var(n + 2K)
= [2n + 4(δ/2)] + 4(δ/2)
(2.29) = 2n + 4δ.

Exercise 2.6. Show that the noncentral chi-square distribution χ2n (δ) is
stochastically increasing in both n and δ.
¯

Next, consider X ∼ Nn (µ, Σ) with a general pd Σ. Then

X Σ−1 X = (Σ− 2 X) (Σ− 2 X) ∼ χ2n (µ Σ−1 µ),

1 1
(2.30)

since

Z ≡ Σ− 2 X ∼ Nn (Σ− 2 µ, In )
1 1

and
Σ− 2 µ2 = µ Σ−1 µ.
1

Note that by Exercise 2.6, the distribution of X Σ−1 X in (2.30) is stochas-

tically increasing in n and µ Σ−1 µ.

Finally, let Y ∼ Nn (ξ, σ 2 In ) and let P be a projection matrix with

rank(P ) = m. Then P = Γ1 Γ1 where Γ1 Γ1 = Im (cf. (2.20) - (2.22)), so

P Y 2 = Γ1 Γ1 Y 2 = (Γ1 Γ1 Y ) (Γ1 Γ1 Y ) = Y Γ1 Γ1 Y = Γ1 Y 2 .

But
Γ1 Y ∼ Nm (Γ1 ξ, σ 2 Γ1 Γ1 ) = Nm (Γ1 ξ, σ 2 Im ),

26
STAT 542 Notes, Winter 2007; MDP

so by (2.30) with X = Γ1 Y , µ = Γ1 ξ, and Σ = σ 2 Im ,

P Y 2 (Γ1 Y ) (Γ1 Y ) ξ Γ1 Γ1 ξ P ξ2
2
= 2
∼ χ2m = χ2m .
σ σ σ2 σ2
Thus
P ξ2
(2.31) P Y 2 ∼ σ 2 χ2m .
σ2

2.4. Joint pdf of a random sample from the MVND Np (µ, Σ).
Let X1 , . . . , Xn be an i.i.d random sample from Np (µ, Σ). Assume that Σ
is positive deﬁnite (pd) so that each Xi has pdf given by (2.2). Thus the
joint pdf of X1 , . . . , Xn is

n
1 −1
e− 2 (xi −µ) Σ (xi −µ)
1
f (x1 , . . . , xn ) = p 1
(2π) |Σ|
i=1
2 2

1 n
− 12 (xi −µ) Σ−1 (xi −µ)
= np n e i=1
(2π) 2 |Σ| 2
1 n
− 12 tr[Σ−1 ( (xi −µ)(xi −µ) )]
= np n e i=1
(2π) 2 |Σ| 2
1 −n −1
2 (x̄−µ) Σ (x̄−µ)− 12 tr(Σ−1 s)
(2.32) = np n e ,
(2π) 2 |Σ| 2
or alternatively,
−1
e− 2 µ Σ µ nx̄ Σ−1 µ− 12 tr(Σ−1 t)
n

(2.33) = np n e ,
(2π) 2 |Σ| 2
where
1
n n n

X̄ = Xi , S= (Xi − X̄)(Xi − X̄) , T = Xi Xi .
n i=1 i=1 i=1

It follows from (2.32) and (2.33) that (X̄, S) and (X̄, T ) are equivalent
representations of the minimal suﬃcient statistic for (µ, Σ). Also from
(2.33), with no further restrictions on (µ, Σ), this MVN statistical model
constitutes a p + p(p+1)
2 -dimensional full exponential family with natural
−1 −1
parameter (Σ µ, Σ ).

27
STAT 542 Notes, Winter 2007; MDP

3. The Wishart Distribution.

3.1. Deﬁnition and basic properties.

Let X1 , . . . , Xn be an i.i.d. random sample from Np (0, Σ) and set

X = (X1 , . . . , Xn ) : p × n,
n

S = XX = Xi Xi : p × p.
i=1

The distribution of S is called the p-variate (central) Wishart distribution

with n degrees of freedom and scale matrix Σ, denoted by Wp (n, Σ).
¯.

Clearly S is a random symmetric positive semi-deﬁnite matrix with

E(S) = nΣ. When p = 1 and Σ = σ 2 , W1 (n, σ 2 ) = σ 2 χ2n .

Lemma 3.1. Preservation under linear transformation. For A : q×p,

(3.1) ASA ∼ Wq (n, AΣA ).

In particular, for a : p × 1,

(3.2) a Sa ∼ (a Σa) · χ2n .

Lemma 3.2. Nonsingularity ≡ positive-deﬁniteness of S ∼ Wp (n, Σ).

S is positive deﬁnite with probability one ⇐⇒ Σ is pd and n ≥ p.
Proof. (⇒): Recall that S ∼ XX with X : p × n. If n < p then

rank(S) = rank(X) ≤ min(p, n) = n < p,

so S is singular with probability one, hence not positive deﬁnite. If Σ is not

pd then ∃ a : p × 1, a = 0, s.t. a Σa = 0. Thus by (3.2),

a Sa ∼ (a Σa) · χ2n = 0,

so S is singular w.pr.1.

28
STAT 542 Notes, Winter 2007; MDP

(⇐) Method 1 (Stein; Eaton and Perlman (1973) Ann. Statist.) Assume
that Σ is pd and n ≥ p. Since

p
n

S = XX = Xi Xi + Xi Xi ,
i=1 i=p+1
p
it suﬃces to show that i=1 Xi Xi is pd w. pr. 1. Thus we can take n = p,
so X : p × p is a square matrix. Then |S| = |X|2 , so it suﬃces to show that
X itself is nonsingular w.pr.1. But

p
{X singular} = {Xi ∈ Si ≡ span {Xj | j = i}} ,
i=1
so

p
Pr[X singular] ≤ Pr[Xi ∈ Si ]
i=1
p
= E {Pr[Xi ∈ Si | Xj , j = i]} = 0,
i=1

since dim(Si ) < p and the distribution of Xi ∼ Np (0, Σ) is absolutely

continuous w.r.to Lebesgue measure on Rp . Thus Pr[X nonsingular] = 1.

(⇐) Method 2 (Okamoto (1973) Ann. Statist.) Apply:

Lemma 3.3 (Okamoto). Let Z ≡ (Z1 , . . . , Zk ) ∈ Rk be a random vector
with a pdf that is absolutely continuous w.r.to Lebesgue measure on Rk . Let
g(z) ≡ g(z1 , . . . , zk ) be a nontrivial polynomial (i.e., g ≡ 0). Then

(3.3) Pr[ g(Z) = 0 ] = 0.

Proof. (sketch) Use induction on k. The result is true for k = 1 since

g can have only ﬁnitely many roots. Now assume the result is true for
k − 1 and extend to k by Fubini’s Theorem (equivalently, by conditioning
on Z1 , . . . , Zk−1 .
¯

Proposition 3.4. Let X : p × n be a random matrix with a pdf that is

absolutely continuous w.r.to Lebesgue measure on Rp×n . If n ≥ p then

(3.4) Pr[ rank(X) = p ] = 1,

29
STAT 542 Notes, Winter 2007; MDP

which implies that

(3.5) Pr[ S ≡ XX is positive deﬁnite ] = 1.

Proof. Without loss of generality (wlog) assume that p ≤ n and partition

X as (X1 , X2 ) with X1 : p × p. Since rank(X1 ) < p iﬀ |X1 | = 0, and since
the determinant |X1 | ≡ g(X1 ) is a nontrivial polynomial,

Pr[ rank(X1 ) = p ] = 1

by Lemma 3.3. But rank(X1 ) = p ⇒ rank(X) = p, so (3.4) holds.

Okamoto’s Lemma also yields the following important result:

Proposition 3.5. Let l1 (S) ≥ · · · ≥ lp (S) denote the eigenvalues (neces-

sarily real) of S ≡ XX . Under the assumptions of Proposition 3.4,

(3.6) Pr[ l1 (S) > · · · > lp (S) > 0 ] = 1.

Proof. (sketch) The eigenvalues of S ≡ XX are the roots of the nontrivial

polynomial h(l) ≡ |XX − l Ip |. These roots are distinct iﬀ the discriminant
of h vanishes. Since the discriminant is itself a nontrivial polynomial of
the coeﬃcients of the polynomial h, hence a nontrivial polynomial of the
elements of X, (3.6) follows from Okamoto’s Lemma. ¯

Lemma 3.6. Additivity: If S1 ⊥

⊥ S2 with Si ∼ Wp (ni , Σ), then

(3.7) S1 + S2 ∼ Wp (n1 + n2 , Σ).

30
STAT 542 Notes, Winter 2007; MDP

3.2. Covariance matrices of Kronecker product form.

If X1 , . . . , Xn are independent rvtrs each with covariance matrix Σ : p × p,
then Cov(X) = Σ ⊗ In , a Kronecker product. We now determine how a
covariance matrix of the general Kronecker product form Cov(X) = Σ ⊗ Λ
transforms under a linear transformation AXB (see Proposition 3.9).

The Kronecker product of the p × q matrix A and the m × n matrix B

is the pm × qn matrix
 
Ab11 · · · Ab1n
. .. 
A ⊗ B :=  .. ..
. . .
Abm1 · · · Abmn

(i) A ⊗ B is bilinear:

(α1 A1 + α2 A2 ) ⊗ B = α1 (A1 ⊗ B) + α2 (A2 ⊗ B)

A ⊗ (β1 B1 + β2 B2 ) = β1 (A ⊗ B1 ) + β2 (A ⊗ B2 ).

(ii) A ⊗
( C ⊗
B ) ( AC ⊗
D ) = ( BD ).
p×q m×n q×r n×s p×r m×s

(A ⊗ B) = A ⊗ B ,
(iii)
A = A , B = B =⇒ A ⊗ B = (A ⊗ B) .

(iv) If Γ : p × p and Ψ : n × n are orthogonal matrices, then Γ ⊗ Ψ : pn × pn

is orthogonal. [apply (ii) and (iii)]

(v) If A : p × p and B : n × n are real symmetric matrices with eigenvalues

α1 , . . . , αp and β1 , . . . , βn , respectively, then A ⊗ B : pn × pn is also real
and symmetric with eigenvalues {αi βj | i = 1, . . . , p, j = 1, . . . , n}.
Proof. Write the spectral decompositions of A and B as

A = ΓDα Γ , B = ΨDβ Ψ ,

31
STAT 542 Notes, Winter 2007; MDP

respectively, where Dα = diag(α1 , . . . , αp ) and Dβ = diag(β1 , . . . , βn ).

Then

A ⊗ B = (ΓDα Γ ) ⊗ (ΨDβ Ψ )
(3.8) = (Γ ⊗ Ψ) (Dα ⊗ Dβ ) (Γ ⊗ Ψ)

by (ii) and(iii). Since Γ ⊗ Ψ is orthogonal and Dα ⊗ Dβ is diagonal with

diagonal entries {αi βj | i = 1, . . . , p, j = 1, . . . , n}, (3.8) is a spectral
decomposition of the real symmetric matrix A ⊗ B, so the result follows. ¯

A psd, B psd =⇒ A ⊗ B psd,

(vi)
A pd, B pd =⇒ A ⊗ B pd. [apply (3.8)]

Let X ≡ (X1 , . . . , Xn ) : p × n be a random matrix. By convention we shall

deﬁne the covariance matrix Cov(X) to be the covariance matrix of the
pn × 1 column vector X formed by “stacking” the column vectors of X:

  Cov(X )
 · · · Cov(X1 , Xn )

X1 1
≡ Cov ..  ≡ 
Cov(X) := Cov(X) 
.. ..
.
.. 
.
. . .
Xn Cov(Xn , X1 ) · · · Cov(Xn )

Lemma 3.7. Let X = {Xij }, Σ = {σii }, Λ = {λjj }. Then

Cov(X) = Σ ⊗ Λ ⇐⇒ Cov(Xij , Xi j ) = σii λjj

for all i, i = 1, . . . , p and all j, j = 1, . . . , n. [straightforward - verify]

Lemma 3.8. Cov(X) = Σ ⊗ Λ ⇐⇒ Cov(X ) = Λ ⊗ Σ.

Proof. Set U = X , so Uij = Xji . Then

Cov(Uij , Ui j ) = Cov(Xji , Uj i ) = σjj λii ,

hence Cov(X ) = Cov(U ) = Λ ⊗ Σ by Lemma 3.7.

32
STAT 542 Notes, Winter 2007; MDP

Proposition 3.9. If Cov(X) = Σ ⊗ Λ then

(3.9) Cov(
A B ) = (AΣA ) ⊗ (B ΛB) .
X

q×p p×n n×m q×q m×m

Thus if X ∼ Np×n (ζ, Σ ⊗ Λ) then

(3.10) AXB ∼ Nq×m (AζB, (AΣA ) ⊗ (B ΛB))

Proof. (a) Because AX = (AX1 , . . . , AXn ) it follows that

= (A ⊗ In )X,
AX
so
= (A ⊗ In ) Cov(X)
Cov(AX) ≡ Cov(AX) (A ⊗ In )
= (A ⊗ In ) (Σ ⊗ Λ) (A ⊗ In )
= (AΣA ) ⊗ Λ [by (ii)].
(b) Next,
Cov(X ) = Λ ⊗ Σ [Lemma 3.8],
so
Cov(B X ) = (B ΛB) ⊗ Σ [(b)],
hence
Cov(XB) ≡ Cov ((B X ) ) = Σ ⊗ (B ΛB) [Lemma 3.8].

—————————————

Looking ahead: Our goal will be to determine the joint distribution of the
matrices (S11·2 , S12 , S22 ) that arise from a partitioned Wishart matrix S.
In §3.4 we will see that the conditional distribution of S12 | S22 follows a
multivariate normal linear model (MNLM) of the form (3.14) in §3.3, whose
covariance structure has Kronecker product form. Therefore we will ﬁrst
study this MNLM and determine the joint distribution of its MLEs (β̂, Σ̂)
given by (3.15) and(3.16). This will readily yield the joint distribution of
(S11·2 , S12 , S22 ), which in turn will have several interesting consequences,
including the evaluation of E(S −1 ) and the distribution of Hotelling’s T 2
statistic X̄n S −1 X̄n .

33
STAT 542 Notes, Winter 2007; MDP

3.3. The multivariate linear model.

The standard univariate linear model consists of a series X ≡ (X1 , . . . , Xn )
of uncorrelated univariate observations with common variance σ 2 > 0 such
that E(X) lies in a speciﬁed linear subspace L ⊂ Rn with dim(L) = q < n.
If Z : q × n is any ﬁxed matrix whose rows span L then

(3.11) L = {βZ | β : 1 × q ∈ Rq },

so this linear model can be expressed as follows:

E(X) = βZ, β : 1 × q,
(3.12)
Cov(X) = σ 2 In , σ 2 > 0.

In the standard multivariate linear model, X ≡ (X1 , . . . , Xn ) : p × n

is a series of uncorrelated p-variate observations with common covariance
matrix Σ > 0 such that each row of E(X) lies in the speciﬁed linear subspace
L ⊂ Rn . This linear model can be expressed as follows:

E(X) = βZ, β : p × q,
(3.13)
Cov(X) = Σ ⊗ In , Σ > 0.

If in addition we assume that X1 , . . . , Xn are normally distributed, then

(3.13) can be expressed as the normal multivariate linear model (MNLM)

(3.14) X ∼ Np×n (βZ, Σ ⊗ In ), β : p × q, Σ > 0.

Often Z is called a design matrix for the linear model. We now assume that
Z is of rank q ≤ n, so ZZ is nonsingular and β is identiﬁable:

β = (E(X)) Z (ZZ )−1 .

The maximum likelihood estimator (β̂, Σ̂). We now show that the
MLE (β̂, Σ̂) exists w. pr. 1 iﬀ n − q ≥ p and is given by

(3.15) β̂ = XZ (ZZ )−1 ,

(3.16) Σ̂ = n1 X In − Z (ZZ )−1 Z X ≡ 1

n XQX .

34
STAT 542 Notes, Winter 2007; MDP

Because the observation vectors X1 , . . . , Xn are independent under the

MNLM (3.14), the joint pdf of X ≡ (X1 , . . . , Xn ) is given by

c1 n
− 12 (xi −βZi ) Σ−1 (xi −βZi )
fβ,Σ (x) = n · e i=1
|Σ| 2
c1 n
− 12 tr[Σ−1 ( (xi −βZi )(xi −βZi ) )]
= n · e i=1
|Σ| 2
c1 − 12 tr[Σ−1 (x−βZ)(x−βZ) ]
(3.17) = n · e ,
|Σ| 2
np
where c1 = (2π)− 2 and Z1 , . . . , Zn are the columns of Z. To find the MLEs
β̂, Σ̂, first fix Σ and maximize (3.17) w.r.to β. This can be accomplished
by “minimizing” the matrix-valued quadratic form

(3.18) ∆(β) := (X − βZ)(X − βZ)

w.r.to the Loewner ordering 2 , which a fortiori minimizes tr[Σ−1 ∆(β)] [ver-
ify]. Since each row of βZ lies in L ≡ row space (Z) ⊂ Rn , this suggests
that the minimizing β̂ be chosen such that each row of β̂Z is the orthogonal
projection of the corresponding row of X onto L. But the matrix of this
orthogonal projection is

P ≡ Z (ZZ )−1 Z : n × n

so we should choose β̂ such that β̂Z = XZ (ZZ )−1 Z, or equivalently,

(3.19) β̂ = XZ (ZZ )−1 .

To verify that β̂ minimizes ∆(β), write X −βZ = (X − β̂Z)+(β̂ −β)Z,

so
∆(β) = (X − β̂Z)(X − β̂Z) + (β̂ − β)ZZ (β̂ − β)
+ (X − β̂Z)Z (β̂ − β) + (β̂ − β) Z(X − β̂Z) .

=0 =0

2
T ≥ S iﬀ T − S is psd.

35
STAT 542 Notes, Winter 2007; MDP

Since ZZ is pd, ∆(β) is uniquely minimized w.r. to the Loewner ordering

when β = β̂, so

(3.20) min ∆(β) = (X − β̂Z)(X − β̂Z)

β

= X In − Z (ZZ )−1 Z)(In − Z (ZZ )−1 Z X

≡ X (In − P )(In − P ) X
≡ XQQ X [ set Q = In − P ]
= XQX [ Q, like P, is a projection matrix]

Since β̂ does not depend on Σ, this establishes (3.15). Furthermore, it

follows from (3.17) and (3.20) that for ﬁxed Σ > 0,
c1 − 12 tr(Σ−1 xQx )
(3.21) max fβ,Σ (x) = n · e .
β |Σ| 2

To maximize (3.21) w. r. to Σ we apply the following lemma:

Lemma 3.10. If W is pd then

1 − 12 tr(Σ−1 W ) 1 − np
(3.22) max n e = n · e
2 ,
Σ>0 |Σ| 2 |Σ̂| 2

where Σ̂ ≡ 1
nW is the unique maximizing value of Σ.
Proof. Since the mappings

Σ → Σ−1 := Λ
Λ → (W ) ΛW
1 1
2 2 := Ω

are both bijections of Sp+ onto itself, the maximum in (3.22) is given by

1
max |Λ| 2 e− 2 tr(ΛW ) = 2 e− 2 trΩ
n 1 n 1
(3.23) n max |Ω|
Λ>0 |W | 2 Ω>0
1 p
n
ωi2 e− 2 ωi ,
1
= n max
|W | 2 ω1 ≥···≥ωp >0 i=1

36
STAT 542 Notes, Winter 2007; MDP

where ω1 , . . . , ωp are the eigenvalues of Ω. Since n log ω − ω is strictly

concave in ω, its maximum value n is uniquely attained at ω̂ = n, hence the
maximizing values of ω1 , . . . , ωp are ω̂1 = · · · = ω̂p = n. Thus the unique
maximizing value of Ω is Ω̂ = nIn , hence Λ̂ = nW −1 and Σ̂ = n1 W .
¯

If W is psd but singular, then the maximum in (3.23) is +∞ [verify].

Thus the MLE Σ̂ for the MNLM (3.14) exists and is given by Σ̂ = n1 XQX
iﬀ XQX is pd. We now derive the distribution of XQX and show that

(3.24) XQX is pd w . pr . 1 ⇐⇒ n − q ≥ p.

Thus the condition n − q ≥ p is necessary and suﬃcient for the existence

and uniqueness of the MLE Σ̂ as stated in (3.16).
First we ﬁnd the joint distn of (β̂, Σ̂). From (3.14) and (3.10),

Z
X ( Z Q ) ∼ Np×(q+n) βZ ( Z Q ) , Σ ⊗ (Z Q)
Q

ZZ 0
= Np×(q+n) ( βZZ 0 ) , Σ ⊗ [ZQ = 0],
0 Q

from which it follows that

(3.25) XZ ∼ Np×q (βZZ , Σ ⊗ (ZZ )) ,

(3.26) XQ ∼ Np×n (0, Σ ⊗ Q) ,
(3.27) XZ ⊥
⊥ XQ.

Because Q ≡ In − Z (ZZ )−1 Z is a projection matrix with [verify]

rank(Q) = tr(Q) = n − q,

its spectral decomposition is (recall (1.37))

In−q 0
(3.28) Q=Γ Γ
0 0

for some p × p orthogonal matrix Γ. Set Y = XQΓ, so from (3.26),

In−q 0
Y ∼ Np×n 0, Σ ⊗ .
0 0

37
STAT 542 Notes, Winter 2007; MDP

This shows that [verify]

(3.29) XQX ≡ Y Y ∼ Wp (n − q, Σ),

hence (3.24) follows from Lemma 3.2. Lastly, by (3.25), (3.29), and (3.27),

(3.30) β̂ ≡ XZ (ZZ )−1 ∼ Np×q β, Σ ⊗ (ZZ )−1 ,

(3.31) nΣ̂ ≡ XQX ∼ Wp (n − q, Σ),
(3.32) β̂ ⊥
⊥ Σ̂.

Remark 3.11. From (3.31), the MLE Σ̂ is a biased estimator of Σ:

q
E(Σ̂) = 1 − Σ.
n

Instead, the adjusted MLE Σ̆ := 1
n−q XQX is unbiased.
¯

Special case of the MNLM: a random sample from Np (µ, Σ).

If X1 , . . . , Xn is an i.i.d. sample from Np (µ, Σ) then the joint distribution
of X ≡ (X1 , . . . , Xn ) is a special case of the MNLM (3.13):

(3.33) X ∼ Np×n (µen , Σ ⊗ In ), µ : p × 1, Σ > 0.

Here q = 1, Z = en , and Q = In − en (en en )−1 en , so from (3.30) - (3.32),

n
(3.34) µ̂ = Xen (en en )−1 = Xi = X̄n ∼ Np µ, 1
nΣ ,
n i=1
1

n

(3.35) nΣ̂ = XQX = Xi − X̄n Xi − X̄n ∼ Wp (n − 1, Σ),
n i=1
(3.36) X̄n ⊥
⊥ Σ̂ .

38
STAT 542 Notes, Winter 2007; MDP

3.4. Distribution of a partitioned Wishart matrix.

Let Sp+ denote the cone of real positive deﬁnite p × p matrices and let
Mm×n denote the algebra of all real m × n matrices. Partition the pd
matrix S : p × p ∈ Sp+ as
p p2
1
p1 S11 S12
(3.37) S= ,
p2 S21 S22
where p1 + p2 = p. The next result follows from directly from (1.34).

Lemma 3.12. The following correspondence is bijective:

Sp+ ↔ Sp+1 × Mp1 ×p2 × Sp+2
(3.38)
S ↔ (S11·2 , S12 , S22 ).
Note that we cannot replace S11·2 by S11 in (3.37) because of the
constraints imposed on S itself by the pd condition. That is, the range
of (S11 , S12 , S22 ) is not the Cartesian product of the three ranges.

Proposition 3.13.*** Let S ∼ Wp (n, Σ) be partitioned as in (3.37) with

n ≥ p2 and Σ22 > 0. Then the joint distribution of (S11·2 , S12 , S22 ) can be
speciﬁed as follows:

(3.39) S12 S22 ∼ Np1 ×p2 Σ12 Σ−122 S 22 , Σ 11·2 ⊗ S 22 ,

(3.40) S22 ∼ Wp2 (n, Σ22 ),
(3.41) S11·2 ∼ Wp1 (n − p2 , Σ11·2 ),
(3.42) (S12 , S22 ) ⊥
⊥ S11·2 .

p1 Y1
Proof. Represent S as Y Y with Y ≡ ∼ Np×n (0, Σ ⊗ In ), so
p2 Y2

S11 S12 Y1 Y1 Y1 Y2
(3.43) = .
S21 S22 Y2 Y1 Y2 Y2
By Proposition 3.4, the conditions n ≥ p2 and Σ22 > 0 imply that
rank(Y2 ) = p2 w. pr. 1, hence S22 ≡ Y2 Y2 is pd w. pr. 1. Thus S11·2
is well deﬁned and is given by

(3.44) S11·2 = Y1 In − Y2 (Y2 Y2 )−1 Y2 Y1 ≡ Y1 QY1 .

39
STAT 542 Notes, Winter 2007; MDP

From (2.8) the conditional distribution of Y1 | Y2 is given by

(3.45) Y1 Y2 ∼ Np1 ×n Σ12 Σ−1

22 Y2 , Σ 11·2 ⊗ In ,

which is a MNLM (3.14) with the following correspondences:

X ↔ Y1 , β ↔ Σ12 Σ−1
22 , p ↔ p1 ,
Z ↔ Y2 , Σ ↔ Σ11·2 , q ↔ p2 .

Thus from (3.25), (3.31), (3.32), (3.43), and (3.44), conditionally on Y2 ,

(3.46) S12 Y2 ∼ Np1 ×p2 Σ12 Σ−1 22 S22 , Σ11·2 ⊗ S22 ,

(3.47) S11·2 Y2 ∼ Wp1 (n − p2 , Σ11·2 ),
(3.48) S12 ⊥
⊥ S11·2 | Y2 .

Clearly (3.46) ⇒ (3.39), while (3.40) follows from Lemma 3.1 with
A = ( 0p2 ×p1 Ip2 ). Also, (3.47) ⇒ (3.41) and (3.47) ⇒ S11·2 ⊥⊥ Y2 , which
combines with (3.48) to yield S11·2 ⊥
⊥ (S12 , Y2 ), which implies (3.42).
3 ¯
Note that (3.39) can be restated in two equivalent forms:
−1

(3.49) S12 S22 S22 ∼ Np1 ×p2 Σ12 Σ−1

22 , Σ 11·2 ⊗ S −1
22 ,

− 12 −1 2
1
(3.50) S12 S22 S22 ∼ Np1 ×p2 Σ12 Σ22 S22 , Σ11·2 ⊗ Ip2 ,

1
where S22
2
can be any (Borel-measurable) square root of S22 It follows from
(3.50) and (3.42) that

−1
(3.51) Σ12 = 0 =⇒ S12 S222 ⊥
⊥ S22 ⊥
⊥ S11·2 .

We remark that Proposition 3.13 can also be derived directly from the
pdf of the Wishart distribution, the existence of which requires the stronger
conditions n ≥ p and Σ > 0. We shall derive the Wishart pdf in §8.4.

Proposition 3.13 yields many useful results – some examples follow.

3
Because A⊥
⊥ B | C and B ⊥
⊥C⇒B⊥
⊥ (A, C) [verify].

40
STAT 542 Notes, Winter 2007; MDP

Example 3.14. Distribution of the generalized variance.

If S ∼ Wp (n, Σ) with n ≥ p and Σ > 0 then

p
(3.52) |S| ∼ |Σ| · χ2n−p+i ,
i=1

a product of independent chi-square variates.

Proof. Partition S as in (3.37) with p1 = 1, p2 = p − 1. Then

|S| = |S11·2 | · |S22 | ∼ |W1 (n − p + 1, Σ11·2 )| · |Wp−1 (n, Σ22 )|

∼ Σ11·2 χ2n−p+1 · |Wp−1 (n, Σ22 )|

with the two factors independent. The result follows by induction on p.

1
1Note
that (3.52) implies that although nS is an unbiased estimator of

Σ, n S is a biased estimator of |Σ|:
p

(3.53) E n1 S = |Σ| · i=1 n−p+i

n < |Σ| .

Proposition 3.15. Let S ∼ Wp (n, Σ) with n ≥ p and Σ > 0. If A : q × p

has rank q ≤ p then

−1 −1 −1 −1
(3.54) AS A ∼ Wq n − p + q, AΣ A .

When A = a : 1 × p this becomes

1 1
(3.55) −1
∼ −1
· χ2n−p+1 .
aS a aΣ a

Note: Compare (3.54) to (3.1): ASA ∼ Wq (n, AΣA ), which holds with
no restrictions on n, p, Σ, A, or q.

Our proof of (3.54) requires the singular value decomposition of A:

41
STAT 542 Notes, Winter 2007; MDP

Lemma 3.16. If A : q × p has rank q ≤ p then there exist an orthogonal

matrix Γ : q × q and a row-orthogonal matrix Ψ1 : q × p such that

(3.56) A = ΓDa Ψ1 ,

where Da = diag(a1 , . . . , aq ) and a21 ≥ · · · ≥ a2q > 0 are the ordered eigen-

Ψ
values of AA .4 By extending Ψ1 to a p × p orthogonal matrix Ψ ≡ 1
,
Ψ2
we have the alternative representations

(3.57) A = Γ ( Da 0q×(p−q) ) Ψ,
(3.58) = C ( Iq 0q×(p−q) ) Ψ,

where C ≡ ΓDa : q × q is nonsingular.

Proof. Let AA = ΓDa2 Γ be the spectral decomposition of the pd q × q
matrix AA . Thus
Da−1 Γ AA ΓDa−1 = Iq ,
so Ψ1 := Da−1 Γ A : q × p satisﬁes Ψ1 Ψ1 = Iq , i.e., the rows of Ψ1 are
orthonormal. Thus (3.56) holds, then (3.57) and (3.58) are immediate. ¯

Proof of Proposition 3.15. It follows from (3.58) that [verify]

−1
AS −1 A = C −1 Š11·2 C −1 ,

−1
AΣ−1 A = C −1 Σ̌11·2 C −1 ,

where Š = ΨSΨ and Σ̌ = ΨΣΨ are partitioned as in (3.37) with p1 = q

and p2 = p − q. Since Š ∼ Wp (n, Σ̌), it follows from Proposition 3.13 that

Š11·2 ∼ Wq n − (p − q), Σ̌11·2 ,

C −1 Š11·2 C −1 ∼ Wq n − (p − q), C −1 Σ̌11·2 C −1 ,

which gives (3.54).

4
a1 ≥ · · · ≥ aq > 0 are called the singular values of A.

42
STAT 542 Notes, Winter 2007; MDP

Proposition 3.17. Distribution of Hotelling’s T 2 statistic.

Let X ∼ Np (µ, Σ) and S ∼ Wp (n, Σ) be independent, n ≥ p, Σ > 0, and
deﬁne

T 2 = X S −1 X.
Then
χ2p (µ Σ−1 µ)
(3.59) T ∼
2
2 ≡ Fp, n−p+1 (µ Σ−1 µ),
χn−p+1

a (nonnormalized) noncentral F distribution. (The two chi-square variates

are independent.)
−1
2
Proof. Decompose T as X X S X
Σ−1 X · X Σ−1 X. By (3.55) and the inde-
pendence of X and S,
1
X S −1 X X ∼ X Σ−1 X · ,
χ2n−p+1
so
X S −1 X 1
X ∼ ,
X Σ−1 X χ2n−p+1

independent of X. Since X Σ−1 X ∼ χ2p (µ Σ−1 µ) by (2.30), (3.59) holds.

For any ﬁxed µ0 ∈ Rp , replace X and µ in Example 3.17 by X − µ0

and µ − µ0 , respectively, to obtain the following generalization of (3.59):

T 2 ≡ (X − µ0 ) S −1 (X − µ0 )

χ2p (µ − µ0 ) Σ−1 (µ − µ0 ) −1

(3.60) ∼ ≡ F p, n−p+1 (µ − µ0 ) Σ (µ − µ0 ) .
χ2n−p+1

Note: In Example 6.11 and Exercise 6.12 it will be shown that T 2 is the
UMP invariant test statistic and the LRT statistic for testing µ = µ0 vs.
µ = µ0 with Σ unknown. When µ = µ0 ,

(3.61) T 2 ∼ Fp−1, n−p+1 ,

which determines the signiﬁcance level of the test.

43
STAT 542 Notes, Winter 2007; MDP

Example 3.18. Expected value of S −1 .

Suppose that S ∼ Wp (n, Σ) with n ≥ p and Σ > 0, so S −1 exists with
pr. 1. When does E(S −1 ) exist, and what is its value? We answer this by
combining Proposition 3.13 with an invariance argument.
First consider the case Σ = I. Partition S and S −1 as
11
s11 S12 −1 s S 12
S= , S = ,
S21 S22 S 21 S 22

respectively, with p1 = 1 and p2 = p − 1. Then by (3.41),

1 1
s11 = ∼ ,
s11·2 χ2n−p+1
so
(3.62) E(s11 ) = 1
n−p−1 <∞ iﬀ n ≥ p + 2.

Similarly for the other diagonal elements of S −1 : E(sii ) < ∞ iﬀ n ≥ p + 2.

Because each oﬀ-diagonal element sij of S −1 satisﬁes
√
|sij | ≤ sii sjj ≤ 12 (sii + sjj ),

we see that E(S −1 ) =: ∆ exists iﬀ n ≥ p + 2. Furthermore, because Σ = I,

S ∼ ΓSΓ for every p × p orthogonal matrix Γ, hence

Γ∆Γ = Γ E(S −1 ) Γ = E (ΓSΓ )−1 = E(S −1 ) = ∆ ∀ Γ.

Exercise 3.19. Show that Γ∆Γ = ∆ ∀ Γ ⇒ ∆ = δ I for some δ > 0.

Thus E(S −1 ) = δ I, and δ = 1

n−p−1 by (3.62). Therefore when Σ = I,

E(S −1 ) = 1
n−p−1 I (n ≥ p + 2).

Now consider the general case Σ > 0. Since

1 1
S ∼ Σ 2 ŠΣ 2 with Š ∼ Wp (n, I),

44
STAT 542 Notes, Winter 2007; MDP

we conclude that

−1 1 1 −1
E(S )=E Σ ŠΣ
2 2

= Σ− 2 E(Š −1 )Σ− 2
1 1

1 − 12 − 12
= n−p−1 Σ Σ
−1
(3.63) = 1
n−p−1 Σ (n ≥ p + 2).

Proposition 3.20. Bartlett’s decomposition.

Let S ∼ Wp (n, I) with n ≥ p. Set S = T T where T ≡ {tij | 1 ≤ j ≤ i ≤ p}
is the unique lower triangular square root of S with tii > 0, i = 1, . . . , p (see
Exercise 1.5). Then the {tij } are mutually independent rvs with

 t2ii ∼ χ2n−i+1 , 1 = 1, . . . , p,
(3.64)

tij ∼ N1 (0, 1), 1 ≤ j < i ≤ p.

Proof. Use induction on p. The result is obvious for p = 1. Partition S

as in (3.37) with p1 = p − 1 and p2 = 1 so by the induction hypothesis,
S11 = T1 T1 for a lower triangular matrix T1 that satisﬁes (3.64) with p
replaced by p − 1. Then

S11 S12 T1 0 T1 T1−1 S12
S≡ = 1 1 ≡ T T ,
S21 s22 S21 T1−1 s22·1
2
0 s22·1
2

1
where T : p × p is lower triangular with tii > 0, i = 1, . . . , p. Since T1 = S11
2

and Σ = I, it follows from (3.51), (3.50), and (3.41) (with the indices “1”
and “2” interchanged) that

S21 T1−1 ⊥
⊥ T1 ⊥
⊥ s22·1

S21 T1−1 ∼ N1×(p−1) (0, 1 ⊗ Ip−1 ) ,
s22·1 ∼ χ2n−p+1 ,

from which the induction step follows.

45
STAT 542 Notes, Winter 2007; MDP

Example 3.21. Distribution of the sample multiple correlation

coeﬃcient R2 .
Let S ∼ Wp (n, Σ) with n ≥ p and Σ > 0. Partition S and Σ as

1 p−1 1 p−1

1 s11 S12 1 σ11 Σ12
(3.65) S= , Σ= ,
p−1 S21 S22 p−1 Σ21 Σ22
and deﬁne
−1
2 S12 S22 S21 Σ12 Σ−1
222 Σ21
R = , ρ = ,
s11 σ11

−1/2 −1/2
−1 S12 S22 S12 S22
R2 S12 S22 S21
U= = = ,
(3.66) 1 − R2 s11·2 s11·2
ρ2 Σ12 Σ−1
22 Σ21
ζ= =
1 − ρ2 σ11·2

Σ12 Σ−1 −1
22 S22 Σ22 Σ21
V ≡ V (S22 , Σ) = .
σ11·2
From Proposition 3.13 and (3.50) we have

−1/2 −1 1/2
S12 S22 S22 ∼ N1×(p−1) Σ12 Σ22 S22 , σ11.2 ⊗ Ip−1 ,
s11·2 ∼ σ11.2 · χ2n−p+1 ,
S22 ∼ Wp−1 (n, Σ22 ),
s11·2 ⊥
⊥ (S12 , S22 ),
so [verify]
χ2p−1 (V ) distn

U S22 ∼ 2 = Fp−1, n−p+1 (V ),
χn−p+1
V ∼ ζ · χ2n .

Therefore the joint distribution of (U, V ) ≡ (U, V (S22 , Σ)) is given by

U | V ∼ Fp−1, n−p+1 (V ),
(3.67)
V ∼ ζ · χ2n .

46
STAT 542 Notes, Winter 2007; MDP

Equivalently, if we set Z := V /ζ so Z is ancillary (but unobservable), then

U Z ∼ Fp−1, n−p+1 (ζZ),
(3.68)
Z ∼ χ2n ,

from which the unconditional distribution of U can be obtained by averaging

over Z (see Exercise 3.22 and Example A.18 in Appendix A).
¯

Exercise 3.22. From (A.7) in Appendix A, the conditional distribution

Fp−1,n−p+1 (ζZ) of U | Z can be represented as a Poisson mixture of central
F distributions:

(3.69) Fp−1, n−p+1 (ζZ) K ∼ Fp−1+2K, n−p+1 , K ∼ Poisson (ζZ/2) .

Use (3.68), (3.69), and (A.8) to show that the unconditional distribution of
U (resp., R2 ) can be represented as a negative binomial mixture of central
F (resp., Beta) rvs:

(3.70) U K ∼ Fp−1+2K, n−p+1 ,
U

(3.71) R ≡
2
K ∼ B p−1
2 + K, n−p+1
2 ,
U +1
(3.72) K ∼ Negative binomial (ρ2 ),
that is,

Γ n2 + k 2
k
n
Pr[ K = k ] = n
ρ 1 − ρ2 2 , k = 0, 1, . . . .
¯
Γ 2 k!

Note: In Example 6.26 and Exercise 6.27 it will be shown that R2 is the
LRT statistic and the UMP invariant test statistic for testing ρ2 = 0 vs.
ρ2 > 0. When ρ2 = 0 ( ⇐⇒ Σ12 = 0 ⇐⇒ ζ = 0), U ⊥ ⊥ Z by (3.68) and

(3.73) U ∼ Fp−1, n−p+1 ,

(3.74) R2 ∼ B p−1
2 , n−p+1
2 ,

either of which determines the signiﬁcance level of the test.

47
STAT 542 Notes, Winter 2007; MDP

4. The Wishart Density; Jacobians of Matrix Transformations.

We have deduced properties of a Wishart random matrix S ∼ Wp (n, Σ)
by using its representation S = XX in terms of a multivariate normal
random matrix X ∼ Np×n (0, Σ ⊗ In ). We have not required the density of
the Wishart distribution on Sp+ (the cone of p×p positive deﬁnite symmetric
matrices). In this section we derive this density, a multivariate extension of
the (central) chi-square density. Throughout it is assumed that n ≥ p.
Assume ﬁrst that Σ = I. From Bartlett’s decomposition S = T T in
Proposition 3.20, the joint pdf of T ≡ {Tij } is given by [verify!]

1 − 1 t2ij
p
1 n−i − 12 t2ii
f (T ) = √ e 2 · n−i−1
t ii e
1≤j<i≤p
2π i=1 2 2 Γ n−i+1
2

1
p 1
(4.1) = p tn−i
ii · exp − t2ij
2
pn
2 −p π
p(p−1)
4 Γ n−i+1
i=1
2
i=1 2 1≤j≤i≤p

p
1
=: cp,n · tn−i
ii

· exp − tr T T .
i=1
2
∂T
Since the pdf of S is given by f (S) = f (T ) , we ﬁrst must ﬁnd the
$ ∂S
∂S ∂T
Jacobian ∂T ≡ 1 ∂S of the mapping S = T T . [This derivation of the
Wishart pdf will resume in §4.4.]

4.1. Jacobians of vector/matrix transformations.

Consider a smooth bijective mapping (≡ diﬀeomorphism)

A→B
(4.2)
x ≡ (x1 , . . . , xn ) → y ≡ (y1 , . . . , yn ),

where A and B are open subsets of Rn . The Jacobian matrix of this map-
ping is given by
 ∂y1 ∂yn 
∂x1 · · · ∂x1
∂y  . .. 
(4.3) :=  .. . ,
∂x ∂y1
∂xn · · · ∂x
∂yn
n

48
STAT 542 Notes, Winter 2007; MDP
∂y +

and the Jacobian of the mapping is given by ∂x := [det ∂y ]+ . Jaco-

∂x
bians obey several elementary properties.

Chain rule: Suppose that x → y and y → z are diﬀeomorphisms. Then

x → z is a diﬀeomorphism and
∂z ∂z ∂y

(4.4) = · .
∂x ∂y y=y(x) ∂x

Proof. This follows from the chain rule for partial derivatives:
∂zi (y1 (x1 , . . . , xn ), . . . , yn (x1 , . . . , xn )) ∂zi ∂yk ∂z ∂y
= = .
∂xj ∂yk ∂xj ∂y ∂x ij
k
∂z
∂z
∂y

Therefore ∂x = ∂y ∂x ; now take determinants.

Inverse rule: Suppose that x → y is a diﬀeomorphism. Then

∂x ∂y −1

(4.5) = .
∂y y=y(x) ∂x

Proof. Apply the chain rule with z = x.

Combination rule: Suppose that x → u and y → v are (unrelated)

diﬀeomorphisms. Then
∂(u, v) ∂u ∂v

(4.6) = · .
∂(x, y) ∂x ∂y

Proof. The Jacobian matrix is given by

∂(u, v) ∂u 0

= ∂x ∂v .
∂(x, y) 0 ∂y

Extended combination rule: Suppose that (x, y) → (u, v) is a diﬀeo-

morphism of the form u = u(x), v = v(x, y). Then (4.6) continues to hold.
Proof. The Jacobian matrix is given by
∂(u, v) ∂u ∂v
= ∂x ∂x
∂v .
∂(x, y) 0 ∂y

49
STAT 542 Notes, Winter 2007; MDP

4.2. Jacobians of linear mappings. Let

A : p × p and B : n × n be nonsingular matrices,

L : p × p and M : p × p be nonsingular lower triangular matrices,
U : p × p and V : p × p be nonsingular upper triangular matrices,
c a nonzero scalar.

(A, B, L, M, U, V, c are non-random.) Then (4.4) – (4.6) imply the following

facts:
∂y
(a) vectors. y = cx, x, y : 1 × n: ∂x = |c|n . [combination rule]
∂Y
(b) matrices. Y = cX, X, Y : p × n: ∂X = |c|pn . [comb. rule]
∂Y
(c) symmetric matrices. Y = cX, X, Y : p × p, symmetric: ∂X = |c| p(p+1)
2 .
[comb. rule]
∂Y
(d) matrices. Y = AX, X, Y : p × n: ∂X = |A|n . [comb. rule]
∂Y
Y = XB, X, Y : p × n: ∂X = |B|p . [comb. rule]
∂Y
Y = AXB, X, Y : p × n: ∂X = |A|n |B|p . [chain rule]

(e) symmetric matrices. Y = AXA , X, Y : p × p, symmetric:

∂Y

= |A|p+1 .
∂X

Proof. Use the fact that A can be written as the product of elementary
matrices of the forms

Mi (c) :=Diag(1, . . . , 1, c, 1, . . . , 1),

 
1 0
 .. 
 . 1ij 
Eij :=  . .
 .. 
0 1

Verify the result when A = Mi (c) and A = Eij , then apply the chain rule.
¯

50
STAT 542 Notes, Winter 2007; MDP

(f ) triangular matrices:

• Y = LX, X, Y : p × p lower triangular:

∂Y p

= |lii |i .
∂X i=1

i
Proof. Let Since yij = k=j lik xkj (i ≥ j), the Jacobian matrix is
 ∂ypp 
∂y11 ∂y21 ∂y22
···  
∂x11 ∂x11 ∂x11 ∂x11 l11 l21 ··· lpp
 ∂y11 ∂y21 ∂y22
···
∂ypp 
∂Y  ∂x21 ∂x21 ∂x21 ∂x21   0 l22 
   
= = 0 .
∂ypp

∂y11 ∂y21 ∂y22
···   .
0 l22
∂X 
∂x22 ∂x22 ∂x22 ∂x22
  . .. .. 
. 
.. .. .. ..
 . .  . . . .
∂y11 ∂y21 ∂y22
···
∂ypp 0 0 ··· 0 lpp
∂xpp ∂xpp ∂xpp ∂xpp

This
pis a ip(p + 1)/2 × p(p + 1)/2 upper triangular matrix whose determinant
is i=1 lii .5
¯

• Y = U X, X, Y : p × p upper triangular:

∂Y p

= |uii |p−i+1 .
∂X i=1

• Y = XL, X, Y : p × p lower triangular:

∂Y p

= |lii |p−i+1 .
∂X i=1

Proof. Write Y = L X and apply the preceding case with U = L .

¯
5
A more revealing proof follows by noting that Y = LX can be written column-by
column as Y1 = L1 X1 , , . . . , Yp = Lp Xp , where Xi and Yi are the (p−i+1)×1 non-
zero parts of the columns of X and Y and where Li is the lower (p−i+1)×(p−i+1)
p
principal submatrix of L. Since Yi = Li Xi has Jacobian |Li |+ = j=i |ljj |, the
result follows from the composition rule.

51
STAT 542 Notes, Winter 2007; MDP

• Y = XU , X, Y : p × p upper triangular:
∂Y p

= |uii |i .
∂X i=1

Proof. Write Y = U X and apply the ﬁrst case with L = U .

• Y = LXM , X, Y : p × p lower triangular:

∂Y p p

= |lii | ·
i
|mii |p−i+1
∂X i=1 i=1

Proof. Apply the chain rule.

• Y = U XV , X, Y : p × p upper triangular:
∂Y p p

= |uii |p−i+1
· |vii |i .
∂X i=1 i=1

Proof. Write Y = V X U and apply the last case with L = V and

M = U .
¯

(g) triangular/symmetric matrices:

• Y = X + X , X : p × p lower (or upper) triangular, Y : p × p symmetric:
∂Y

= 2p .
∂X
Proof. Since yii = 2xii , 1 ≤ i ≤ p, while yij = xij , 1 ≤ j < i ≤ p.
¯

• Y = L X + X L, X : p × p lower triangular, Y : p × p symmetric:

∂Y p

= 2p |lii |i .
∂X i=1

Proof. Clearly X → Y is a linear mapping. To show that it is 1-1:

L X1 + X1 L = L X2 + X2 L
=⇒ L (X1 − X2 ) = −(X1 − X2 ) L
=⇒ (X1 − X2 )L−1 = −[(X1 − X2 )L−1 ] .

52
STAT 542 Notes, Winter 2007; MDP

Thus (X1 − X2 )L−1 is both lower triangular and skew-symmetric, hence is

0, so X1 = X2 . Next, to ﬁnd the required Jacobian, apply the chain rule to
the sequence of mappings

X → XL−1 → XL−1 +(XL−1 ) → L [XL−1 +(XL−1 ) ]L ≡ L X +X L.

Therefore the Jacobian is given by [verify!]

∂Y p p p
−1 p−i+1
= |lii | ·2 ·
p
|lii |p+1
= 2p
|lii |i .
∂X i=1 i=1 i=1

• Y = U X + X U , X : p × p upper triangular, Y : p × p symmetric:

∂Y p

=2p
|uii |p−i+1 .
∂X i=1

• Y = XL + LX , X : p × p lower triangular, Y : p × p symmetric:

∂Y p

=2p
|lii |p−i+1 .
∂X i=1

Proof. Apply the preceding case with U = L and X replaced by X̃ := X .

• Y = XU + U X , X : p × p upper triangular, Y : p × p symmetric:

∂Y p

=2p
|uii |i .
∂X i=1

Proof. Apply the ﬁrst case with L = U and X replaced by X̃ := X .

53
STAT 542 Notes, Winter 2007; MDP

4.3. Jacobians of nonlinear mappings.

(*) The Jacobian of a nonlinear diﬀeomorphism x → y is the same as the

Jacobian of the linearized diﬀerential mapping dx → dy. Here,

dx := (dx1 , . . . , dxn ) and dy := (dy1 , . . . , dyn ).

For n = 1, (*) is immediate from the linear relation

between dx and dy
dy dy
given by the formal diﬀerential identity dy = dx dx, where dx is treated
as a scalar constant c. For n ≥ 2, the equations for total diﬀerentials

∂y ∂y
1 1
dy1 = dx1 + · · · + dxn ,
∂x1 ∂xn
(4.7) .. .. ..
. . .
∂y ∂y
n n
dyn = dx1 + · · · + dxn ,
∂x1 ∂x1

can be expressed in vector-matrix notation as the single linear relation

∂y
(4.8) dy = dx
∂x
∂y

with ∂x treated as a constant matrix, which again implies (*).

The following elementary rules for matrix differentials will combine
with (*) to allow calculation of Jacobians for apparently complicated non-
linear diffeomorphisms. Here, if X ≡ (xij ) is a matrix variable, dX denotes
the matrix of differentials (dxij ). If X is a structured matrix (e.g., sym-
metric or triangular) then dX has the same structure.

(1) sum: d(X + Y ) = dX + dY . [verify]

(2) product: d(XY ) = (dX)Y + X(dY ). [verify]
(3) inverse: d(X −1 ) = −X −1 (dX)X −1 .

Proof. Apply (2) with Y = X −1 .

54
STAT 542 Notes, Winter 2007; MDP

Four examples of nonlinear Jacobians:

(a) matrix inversion: if Y = X −1 with X, Y : p × p (unstructured) then

∂Y

= |X|−2p .
∂X

Proof. Apply (3) and §4.2(d).

(b) matrix inversion: if Y = X −1 with X, Y : p × p symmetric, then

∂Y

= |X|−p−1 .
∂X

Proof. Apply (3) and §4.2(e).

(c) lower triangular decomposition: if S = T T with S : p × p symmetric

pd and T : p × p lower triangular with t11 > 0, . . . , tpp > 0 (Cholesky), then

∂S p

=2p
tp−i+1
ii .
∂T i=1

Proof. By (2), dS = (dT )T + T (dT ) ; now apply §4.2(g).

(d) upper triangular decomposition: if S = U U with S : p × p symmetric

pd and U : p × p upper triangular with u11 > 0, . . . , upp > 0 (Cholesky),
then
∂S p
p
=2 uiii .
∂U i=1

Proof. By (2), dS = (dU )U + U (dU ) ; again apply §4.2(g).

55
STAT 542 Notes, Winter 2007; MDP

4.4. The Wishart density.

We continue the discussion following (4.1). When Σ = Ip and n ≥ p, the
pdf f (T ) of T (recall that S = T T with T lower triangular) is given by
(4.1). Thus by the inverse rule and §4.3(c) the pdf of S is given by

1
f (S) =f (T (S)) · ∂S

∂T T =T (S)

p

p
= cp,n · tn−i
ii · exp − 12 tr T T ·2−p
t−p+i−1
ii
(4.9) i=1 i=1

p n−p−1

= 2−p cp,n · tii · exp − 12 tr T T

i=1
n−p−1
e− 2 tr S ,
1
= cp,n · |S| 2 S ∈ Sp+ ,

where
pn p(p−1) p n−i+1

(4.10) c−1
p, n := 2
2 π 4 · Γ 2
i=1
pn p(p−1)

=: 2 2 π 4 · Γp n2 .

Finally, for Σ > 0 the Jacobian of the mapping S → Σ1/2 SΣ1/2 is

p+1
|Σ| 2 (apply §4.2(e)), so the general Wishart pdf for S ∼ Wp (n, Σ) is
given by
cp, n n−p−1 −1
e− 2 tr Σ S ,
1
(4.11) n · |S| 2 S ∈ Sp+ ,
|Σ| 2

a multivariate extension of the density of σ 2 χ2n .

Exercise 4.1. Moments of the determinant of a Wishart random

matrix. Use (4.11) to show that

pk
n

2 Γp + k
(4.12) E(|S|k ) = |Σ|k · 2n
, k = 1, 2, . . . .
Γp 2

56
STAT 542 Notes, Winter 2007; MDP

Exercise 4.2. Matrix-variate Beta distribution.

Let S and T be independent with S ∼ Wp (r, Σ), T ∼ Wp (n, Σ), r ≥ p,
n ≥ p, and Σ > 0, so S > 0 and T > 0 w. pr. 1. Deﬁne
1

U = (S + T )− 2 S (S + T )− 2 ,
1

(4.13)
V = S + T.

Show that the range of (U, V ) is given by {0 < U < I} × {V > 0} and verify
that (4.13) is a bijection. Show that the joint pdf of (U, V ) is given by
cp, r cp, n r−p−1 n−p−1
f (U, V ) = · |U | 2 |I − U | 2
cp, r+n
(4.14) cp, r+n r+n−p−1
− 12 tr Σ−1 V
· r+n · |V | 2 e ,
|Σ| 2

so U and V are independent and the distribution of U does not depend on

Σ. (Note that the distribution of U is a matrix generalization of the Beta
distribution.) Therefore

(4.15) E(|S|k ) = E(|U |k |V |k ) = E(|U |k )E(|V |k ),

so the moments of |U | can be expressed in terms of the moments of deter-

minants of the two Wishart matrices S and V via (4.12) as follows:
n+r
r

k Γ Γ + k
E(|S| ) p p
(4.16) E(|U |k ) = = r
2 n+r2
.
E(|V | )
k Γp 2 Γp 2 + k

Hint: To ﬁnd the Jacobian of (4.13), apply the chain rule to the sequence
of mappings
(S, T ) → (S, V ) → (U, V ).
Use the extended combination rule to ﬁnd the two intermediate Jacobians.

57
STAT 542 Notes, Winter 2007; MDP

Exercise 4.3. Distribution of the sample correlation matrix when

Σ is diagonal.
Let S ∼ Wp (n, Dσ ) (n ≥ p), where Dσ := diag(σ1 , . . . , σp ) > 0. Deﬁne the
sample correlation matrix R ≡ {rij } by
−1/2 −1/2
rij = sii sij sjj ,
where S ≡ {sij }. Find the joint pdf of R, s11 , . . . , spp . Show that they are
mutually independent.
Hint: First determine the range of (R, s11 , . . . , spp ). Next, the joint pdf of
R, s11 , . . . , spp is given by

∂(S)
f (R, s11 , . . . , spp ) =f (S) ·
∂(R, s11 , . . . , spp )
∂(s , . . . , s
12 p−1,p , s11 , . . . , spp )
=f (S) ·
∂(R, s11 , . . . , spp )

cp,n n−p−1
− 12 trDσ−1
S ∂(s12 , . . . , sp−1,p )
= n · |S| e ·
2
|Dσ | 2 ∂R
cp,n n−p−1 p
n−p−1
−
sii p
p−1
= p n · |R| · sii 2 e 2σi · sii2
2

i=1 σi
2
i=1 i=1

n−p−1 1 sii n2 −1 − sii

p
=cp,n · |R| 2 · e 2σi ,
i=1
σ i σ i

where f (S) is given by (4.11) with Σ = Dσ and the Jacobian is calculated

1/2 1/2
using the extended combination rule and the relation sij = sii rij sjj .
This establishes the mutual independence, and will yield the marginal pdf
of R. (The mutual independence also can be established by means of Basu’s
Lemma.)
¯

Exercise 4.4. Inverse Wishart distribution. Let S ∼ Wp (n, I) with

n ≥ p and Σ > 0. Show that the pdf of W ≡ S −1 is
n
|Ω| 2 −1
e− 2 tr ΩW
1
(4.17) cp,n n+p+1 , W ∈ Sp+ .
|W | 2

where Ω = Σ−1 .
¯

58
STAT 542 Notes, Winter 2007; MDP

5. Estimating a Covariance Matrix.

Consider the problem of estimating Σ based on a Wishart random matrix
S ∼ Wp (n, Σ) with Σ ∈ Sp+ . Assume that n ≥ p so that S is nonsingular6
w. pr. 1. The loss incurred by an estimate Σ̂ is measured by a loss function
L(Σ̂, Σ) such that L ≥ 0 and L = 0 iﬀ Σ̂ = Σ. An estimator Σ̂ ≡ Σ̂(S) is
evaluated in terms of its risk function ≡ expected loss:

R(Σ̂, Σ) = EΣ [L(Σ̂, Σ)].

We shall consider two speciﬁc loss functions:

2
Quadratic loss : L1 (Σ̂, Σ) = tr Σ̂Σ−1 − I ,

Stein s loss : L2 (Σ̂, Σ) = tr Σ̂Σ−1 − log |Σ̂Σ−1 | − p.

We prefer L2 over L1 because L1 penalizes overestimates more than under-

estimates, unlike L2 :

L1 (Σ̂, I) → p as Σ̂ → 0, L1 (Σ̂, I) → ∞ as Σ̂ → ∞;
L2 (Σ̂, I) → ∞ as Σ̂ → 0 or ∞.

5.1. Equivariant estimators of Σ.

Let G be a subgroup of GL ≡ GL(p), the general linear group of all p × p
nonsingular real matrices. Each A ∈ G acts on Sp+ according to the mapping

Sp+ → Sp+
(5.1)
Σ → AΣA .

A loss function L is G-invariant if

(5.2) L(AΣ̂A , AΣA ) = L(Σ̂, Σ) ∀ A ∈ G.

6
If n < p it would seem impossible to estimate Σ. However several proposals
recently been put forth to address this case, which occurs for example with microarray
data where p ≈ 105 but n ≈ 103 . [References?]

59
STAT 542 Notes, Winter 2007; MDP

Note that both L1 and L2 are fully invariant, i.e., are GL-invariant. If L
is G-invariant then the risk function of any estimator Σ̂ ≡ Σ̂(S) transforms
as follows: for A ∈ G,

−1 −1 −1 −1
R A Σ̂(ASA )A , Σ = EΣ L A Σ̂(ASA )A , Σ
= EΣ [L(Σ̂(ASA ), AΣA )]
(5.3)
= EAΣA [L(Σ̂(S), AΣA )]
= R(Σ̂(S), AΣA ).

An estimator Σ̂ ≡ Σ̂(S) is G-equivariant if

(5.4) Σ̂(ASA ) = A Σ̂(S) A ∀ A ∈ G, ∀ S ∈ Sp+ .

If L is G-invariant and Σ̂ is G-equivariant then by (5.3) the risk function is

also G-invariant:

(5.5) R(Σ̂, Σ) = R(Σ̂, AΣA ) ∀ A ∈ G,

that is, R(Σ̂, Σ) is constant on G-orbits of Sp+ (see Deﬁnition 6.1).

We say that G acts transitively on Sp+ if Sp+ has only one G-orbit under
the action of G. Note that G acts transitively on Sp+ iﬀ every Σ ∈ Sp+ has a
square root ΣG ∈ G, i.e., Σ = ΣG ΣG . Thus both GL and GT ≡ GT (p) (the
subgroup of all p × p nonsingular lower triangular matrices) act transitively
on Sp+ .7 If L is G-invariant, Σ̂ is G-equivariant, and G acts transitively on
Sp+ , then the risk function is constant on Sp+ :

(5.6) R(Σ̂, Σ) = R(Σ̂, I) ∀ Σ ∈ Sp+ [set A = Σ−1

G in (5.5)]

5.2. The best fully equivariant estimator of Σ.

Lemma 5.1. An estimator Σ̂(S) is GL-equivariant iﬀ Σ̂(S) = δ S for some
scalar δ > 0.
7
For the latter, apply the Cholesky decomposition, Exercise 1.5.

60
STAT 542 Notes, Winter 2007; MDP

−1
Proof. Set G = GL and A = SGL in (5.2) to obtain

−1 −1
Σ̂(I) = SGL Σ̂(S) SGL ,
so

(5.7) Σ̂(S) = SGL Σ̂(I) SGL .

Next set A = Γ ∈ O and S = I in (5.2) to obtain

(5.8) Σ̂(I) = Γ Σ̂(I) Γ ∀ Γ ∈ O(p),

where O ≡ O(p) is the subgroup of all p × p orthogonal matrices. By

Exercise 3.19, (5.8) implies that Σ̂(I) = δ I, so Σ̂(S) = δ S by (5.7), as
stated. ¯

We now ﬁnd the optimal fully equivariant estimators Σ̂(S) ≡ δ̂ S w. r.

to the loss function L1 and L2 , respectively.

Proposition 5.2. (a) The best fully equivariant estimator w. r. to the loss
1
function L1 is the biased estimator n+p+1 S.
(b) The best fully equivariant estimator w. r. to the loss function L2 is the
unbiased estimator n1 S.
Proof. (a) Let S = {sij | i, j = 1 . . . , p}. Because GL acts transitively on
Sp+ and L1 is GL-invariant, δ S has constant risk given by

EI [L1 (δ S, I)] = EI [tr(δ S − I)2 ]

= δ 2 EI (tr S 2 ) − 2δEI (tr S) + tr I 2

2
= δ Ei sij − 2δ E
2
sii + p
i,j i

2
= δ EI 2
sii + EI sij − 2δnp + p
2
i i =j

∗ 2
(5.9) = δ (2n + n )p + p(p − 1)n − 2δnp + p
2

(5.10) = δ 2 np(n + p + 1) − 2δnp + p.

1
The quadratic function of δ in (5.10) is minimized by δ̂ = n+p+1 .

61
STAT 542 Notes, Winter 2007; MDP

*To verify (5.9), ﬁrst note that when Σ = I, s2ii ∼ χ2n so

EI s2ii = VarI χ2n + EI (χ2n ) = 2n + n2 .

Next, sij ∼ s12 since

Π S Π ∼ Wp (n, Π Π ) = Wp (n, I) ∼ S
−1/2 −1/2
for any permutation matrix Π. Also s12 s22 ⊥
⊥ s22 and s12 s22 ∼ N (0, 1)
by (3.50) and (3.51), so

s2 s2
EI s212 = EI 12
· s22 = EI 12
· EI (s22 ) = 1 · n = n.
s22 s22
(b) Because GL acts transitively on Sp+ and L2 is GL-invariant, δS has
constant risk given by

EI [L2 (δS, I)] = EI [tr(δS) − log |δS| − p]

= δ EI (tr S) − p log δ − EI (log |S| ) − p
(5.11) = δnp − p log δ − EI (log |S| ) − p.

This is minimized by δ̂ = 1
n.
¯

5.3. The best GT -equivariant estimator of Σ.

Lemma 5.3. Let ST = SGT . An estimator Σ̂(S) is GT -equivariant iﬀ

(5.12) Σ̂(S) = ST ∆ ST

for a ﬁxed diagonal matrix ∆ ≡ diag(δ1 , . . . , δp ) with each δi > 0.

Proof. Set G = GT and A = ST−1 in (5.4) to obtain

Σ̂(I) = ST−1 Σ̂(S) ST−1 ,
so
(5.13) Σ̂(S) = ST Σ̂(I) ST .

Next set A = D± ≡ diag(±1, . . . , ±) ∈ GT and S = I in (5.4) to obtain

(5.14) Σ̂(I) = D± Σ̂(I) D± ∀ D± .

62
STAT 542 Notes, Winter 2007; MDP

But (5.14) implies that Σ̂(I) = ∆ for some diagonal matrix ∆ ∈ Sp+ , [verify],
hence (5.12) follows from (5.13).
¯

We now present Charles Stein’s derivation of the optimal GT -equivariant

estimator Σ̂T (S) := ST ∆ ˆ T ST w. r. to the loss function L2 . Remarkably,
Σ̂T (S) is not of the form δ S, hence is not GL-equivariant. Because GT is
a proper subgroup of GL, the class of GT -equivariant estimators properly
contains the class of GL-equivariant estimators, hence Σ̂T dominates the
best fully equivariant estimator n1 S. Thus the latter, which is also the
best unbiased estimator and the MLE, is neither admissible nor minimax.
(Similar results hold for the quadratic loss function L1 .)

Proposition 5.4.8 The best GT -equivariant estimator w. r. to the loss

function L2 is

(5.15) ˆ T ST ,
Σ̂T (S) = ST ∆
where
(5.16) ˆ T = diag(δ̂T,1 , . . . , δ̂T,p )
∆
and
1
(5.17) δ̂T,i = n+p+1−2i .

Proof. Let ST = {tij | 1 ≤ j ≤ i ≤ p}. Because GT acts transitively on

Sp+ and L2 is GT -invariant, each GT -equivariant estimator ST ∆ ST has
constant risk R2 (ST ∆ ST , Σ) given by
% &
EI L2 (ST ∆ST , I)
%
&
= EI tr ST ∆ST − log |ST ∆ST | − p
%
& p % &
= EI tr ∆ST ST − log δi − EI log ST ST − p
p i=1
p
= EI δi (tii + t(i+1)i + · · · tpi ) −
2 2 2
log δi + const.
i=1 i=1
∗
p
= [δi ((n − i + 1) + (p − i)) − log δi ] + const.
i=1
p
(5.18) = [δi (n + 1 + p − 2i)) − log δi ] + const.
i=1

8
James and Stein (1962), Proc. 4th Berkeley Symp. Math. Statist. Prob. V.1.

63
STAT 542 Notes, Winter 2007; MDP

1
The ith term in the last sum is minimized by δ̂i = n+p+1−2i , as asserted.
*This follows from Bartlett’s decomposition (Proposition 3.20).
¯

For the loss function L2 , the improvement in risk oﬀered by Stein’s

estimator Σ̂T (S) = ST ∆ ˆ T ST compared to the unbiased estimator 1 S
n
is ≈ 5-20% for moderate values of p.9 However, this estimator is itself
inadmissible and can be improved upon readily as follows:
Replace the lower triangular group GT with the upper triangular group
GU to obtain the alternative version of Stein’s estimator given by

(5.19) ˆ U SU ,
Σ̂U (S) = SU ∆

where SU ≡ SGU is the unique upper triangular square root of S and

ˆ U = diag(δ̂U,1 , . . . , δ̂U,p ) with
∆
1
δ̂U,i = δ̂T, p−i+1 = n−p−1+2i .

Because GU also acts transitively on Sp+ , the risk function of Σ̂U is also
constant on Sp+ with the same constant value as the risk function of Σ̂T
[why?10 ] Since L2 (Σ̂, Σ) is strictly convex in Σ̂ [verify!], so is R2 (Σ̂, Σ)
9
S. Lin and M. Perlman (1985). A Monte Carlo comparison of four estimators of a
covariance matrix. In Multivariate Analysis – VI, P. R. Krishnaiah, ed., pp. 411-429.
10
Use an invariance argument: Let Π denote the p × p permutation matrix corre-
sponding to the permutation (1, . . . , p) → (p, . . . , 1). Then

S̃ := ΠSΠ = ΠSU SU Π = (ΠSU Π ) (ΠSU Π )

and ΠSU Π is lower triangular, so ΠSU Π ˆ U = Π ∆

= S̃T by uniqueness. Also ∆ ˆ T Π,
so from(5.19),

Σ̂U (S) = (Π S̃T Π)(Π ∆

ˆ T Π)(Π S̃T Π) = Π (S̃T ∆
ˆ T S̃T )Π
= Π Σ̂T (S̃) Π = Π Σ̂T (ΠSΠ ) Π.

A = Π to obtain
Now apply (5.3) with

(5.20) R2 Σ̂U (S), Σ ≡ R2 Π Σ̂T (ΠSΠ ) Π, Σ = R2 Σ̂T (S), ΠΣΠ ,

64
STAT 542 Notes, Winter 2007; MDP

[verify], hence

1
R2 2 Σ̂T + Σ̂U , Σ < 2 R2 (Σ̂T , Σ) + 12 R2 (Σ̂U , Σ) = R2 (Σ̂T , Σ).
1

Therefore the estimator 2 Σ̂T + Σ̂U strictly dominates Σ̂T (and Σ̂U ).
The preceding
discussion suggests another estimator that strictly dom-
inates 12 Σ̂T + Σ̂U , namely

1
(5.21) Σ̂P (S) := Π Σ̂T (ΠSΠ ) Π,
p!
Π∈P(p)

where P ≡ P(p) is the subgroup of all p × p permutation matrices. Again

the strict convexity of L2 implies that Σ̂P dominates Σ̂T , in fact [verify!]

1
R2 (Σ̂P , Σ) < R2 2 Σ̂T + Σ̂U , Σ < R2 (Σ̂T , Σ).

5.4. Orthogonally equivariant estimators of Σ.

The estimator Σ̂P (S) in (5.21) is the average over P of the transformed
estimators Π Σ̂T (ΠSΠ ) Π and is itself permutation-equivariant [verify]:

(5.22) Σ̂P (ΠSΠ ) = Π Σ̂P (S) Π ∀ Π ∈ P.

Because P is a proper subgroup of the orthogonal group O, the preceding

discussion suggests the following estimator, obtained by averaging over O
itself:

(5.23) Σ̂O (S) = Γ Σ̂T (ΓSΓ ) Γdν(Γ),
O

where ν is the Haar probability measure on O, i.e. the unique (left ≡ right)
orthogonally invariant probability measure on O. Since [verify!]

(5.24) Σ̂O (S) = Γ Σ̂P (ΓSΓ ) Γ dν(Γ),
O

so Σ̂U and Σ̂T must have the same (constant) risk function, as asserted.
¯

65
STAT 542 Notes, Winter 2007; MDP

the strict convexity of L2 implies that Σ̂O in turn dominates Σ̂P [verify!]:

R2 (Σ̂O , Σ) < R2 (Σ̂P , Σ).

The estimator Σ̂O , ﬁrst proposed11 by Akimichi Takemura, is orthogo-

nally equivariant: for any Γ ∈ O,

Σ̂O (ΓSΓ ) = Ψ Σ̂T (Ψ(ΓSΓ )Ψ ) Ψ dν(Ψ)
O
= Γ(ΨΓ) Σ̂T ((ΨΓ)S(ΨΓ) ) (ΨΓ)Γ dν(Ψ)
O

∗
=Γ Φ Σ̂T (ΦSΦ ) Φ dν(Φ) Γ

O
(5.25) = Γ Σ̂O (S) Γ ,

where * follows from the substitution Ψ → Φ ≡ ΨΓ and the orthogonal

invariance of ν: dν(Ψ) = dν(ΓΨ) ≡ dν(Φ). The estimator Σ̂O oﬀers greater
improvement over n1 S than does Σ̂T (S), often a reduction in risk of 20-30%.

Clearly the unbiased estimator n1 S is orthogonally equivariant [verify].

The class of orthogonally equivariant estimators is characterized as follows:

Lemma 5.3. For any S ∈ Sp+ let S = ΓS Dl(S) ΓS be its spectral decom-
position. Here l(S) = (l1 (S), . . . , lp (S)) where l1 ≥ · · · ≥ lp (> 0) are the
ordered eigenvalues of S, the columns of ΓS are the corresponding eigen-
vectors, and Dl(S) = diag(l1 (S), . . . , lp (S)). An estimator Σ̂ ≡ Σ̂(S) is
O-equivariant iﬀ

(5.26) Σ̂(S) = ΓS Dφ(l(S)) ΓS

where Dφ(l) = diag(φ1 (l1 , . . . , lp ), . . . , φp (l1 , . . . , lp )) with φ1 ≥ · · · ≥ φp > 0.

Proof. For any Γ ∈ O and S ∈ Sp+ ,

Γ S Γ = (ΓΓS ) Dl(S) (ΓΓS ) ,

11
An orthogonally invariant minimax estimator of the covariance matrix of a multi-
variate normal population. Tsukuba J. Math. (1984) 8 367-376.

66
STAT 542 Notes, Winter 2007; MDP

hence ΓΓSΓ = ΓΓS and l(ΓSΓ ) = l(S). Thus if Σ̂(S) satisﬁes (5.26) then

Σ̂(ΓSΓ ) = ΓΓSΓ Dφ(l(ΓSΓ )) ΓΓSΓ = Γ Σ̂(S) Γ ,

so Σ̂ is O-equivariant.
Conversely, if Σ̂ is O-equivariant then

(5.27) Σ̂(S) = ΓS Σ̂(ΓS S ΓS ) ΓS = ΓS Σ̂(Dl(S) ) ΓS

But
Σ̂(Dl(S) ) = D± Σ̂(Dl(S) ) D± ∀ D± ≡ diag(±1, . . . , ±) ∈ O,

hence (recall (5.14)) Σ̂(Dl(S) ) must be a diagonal matrix whose entries

depend on S only through l(S). That is,

Σ̂(Dl(S) ) = Dφ(l(S))

for some φ(l(S) ≡ (φ1 (l(S)), . . . , φp (l(S))), so (5.27) yields (5.26)

By (5.5), the risk function R2 (Σ̂, Σ) of an O-equivariant estimator Σ̂

is constant on O-orbits of Sp+ , hence satisﬁes

(5.28) R2 (Σ̂, Σ) = R(Σ̂, Dλ(Σ) ),

where λ(Σ) ≡ (λ1 (Σ) ≥ . . . ≥ λp (Σ) (> 0)) is the vector of the ordered
eigenvalues of Σ. Thus, by restricting consideration to orthogonally equiv-
ariant estimators, the problem of estimating Σ reduces to that of estimating
the population eigenvalues λ(Σ) based on the sample eigenvalues l(S).

Exercise 5.5. (Takemura). When p = 2, show that Σ̂O (S) has the form
(5.26) with
' √ √ (
l1 δ̂T,1 l δ̂
φ1 (l1 , l2 ) = √ √ + √ 2 T,2 √ l1 ,
l1 + l2 l1 + l2
(5.29) ' √ √ (
l2 δ̂T,2 l1 δ̂T,1
φ2 (l1 , l2 ) = √ √ +√ √ l2 .
l1 + l2 l1 + l2

67
STAT 542 Notes, Winter 2007; MDP

where δ̂T,1 = 1
n+1 and δ̂T,2 = 1
n−1 (set p = 2 in (5.17)).
¯

1
√ √
Because n+1 < n1 < n−1 1
and l1 > l2 , Σ̂O “shrinks” the largest
eigenvalue of n1 S and “expands” its smallest eigenvalue when p = 2 [verify],
and Takemura showed that this remains true of Σ̂O for all p ≥ 2.
Stein has argued that the shrinkage/expansion should be stronger than
that given by Σ̂O . For example, he suggested that for any p ≥ 2, if consid-
eration is restricted to orthogonally invariant estimators having the simple
form φi (l1 , . . . , lp ) = ci li for constants ci > 0, then the best choice of ci is
given by (recall (5.17))
1
(5.30) ci = δ̂T,i = n+p+1−2i , i = 1, . . . , p.

Several reasons why such shrinkage/expansion is a desirable property

for orthogonally equivariant estimators are now presented.
First, the extremal representations

(5.31) l1 (S) = max

x Sx,
x x=1

(5.32) lp (S) = min

x Sx,
x x=1

show that l1 (S) and lp (S) are, respectively, convex and concave functions
of S [verify]. Thus by Jensen’s inequality,
(5.33) EΣ [l1 (S)] ≥ l1 [E(S)] = l1 (nΣ) ≡ n λ1 (Σ),
(5.34) EΣ [lp (S)] ≤ lp [E(S)] = lp (nΣ) ≡ n λp (Σ).
Thus n1 l1 tends to overestimate λ1 and should be shrunk, while n1 lp tends to
underestimate λp and should be expanded. This holds for the other eigen-
values also: n1 l2 , n1 l3 , . . . should be shrunk while n1 lp−1 , n1 lp−2 , . . . should be
expanded.
Next from (3.53) and the concavity of log x,
p n p

E 1
l i (S) = λ i (Σ) · n−p+i
i=1 n i=1 i=1 n
n
p
≤ λi (Σ) · 1 − p−12n
i=1
n p(p−1)
(5.35) ≤ λi (Σ) · e− 2n .
i=1

68
STAT 542 Notes, Winter 2007; MDP

p n
Thus i=1 n1 li (S) will tend to underestimate i=1 λi (Σ) unless n p2 ,
which does not usually hold in applications. This suggests that the shrink-
age/expansion of the sample eigenvalues should not be done in a linear
manner: the smaller n1 li (S)’s should be expanded proportionately more than
the larger n1 li (S)’s should be shrunk.
A more precise justification is based on the celebrated ”semi-circle” law
[draw figure] of the mathematical physicist E. P. Wigner, since extended by
many others. A strong consequence of these results is that when Σ = λ Ip
(equivalently, λ1 (Σ) = · · · = λp (Σ) = λ) and both n, p → ∞ while np → η
for some fixed η ∈ (0, 1], then
a.s. √ 2
(5.36) 1
n l 1 (S) → λ (1 + η) ,
a.s. √
n lp (S) → λ (1 − η)2 .
1
(5.37)
Thus if it were known that Σ = λ Ip then n1 l1 (S) should be shrunk by
√
the factor 1/(1 + η)2 while n1 lp (S) should be expanded by the factor
√
1/(1 − η)2 . Furthermore, the expansion is proportionately greater than
the shrinkage since
1
√
(1+ η)2 · 1
√
(1− η)2 = 1
(1−η)2 > 1.

Note that these two desired shrinkage factors for n1 l1 (S) and n1 lp (S)
are even more extreme than nc1 ≡ nδ̂T,1 and nc1 ≡ nδ̂T,p from (5.30):

(5.38) 1 > nδ̂T,1 ≡ n

n+p−1 ≈ 1
1+η > 1
√
(1+ η)2 ,

(5.39) 1 < nδ̂T,p ≡ n

n−p+1 ≈ 1
1−η < 1
√
(1− η)2 .

The shrinkage and expansion factors in (5.36) and (5.37) are derived
only for the case Σ = λ Ip (the “worst case” in that the most shrink-
age/expansion is required). In general the appropriate shrinkage/expansion
factors (equivalently, the functions φ1 , . . . , φp in (5.26)) depend on the (un-
known) empirical distribution of λ1 (Σ), . . . , λp (Σ) so must themselves be
estimated adaptively. Stein12 proposed the following adaptive eigenvalue
12
I ﬁrst learned of this result at Stein’s 1975 IMS Rietz Lecture in Atlanta, which
remains unpublished in English - Stein published his results in a Russian journal in
1977. I have copies of his handwritten lecture notes from his courses at Stanford and U.
of Washington. Similar results were later obtained independently by Len Haﬀ at UCSD.

69
STAT 542 Notes, Winter 2007; MDP

estimators:
+
1
(5.40) φ∗i (l1 , . . . , lp ) = n−p+1+2li 1 li .
j=i li −lj

The term inside the large parentheses can be negative hence its positive part
is taken. Also the required ordering φ∗1 > . . . > φ∗p need not hold, in which
case the ordering is achieved by an isotonization algorithm – see Lin and
Perlman (1985) for details. Despite these complications, Stein’s estimator
oﬀers substantial improvement over the other estimators considered thus
far – the reduction in risk can be 70-90% when Σ ≈ λ Ip !
If the population eigenvalues are widely dispersed, i.e.,
(5.41) λ1 (Σ) · · · λp (Σ),
then the sample eigenvalues {li } will also be widely dispersed, so

1
li j =i li −l j
= j>i l i
li
−l j
+ j<i l i
li
−l j
≈ (p − i) + 0,
in which case (5.40) reduces to [verify]

∗
(5.42) φi (l1 , . . . , lp ) = n+p−1+2i li ≡ δ̂T,i li
1

(recall (5.30)). On the other hand, if two or more λi (Σ)’s are nearly equal
then the same will be true for the corresponding li ’s, in which case the
shrinkage/expansion oﬀered by the φ∗i ’s will be more pronounced than in
(5.42), a desirable feature as indicated by (5.38) and (5.39).

Remark. When p ≥ 3 it is diﬃcult to evaluate the integral for Takemura’s

estimator Σ̂O (S) in (5.23). However, the integral can be approximated by
Monte Carlo simulation from the Haar probability distribution over O. This
can be accomplished as follows:

Lemma 5.6. Let X ∼ Np×p (0, Ip ⊗ Ip ). The distribution of the ran-

dom orthogonal matrix Γ ≡ (XX )−1/2 X is the Haar measure on O, i.e.,
the unique left ≡ right orthogonally invariant probability measure on the
compact topological group O.
Proof. It suﬃces to show that the distribution is right orthogonally invari-
ant, i.e., that Γ ∼ Γ Ψ for all Ψ ∈ O. But this holds since
Γ Ψ = [(XΨ)(XΨ) ]−1/2 X Ψ ∼ (XX )−1/2 X = Γ.
¯

70
STAT 542 Notes, Winter 2007; MDP

6. Invariant Tests of Hypotheses. (See Lehmann TSH Ch. 6, 8.)

Motivation for invariant tests (and equivariant estimators):
(a) Respect the symmetries of a statistical problem.
(b) Unbiasedness fails to yield a UMPU test when testing more than one
parameter. Restricting to invariant tests sometimes leads to a UMPI test,
but at least reduces the class of tests to be compared.

6.1. Invariant statistical models and maximal invariant statistics.

A statistical model is a family P of probability distributions defined on a
sample space (X , A), where A is the sigma-field of measurable subsets of
X . Often P has a parametric representation: P = {Pθ | θ ∈ Θ}. (The
parameterization is assumed to be identifiable.)
Let G be a group of measurable mappings of X into itself. Then G
acts on X if
(1) (g1 g2 )x = g1 (g2 x) ∀g1 , g2 ∈ G, ∀x ∈ X .
(2) 1G x = x ∀x ∈ X . (1G denotes the identity element in G.)
Here (1) and (2) imply that the mapping g : X → gX is a bijection ∀ g ∈ G.

Deﬁnition 6.1. Suppose that G acts on X . For x ∈ X , the G-orbit of x

is the subset Gx := {gx | g ∈ G} ⊆ X , i.e., the set of all images of x under
the actions in G. The orbit space

X /G := {Gx | x ∈ X }

is the set of all G-orbits. The orbit projection π is the mapping

π : X → X /G
x → Gx.

Trivially, π is a G-invariant function, that is, π is constant on G-orbits:

π(x) = π(gx) ∀x, g.

[Since G itself is invariant under group multiplication: {gg | g ∈ G} = G.]

71
STAT 542 Notes, Winter 2007; MDP

Deﬁnition 6.2. A function t : X → T is a maximal invariant statistic

(MIS) if it is equivalent to the orbit projection π, i.e., if t is constant on G-
orbits and distinguishes G-orbits (takes diﬀerent values on diﬀerent orbits.)

Lemma 6.3. Suppose that t : X → T satisﬁes

(3) t is G-invariant;
(4) if u : X → U is G-invariant, i.e., satisﬁes u(x) = u(gx) ∀x, g, then u
depends on x only through the value of t(x), i.e., u(x) = w(t(x)) for some
function w : T → U.
Then t is a maximal invariant statistic.
Proof. We need only show that t distinguishes G-orbits. This follows from
(4) with u = π.
¯

If G acts on X then G acts on P as follows: gP := P ◦ g −1 , that is,

(gP )(A) := P (g −1 (A)) ∀A ∈ A.

Equivalently, if X ∼ P then gX ∼ gP .

Deﬁnition 6.4. The statistical model P is G-invariant if gP ⊆ P ∀g ∈ G.

If P is G-invariant then by (1) and (2),

(5) (g1 g2 )P = g1 (g2 P ) ∀g1 , g2 ∈ G, ∀P ∈ P. [since (g1 g2 )−1 = g2−1 g1−1 ]
(6) 1G P = P ∀P ∈ P.
Then (5) implies that

P = g(g −1 P) ⊆ gP ∀g ∈ G,

so gP = P ∀g and the mapping g : P → gP is a bijection for each g ∈ G.

Furthermore, if P has a parametric representation {Pθ | θ ∈ Θ} then,
equivalently, G acts on Θ according to

Pgθ := gPθ ≡ Pθ ◦ g −1 .

Also equivalently, if X ∼ Pθ then gX ∼ Pgθ . In this case, (5) and (6)

become

72
STAT 542 Notes, Winter 2007; MDP

(7) (g1 g2 )θ = g1 (g2 θ) ∀g1 , g2 ∈ G, ∀θ ∈ Θ.

(8) 1G θ = θ ∀θ ∈ Θ. (Thus, GΘ = Θ.)
Again, (7) and (8) imply that gΘ = Θ ∀g and the mapping g : Θ → gΘ
is a bijection for each g ∈ G. Note that if dPθ (x) = f (x, θ)dx then the
G-invariance of P is equivalent to

(9) f (x, θ) = f (gx, gθ) ∂(gx)
∂x [verify].

Deﬁnition 6.5. Assume that P ≡ {Pθ | θ ∈ Θ} is G-invariant. For θ ∈ Θ,

the G-orbit of θ is the subset Gθ := {gθ | g ∈ G} ⊆ Θ. A function τ : Θ → Ξ
is a maximal invariant parameter (MIP) if it is constant on G-orbits and
distinguishes G-orbits.
¯

As in Lemma 6.3, τ is a maximal invariant parameter iﬀ τ is G-invariant

and any G-invariant parameter σ(θ) depends on θ only through the value
of τ ≡ τ (θ).

Lemma 6.6. Assume that u : X → U is G-invariant. Then the distribution

of u depends on θ only through the value of the maximal invariant parameter
τ . (In particular, the distribution of a maximal invariant statistic t depends
only on τ .)
Proof. We need only show that the distribution of u is G-invariant. But
this is immediate, since for any measurable subset B ⊆ U,

Pgθ [ u(X) ∈ B] = Pθ [ u(gX) ∈ B] = Pθ [ u(X) ∈ B].

6.2. Invariant hypothesis testing problems.

Suppose that P ≡ {Pθ | θ ∈ Θ} is G-invariant and we wish to test

(6.1) H0 : θ ∈ Θ0 vs. H : θ ∈ Θ \ Θ0

based on X, where Θ0 is a proper subset of Θ such that P0 ≡ {Pθ | θ ∈ Θ0 }

is also G-invariant. Then (6.1) is called a G-invariant testing problem. A
sensible approach to such a testing problem is to respect the symmetry of
the problem (i.e., its G-invariance) and restrict attention to test statistics
that are G-invariant. Equivalently, this leads us to consider the “invariance-
reduced” problem where we test H0 vs. H based on the value of a MIS

73
STAT 542 Notes, Winter 2007; MDP

t ≡ t(x) rather than on the value of x itself. In general this may entail
a loss of information, but optimal invariant tests often (but not always)
remain admissible among all possible tests.
Because P0 and P are G-invariant, the invariance-reduced testing prob-
lem can be restated equivalently as that of testing

(6.2) H0 : τ ∈ Ξ0 vs. H : τ ∈ Ξ \ Ξ0

based on a MIS t, for appropriate sets Ξ0 and Ξ in the range of the MIP τ .
Our goal will be to determine the distribution of the MIS t and apply the
principles of hypothesis testing to (6.2). In particular, if a UMP test exists
for (6.2), it is called UMP invariant (UMPI) with respect to G for (6.1).
In cases where the class of invariant tests still so large that no UMPI
test exists, the likelihood ratio test (LRT) for (6.1), which rejects H0 for
large values of the LRT statistic

maxΘ f (x, θ)
Λ(x) := ,
maxΘ0 f (x, θ)

is often a satisfactory G-invariant test.

Lemma 6.7. The LRT statistic is G-invariant:

Λ(gx) = Λ(x) ∀ g ∈ G.

Proof. Apply property (9) in §6.1.

Example 6.8. Testing a mean vector with known covariance ma-

trix: one observation.
Consider the problem of testing

(6.3) µ=0 vs. µ = 0 based on X ∼ Np (µ, Ip ).

Here X = Θ = Rp and Θ0 = {0}. Let G = Op ≡ the group of all p × p

orthogonal matrices g acting on X and Θ via

X → gX and µ → gµ,

74
STAT 542 Notes, Winter 2007; MDP

respectively. Because

gX ∼ Np (gµ, gg ≡ Ip ),

Θ and Θ0 are G-invariant. For X, µ ∈ X ≡ Rp , the G-orbits of X and µ

are the spheres

{y ∈ Rp | y = X} and {ν ∈ Rp | ν = µ},

respectively, so

t ≡ t(X) = X2 and τ ≡ τ (µ) = µ2

represent the MIS and MIP, resp. The distribution of t is χ2p (τ ), the non-
central chi-square distribution with noncentrality parameter τ . Any G-
invariant statistic depends on X only through X2 , and its distribution
depends on µ only through µ2 . The invariance-reduced problem (6.2)
becomes that of testing

(6.4) τ =0 vs. τ > 0 based on X2 ∼ χ2p (τ ).

Since χ2p (τ ) has monotone likelihood ratio (MLR) in τ (see Appendix A on

MLR, Example A.14), by the Neyman-Pearson (NP) Lemma the uniformly
most powerful (UMP) level α test for (6.4) rejects µ2 = 0 if

X2 > χ2p; α ,

the upper α quantile of the χ2p distribution, and is unbiased. Thus this test
is UMPI level α for (6.3) and is unbiased for (6.3).
¯

Exercise 6.9. (a) In Example 6.8 show that the UMP invariant level α
test is the level α LRT based on X for (6.3).
(b) The power function of this LRT is given by

βp (τ ) := Pr τ [ X2 > χ2p; α ] ≡ Pr[ χ2p (τ ) > χ2p; α ].

It follows from the MLR property (or the log concavity of the normal pdf)
that βp (τ ) is increasing in τ , hence this test is unbiased. Show that for ﬁxed
τ , βp (τ ) is decreasing in p. Hint: apply the NP Lemma.

75
STAT 542 Notes, Winter 2007; MDP

(c) (Kiefer and Schwartz (1965) Ann. Math. Statist.) Show that the LRT
is a proper Bayes test for (6.3), and therefore is admissible among all tests
for (6.3).
Hint: consider the following prior distribution:

Pr[ µ = 0] = γ,
Pr[ µ = 0] = 1 − γ,

µ µ = 0 ∼ Np (0, λIp ), (0 < γ < 1, λ > 0).

Example 6.10. Testing a mean vector with unknown covariance

matrix: one observation.
Consider the problem of testing

µ=0 vs. µ = 0 based on X ∼ Np (µ, Σ)

with Σ > 0 unknown. Here

X = Rp , Θ = Rp × Sp+ , Θ0 = {0} × Sp+ .

Now we may take G = GL(p), the group of all p × p nonsingular matrices

g, acting on X and Θ via

X → gX and (µ, Σ) → (gµ, gΣg )

respectively. Again Θ and Θ0 are G-invariant. Now there are only two G-
orbits in X : {0} and Rp \{0} [why?], so any G-invariant statistic is constant
on Rp \ {0}, hence its distribution does not depend on µ. Thus there is
no G-invariant test that can distinguish between the hypotheses µ = 0 and
µ = 0 on the basis of a single observation X when Σ is unknown. ¯

76
STAT 542 Notes, Winter 2007; MDP

Example 6.11. Testing a mean vector with unknown covariance

matrix: n + 1 observations.
Consider the problem of testing

(6.5) µ = 0 vs. µ = 0 based on (Y, W ) ∼ Np (µ, Σ) × Wp (n, Σ)

with Σ > 0 unknown and n ≥ p. Here

X = Θ = Rp × Sp+ , Θ0 = {0} × Sp+ .

Let G = GL act on X and Θ via

(Y, W ) → (gY, gW g ) and (µ, Σ) → (gµ, gΣg ),

respectively. Because

(gY, gW g ) ∼ Np (gµ, gΣg ) × Wp (n, gΣg ),

Θ and Θ0 are G-invariant. It follows from Lemma 6.3 that

t ≡ t(Y, W ) := Y W −1 Y and τ ≡ τ (µ, Σ) := µ Σ−1 µ

represent the MIS and MIP, respectively [verify!]. We have seen that

−1
χ2p (τ )
Hotelling s T ≡ Y W
2
Y ∼ 2 ,
χn−p+1

the ratio of two independent chisquare variates, the ﬁrst noncentral. (This is
the (nonnormalized) noncentral F distribution Fp, n−p+1 (τ ).) The invariance-
reduced problem (6.2) becomes that of testing

(6.6) τ = 0 vs. τ > 0 based on T 2 ∼ Fp, n−p+1 (τ ).

Because Fp, n−p+1 (τ ) has MLR in τ (see Example A.15), the UMP level α
test for (6.6) rejects τ = 0 if T 2 > Fp, n−p+1; α and is unbiased. Thus this
test is UMPI level α for (6.5), and is unbiased for (6.5).
¯

77
STAT 542 Notes, Winter 2007; MDP

Exercise 6.12. (a) In Example 6.11, show that the UMP invariant level α
test ( ≡ the T 2 test) is the level α LRT based on (Y, W ) for (6.5).
(b) The power function of this LRT is given by

βp, n−p+1 (τ ) := Pr τ [ T 2 > Fp, n−p+1;α ] ≡ Pr[ Fp, n−p+1 (τ ) > Fp, n−p+1;α ].

It follows from MLR that βp, n−p+1 (τ ) is increasing in τ , hence this test is
unbiased. Show that for ﬁxed τ and p, βp, n−p+1 (τ ) is increasing in n.
(c)* (Kiefer and Schwartz (1965) Ann. Math. Statist.). Show that the
LRT is a proper Bayes test for testing (6.5) based on (Y, W ), and thus is
admissible among all tests for (6.5).
Hint: consider the prior probability distribution on Θ0 ∪ Θ given by

Pr[Θ0 ] = γ,
Pr[Θ] = 1 − γ, (0 < γ < 1);

(µ, Σ) Θ0 ∼ π0 ,

(µ, Σ) Θ ∼ π,

where π0 and π are measures on Θ0 ≡ {0} × Sp+ and Θ ≡ Rp × Sp+ respec-

tively, deﬁned as follows: π0 assigns all its mass to points of the form

(µ, Σ) = (0, (Ip + ηη )−1 ), η ∈ Rp ,

where η has pdf proportional to |Ip + ηη |−(n+1)/2 ; π assigns all its mass to
points of the form

(µ, Σ) = ((Ip + ηη )−1 η, (Ip + ηη )−1 ), η ∈ Rp ,

where η has pdf proportional to

|Ip + ηη |−(n+1)/2 exp ( 12 η (Ip + ηη )−1 η).

Verify that π0 and π are proper measures, i.e., verify that the corresponding
pdfs of η have ﬁnite total mass. Show that the T 2 test is the Bayes test for
this prior distribution.
¯

78
STAT 542 Notes, Winter 2007; MDP

Note: An entirely diﬀerent method for showing the admissibility of the T 2

test among all tests for (6.5) was given by Stein (Ann. Math. Statist. 1956),
based on the exponential structure of the distribution of (Y, W ).

Example 6.13. Testing a mean vector with covariates and un-

known covariance matrix.
Similar to Example 6.11, but with the following changes. Partition Y , W ,
µ, and Σ as

Y1 W11 W12 µ1 Σ11 Σ12
Y = , W = , µ= , Σ= ,
Y2 W21 W22 µ2 Σ21 Σ22

respectively, where Yi and µi are pi × 1, Wij and Σij are pi × pj , i, j = 1, 2,

where p1 + p2 = p. Suppose it is known that µ2 = 0, that is, the second
group of p2 variables are covariates. Consider the problem of testing

µ1 = 0 vs. µ1 = 0
(6.7)
based on (Y, W ) ∼ Np (µ, Σ) × Wp (n, Σ)

with Σ > 0 unknown and n ≥ p. Again X = Rp × Sp+ , but now

Θ = Rp1 × Sp+ , Θ0 = {0} × Sp+ .

Let G1 be the set of all non-singular block-triangular p × p matrices of the

form
g11 g12
g= ,
0 g22
so G1 is a subgroup of the invariance group GL in Example 6.11. Here
G1 ≡ {g} acts on X and Θ via the actions

(Y, W ) → (gY, gW g ) and (µ1 , Σ) → (g11 µ1 , gΣg ),

respectively. Then Θ and Θ0 are G1 -invariant [verify].

Exercise 6.14. (a) In Example 6.13, apply Lemma 6.3 to show that

(L, M ) ≡ (L(Y, W ), M (Y, W ))

−1 −1 −1
(Y1 −W12 W22 Y2 ) W11·2 (Y1 −W12 W22 Y2 ) −1
:= 1+Y W −1 Y
, Y2 W22 Y2
2 22 2

79
STAT 542 Notes, Winter 2007; MDP

is a (two-dimensional!) MIS, while

τ1 ≡ τ1 (µ1 , Σ) := µ1 Σ−1

11·2 µ1

is a (one-dimensional!) MIP. Thus the invariance-reduced problem (6.2)

becomes that of testing

(6.8) τ1 = 0 vs. τ1 > 0 based on (L, M ).

(b) Show that the joint distribution of (L, M ) ≡ (L(Y, W ), M (Y, W )) can
be described as follows:
τ

χ2p1 1+M1
τ

L|M ∼ 2 ≡ F p1 , n−p+1 1+M ,

1
χn−p+1
(6.9)
χ2p2
M∼ 2 ≡ Fp2 , n−p2 +1 .
χn−p2 +1

−1
Hint: Begin by ﬁnding the conditional distribution of Y1 −W12 W22 Y2 given
(Y2 , W22 ).
(c) Show that the level α LRT based on (Y, W ) for (6.7) is the test that
rejects (µ1 , µ2 ) = (0, 0) if

L > Fp1 , n−p+1; α .

This test is the conditionally UMP level α test for (6.8) given the ancillary
statistic M and is conditionally unbiased for (6.8), therefore unconditionally
unbiased for (6.7) ≡ (6.8).
(d)** Show that no UMP size α test exists for (6.8), so no UMPI test exists
for (6.7). Therefore the LRT is not UMPI. (See Remark 6.16).
(e)* In Exercise 6.12b, show βp,m (τ ) is decreasing in p for ﬁxed τ and m.
Hint: Apply the results (6.9) concerning the joint distribution of (L, M )
¯

Remark 6.15. Since T 2 ≡ Y W −1 Y = L(1 + M ) + M , the overall T 2 test

in Example 6.11 is also G1 -invariant in Example 6.13, so it is of interest to

80
STAT 542 Notes, Winter 2007; MDP

compare its power function to that of the LRT in Example 6.13. Given M ,
the conditional power function of the LRT is given by
% τ1
& τ1

Prτ Fp1 , n−p+1 1+M > Fp1 , n−p+1;α M ≡ βp1 , n−p+1 1+M ,

while the (unconditional) power of the size-α T 2 test is βp, n−p+1 (τ1 ) because
τ = τ1 when µ2 = 0. Since βp,m (δ) is decreasing in p but increasing in δ
(recall Exercises 6.12b, 6.14e), neither power function dominates the other.
Another possible test in Example 6.13 rejects (µ1 , µ2 ) = (0, 0) iﬀ
−1
T12 := Y1 W11 Y1 > Fp1 , n−p1 +1, α ,

a test that ignores the covariate information and is not G1 -invariant [verify].
Since
T12 ∼ Fp1 , n−p1 +1 (τ̃1 )
where τ̃1 := µ1 Σ−1 2
11 µ1 , the power function of the level α test based on T1 is
βp1 , n−p1 +1 (τ̃1 ). Because τ̃1 ≤ τ1 but βp,m (δ) is decreasing in p and increas-
ing in m, the power function of T12 neither dominates nor is dominated by
that of the LRT or of T 2 .
¯

Remark 6.16. Despite their apparent similarity, the invariant testing

problems (6.6) and (6.8) are fundamentally diﬀerent, due to the fact that
in (6.8) the dimensionality of the MIS (L, M ) exceeds that of the MIP τ1 .
Marden and Perlman (1980) (Ann. Statist.) show that in Example 6.13,
no UMP invariant test exists, and the level α LRT is actually inadmissible
for typical (= small) α values, due to the fact that it does not make use of
the information in the ancillary statistic M . Nonetheless, use of the LRT
is recommended on the basis that it is the UMP conditional test given M ,
it is G1 -invariant, its power function compares well numerically to those of
T 2 , T12 , and other competing tests, and it is easy to apply.
¯

Exercise 6.17. Let (Y, W ) be as in Examples 6.11 and 6.13. Consider the
problem of testing µ2 = 0 vs. µ2 = 0 with µ1 and Σ unknown. Find a
natural invariance group G2 such that the test that rejects µ2 = 0 if
−1
T22 := Y2 W22 Y2 > Fp2 , n−p2 +1; α

is UMP among all G2 -invariant level α tests.

81
STAT 542 Notes, Winter 2007; MDP

Example 6.18. Testing a covariance matrix.

Consider the problem of testing

(6.10) Σ = Ip vs. Σ = Ip based on S ∼ Wp (r, Σ) (r ≥ p).

Here X = Θ = Sp+ and Θ0 = {Ip }. This problem is invariant under the

action of G ≡ Op on Sp+ given by S → gSg . It follows from Lemma
6.3 and the spectral decomposition of Σ ∈ Sp+ that the MIS and MIP are
represented by, respectively,

l(S) ≡ (l1 (S) ≥ · · · ≥ lp (S)) := the set of (ordered) eigenvalues of S,

λ(Σ) ≡ (λ1 (Σ) ≥ · · · ≥ λp (Σ)) := the set of (ordered) eigenvalues of Σ.

[verify!]. By Lemma 6.6, the distribution of l(S) depends on Σ only through

λ(Σ); this distribution is complicated when Σ is not of the form κIp for some
κ > 0. The invariance-reduced problem is that of testing

(6.11) λ(Σ) = (1, . . . , 1) vs. λ(Σ) = (1, . . . , 1) based on λ(S).

Here, unlike Examples 6.8, 6.11, and 6.13, when p ≥ 2 the alternative
hypothesis remains multi-dimensional even after reduction by invariance,
so it is not to be expected that a UMPI test exists (it does not).
¯

Exercise 6.19a. In Example 6.18 derive the LRT for (6.10). Express the
test statistic in terms of l(S).
etrS
Answer: The LRT rejects Σ = Ip for large values of |S| , or equivalently,
for large values of p
(li (S) − log li (S) − 1).
i=1

Exercise 6.19b. Suppose that Σ = Cov(X). Show that

λ1 (Σ) = max Var(a X) ≡ max a Σa.

a=1 a=1

The maximal linear combination a X is the ﬁrst principal component of X.

Hint: Apply the spectral decomposition of Σ.
¯

82
STAT 542 Notes, Winter 2007; MDP

Exercise 6.20. Testing sphericity. Change (6.10) as follows: test

(6.12) Σ = κ Ip , 0 < κ < ∞ vs. Σ = κ Ip based on S ∼ Wp (r, Σ).

Show that this problem remains invariant under the extended group

Ḡ := {ḡ = ag | a > 0, g ∈ Op }.

Express a MIS and MIP for this problem in terms of l(S) and λ(Σ) respec-
tively. Find the LRT for this problem and express it in terms of l(S).
(The hypothesis Σ = κIp , 0 < σ < ∞ is called the hypothesis of sphericity.)
1
p trS
Answer: The LRT rejects the sphericity hypothesis for large values of 1 ,
|S| p
or equivalently, for large values of
1
p
p i=1 li (S)
p 1 ,
i=1 li (S)
p

the ratio of the arithmetic and geometric means of l1 (S), . . . , lp (S).

Exercise 6.21. If the identity matrix Ip is replaced by any ﬁxed matrix

Σ0 ∈ Sp+ , show that the results in Exercises 6.19 and 6.20 can be applied af-
−1/2 −1/2 −1/2 −1/2
ter the linear transformations S → Σ0 SΣ0 and Σ → Σ0 ΣΣ0 .

Example 6.22. Testing independence of two sets of variates.

In the setting of Example 6.18, partition S and Σ as

S11 S12 Σ11 Σ12
S= , Σ= ,
S21 S22 Σ21 Σ22

respectively. Here, Sij and Σij are pi × pj matrices, i, j = 1, 2, where

p1 + p2 = p. Let Θ = Sp+ as before, but now take

Θ0 = {Σ ∈ Sp+ | Σ12 = 0},

so (6.1) becomes the problem of testing

(6.13) Σ12 = 0 vs. Σ12 = 0 based on S ∼ Wp (n, Σ) (n ≥ p).

83
STAT 542 Notes, Winter 2007; MDP

If G is the group of non-singular block-diagonal p × p matrices of the form

g11 0
g=
0 g22

(so G = GL(p1 )×GL(p2 )), then (6.13) is invariant under the action of G on
Sp+ given by S → gSg [verify]. It follows from Lemma 6.3 and the singular
value decomposition that a MIS is [verify!]

−1/2 −1/2
r(S) ≡ (r1 (S) ≥ · · · ≥ rq (S)) := the singular values of S11 S12 S22 ,

the canonical correlation coeﬃcients of S, and a MIP is [verify!]

−1/2 −1/2
ρ(Σ) ≡ (ρ1 (Σ) ≥ · · · ≥ ρq (Σ)) := the singular values of Σ11 Σ12 Σ22 ,

the canonical correlation coeﬃcients of Σ, where q = min{p1 , p2 } (see Ex-

ercise 6.25).
The distribution of r(S) depends on Σ only through ρ(Σ); it is com-
plicated when Σ12 = 0. The invariance-reduced problem is that of testing

(6.14) ρ(Σ) = (0, . . . , 0) vs. ρ(Σ) ≥ (0, . . . , 0) based on r(S).

When p ≥ 2 the alternative hypothesis remains multi-dimensional even after

reduction by invariance, so a UMPI test for (6.13) does not exist. ¯

Remark 6.23. This model and testing problem can be reduced to the
multivariate linear model and MANOVA testing problem (see Remark 8.5)
by conditioning on S22 :

−1/2

(6.15) Y := S12 S22 S22 ∼ Np1 ×p2 βS 1/2 , Σ11·2 ⊗ Ip2 ,

where β = Σ12 Σ−122 . Since Σ12 = 0 iﬀ β = 0, the present testing problem

is equivalent to that of testing β = 0 vs. β = 0 based on (Y, S11·2 ), a
MANOVA testing problem under the conditional distribution of Y .
¯

Exercise 6.24. In Example 6.22 ﬁnd the LRT for (6.13). Express the
test statistic in terms of r(S). Show this LRT statistic is equivalent to

84
STAT 542 Notes, Winter 2007; MDP

the conditional LRT statistic for testing β = 0 vs. β = 0 based on the

−1/2
conditional distribution of (S12 S22 , S11·2 ) given S22 (see Exercise 6.37a).
Show that when Σ12 = 0, the conditional and unconditional distributions
of the LRT statistic are identical. [This distribution can be expressed in
terms of Wilks’ distribution U (p1 , p2 , n − p2 ) – see Exercises 6.37c, d, e.]
Partial answer: The (unconditional and conditional) LRT rejects Σ12 = 0
for large values of
|S11 ||S22 |
,
|S|
or equivalently, for small values of
q
(6.16) (1 − ri2 (S)).
i=1

X1
Exercise 6.25. Suppose that Σ = Cov . Show that
X2

a1 Σ12 a2
ρ1 (Σ) = max Cor(a1 X1 , a2 X2 ) ≡ max ) ) .
a1 =0, a2 =0 a1 =0, a2 =0 a1 Σ11 a1 a2 Σ22 a2

Hint: Apply the Cauchy-Schwartz inequality.

Example 6.26. Testing a multiple correlation coeﬃcient.

In Example 6.22 set p1 = 1, so p2 = p − 1 and q ≡ min(1, p − 1) = 1. Now
the MIS r1 (S) ≥ 0 and the MIP ≡ ρ1 (Σ) ≥ 0 are one-dimensional and can
be expressed explicitly as follows:
−1
S12 S22 S21 Σ12 Σ−1
22 Σ21
r12 (S) = =: R2 , ρ21 (Σ) = =: ρ2 .
S11 Σ11

The invariance-reduced problem (6.14) becomes that of testing

(6.17) ρ2 = 0 vs. ρ2 > 0 based on R2 .

By normality, the hypotheses

(6.18) Σ12 = 0, ρ2 = 0, and X1 ⊥

⊥ X2

85
STAT 542 Notes, Winter 2007; MDP

are mutually equivalent. By (6.16) and (3.74) the size α LRT

for testing
Σ12 = 0 vs. Σ12 = 0 rejects Σ12 = 0 if R > B 2 ,
2 p n−p+1
2 ;α .
Note: ρ and R are called the population (resp., sample) multiple correlation
coeﬃcients for the following reason: if

Σ11 Σ12 X1
= Cov ,
Σ21 Σ22 X2

then
Σ12 a2
ρ = max Cor(X1 , a2 X2 ) = max √ ) ,
a2 =0 a2 =0 Σ11 a2 Σ22 a2

with equality attained at â2 = Σ−1

22 Σ21 . [Verify; apply the Cauchy-Schwartz
inequality – this implies that Σ12 Σ−1
22 X2 is the best linear predictor of X1
based on X2 when EX = 0.]
¯

Exercise 6.27. Show that the R2 -test is UMPI and unbiased.

Solution: In Example A.18 of Appendix A it is shown that the pdf of R2 has
MLR in ρ2 . Thus this R2 -test is the UMP size α test for the invariance-
reduced problem (6.17), hence is the UMPI size α test for Σ12 = 0 vs.
Σ12 = 0, and is unbiased. ¯

Remark 6.28. (Kiefer and Schwartz (1965) Ann. Math. Statist.) By an

argument similar to that in Exercise 6.14c, the LRT is a proper Bayes test
for testing Σ12 = 0 vs. Σ12 = 0 based on S, and thus is admissible among
all tests for this problem. ¯

Remark 6.29. When Σ12 = 0, R2 ≡ 1+Q

Q
∼ B p−1 2 ,
n−p+1
2 (see (3.74)),
so
p−1
E(R2 ) = > 0 = ρ2 .
n
Thus, under the null hypothesis of independence, R2 is an overestimate of
ρ21 (Σ) ≡ 0 (unless n p), hence might naively suggest dependence of X1
on X2 .
¯

86
STAT 542 Notes, Winter 2007; MDP

Example 6.30. Testing independence of k ≥ 3 sets of variates.

In the framework of Example 6.22, partition S and Σ as
   
S11 ··· S1k Σ11 ··· Σ1k
. ..  . .. 
S =  .. . , Σ =  .. . ,
Sk1 · · · Skk Σk1 · · · Σkk

respectively, where k ≥ 3. Again Sij and Σij are pi × pj matrices, i, j =

1, . . . , k, where p1 + · · · + pk = p. Take

Θ0 = {Σ | Σij = 0, i = j},

so (6.1) becomes the problem of testing

(6.19) Σij = 0, i = j vs. Σij = 0 for some i = j based on S ∼ Wp (n, Σ)

with n ≥ p. If G is the set of all non-singular block-diagonal p × p matrices

 
g11 ··· 0
 . .. .. 
g ≡  .. . . ,
0 · · · gkk

so G = GL(p1 ) × · · · × GL(pk ) and P, then (6.19) is G-invariant. Now

no explicit representation of the MIS and MIP is known (probably none
exists). Again the alternative hypothesis remains multi-dimensional even
after reduction by invariance, so a UMPI test does not exist.
¯

Exercise 6.31. In Example 6.30, derive the LRT for (6.19).

k
|Sii |
Answer: The LRT rejects Σij = 0, i = j, for large values of i=1
|S| .
Note: This LRT is proper Bayes and admissible among all tests for (6.19)
(Kiefer and Schwartz (1965) Ann. Math. Statist.) and is unbiased. ¯

87
STAT 542 Notes, Winter 2007; MDP

Example 6.32. Testing equality of two covariance matrices.

Consider the problem of testing

Σ1 = Σ2 vs. Σ1 = Σ2
(6.20)
based on (S1 , S2 ) ∼ Wp (n1 , Σ1 ) × Wp (n2 , Σ2 )

with n1 , n2 ≥ p. Here

X = Θ = Sp+ × Sp+ , Θ0 = Sp+ .

This problem is invariant under the action of GL on Sp+ × Sp+ given by

(6.21) (S1 , S2 ) → (gS1 g , gS2 g )

It follows from Lemma 6.3 and the simultaneous diagonalizibility of two

positive deﬁnite matrices that the MIS and MIP are represented by

f (S1 , S2 ) ≡ (f1 (S1 , S2 ) ≥ · · · ≥ fp (S1 , S2 )) := the eigenvalues of S1 S2−1 ,

φ(Σ1 , Σ2 ) ≡ (φ1 (Σ1 , Σ2 ) ≥ · · · ≥ φp (Σ1 , Σ2 )) := the eigenvalues of Σ1 Σ−1
2 ,

respectively [verify!]. By Lemma 6.6 the distribution of f (S1 , S2 ) depends

on (Σ1 , Σ2 ) only through φ(Σ1 , Σ2 ); this distribution is complicated when
Σ1 = κΣ2 . The invariance-reduced problem becomes that of testing

(6.22) φ(Σ1 , Σ2 ) = (1, . . . , 1) vs. φ(Σ1 , Σ2 ) = (1, . . . , 1) based on f (S1 , S2 ).

When p ≥ 2 the alternative hypothesis remains multi-dimensional even after

reduction by invariance, so a UMPI test for (6.20) does not exist. ¯

Exercise 6.33. In Example 6.32, derive the LRT for (6.20) and express the
test statistic in terms of f (S1 , S2 ). Show that the LRT statistic is minimized
when n11 S1 = n12 S2 .
Answer: The LRT rejects Σ1 = Σ2 for large values of

|S1 + S2 |n1 +n2

,
|S1 |n1 |S2 |n2

88
STAT 542 Notes, Winter 2007; MDP

or equivalently, for large values of

p
(1 + fi−1 )n1 (1 + fi )n2 , (fi ≡ fi (S1 , S2 )).
i=1

The ith term in the product is minimized when fi = n1 /n2 .

Example 6.34. Testing equality of k ≥ 3 covariance matrices.

Consider the problem of testing

Σ 1 = · · · = Σk vs. Σi = Σj for some i = j

(6.23)
based on (S1 , . . . , Sk ) ∼ Wp (n1 , Σ1 ) × · · · × Wp (nk , Σk ).

with n1 ≥ p, . . . , nk ≥ p. Here

X = Θ = Sp+ × · · · × Sp+ (k times), Θ0 = Sp+ .

This problem is invariant under the action of GL on Sp+ × · · · × Sp+ given

by
(S1 , . . . , Sk ) → (gS1 g , . . . , gSk g ).
As in Example 6.30, no explicit representation of the MIS and MIP are
known (probably none exists). The alternative hypothesis is multidimen-
sional after reduction by invariance; no UMPI test for (6.23) exists.
¯

Exercise 6.35. In Example 6.34, derive the LRT for (6.23). Show that the
LRT statistic is minimized when n11 S1 = · · · = n1k Sk .
Answer: The LRT rejects Σ1 = · · · = Σk for large values of
ni
k
i=1 Si
k .
i=1 |Si |
n i

To minimize this, apply the case k = 2 repeatedly.

Note: This LRT, also called Bartlett’s test, is unbiased when k ≥ 2. (Perl-
man (1980) Ann. Statist.)
¯

89
STAT 542 Notes, Winter 2007; MDP

Example 6.36. The canonical MANOVA testing problem.

Consider the problem of testing

µ=0 vs. µ = 0 (Σ unknown)

(6.24)
based on (Y, W ) ∼ Np×r (µ, Σ ⊗ Ir ) × Wp (n, Σ)

with Σ > 0 unknown and n ≥ p. (Example 6.11 is the special case where
r = 1.) Here

X = Θ = Rp×r × Sp+ , Θ0 = {0} × Sp+ .

This problem is invariant under the action of the group GL × Or ≡ {(g, γ)}
acting on X and Θ via

(Y, W ) → (gY γ , gW g ),
(6.25)
(µ, Σ) → (gµγ , gΣg ),

respectively. It follows from Lemma 6.3 and the singular value decomposi-
tion that a MIS is [verify!]

f (Y, W ) ≡ (f1 (Y, W ) ≥ · · · ≥ fq (Y, W ))

:= the nonzero eigenvalues of Y W −1 Y

where q := min(p, r) (or equivalently, the nonzero eigenvalues of Y Y W −1 ),

and a MIP is [verify!]

φ(µ, Σ) ≡ (φ1 (µ, Σ) ≥ · · · ≥ φq (µ, Σ))

:= the nonzero eigenvalues of µ Σ−1 µ,

(or equivalently, the nonzero eigenvalues of µµ Σ−1 ). The distribution of

f (Y, W ) depends on (µ, Σ) only through φ(µ, Σ); it is complicated when
µ = 0. The invariance-reduced problem (6.2) becomes that of testing

(6.26) φ(µ, Σ) = (0, . . . , 0) vs. φ(µ, Σ) ≥ (0, . . . , 0) based on f (Y, W ).

Here the MIS and MIP have the same dimension, namely q, and a UMP
invariant test will not exist when q ≡ min(p, r) ≥ 2.

90
STAT 542 Notes, Winter 2007; MDP

Note that f (Y, W ) reduces to the T 2 statistic when r = 1, so in the gen-

eral case the distribution of f (Y, W ) is a generalization of the (central and
noncentral) F distribution. The distribution of (f1 (Y, W ), . . . , fq (Y, W ))
when µ = 0 is given in Exercise 7.2.
(The reduction of the general MANOVA testing problem to this canon-
ical form will be presented in §8.2.) ¯

Exercise 6.37a. In Example 6.36, derive the LRT for testing µ = 0 vs.
µ = 0 based on (Y, W ). Express the test statistic in terms of f (Y, W ). Show
that when µ = 0, W + Y Y is independent of f (Y, W ), hence is independent
of the LRT statistic.
Partial solution: The LRT rejects µ = 0 for large values of

|W + Y Y | q
−1
(6.27) = |Ip + Y W Y | ≡ (1 + fi (Y, W )).
|W | i=1

When µ = 0, W + Y Y is a complete and suﬃcient statistic for Σ, and

f (Y, W ) is an ancillary statistic, hence they are independent by Basu’s
Lemma. (Also see §7.1.)
¯

Exercise 6.37b. Let U be the matrix-variate Beta rv (recall Exercise 4.2)

deﬁned as

(6.28) U := (W + Y Y )−1/2 W (W + Y Y )−1/2 .

|W |
Derive the moments of |W +Y Y | ≡ |U | under the null hypothesis µ = 0.

Solution: By independence,

(6.29) E(|W |k ) = E(|U |k |W + Y Y |k ) = E(|U |k ) E(|W + Y Y |k ),

so (recall (4.16) in Exercise 4.2)

r+n
n

E(|W |k
) E(|S| k
) Γp Γ p + k
(6.30) E(|U |k ) =
= = n 2
r+n2
.
¯
E(|W + Y Y | ) k E(|V | )
k Γp 2 Γp 2 + k

91
STAT 542 Notes, Winter 2007; MDP

Exercise 6.37c. Let U (p, r, n) denote the null (µ = 0) distribution of |U |.

(U (p, r, n) is called Wilks’ distribution.) Show that this distribution can
be represented as the product of independent Beta distributions:
r

(6.31) U (p, r, n) ∼ B n−p+i

2 , p
2 ,
i=1

where the Beta variates are mutually independent.

Note: The moments of |U | given in 6.30) or obtained directly from (6.31)
can be used to obtain the Box approximation, a chi-square approximation
to the Wilks’ distribution U (p, r, n). (See T.W.Anderson book, §8.5.) ¯

Exercise 6.37d. In Exercise 6.24 it was found that the LRT for testing

(6.32) Σ12 = 0 vs. Σ12 = 0

(i.e., testing independence of two sets of variates) rejects Σ12 = 0 for small
values of |S11|S|
||S22 | . Show that the null (Σ12 = 0) distribution of this LRT
statistic is U (p1 , p2 , n − p2 ) – see Exercise 6.24.
¯

Exercise 6.37e. Show that U (p, r, n) ∼ U (r, p, r + n − p), hence

U (p, r, n) ∼ B n−p+i
2 , r
2 .
¯
i=1

Remark 6.38. Perlman and Olkin (Annals of Statistics 1980) applied the
FKG inequality to show that the LRTs in Exercises 6.24 and 6.37a are
unbiased.
¯

Example 6.39. The canonical GMANOVA Model.

(Example 6.13 is a special case.) [To be completed]
¯

Example 6.40. An inadmissible UMPI test. (C. Stein – see Lehmann

TSH Example 11 p.305 and Example 9 p.522.)
Consider Example 6.32 (testing Σ1 = Σ2 ) with p > 1 but with n1 = n2 = 1,
so S1 and S2 are each singular of rank 1. This problem again remains
invariant under the action of GL on (S1 , S2 ) given by (6.21):

(S1 , S2 ) → (gS1 g , gS2 g ).

92
STAT 542 Notes, Winter 2007; MDP

Here, however, GL acts transitively [verify] on this sample space since S1 , S2

each have rank 1, so the MIS is trivial: t(S1 , S2 ) ≡ const. This implies that
the only size α invariant test is φ(S1 , S2 ) ≡ α, so its power is identically α.
However, there exist more powerful non-invariant tests. For any nonzero
a : p × 1, let

a S1 a a S1 a
(6.33) Va ≡ ∼ · F1,1 ≡ δa · F1,1
a S2 a a S2 a

and let φa denote the UMPU size α test for testing δa = 1 vs. δa = 1 based
on Va (cf. TSH Ch.5 §3). Then [verify]: φa is unbiased size α for testing
Σ1 = Σ2 , with power > α when δa = 1, so φa dominates the UMPI test φ.
Note: This failure of invariance to yield a nontrivial UMPI test is usually
attributed to the group GL being “too large”, i.e., not “amenable”.13 How-
ever, this example is somewhat artiﬁcial in that the sample sizes are too
small (n1 = n2 = 1) to permit estimation of Σ1 and Σ2 . It would be of
interest to ﬁnd (if possible?) an example of a trivial UMPI test in a less
contrived model. ¯

Exercise 6.41. Another inadmissible UMPI test. (see Lehmann TSH

Problem 11 p.532.)
Consider Example 6.10 (testing µ = 0 with Σ unknown) with n > 1 obser-
vations but n < p. As in Example 6.40, show that the UMPI GL-invariant
test is trivial but there exists more powerful non-invariant tests. ¯

13
See Bondar and Milnes (1981) Zeit. f. Wahr. 57, pp. 103-128.

93
STAT 542 Notes, Winter 2007; MDP

7. Distribution of Eigenvalues. (See T.W.Anderson book, Ch. 13.)

In the invariant testing problems of Examples 6.22 (testing Σ12 = 0),
6.32 (testing Σ1 = Σ2 ), and 6.36 (the canonical MANOVA testing prob-
lem), the maximal invariant statistic (MIS) was represented as the set of
nontrivial eigenvalues of a matrix of one of the forms

ST −1 or S(S + T )−1 ,

where S and T are independent Wishart matrices.14 Because the LRT

statistic is invariant (Lemma 6.7), it is necessarily a function of these eigen-
values. When the dimensionality of the invariance-reduced alternative hy-
pothesis is ≥ 2,15 however, no single invariant test is UMPI, and other
reasonable invariant test statistics16 have been proposed: for example,
q ri2
r12 (Roy) and (Lawley − Hotelling)
i=1 1 − ri2
−1
in Example 6.22, where we may take (S, T ) = (S12 S22 S21 , S11·2 ), and
q
f1 (Roy) and fi (Lawley − Hotelling)
i=1

in Example 6.36, where (S, T ) = (Y Y , W ).

Thus, to determine the distribution of such invariant test statistics it
is necessary to determine the distribution of the eigenvalues of ST −1 or
equivalently [why?] of S(S + T )−1 .

7.1. The central distribution of the eigenvalues of S(S + T )−1 .

Let S and T be independent with S ∼ Wp (r, Σ) and T ∼ Wp (n, Σ), Σ > 0.
Assume further that n ≥ p, so T > 0 w. pr. 1. Let

1 ≥ b1 ≥ · · · ≥ bq > 0 and f1 ≥ · · · ≥ fq > 0

14
In Example 6.36, Y Y (≡ S here) has a noncentral Wishart distribution under the
−1
alternative hypothesis, i.e., E(Y ) = µ = 0. In Example 6.22, S12 S22 S21 (≡ S here)
has a conditional noncentral Wishart distribution under the alternative hypothesis.
15
For example, see (6.14), (6.22), and (6.26).
16
Schwartz (Ann. Math. Statist. (1967) 698-710), presents a suﬃcient condition and
a (weaker) necessary condition for an invariant test to be admissible among all tests.

94
STAT 542 Notes, Winter 2007; MDP

denote the q ≡ min(p, r) ordered nonzero17 eigenvalues of S(S + T )−1 (the

Beta form) and ST −1 (the F form), respectively. Set

b ≡ (b1 , . . . , bq ) ≡ {bi (S, T )},

(7.1)
f ≡ (f1 , · · · , fq ) ≡ {fi (S, T )}.

First we shall derive the pdf of b, then obtain the pdf of f using the relation

bi
(7.2) fi = .
1 − bi

Because b is GL-invariant, i.e.,

bi (S, T ) = bi (ASA , AT A ) ∀ A ∈ GL,

the distribution of b does not depend on Σ [verify], so we may set Σ = Ip .

Denote this distribution by b(p, n, r) and the corresponding distribution of
f by f (p, n, r).

Exercise 7.1. Show that (compare to Exercise 6.37e)

b(p, n, r) = b(r, n + r − p, p),

(7.3)
f (p, n, r) = f (r, n + r − p, p).

Outline of solution: Let W be a partitioned Wishart random matrix:

p r

p W11 W12
W ≡ ∼ Wp+r (m, Ip+r ).
r W21 W22

Assume that m ≥ max(p, r), so W11 > 0 and W22 > 0 w. pr. 1. By the
properties of the distribution of a partitioned Wishart matrix (Proposition
3.13),
−1 −1
(a) the distribution of the nonzero eigenvalues of W12 W22 W21 W11
is b(p, m − r, r) [verify!]
17
If p > r then q = r and p − r of the eigenvalues of S(S + T )−1 are trivially
≡ 1. By Okamoto’s Lemma the nonzero eigenvalues are distinct w. pr. 1.

95
STAT 542 Notes, Winter 2007; MDP

−1 −1
(b) the distribution of the nonzero eigenvalues of W21 W11 W12 W22
is b(r, m − p, p) [verify!].
But these two sets of eigenvalues are identical18 so the result follows by
setting n = m − r.
¯

By Exercise 7.1 it suﬃces to derive the distribution b(p, n, r) when

r ≥ p, where q = p. Because r ≥ p, also S > 0 w. pr. 1, so by (4.11) the
joint pdf of (S, T ) is
r−p−1 n−p−1
e− 2 tr(S+T ) ,
1
cp,r cp,n · |S| 2 |T | 2 S > 0, T > 0.

Make the transformation

(S, T ) → (S, V ≡ S + T ).

By the extended combination rule, the Jacobian is 1, so the joint pdf of

(S, V ) is
r−p−1 n−p−1
e− 2 trV ,
1
cp,r cp,n · |S| 2 |V − S| 2 V > S > 0.

By E.3, there exists a unique [verify] nonsingular p × p matrix E ≡ {eij }

with e1j > 0, j = 1, . . . , p, such that

S = E Db E ,
(7.4)
V = E E,

where Db := Diag(b1 , . . . , bp ). Thus the joint pdf of (b, E) is given by

)
r−p−1 n−p−1
n+r−2p−2
− 12 tr EE ∂(S, V
f (b, E) = cp,r cp,n ·|Db | 2 |Ip −Db | 2 |EE | 2 e ,
∂(b, E)

where the range Rb,E is the Cartesian product Rb × RE with

Rb := {b | 1 > b1 > · · · > bp > 0},

RE := {E | e1j > 0, ∞ < eij < ∞, i = 2, . . . , p, j = 1, . . . , p}.

λIp A
18
Because |λIp − AB| = = |λIp | · | λ1 (λIr − BA)|.
B Ir

96
STAT 542 Notes, Winter 2007; MDP

We will show that

∂(S, V )
p+2
(7.5) = 2p · |EE | 2 · (bi − bj ),
∂(b, E) i<j

hence

p
r−p−1
p
n−p−1
f (b, E) =2 cp,r cp,n ·
p
bi 2
(1 − bi ) 2 (bi − bj )
i=1 i=1 1≤i<j≤p
n+r−p
· |EE | e− 2 tr EE .
1
2

Because Rb,E = Rb × RE , this implies that b and E are independent with

marginal pdfs given by

p
r−p−1 n−p−1
(7.6) f (b) = cb · bi 2
(1 − bi ) 2 · (bi − bj ), b ∈ Rb (p),
i=1 1≤i<j≤p
n+r−p
f (E) = cE · |EE | e− 2 tr EE ,
1
(7.7) 2 E ∈ RE (p),

where
cb cE = 2p cp,r cp,n .
Thus, to determine cb it suﬃces to determine cE . This is accomplished as
follows:

n+r−p
−1
|EE | 2 e− 2 tr EE dE
1
cE =
RE

n+r−p
−p
|EE | 2 e− 2 tr EE dE
1
=2 [by symmetry]
Rp2

p
−p p2
n+r−p 1
√ e− 2 eij deij
1 2
=2 (2π) 2 |EE | 2

Rp2 i,j=1
2π
p2
n+r−p

−p
=2 (2π) 2 · E |Wp (p, Ip )| 2 [why?]

−p p2 cp,p
=2 (2π) 2 · . [by (4.12)]
cp,n+r

97
STAT 542 Notes, Winter 2007; MDP

Therefore (recall (4.10))

cp,p cp,n cp,r
p2
(7.8) cb ≡ cb (p, n, r) = (2π) 2 ·
cp,n+r
p n+r

π 2 Γp
≡ n
r
2 p
.
Γp 2 Γp 2 Γp 2

This completes the derivation of the pdf f (b) in (7.6), hence determines the
distribution b(p, n, r) when r ≥ p. Note that this can be viewed as another
generalization of the Beta distribution.

Veriﬁcation of the Jacobian (7.5):

By the linearization method (*) in §4.3,

∂(S, V ) ∂(dS, dV )
(7.9)
∂(b, E) = ∂(db, dE) .

From (7.4),
dS = (dE)Db E + E Ddb E + EDb (dE) ,
dV = (dE)E + E(dE) ,
hence, deﬁning

dF = E −1 (dE),
dG = E −1 (dS)(E −1 ) ,
dH = E −1 (dV )(E −1 ) ,
we have
(7.10) dG = (dF )Db + Ddb + Db (dF ) ,
(7.11) dH = (dF ) + (dF ) .

∂(dS,dV )
To evaluate ∂(db,dE) , apply the chain rule to the sequence

(db, dE) → (db, dF ) → (dG, dH) → (dS, dV )

to obtain
∂(dS, dV ) ∂(db, dF ) ∂(dG, dH) ∂(dS, dV )

= · · .
∂(db, dE) ∂(db, dE) ∂(db, dF ) ∂(dG, dH)

98
STAT 542 Notes, Winter 2007; MDP

By 4.2(d), 4.2(e), and the combination rule in §4.1,

∂(db, dF ) ∂(dF )

(7.12) = = |E|−p ,
∂(db, dE) ∂(dE)
∂(dS, dV ) ∂(dS) ∂(dV )

(7.13) = · = |E|p+1 |E|p+1 = |E|2(p+1) .
∂(dG, dH) ∂(dG) ∂(dH)

∂(dG,dH)
Lastly we evaluate ∂(db,dF ) =: J. Here dG ≡ {dgij } and dH ≡
{dhij } are p × p symmetric matrices, dF ≡ {dfij } is a p × p unconstrained
matrix, and db is a vector of dimension p. From (7.10) and (7.11),

dgii = 2(dfii )bi + dbi , i = 1, . . . , p,

dhii = 2dfii , i = 1, . . . , p,
dgij = (dfij )bj + bi (dfji ), 1 ≤ i < j ≤ p,
dhij = dfij + dfji , 1 ≤ i < j ≤ p.

Therefore

∂ (dgii ), (dhii ), (dgij ), (dhij )
J =

∂ (dbi ), (dfii ), (dfij ), (dfji )

Ip 0 0 0

2D 2Ip 0 0
= b ,
0 0 D1 Ip(p−1)/2

0 0 D2 Ip(p−1)/2
where

D1 := Diag(b2 , . . . , bp , b3 , . . . , bp , . . . , bp−1 , bp , bp )
D2 := Diag(b1 , . . . , b1 , b2 , . . . , b2 , . . . , bp−2 , bp−2 , bp−1 ),

hence [verify!]

(7.14) J = 2p |D1 − D2 | = 2p (bi − bj ).
1≤i<j≤p

The desired Jacobian (7.5) follows from (7.12), (7.13), and (7.14).
¯

99
STAT 542 Notes, Winter 2007; MDP

Exercise 7.2. Use (7.2) to show that if r ≥ p, the pdf of (f1 , · · · , fp ) is

given by

p
r−p−1 n+r
(7.15) cb (p, r, n) fi 2
(1 + fi )− 2 (fi − fj ),
i=1 1≤i<j≤p

where cb (p, n, r) is given by (7.8). If r < p then the pdf of (f1 , · · · , fr )

follows from f (p, n, r) = f (r, n + r − p, p) in (7.3). ¯

Exercise 7.3. Under the weaker assumption that n + r ≥ p, show that the
distribution of b ≡ {bi (S, T )} does not depend on Σ and that b and V are
independent. (Note that f ≡ {fi (S, T )} is not deﬁned unless n ≥ p.)
Hint: Apply the GL-invariance of {bi (S, T )} and Basu’s Lemma. If n ≥ p
and r ≥ p the result also follows from Exercise 4.2. ¯

7.2. Eigenvalues and eigenvectors of one Wishart matrix.

In the invariant testing problems of Examples 6.18 (testing Σ = Ip ) and
Exercise 6.20 (testing Σ = κIp ), the maximal invariant statistic (MIS) can
be represented in terms of the set of ordered eigenvalues

{l1 ≥ · · · ≥ lp } ≡ {li (S)}

of a single Wishart matrix S ∼ Wp (r, Σ) (r ≥ p, Σ > 0). Again the LRT

statistic is invariant so is necessarily a function of these eigenvalues.
As in §7.1, when the dimensionality of the invariance-reduced alterna-
tive hypothesis is ≥ 2 (e.g. (6.11)), no single invariant test is UMPI – other
reasonable invariant test statistics include
p
2
1 − I{a<lp <l1 <b} (Roy) and 1
l
n i − 1 (Nagao)
i=1

in Example 6.18 and

l1 p p 1
(Roy) and li ·
lp i=1 i=1 li

in Example 6.20. To determine the distribution of such invariant test statis-

tics we need to ﬁnd the distribution of (l1 , . . . , lp ) when Σ = Ip .

100
STAT 542 Notes, Winter 2007; MDP

Exercise 7.4. Eigenvalues of S ∼ Wp (r, Ip ).

Assume that r ≥ p. Show that the pdf of l ≡ (l1 , · · · , lp ) is
p
π2
p
r−p−1
− 12 li
(7.16) f (l) = pr p
r
li 2
e (li − lj )
2 2 Γp 2 Γp 2 i=1 1≤i<j≤p

on the range
Rl := {l | ∞ > l1 > · · · > lp > 0}.
Outline of solution. Use the limit representation

(7.17) li (S) = limn→∞ fi S, n1 T = limn→∞ nfi (S, T ) .

Let li = nfi , i = 1, . . . , p and derive the pdf of l1 , . . . , lp from the pdf of

(f1 , . . . , fp ) in (7.15). Now let n → ∞ and apply Stirling’s approximation
for the Gamma function.
¯

Alternate derivation of (7.16). Begin with the spectral decomposition

S = Γ Dl Γ ,
(7.18)
Ip = Γ Γ ,

where Dl = Diag(l1 , . . . , lp ). The joint pdf of (l, Γ) is given by

∂S
r−p−1
− 12 tr S
f (l, Γ) = cp,r · |S| 2 e ·
∂(l, Γ)

p ∂(dS)
r−p−1

e− 2 li
1
(7.19) = cp,r · li 2
· .
i=1
∂(dl, dΓ)

From (7.18),
dS = (dΓ) Dl Γ + Γ Ddl Γ + Γ Dl (dΓ) ,
0 = (dΓ) Γ + Γ (dΓ) ,
hence, deﬁning dF = Γ−1 (dΓ),

(7.20) dG : = Γ−1 (dS)(Γ−1 ) = (dF )Dl + Ddl + Dl (dF ) ,

(7.21) 0 = E −1 (0)(E −1 ) = (dF ) + (dF ) .

101
STAT 542 Notes, Winter 2007; MDP

Thus dG ≡ {dgij } is symmetric, dF ≡ {dfij } is skew-symmetric, and

(7.22) dG = (dF )Dl + Ddl − Dl (dF ).

∂(dS)
To evaluate ∂(dl,dΓ) , apply the chain rule to the sequence

(dl, dΓ) → (dl, dF ) → dG → dS

to obtain
∂(dS) ∂(dl, dF ) ∂(dG) ∂(dS)

= · · [verify].
∂(dl, dΓ) ∂(dl, dΓ) ∂(dl, dF ) ∂(dG)

=1 ≡J =1

From (7.22),
dgii = dli , i = 1, . . . , p,
dgij = (dfij )(lj − li ), 1 ≤ i < j ≤ p,
(note that dfii = 0 by skew-symmetry), so

∂ (dgii ), (dgij ) Ip ∗
J =
= = |D| = (li − lj ),
∂ (dli ), (dfij ) 0 D
1≤i<j≤p

where
D = Diag(l2 − l1 , . . . , lp − l1 , . . . , lp − lp−1 ).
Therefore from (7.19),

p
r−p−1
− 12 li
f (l, Γ) = cp,r · li 2
e (li − lj ),
i=1 1≤i<j≤p
so
f (l) = f (l, Γ) dΓ
Op
' (

p
r−p−1
e− 2 li
1
(7.23) cp,r dΓ · li 2
(li − lj ).
Op i=1 1≤i<j≤p

102
STAT 542 Notes, Winter 2007; MDP

The evaluation of this integral requires the theory of diﬀerential forms on

smooth manifolds.19 However, we have already obtained f (l) in (7.16), so
we can equate the constants in (7.16) and (7.23) to obtain
p
π2
cp,r dΓ = pr p
r
,
Op 2 2 Γp 2 Γp 2
so from (4.10),
p(p+1)
π 4
(7.24) dΓ =
.
¯
Op Γp n2

It follows from ((7.19) that l ⊥

⊥ Γ and that Γ is uniformly distributed
over Op w. r. to the measure dΓ – however, we have not deﬁned this
measure explicitly. This is accomplished by the following proposition.

Proposition 7.5. Let S = ΓS Dl(S) ΓS be the spectral decomposition of

the Wishart matrix S ∼ Wp (r, I). Then the eigenvectors and eigenvalues
of S are independent, i.e., ΓS ⊥
⊥ l(S), and

ΓS ∼ Haar(Op ),

the unique orthogonally invariant probability distribution on Op .

Proof. It suﬃces to show that for any measurable sets A ∈ Op and B ∈ Rp ,

(7.25) Pr[ Ψ ΓS ∈ A | l(S) ∈ B] = Pr[ ΓS ∈ A | l(S) ∈ B] ∀ Ψ ∈ Op .

This will imply that the conditional distribution of ΓS is (left) orthogonally

invariant, hence, by the uniqueness of Haar measure,

ΓS l(S) ∈ B ∼ Haar(Op ) ∀ B ∈ Rp .

This implies that ΓS ⊥

⊥ l(S) and ΓS ∼ Haar(Op ) unconditionally, as as-
serted.
19
This approach is followed in the books by R. J. Muirhead, Aspects of Multivariate
Statistical Theory (1982) and R. H. Farrell, Multivariate Calculation (1985).

103
STAT 542 Notes, Winter 2007; MDP

To establish (7.25), consider

S̃ = Ψ S Ψ ∼ Wp (r, I).
Then
S̃ = (ΨΓS )Dl(S) (ΨΓS ) ,
so
ΓS̃ = Ψ ΓS and l(S̃) = l(S).

Therefore

Pr[ Ψ ΓS ∈ A | l(S) ∈ B] = Pr[ ΓS̃ ∈ A | l(S̃) ∈ B]

= Pr[ ΓS ∈ A | l(S) ∈ B],

since S̃ ∼ S, so (7.25) holds.

¯.

7.3. Stein’s integral representation of the density of a maximal

invariant statistic.
Proposition 7.6. Suppose that the distribution of X is given by a pdf f (x)
w. r. to a measure µ on the sample space X . Assume that µ is invariant
under the action of a compact topological group G acting on X and that µ
is G-invariant, i.e., µ(gB) = µ(B) for all events B ⊆ X and all g ∈ G. If

t:X →T
x → t(x)

is a maximal invariant statistic then the pdf of t w.r. to the induced measure
µ̃ = µ(t−1 ) on T is given by

(7.26) f¯(x) = f (gx) dν(g),
G

where ν is the Haar probability measure on G.

Proof: First we show that f¯(x) is actually a function of the MIS t. The
integral is simply the average of f (·) over all members gx in the G-orbit of
x. By the G-invariance of µ, f¯(·) is also G-invariant:

f¯(g1 x) = f (gg1 x) dµ(g) = f (gx) dν(g) = f¯(x) ∀ g1 ∈ G,
G G

104
STAT 542 Notes, Winter 2007; MDP

hence f¯(x) = h(t(x)) for some function h(t).

Next, for any event A ⊆ T and any g ∈ G,

P [ t(X) ∈ A ] = IA (t(x)) f (x) dµ(x)
X
= IA (t(g −1 x)) f (x) dµ(x)
X
= IA (t(y)) f (gx) dµ(y),
X

by the G-invariance of t and µ, so

P [ t(X) ∈ A ] = IA (t(y)) f (gy) dµ(y) dν(g)
G X

= IA (t(y)) f (gy) dν(g) dµ(y)
X G

= IA (t(y)) h(t(y)) dµ(y)

X

= IA (t) h(t) dµ̃(t).
T

Thus h(t) ≡ f¯(x) is the pdf of t ≡ t(X) w. r. to dµ̃(t).

Example 7.7. Let X = Rp , G = Op , and µ = Lebesgue measure on Rp ,

an Op -invariant measure. Here γ ∈ Op acts on Rp via x → γx. A maximal
invariant statistic is t(x) = x2 . If X has pdf f (x) w. r. to µ then the
integral representation (7.26) states that t(X) ≡ X2 has pdf

(7.27) h(t) = f (γx) dνp (γ)
Op

w. r. to dµ̃(t) on (0, ∞), where νp is the Haar probability measure on Op .

In particular, if f (x) is also Op -invariant, i.e., if

f (x) = k(x2 )

for some k(·) on (0, ∞), then the pdf of t(X) w.r.to dµ̃(t) is simply

(7.28) h(t) = k(t), t ∈ (0, ∞).

105
STAT 542 Notes, Winter 2007; MDP

The induced measure dµ̃(t) can be found by considering a special case:

If X ∼ Np (0, Ip ) then t ≡ X2 ∼ χ2p . Here

e− 2 x ≡ k(x2 )
1 2
1
f (x) = p w.r. to dµ(x),
(2π) 2
so t has pdf
e− 2 t
1 1
h(t) = k(t) = p w.r. to dµ̃(t).
(2π) 2

We also know, however, that t has the χ2p pdf

p
t 2 −1 e− 2 t
1
w(t) ≡ p
1
w.r. to dt (≡ Lebesgue measure).
22 Γ( p
2)

Therefore dµ̃(t) is determined as follows:

w(t) p
p
(7.29) dµ̃(t) = dt = Γ p t 2 −1 dt.
π2
k(t) (2)

Application: We can use Stein’s representation (7.27) to give an alternative

derivation of the noncentral chi-square pdf in (2.25) – (2.27). Suppose that
X ∼ Np (ξ, Ip ) with ξ = 0, so

t ≡ X2 ∼ χ2p (δ) with δ = ξ2 .

Here
e− 2 x−ξ ,
1 2
1
f (x) = p
(2π) 2

so by (7.27), t has pdf w. r. to dµ̃(t) given by

e− 2 x−γξ dνp (γ)
1 2
1
h(t) = p
(2π) 2 Op

− 12 ξ2 − 12 x2
= 1
p e e ex γξ dνp (γ)
(2π) 2 Op
1 1
(1)
e− 2 e− 2
δ t
= 1
p et 2 δ 2 γ11 dνp (γ) [ verify! ]
(2π) 2 Op
∞
k
− δ2 − 2t (tδ) 2
= 1
p e e k
γ11 dνp (γ) [ γ = {γij } ]
(2π) 2 k! Op
k=0

106
STAT 542 Notes, Winter 2007; MDP

∞

(2) 1 − δ2 − 2t (tδ)k 2k
= p e e γ11 dνp (γ) [ verify! ]
(2π) 2 (2k)! Op
k=0
∞
(3) (tδ)k %
&k
e− 2 e− 2
δ t
= 1
p E Beta 12 , p−1
2 [ verify! ]
(2π) 2 (2k)!
k=0
∞
p

(tδ)k Γ 12 + k Γ 2
e− 2 e− 2
δ t
= 1
p

.
(2π) 2 (2k)! Γ p2 + k Γ 12
k=0

(1) This follows from the left and right invariance of the Haar measure νp .
(2) By the invariance of νp the distribution of γ11 is even, i.e., γ11 ∼ −γ11 ,
so its odd moments vanish.
(3) By left invariance, the ﬁrst column of γ is uniformly

distributed on the
unit sphere in R , hence γ11 ∼ Beta 2 , 2
p 2 1 p−1
[verify!].
Thus from (7.29) and Legendre’s duplication formula, t has pdf w. r. to dt
given by
∞ 1

dµ̃(t) t
p
2 −1 (tδ)k Γ + k
= 1p e− 2 e− 2
δ t
h(t) p 2
1

dt 22 (2k)! Γ 2 + k Γ 2
k=0
 
∞
( ) δ k p+2k
−1 − t

 t e 2 
2
= e− 2
δ
(7.30) 2 ,
k! 2
p+2k
2 Γ p+2k
k=0

2

Poisson( δ2 ) weights pdf of χ2p+2k

as also found in (2.27).

Example 7.9. Extend Example 7.8 as follows. Let

X = Rp×r , G = Op × Or , µ = Lebesgue measure on Rp×r ,

so µ is (Op × Or )-invariant. Here (γ, ψ) ∈ Op × Or acts on Rp×r via

x → γxψ .

Assume ﬁrst that r ≥ p. A maximal invariant statistic is [verify!]

(7.31) t(x) = ( l1 (xx ) ≥ · · · ≥ lp (xx ) ) ≡ l(xx ),

107
STAT 542 Notes, Winter 2007; MDP

the ordered nonzero eigenvalues of xx [verify]. If X has pdf f (x) w. r. to

µ then Stein’s integral representation (7.26) states that l ≡ l(XX ) has pdf

(7.32) h(l) = f (γxψ ) dνp (γ) dνr (ψ)
Op Or

w. r. to dµ̃(l) on Rl . In particular, if f (x) is also (Op × Or )-invariant, i.e.,

if
f (x) = k( l(xx ) )
for some k(·) on Rl , then the pdf of l(XX ) w. r. to dµ̃(l) is simply

(7.33) h(l) = k(l), l ∈ Rl .

The induced measure dµ̃(l) can be found by considering a special case:

X ∼ Np×r (0, Ip ⊗ Ir ) =⇒ XX ∼ Wp (r, Ip ).

Here

e− 2 tr xx
1 1
f (x) = pr
(2π) 2
q
− 12 li
= 1
pr e i=1 ≡ k(l) w.r. to dµ(x) on Rp×r ,
(2π) 2
so l has pdf
q
− 12 li
h(l) = k(l) = 1
pr e i=1 w.r. to dµ̃(l) on Rl .
(2π) 2

We also know from (7.16) that l has the pdf

p
π2
p
r−p−1
− 12 li
w(l) ≡ pr p
r
li 2
e (li − lj )
2 2 Γp 2 Γp 2 i=1 1≤i<j≤p

w.r. to dl (≡ Lebesgue measure) on Rl . Therefore dµ̃(l) is determined as

follows:

w(l)
p(r+1)
π 2
p
r−p−1
(7.34) dµ̃(l) = dl = p
r
li 2 (li − lj ) dl.
k(l) Γp 2 Γp 2 i=1 1≤i<j≤p

Finally, the case r < p follows from (7.33) by interchanging p and r, since
XX and X X have the same nonzero eigenvalues.

108
STAT 542 Notes, Winter 2007; MDP

Application: Stein’s representation (7.32) provides an integral representa-

tion for the pdf of the eigenvalues of a noncentral Wishart matrix. If

X ∼ Np×r (ξ, Ip ⊗ Ir )

with ξ = 0, the distribution of XX ≡ S depends on ξ only through ξξ [ver-

ify], hence is designated the noncentral Wishart distribution Wp (r, Ip ; ξξ ).
Assume ﬁrst that r ≥ p. The distribution of the ordered eigenvalues

l ≡ l(XX ) ≡ ( l1 (XX ) ≥ · · · ≥ lp (XX ) )

of S depends on ξξ only through the ordered eigenvalues

λ ≡ λ(ξξ ) ≡ ( l1 (ξξ ) ≥ · · · ≥ lp (ξξ ) )

of ξξ , hence is designated by l(p, r; λ). Here

e− 2 tr (x−ξ)(x−ξ) ,
1 1
f (x) = pr
(2π) 2

so by (7.32), l has pdf w. r. to dµ̃(l) given by

h(l)

e− 2 tr (γxψ −ξ)(γxψ −ξ) dνp (γ) dνr (ψ)
1
1
= pr
(2π) 2 Op Or

− 12 tr ξξ − 12 tr xx
= 1
pr e e etr γxψ ξ dνp (γ) dνr (ψ)
(2π) 2 Op Or
1 1
(1) 1 − 12 λi − 2 1
li tr γDl2 ψ̃ Dλ2
= pr e e e dνp (γ) dνr (ψ)
(2π) 2 Op Or
p p 12 12
1 − 12 λi − 12 li l λ γ ψ
= pr e e e i=1 j=1 i j ji jidνp (γ) dνr (ψ)
(2π) 2 Op Or
 ' (2k 

p
∞
p 1
(2)
= 1
e− 12 λi − 2
e
1
li  1 k dνp (γ) dνr (ψ).
pr
(2π) 2 (2k)! li λj2 γji ψji
Op Or i=1 k=0 j=1

109
STAT 542 Notes, Winter 2007; MDP

(1) Here Dλ = diag(λ1 , . . . , λp ), Dl = diag(l1 , . . . , lp ), and γ̃ is the leading

p × p submatrix of ψ. The equality follows from the left and right in-
variance of the Haar measures νp and νr and from the singular value
decompositions of ξ and x. The representation (1) is due to A. James
Ann. Math. Statist. (1961, 1964). Note that the double integral in (1)
is a convex and symmetric (≡ permutation-invariant) function of
1 1
l12 , . . . , lp2 on the unordered positive orthant Rp+ [explain].
(2) By the invariance of νp the distribution of γi ≡ (γ1i , . . . , γpi ) , the ith
column of γ, is even, i.e., γi ∼ −γi . Apply this for i = 1, . . . , p, using
the following expansion at each step:
∞
x2k
1 x −x
(e + e ) = .
2 (2k)!
k=0

Thus from (7.34), l has pdf fλ (l) w. r. to dl given by

dµ̃(l)
fλ (l) = h(l)
dl
π 2 e− 2 λi
p 1
p
r−p−1
li 2 e− 2 li
1
= pr p
r
(li − lj )
2 2 Γp 2 Γp 2 i=1 1≤i<j≤p
 ' (2k 

p
∞
p 1
(7.35) ·  1
l k
λ j γji ψji
2 dνp (γ) dνr (ψ)
(2k)! i
Op Or i=1 k=0 j=1

on the range Rl . The case r < p now follows by interchanging p and r in

(7.35), since XX and X X have the same nonzero eigenvalues.
¯

Remark 7.10. The integrand in (7.35) is a multiple power series in {li },

and similarly in {λj } – this can be expanded and integrated term-by-term,
leading to an extension of the Poisson mixture representation (7.30) for
the noncentral chi-square pdf. However, important information already
can be obtained from the integral representation (7.35). By comparing the
noncentral pdf fλ (l) in (7.35) to the central pdf f (l) ≡ f0 (l) in (7.16) the
likelihood ratio
fλ (l)
= e− 2 λi
1
(7.36) [· · ·],
f0 (l) Op Or

110
STAT 542 Notes, Winter 2007; MDP

the double integral in (7.35). From this representation it is immediate

1
that f (l) ≡ f0 (l) is strictly increasing in each li , hence in each li , and 2

as already noted in (1) its extension to the positive orthant Rp+ is convex
1
and symmetric in the {li2 }. Thus the symmetric extension to Rp+ of the
acceptance region A ⊆ Rl of any proper Bayes test for testing λ = 0 vs.
1
λ > 0 based on l must be convex and decreasing in {li2 } [explain and verify!].
Wald’s fundamental theorem of decision theory states that the closure
in the weak* topology of the set of all proper Bayes acceptance regions
determines an essentially complete class of tests. Because convexity and
monotonicity are preserved under weak* limits, this implies that the sym-
metric extension to Rp+ of any admissible acceptance region A ⊆ Rl must
1
be convex and decreasing in {li }. This shows, for example, that the test
2

which rejects λ = 0 for large values of the minimum eigenvalue lp (S) is

inadmissible among invariant tests [verify!], hence among all tests.
Furthermore, Perlman and Olkin (Ann. Statist. (1980) pp.1326-41)
used the monotonicity of the likelihood ratio (7.36) and the FKG inequality
to establish the unbiasedness of all monotone invariant tests, i.e., all tests
with acceptance regions of the form {g(l1 , . . . , gp ) ≤ c} with g nondecreasing
in each li .
¯

Exercise 7.11. Eigenvalues of S ∼ Wp (r, Σ) when Σ = Ip .

(a) Assume that r ≥ p and Σ > 0. Show that the pdf of l ≡ (l1 , · · · , lp ) is

π2
p
r−p−1
p
fλ (l) = pr p

p r li 2 (li − lj )
r
2 2 Γp 2 Γp 2 ( i=1 λi ) i=1 2
1≤i<j≤p

−1
e− 2 tr Dλ γ Dl γ dνp (γ),
1
(7.37) · l ∈ Rl ,
Op

where l ≡ (l1 , · · · , lp ) and λ ≡ (λ1 , . . . , λp ) are the ordered eigenvalues of S

and Σ, respectively. (Compare to Exercise 7.4).
(b) Consider the problem of testing Σ = Ip vs. Σ ≥ Ip . Show that a
necessary condition for the admissibility of an invariant test is that the
symmetric extension to Rp+ of its acceptance region A ⊆ Rl must be convex
and decreasing in {li }. (Thus the test based on lp (S) is inadmissible.) )
¯

111
STAT 542 Notes, Winter 2007; MDP

Remark 7.12. Stein’s integral formula (7.26) for the pdf of a maximal
invariant statistic under the action of a compact topological group G can
be partially extended to the case where G is locally compact. Important
examples include the general linear group GL and the triangular groups GT
and GU . In this case, however, the integral representation does not provide
the normalizing constant for the pdf of the MIS, but still provides a useful
expression for the likelihood ratio, e.g. (7.36). References include:
S. A. Andersson (1982). Distributions of maximal invariants using quotient
measures. Ann. Statist. 10 955-961.
M. L. Eaton (1989). Group Invariance Applications in Statistics. Regional
Conference Series in Probability and Statistics Vol. 1, Institute of Mathe-
matical Statistics.
R. A. Wijsman (1990). Invariant Measures on Groups and their Use in
Statistics. Lecture Notes – Monograph Series Vol. 14, Institute of Mathe-
matical Statistics.

112
STAT 542 Notes, Winter 2007; MDP

8. The MANOVA Model and Testing Problem. (LehmannTSH Ch.8.)

8.1. Characterization of a MANOVA subspace.

In Section 3.3 the multivariate linear model was deﬁned as follows:
Y1 , . . . , Ym (note m not n) are independent p × 1 vector observations hav-
ing common unknown pd covariance matrix Σ. Let Yj ≡ (Y1j , . . . , Ypj ) ,
j = 1, . . . , m. We assume that each of the p variates satisﬁes the same
univariate linear model, that is,

(8.1) E(Yi1 , . . . , Yim ) = βi X, i = 1, . . . , p,

where X : l × m is the design matrix, rank(X) = l ≤ m, and βi : 1 × l

is a vector of unknown regression coeﬃcients. Equivalently, (8.1) can be
expressed geometrically as

(8.2) E(Yi1 , . . . , Yim ) ∈ L(X) ≡ row space of X ⊆ Rm , i = 1, . . . , p.

In matrix form, (8.1) and 8.2) can be written as

(8.3) E(Y ) ∈ {βX | β ∈ M(p, l)} =: Lp (X),

where M(a, b) denotes the vector space of all real a × b matrices,

Y ≡ (Y1 , . . . , Ym ) ∈ M(p, m),

 
β1
 
β ≡  ...  .
βp

Note that Lp (X) is a linear subspace of M(p, m) with

dim(Lp (X)) = p · dim(L(X)) = p l,

a multiple of p. Then (8.2) can be expressed equivalently as20

  
 v 1 
p  .. 
(8.4) E(Y ) ∈ ⊕i=1 L(X) ≡ v1 , . . . , vp ∈ L(X) .
 . 
vp
20
(8.3) and (8.4) can also be written as E(Y ) ∈ Rp ⊗ L(X).

113
STAT 542 Notes, Winter 2007; MDP

The forms (8.1) – (8.4) are all extrinsic, in that they require spec-
ification of the design matrix X, which in turn is specified only after a
choice of coordinate system. We seek to express these equivalent forms in
an intrinsic algebraic form that will allow us to determine when a specified
linear subspace L ⊆ M(p, m) can be written as Lp (X) for some X. This
is accomplished by means of an invariant ≡ coordinate-free definition of a
MANOVA subspace.

Deﬁnition 8.1. A linear subspace L ⊆ M(p, m) is called a MANOVA

subspace if

(8.5) M(p, p) L ⊆ L.

Because M(p, p) is in fact a matrix algebra (i.e., closed under matrix mul-
tiplication as well as matrix addition) that contains the identity matrix Ip ,
(8.5) is equivalent to the condition M(p, p) L = L.
¯

Proposition 8.2. Suppose that L is a linear subspace of M(p, m). The

following are equivalent:
(a) L is a MANOVA subspace.
(b) L = Lp (X) for some X ∈ M(l, m) of rank l ≤ m (so dim(L) = p l).
(c) There exists an orthogonal matrix Γ : m × m such that

(8.6) LΓ = {(µ, 0p×(m−l) ) µ ∈ M(p, l)} (l ≤ m).

((8.6) is the canonical form of a MANOVA subspace.)

(d) There exists a unique m × m projection matrix P such that

(8.7) L = {x ∈ M(p, m) | x = xP }.

Note: if L = Lp (X) then P = X (XX )−1 X and l = tr(P ) [verify]. Also, Γ

is obtained from the spectral decomposition P = Γ diag(Il , 0m−l ) Γ .
Proof. The equivalence of (b), (c), and (d) is proved exactly as for uni-
variate linear models [reference?] It is straightforward to show that (b) ⇒
(a). We now show that (a) ⇒ (d).

114
STAT 542 Notes, Winter 2007; MDP

Let i ≡ (0, . . . , 0, 1, 0, . . . , 0) denote the i-th coordinate vector in Rp ≡

M(1, p) and deﬁne Li := i L ⊆ M(1, n), i = 1, . . . , p. Then for every pair
i, j, it follows from (a) that

Lj = j L = i Πij L ⊆ i L = Li ,

where Πij ∈ M(p, p) is the i, j-permutation matrix, so

L1 = · · · = Lp =: L̃ ⊆ M(1, m).

Let P : m × m be the unique projection matrix onto L̃. Then for x ∈ L,

p
xP = Ip xP = xPi i
i=1
p p p

= i i xP = i i x = i i x = x,
i=1 i=1 i=1

where the third equality holds since i x ∈ Li ≡ L̃, hence

L ⊆ {x ∈ M(p, m) | x = xP }.

Conversely, for x ∈ M(p, m),

xP = x =⇒ i xP = i x, i = 1, . . . , p,
=⇒ i x ∈ L̃ ≡ Li
=⇒ i x = i xi for some xi ∈ L
p p

=⇒ x ≡ i i x = (i i )xi ∈ L,
i=1 i=1

where the ﬁnal membership follows from (a) and the assumption that L is
a linear subspace. Thus

L ⊇ {x ∈ M(p, m) | x = xP },

which completes the proof.

115
STAT 542 Notes, Winter 2007; MDP

Remark 8.3. In the statistical literature, multivariate linear models often

occur in the form

(8.8) Lp (X, C) := {βX | β ∈ M(p, l), βC = 0},

where C : l × s (with rank(C) = s ≤ l) determines s linear constraints on

β. To see that Lp (X, C) is in fact a MANOVA subspace and thus can be
re-expressed in the form Lp (X0 ) for some design matrix X0 , by Proposition
8.2 it suﬃces to verify that

M(p, p) Lp (X, C) ⊆ Lp (X, C),

which is immediately evident.

8.2. Reduction of a MANOVA testing problem to canonical form.

A normal MANOVA model is simply a normal multivariate linear model
(3.14), i.e., one observes

(8.9) Y ≡ (Y1 , . . . , Ym ) ∼ Np×m (η, Σ ⊗ Im ) with η ∈ L ⊆ Rp×m ,

where L is a MANOVA subspace of Rp×m and Σ > 0 is unknown.

The MANOVA testing problem is that of testing

(8.10) η ∈ L0 vs. η∈L based on Y,

for two MANOVA subspaces L0 ⊂ L ⊂ Rp×m with

dim(L0 ) ≡ p l0 < p l ≡ dim(L).

Proposition 8.4. (extension of Proposition 8.2c). Let r = l−l0 , n = m−l.

There exists an m × m orthogonal matrix Γ∗ such that

L Γ∗ = {(ξ, µ, 0p×n ) | ξ ∈ M(p, l0 ), µ ∈ M(p, r) },

(8.11)
L0 Γ∗ = {(ξ, 0p×r , 0p×n ) | ξ ∈ M(p, l0 ) }.

Proof. Again this is proved exactly as for univariate linear subspaces:

From (8.6), choose Γ : n × n orthogonal such that

LΓ = {(ξ, µ, 0p×n ) | ξ ∈ M(p, l0 ), µ ∈ M(p, r)}.

116
STAT 542 Notes, Winter 2007; MDP

Il
By (8.5), L0 Γ is a MANOVA subspace of Rpl , so we can ﬁnd
0n×l
Γ0 : l × l orthogonal so that

Il
L0 Γ Γ0 = {(ξ, 0p×r ) | ξ ∈ M(p, l0 )}.
0n×l

∗ Γ0 0l×n
Now take Γ = Γ and verify that (8.11) holds.
¯
0n×l In

From (8.11) the MANOVA testing problem (8.10) is transformed to

that of testing

µ = 0 vs. µ = 0 with ξ ∈ M(p, l0 ) and Σ unknown

(8.12)

based on Y ∗ := Γ∗ Y ≡ (U, Y, Z) ∼ Np×m (ξ, µ, 0p×n ), Σ ⊗ Im .

This testing problem is invariant under G∗ := M(p, l0 ) acting as a trans-

lation group on U (and ξ):

(U, Y, Z) → (U + b, Y, Z),
(8.13)
(ξ, µ, Σ) →
(ξ + b, µ, Σ).

Since M(p, l0 ) acts transitively on itself, the MIS and MIP are (Y, Z) and
(µ, Σ), resp., and the invariance-reduced problem becomes that of testing

µ=0 vs. µ = 0 with Σ unknown

(8.14)

based on (Y, Z) ∼ Np×(r+n) (µ, 0p×n ), Σ ⊗ Ir+n .

For this problem, (Y, W ) := (Y, ZZ ) is a suﬃcient statistic [verify], so

(8.14) is reduced by suﬃciency to the canonical MANOVA testing problem
(6.24). As in Example 6.36, (6.24) is now reduced by invariance under (6.25)
to the testing problem (6.26) based on the nonzero eigenvalues of Y W −1 Y .
(The condition n ≥ p, needed for the existence of the MLE Σ̂ in (6.24)
and (8.14), is equivalent to m ≥ l + p in (8.9) and (8.10).)
¯

117
STAT 542 Notes, Winter 2007; MDP

Remark 8.5. By Proposition 8.2b and Remark 8.3, Lp (X) and Lp (X, C)
are MANOVA subspaces of Rp×m such that Lp (X, C) ⊂ Lp (X). Thus the
general MANOVA testing problem (8.10) is often stated as that of testing

(8.15) η ∈ Lp (X, C) vs. η ∈ Lp (X).

[Add Examples]
¯

Exercise 8.6. Derive the LRT for (8.15).

Hint: The LRT already has been derived for the canonical MANOVA testing
problem in Exercise 6.37a. Now express the LRT statistic in terms of the
observation matrix Y, the design matrix X, and the constraint matrix C.
¯

8.3. Related topics.

8.3.1. Seemingly unrelated regressions (SUR).
If the p variates follow diﬀerent univariate linear models, i.e., if (8.1) is
extended to

(8.16) E(Yi1 , . . . , Yim ) = βi Xi ∈ L(Xi ), i = 1, . . . , p,

where X1 : l1 × m, . . ., Xp : lp × m are design matrices with diﬀerent row

spaces, the model (8.16) is called a seemingly unrelated regression (SUR)
model. The p univariate models are only “seemingly” unrelated because
they are correlated if Σ is not diagonal. Under the assumption of normality,
explicit likelihood inference (i.e., MLEs and LRTs) is not possible unless the
row spaces L(X1 ), . . . , L(Xp ) are nested. (But see Remark 8.9.)
¯

8.3.2. Invariant formulation of block-triangular matrices.

The invariant algebraic definition of a MANOVA subspace in Definition
8.1 suggests an invariant algebraic definition of generalized block-triangular
matrices. First, for any increasing sequence of integers

0 ≡ p0 < p1 < p2 < · · · < pr < pr+1 ≡ p (1 < r < p)

deﬁne the sequence

(8.17) {0} ⊂ V1 ⊂ V2 ⊂ · · · ⊂ Vr ⊂ Rp

118
STAT 542 Notes, Winter 2007; MDP

of proper linear subspaces of Rp as follows:

(8.18) Vi = span{1 , 2 , . . . , pi }, i = 1, . . . , r.

Consider a partitioned matrix

(8.19) A ≡ (Aij | 1 ≤ i, j ≤ r) ∈ M(p, p),

where Aij ∈ M(pi − pi−1 , pj − pj−1 ). Then A is upper block triangular,

i.e., Aij = 0 for 1 ≤ j < i ≤ r, if and only if [verify!]

AVi ⊆ Vi , i = 1, . . . , r

Thus the set of A of upper block-triangular matrices can be deﬁned in the

following algebraic way:

(8.20) A ≡ A(p1 , . . . , pr ) := {A ∈ M(p, p) | AVi ⊆ Vi , i = 1, . . . , r}.

Exercise 8.7. Give an algebraic deﬁnition of the set of lower block trian-
gular matrices. ¯

More generally, let

(8.21) {0} ⊂ V1 ⊂ V2 ⊂ · · · ⊂ Vr ⊂ Rp

be a general increasing sequence of proper linear subspaces of Rp and deﬁne

(8.22) A ≡ A(V1 , . . . , Vr ) := {A ∈ M(p, p) | AVi ⊆ Vi , i = 1, . . . , p}.

Note that this is a completely invariant ≡ coordinate-free algebraic deﬁ-

nition, and immediately implies that A is a matrix algebra, i.e., is closed
under matrix addition and multiplication [verify], and Ip ∈ A. The algebra
A is called the algebra of block-triangular matrices with respect to V1 , . . . , Vr .
The proper subset A∗ ⊂ A consisting of all nonsingular matrices in A is a
matrix group, i.e., it contains the identity matrix and is closed under matrix
inversion [verify]. Finally, it is readily seen that A(V1 , . . . , Vr ) is isomorphic
to A(p1 , . . . , pr ) under a similarity transformation, where pi := dim(Vi ). ¯

119
STAT 542 Notes, Winter 2007; MDP

Remark 8.8. Suppose that V1 , . . . , Vr is an arbitrary (i.e., non-nested)

ﬁnite collection of proper linear supspaces of Rp . Deﬁne A ≡ A(V1 , . . . , Vr )
as in (8.22). Then A is a generalized block-triangular matrix algebra [verify!]
and A∗ is a generalized block-triangular matrix group. Note too that

(8.23) A(V1 , . . . , Vr ) = A(L(V1 , . . . , Vr )),

where L(V1 , . . . , Vr ) is the lattice of linear subspaces generated from

(V1 , . . . , Vr ) by all possible ﬁnite unions and intersections.
¯

Remark 8.9. The algebra A ≡ A(L(V1 , . . . , Vr )) plays an important role

in the theory of normal lattice conditional independence (LCI) models (An-
dersson and Perlman (1993) Annals of Statistics). A subspace L ⊆ M(p, n)
is called an A-subspace if AL ⊆ L. It is shown by A&P (IMS Lecture Notes
Vol. 24, 1994) that if the linear model subspace L of a normal multivariate
linear model is an A-subspace and if the covariance structure satisﬁes a
corresponding set of LCI constraints, then the MLE and LRT statistics can
be obtained explicitly. This was extended to ADG covariance models by
A&P (J. Multivariate Analysis 1998), and to SUR models and non-nested
missing data models with conforming LCI covariance structure by Drton,
Andersson, and Perlman (J. Multivariate Analysis 2006). ¯

8.3.3. The GMANOVA model and testing problem.

(Recall Example 6.39.) [To be completed]
¯

120
STAT 542 Notes, Winter 2007; MDP

9. Testing and Estimation with Missing/Incomplete Data.

Let Y1 , . . . , Ym be an i.i.d. random sample from Np (µ, Σ) with µ and Σ
unknown. Partition Yk , µ, and Σ as

p1 p2

p1 Y1k µ1 p1 Σ11 Σ12
Yk = , µ= , Σ= .
p2 Y2k µ2 p2 Σ21 Σ22

Consider n additional i.i.d. observations V1 , . . . , Vn from Np2 (µ2 , Σ22 ), in-

dependent of Y1 , . . . , Ym . Here V1 , . . . , Vn can be viewed as incomplete ob-
servations from the original distribution Np (µ, Σ). We shall ﬁnd the MLEs
µ̂, Σ̂ based on Y1 , . . . , Ym , V1 , . . . , Vn .
Because
Y1k | Y2k ∼ Np1 (α + βY2k , Σ11·2 ),
β = Σ12 Σ−1
22 ,
α = µ1 − βµ2 ,
the likelihood function (LF ≡ joint pdf of Y1 , . . . , Ym , V1 , . . . , Vn ) can be
written in the form

m
(1)

m
(2)

n
(2)
(9.1) fα,β,Σ11·2 (y1k | y2k ) fµ2 ,Σ22 (y2k ) fµ2 ,Σ22 (vk )
k=1 k=1 k=1
−m/2
−1 m ∗2

=c · |Σ11·2 | exp − 1
2 tr Σ11·2
k=1 (y 1k − α − βy2k )
−(m+n)/2
1 −1
%
m
∗2
n
&

· |Σ22 | exp − 2 tr Σ22 (y2k − µ2 ) + (vk − µ2 )∗2 ,

k=1 k=1

where (y)∗2 := yy and the parameters α, β, Σ11·2 , µ2 , Σ22 vary indepen-

dently over their respective ranges. Thus we see that the LF is the product
of two LFs, the ﬁrst that of a multivariate normal linear regression model

e
Np1 m ((α, β) , Σ11·2 )
Z

with e = (1, . . . , 1) : 1 × m and Z = (Y21 , . . . , Y2m ), and the second that of

m + n i.i.d. observations from Np (µ2 , Σ22 ).

121
STAT 542 Notes, Winter 2007; MDP

The MLEs for these models are given in (3.15), (3.16), (3.34), and
(3.35). To assure the existence of the MLE, the single condition m ≥ p + 1
is necessary and suﬃcient [verify!]. (This is the same condition required for
existence of the MLE based on the complete observations Y1 , . . . , Ym only.)
If this condition holds, then the MLEs of α, β, Σ11·2 , µ2 , Σ22 are as follows:

mȲ2 + nV̄
α̂ = Ȳ1 − β̂ Ȳ2 , µ̂2 = ,
m+n

−1 ∗2
(9.2) β̂ = S12 S22 , Σ̂22 1
= m+n S22 + T + mn
m+n (Ȳ2 − V̄ ) ,
1
Σ̂11·2 = m S11·2 ,

[verify!], where
m n
S= (Yk − Ȳ )∗2 , T = (Vk − V̄ )∗2 .
k=1 k=1

m+n
Verify that m+n−1 Σ̂22 is the sample covariance matrix based on the com-
bined sample Y21 , . . . , Y2m , V1 , . . . , Vn . Furthermore, the maximum value of
the LF is given by

(9.3) c · |Σ̂11·2 |−m/2 |Σ̂22 |−(m+n)/2 exp − 12 (mp + np2 ) .

Remark 9.1. The pairs (Ȳ , S) and (V̄ , T ) together form a complete and
suﬃcient statistic for the above incomplete data model.
¯

Remark 9.2. This analysis can be extended to the case of a monotone ≡

nested incomplete data model. The observed data consists of independent
observations of the forms
     
Y1
 Y2   Y2 
(9.4)  . ,  . , ..., 


,
 ..   .. 
Yr Yr Yr

where a complete observation Y ∼ Np (µ, Σ). The MLEs are obtained by

factoring the joint pdf of Y1 , . . . , Yr as

122
STAT 542 Notes, Winter 2007; MDP

(9.5) f (y1 , . . . , yr ) = f (y1 |y2 , . . . , yr )f (y2 |y3 , . . . , yr ) · · · f (yr−1 |yr )f (yr )

and noting that each conditional pdf is the LF of a normal linear regression
model.
¯

Exercise 9.4. Find the LRTs based on Y1 , . . . , Ym , V1 , . . . , Vn for testing

problems (i) and (ii) below. Argue that no explicit expression is available
for the LRT statistic in (iii). (Eaton and Kariya (1983) Ann. Statist.)

(i) H1 : µ2 = 0 vs. H : µ2 = 0 (µ1 and Σ unspeciﬁed).

(ii) H2 : µ1 = 0, µ2 = 0 vs. 0, µ2 = 0 ( Σ unspeciﬁed).
H : µ1 =
(iii) H3 : µ1 = 0 vs. H : µ1 = 0 (µ2 and Σ unspeciﬁed).

Partial solutions: First, for each testing problem, the LF is given by (9.1)
and its maximum under H given by (9.3).
(i) Because α = µ1 when µ2 = 0, it follows from (9.1) that the LF under
H1 is given by

−m/2 −1
c ·|Σ11·2 | exp − 1
2 tr Σ11·2 (y1k − µ1 − βy2k )∗2
k=1
(9.6)
−(m+n)/2
−1
%
m
∗2

n
&

· |Σ22 | exp − 1
2 tr Σ22 (y2k ) + (vk )∗2 ,
k=1 k=1

Thus the maximum of the LF under H1 is given by

(9.7) c · |Σ̂11·2 |−m/2 |Σ̃22 |−(m+n)/2 exp − 12 (mp + np2 ) ,

where
1
Σ̃22 := m+n (S̃22 + T̃ )
m
n
1 ∗2 ∗2
= m+n (Y2k ) + (Vk )
k=1 k=1

123
STAT 542 Notes, Winter 2007; MDP

[verify!]. Thus, by (9.3) and (9.7) the LRT rejects H2 in favor of H for large
values of [verify!]
∗2
mȲ2 +nV̄
|Σ̃22 | Σ̂22 + m+n
=
|Σ̂22 | |Σ̂22 |

= 1+ mȲ2 +nV̄
m+n Σ̂−1
22
mȲ2 +nV̄
m+n

≡ 1 + T22 .

Note that T22 is exactly the T 2 statistic for testing µ2 = 0 vs. µ2 = 0 based
on the combined sample Y21 , . . . , Y2m , V1 , . . . , Vn , so the LRT ignores the
observations Y11 , . . . , Y1m .
(ii) The LRT statistic is the product of the LRT statistics for problem (i)
and for the problem of testing µ1 = 0, µ2 = 0 vs. µ1 = 0, µ2 = 0 (see
Exercise 6.14). Both LRTs can be obtained explicitly, but the distribution
of their product is not simple. (See Eaton and Kariya (1983).)
(iii) Under H3 : µ1 = 0, µ2 appears in diﬀerent forms in the two exponen-
tials on the right-hand side of (9.1), hence maximization over µ2 cannot be
done explicitly.
¯

Exercise 9.5. For simplicity, assume µ is known, say µ = 0. Find the LRT
based on Y1 , . . . , Ym , V1 , . . . , Vn for testing

H0 : Σ12 = 0 vs. H : Σ12 = 0 (Σ11 and Σ22 unspeciﬁed).

Solution: The LRT statistic for this problem is the same as if the addi-
tional observations V1 , . . . , Vn were not present (cf. Exercise 6.24), namely
|S11 ||S22 |
|S| . This can be seen by examining the LF factorization in (9.1) when
µ = 0 (so α = 0 and µ2 = 0). The null hypothesis H0 : Σ12 = 0 is equivalent
to β = 0, so the second exponential on the right-hand side of (9.1) is the
same under H0 and H, hence has the same maximum value under H0 and
H. Thus this second factor cancels when forming the LRT statistic, hence
the LRT does not involve V1 , . . . , Vn .
¯

124
STAT 542 Notes, Winter 2007; MDP

9.1. Lattice conditional independence (LCI) models for non-

monotone missing/incomplete data.
If the incomplete data pattern is non-monotone ≡ non-nested, then no
explicit expressions exist for the MLEs. Instead, an iterative procedure
such as the EM algorithm must be used to compute the MLEs. (Caution:
convergence to the MLE is not always guaranteed, and the choice of starting
point may aﬀect the convergence properties.)
An example of a non-monotone incomplete data pattern is
       
Y1 Y1
(9.8)  Y2  ,   ,  Y2  ,   .
Y3 Y3 Y3 Y3

Here no compatible factorization of the joint pdf such as (9.5) is possible.

However, Rubin (Multiple Imputation, 1987) and Andersson and Perlman
(Statist. Prob. Letters, 1991) have pointed out that a compatible factor-
ization is possible if a parsimonious set of lattice conditional independence
(LCI) restrictions determined by the incomplete data pattern is imposed on
the (unknown) covariance matrix Σ. In the present example, these restric-
tions reduce to the single condition Y1 ⊥ ⊥ Y2 | Y3 , in which case the joint
pdf of Y1 , Y2 , Y3 factors as

(9.9) f (y1 , y2 , y3 ) = f (y1 |y3 )f (y2 |y3 )f (y3 ).

Here again each conditional pdf is the LF of a normal linear regression

model, so the MLEs of the corresponding regression parameters can be
obtained explicitly.
Of course, the LCI restriction may not be defensible, but it can be
tested. If it is rejected, at least the MLEs obtained under the LCI restriction
may serve as a reasonable starting value for the EM algorithm. (See L. Wu
and M. D. Perlman (2000) Communications in Statistics - Simulation and
Computation 29 481-509.)

[Add handwritten notes on LCI models.]

125
STAT 542 Notes, Winter 2007; MDP

Appendix A. Monotone Likelihood Ratio and Total Positivity.

In Section 6 we study multivariate hypothesis testing problems which re-

main invariant under a group of symmetry transformations. In order to
respect these symmetries, we shall restrict consideration to test functions
that possess the same invariance properties and seek a uniformly most pow-
erful invariant (UMPI) test. Under multivariate normality, the distribution
of a UMPI test statistic is often a noncentral chi-square or related noncen-
tral distribution. To verify the UMPI property it is necessary to establish
that the noncentral distribution has monotone likelihood ratio (MLR) with
respect to the noncentrality parameter. For this we will rely on the relation
between the MLR property and total positivity of order 2.

Deﬁnition A.1. Let f (x, y) ≥ 0 be deﬁned on A × B, a Cartesian product

of intervals in R1 . We say that f is totally positive of order 2 (TP2) if

f (x1 , y1 ) f (x1 , y2 )

f (x2 , y1 ) f (x2 , y2 ) ≥ 0 ∀ x1 < x2 , y1 < y2 ,

i .e., if
(A.1) f (x1 , y1 )f (x2 , y2 ) ≥ f (x1 , y2 )f (x2 , y1 ).

If f > 0 on A × B then (5.1) is equivalent to the following condition:

f (x2 , y)
(A.2) is nondecreasing in y ∀ x1 < x2 .
f (x1 , y)

Note that f (x, y) is TP2 on A × B iﬀ f (y, x) is TP2 on B × A.

Fact A.2. If f and g are TP2 on A × B then f · g is TP2 on A × B. In

particular, a(x)b(y)f (x, y) is TP2 for any a(·) ≥ 0 and b(·) ≥ 0.
¯

Fact A.3. If f is TP2 on A × B and φ : A → A and ψ : B → B are

both increasing or both decreasing, then f (φ(x), ψ(y)) is TP2 on A × B.
¯

∂ 2 log f
Fact A.4. If f (x, y) > 0 and ∂x∂y ≥ 0 on A × B then f is TP2.
¯

126
STAT 542 Notes, Winter 2007; MDP

Fact A.5. If f (x, y) = g(x − y) and g : R1 → [0, ∞) is log-concave, then

f is TP2 on R2 .
Proof. Let h(x) = log g(x). For x1 < x2 , y1 < y2 set

s = x1 − y1 , u = x1 − y2 ,
t = x2 − y2 , v = x2 − y1 .
Then [verify]
u ≤ min(s, t) ≤ max(s, t) ≤ v,
s + t = u + v,
so, since h is concave,
h(s) + h(t) ≥ h(u) + h(v),

which is equivalent to the TP2 condition (A.1) for f (x, y) ≡ g(x − y).
¯

These Facts yield the following examples of TP2 functions f (x, y):

Example A.6. Exponential kernel: f (x, y) = exy is TP2 on R1 × R1 .

Example A.7. Exponential family: f (x, y) = a(x)b(y)eφ(x)ψ(y) is TP2 on

A × B if a(·) ≥ 0 on A, b(·) ≥ 0 on B, φ(·) is increasing on A, and ψ(·) is
increasing on B. In particular, f (x, y) = xy is TP2 on (0, ∞) × R1 .

Example A.8. Order kernel: f (x, y) = (x − y)α + and f (x, y) = (x − y)−

are TP2 on R1 × R1 for α ≥ 0. [I(0,∞) and I(∞,0) are log concave on R1 .]

The following is a celebrated result in the theory of total positivity.

Proposition A.9. Composition Lemma ≡ Karlin’s Lemma (due to
Polya and Szego). If g(x, y) is TP2 on A × B and h(x, y) is TP2 on B × C,
then for any σ-ﬁnite measure µ,

(A.3) f (x, z) := g(x, y)h(y, z)dµ(y)
B

is TP2 on A × C.

127
STAT 542 Notes, Winter 2007; MDP

Proof. For x1 ≤ x2 and z1 ≤ z2 ,

f (x1 , z1 )f (x2 , z2 ) − f (x1 , z2 )f (x2 , z1 )

= g(x1 , y)g(x2 , u)[h(y, z1 )h(u, z2 ) − h(y, z2 )h(u, z1 )]dµ(y)dµ(u)

= + + .
{y<u} {y>u} {y=u}

=0

By interchanging the dummy variables y and u, however, we see that

g(x1 , y)g(x2 , u)[h(y, z1 )h(u, z2 ) − h(y, z2 )h(u, z1 )]dµ(y)dµ(u)
{y>u}

= g(x1 , u)g(x2 , y)[h(u, z1 )h(y, z2 ) − h(u, z2 )h(y, z1 )]dµ(y)dµ(u)
{y<u}
so
+
{y<u} {y>u}

= [g(x1 , y)g(x2 , u) − g(x1 , u)g(x2 , y)]
{y<u}
· [h(y, z1 )h(u, z2 ) − h(y, z2 )h(u, z1 )]dµ(y)dµ(u) ≥ 0
since g and h are TP2. Thus h is TP2.
¯
∞ k k
Example A.10. Power series: f (x, y) = k=0 ck x y is TP2 on
(0, ∞) × (0, ∞) if ck ≥ 0 ∀k.
Proof. Apply the Composition Lemma with g(x, k) = xk , h(k, y) = y k ,
and µ the measure that assigns mass ck to k = 0, 1, . . ..
¯

Deﬁnition A.11. Let {f (x|λ) | λ ∈ Λ} be a 1-parameter family of pdfs

(discrete or continuous) for a real random variable X with range X , where
both X and Λ are intervals in R1 . We say that f (x|λ) has monotone
likelihood ratio (MLR) if f (x|λ) is TP2 on X × Λ. ¯

Proposition A.12. MLR preserves monotonicity. If f (x|λ) has MLR

and g(x) is nondecreasing on X , then

Eλ [g(X)] ≡ g(x) f (x|λ)dν(x)
X

128
STAT 542 Notes, Winter 2007; MDP

is nondecreasing in λ (ν is either counting measure or Lebesgue measure).

Proof. Set h(λ) = Eλ [g(X)]. Then for any λ1 ≤ λ2 in Λ,

h(λ2 ) − h(λ1 )

= g(x)[f (x|λ2 ) − f (x|λ1 )]dν(x)
11
= 12 [g(x) − g(y)] [f (x|λ2 )f (y|λ1 ) − f (y|λ2 )f (x|λ1 )]dν(x)dν(y)
≥ 0,

since the two [· · ·] terms are both ≥ 0 if x ≥ y or both ≤ 0 if x ≤ y.

Remark A.13. If {f (x|λ)} has MLR and X ∼ f (x|λ), then for each a ∈ X ,
% &
Prλ [ X > a] ≡ Eλ I(a,∞)(x)

is nondecreasing in λ, hence X is stochastically increasing in λ.

Example A.14. The noncentral chi-square distribution χ2n (δ) has

MLR w.r.to δ.
From (2.27), a noncentral chi-square rv χ2n (δ) with n df and noncentrality
parameter δ is a Poisson(δ/2)-mixture of central chi-square rvs:

(A.4) χ2n (δ) K = k ∼ χ2n+2k , K ∼ Poisson(δ/2).

Thus if fn (x|δ) and fn (x) denote the pdfs of χ2n (δ) and χ2n , then
∞

fn (x|δ) = fn+2k (x) Pr[K = k]
k=0
∞
δ
k
x 2 +k−1 e− 2
n x
e− 2 2δ
= n
·
k=0
2 2 +1 Γ n2 + k k!
∞

2 −1 −x − δ2
n
(A.5) ≡x e 2 ·e · ck xk δ k ,
k=0

where ck ≥ 0. Thus by A.2, A.3, and A.10, fn (x|δ) is TP2 in (x, δ).
¯

129
STAT 542 Notes, Winter 2007; MDP

Example A.15. The noncentral F distribution Fm,n (δ) has MLR

w.r.to δ. Let
χ2m (δ)
distn
(A.6) Fm,n (δ) = ,
χ2n
the ratio of two independent chi-square rvs with χ2m (δ) noncentral and χ2n
central. From (A.4), Fm,n (δ) can be represented as a Poisson mixture of
central F distributions:

(A.7) Fm,n (δ) K = k ∼ Fm+2k,n , K ∼ Poisson (δ/2) ,

so if fm,n (x|δ) and fm,n (x) now denote the pdfs of Fm,n (δ) and Fm,n , then
∞

fm,n (x|δ) = fm, n+2k (x) Pr[K = k]
k=0
∞

δ
k
Γ m+n + k x
m
2 +k−1 e− 2 2δ
= m 2
n
· m+n ·
Γ 2 + k Γ 2 (x + 1) 2 +k−1 k!
k=0

x 2 −1
m
∞ x k
− δ2
(A.8) ≡ m+n ·e · dk δk ,
(x + 1) 2 −1 k=0
x+1

where dk ≥ 0. Thus by A.2 and A.10, fm,n (x|δ) is TP2 in (x, δ).
¯

Question A.16. Does χ2n (δ) have MLR w.r.to n? (δ ﬁxed) Does Fm,n (δ)
have MLR w.r.to m? (n, δ ﬁxed) ¯

Proposition A.17. Scale mixture of a TP2 kernel. Let g(x, y) be

TP2 on R1 × (0, ∞) and let h be a nonnegative function on (0, ∞) such
that h (y/ζ) is TP2 for (y, ζ) ∈ (0, ∞) × (0, ∞). Then
∞
(A.9) f (x, ζ) := g(x, ζz)h(z)dz
0

is TP2 on R1 × (0, ∞).

Proof. Set y = ζz, so
∞
y dy
f (x, ζ) = g(x, y) h ,
0 ζ ζ

130
STAT 542 Notes, Winter 2007; MDP

hence the result follows from the Composition Lemma.

Example A.18. The distribution of the multiple correlation coef-

ﬁcient R2 has MLR w.r. to ρ2 .
Let R2 , ρ2 , U , ζ, and Z be as deﬁned in Example 3.21 (also see Example
6.26 and Exercise 6.27). From (3.68),

U Z ∼ Fp−1, n−p+1 (ζZ),
(A.10)
Z ∼ χ2n ,

so the unconditional pdf of u with parameter ζ is given by

∞
f (u|ζ) = fp−1, n−p+1 (u| ζz) fn (z)dz
0

where fp−1, n−p+1 (·| ζz) and fn (·) are the pdfs for Fp−1, n−p+1 (ζz) and χ2n ,
respectively. Then fp−1, n−p+1 (u|y) is TP2 in (u, y) by Example A.15, while
n2 −1 y
fn y
ζ =c· y
ζ e− 2ζ

is TP2 in (y, ζ) by Example A.7, so f (u|ζ) is TP2 in (u, ζ) by Proposition

A.17. Finally, because U and ζ are increasing functions of R2 and ρ2 ,
respectively, it follows by Fact A.3 that the distribution of R2 has MLR
w.r.to ρ2 .
¯

131

Module 2 MAT 350
No ratings yet
Module 2 MAT 350
95 pages
ENGMAE 200A: Engineering Analysis I Matrix Eigenvalue Problems Instructor: Dr. Ramin Bostanabad
100% (1)
ENGMAE 200A: Engineering Analysis I Matrix Eigenvalue Problems Instructor: Dr. Ramin Bostanabad
42 pages
Linear Algebra Cheat Sheet
100% (2)
Linear Algebra Cheat Sheet
1 page
IGNOU - Lecture Notes - Linear Algebra
No ratings yet
IGNOU - Lecture Notes - Linear Algebra
29 pages
Multivariate Material
No ratings yet
Multivariate Material
58 pages
The Spectral Decomposition: 1 N N J J J J J J 1 1 T 1 N N T N
No ratings yet
The Spectral Decomposition: 1 N N J J J J J J 1 1 T 1 N N T N
7 pages
Rotation in The Space: 1 Euler Angles
100% (1)
Rotation in The Space: 1 Euler Angles
14 pages
Hand Written Notes Matrix Part 1
No ratings yet
Hand Written Notes Matrix Part 1
54 pages
Eigenvalues: Matrices: Geometric Interpretation
No ratings yet
Eigenvalues: Matrices: Geometric Interpretation
8 pages
My Notes For Linear Algebra 987654
No ratings yet
My Notes For Linear Algebra 987654
33 pages
Eigenvalues and Eigenvectors
No ratings yet
Eigenvalues and Eigenvectors
32 pages
1 Matrix Algebra: - What To Read
No ratings yet
1 Matrix Algebra: - What To Read
22 pages
Lecture 6
No ratings yet
Lecture 6
53 pages
Lecture 6
No ratings yet
Lecture 6
53 pages
Matrix Algebra and Random Vectors
No ratings yet
Matrix Algebra and Random Vectors
37 pages
Matrix Analyisis
No ratings yet
Matrix Analyisis
23 pages
1 Determinants
No ratings yet
1 Determinants
38 pages
My Notes On Tensors
No ratings yet
My Notes On Tensors
5 pages
MFC2 L5
No ratings yet
MFC2 L5
63 pages
Some Linear Algebra: 1. Symmetric Matrices
No ratings yet
Some Linear Algebra: 1. Symmetric Matrices
10 pages
Linear Algebra & Analysis Review As Covered in Class UW EE/AA/ME 578 Convex Optimization
No ratings yet
Linear Algebra & Analysis Review As Covered in Class UW EE/AA/ME 578 Convex Optimization
16 pages
Eigenvalues and Eigenvectors-2
No ratings yet
Eigenvalues and Eigenvectors-2
4 pages
Linear Algebra Part2
No ratings yet
Linear Algebra Part2
44 pages
LA Intro
No ratings yet
LA Intro
15 pages
Linear Algebra
No ratings yet
Linear Algebra
6 pages
Eigenvalues and Dynamical Systems-1
No ratings yet
Eigenvalues and Dynamical Systems-1
6 pages
7 Psdmatrices
No ratings yet
7 Psdmatrices
9 pages
Solution Manual and Notes For - Applied Optimal Estimation (Gelb)
No ratings yet
Solution Manual and Notes For - Applied Optimal Estimation (Gelb)
185 pages
Probability Practice Problems With Solutions 6
No ratings yet
Probability Practice Problems With Solutions 6
5 pages
Svdnotes
No ratings yet
Svdnotes
10 pages
Module 1 Lesson 2 - Equivalent Systems Elementary Row Operations
No ratings yet
Module 1 Lesson 2 - Equivalent Systems Elementary Row Operations
8 pages
斯坦福大学机器学习数学基础 9-16
No ratings yet
斯坦福大学机器学习数学基础 9-16
8 pages
L02 Notes
No ratings yet
L02 Notes
6 pages
EMTH202-Lecture 04-03-2024
No ratings yet
EMTH202-Lecture 04-03-2024
32 pages
Ch05 Notes
No ratings yet
Ch05 Notes
6 pages
01 - Lab Notes
No ratings yet
01 - Lab Notes
8 pages
Matrix Algebra
No ratings yet
Matrix Algebra
18 pages
Bounds For Norms of Matrix Inverse
No ratings yet
Bounds For Norms of Matrix Inverse
56 pages
Jiang 2022 J. Phys. Conf. Ser. 2282 012004
No ratings yet
Jiang 2022 J. Phys. Conf. Ser. 2282 012004
15 pages
00 Lectureslides LinAlg
No ratings yet
00 Lectureslides LinAlg
20 pages
Ch19-Lecture Notes
No ratings yet
Ch19-Lecture Notes
11 pages
Caam 453 Numerical Analysis I: 6 October 2009 M. Embree, Rice University
No ratings yet
Caam 453 Numerical Analysis I: 6 October 2009 M. Embree, Rice University
4 pages
MODULE - 05-Matrix Decomposition Probability
No ratings yet
MODULE - 05-Matrix Decomposition Probability
46 pages
Chap2 Eigenvalues and Eigenvectors
No ratings yet
Chap2 Eigenvalues and Eigenvectors
12 pages
Lecture 4 Full
No ratings yet
Lecture 4 Full
11 pages
HKU MATH1853 - Brief Linear Algebra Notes: 1 Eigenvalues and Eigenvectors
No ratings yet
HKU MATH1853 - Brief Linear Algebra Notes: 1 Eigenvalues and Eigenvectors
6 pages
Eigenvectors and Eigenvalues of Stationary Processes: N N JK J K
No ratings yet
Eigenvectors and Eigenvalues of Stationary Processes: N N JK J K
8 pages
Ecd 01
No ratings yet
Ecd 01
16 pages
Math Primer
No ratings yet
Math Primer
13 pages
Matrices
No ratings yet
Matrices
15 pages
Col726 2302 Ass2 Solutions
No ratings yet
Col726 2302 Ass2 Solutions
6 pages
Lecture 4
No ratings yet
Lecture 4
20 pages
Sta 809 A
No ratings yet
Sta 809 A
58 pages
Eigenvalues
No ratings yet
Eigenvalues
48 pages
Lec 08
No ratings yet
Lec 08
51 pages
Mult 2023 Final 1
No ratings yet
Mult 2023 Final 1
96 pages
Notes On Matrices
No ratings yet
Notes On Matrices
8 pages
Linear Algebra 2
No ratings yet
Linear Algebra 2
8 pages
Linear Algebra & Singular Value Decomposition
No ratings yet
Linear Algebra & Singular Value Decomposition
5 pages
Lecture 0. Useful Matrix Theory: Basic Definitions
No ratings yet
Lecture 0. Useful Matrix Theory: Basic Definitions
4 pages
Chapter 5 Homework Answers
No ratings yet
Chapter 5 Homework Answers
24 pages
HMW 1 Text
0% (1)
HMW 1 Text
5 pages
Gaussian Elimination
No ratings yet
Gaussian Elimination
6 pages
Some Basics of Matrix Calculation: Ij N, K I, J 1
No ratings yet
Some Basics of Matrix Calculation: Ij N, K I, J 1
3 pages
Chapter 2
No ratings yet
Chapter 2
36 pages
12th Math (23-24)
No ratings yet
12th Math (23-24)
23 pages
Imc Linear Algebra
No ratings yet
Imc Linear Algebra
11 pages
Cal2 TD12020 21
No ratings yet
Cal2 TD12020 21
3 pages
53 Normal Matrices
No ratings yet
53 Normal Matrices
3 pages
Gauss Jacobi Method v.2
No ratings yet
Gauss Jacobi Method v.2
6 pages
Unit 1 Matrices Test Review
No ratings yet
Unit 1 Matrices Test Review
7 pages
Chapter 1 Matrix Algebra
No ratings yet
Chapter 1 Matrix Algebra
10 pages
Advanced Higher Mechanics Unit 3 Notes COMPLETE
No ratings yet
Advanced Higher Mechanics Unit 3 Notes COMPLETE
5 pages
Algebra of Matrices Question Bank Maths L
No ratings yet
Algebra of Matrices Question Bank Maths L
8 pages
Assignment 0 (Sol.) : Reinforcement Learning
No ratings yet
Assignment 0 (Sol.) : Reinforcement Learning
50 pages
NDA Math Paper 2 2023
No ratings yet
NDA Math Paper 2 2023
26 pages
Target Part A 2024 Final
No ratings yet
Target Part A 2024 Final
40 pages
La Assignment
No ratings yet
La Assignment
4 pages
8FM0 01 MS 08
No ratings yet
8FM0 01 MS 08
2 pages
Linear Algebra - Eigen Value & Eigen Vector - DPP-08
No ratings yet
Linear Algebra - Eigen Value & Eigen Vector - DPP-08
2 pages
Math Olympiad
No ratings yet
Math Olympiad
22 pages
Assume Positive To The Right: M X KX K (X X K (X X K (X
No ratings yet
Assume Positive To The Right: M X KX K (X X K (X X K (X
8 pages
Microsoft Word - Fhmm1024 Matrices Extra
No ratings yet
Microsoft Word - Fhmm1024 Matrices Extra
3 pages
The Density Matrix Description of Matrix Shell Systems Associated With The Matrix Shell Model Formalism
No ratings yet
The Density Matrix Description of Matrix Shell Systems Associated With The Matrix Shell Model Formalism
8 pages
Anova Table
No ratings yet
Anova Table
4 pages
Ma3151 Nov Dec 2022 30515
No ratings yet
Ma3151 Nov Dec 2022 30515
5 pages
Hw-Week 10
No ratings yet
Hw-Week 10
1 page
Colored Math TextBook
No ratings yet
Colored Math TextBook
4 pages
Matrices: DPP 02 (Of Lec 03) - Prayas 2023 (Live)
No ratings yet
Matrices: DPP 02 (Of Lec 03) - Prayas 2023 (Live)
2 pages
Numerical Analysis II Essentials
From Everand
Numerical Analysis II Essentials
The Editors of REA
No ratings yet

Multivariate Analysis - M.E

Uploaded by

Multivariate Analysis - M.E

Uploaded by

STAT 542 Notes, Winter 2007; MDP

STAT 542: MULTIVARIATE STATISTICAL ANALYSIS

1. Random Vectors and Covariance Matrices.

The mapping (1.1) clearly satisﬁes the linearity property:

Transpose matrix: If A ≡ {aij } is m × n, its transpose is the n × m

Rank of a matrix: The row (column) rank of a matrix S : m × n is the

row rank(A) ≤ min(m, n),

row rank(A) = m − dim [row space(A)]⊥ ,

column rank(A) = n − dim [column space(A)]⊥ ,

Furthermore, for A : m × n and B : n × p,

rank(AB) ≤ min(rank(A), rank(B)).

Inverse matrix: If A : n × n is a square matrix, its inverse A−1 (if it

where I ≡ In is the n × n identity matrix diag(1, . . . , 1). If A−1 exists then

Note that if A is nonsingular then A−1 is nonsingular and (A−1 )−1 = A.

If A : m × m and C : n × n are nonsingular and B is m × n, then [verify]

rank(AB) = rank(B) = rank(BC).

If A : n × n and B : n × n are nonsingular then so is AB, and [verify]

(1.6) (AB)−1 = B −1 A−1 .

If A ≡ diag(d1 , . . . , dn ) with all di = 0 then A−1 = diag(d−1 −1

Trace: For a square matrix A ≡ {aij } : n × n, the trace of A is

the sum of the diagonal entries of A. Then

(1.8) tr(aA + bB) = a tr(A) + b tr(B);

Determinant: For a square matrix A ≡ {aij } : n × n, its determinant is

= ±Volume(A([0, 1]n )),

where π ranges over all n! permutations of 1, . . . , n and (π) = ±1 according

(1.11) |AB| = |A| · |B| (A, B : n × n);

Orthogonal matrix. An n × n matrix Γ is orthogonal if

(Γx, Γy) ≡ (Γx) (Γy) = x Γ Γy = x y ≡ (x, y),

In fact, any orthogonal transformation is a product of rotations and reﬂec-

Complex numbers and matrices. For any complex number c ≡ a + ib ∈

The characteristic roots ≡ of the n × n matrix A are the n roots

The vector u is called a characteristic vector ≡ eigenvector for the eigenvalue

then its eigenvalues are just d1 , . . . , dn , with corresponding eigenvectors

is the i-th unit coordinate vector.

lu = Au = cAv = cmv = mu,

which contradicts the assumption that l = m.

Symmetric matrix. An n × n matrix S ≡ {sij } is symmetric if A = A ,

Lemma 1.1. Let S be a real symmetric n × n matrix.

But S is real and symmetric, so S ∗ = S, hence

u∗ Su = (u∗ Su)∗ = u∗ S ∗ (u∗ )∗ = u∗ Su.

Thus u∗ Su is real, hence l is real. Since S − l I is real, the existence of a

(b) We have Sγ = lγ and Svψ = mψ, hence

Proposition 1.2. Spectral decomposition of a real symmetric ma-

where Dl = diag(l1 , . . . , ln ). Since SΓ = ΓDl , the i-th column vector γi of

satisﬁes Γ Γ = I, i.e., Γ is an orthogonal matrix. Since each γi is an

Lemma 1.3. If S is a real symmetric matrix with eigenvalues l1 , . . . , ln ,,

Proof. This is immediate from the spectral decomposition (1.20) of S.

Positive deﬁnite matrix. An n × n matrix S is positive semi-deﬁnite

S is positive deﬁnite (pd) (also written as S > 0) if it is symmetric and its

(1.25) x Sx > 0 ∀ nonzero x ∈ Rn .

• The identity matrix is pd: x Ix = x2 > 0 if x = 0.

Lemma 1.4. (a) A symmetric n × n matrix S with eigenvalues l1 , . . . , ln

E = {x ∈ Rn | (Γ x) Dl−1 (Γ x) = 1}

Square root of a pd matrix. Let S be an n × n pd matrix. Any n × n

From the spectral decomposition S = ΓDl Γ , one version of S 2 is

Partitioned pd matrix. Partition the pd matrix S : n × n as

where n1 + n2 = n. Then both S11 and S22 are symmetric pd [why?],

is necessarily pd [why?] This in turn implies the two fundamental identities

The following three consequences of (1.32) and (1.33) are immediate:

(1.35) |S| = |S11·2 | · |S22 | = |S22·1 | · |S11 | .

Exercise 1.5. Cholesky decompositions of a pd matrix. Use (1.32)

Projection matrix. An n × n matrix P is a projection matrix if it is

Lemma 1.6. P is a projection matrix iﬀ it has the form

for some orthogonal matrix Γ : n × n and some m ≤ n. In this case,

Interpretation of (1.37): Partition Γ as

But Γ is orthogonal so Γ Γ = In , hence

Thus from (1.39) and (1.40),

This shows that P represents the linear transformation that projects Rn

In − P = ΓΓ − P = (Γ1 Γ1 + Γ2 Γ2 ) − Γ1 Γ1 = Γ2 Γ2 ,

so In − P represents the linear transformation that projects Rn orthogonally

1.2. Matrix exercises.

1. For S : p × p and U : p × q, with S > 0 (positive deﬁnite), show that

where | · | denotes the determinant and Iq is the q × q identity matrix.

2. For S : p × p and a : p × 1 with S > 0, show that

where λ1 ≥ · · · ≥ λp denote the ordered eigenvalues.

4. Let A > 0 and B > 0 be p × p matrices with A ≥ B. Partition A as

If A ≡ diag(d1 , . . . , dn ) with all di = 0 then A−1 = diag(d−1 −1

which contradicts the assumption that l = m.

• The identity matrix is pd: x Ix = x2 > 0 if x = 0.

subspace spanned by en , and Q ≡ In − en en is the projection matrix of

Z12 + · · · + Zn2 ≡ Z Z ≡ Z2 ,

Z12 + · · · + Zn2 ≡ Z Z ≡ Z2

Y = ΓZ ∼ Nn (Γξ, ΓΓ ) = Nn ((ξ, 0, . . . , 0) , In ) .