0% found this document useful (0 votes)
218 views131 pages

Multivariate Analysis - M.E

This document provides notes on multivariate statistical analysis and matrix algebra concepts including: 1. Random vectors, covariance matrices, and properties of matrix operations such as addition, multiplication, transpose, inverse, rank, trace, and determinant. 2. Orthogonal matrices that preserve lengths and angles. 3. Complex numbers and properties of complex conjugates and transposes of matrices. 4. Eigenvalues and eigenvectors of matrices, including that eigenvectors associated with distinct eigenvalues cannot be proportional. 5. Symmetric matrices where the transpose is equal to the original matrix.

Uploaded by

Farm House
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
218 views131 pages

Multivariate Analysis - M.E

This document provides notes on multivariate statistical analysis and matrix algebra concepts including: 1. Random vectors, covariance matrices, and properties of matrix operations such as addition, multiplication, transpose, inverse, rank, trace, and determinant. 2. Orthogonal matrices that preserve lengths and angles. 3. Complex numbers and properties of complex conjugates and transposes of matrices. 4. Eigenvalues and eigenvectors of matrices, including that eigenvectors associated with distinct eigenvalues cannot be proportional. 5. Symmetric matrices where the transpose is equal to the original matrix.

Uploaded by

Farm House
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 131

STAT 542 Notes, Winter 2007; MDP

STAT 542: MULTIVARIATE STATISTICAL ANALYSIS

1. Random Vectors and Covariance Matrices.

1.1. Review of vectors and matrices. (The results are stated for
vectors and matrices with real entries but also hold for complex entries.)
An m × n matrix A ≡ {aij } is an array of mn numbers:
 
a11 . . . a1n
. .. 
A =  .. . .
am1 . . . amn
This matrix represents the linear mapping (≡ linear transformation)
A : Rn → Rm
(1.1)
x → Ax,
where x ∈ Rn is written as an n × 1 column vector and
     n a x 
a11 . . . a1n x1 j=1 1j j
. . .  
Ax ≡  .. ..   ..  ≡  ..
. ∈R .
m
n
am1 . . . amn xn j=1 amj xj

The mapping (1.1) clearly satisfies the linearity property:


A(ax + by) = aAx + bBy.
Matrix addition: If A ≡ {aij } and B ≡ {bij } are m × n matrices, then
(A + B)ij = aij + bij .
Matrix multiplication: If A is m × n and B is n × p, then the matrix
product AB is the m × p matrix AB whose ij-th element is

n
(1.2) (AB)ij = aik bkj .
k=1

B A
Then AB is the matrix of the composition Rp → Rn → Rm of the two
linear mappings determined by A and B [verify]:
(AB)x = A(Bx) ∀x ∈ Rp .

1
STAT 542 Notes, Winter 2007; MDP

Transpose matrix: If A ≡ {aij } is m × n, its transpose is the n × m


matrix A (sometimes denoted by A ) whose ij-th element is aji . That is,
the m row vectors (n column vectors) of A are the m column vectors (n
row vectors) of A . Note that [verify]

(1.3) (A + B) = A + B  ;
(1.4) (AB) = B  A (A : m × n, B : n × p);
(1.5) (A−1 ) = (A )−1 (A : n × n, nonsingular).

Rank of a matrix: The row (column) rank of a matrix S : m × n is the


dimension of the linear space spanned by its rows (columns). The rank of
A is the dimension r of the largest nonzero minor (= r × r subdeterminant)
of A. Then [verify]

row rank(A) ≤ min(m, n),


column rank(A) ≤ min(m, n),
rank(A) ≤ min(m, n),

row rank(A) = m − dim [row space(A)]⊥ ,


column rank(A) = n − dim [column space(A)]⊥ ,


row rank(A) = column rank(A)
= rank(A) = rank(A )
= rank(AA ) = rank(A A).

Furthermore, for A : m × n and B : n × p,

rank(AB) ≤ min(rank(A), rank(B)).

Inverse matrix: If A : n × n is a square matrix, its inverse A−1 (if it


exists) is the unique matrix that satisfies

AA−1 = A−1 A = I,

where I ≡ In is the n × n identity matrix diag(1, . . . , 1). If A−1 exists then


A is called nonsingular (or regular). The following are equivalent:

2
STAT 542 Notes, Winter 2007; MDP

(a) A is nonsingular.
(b) The n columns of A are linearly independent (i.e., column rank(A) = n).
Equivalently, Ax = 0 for every nonzero x ∈ Rn .
(c) The n rows of A are linearly independent (i.e., row rank(A) = n).
Equivalently, x A = 0 for every nonzero x ∈ Rn .
(d) The determinant |A| = 0 (i.e., rank(A) = n). [Define det geometrically.]

Note that if A is nonsingular then A−1 is nonsingular and (A−1 )−1 = A.

If A : m × m and C : n × n are nonsingular and B is m × n, then [verify]

rank(AB) = rank(B) = rank(BC).

If A : n × n and B : n × n are nonsingular then so is AB, and [verify]

(1.6) (AB)−1 = B −1 A−1 .

If A ≡ diag(d1 , . . . , dn ) with all di = 0 then A−1 = diag(d−1 −1


1 , . . . , dn ).

Trace: For a square matrix A ≡ {aij } : n × n, the trace of A is


n
(1.7) tr(A) = aii ,
i=1

the sum of the diagonal entries of A. Then

(1.8) tr(aA + bB) = a tr(A) + b tr(B);


(1.9) tr(AB) = tr(BA); (Note : A : m × n, B : n × m)
(1.10) tr(A ) = tr(A). (A : n × n)

Proof of (1.9):
m  m  n
tr(AB) = (AB)ii = aik bki
i=1 i=1 k=1
n  m n
= bki aik = (BA)kk = tr(BA).
k=1 i=1 k=1

3
STAT 542 Notes, Winter 2007; MDP

Determinant: For a square matrix A ≡ {aij } : n × n, its determinant is


 n
|A| = (π) aiπ(i)
π i=1

= ±Volume(A([0, 1]n )),

where π ranges over all n! permutations of 1, . . . , n and (π) = ±1 according


to whether π is an even or odd permutation. Then

(1.11) |AB| = |A| · |B| (A, B : n × n);


(1.12) |A−1 | = |A|−1
(1.13) |A | = |A|;
n
(1.14) |A| = aii if A is triangular (or diagonal).
i=1

Orthogonal matrix. An n × n matrix Γ is orthogonal if

(1.15) ΓΓ = I.

This is equivalent to the fact that the n row vectors of Γ form an orthonor-
mal basis for Rn . Note that (1.15) implies that Γ = Γ−1 , hence also
Γ Γ = I, which is equivalent to the fact that the n column vectors of Γ also
form an orthonormal basis for Rn .
Note that Γ preserves angles and lengths, i.e., preserves the usual inner
product and norm in Rn : for x, y ∈ Rn ,

(Γx, Γy) ≡ (Γx) (Γy) = x Γ Γy = x y ≡ (x, y),


so
Γx2 ≡ (Γx, Γx) = (x, x) ≡ x2 .

In fact, any orthogonal transformation is a product of rotations and reflec-


tions. Also, from (1.13) and (1.15), |Γ|2 = 1, so |Γ| = ±1.

4
STAT 542 Notes, Winter 2007; MDP

Complex numbers and matrices. For any complex number c ≡ a + ib ∈


C, let c̄ ≡ a − ib denote the complex conjugate of c. Note that ¯c̄ = c and

cc̄ = a2 + b2 ≡ |c|2 ,
¯
cd = c̄d.

For any complex matrix C ≡ {cij }, let C̄ = {c̄ij } and define C ∗ = C̄  . Note
that

(1.16) (CD)∗ = D∗ C ∗ .

The characteristic roots ≡ of the n × n matrix A are the n roots


l1 , . . . , ln of the polynomial equation

(1.17) |A − l I| = 0.

These roots may be real or complex; the complex roots occur in conjugate
pairs. Note that the eigenvalues of a triangular or diagonal matrix are just
its diagonal elements.
By (b) (for matrices with possibly complex entries), for each eigenvalue
l there exists some nonzero (possibly complex) vector u ∈ Cn s.t.

(A − l I)u = 0,
equivalently,
(1.18) Au = lu.

The vector u is called a characteristic vector ≡ eigenvector for the eigenvalue


l. Since any nonzero multiple cu is also an eigenvector for l, we will usually
normalize u to be a unit vector, i.e., u2 ≡ u∗ u = 1.
For example, if A is a diagonal matrix, say
 
d1 0 ··· 0
 0 d2 ··· 0 
A = diag(d1 , . . . , dn ) ≡ 
 ... ..  ,
. 
..
.
0 0 · · · dn

5
STAT 542 Notes, Winter 2007; MDP

then its eigenvalues are just d1 , . . . , dn , with corresponding eigenvectors


u1 , . . . , un , where

(1.19) 1 , 0, . . . , 0)
ui ≡ (0, . . . , 0, 
i

is the i-th unit coordinate vector.


Note, however, that in general, eigenvalues need not be distinct and
eigenvectors need not be unique. For example, if A is the identity matrix
I, then its eigenvalues are 1, . . . , 1 and every unit vector u ∈ Rn is an
eigenvector for the eigenvalue 1: Iu = 1 · u.
However, eigenvectors u, v associated with two distinct eigenvalues l,
m cannot be proportional: if u = cv then

lu = Au = cAv = cmv = mu,

which contradicts the assumption that l = m.

Symmetric matrix. An n × n matrix S ≡ {sij } is symmetric if A = A ,


i.e., if sij = sji ∀i, j.

Lemma 1.1. Let S be a real symmetric n × n matrix.


(a) Each eigenvalue l of S is real and has a real eigenvector γ ∈ Rn .
(b) If l = m are distinct eigenvalues of S with corresponding real eigenvec-
tors γ and ψ, then γ ⊥ ψ, i.e., γ  ψ = 0. Thus if all the eigenvalues of S
are distinct, each eigenvalue l has exactly one real eigenvector γ.
Proof. (a) Let l be an eigenvalue of S with eigenvector u = 0. Then

Su = lu ⇒ u∗ Su = lu∗ u = l.

But S is real and symmetric, so S ∗ = S, hence

u∗ Su = (u∗ Su)∗ = u∗ S ∗ (u∗ )∗ = u∗ Su.

Thus u∗ Su is real, hence l is real. Since S − l I is real, the existence of a


real eigenvector γ for l now follows from (b) on p.3.

6
STAT 542 Notes, Winter 2007; MDP

(b) We have Sγ = lγ and Svψ = mψ, hence

lψ  γ = ψ  Sγ = (ψ  Sγ) = γ  Sψ = mγ  ψ = mψ  γ,

so γ  ψ = 0 since l = m.
¯

Proposition 1.2. Spectral decomposition of a real symmetric ma-


trix. Let S be a real symmetric n × n matrix with eigenvalues l1 , . . . , ln
(necessarily real). Then there exists a real orthogonal matrix Γ such that

(1.20) S = Γ Dl Γ ,

where Dl = diag(l1 , . . . , ln ). Since SΓ = ΓDl , the i-th column vector γi of


Γ is a real eigenvector for li .
Proof. For simplicity suppose that l1 , . . . , ln are distinct. Let γ1 , . . . , γn be
the corresponding unique real unit eigenvectors (apply Lemma 1.1b). Since
γ1 , . . . , γn is an orthonormal basis for Rn , the matrix

(1.21) Γ ≡ (γ1 , . . . , γn ) : n × n

satisfies Γ Γ = I, i.e., Γ is an orthogonal matrix. Since each γi is an


eigenvector for li , SΓ = ΓDl [verify], which is equivalent to (1.20).
[The case where the eigenvalues are not distinct can be established by a
“perturbation” argument. Perturb S slightly so that its eigenvalues become
distinct (non-trivial) and apply the first case. Now use a limiting argument
based on the compactness of the set of all n × n orthogonal matrices.] ¯

Lemma 1.3. If S is a real symmetric matrix with eigenvalues l1 , . . . , ln ,,


n
(1.22) tr(S) = li ;
i=1
n
(1.23) |S| = li .
i=1

Proof. This is immediate from the spectral decomposition (1.20) of S.


¯

Positive definite matrix. An n × n matrix S is positive semi-definite


(psd) (also written as S ≥ 0) if it is symmetric and its quadratic form is
nonnegative:

(1.24) x Sx ≥ 0 ∀ x ∈ Rn ;

7
STAT 542 Notes, Winter 2007; MDP

S is positive definite (pd) (also written as S > 0) if it is symmetric and its


quadratic form is positive:

(1.25) x Sx > 0 ∀ nonzero x ∈ Rn .

• The identity matrix is pd: x Ix = x2 > 0 if x = 0.


• A diagonal matrix diag(d1 , . . . , dn ) is psd (pd) iff each di ≥ 0 (> 0).
• If S : n × n is psd, then ASA is psd for any A : m × n.
• If S : n × n is pd, then ASA is pd for any A : m × n of full rank m ≤ n.
• AA is psd for any A : m × n.
• AA is pd for any A : m × n of full rank m ≤ n.
Note: This shows that the proper way to “square” a matrix A is to form
AA (or A A), not A2 , which need not even be symmetric.
• S pd ⇒ S has full rank ⇒ S −1 exists ⇒ S −1 ≡ (S −1 )S(S −1 ) is pd.

Lemma 1.4. (a) A symmetric n × n matrix S with eigenvalues l1 , . . . , ln


is psd (pd) iff each li ≥ 0 (> 0). In particular, |S| ≥ 0 (> 0) if S is psd
(pd), so a pd matrix is nonsingular.
(b) Suppose S is pd with distinct eigenvalues l1 > · · · > ln > 0 and corre-
sponding unique real unit eigenvectors γ1 , . . . , γn . Then the set

(1.26) E ≡ {x ∈ Rn | x S −1 x = 1}
√ √
is the ellipsoid with principle axes l1 γ1 , . . ., ln γn .

Proof. (a) Apply the above results and the spectral decomposition (1.20).
(b) From (1.20), S = ΓDl Γ with Γ = (γ1 · · · γn ), so S −1 = ΓDl−1 Γ and,

E = {x ∈ Rn | (Γ x) Dl−1 (Γ x) = 1}


= Γ{y ∈ Rn | y  Dl−1 y = 1} (y = Γx)
  y2 2

 y
= Γ y ≡ (y1 , . . . , yn )  1
+ ··· + n
=1
l1 ln
≡ ΓE0 .

8
STAT 542 Notes, Winter 2007; MDP

√ √
But E0 is the ellipsoid with principal axes l1 u1 , . . ., √ln un (recall
√ (1.19))
and Γui = γi , so E is the ellipsoid with principle axes l1 γ1 , . . ., ln γn .
¯

Square root of a pd matrix. Let S be an n × n pd matrix. Any n × n


matrix A such that AA = S is called a square root of S, denoted by S 2 .
1

From the spectral decomposition S = ΓDl Γ , one version of S 2 is


1

1 1 1
S 2 = Γ diag(l12 , . . . , ln2 ) Γ ≡ ΓDl2 Γ ;
1
(1.27)
1
this is a symmetric square root of S. Any square root S 2 is nonsingular, for
1 1
(1.28) |S 2 | = |S| 2 > 0.

Partitioned pd matrix. Partition the pd matrix S : n × n as

n1 n2
 
n1 S11 S12
(1.29) S= ,
n2 S21 S22

where n1 + n2 = n. Then both S11 and S22 are symmetric pd [why?],



S12 = S21 , and [verify!]
 −1
    
In1 −S12 S22 S11 S12 In1 0 S11·2 0
(1.30) −1 = ,
0 In2 S21 S22 −S22 S21 In2 0 S22

where
−1
(1.31) S11·2 ≡ S11 − S12 S22 S21

is necessarily pd [why?] This in turn implies the two fundamental identities


   −1
  
S11 S12 In1 S12 S22 S11·2 0 In1 0
(1.32) = −1 ,
S21 S22 0 In2 0 S22 S22 S21 In2
 −1   −1
 −1

S11 S12 In1 0 S11·2 0 In1 −S12 S22
(1.33) = −1 −1 ,
S21 S22 −S22 S21 In2 0 S22 0 In2

The following three consequences of (1.32) and (1.33) are immediate:

9
STAT 542 Notes, Winter 2007; MDP

(1.34) S is pd ⇐⇒ S11·2 and S22 are pd ⇐⇒ S22·1 and S11 are pd.

(1.35) |S| = |S11·2 | · |S22 | = |S22·1 | · |S11 | .


 
x1
For x ≡ ∈ Rn , the quadratic form x S −1 x can be decomposed as
x2
−1 −1 −1 −1
(1.36) x S −1 x = (x1 − S12 S22 x2 ) S11·2 (x1 − S12 S22 x2 ) + x2 S22 x2 .

Exercise 1.5. Cholesky decompositions of a pd matrix. Use (1.32)


and induction on n to obtain an upper triangular square root U of S, i.e.,
S = U U  . Similarly, S has a lower triangular square root L, i.e. S = LL .
Note: Both U ≡ {uij } and L ≡ {lij } are unique if the positivity conditions
uii > 0 ∀i and lii > 0 ∀i are imposed on their diagonal elements. To see
this for U , suppose that U U  = V V  where V is also an upper triangular
matrix with each vii > 0. Then U −1 V (U −1 V ) = I, so Γ ≡ U −1 V is both
upper triangular and orthogonal, hence Γ = diag(±1, . . . , ±1) =: D [why?]
Thus V = U D, and the positivity conditions imply that D = I.
¯

Projection matrix. An n × n matrix P is a projection matrix if it is


symmetric and idempotent: P 2 = P .

Lemma 1.6. P is a projection matrix iff it has the form


 
Im 0
(1.37) P =Γ Γ
0 0

for some orthogonal matrix Γ : n × n and some m ≤ n. In this case,


rank(P ) = m = tr(P ).
Proof. Since P is symmetric, P = ΓDl Γ by its spectral decomposition.
But the idempotence of P implies that each li = 0 or 1. (A permutation of
the rows and columns, which is also an orthogonal transformation, may be
necessary to obtain the form (1.37).)
¯

Interpretation of (1.37): Partition Γ as

m n−m

(1.38) Γ =n Γ1 Γ2 ,

10
STAT 542 Notes, Winter 2007; MDP

so (1.37) becomes

(1.39) P = Γ1 Γ1 .

But Γ is orthogonal so Γ Γ = In , hence


   
 Γ1 Γ1 Γ1 Γ2 Im 0
(1.40) ΓΓ≡ = .
Γ2 Γ1 Γ2 Γ2 0 In−m

Thus from (1.39) and (1.40),

P Γ1 = (Γ1 Γ1 ) Γ1 = Γ1 ,
P Γ2 = (Γ1 Γ1 ) Γ2 = 0.

This shows that P represents the linear transformation that projects Rn


orthogonally onto the column space of Γ1 , which has dimension m = tr(P ).
Furthermore, In − P is also symmetric and idempotent [verify] with
rank(In − P ) = n − m. In fact,

In − P = ΓΓ − P = (Γ1 Γ1 + Γ2 Γ2 ) − Γ1 Γ1 = Γ2 Γ2 ,

so In − P represents the linear transformation that projects Rn orthogonally


onto the column space of Γ2 , which has dimension n − m = tr(In − P ).
Note that the column spaces of Γ1 and Γ2 are perpendicular, since
Γ1 Γ2
= 0. Equivalently, P (In − P ) = (In − P )P = 0, i.e., applying P and
In − P successively sends any x ∈ Rn to 0.

11
STAT 542 Notes, Winter 2007; MDP

1.2. Matrix exercises.

1. For S : p × p and U : p × q, with S > 0 (positive definite), show that

|S + U U  | = |S| · |Iq + U  S −1 U |,

where | · | denotes the determinant and Iq is the q × q identity matrix.

2. For S : p × p and a : p × 1 with S > 0, show that

  −1 a S −1 a
a (S + aa ) a= .
1 + a S −1 a
3. For S : p × p and T : p × p with S > 0 and T ≥ 0, show that

−1 λi (T S −1 )
λi [T (S + T ) ]= , i = 1, . . . , p,
1 + λi (T S −1 )

where λ1 ≥ · · · ≥ λp denote the ordered eigenvalues.

4. Let A > 0 and B > 0 be p × p matrices with A ≥ B. Partition A as


 
A11 A12
A=
A21 A22

and let A11·2 = A11 −A12 A−1


22 A21 . Partition B in the same way and similarly
define B1.2 . Show:
(i) A11 ≥ B11 .
(ii) B −1 ≥ A−1 .
(iii) A11·2 ≥ B11·2 .

5. For S : p × p with S > 0, partition S and S −1 as


   11 
S11 S12 −1 S S 12
S= , S = ,
S21 S22 S 21 S 22
−1
respectively. Show that S 11 ≥ S11 , and equality holds iff S12 = 0, or
12
equivalently, iff S = 0.

12
STAT 542 Notes, Winter 2007; MDP

6. Now partition S and S −1 as


 
S11 S12 S13  
S(12) S(12)3
S =  S21 S22 S23  ≡ ,
S3(12) S33
S31 S32 S33
 
S 11 S 12 S 13  (12) 
S S (12)3
S −1 =  S 21 S 22 S 23  ≡ .
S 3(12) S 33
S 31 S 32 S 33

Then
−1
S(12)·3 ≡ S(12) − S(12)3 S33 S3(12)
 
S11 − S13 S3−1 S31 S12 − S13 S3−1 S32
=
S21 − S23 S3−1 S31 S22 − S23 S3−1 S32
 
S11·3 S12·3
≡ ,
S21·3 S22·3

with similar relations holding for S (12)·3 . Note that

S (12) = (S(12)·3 )−1 , S(12) = (S (12)·3 )−1 ,

but in general
S 11 = (S11·2 )−1 , S11 = (S 11·2 )−1 ;
instead,
S 11 = (S11·(23) )−1 , S11 = (S 11·(23) )−1 .
Show:
(i) (S(12)·3 )11·2 = S11·(23) .
(ii) S11·2 = (S 11·3 )−1 .
(iii) S12·3 (S22·3 )−1 = −(S 11 )−1 S 12 .
(iv) S11 ≥ S11·2 ≥ S11·(23) . When do the inequalities become equalities?
(v) S12·3 (S22·3 )−1 = −(S 11·4 )−1 S 12·4 . (for a 4 × 4 partitioning.)

13
STAT 542 Notes, Winter 2007; MDP

 
X1
.
1.3. Random vectors and covariance matrices. Let X ≡  ..  be
Xn
a rvtr in R . The expected value of X is the vector
n

 
E(X1 )
 
E(X) ≡  ...  ,
E(Xn )

which is the center of gravity of the probability distribution of X in Rn .


Note that expectation is linear: for rvtrs X, Y and constant matrices A, B,

(1.41) E(AX + BY ) = A E(X) + B E(Y ).


 
Z1 · · · Z1n
. .. 
Similarly, if Z ≡  .. . is a random matrix in Rm×n ,
Zm1 · · · Zmn
E(Z) is also defined component-wise:
 
E(Z1 ) · · · E(Z1n )
 .. .. 
E(Z) =  . . .
E(Zm1 ) · · · E(Zmn )

Then for constant matrices A : k × m and B : n × p,

(1.42) E(AZB) = A E(Z) B.

The covariance matrix of X (≡ the variance-covariance matrix), is

Cov(X) = E [(X − EX)(X − EX) ]

 Var(X1 ) Cov(X1 , X2 ) · · · Cov(X1 , Xn ) 


 Cov(X2 , X1 ) Var(X2 ) · · · Cov(X2 , Xn ) 
=
 .. .. .. .

. . .
Cov(Xn , X1 ) Cov(Xn , X2 ) · · · Var(Xn )

14
STAT 542 Notes, Winter 2007; MDP

The following formulas are essential: for X : n × 1, A : m × n, a : n × 1,

(1.43) Cov(X) = E(XX  ) − (EX)(EX) ;


(1.44) Cov(AX + b) = A Cov(X) A ;
(1.45) Var(a X + b) = a Cov(X) a.

Lemma 1.7. Let X ≡ (X1 , . . . , Xn ) be a random vector in Rn .


(a) Cov(X) is psd.
(b) Cov(X) is pd unless ∃ a nonzero a ≡ (a1 , . . . , an ) ∈ Rn s.t. the linear
combination
a X ≡ a1 X1 + · · · + an Xn = constant,
i.e., the support of X is contained in some hyperplane of dimension ≤ n−1.
Proof. (a) This follows immediately from (1.45).
(b) If Cov(X) is not pd, then ∃ a nonzero a ∈ Rn s.t.

0 = a Cov(X) a = Var(a X).

But this implies that a X = const.


¯

For rvtrs X : m × 1 and Y : n × 1, define

Cov(X, Y ) = E [(X − EX)(Y − EY ) ]

 Cov(X , Y ) Cov(X1 , Y2 ) ··· Cov(X1 , Yn ) 


1 1
 Cov(X2 , Y1 ) Cov(X2 , Y2 ) ··· Cov(X2 , Yn ) 
=
 .. .. ..  .

. . .
Cov(Xm , Y1 ) Cov(Xm , Y2 ) · · · Cov(Xm , Yn )
Clearly Cov(X, Y ) = [Cov(Y, X)] . Then [verify]

(1.46) Cov(X ± Y ) = Cov(X) + Cov(Y ) ± Cov(X, Y ) ± Cov(Y, X).

and [verify]

X⊥
⊥ Y ⇒ Cov(X, Y ) = 0
(1.47) ⇒ Cov(X ± Y ) = Cov(X) + Cov(Y ).

15
STAT 542 Notes, Winter 2007; MDP

Variance of sample average (sample mean) of rvtrs: Let X1 , . . . , Xn


be i.i.d. rvtrs in Rp , each with mean vector µ and covariance matrix Σ. Set
X̄n = 1
n (X1 + · · · + Xn ).
Then E(X̄n ) = µ and, by (1.47),
(1.48) Cov(X̄n ) = 1
n2 Cov(X1 + · · · + Xn ) = 1
n Σ.

Exercise 1.8. Verify the Weak Law of Large Numbers (WLLN) for rvtrs:
p
X̄n converges to µ in probability (Xn → µ), that is, for each  > 0,
P [X̄n − µ ≤ ] → 1 as n → ∞.

Example 1.9a. Equicorrelated random variables. Let X1 , . . . , Xn


be rvs with common mean µ and common variance σ 2 . Suppose they are
equicorrelated, i.e., Cor(Xi , Xj ) = ρ ∀i = j. Let
n
(1.49) X̄n = n1 (X1 + · · · + Xn ), 1
s2n = n−1 i=1 (Xi − X̄n ) ,
2

the sample mean and sample variance, respectively. Then


(1.50) E(X̄n ) = µ (so X̄n is unbiased for µ); .
n2 Var(X1 + · · · + Xn )
1
Var(X̄n ) =
n2 [nσ + n(n − 1)ρσ ]
1 2 2
= [why?]
σ2
(1.51) = n [1 + (n − 1)ρ].

When X1 , . . . , Xn are uncorrelated (ρ = 0), in particular when they are


2
independent, then (1.51) reduces to σn , which → 0 as n → ∞. When
ρ = 0, however, Var(X̄n ) → σ 2 ρ = 0 so the WLLN fails for equicorrelated
i.d. rvs Also, (1.51) imposes the constraint
(1.52) − n−1
1
≤ ρ ≤ 1.
Next, using (1.51),
n

E(s2n ) = 1
n−1 E X
i=1 i
2
− n(X̄ n )2
 2 
= n−1 n(σ + µ ) − n n [1 + (n − 1)ρ] + µ
1 2 2 σ 2

(1.53) = (1 − ρ)σ 2 .

16
STAT 542 Notes, Winter 2007; MDP

Thus s2n is unbiased for σn2 if ρ = 0 but not otherwise.


¯

Example 1.9b. We now re-derive (1.51) and (1.53) via covariance matri-
ces, using properties (1.44) and (1.45). Set X = (X1 , . . . , Xn ) , so
   
µ 1
 ..  .
(1.54) E(X) =  .  ≡ µ en , where en =  ..  : n × 1,
µ 1
 
1 ρ ··· ρ
 .
. .. 
..
ρ 2 1 
Cov(X) = σ  . 
. .. ..
. . ρ
.
ρ ··· ρ 1

(1.55) ≡ σ 2 [(1 − ρ)In + ρ en en ].


1 
Then X̄n = n en X, so by (1.45),

σ2  
Var(X̄n ) = n2 en [(1 − ρ)In + ρ en en ]en
σ2
= n2 [(1 − ρ)n + ρn ]
2
[since en en = n]
σ2
= n [1 + (n − 1)ρ],

which agrees with (1.51).


To find E(s2n ), write


n 
n
(Xi − X̄n ) =
2
Xi2 − n(X̄n )2
i=1 i=1

= X X − n1 (en X)2
= X  X − n1 (X  en )(en X)

≡ X  In − en en  X
(1.56) ≡ X  QX,

where en ≡ en

n
is a unit vector, P ≡ en en  is the projection matrix of
rank 1 ≡ tr(en en  ) that projects Rn orthogonally onto the 1-dimensional

17
STAT 542 Notes, Winter 2007; MDP

subspace spanned by en , and Q ≡ In − en en  is the projection matrix of


rank n−1 ≡ tr Q that projects Rn orthogonally onto the (n−1)-dimensional
subspace e⊥
n [draw figure]. Now complete the following exercise:

Exercise 1.10. Prove Lemma 1.11 below, and use it to show that
(1.57) E(X  QX) = (n − 1)(1 − ρ)σ 2 ,
which is equivalent to (1.53).
¯

Lemma 1.11. Let X : n × 1 be a rvtr with E(X) = θ and Cov(X) = Σ.


Then for any n × n symmetric matrix A,
(1.58) E(X  AX) = tr(AΣ) + θ Aθ.
(This generalizes the relation E(X 2 ) = Var(X) + (E X)2 .)

Example 1.9c. Eqn. (1.53) also can be obtained from the properties of
the projection matrix Q. First note that [verify]

(1.59) Qen = nQen = 0.
Define
 
Y1
.
(1.60) Y ≡  ..  = QX : n × 1,
Yn
so
(1.61) E(Y ) = Q E(X) = µ Qen = 0,

E(Y Y  ) = Cov(Y ) = σ 2 Q[(1 − ρ)In + ρen en ]Q


(1.62) = σ 2 (1 − ρ)Q.
Thus, since Q is idempotent (Q2 = Q),
E(X  QX) = E(Y  Y ) = E[ tr (Y Y  )]
= tr[ E (Y Y  )]
= σ 2 (1 − ρ) tr (Q)
= σ 2 (1 − ρ)(n − 1),

18
STAT 542 Notes, Winter 2007; MDP

which again is equivalent to (1.53).


¯

Exercise 1.12. Show that Cov(X) ≡ σ 2 [(1 − ρ)In + ρen en ] in (1.55) has
one eigenvalue = σ 2 [1 + (n − 1)ρ] with eigenvector en , and n − 1 eigenvalues
= σ 2 (1 − ρ). ¯

Exercise 1.13. Suppose that Σ = Cov(X) : n × n. Show that the extreme


eigenvalues of Σ satisfy

λ1 (Σ) = max Var(a X),


a=1

λn (Σ) = min Var(a X).


¯
a=1

19
STAT 542 Notes, Winter 2007; MDP

2. The Multivariate Normal Distribution (MVND).

2.1. Definition and basic properties.


Consider a random vector X ≡ (X1 , . . . , Xp ) ∈ Rp , where X1 , . . . , Xp are
i.i.d. standard normal random variables, i.e., Xi ∼ N (0, 1), so E(X) = 0
and Cov(X) = Ip . The pdf of X (i.e., the joint pdf of X1 , . . . , Xp ) is
p
f (x) = (2π)− 2 e− 2 (x1 +···+xp )
1 2 2

p 
= (2π)− 2 e− 2 x x ,
1
(2.1) x ∈ Rp .

For any nonsingular matrix A : p × p and any µ : p × 1 ∈ Rp , consider the


random vector Y := AX + µ. Since the Jacobian of this linear (actually,
affine) mapping is | ∂X|
∂Y
= |A|+ > 0, the pdf of Y is
−1
p
(y−µ)) A−1 (y−µ)
f (y) = (2π)− 2 |A|−1 − 2 (A 1
+ e
p   −1
= (2π)− 2 |AA |− 2 e− 2 (y−µ) (AA )
1 1
(y−µ)
p  −1
= (2π)− 2 |Σ|− 2 e− 2 (y−µ) Σ
1 1
(2.2) (y−µ)
, y ∈ Rp ,

where
E(Y ) = A E(X) + µ = µ,
Cov(Y ) = A Cov(X) A = AA ≡ Σ > 0.
Since the distribution of Y depends only on µ and Σ, we denote this distri-
bution by Np (µ, Σ), the multivariate normal distribution (MVND) on Rp
with mean vector µ and covariance matrix Σ.

Exercise 2.1. (a) Show that the moment generating function of X is


 1 
(2.3) mX (w) ≡ E(ew X ) = e 2 w w .

(b) Let Y = AX + µ where now A : q × p and µ ∈ Rq . Show that the mgf


of Y is
  1 
(2.4) mY (w) ≡ E(ew Y ) = ew µ+ 2 w Σw

where Σ ≡ AA = Cov(Y ). Thus the distribution of Y ≡ AX + µ depends


only on µ and Σ even when A is singular and/or a non-square matrix, so
we may again write Y ∼ Nq (µ, Σ).

20
STAT 542 Notes, Winter 2007; MDP

Lemma 2.1. Affine transformations preserve normality.


If Y ∼ Nq (µ, Σ), then for C : r × q and d : r × 1,

(2.5) Z ≡ CY + d ∼ Nr (Cµ + d, CΣC  ).

Proof. Represent Y as AX + µ, so Z = (CA)X + (Cµ + d) is also an affine


transformation of X, hence also has an MVND with E(Z) = Cµ + d and
Cov(Z) = (CA)(CA) = CΣC  .
¯

Lemma 2.2. Independence ⇐⇒ zero covariance.


Suppose that Y ∼ Np (µ, Σ) and partition Y , µ, and Σ as

p1 p2
     
p1 Y1 p1 µ1 p1 Σ11 Σ12
(2.6) Y = , µ= , Σ= ,
p2 Y2 p2 µ2 p2 Σ21 Σ22

where p1 + p2 = p. Then Y1 ⊥
⊥ Y2 ⇐⇒ Σ12 = 0.

Proof. This follows from the pdf (2.2) or the mgf (2.4).
¯

Proposition 2.3. Marginal & conditional distributions are normal.


If Y ∼ Np (µ, Σ) and Σ22 is pd then

(2.8) Y1 | Y2 ∼ Np1 µ1 + Σ12 Σ−1 22 (Y2 − µ2 ), Σ 11·2 ,


(2.9) Y2 ∼ Np2 (µ2 , Σ22 ).

Proof. Method 1: Assume also that Σ is nonsingular. By the quadratic


identity (1.35) applied with µ, y, and Σ partitioned as in (2.6),

(2.10) (y − µ) Σ−1 (y − µ)




= y1 − µ1 − Σ12 Σ−1 22 (y 2 − µ2 ) Σ11·2 (· · ·) + (y2 − µ2 ) Σ−1
22 (· · ·).

Since also |Σ| = |Σ11·2 | |Σ22 |, the result follows from the pdf (2.2).

21
STAT 542 Notes, Winter 2007; MDP

Method 2. By Lemma 2.1 and the quadratic identity (1.32),


    
Y1 − Σ12 Σ−1
22 Y2 Ip1 −Σ12 Σ−1
22 Y1
=
Y2 0 Ip2 Y2
   
µ1 − Σ12 Σ−1 µ Σ11·2 0
(2.11) ∼ Np1 +p2 22 2
, .
µ2 0 Σ22

Thus by Lemma 2.1 for C = ( Ip1 0p1 ×p2 ) and ( 0p2 ×p1 Ip2 ), respectively,

Y1 − Σ12 Σ−1
22 Y2 ∼ N p1
µ1 − Σ 12 Σ −1
22 µ2 , Σ 11·2 ,
Y2 ∼ Np2 (µ2 , Σ22 ) ,

which yields (2.9). Also Y1 − Σ12 Σ−1


22 Y2 ⊥
⊥ Y2 by (2.11) and Lemma 2.2, so

(2.12) Y1 − Σ12 Σ−1 Y


22 2 | Y2 ∼ N p1 µ1 − Σ Σ −1
12 22 2µ , Σ 11·2

which yields (2.8).


¯

2.2. The MVND and the chi-square distribution.


The chi-square distribution χ2n with n degrees of freedom (df ) can be defined
as the distribution of

Z12 + · · · + Zn2 ≡ Z  Z ≡ Z2 ,

where Z ≡ (Z1 , . . . , Zn ) ∼ Nn (0, In ). (That is, Z1 , . . . , Zn are i.i.d. stan-


dard N (0, 1) rvs.) Recall that

(2.13) χ2n ∼ Gamma α = n


2, λ= 1
2 ,
(2.14) E(χ2n ) = n,
(2.15) Var(χ2n ) = 2n.

Now consider X ∼ Nn (µ, Σ) with Σ pd. Then

(2.16) Z ≡ Σ−1/2 (X − µ) ∼ Nn (0, In ),


(2.17) Z  Z = (X − µ) Σ−1 (X − µ) ∼ χ2n .

22
STAT 542 Notes, Winter 2007; MDP

Suppose, however, that we omit Σ−1 in (2.17) and seek the distribution
of (X − µ) (X − µ). Then this will not have a chi-square distribution in
general. Instead, by the spectral decomposition Σ = ΓDλ Γ , (2.16) yields

(X − µ) (X − µ) = Z  ΣZ = (Γ Z) Dλ (Γ Z)


(2.18) ≡ V  Dλ V = λ1 V12 + · · · λn Vn2 ,

where λ1 , . . . , λn are the eigenvalues of Σ and V ≡ Γ Z ∼ Nn (0, In ). Thus


the distribution of (X − µ) (X − µ) is a positive linear combination of in-
dependent χ21 rvs, which is not (proportional to) a χ2n rv. [Check via mgfs!]

Lemma 2.5. Quadratic forms and projection matrices.


Let X∼ Nn (ξ, σ 2 In ) and let P be an n × n projection matrix with rank(P ) =
tr(P ) ≡ m. Then the quadratic form determined by X − ξ and P satisfies

(2.23) (X − ξ) P (X − ξ) ∼ σ 2 χ2m .

Proof. By Lemma 1.6, there exists an orthogonal matrix Γ : n × n s.t.


 
Im 0
P =Γ Γ .
0 0

Then Y ≡ Γ (X − ξ) ∼ Nn (0, σ 2 In ), so with Y = (Y1 , . . . , Yn ) ,


 
  Im 0
(X − ξ) P (X − ξ) = Y Y = Y12 + · · · Ym2 ∼ σ 2 χ2m .
¯
0 0

23
STAT 542 Notes, Winter 2007; MDP

2.3. The noncentral chi-square distribution.


Extend the result (2.17) to (2.30) as follows: First let Z ≡ (Z1 , . . . , Zn ) ∼
Nn (ξ, In ), where ξ ≡ (ξ1 , . . . , ξn ) ∈ Rn . The distribution of

Z12 + · · · + Zn2 ≡ Z  Z ≡ Z2

is called the noncentral chi-square distribution with n degrees of freedom


(df ) and noncentrality parameter ξ2 , denoted by χ2n (ξ2 ). Note that
Z1 , . . . , Zn are independent, each with variance = 1, but now E(Zi ) = ξi .
To show that the distribution of Z2 depends on ξ only through its
(squared) length ξ2 , choose1 an orthogonal (rotation) matrix Γ : n × n
such that Γξ = (ξ, 0, . . . , 0) , i.e., Γ rotates ξ into (ξ, 0, . . . , 0) , and set

Y = ΓZ ∼ Nn (Γξ, ΓΓ ) = Nn ((ξ, 0, . . . , 0) , In ) .

Then the desired result follows since

Z2 = Y 2 ≡ Y12 + Y22 + · · · + Yn2


∼ [N1 (ξ, 1)]2 + [N1 (0, 1)]2 + · · · + [N1 (0, 1)]2
≡ χ21 (ξ2 ) + χ21 + · · · + χ21
(2.24) ≡ χ21 (ξ2 ) + χ2n−1 ,

where the chi-square variates in each line are mutually independent.



Let V ≡ Y12 ∼ χ21 (δ) ∼ [N1 ( δ, 1)]2 , where δ = ξ2 . We find the pdf
of V as follows:
d √ √
fV (v) = P [− v ≤ Y1 ≤ v]
dv
 √v √
d 1 − 12 (t− δ)2
= √ e dt
dv 2π −√v
 √v √
1 δ d 2
= √ e− 2 e t δ − t2
e dt
2π dv −√v

1
Let the first row of Γ be ξ ≡ ξ
ξ and let the remaining n − 1 rows be any
orthonormal basis for L⊥ .

24
STAT 542 Notes, Winter 2007; MDP

 √  ∞ k k

1 δ d
v  t δ2 t2
= √ e− 2 √
e− 2 dt
2π dv − v k!
k=0
∞  √
1 −δ  δ 2 d
k v
t2
=√ e 2 √
tk e− 2 dt
2π k! dv − v
k=0
∞  √
1 − δ  δk d v
t2
=√ e 2 √
t2k e− 2 dt [why?]
2π (2k)! dv − v
k=0

1 − δ  δ k k− 1 − v
=√ e 2 v 2e 2 [verify]
2π (2k)!
k=0

 1+2k 
 δ k
(2) v 2 −1 −
e 2
v

= e− 2
δ
(2.25) 1+2k 1+2k
· ck ,
k! 2 2 Γ 2
k=0
     
pdf of χ 2
Poisson( 2 ) weights
δ
1+2k

where

1+2k

2k k! 2 Γ 1+2k
2
ck = √ 2
=1
(2k)! 2π

by the Legendre duplication formula for the Gamma function. Thus we


have represented the pdf of a χ21 (δ) rv as a mixture (weighted average) of
central chi-square pdfs with Poisson weights. This can be written as follows:

(2.26) χ21 (δ)  K ∼ χ21+2K where K ∼ Poisson (δ/2) .

Thus by (2.24) this implies that Z  Z ≡ Z2 ∼ χ2n (δ) satisfies



(2.27) χ2n (δ)  K ∼ χ2n+2K where K ∼ Poisson (δ/2) .

That is, the pdf of a noncentral chi-square rv χ2n (δ) is a Poisson(δ/2)-


mixture of the pdfs of central chi-square rvs with n + 2k df, k = 0, 1, . . ..

The representation (2.27) can be used to obtain the mean and variance
of χ2n (δ):

25
STAT 542 Notes, Winter 2007; MDP

E[χ2n (δ)] = E{E[χ2n+2K | K]}


= E(n + 2K)
= n + 2(δ/2)
(2.28) = n + δ;
Var[χ2n (δ)] = E[Var(χ2n+2K | K)] + Var[E(χ2n+2K | K)]
= E[2(n + 2K)] + Var(n + 2K)
= [2n + 4(δ/2)] + 4(δ/2)
(2.29) = 2n + 4δ.

Exercise 2.6. Show that the noncentral chi-square distribution χ2n (δ) is
stochastically increasing in both n and δ.
¯

Next, consider X ∼ Nn (µ, Σ) with a general pd Σ. Then

X  Σ−1 X = (Σ− 2 X) (Σ− 2 X) ∼ χ2n (µ Σ−1 µ),


1 1
(2.30)

since

Z ≡ Σ− 2 X ∼ Nn (Σ− 2 µ, In )
1 1

and
Σ− 2 µ2 = µ Σ−1 µ.
1

Note that by Exercise 2.6, the distribution of X  Σ−1 X in (2.30) is stochas-


tically increasing in n and µ Σ−1 µ.

Finally, let Y ∼ Nn (ξ, σ 2 In ) and let P be a projection matrix with


rank(P ) = m. Then P = Γ1 Γ1 where Γ1 Γ1 = Im (cf. (2.20) - (2.22)), so

P Y 2 = Γ1 Γ1 Y 2 = (Γ1 Γ1 Y ) (Γ1 Γ1 Y ) = Y  Γ1 Γ1 Y = Γ1 Y 2 .

But
Γ1 Y ∼ Nm (Γ1 ξ, σ 2 Γ1 Γ1 ) = Nm (Γ1 ξ, σ 2 Im ),

26
STAT 542 Notes, Winter 2007; MDP

so by (2.30) with X = Γ1 Y , µ = Γ1 ξ, and Σ = σ 2 Im ,


   
P Y 2 (Γ1 Y ) (Γ1 Y ) ξ  Γ1 Γ1 ξ P ξ2
2
= 2
∼ χ2m = χ2m .
σ σ σ2 σ2
Thus  
P ξ2
(2.31) P Y 2 ∼ σ 2 χ2m .
σ2

2.4. Joint pdf of a random sample from the MVND Np (µ, Σ).
Let X1 , . . . , Xn be an i.i.d random sample from Np (µ, Σ). Assume that Σ
is positive definite (pd) so that each Xi has pdf given by (2.2). Thus the
joint pdf of X1 , . . . , Xn is


n
1  −1
e− 2 (xi −µ) Σ (xi −µ)
1
f (x1 , . . . , xn ) = p 1
(2π) |Σ|
i=1
2 2

1 n
− 12 (xi −µ) Σ−1 (xi −µ)
= np n e i=1
(2π) 2 |Σ| 2
1 n
− 12 tr[Σ−1 ( (xi −µ)(xi −µ) )]
= np n e i=1
(2π) 2 |Σ| 2
1 −n  −1
2 (x̄−µ) Σ (x̄−µ)− 12 tr(Σ−1 s)
(2.32) = np n e ,
(2π) 2 |Σ| 2
or alternatively,
 −1
e− 2 µ Σ µ nx̄ Σ−1 µ− 12 tr(Σ−1 t)
n

(2.33) = np n e ,
(2π) 2 |Σ| 2
where
1  
n n n

X̄ = Xi , S= (Xi − X̄)(Xi − X̄) , T = Xi Xi .
n i=1 i=1 i=1

It follows from (2.32) and (2.33) that (X̄, S) and (X̄, T ) are equivalent
representations of the minimal sufficient statistic for (µ, Σ). Also from
(2.33), with no further restrictions on (µ, Σ), this MVN statistical model
constitutes a p + p(p+1)
2 -dimensional full exponential family with natural
−1 −1
parameter (Σ µ, Σ ).

27
STAT 542 Notes, Winter 2007; MDP

3. The Wishart Distribution.

3.1. Definition and basic properties.


Let X1 , . . . , Xn be an i.i.d. random sample from Np (0, Σ) and set

X = (X1 , . . . , Xn ) : p × n,
n

S = XX = Xi Xi : p × p.
i=1

The distribution of S is called the p-variate (central) Wishart distribution


with n degrees of freedom and scale matrix Σ, denoted by Wp (n, Σ).
¯.

Clearly S is a random symmetric positive semi-definite matrix with


E(S) = nΣ. When p = 1 and Σ = σ 2 , W1 (n, σ 2 ) = σ 2 χ2n .

Lemma 3.1. Preservation under linear transformation. For A : q×p,

(3.1) ASA ∼ Wq (n, AΣA ).

In particular, for a : p × 1,

(3.2) a Sa ∼ (a Σa) · χ2n .

Lemma 3.2. Nonsingularity ≡ positive-definiteness of S ∼ Wp (n, Σ).


S is positive definite with probability one ⇐⇒ Σ is pd and n ≥ p.
Proof. (⇒): Recall that S ∼ XX  with X : p × n. If n < p then

rank(S) = rank(X) ≤ min(p, n) = n < p,

so S is singular with probability one, hence not positive definite. If Σ is not


pd then ∃ a : p × 1, a = 0, s.t. a Σa = 0. Thus by (3.2),

a Sa ∼ (a Σa) · χ2n = 0,

so S is singular w.pr.1.

28
STAT 542 Notes, Winter 2007; MDP

(⇐) Method 1 (Stein; Eaton and Perlman (1973) Ann. Statist.) Assume
that Σ is pd and n ≥ p. Since

p 
n

S = XX = Xi Xi + Xi Xi ,
i=1 i=p+1
p
it suffices to show that i=1 Xi Xi is pd w. pr. 1. Thus we can take n = p,
so X : p × p is a square matrix. Then |S| = |X|2 , so it suffices to show that
X itself is nonsingular w.pr.1. But

p
{X singular} = {Xi ∈ Si ≡ span {Xj | j = i}} ,
i=1
so

p
Pr[X singular] ≤ Pr[Xi ∈ Si ]
i=1
p
= E {Pr[Xi ∈ Si | Xj , j = i]} = 0,
i=1

since dim(Si ) < p and the distribution of Xi ∼ Np (0, Σ) is absolutely


continuous w.r.to Lebesgue measure on Rp . Thus Pr[X nonsingular] = 1.

(⇐) Method 2 (Okamoto (1973) Ann. Statist.) Apply:


Lemma 3.3 (Okamoto). Let Z ≡ (Z1 , . . . , Zk ) ∈ Rk be a random vector
with a pdf that is absolutely continuous w.r.to Lebesgue measure on Rk . Let
g(z) ≡ g(z1 , . . . , zk ) be a nontrivial polynomial (i.e., g ≡ 0). Then

(3.3) Pr[ g(Z) = 0 ] = 0.

Proof. (sketch) Use induction on k. The result is true for k = 1 since


g can have only finitely many roots. Now assume the result is true for
k − 1 and extend to k by Fubini’s Theorem (equivalently, by conditioning
on Z1 , . . . , Zk−1 .
¯

Proposition 3.4. Let X : p × n be a random matrix with a pdf that is


absolutely continuous w.r.to Lebesgue measure on Rp×n . If n ≥ p then

(3.4) Pr[ rank(X) = p ] = 1,

29
STAT 542 Notes, Winter 2007; MDP

which implies that

(3.5) Pr[ S ≡ XX  is positive definite ] = 1.

Proof. Without loss of generality (wlog) assume that p ≤ n and partition


X as (X1 , X2 ) with X1 : p × p. Since rank(X1 ) < p iff |X1 | = 0, and since
the determinant |X1 | ≡ g(X1 ) is a nontrivial polynomial,

Pr[ rank(X1 ) = p ] = 1

by Lemma 3.3. But rank(X1 ) = p ⇒ rank(X) = p, so (3.4) holds.


¯

Okamoto’s Lemma also yields the following important result:

Proposition 3.5. Let l1 (S) ≥ · · · ≥ lp (S) denote the eigenvalues (neces-


sarily real) of S ≡ XX  . Under the assumptions of Proposition 3.4,

(3.6) Pr[ l1 (S) > · · · > lp (S) > 0 ] = 1.

Proof. (sketch) The eigenvalues of S ≡ XX  are the roots of the nontrivial


polynomial h(l) ≡ |XX  − l Ip |. These roots are distinct iff the discriminant
of h vanishes. Since the discriminant is itself a nontrivial polynomial of
the coefficients of the polynomial h, hence a nontrivial polynomial of the
elements of X, (3.6) follows from Okamoto’s Lemma. ¯

Lemma 3.6. Additivity: If S1 ⊥


⊥ S2 with Si ∼ Wp (ni , Σ), then

(3.7) S1 + S2 ∼ Wp (n1 + n2 , Σ).

30
STAT 542 Notes, Winter 2007; MDP

3.2. Covariance matrices of Kronecker product form.


If X1 , . . . , Xn are independent rvtrs each with covariance matrix Σ : p × p,
then Cov(X) = Σ ⊗ In , a Kronecker product. We now determine how a
covariance matrix of the general Kronecker product form Cov(X) = Σ ⊗ Λ
transforms under a linear transformation AXB (see Proposition 3.9).

The Kronecker product of the p × q matrix A and the m × n matrix B


is the pm × qn matrix
 
Ab11 · · · Ab1n
. .. 
A ⊗ B :=  .. ..
. . .
Abm1 · · · Abmn

(i) A ⊗ B is bilinear:

(α1 A1 + α2 A2 ) ⊗ B = α1 (A1 ⊗ B) + α2 (A2 ⊗ B)


A ⊗ (β1 B1 + β2 B2 ) = β1 (A ⊗ B1 ) + β2 (A ⊗ B2 ).

(ii) A ⊗ 
( C ⊗ 
B ) ( AC ⊗ 
D ) = ( BD ).
p×q m×n q×r n×s p×r m×s

(A ⊗ B) = A ⊗ B  ,
(iii)
A = A , B = B  =⇒ A ⊗ B = (A ⊗ B) .

(iv) If Γ : p × p and Ψ : n × n are orthogonal matrices, then Γ ⊗ Ψ : pn × pn


is orthogonal. [apply (ii) and (iii)]

(v) If A : p × p and B : n × n are real symmetric matrices with eigenvalues


α1 , . . . , αp and β1 , . . . , βn , respectively, then A ⊗ B : pn × pn is also real
and symmetric with eigenvalues {αi βj | i = 1, . . . , p, j = 1, . . . , n}.
Proof. Write the spectral decompositions of A and B as

A = ΓDα Γ , B = ΨDβ Ψ ,

31
STAT 542 Notes, Winter 2007; MDP

respectively, where Dα = diag(α1 , . . . , αp ) and Dβ = diag(β1 , . . . , βn ).


Then

A ⊗ B = (ΓDα Γ ) ⊗ (ΨDβ Ψ )
(3.8) = (Γ ⊗ Ψ) (Dα ⊗ Dβ ) (Γ ⊗ Ψ)

by (ii) and(iii). Since Γ ⊗ Ψ is orthogonal and Dα ⊗ Dβ is diagonal with


diagonal entries {αi βj | i = 1, . . . , p, j = 1, . . . , n}, (3.8) is a spectral
decomposition of the real symmetric matrix A ⊗ B, so the result follows. ¯

A psd, B psd =⇒ A ⊗ B psd,


(vi)
A pd, B pd =⇒ A ⊗ B pd. [apply (3.8)]

Let X ≡ (X1 , . . . , Xn ) : p × n be a random matrix. By convention we shall


define the covariance matrix Cov(X) to be the covariance matrix of the
pn × 1 column vector X  formed by “stacking” the column vectors of X:

  Cov(X )
 · · · Cov(X1 , Xn )

X1 1
 ≡ Cov ..  ≡ 
Cov(X) := Cov(X) 
.. ..
.
.. 
.
. . .
Xn Cov(Xn , X1 ) · · · Cov(Xn )

Lemma 3.7. Let X = {Xij }, Σ = {σii }, Λ = {λjj  }. Then

Cov(X) = Σ ⊗ Λ ⇐⇒ Cov(Xij , Xi j  ) = σii λjj 

for all i, i = 1, . . . , p and all j, j  = 1, . . . , n. [straightforward - verify]


¯

Lemma 3.8. Cov(X) = Σ ⊗ Λ ⇐⇒ Cov(X  ) = Λ ⊗ Σ.

Proof. Set U = X  , so Uij = Xji . Then

Cov(Uij , Ui j  ) = Cov(Xji , Uj  i ) = σjj  λii ,

hence Cov(X  ) = Cov(U ) = Λ ⊗ Σ by Lemma 3.7.


¯

32
STAT 542 Notes, Winter 2007; MDP

Proposition 3.9. If Cov(X) = Σ ⊗ Λ then

(3.9) Cov(
A  B ) = (AΣA ) ⊗ (B  ΛB) .
X 
     
q×p p×n n×m q×q m×m

Thus if X ∼ Np×n (ζ, Σ ⊗ Λ) then

(3.10) AXB ∼ Nq×m (AζB, (AΣA ) ⊗ (B  ΛB))

Proof. (a) Because AX = (AX1 , . . . , AXn ) it follows that

 = (A ⊗ In )X,
AX 
so
 = (A ⊗ In ) Cov(X)
Cov(AX) ≡ Cov(AX)  (A ⊗ In )
= (A ⊗ In ) (Σ ⊗ Λ) (A ⊗ In )
= (AΣA ) ⊗ Λ [by (ii)].
(b) Next,
Cov(X  ) = Λ ⊗ Σ [Lemma 3.8],
so
Cov(B  X  ) = (B  ΛB) ⊗ Σ [(b)],
hence
Cov(XB) ≡ Cov ((B  X  ) ) = Σ ⊗ (B  ΛB) [Lemma 3.8].

—————————————

Looking ahead: Our goal will be to determine the joint distribution of the
matrices (S11·2 , S12 , S22 ) that arise from a partitioned Wishart matrix S.
In §3.4 we will see that the conditional distribution of S12 | S22 follows a
multivariate normal linear model (MNLM) of the form (3.14) in §3.3, whose
covariance structure has Kronecker product form. Therefore we will first
study this MNLM and determine the joint distribution of its MLEs (β̂, Σ̂)
given by (3.15) and(3.16). This will readily yield the joint distribution of
(S11·2 , S12 , S22 ), which in turn will have several interesting consequences,
including the evaluation of E(S −1 ) and the distribution of Hotelling’s T 2
statistic X̄n S −1 X̄n .

33
STAT 542 Notes, Winter 2007; MDP

3.3. The multivariate linear model.


The standard univariate linear model consists of a series X ≡ (X1 , . . . , Xn )
of uncorrelated univariate observations with common variance σ 2 > 0 such
that E(X) lies in a specified linear subspace L ⊂ Rn with dim(L) = q < n.
If Z : q × n is any fixed matrix whose rows span L then

(3.11) L = {βZ | β : 1 × q ∈ Rq },

so this linear model can be expressed as follows:

E(X) = βZ, β : 1 × q,
(3.12)
Cov(X) = σ 2 In , σ 2 > 0.

In the standard multivariate linear model, X ≡ (X1 , . . . , Xn ) : p × n


is a series of uncorrelated p-variate observations with common covariance
matrix Σ > 0 such that each row of E(X) lies in the specified linear subspace
L ⊂ Rn . This linear model can be expressed as follows:

E(X) = βZ, β : p × q,
(3.13)
Cov(X) = Σ ⊗ In , Σ > 0.

If in addition we assume that X1 , . . . , Xn are normally distributed, then


(3.13) can be expressed as the normal multivariate linear model (MNLM)

(3.14) X ∼ Np×n (βZ, Σ ⊗ In ), β : p × q, Σ > 0.

Often Z is called a design matrix for the linear model. We now assume that
Z is of rank q ≤ n, so ZZ  is nonsingular and β is identifiable:

β = (E(X)) Z  (ZZ  )−1 .

The maximum likelihood estimator (β̂, Σ̂). We now show that the
MLE (β̂, Σ̂) exists w. pr. 1 iff n − q ≥ p and is given by

(3.15) β̂ = XZ  (ZZ  )−1 ,


(3.16) Σ̂ = n1 X In − Z  (ZZ  )−1 Z X  ≡ 1 


n XQX .

34
STAT 542 Notes, Winter 2007; MDP

Because the observation vectors X1 , . . . , Xn are independent under the


MNLM (3.14), the joint pdf of X ≡ (X1 , . . . , Xn ) is given by

c1 n
− 12 (xi −βZi ) Σ−1 (xi −βZi )
fβ,Σ (x) = n · e i=1
|Σ| 2
c1 n
− 12 tr[Σ−1 ( (xi −βZi )(xi −βZi ) )]
= n · e i=1
|Σ| 2
c1 − 12 tr[Σ−1 (x−βZ)(x−βZ) ]
(3.17) = n · e ,
|Σ| 2
np
where c1 = (2π)− 2 and Z1 , . . . , Zn are the columns of Z. To find the MLEs
β̂, Σ̂, first fix Σ and maximize (3.17) w.r.to β. This can be accomplished
by “minimizing” the matrix-valued quadratic form

(3.18) ∆(β) := (X − βZ)(X − βZ)

w.r.to the Loewner ordering 2 , which a fortiori minimizes tr[Σ−1 ∆(β)] [ver-
ify]. Since each row of βZ lies in L ≡ row space (Z) ⊂ Rn , this suggests
that the minimizing β̂ be chosen such that each row of β̂Z is the orthogonal
projection of the corresponding row of X onto L. But the matrix of this
orthogonal projection is

P ≡ Z  (ZZ  )−1 Z : n × n

so we should choose β̂ such that β̂Z = XZ  (ZZ  )−1 Z, or equivalently,

(3.19) β̂ = XZ  (ZZ  )−1 .

To verify that β̂ minimizes ∆(β), write X −βZ = (X − β̂Z)+(β̂ −β)Z,


so
∆(β) = (X − β̂Z)(X − β̂Z) + (β̂ − β)ZZ  (β̂ − β)
+ (X − β̂Z)Z  (β̂ − β) + (β̂ − β) Z(X − β̂Z) .
     
=0 =0

2
T ≥ S iff T − S is psd.

35
STAT 542 Notes, Winter 2007; MDP

Since ZZ  is pd, ∆(β) is uniquely minimized w.r. to the Loewner ordering


when β = β̂, so

(3.20) min ∆(β) = (X − β̂Z)(X − β̂Z)


β


= X In − Z  (ZZ  )−1 Z)(In − Z  (ZZ  )−1 Z X 

≡ X (In − P )(In − P ) X 
≡ XQQ X  [ set Q = In − P ]
= XQX  [ Q, like P, is a projection matrix]

Since β̂ does not depend on Σ, this establishes (3.15). Furthermore, it


follows from (3.17) and (3.20) that for fixed Σ > 0,
c1 − 12 tr(Σ−1 xQx )
(3.21) max fβ,Σ (x) = n · e .
β |Σ| 2

To maximize (3.21) w. r. to Σ we apply the following lemma:

Lemma 3.10. If W is pd then

1 − 12 tr(Σ−1 W ) 1 − np
(3.22) max n e = n · e
2 ,
Σ>0 |Σ| 2 |Σ̂| 2

where Σ̂ ≡ 1
nW is the unique maximizing value of Σ.
Proof. Since the mappings

Σ → Σ−1 := Λ
Λ → (W ) ΛW
1 1
2 2 := Ω

are both bijections of Sp+ onto itself, the maximum in (3.22) is given by

1
max |Λ| 2 e− 2 tr(ΛW ) = 2 e− 2 trΩ
n 1 n 1
(3.23) n max |Ω|
Λ>0 |W | 2 Ω>0
1 p
n
ωi2 e− 2 ωi ,
1
= n max
|W | 2 ω1 ≥···≥ωp >0 i=1

36
STAT 542 Notes, Winter 2007; MDP

where ω1 , . . . , ωp are the eigenvalues of Ω. Since n log ω − ω is strictly


concave in ω, its maximum value n is uniquely attained at ω̂ = n, hence the
maximizing values of ω1 , . . . , ωp are ω̂1 = · · · = ω̂p = n. Thus the unique
maximizing value of Ω is Ω̂ = nIn , hence Λ̂ = nW −1 and Σ̂ = n1 W .
¯

If W is psd but singular, then the maximum in (3.23) is +∞ [verify].


Thus the MLE Σ̂ for the MNLM (3.14) exists and is given by Σ̂ = n1 XQX 
iff XQX  is pd. We now derive the distribution of XQX  and show that

(3.24) XQX  is pd w . pr . 1 ⇐⇒ n − q ≥ p.

Thus the condition n − q ≥ p is necessary and sufficient for the existence


and uniqueness of the MLE Σ̂ as stated in (3.16).
First we find the joint distn of (β̂, Σ̂). From (3.14) and (3.10),
   
  Z 
X ( Z Q ) ∼ Np×(q+n) βZ ( Z Q ) , Σ ⊗ (Z Q)
Q
  
 ZZ  0
= Np×(q+n) ( βZZ 0 ) , Σ ⊗ [ZQ = 0],
0 Q

from which it follows that

(3.25) XZ  ∼ Np×q (βZZ  , Σ ⊗ (ZZ  )) ,


(3.26) XQ ∼ Np×n (0, Σ ⊗ Q) ,
(3.27) XZ  ⊥
⊥ XQ.

Because Q ≡ In − Z  (ZZ  )−1 Z is a projection matrix with [verify]

rank(Q) = tr(Q) = n − q,

its spectral decomposition is (recall (1.37))


 
In−q 0
(3.28) Q=Γ Γ
0 0

for some p × p orthogonal matrix Γ. Set Y = XQΓ, so from (3.26),


  
In−q 0
Y ∼ Np×n 0, Σ ⊗ .
0 0

37
STAT 542 Notes, Winter 2007; MDP

This shows that [verify]

(3.29) XQX  ≡ Y Y  ∼ Wp (n − q, Σ),

hence (3.24) follows from Lemma 3.2. Lastly, by (3.25), (3.29), and (3.27),

(3.30) β̂ ≡ XZ  (ZZ  )−1 ∼ Np×q β, Σ ⊗ (ZZ  )−1 ,


(3.31) nΣ̂ ≡ XQX  ∼ Wp (n − q, Σ),
(3.32) β̂ ⊥
⊥ Σ̂.

Remark 3.11. From (3.31), the MLE Σ̂ is a biased estimator of Σ:


q
E(Σ̂) = 1 − Σ.
n

Instead, the adjusted MLE Σ̆ := 1
n−q XQX is unbiased.
¯

Special case of the MNLM: a random sample from Np (µ, Σ).


If X1 , . . . , Xn is an i.i.d. sample from Np (µ, Σ) then the joint distribution
of X ≡ (X1 , . . . , Xn ) is a special case of the MNLM (3.13):

(3.33) X ∼ Np×n (µen , Σ ⊗ In ), µ : p × 1, Σ > 0.

Here q = 1, Z = en , and Q = In − en (en en )−1 en , so from (3.30) - (3.32),

1

n
(3.34) µ̂ = Xen (en en )−1 = Xi = X̄n ∼ Np µ, 1
nΣ ,
n i=1
1 


n

(3.35) nΣ̂ = XQX = Xi − X̄n Xi − X̄n ∼ Wp (n − 1, Σ),
n i=1
(3.36) X̄n ⊥
⊥ Σ̂ .

38
STAT 542 Notes, Winter 2007; MDP

3.4. Distribution of a partitioned Wishart matrix.


Let Sp+ denote the cone of real positive definite p × p matrices and let
Mm×n denote the algebra of all real m × n matrices. Partition the pd
matrix S : p × p ∈ Sp+ as
p p2
 1 
p1 S11 S12
(3.37) S= ,
p2 S21 S22
where p1 + p2 = p. The next result follows from directly from (1.34).

Lemma 3.12. The following correspondence is bijective:


Sp+ ↔ Sp+1 × Mp1 ×p2 × Sp+2
(3.38)
S ↔ (S11·2 , S12 , S22 ).
Note that we cannot replace S11·2 by S11 in (3.37) because of the
constraints imposed on S itself by the pd condition. That is, the range
of (S11 , S12 , S22 ) is not the Cartesian product of the three ranges.

Proposition 3.13.*** Let S ∼ Wp (n, Σ) be partitioned as in (3.37) with


n ≥ p2 and Σ22 > 0. Then the joint distribution of (S11·2 , S12 , S22 ) can be
specified as follows:


(3.39) S12  S22 ∼ Np1 ×p2 Σ12 Σ−122 S 22 , Σ 11·2 ⊗ S 22 ,


(3.40) S22 ∼ Wp2 (n, Σ22 ),
(3.41) S11·2 ∼ Wp1 (n − p2 , Σ11·2 ),
(3.42) (S12 , S22 ) ⊥
⊥ S11·2 .
 
 p1 Y1
Proof. Represent S as Y Y with Y ≡ ∼ Np×n (0, Σ ⊗ In ), so
p2 Y2
   
S11 S12 Y1 Y1 Y1 Y2
(3.43) = .
S21 S22 Y2 Y1 Y2 Y2
By Proposition 3.4, the conditions n ≥ p2 and Σ22 > 0 imply that
rank(Y2 ) = p2 w. pr. 1, hence S22 ≡ Y2 Y2 is pd w. pr. 1. Thus S11·2
is well defined and is given by

(3.44) S11·2 = Y1 In − Y2 (Y2 Y2 )−1 Y2 Y1 ≡ Y1 QY1 .

39
STAT 542 Notes, Winter 2007; MDP

From (2.8) the conditional distribution of Y1 | Y2 is given by




(3.45) Y1  Y2 ∼ Np1 ×n Σ12 Σ−1


22 Y2 , Σ 11·2 ⊗ In ,

which is a MNLM (3.14) with the following correspondences:

X ↔ Y1 , β ↔ Σ12 Σ−1
22 , p ↔ p1 ,
Z ↔ Y2 , Σ ↔ Σ11·2 , q ↔ p2 .

Thus from (3.25), (3.31), (3.32), (3.43), and (3.44), conditionally on Y2 ,




(3.46) S12  Y2 ∼ Np1 ×p2 Σ12 Σ−1 22 S22 , Σ11·2 ⊗ S22 ,



(3.47) S11·2  Y2 ∼ Wp1 (n − p2 , Σ11·2 ),
(3.48) S12 ⊥
⊥ S11·2 | Y2 .

Clearly (3.46) ⇒ (3.39), while (3.40) follows from Lemma 3.1 with
A = ( 0p2 ×p1 Ip2 ). Also, (3.47) ⇒ (3.41) and (3.47) ⇒ S11·2 ⊥⊥ Y2 , which
combines with (3.48) to yield S11·2 ⊥
⊥ (S12 , Y2 ), which implies (3.42).
3 ¯
Note that (3.39) can be restated in two equivalent forms:
−1 


(3.49) S12 S22 S22 ∼ Np1 ×p2 Σ12 Σ−1


22 , Σ 11·2 ⊗ S −1
22 ,

− 12   −1 2
1
(3.50) S12 S22 S22 ∼ Np1 ×p2 Σ12 Σ22 S22 , Σ11·2 ⊗ Ip2 ,

1
where S22
2
can be any (Borel-measurable) square root of S22 It follows from
(3.50) and (3.42) that

−1 
(3.51) Σ12 = 0 =⇒ S12 S222 ⊥
⊥ S22 ⊥
⊥ S11·2 .

We remark that Proposition 3.13 can also be derived directly from the
pdf of the Wishart distribution, the existence of which requires the stronger
conditions n ≥ p and Σ > 0. We shall derive the Wishart pdf in §8.4.

Proposition 3.13 yields many useful results – some examples follow.


3
Because A⊥
⊥ B | C and B ⊥
⊥C⇒B⊥
⊥ (A, C) [verify].

40
STAT 542 Notes, Winter 2007; MDP

Example 3.14. Distribution of the generalized variance.


If S ∼ Wp (n, Σ) with n ≥ p and Σ > 0 then


p
(3.52) |S| ∼ |Σ| · χ2n−p+i ,
i=1

a product of independent chi-square variates.


Proof. Partition S as in (3.37) with p1 = 1, p2 = p − 1. Then

|S| = |S11·2 | · |S22 | ∼ |W1 (n − p + 1, Σ11·2 )| · |Wp−1 (n, Σ22 )|


∼ Σ11·2 χ2n−p+1 · |Wp−1 (n, Σ22 )|

with the two factors independent. The result follows by induction on p.


¯

1
 1Note
 that (3.52) implies that although nS is an unbiased estimator of

Σ, n S  is a biased estimator of |Σ|:
  p

(3.53) E  n1 S  = |Σ| · i=1 n−p+i


n < |Σ| .

Proposition 3.15. Let S ∼ Wp (n, Σ) with n ≥ p and Σ > 0. If A : q × p


has rank q ≤ p then



−1  −1 −1  −1
(3.54) AS A ∼ Wq n − p + q, AΣ A .

When A = a : 1 × p this becomes

1 1
(3.55)  −1
∼  −1
· χ2n−p+1 .
aS a aΣ a

Note: Compare (3.54) to (3.1): ASA ∼ Wq (n, AΣA ), which holds with
no restrictions on n, p, Σ, A, or q.

Our proof of (3.54) requires the singular value decomposition of A:

41
STAT 542 Notes, Winter 2007; MDP

Lemma 3.16. If A : q × p has rank q ≤ p then there exist an orthogonal


matrix Γ : q × q and a row-orthogonal matrix Ψ1 : q × p such that

(3.56) A = ΓDa Ψ1 ,

where Da = diag(a1 , . . . , aq ) and a21 ≥ · · · ≥ a2q > 0 are the ordered eigen-
 
Ψ
values of AA .4 By extending Ψ1 to a p × p orthogonal matrix Ψ ≡ 1
,
Ψ2
we have the alternative representations

(3.57) A = Γ ( Da 0q×(p−q) ) Ψ,
(3.58) = C ( Iq 0q×(p−q) ) Ψ,

where C ≡ ΓDa : q × q is nonsingular.


Proof. Let AA = ΓDa2 Γ be the spectral decomposition of the pd q × q
matrix AA . Thus
Da−1 Γ AA ΓDa−1 = Iq ,
so Ψ1 := Da−1 Γ A : q × p satisfies Ψ1 Ψ1 = Iq , i.e., the rows of Ψ1 are
orthonormal. Thus (3.56) holds, then (3.57) and (3.58) are immediate. ¯

Proof of Proposition 3.15. It follows from (3.58) that [verify]



−1
AS −1 A = C −1 Š11·2 C −1 ,

−1
AΣ−1 A = C −1 Σ̌11·2 C −1 ,

where Š = ΨSΨ and Σ̌ = ΨΣΨ are partitioned as in (3.37) with p1 = q


and p2 = p − q. Since Š ∼ Wp (n, Σ̌), it follows from Proposition 3.13 that

Š11·2 ∼ Wq n − (p − q), Σ̌11·2 ,


so

C −1 Š11·2 C −1 ∼ Wq n − (p − q), C −1 Σ̌11·2 C −1 ,

which gives (3.54).


¯

4
a1 ≥ · · · ≥ aq > 0 are called the singular values of A.

42
STAT 542 Notes, Winter 2007; MDP

Proposition 3.17. Distribution of Hotelling’s T 2 statistic.


Let X ∼ Np (µ, Σ) and S ∼ Wp (n, Σ) be independent, n ≥ p, Σ > 0, and
define

T 2 = X  S −1 X.
Then
χ2p (µ Σ−1 µ)
(3.59) T ∼
2
2 ≡ Fp, n−p+1 (µ Σ−1 µ),
χn−p+1

a (nonnormalized) noncentral F distribution. (The two chi-square variates


are independent.)
 −1
2
Proof. Decompose T as X X S X
 Σ−1 X · X  Σ−1 X. By (3.55) and the inde-
pendence of X and S,
 1
X  S −1 X  X ∼ X  Σ−1 X · ,
χ2n−p+1
so 
X  S −1 X  1
X ∼ ,
X  Σ−1 X  χ2n−p+1

independent of X. Since X  Σ−1 X ∼ χ2p (µ Σ−1 µ) by (2.30), (3.59) holds.


¯

For any fixed µ0 ∈ Rp , replace X and µ in Example 3.17 by X − µ0


and µ − µ0 , respectively, to obtain the following generalization of (3.59):

T 2 ≡ (X − µ0 ) S −1 (X − µ0 )

χ2p (µ − µ0 ) Σ−1 (µ − µ0 )  −1

(3.60) ∼ ≡ F p, n−p+1 (µ − µ0 ) Σ (µ − µ0 ) .
χ2n−p+1

Note: In Example 6.11 and Exercise 6.12 it will be shown that T 2 is the
UMP invariant test statistic and the LRT statistic for testing µ = µ0 vs.
µ = µ0 with Σ unknown. When µ = µ0 ,

(3.61) T 2 ∼ Fp−1, n−p+1 ,

which determines the significance level of the test.


¯

43
STAT 542 Notes, Winter 2007; MDP

Example 3.18. Expected value of S −1 .


Suppose that S ∼ Wp (n, Σ) with n ≥ p and Σ > 0, so S −1 exists with
pr. 1. When does E(S −1 ) exist, and what is its value? We answer this by
combining Proposition 3.13 with an invariance argument.
First consider the case Σ = I. Partition S and S −1 as
   11 
s11 S12 −1 s S 12
S= , S = ,
S21 S22 S 21 S 22

respectively, with p1 = 1 and p2 = p − 1. Then by (3.41),

1 1
s11 = ∼ ,
s11·2 χ2n−p+1
so
(3.62) E(s11 ) = 1
n−p−1 <∞ iff n ≥ p + 2.

Similarly for the other diagonal elements of S −1 : E(sii ) < ∞ iff n ≥ p + 2.


Because each off-diagonal element sij of S −1 satisfies

|sij | ≤ sii sjj ≤ 12 (sii + sjj ),

we see that E(S −1 ) =: ∆ exists iff n ≥ p + 2. Furthermore, because Σ = I,


S ∼ ΓSΓ for every p × p orthogonal matrix Γ, hence

Γ∆Γ = Γ E(S −1 ) Γ = E (ΓSΓ )−1 = E(S −1 ) = ∆ ∀ Γ.

Exercise 3.19. Show that Γ∆Γ = ∆ ∀ Γ ⇒ ∆ = δ I for some δ > 0.


¯

Thus E(S −1 ) = δ I, and δ = 1


n−p−1 by (3.62). Therefore when Σ = I,

E(S −1 ) = 1
n−p−1 I (n ≥ p + 2).

Now consider the general case Σ > 0. Since


1 1 
S ∼ Σ 2 ŠΣ 2 with Š ∼ Wp (n, I),

44
STAT 542 Notes, Winter 2007; MDP

we conclude that
 
−1 1 1  −1
E(S )=E Σ ŠΣ
2 2


= Σ− 2 E(Š −1 )Σ− 2
1 1

1 − 12  − 12
= n−p−1 Σ Σ
−1
(3.63) = 1
n−p−1 Σ (n ≥ p + 2).

Proposition 3.20. Bartlett’s decomposition.


Let S ∼ Wp (n, I) with n ≥ p. Set S = T T  where T ≡ {tij | 1 ≤ j ≤ i ≤ p}
is the unique lower triangular square root of S with tii > 0, i = 1, . . . , p (see
Exercise 1.5). Then the {tij } are mutually independent rvs with

 t2ii ∼ χ2n−i+1 , 1 = 1, . . . , p,
(3.64)

tij ∼ N1 (0, 1), 1 ≤ j < i ≤ p.

Proof. Use induction on p. The result is obvious for p = 1. Partition S


as in (3.37) with p1 = p − 1 and p2 = 1 so by the induction hypothesis,
S11 = T1 T1 for a lower triangular matrix T1 that satisfies (3.64) with p
replaced by p − 1. Then
    
S11 S12 T1 0 T1 T1−1 S12
S≡ =  1 1 ≡ T T ,
S21 s22 S21 T1−1 s22·1
2
0 s22·1
2

1
where T : p × p is lower triangular with tii > 0, i = 1, . . . , p. Since T1 = S11
2

and Σ = I, it follows from (3.51), (3.50), and (3.41) (with the indices “1”
and “2” interchanged) that

S21 T1−1 ⊥
⊥ T1 ⊥
⊥ s22·1

S21 T1−1 ∼ N1×(p−1) (0, 1 ⊗ Ip−1 ) ,
s22·1 ∼ χ2n−p+1 ,

from which the induction step follows.


¯

45
STAT 542 Notes, Winter 2007; MDP

Example 3.21. Distribution of the sample multiple correlation


coefficient R2 .
Let S ∼ Wp (n, Σ) with n ≥ p and Σ > 0. Partition S and Σ as

1 p−1 1 p−1
   
1 s11 S12 1 σ11 Σ12
(3.65) S= , Σ= ,
p−1 S21 S22 p−1 Σ21 Σ22
and define
−1
2 S12 S22 S21 Σ12 Σ−1
222 Σ21
R = , ρ = ,
s11 σ11

−1/2 −1/2
−1 S12 S22 S12 S22
R2 S12 S22 S21
U= = = ,
(3.66) 1 − R2 s11·2 s11·2
ρ2 Σ12 Σ−1
22 Σ21
ζ= =
1 − ρ2 σ11·2

Σ12 Σ−1 −1
22 S22 Σ22 Σ21
V ≡ V (S22 , Σ) = .
σ11·2
From Proposition 3.13 and (3.50) we have

−1/2   −1 1/2
S12 S22 S22 ∼ N1×(p−1) Σ12 Σ22 S22 , σ11.2 ⊗ Ip−1 ,
s11·2 ∼ σ11.2 · χ2n−p+1 ,
S22 ∼ Wp−1 (n, Σ22 ),
s11·2 ⊥
⊥ (S12 , S22 ),
so [verify]
 χ2p−1 (V ) distn

U S22 ∼ 2 = Fp−1, n−p+1 (V ),
χn−p+1
V ∼ ζ · χ2n .

Therefore the joint distribution of (U, V ) ≡ (U, V (S22 , Σ)) is given by


U | V ∼ Fp−1, n−p+1 (V ),
(3.67)
V ∼ ζ · χ2n .

46
STAT 542 Notes, Winter 2007; MDP

Equivalently, if we set Z := V /ζ so Z is ancillary (but unobservable), then



U  Z ∼ Fp−1, n−p+1 (ζZ),
(3.68)
Z ∼ χ2n ,

from which the unconditional distribution of U can be obtained by averaging


over Z (see Exercise 3.22 and Example A.18 in Appendix A).
¯

Exercise 3.22. From (A.7) in Appendix A, the conditional distribution


Fp−1,n−p+1 (ζZ) of U | Z can be represented as a Poisson mixture of central
F distributions:

(3.69) Fp−1, n−p+1 (ζZ)  K ∼ Fp−1+2K, n−p+1 , K ∼ Poisson (ζZ/2) .

Use (3.68), (3.69), and (A.8) to show that the unconditional distribution of
U (resp., R2 ) can be represented as a negative binomial mixture of central
F (resp., Beta) rvs:

(3.70) U  K ∼ Fp−1+2K, n−p+1 ,
U 

(3.71) R ≡
2
 K ∼ B p−1
2 + K, n−p+1
2 ,
U +1
(3.72) K ∼ Negative binomial (ρ2 ),
that is,

Γ n2 + k 2
k
n
Pr[ K = k ] = n
ρ 1 − ρ2 2 , k = 0, 1, . . . .
¯
Γ 2 k!

Note: In Example 6.26 and Exercise 6.27 it will be shown that R2 is the
LRT statistic and the UMP invariant test statistic for testing ρ2 = 0 vs.
ρ2 > 0. When ρ2 = 0 ( ⇐⇒ Σ12 = 0 ⇐⇒ ζ = 0), U ⊥ ⊥ Z by (3.68) and

(3.73) U ∼ Fp−1, n−p+1 ,


(3.74) R2 ∼ B p−1
2 , n−p+1
2 ,

either of which determines the significance level of the test.


¯

47
STAT 542 Notes, Winter 2007; MDP

4. The Wishart Density; Jacobians of Matrix Transformations.


We have deduced properties of a Wishart random matrix S ∼ Wp (n, Σ)
by using its representation S = XX  in terms of a multivariate normal
random matrix X ∼ Np×n (0, Σ ⊗ In ). We have not required the density of
the Wishart distribution on Sp+ (the cone of p×p positive definite symmetric
matrices). In this section we derive this density, a multivariate extension of
the (central) chi-square density. Throughout it is assumed that n ≥ p.
Assume first that Σ = I. From Bartlett’s decomposition S = T T  in
Proposition 3.20, the joint pdf of T ≡ {Tij } is given by [verify!]

1 − 1 t2ij
p
1 n−i − 12 t2ii
f (T ) = √ e 2 · n−i−1
t ii e
1≤j<i≤p
2π i=1 2 2 Γ n−i+1
2

1
p 1 
(4.1) = p tn−i
ii · exp − t2ij
2
pn
2 −p π
p(p−1)
4 Γ n−i+1
i=1
2
i=1 2 1≤j≤i≤p


p
1
=: cp,n · tn−i
ii

· exp − tr T T .
i=1
2
 ∂T 
Since the pdf of S is given by f (S) = f (T ) , we first must find the
$  ∂S
 ∂S   ∂T 
Jacobian ∂T ≡ 1 ∂S of the mapping S = T T  . [This derivation of the
Wishart pdf will resume in §4.4.]

4.1. Jacobians of vector/matrix transformations.


Consider a smooth bijective mapping (≡ diffeomorphism)

A→B
(4.2)
x ≡ (x1 , . . . , xn ) → y ≡ (y1 , . . . , yn ),

where A and B are open subsets of Rn . The Jacobian matrix of this map-
ping is given by
 ∂y1 ∂yn 
  ∂x1 · · · ∂x1
∂y  . .. 
(4.3) :=  .. . ,
∂x ∂y1
∂xn · · · ∂x
∂yn
n

48
STAT 542 Notes, Winter 2007; MDP
 ∂y +

and the Jacobian of the mapping is given by  ∂x  := [det ∂y ]+ . Jaco-


∂x
bians obey several elementary properties.

Chain rule: Suppose that x → y and y → z are diffeomorphisms. Then


x → z is a diffeomorphism and
 ∂z   ∂z   ∂y 
     
(4.4)  =  ·  .
∂x ∂y y=y(x) ∂x

Proof. This follows from the chain rule for partial derivatives:
∂zi (y1 (x1 , . . . , xn ), . . . , yn (x1 , . . . , xn ))  ∂zi ∂yk  ∂z ∂y 
= = .
∂xj ∂yk ∂xj ∂y ∂x ij
k
∂z
∂z
∂y

Therefore ∂x = ∂y ∂x ; now take determinants.


¯

Inverse rule: Suppose that x → y is a diffeomorphism. Then


 ∂x   ∂y −1
   
(4.5)   =  .
∂y y=y(x) ∂x

Proof. Apply the chain rule with z = x.


¯

Combination rule: Suppose that x → u and y → v are (unrelated)


diffeomorphisms. Then
 ∂(u, v)   ∂u   ∂v 
     
(4.6)   =   ·  .
∂(x, y) ∂x ∂y

Proof. The Jacobian matrix is given by


∂(u, v)  ∂u 0

= ∂x ∂v .
∂(x, y) 0 ∂y

Extended combination rule: Suppose that (x, y) → (u, v) is a diffeo-


morphism of the form u = u(x), v = v(x, y). Then (4.6) continues to hold.
Proof. The Jacobian matrix is given by
∂(u, v)  ∂u ∂v 
= ∂x ∂x
∂v .
∂(x, y) 0 ∂y

49
STAT 542 Notes, Winter 2007; MDP

4.2. Jacobians of linear mappings. Let

A : p × p and B : n × n be nonsingular matrices,


L : p × p and M : p × p be nonsingular lower triangular matrices,
U : p × p and V : p × p be nonsingular upper triangular matrices,
c a nonzero scalar.

(A, B, L, M, U, V, c are non-random.) Then (4.4) – (4.6) imply the following


facts:
 ∂y 
(a) vectors. y = cx, x, y : 1 × n:  ∂x  = |c|n . [combination rule]
 ∂Y 
(b) matrices. Y = cX, X, Y : p × n:  ∂X  = |c|pn . [comb. rule]
 ∂Y 
(c) symmetric matrices. Y = cX, X, Y : p × p, symmetric:  ∂X  = |c| p(p+1)
2 .
[comb. rule]
 ∂Y 
(d) matrices. Y = AX, X, Y : p × n:  ∂X  = |A|n . [comb. rule]
 ∂Y 
Y = XB, X, Y : p × n:  ∂X  = |B|p . [comb. rule]
 ∂Y 
Y = AXB, X, Y : p × n:  ∂X  = |A|n |B|p . [chain rule]

(e) symmetric matrices. Y = AXA , X, Y : p × p, symmetric:


 ∂Y 
 
  = |A|p+1 .
∂X

Proof. Use the fact that A can be written as the product of elementary
matrices of the forms

Mi (c) :=Diag(1, . . . , 1, c, 1, . . . , 1),


 
1 0
 .. 
 . 1ij 
Eij :=  . .
 .. 
0 1

Verify the result when A = Mi (c) and A = Eij , then apply the chain rule.
¯

50
STAT 542 Notes, Winter 2007; MDP

(f ) triangular matrices:

• Y = LX, X, Y : p × p lower triangular:

 ∂Y  p
 
 = |lii |i .
∂X i=1

i
Proof. Let Since yij = k=j lik xkj (i ≥ j), the Jacobian matrix is
 ∂ypp 
∂y11 ∂y21 ∂y22
···  
∂x11 ∂x11 ∂x11 ∂x11 l11 l21 ··· lpp
 ∂y11 ∂y21 ∂y22
···
∂ypp 
∂Y  ∂x21 ∂x21 ∂x21 ∂x21   0 l22 
   
= = 0 .
∂ypp

∂y11 ∂y21 ∂y22
···   .
0 l22
∂X 
∂x22 ∂x22 ∂x22 ∂x22
  . .. .. 
. 
.. .. .. ..
 . .  . . . .
∂y11 ∂y21 ∂y22
···
∂ypp 0 0 ··· 0 lpp
∂xpp ∂xpp ∂xpp ∂xpp

This
pis a ip(p + 1)/2 × p(p + 1)/2 upper triangular matrix whose determinant
is i=1 lii .5
¯

• Y = U X, X, Y : p × p upper triangular:

 ∂Y  p
 
 = |uii |p−i+1 .
∂X i=1

• Y = XL, X, Y : p × p lower triangular:

 ∂Y  p
 
 = |lii |p−i+1 .
∂X i=1

Proof. Write Y  = L X  and apply the preceding case with U = L .


¯
5
A more revealing proof follows by noting that Y = LX can be written column-by
column as Y1 = L1 X1 , , . . . , Yp = Lp Xp , where Xi and Yi are the (p−i+1)×1 non-
zero parts of the columns of X and Y and where Li is the lower (p−i+1)×(p−i+1)
p
principal submatrix of L. Since Yi = Li Xi has Jacobian |Li |+ = j=i |ljj |, the
result follows from the composition rule.

51
STAT 542 Notes, Winter 2007; MDP

• Y = XU , X, Y : p × p upper triangular:
 ∂Y  p
 
 = |uii |i .
∂X i=1

Proof. Write Y  = U  X  and apply the first case with L = U  .


¯

• Y = LXM , X, Y : p × p lower triangular:


 ∂Y  p p
 
 = |lii | ·
i
|mii |p−i+1
∂X i=1 i=1

Proof. Apply the chain rule.


¯

• Y = U XV , X, Y : p × p upper triangular:
 ∂Y  p p
 
 = |uii |p−i+1
· |vii |i .
∂X i=1 i=1

Proof. Write Y  = V  X  U  and apply the last case with L = V  and


M = U .
¯

(g) triangular/symmetric matrices:


• Y = X + X  , X : p × p lower (or upper) triangular, Y : p × p symmetric:
 ∂Y 
 
  = 2p .
∂X
Proof. Since yii = 2xii , 1 ≤ i ≤ p, while yij = xij , 1 ≤ j < i ≤ p.
¯

• Y = L X + X  L, X : p × p lower triangular, Y : p × p symmetric:


 ∂Y  p
 
  = 2p |lii |i .
∂X i=1

Proof. Clearly X → Y is a linear mapping. To show that it is 1-1:


L X1 + X1 L = L X2 + X2 L
=⇒ L (X1 − X2 ) = −(X1 − X2 ) L
=⇒ (X1 − X2 )L−1 = −[(X1 − X2 )L−1 ] .

52
STAT 542 Notes, Winter 2007; MDP

Thus (X1 − X2 )L−1 is both lower triangular and skew-symmetric, hence is


0, so X1 = X2 . Next, to find the required Jacobian, apply the chain rule to
the sequence of mappings

X → XL−1 → XL−1 +(XL−1 ) → L [XL−1 +(XL−1 ) ]L ≡ L X +X  L.

Therefore the Jacobian is given by [verify!]

 ∂Y  p p p
  −1 p−i+1
 = |lii | ·2 ·
p
|lii |p+1
= 2p
|lii |i .
∂X i=1 i=1 i=1

• Y = U  X + X  U , X : p × p upper triangular, Y : p × p symmetric:

 ∂Y  p
 
 =2p
|uii |p−i+1 .
∂X i=1

• Y = XL + LX  , X : p × p lower triangular, Y : p × p symmetric:

 ∂Y  p
 
 =2p
|lii |p−i+1 .
∂X i=1

Proof. Apply the preceding case with U = L and X replaced by X̃ := X  .


¯

• Y = XU  + U X  , X : p × p upper triangular, Y : p × p symmetric:

 ∂Y  p
 
 =2p
|uii |i .
∂X i=1

Proof. Apply the first case with L = U  and X replaced by X̃ := X  .


¯

53
STAT 542 Notes, Winter 2007; MDP

4.3. Jacobians of nonlinear mappings.

(*) The Jacobian of a nonlinear diffeomorphism x → y is the same as the


Jacobian of the linearized differential mapping dx → dy. Here,

dx := (dx1 , . . . , dxn ) and dy := (dy1 , . . . , dyn ).

For n = 1, (*) is immediate from the linear relation



between dx and dy
dy dy
given by the formal differential identity dy = dx dx, where dx is treated
as a scalar constant c. For n ≥ 2, the equations for total differentials

∂y ∂y
1 1
dy1 = dx1 + · · · + dxn ,
∂x1 ∂xn
(4.7) .. .. ..
. . .
∂y ∂y
n n
dyn = dx1 + · · · + dxn ,
∂x1 ∂x1

can be expressed in vector-matrix notation as the single linear relation

∂y
(4.8) dy = dx
∂x
∂y

with ∂x treated as a constant matrix, which again implies (*).


The following elementary rules for matrix differentials will combine
with (*) to allow calculation of Jacobians for apparently complicated non-
linear diffeomorphisms. Here, if X ≡ (xij ) is a matrix variable, dX denotes
the matrix of differentials (dxij ). If X is a structured matrix (e.g., sym-
metric or triangular) then dX has the same structure.

(1) sum: d(X + Y ) = dX + dY . [verify]


(2) product: d(XY ) = (dX)Y + X(dY ). [verify]
(3) inverse: d(X −1 ) = −X −1 (dX)X −1 .

Proof. Apply (2) with Y = X −1 .


¯

54
STAT 542 Notes, Winter 2007; MDP

Four examples of nonlinear Jacobians:

(a) matrix inversion: if Y = X −1 with X, Y : p × p (unstructured) then


 ∂Y 
 
  = |X|−2p .
∂X

Proof. Apply (3) and §4.2(d).


¯

(b) matrix inversion: if Y = X −1 with X, Y : p × p symmetric, then


 ∂Y 
 
  = |X|−p−1 .
∂X

Proof. Apply (3) and §4.2(e).


¯

(c) lower triangular decomposition: if S = T T  with S : p × p symmetric


pd and T : p × p lower triangular with t11 > 0, . . . , tpp > 0 (Cholesky), then

 ∂S  p
 
 =2p
tp−i+1
ii .
∂T i=1

Proof. By (2), dS = (dT )T  + T (dT ) ; now apply §4.2(g).


¯

(d) upper triangular decomposition: if S = U U  with S : p × p symmetric


pd and U : p × p upper triangular with u11 > 0, . . . , upp > 0 (Cholesky),
then
 ∂S  p
  p
 =2 uiii .
∂U i=1

Proof. By (2), dS = (dU )U  + U (dU ) ; again apply §4.2(g).


¯

55
STAT 542 Notes, Winter 2007; MDP

4.4. The Wishart density.


We continue the discussion following (4.1). When Σ = Ip and n ≥ p, the
pdf f (T ) of T (recall that S = T T  with T lower triangular) is given by
(4.1). Thus by the inverse rule and §4.3(c) the pdf of S is given by

1
f (S) =f (T (S)) ·  ∂S 
 
∂T T =T (S)

p


p
= cp,n · tn−i
ii · exp − 12 tr T T  ·2−p
t−p+i−1
ii
(4.9) i=1 i=1

p n−p−1

= 2−p cp,n · tii · exp − 12 tr T T 


i=1
n−p−1
e− 2 tr S ,
1
= cp,n · |S| 2 S ∈ Sp+ ,

where
pn p(p−1) p n−i+1

(4.10) c−1
p, n := 2
2 π 4 · Γ 2
i=1
pn p(p−1)

=: 2 2 π 4 · Γp n2 .

Finally, for Σ > 0 the Jacobian of the mapping S → Σ1/2 SΣ1/2 is


p+1
|Σ| 2 (apply §4.2(e)), so the general Wishart pdf for S ∼ Wp (n, Σ) is
given by
cp, n n−p−1 −1
e− 2 tr Σ S ,
1
(4.11) n · |S| 2 S ∈ Sp+ ,
|Σ| 2

a multivariate extension of the density of σ 2 χ2n .


¯

Exercise 4.1. Moments of the determinant of a Wishart random


matrix. Use (4.11) to show that

pk
n

2 Γp + k
(4.12) E(|S|k ) = |Σ|k · 2n
, k = 1, 2, . . . .
Γp 2

56
STAT 542 Notes, Winter 2007; MDP

Exercise 4.2. Matrix-variate Beta distribution.


Let S and T be independent with S ∼ Wp (r, Σ), T ∼ Wp (n, Σ), r ≥ p,
n ≥ p, and Σ > 0, so S > 0 and T > 0 w. pr. 1. Define
1

U = (S + T )− 2 S (S + T )− 2 ,
1

(4.13)
V = S + T.

Show that the range of (U, V ) is given by {0 < U < I} × {V > 0} and verify
that (4.13) is a bijection. Show that the joint pdf of (U, V ) is given by
cp, r cp, n r−p−1 n−p−1
f (U, V ) = · |U | 2 |I − U | 2
cp, r+n
(4.14) cp, r+n r+n−p−1
− 12 tr Σ−1 V
· r+n · |V | 2 e ,
|Σ| 2

so U and V are independent and the distribution of U does not depend on


Σ. (Note that the distribution of U is a matrix generalization of the Beta
distribution.) Therefore

(4.15) E(|S|k ) = E(|U |k |V |k ) = E(|U |k )E(|V |k ),

so the moments of |U | can be expressed in terms of the moments of deter-


minants of the two Wishart matrices S and V via (4.12) as follows:
n+r
r

k Γ Γ + k
E(|S| ) p p
(4.16) E(|U |k ) = = r
2 n+r2
.
E(|V | )
k Γp 2 Γp 2 + k

Hint: To find the Jacobian of (4.13), apply the chain rule to the sequence
of mappings
(S, T ) → (S, V ) → (U, V ).
Use the extended combination rule to find the two intermediate Jacobians.

57
STAT 542 Notes, Winter 2007; MDP

Exercise 4.3. Distribution of the sample correlation matrix when


Σ is diagonal.
Let S ∼ Wp (n, Dσ ) (n ≥ p), where Dσ := diag(σ1 , . . . , σp ) > 0. Define the
sample correlation matrix R ≡ {rij } by
−1/2 −1/2
rij = sii sij sjj ,
where S ≡ {sij }. Find the joint pdf of R, s11 , . . . , spp . Show that they are
mutually independent.
Hint: First determine the range of (R, s11 , . . . , spp ). Next, the joint pdf of
R, s11 , . . . , spp is given by
 
 ∂(S) 
f (R, s11 , . . . , spp ) =f (S) ·  
∂(R, s11 , . . . , spp )
 ∂(s , . . . , s 
 12 p−1,p , s11 , . . . , spp ) 
=f (S) ·  
∂(R, s11 , . . . , spp )
 
cp,n n−p−1
− 12 trDσ−1
S  ∂(s12 , . . . , sp−1,p ) 
= n · |S| e · 
2
|Dσ | 2 ∂R
cp,n n−p−1 p
n−p−1

sii p
p−1
= p n · |R| · sii 2 e 2σi · sii2
2

i=1 σi
2
i=1 i=1

n−p−1 1 sii n2 −1 − sii


p
=cp,n · |R| 2 · e 2σi ,
i=1
σ i σ i

where f (S) is given by (4.11) with Σ = Dσ and the Jacobian is calculated


1/2 1/2
using the extended combination rule and the relation sij = sii rij sjj .
This establishes the mutual independence, and will yield the marginal pdf
of R. (The mutual independence also can be established by means of Basu’s
Lemma.)
¯

Exercise 4.4. Inverse Wishart distribution. Let S ∼ Wp (n, I) with


n ≥ p and Σ > 0. Show that the pdf of W ≡ S −1 is
n
|Ω| 2 −1
e− 2 tr ΩW
1
(4.17) cp,n n+p+1 , W ∈ Sp+ .
|W | 2

where Ω = Σ−1 .
¯

58
STAT 542 Notes, Winter 2007; MDP

5. Estimating a Covariance Matrix.


Consider the problem of estimating Σ based on a Wishart random matrix
S ∼ Wp (n, Σ) with Σ ∈ Sp+ . Assume that n ≥ p so that S is nonsingular6
w. pr. 1. The loss incurred by an estimate Σ̂ is measured by a loss function
L(Σ̂, Σ) such that L ≥ 0 and L = 0 iff Σ̂ = Σ. An estimator Σ̂ ≡ Σ̂(S) is
evaluated in terms of its risk function ≡ expected loss:

R(Σ̂, Σ) = EΣ [L(Σ̂, Σ)].

We shall consider two specific loss functions:



2
Quadratic loss : L1 (Σ̂, Σ) = tr Σ̂Σ−1 − I ,

Stein  s loss : L2 (Σ̂, Σ) = tr Σ̂Σ−1 − log |Σ̂Σ−1 | − p.

We prefer L2 over L1 because L1 penalizes overestimates more than under-


estimates, unlike L2 :

L1 (Σ̂, I) → p as Σ̂ → 0, L1 (Σ̂, I) → ∞ as Σ̂ → ∞;
L2 (Σ̂, I) → ∞ as Σ̂ → 0 or ∞.

5.1. Equivariant estimators of Σ.


Let G be a subgroup of GL ≡ GL(p), the general linear group of all p × p
nonsingular real matrices. Each A ∈ G acts on Sp+ according to the mapping

Sp+ → Sp+
(5.1)
Σ → AΣA .

A loss function L is G-invariant if

(5.2) L(AΣ̂A , AΣA ) = L(Σ̂, Σ) ∀ A ∈ G.

6
If n < p it would seem impossible to estimate Σ. However several proposals
recently been put forth to address this case, which occurs for example with microarray
data where p ≈ 105 but n ≈ 103 . [References?]

59
STAT 542 Notes, Winter 2007; MDP

Note that both L1 and L2 are fully invariant, i.e., are GL-invariant. If L
is G-invariant then the risk function of any estimator Σ̂ ≡ Σ̂(S) transforms
as follows: for A ∈ G,
 
−1  −1  −1  −1 
R A Σ̂(ASA )A , Σ = EΣ L A Σ̂(ASA )A , Σ
= EΣ [L(Σ̂(ASA ), AΣA )]
(5.3)
= EAΣA [L(Σ̂(S), AΣA )]
= R(Σ̂(S), AΣA ).

An estimator Σ̂ ≡ Σ̂(S) is G-equivariant if

(5.4) Σ̂(ASA ) = A Σ̂(S) A ∀ A ∈ G, ∀ S ∈ Sp+ .

If L is G-invariant and Σ̂ is G-equivariant then by (5.3) the risk function is


also G-invariant:

(5.5) R(Σ̂, Σ) = R(Σ̂, AΣA ) ∀ A ∈ G,

that is, R(Σ̂, Σ) is constant on G-orbits of Sp+ (see Definition 6.1).


We say that G acts transitively on Sp+ if Sp+ has only one G-orbit under
the action of G. Note that G acts transitively on Sp+ iff every Σ ∈ Sp+ has a
square root ΣG ∈ G, i.e., Σ = ΣG ΣG . Thus both GL and GT ≡ GT (p) (the
subgroup of all p × p nonsingular lower triangular matrices) act transitively
on Sp+ .7 If L is G-invariant, Σ̂ is G-equivariant, and G acts transitively on
Sp+ , then the risk function is constant on Sp+ :

(5.6) R(Σ̂, Σ) = R(Σ̂, I) ∀ Σ ∈ Sp+ [set A = Σ−1


G in (5.5)]

5.2. The best fully equivariant estimator of Σ.


Lemma 5.1. An estimator Σ̂(S) is GL-equivariant iff Σ̂(S) = δ S for some
scalar δ > 0.
7
For the latter, apply the Cholesky decomposition, Exercise 1.5.

60
STAT 542 Notes, Winter 2007; MDP

−1
Proof. Set G = GL and A = SGL in (5.2) to obtain

−1 −1 
Σ̂(I) = SGL Σ̂(S) SGL ,
so

(5.7) Σ̂(S) = SGL Σ̂(I) SGL .

Next set A = Γ ∈ O and S = I in (5.2) to obtain

(5.8) Σ̂(I) = Γ Σ̂(I) Γ ∀ Γ ∈ O(p),

where O ≡ O(p) is the subgroup of all p × p orthogonal matrices. By


Exercise 3.19, (5.8) implies that Σ̂(I) = δ I, so Σ̂(S) = δ S by (5.7), as
stated. ¯

We now find the optimal fully equivariant estimators Σ̂(S) ≡ δ̂ S w. r.


to the loss function L1 and L2 , respectively.

Proposition 5.2. (a) The best fully equivariant estimator w. r. to the loss
1
function L1 is the biased estimator n+p+1 S.
(b) The best fully equivariant estimator w. r. to the loss function L2 is the
unbiased estimator n1 S.
Proof. (a) Let S = {sij | i, j = 1 . . . , p}. Because GL acts transitively on
Sp+ and L1 is GL-invariant, δ S has constant risk given by

EI [L1 (δ S, I)] = EI [tr(δ S − I)2 ]


= δ 2 EI (tr S 2 ) − 2δEI (tr S) + tr I 2
  
2
= δ Ei sij − 2δ E
2
sii + p
i,j i
    
2
= δ EI 2
sii + EI sij − 2δnp + p
2
i i =j
 
∗ 2
(5.9) = δ (2n + n )p + p(p − 1)n − 2δnp + p
2

(5.10) = δ 2 np(n + p + 1) − 2δnp + p.

1
The quadratic function of δ in (5.10) is minimized by δ̂ = n+p+1 .

61
STAT 542 Notes, Winter 2007; MDP

*To verify (5.9), first note that when Σ = I, s2ii ∼ χ2n so




EI s2ii = VarI χ2n + EI (χ2n ) = 2n + n2 .

Next, sij ∼ s12 since

Π S Π ∼ Wp (n, Π Π ) = Wp (n, I) ∼ S
−1/2 −1/2
for any permutation matrix Π. Also s12 s22 ⊥
⊥ s22 and s12 s22 ∼ N (0, 1)
by (3.50) and (3.51), so

s2 s2
EI s212 = EI 12
· s22 = EI 12
· EI (s22 ) = 1 · n = n.
s22 s22
(b) Because GL acts transitively on Sp+ and L2 is GL-invariant, δS has
constant risk given by

EI [L2 (δS, I)] = EI [tr(δS) − log |δS| − p]


= δ EI (tr S) − p log δ − EI (log |S| ) − p
(5.11) = δnp − p log δ − EI (log |S| ) − p.

This is minimized by δ̂ = 1
n.
¯

5.3. The best GT -equivariant estimator of Σ.


Lemma 5.3. Let ST = SGT . An estimator Σ̂(S) is GT -equivariant iff

(5.12) Σ̂(S) = ST ∆ ST 

for a fixed diagonal matrix ∆ ≡ diag(δ1 , . . . , δp ) with each δi > 0.


Proof. Set G = GT and A = ST−1 in (5.4) to obtain

Σ̂(I) = ST−1 Σ̂(S) ST−1 ,
so
(5.13) Σ̂(S) = ST Σ̂(I) ST  .

Next set A = D± ≡ diag(±1, . . . , ±) ∈ GT and S = I in (5.4) to obtain

(5.14) Σ̂(I) = D± Σ̂(I) D± ∀ D± .

62
STAT 542 Notes, Winter 2007; MDP

But (5.14) implies that Σ̂(I) = ∆ for some diagonal matrix ∆ ∈ Sp+ , [verify],
hence (5.12) follows from (5.13).
¯

We now present Charles Stein’s derivation of the optimal GT -equivariant


estimator Σ̂T (S) := ST ∆ ˆ T ST  w. r. to the loss function L2 . Remarkably,
Σ̂T (S) is not of the form δ S, hence is not GL-equivariant. Because GT is
a proper subgroup of GL, the class of GT -equivariant estimators properly
contains the class of GL-equivariant estimators, hence Σ̂T dominates the
best fully equivariant estimator n1 S. Thus the latter, which is also the
best unbiased estimator and the MLE, is neither admissible nor minimax.
(Similar results hold for the quadratic loss function L1 .)

Proposition 5.4.8 The best GT -equivariant estimator w. r. to the loss


function L2 is

(5.15) ˆ T ST  ,
Σ̂T (S) = ST ∆
where
(5.16) ˆ T = diag(δ̂T,1 , . . . , δ̂T,p )

and
1
(5.17) δ̂T,i = n+p+1−2i .

Proof. Let ST = {tij | 1 ≤ j ≤ i ≤ p}. Because GT acts transitively on


Sp+ and L2 is GT -invariant, each GT -equivariant estimator ST ∆ ST  has
constant risk R2 (ST ∆ ST  , Σ) given by
% &
EI L2 (ST ∆ST  , I)
%
&
= EI tr ST ∆ST  − log |ST ∆ST  | − p
%
& p %  &
= EI tr ∆ST  ST − log δi − EI log ST ST   − p
 p i=1
 p
= EI δi (tii + t(i+1)i + · · · tpi ) −
2 2 2
log δi + const.
i=1 i=1

p
= [δi ((n − i + 1) + (p − i)) − log δi ] + const.
i=1
p
(5.18) = [δi (n + 1 + p − 2i)) − log δi ] + const.
i=1

8
James and Stein (1962), Proc. 4th Berkeley Symp. Math. Statist. Prob. V.1.

63
STAT 542 Notes, Winter 2007; MDP

1
The ith term in the last sum is minimized by δ̂i = n+p+1−2i , as asserted.
*This follows from Bartlett’s decomposition (Proposition 3.20).
¯

For the loss function L2 , the improvement in risk offered by Stein’s


estimator Σ̂T (S) = ST ∆ ˆ T ST  compared to the unbiased estimator 1 S
n
is ≈ 5-20% for moderate values of p.9 However, this estimator is itself
inadmissible and can be improved upon readily as follows:
Replace the lower triangular group GT with the upper triangular group
GU to obtain the alternative version of Stein’s estimator given by

(5.19) ˆ U SU  ,
Σ̂U (S) = SU ∆

where SU ≡ SGU is the unique upper triangular square root of S and


ˆ U = diag(δ̂U,1 , . . . , δ̂U,p ) with

1
δ̂U,i = δ̂T, p−i+1 = n−p−1+2i .

Because GU also acts transitively on Sp+ , the risk function of Σ̂U is also
constant on Sp+ with the same constant value as the risk function of Σ̂T
[why?10 ] Since L2 (Σ̂, Σ) is strictly convex in Σ̂ [verify!], so is R2 (Σ̂, Σ)
9
S. Lin and M. Perlman (1985). A Monte Carlo comparison of four estimators of a
covariance matrix. In Multivariate Analysis – VI, P. R. Krishnaiah, ed., pp. 411-429.
10
Use an invariance argument: Let Π denote the p × p permutation matrix corre-
sponding to the permutation (1, . . . , p) → (p, . . . , 1). Then

S̃ := ΠSΠ = ΠSU SU Π = (ΠSU Π ) (ΠSU Π )

and ΠSU Π is lower triangular, so ΠSU Π ˆ U = Π ∆


= S̃T by uniqueness. Also ∆ ˆ T Π,
so from(5.19),

Σ̂U (S) = (Π S̃T Π)(Π ∆


ˆ T Π)(Π S̃T Π) = Π (S̃T ∆
ˆ T S̃T )Π
= Π Σ̂T (S̃) Π = Π Σ̂T (ΠSΠ ) Π.

A = Π to obtain
Now apply (5.3) with

  
(5.20) R2 Σ̂U (S), Σ ≡ R2 Π Σ̂T (ΠSΠ ) Π, Σ = R2 Σ̂T (S), ΠΣΠ ,

64
STAT 542 Notes, Winter 2007; MDP

[verify], hence

1
R2 2 Σ̂T + Σ̂U , Σ < 2 R2 (Σ̂T , Σ) + 12 R2 (Σ̂U , Σ) = R2 (Σ̂T , Σ).
1

Therefore the estimator 2 Σ̂T + Σ̂U strictly dominates Σ̂T (and Σ̂U ).
The preceding
discussion suggests another estimator that strictly dom-
inates 12 Σ̂T + Σ̂U , namely

1 
(5.21) Σ̂P (S) := Π Σ̂T (ΠSΠ ) Π,
p!
Π∈P(p)

where P ≡ P(p) is the subgroup of all p × p permutation matrices. Again


the strict convexity of L2 implies that Σ̂P dominates Σ̂T , in fact [verify!]


1
R2 (Σ̂P , Σ) < R2 2 Σ̂T + Σ̂U , Σ < R2 (Σ̂T , Σ).

5.4. Orthogonally equivariant estimators of Σ.


The estimator Σ̂P (S) in (5.21) is the average over P of the transformed
estimators Π Σ̂T (ΠSΠ ) Π and is itself permutation-equivariant [verify]:

(5.22) Σ̂P (ΠSΠ ) = Π Σ̂P (S) Π ∀ Π ∈ P.

Because P is a proper subgroup of the orthogonal group O, the preceding


discussion suggests the following estimator, obtained by averaging over O
itself:

(5.23) Σ̂O (S) = Γ Σ̂T (ΓSΓ ) Γdν(Γ),
O

where ν is the Haar probability measure on O, i.e. the unique (left ≡ right)
orthogonally invariant probability measure on O. Since [verify!]

(5.24) Σ̂O (S) = Γ Σ̂P (ΓSΓ ) Γ dν(Γ),
O

so Σ̂U and Σ̂T must have the same (constant) risk function, as asserted.
¯

65
STAT 542 Notes, Winter 2007; MDP

the strict convexity of L2 implies that Σ̂O in turn dominates Σ̂P [verify!]:

R2 (Σ̂O , Σ) < R2 (Σ̂P , Σ).

The estimator Σ̂O , first proposed11 by Akimichi Takemura, is orthogo-


nally equivariant: for any Γ ∈ O,


Σ̂O (ΓSΓ ) = Ψ Σ̂T (Ψ(ΓSΓ )Ψ ) Ψ dν(Ψ)
O
= Γ(ΨΓ) Σ̂T ((ΨΓ)S(ΨΓ) ) (ΨΓ)Γ dν(Ψ)
O
 

=Γ Φ Σ̂T (ΦSΦ ) Φ dν(Φ) Γ
 
O
(5.25) = Γ Σ̂O (S) Γ ,

where * follows from the substitution Ψ → Φ ≡ ΨΓ and the orthogonal


invariance of ν: dν(Ψ) = dν(ΓΨ) ≡ dν(Φ). The estimator Σ̂O offers greater
improvement over n1 S than does Σ̂T (S), often a reduction in risk of 20-30%.

Clearly the unbiased estimator n1 S is orthogonally equivariant [verify].


The class of orthogonally equivariant estimators is characterized as follows:

Lemma 5.3. For any S ∈ Sp+ let S = ΓS Dl(S) ΓS be its spectral decom-
position. Here l(S) = (l1 (S), . . . , lp (S)) where l1 ≥ · · · ≥ lp (> 0) are the
ordered eigenvalues of S, the columns of ΓS are the corresponding eigen-
vectors, and Dl(S) = diag(l1 (S), . . . , lp (S)). An estimator Σ̂ ≡ Σ̂(S) is
O-equivariant iff

(5.26) Σ̂(S) = ΓS Dφ(l(S)) ΓS

where Dφ(l) = diag(φ1 (l1 , . . . , lp ), . . . , φp (l1 , . . . , lp )) with φ1 ≥ · · · ≥ φp > 0.


Proof. For any Γ ∈ O and S ∈ Sp+ ,

Γ S Γ = (ΓΓS ) Dl(S) (ΓΓS ) ,


11
An orthogonally invariant minimax estimator of the covariance matrix of a multi-
variate normal population. Tsukuba J. Math. (1984) 8 367-376.

66
STAT 542 Notes, Winter 2007; MDP

hence ΓΓSΓ = ΓΓS and l(ΓSΓ ) = l(S). Thus if Σ̂(S) satisfies (5.26) then

Σ̂(ΓSΓ ) = ΓΓSΓ Dφ(l(ΓSΓ )) ΓΓSΓ = Γ Σ̂(S) Γ ,

so Σ̂ is O-equivariant.
Conversely, if Σ̂ is O-equivariant then

(5.27) Σ̂(S) = ΓS Σ̂(ΓS S ΓS ) ΓS = ΓS Σ̂(Dl(S) ) ΓS


But
Σ̂(Dl(S) ) = D± Σ̂(Dl(S) ) D± ∀ D± ≡ diag(±1, . . . , ±) ∈ O,

hence (recall (5.14)) Σ̂(Dl(S) ) must be a diagonal matrix whose entries


depend on S only through l(S). That is,

Σ̂(Dl(S) ) = Dφ(l(S))

for some φ(l(S) ≡ (φ1 (l(S)), . . . , φp (l(S))), so (5.27) yields (5.26)


¯

By (5.5), the risk function R2 (Σ̂, Σ) of an O-equivariant estimator Σ̂


is constant on O-orbits of Sp+ , hence satisfies

(5.28) R2 (Σ̂, Σ) = R(Σ̂, Dλ(Σ) ),

where λ(Σ) ≡ (λ1 (Σ) ≥ . . . ≥ λp (Σ) (> 0)) is the vector of the ordered
eigenvalues of Σ. Thus, by restricting consideration to orthogonally equiv-
ariant estimators, the problem of estimating Σ reduces to that of estimating
the population eigenvalues λ(Σ) based on the sample eigenvalues l(S).

Exercise 5.5. (Takemura). When p = 2, show that Σ̂O (S) has the form
(5.26) with
' √ √ (
l1 δ̂T,1 l δ̂
φ1 (l1 , l2 ) = √ √ + √ 2 T,2 √ l1 ,
l1 + l2 l1 + l2
(5.29) ' √ √ (
l2 δ̂T,2 l1 δ̂T,1
φ2 (l1 , l2 ) = √ √ +√ √ l2 .
l1 + l2 l1 + l2

67
STAT 542 Notes, Winter 2007; MDP

where δ̂T,1 = 1
n+1 and δ̂T,2 = 1
n−1 (set p = 2 in (5.17)).
¯

1
√ √
Because n+1 < n1 < n−1 1
and l1 > l2 , Σ̂O “shrinks” the largest
eigenvalue of n1 S and “expands” its smallest eigenvalue when p = 2 [verify],
and Takemura showed that this remains true of Σ̂O for all p ≥ 2.
Stein has argued that the shrinkage/expansion should be stronger than
that given by Σ̂O . For example, he suggested that for any p ≥ 2, if consid-
eration is restricted to orthogonally invariant estimators having the simple
form φi (l1 , . . . , lp ) = ci li for constants ci > 0, then the best choice of ci is
given by (recall (5.17))
1
(5.30) ci = δ̂T,i = n+p+1−2i , i = 1, . . . , p.

Several reasons why such shrinkage/expansion is a desirable property


for orthogonally equivariant estimators are now presented.
First, the extremal representations

(5.31) l1 (S) = max

x Sx,
x x=1

(5.32) lp (S) = min

x Sx,
x x=1

show that l1 (S) and lp (S) are, respectively, convex and concave functions
of S [verify]. Thus by Jensen’s inequality,
(5.33) EΣ [l1 (S)] ≥ l1 [E(S)] = l1 (nΣ) ≡ n λ1 (Σ),
(5.34) EΣ [lp (S)] ≤ lp [E(S)] = lp (nΣ) ≡ n λp (Σ).
Thus n1 l1 tends to overestimate λ1 and should be shrunk, while n1 lp tends to
underestimate λp and should be expanded. This holds for the other eigen-
values also: n1 l2 , n1 l3 , . . . should be shrunk while n1 lp−1 , n1 lp−2 , . . . should be
expanded.
Next from (3.53) and the concavity of log x,
 p  n p

E 1
l i (S) = λ i (Σ) · n−p+i
i=1 n i=1 i=1 n
n
p
≤ λi (Σ) · 1 − p−12n
i=1
n p(p−1)
(5.35) ≤ λi (Σ) · e− 2n .
i=1

68
STAT 542 Notes, Winter 2007; MDP

p n
Thus i=1 n1 li (S) will tend to underestimate i=1 λi (Σ) unless n  p2 ,
which does not usually hold in applications. This suggests that the shrink-
age/expansion of the sample eigenvalues should not be done in a linear
manner: the smaller n1 li (S)’s should be expanded proportionately more than
the larger n1 li (S)’s should be shrunk.
A more precise justification is based on the celebrated ”semi-circle” law
[draw figure] of the mathematical physicist E. P. Wigner, since extended by
many others. A strong consequence of these results is that when Σ = λ Ip
(equivalently, λ1 (Σ) = · · · = λp (Σ) = λ) and both n, p → ∞ while np → η
for some fixed η ∈ (0, 1], then
a.s. √ 2
(5.36) 1
n l 1 (S) → λ (1 + η) ,
a.s. √
n lp (S) → λ (1 − η)2 .
1
(5.37)
Thus if it were known that Σ = λ Ip then n1 l1 (S) should be shrunk by

the factor 1/(1 + η)2 while n1 lp (S) should be expanded by the factor

1/(1 − η)2 . Furthermore, the expansion is proportionately greater than
the shrinkage since
1

(1+ η)2 · 1

(1− η)2 = 1
(1−η)2 > 1.

Note that these two desired shrinkage factors for n1 l1 (S) and n1 lp (S)
are even more extreme than nc1 ≡ nδ̂T,1 and nc1 ≡ nδ̂T,p from (5.30):

(5.38) 1 > nδ̂T,1 ≡ n


n+p−1 ≈ 1
1+η > 1

(1+ η)2 ,

(5.39) 1 < nδ̂T,p ≡ n


n−p+1 ≈ 1
1−η < 1

(1− η)2 .

The shrinkage and expansion factors in (5.36) and (5.37) are derived
only for the case Σ = λ Ip (the “worst case” in that the most shrink-
age/expansion is required). In general the appropriate shrinkage/expansion
factors (equivalently, the functions φ1 , . . . , φp in (5.26)) depend on the (un-
known) empirical distribution of λ1 (Σ), . . . , λp (Σ) so must themselves be
estimated adaptively. Stein12 proposed the following adaptive eigenvalue
12
I first learned of this result at Stein’s 1975 IMS Rietz Lecture in Atlanta, which
remains unpublished in English - Stein published his results in a Russian journal in
1977. I have copies of his handwritten lecture notes from his courses at Stanford and U.
of Washington. Similar results were later obtained independently by Len Haff at UCSD.

69
STAT 542 Notes, Winter 2007; MDP

estimators:
 +
1
(5.40) φ∗i (l1 , . . . , lp ) = n−p+1+2li 1 li .
j=i li −lj

The term inside the large parentheses can be negative hence its positive part
is taken. Also the required ordering φ∗1 > . . . > φ∗p need not hold, in which
case the ordering is achieved by an isotonization algorithm – see Lin and
Perlman (1985) for details. Despite these complications, Stein’s estimator
offers substantial improvement over the other estimators considered thus
far – the reduction in risk can be 70-90% when Σ ≈ λ Ip !
If the population eigenvalues are widely dispersed, i.e.,
(5.41) λ1 (Σ)  · · ·  λp (Σ),
then the sample eigenvalues {li } will also be widely dispersed, so
  
1
li j =i li −l j
= j>i l i
li
−l j
+ j<i l i
li
−l j
≈ (p − i) + 0,
in which case (5.40) reduces to [verify]


(5.42) φi (l1 , . . . , lp ) = n+p−1+2i li ≡ δ̂T,i li
1

(recall (5.30)). On the other hand, if two or more λi (Σ)’s are nearly equal
then the same will be true for the corresponding li ’s, in which case the
shrinkage/expansion offered by the φ∗i ’s will be more pronounced than in
(5.42), a desirable feature as indicated by (5.38) and (5.39).

Remark. When p ≥ 3 it is difficult to evaluate the integral for Takemura’s


estimator Σ̂O (S) in (5.23). However, the integral can be approximated by
Monte Carlo simulation from the Haar probability distribution over O. This
can be accomplished as follows:

Lemma 5.6. Let X ∼ Np×p (0, Ip ⊗ Ip ). The distribution of the ran-


dom orthogonal matrix Γ ≡ (XX  )−1/2 X is the Haar measure on O, i.e.,
the unique left ≡ right orthogonally invariant probability measure on the
compact topological group O.
Proof. It suffices to show that the distribution is right orthogonally invari-
ant, i.e., that Γ ∼ Γ Ψ for all Ψ ∈ O. But this holds since
Γ Ψ = [(XΨ)(XΨ) ]−1/2 X Ψ ∼ (XX  )−1/2 X = Γ.
¯

70
STAT 542 Notes, Winter 2007; MDP

6. Invariant Tests of Hypotheses. (See Lehmann TSH Ch. 6, 8.)


Motivation for invariant tests (and equivariant estimators):
(a) Respect the symmetries of a statistical problem.
(b) Unbiasedness fails to yield a UMPU test when testing more than one
parameter. Restricting to invariant tests sometimes leads to a UMPI test,
but at least reduces the class of tests to be compared.

6.1. Invariant statistical models and maximal invariant statistics.


A statistical model is a family P of probability distributions defined on a
sample space (X , A), where A is the sigma-field of measurable subsets of
X . Often P has a parametric representation: P = {Pθ | θ ∈ Θ}. (The
parameterization is assumed to be identifiable.)
Let G be a group of measurable mappings of X into itself. Then G
acts on X if
(1) (g1 g2 )x = g1 (g2 x) ∀g1 , g2 ∈ G, ∀x ∈ X .
(2) 1G x = x ∀x ∈ X . (1G denotes the identity element in G.)
Here (1) and (2) imply that the mapping g : X → gX is a bijection ∀ g ∈ G.

Definition 6.1. Suppose that G acts on X . For x ∈ X , the G-orbit of x


is the subset Gx := {gx | g ∈ G} ⊆ X , i.e., the set of all images of x under
the actions in G. The orbit space

X /G := {Gx | x ∈ X }

is the set of all G-orbits. The orbit projection π is the mapping

π : X → X /G
x → Gx.

Trivially, π is a G-invariant function, that is, π is constant on G-orbits:

π(x) = π(gx) ∀x, g.

[Since G itself is invariant under group multiplication: {gg  | g  ∈ G} = G.]

71
STAT 542 Notes, Winter 2007; MDP

Definition 6.2. A function t : X → T is a maximal invariant statistic


(MIS) if it is equivalent to the orbit projection π, i.e., if t is constant on G-
orbits and distinguishes G-orbits (takes different values on different orbits.)

Lemma 6.3. Suppose that t : X → T satisfies


(3) t is G-invariant;
(4) if u : X → U is G-invariant, i.e., satisfies u(x) = u(gx) ∀x, g, then u
depends on x only through the value of t(x), i.e., u(x) = w(t(x)) for some
function w : T → U.
Then t is a maximal invariant statistic.
Proof. We need only show that t distinguishes G-orbits. This follows from
(4) with u = π.
¯

If G acts on X then G acts on P as follows: gP := P ◦ g −1 , that is,

(gP )(A) := P (g −1 (A)) ∀A ∈ A.

Equivalently, if X ∼ P then gX ∼ gP .

Definition 6.4. The statistical model P is G-invariant if gP ⊆ P ∀g ∈ G.

If P is G-invariant then by (1) and (2),


(5) (g1 g2 )P = g1 (g2 P ) ∀g1 , g2 ∈ G, ∀P ∈ P. [since (g1 g2 )−1 = g2−1 g1−1 ]
(6) 1G P = P ∀P ∈ P.
Then (5) implies that

P = g(g −1 P) ⊆ gP ∀g ∈ G,

so gP = P ∀g and the mapping g : P → gP is a bijection for each g ∈ G.


Furthermore, if P has a parametric representation {Pθ | θ ∈ Θ} then,
equivalently, G acts on Θ according to

Pgθ := gPθ ≡ Pθ ◦ g −1 .

Also equivalently, if X ∼ Pθ then gX ∼ Pgθ . In this case, (5) and (6)


become

72
STAT 542 Notes, Winter 2007; MDP

(7) (g1 g2 )θ = g1 (g2 θ) ∀g1 , g2 ∈ G, ∀θ ∈ Θ.


(8) 1G θ = θ ∀θ ∈ Θ. (Thus, GΘ = Θ.)
Again, (7) and (8) imply that gΘ = Θ ∀g and the mapping g : Θ → gΘ
is a bijection for each g ∈ G. Note that if dPθ (x) = f (x, θ)dx then the
G-invariance of P is equivalent to
 
(9) f (x, θ) = f (gx, gθ)  ∂(gx) 
∂x [verify].

Definition 6.5. Assume that P ≡ {Pθ | θ ∈ Θ} is G-invariant. For θ ∈ Θ,


the G-orbit of θ is the subset Gθ := {gθ | g ∈ G} ⊆ Θ. A function τ : Θ → Ξ
is a maximal invariant parameter (MIP) if it is constant on G-orbits and
distinguishes G-orbits.
¯

As in Lemma 6.3, τ is a maximal invariant parameter iff τ is G-invariant


and any G-invariant parameter σ(θ) depends on θ only through the value
of τ ≡ τ (θ).

Lemma 6.6. Assume that u : X → U is G-invariant. Then the distribution


of u depends on θ only through the value of the maximal invariant parameter
τ . (In particular, the distribution of a maximal invariant statistic t depends
only on τ .)
Proof. We need only show that the distribution of u is G-invariant. But
this is immediate, since for any measurable subset B ⊆ U,

Pgθ [ u(X) ∈ B] = Pθ [ u(gX) ∈ B] = Pθ [ u(X) ∈ B].

6.2. Invariant hypothesis testing problems.


Suppose that P ≡ {Pθ | θ ∈ Θ} is G-invariant and we wish to test

(6.1) H0 : θ ∈ Θ0 vs. H : θ ∈ Θ \ Θ0

based on X, where Θ0 is a proper subset of Θ such that P0 ≡ {Pθ | θ ∈ Θ0 }


is also G-invariant. Then (6.1) is called a G-invariant testing problem. A
sensible approach to such a testing problem is to respect the symmetry of
the problem (i.e., its G-invariance) and restrict attention to test statistics
that are G-invariant. Equivalently, this leads us to consider the “invariance-
reduced” problem where we test H0 vs. H based on the value of a MIS

73
STAT 542 Notes, Winter 2007; MDP

t ≡ t(x) rather than on the value of x itself. In general this may entail
a loss of information, but optimal invariant tests often (but not always)
remain admissible among all possible tests.
Because P0 and P are G-invariant, the invariance-reduced testing prob-
lem can be restated equivalently as that of testing

(6.2) H0 : τ ∈ Ξ0 vs. H : τ ∈ Ξ \ Ξ0

based on a MIS t, for appropriate sets Ξ0 and Ξ in the range of the MIP τ .
Our goal will be to determine the distribution of the MIS t and apply the
principles of hypothesis testing to (6.2). In particular, if a UMP test exists
for (6.2), it is called UMP invariant (UMPI) with respect to G for (6.1).
In cases where the class of invariant tests still so large that no UMPI
test exists, the likelihood ratio test (LRT) for (6.1), which rejects H0 for
large values of the LRT statistic

maxΘ f (x, θ)
Λ(x) := ,
maxΘ0 f (x, θ)

is often a satisfactory G-invariant test.

Lemma 6.7. The LRT statistic is G-invariant:

Λ(gx) = Λ(x) ∀ g ∈ G.

Proof. Apply property (9) in §6.1.


¯

Example 6.8. Testing a mean vector with known covariance ma-


trix: one observation.
Consider the problem of testing

(6.3) µ=0 vs. µ = 0 based on X ∼ Np (µ, Ip ).

Here X = Θ = Rp and Θ0 = {0}. Let G = Op ≡ the group of all p × p


orthogonal matrices g acting on X and Θ via

X → gX and µ → gµ,

74
STAT 542 Notes, Winter 2007; MDP

respectively. Because

gX ∼ Np (gµ, gg  ≡ Ip ),

Θ and Θ0 are G-invariant. For X, µ ∈ X ≡ Rp , the G-orbits of X and µ


are the spheres

{y ∈ Rp | y = X} and {ν ∈ Rp | ν = µ},

respectively, so

t ≡ t(X) = X2 and τ ≡ τ (µ) = µ2

represent the MIS and MIP, resp. The distribution of t is χ2p (τ ), the non-
central chi-square distribution with noncentrality parameter τ . Any G-
invariant statistic depends on X only through X2 , and its distribution
depends on µ only through µ2 . The invariance-reduced problem (6.2)
becomes that of testing

(6.4) τ =0 vs. τ > 0 based on X2 ∼ χ2p (τ ).

Since χ2p (τ ) has monotone likelihood ratio (MLR) in τ (see Appendix A on


MLR, Example A.14), by the Neyman-Pearson (NP) Lemma the uniformly
most powerful (UMP) level α test for (6.4) rejects µ2 = 0 if

X2 > χ2p; α ,

the upper α quantile of the χ2p distribution, and is unbiased. Thus this test
is UMPI level α for (6.3) and is unbiased for (6.3).
¯

Exercise 6.9. (a) In Example 6.8 show that the UMP invariant level α
test is the level α LRT based on X for (6.3).
(b) The power function of this LRT is given by

βp (τ ) := Pr τ [ X2 > χ2p; α ] ≡ Pr[ χ2p (τ ) > χ2p; α ].

It follows from the MLR property (or the log concavity of the normal pdf)
that βp (τ ) is increasing in τ , hence this test is unbiased. Show that for fixed
τ , βp (τ ) is decreasing in p. Hint: apply the NP Lemma.

75
STAT 542 Notes, Winter 2007; MDP

(c) (Kiefer and Schwartz (1965) Ann. Math. Statist.) Show that the LRT
is a proper Bayes test for (6.3), and therefore is admissible among all tests
for (6.3).
Hint: consider the following prior distribution:

Pr[ µ = 0] = γ,
Pr[ µ = 0] = 1 − γ,

µ  µ = 0 ∼ Np (0, λIp ), (0 < γ < 1, λ > 0).

Example 6.10. Testing a mean vector with unknown covariance


matrix: one observation.
Consider the problem of testing

µ=0 vs. µ = 0 based on X ∼ Np (µ, Σ)

with Σ > 0 unknown. Here

X = Rp , Θ = Rp × Sp+ , Θ0 = {0} × Sp+ .

Now we may take G = GL(p), the group of all p × p nonsingular matrices


g, acting on X and Θ via

X → gX and (µ, Σ) → (gµ, gΣg  )

respectively. Again Θ and Θ0 are G-invariant. Now there are only two G-
orbits in X : {0} and Rp \{0} [why?], so any G-invariant statistic is constant
on Rp \ {0}, hence its distribution does not depend on µ. Thus there is
no G-invariant test that can distinguish between the hypotheses µ = 0 and
µ = 0 on the basis of a single observation X when Σ is unknown. ¯

76
STAT 542 Notes, Winter 2007; MDP

Example 6.11. Testing a mean vector with unknown covariance


matrix: n + 1 observations.
Consider the problem of testing

(6.5) µ = 0 vs. µ = 0 based on (Y, W ) ∼ Np (µ, Σ) × Wp (n, Σ)

with Σ > 0 unknown and n ≥ p. Here

X = Θ = Rp × Sp+ , Θ0 = {0} × Sp+ .

Let G = GL act on X and Θ via

(Y, W ) → (gY, gW g  ) and (µ, Σ) → (gµ, gΣg  ),

respectively. Because

(gY, gW g  ) ∼ Np (gµ, gΣg  ) × Wp (n, gΣg  ),

Θ and Θ0 are G-invariant. It follows from Lemma 6.3 that

t ≡ t(Y, W ) := Y  W −1 Y and τ ≡ τ (µ, Σ) := µ Σ−1 µ

represent the MIS and MIP, respectively [verify!]. We have seen that

  −1
χ2p (τ )
Hotelling s T ≡ Y W
2
Y ∼ 2 ,
χn−p+1

the ratio of two independent chisquare variates, the first noncentral. (This is
the (nonnormalized) noncentral F distribution Fp, n−p+1 (τ ).) The invariance-
reduced problem (6.2) becomes that of testing

(6.6) τ = 0 vs. τ > 0 based on T 2 ∼ Fp, n−p+1 (τ ).

Because Fp, n−p+1 (τ ) has MLR in τ (see Example A.15), the UMP level α
test for (6.6) rejects τ = 0 if T 2 > Fp, n−p+1; α and is unbiased. Thus this
test is UMPI level α for (6.5), and is unbiased for (6.5).
¯

77
STAT 542 Notes, Winter 2007; MDP

Exercise 6.12. (a) In Example 6.11, show that the UMP invariant level α
test ( ≡ the T 2 test) is the level α LRT based on (Y, W ) for (6.5).
(b) The power function of this LRT is given by

βp, n−p+1 (τ ) := Pr τ [ T 2 > Fp, n−p+1;α ] ≡ Pr[ Fp, n−p+1 (τ ) > Fp, n−p+1;α ].

It follows from MLR that βp, n−p+1 (τ ) is increasing in τ , hence this test is
unbiased. Show that for fixed τ and p, βp, n−p+1 (τ ) is increasing in n.
(c)* (Kiefer and Schwartz (1965) Ann. Math. Statist.). Show that the
LRT is a proper Bayes test for testing (6.5) based on (Y, W ), and thus is
admissible among all tests for (6.5).
Hint: consider the prior probability distribution on Θ0 ∪ Θ given by

Pr[Θ0 ] = γ,
Pr[Θ] = 1 − γ, (0 < γ < 1);

(µ, Σ)  Θ0 ∼ π0 ,

(µ, Σ)  Θ ∼ π,

where π0 and π are measures on Θ0 ≡ {0} × Sp+ and Θ ≡ Rp × Sp+ respec-


tively, defined as follows: π0 assigns all its mass to points of the form

(µ, Σ) = (0, (Ip + ηη  )−1 ), η ∈ Rp ,

where η has pdf proportional to |Ip + ηη  |−(n+1)/2 ; π assigns all its mass to
points of the form

(µ, Σ) = ((Ip + ηη  )−1 η, (Ip + ηη  )−1 ), η ∈ Rp ,

where η has pdf proportional to

|Ip + ηη  |−(n+1)/2 exp ( 12 η  (Ip + ηη  )−1 η).

Verify that π0 and π are proper measures, i.e., verify that the corresponding
pdfs of η have finite total mass. Show that the T 2 test is the Bayes test for
this prior distribution.
¯

78
STAT 542 Notes, Winter 2007; MDP

Note: An entirely different method for showing the admissibility of the T 2


test among all tests for (6.5) was given by Stein (Ann. Math. Statist. 1956),
based on the exponential structure of the distribution of (Y, W ).

Example 6.13. Testing a mean vector with covariates and un-


known covariance matrix.
Similar to Example 6.11, but with the following changes. Partition Y , W ,
µ, and Σ as
       
Y1 W11 W12 µ1 Σ11 Σ12
Y = , W = , µ= , Σ= ,
Y2 W21 W22 µ2 Σ21 Σ22

respectively, where Yi and µi are pi × 1, Wij and Σij are pi × pj , i, j = 1, 2,


where p1 + p2 = p. Suppose it is known that µ2 = 0, that is, the second
group of p2 variables are covariates. Consider the problem of testing

µ1 = 0 vs. µ1 = 0
(6.7)
based on (Y, W ) ∼ Np (µ, Σ) × Wp (n, Σ)

with Σ > 0 unknown and n ≥ p. Again X = Rp × Sp+ , but now

Θ = Rp1 × Sp+ , Θ0 = {0} × Sp+ .

Let G1 be the set of all non-singular block-triangular p × p matrices of the


form  
g11 g12
g= ,
0 g22
so G1 is a subgroup of the invariance group GL in Example 6.11. Here
G1 ≡ {g} acts on X and Θ via the actions

(Y, W ) → (gY, gW g  ) and (µ1 , Σ) → (g11 µ1 , gΣg  ),

respectively. Then Θ and Θ0 are G1 -invariant [verify].


¯

Exercise 6.14. (a) In Example 6.13, apply Lemma 6.3 to show that

(L, M ) ≡ (L(Y, W ), M (Y, W ))


 
−1 −1 −1
(Y1 −W12 W22 Y2 ) W11·2 (Y1 −W12 W22 Y2 )  −1
:= 1+Y  W −1 Y
, Y2 W22 Y2
2 22 2

79
STAT 542 Notes, Winter 2007; MDP

is a (two-dimensional!) MIS, while

τ1 ≡ τ1 (µ1 , Σ) := µ1 Σ−1


11·2 µ1

is a (one-dimensional!) MIP. Thus the invariance-reduced problem (6.2)


becomes that of testing

(6.8) τ1 = 0 vs. τ1 > 0 based on (L, M ).

(b) Show that the joint distribution of (L, M ) ≡ (L(Y, W ), M (Y, W )) can
be described as follows:
τ

χ2p1 1+M1
τ

L|M ∼ 2 ≡ F p1 , n−p+1 1+M ,


1
χn−p+1
(6.9)
χ2p2
M∼ 2 ≡ Fp2 , n−p2 +1 .
χn−p2 +1

−1
Hint: Begin by finding the conditional distribution of Y1 −W12 W22 Y2 given
(Y2 , W22 ).
(c) Show that the level α LRT based on (Y, W ) for (6.7) is the test that
rejects (µ1 , µ2 ) = (0, 0) if

L > Fp1 , n−p+1; α .

This test is the conditionally UMP level α test for (6.8) given the ancillary
statistic M and is conditionally unbiased for (6.8), therefore unconditionally
unbiased for (6.7) ≡ (6.8).
(d)** Show that no UMP size α test exists for (6.8), so no UMPI test exists
for (6.7). Therefore the LRT is not UMPI. (See Remark 6.16).
(e)* In Exercise 6.12b, show βp,m (τ ) is decreasing in p for fixed τ and m.
Hint: Apply the results (6.9) concerning the joint distribution of (L, M )
¯

Remark 6.15. Since T 2 ≡ Y  W −1 Y = L(1 + M ) + M , the overall T 2 test


in Example 6.11 is also G1 -invariant in Example 6.13, so it is of interest to

80
STAT 542 Notes, Winter 2007; MDP

compare its power function to that of the LRT in Example 6.13. Given M ,
the conditional power function of the LRT is given by
% τ1
 & τ1

Prτ Fp1 , n−p+1 1+M > Fp1 , n−p+1;α  M ≡ βp1 , n−p+1 1+M ,

while the (unconditional) power of the size-α T 2 test is βp, n−p+1 (τ1 ) because
τ = τ1 when µ2 = 0. Since βp,m (δ) is decreasing in p but increasing in δ
(recall Exercises 6.12b, 6.14e), neither power function dominates the other.
Another possible test in Example 6.13 rejects (µ1 , µ2 ) = (0, 0) iff
−1
T12 := Y1 W11 Y1 > Fp1 , n−p1 +1, α ,

a test that ignores the covariate information and is not G1 -invariant [verify].
Since
T12 ∼ Fp1 , n−p1 +1 (τ̃1 )
where τ̃1 := µ1 Σ−1 2
11 µ1 , the power function of the level α test based on T1 is
βp1 , n−p1 +1 (τ̃1 ). Because τ̃1 ≤ τ1 but βp,m (δ) is decreasing in p and increas-
ing in m, the power function of T12 neither dominates nor is dominated by
that of the LRT or of T 2 .
¯

Remark 6.16. Despite their apparent similarity, the invariant testing


problems (6.6) and (6.8) are fundamentally different, due to the fact that
in (6.8) the dimensionality of the MIS (L, M ) exceeds that of the MIP τ1 .
Marden and Perlman (1980) (Ann. Statist.) show that in Example 6.13,
no UMP invariant test exists, and the level α LRT is actually inadmissible
for typical (= small) α values, due to the fact that it does not make use of
the information in the ancillary statistic M . Nonetheless, use of the LRT
is recommended on the basis that it is the UMP conditional test given M ,
it is G1 -invariant, its power function compares well numerically to those of
T 2 , T12 , and other competing tests, and it is easy to apply.
¯

Exercise 6.17. Let (Y, W ) be as in Examples 6.11 and 6.13. Consider the
problem of testing µ2 = 0 vs. µ2 = 0 with µ1 and Σ unknown. Find a
natural invariance group G2 such that the test that rejects µ2 = 0 if
−1
T22 := Y2 W22 Y2 > Fp2 , n−p2 +1; α

is UMP among all G2 -invariant level α tests.


¯

81
STAT 542 Notes, Winter 2007; MDP

Example 6.18. Testing a covariance matrix.


Consider the problem of testing

(6.10) Σ = Ip vs. Σ = Ip based on S ∼ Wp (r, Σ) (r ≥ p).

Here X = Θ = Sp+ and Θ0 = {Ip }. This problem is invariant under the


action of G ≡ Op on Sp+ given by S → gSg  . It follows from Lemma
6.3 and the spectral decomposition of Σ ∈ Sp+ that the MIS and MIP are
represented by, respectively,

l(S) ≡ (l1 (S) ≥ · · · ≥ lp (S)) := the set of (ordered) eigenvalues of S,


λ(Σ) ≡ (λ1 (Σ) ≥ · · · ≥ λp (Σ)) := the set of (ordered) eigenvalues of Σ.

[verify!]. By Lemma 6.6, the distribution of l(S) depends on Σ only through


λ(Σ); this distribution is complicated when Σ is not of the form κIp for some
κ > 0. The invariance-reduced problem is that of testing

(6.11) λ(Σ) = (1, . . . , 1) vs. λ(Σ) = (1, . . . , 1) based on λ(S).

Here, unlike Examples 6.8, 6.11, and 6.13, when p ≥ 2 the alternative
hypothesis remains multi-dimensional even after reduction by invariance,
so it is not to be expected that a UMPI test exists (it does not).
¯

Exercise 6.19a. In Example 6.18 derive the LRT for (6.10). Express the
test statistic in terms of l(S).
etrS
Answer: The LRT rejects Σ = Ip for large values of |S| , or equivalently,
for large values of p
(li (S) − log li (S) − 1).
i=1

Exercise 6.19b. Suppose that Σ = Cov(X). Show that

λ1 (Σ) = max Var(a X) ≡ max a Σa.


a=1 a=1

The maximal linear combination a X is the first principal component of X.


Hint: Apply the spectral decomposition of Σ.
¯

82
STAT 542 Notes, Winter 2007; MDP

Exercise 6.20. Testing sphericity. Change (6.10) as follows: test

(6.12) Σ = κ Ip , 0 < κ < ∞ vs. Σ = κ Ip based on S ∼ Wp (r, Σ).

Show that this problem remains invariant under the extended group

Ḡ := {ḡ = ag | a > 0, g ∈ Op }.

Express a MIS and MIP for this problem in terms of l(S) and λ(Σ) respec-
tively. Find the LRT for this problem and express it in terms of l(S).
(The hypothesis Σ = κIp , 0 < σ < ∞ is called the hypothesis of sphericity.)
1
p trS
Answer: The LRT rejects the sphericity hypothesis for large values of 1 ,
|S| p
or equivalently, for large values of
1
p
p i=1 li (S)
p 1 ,
i=1 li (S)
p

the ratio of the arithmetic and geometric means of l1 (S), . . . , lp (S).


¯

Exercise 6.21. If the identity matrix Ip is replaced by any fixed matrix


Σ0 ∈ Sp+ , show that the results in Exercises 6.19 and 6.20 can be applied af-
−1/2 −1/2  −1/2 −1/2 
ter the linear transformations S → Σ0 SΣ0 and Σ → Σ0 ΣΣ0 .

Example 6.22. Testing independence of two sets of variates.


In the setting of Example 6.18, partition S and Σ as
   
S11 S12 Σ11 Σ12
S= , Σ= ,
S21 S22 Σ21 Σ22

respectively. Here, Sij and Σij are pi × pj matrices, i, j = 1, 2, where


p1 + p2 = p. Let Θ = Sp+ as before, but now take

Θ0 = {Σ ∈ Sp+ | Σ12 = 0},

so (6.1) becomes the problem of testing

(6.13) Σ12 = 0 vs. Σ12 = 0 based on S ∼ Wp (n, Σ) (n ≥ p).

83
STAT 542 Notes, Winter 2007; MDP

If G is the group of non-singular block-diagonal p × p matrices of the form


 
g11 0
g=
0 g22

(so G = GL(p1 )×GL(p2 )), then (6.13) is invariant under the action of G on
Sp+ given by S → gSg  [verify]. It follows from Lemma 6.3 and the singular
value decomposition that a MIS is [verify!]

−1/2 −1/2 
r(S) ≡ (r1 (S) ≥ · · · ≥ rq (S)) := the singular values of S11 S12 S22 ,

the canonical correlation coefficients of S, and a MIP is [verify!]

−1/2 −1/2 
ρ(Σ) ≡ (ρ1 (Σ) ≥ · · · ≥ ρq (Σ)) := the singular values of Σ11 Σ12 Σ22 ,

the canonical correlation coefficients of Σ, where q = min{p1 , p2 } (see Ex-


ercise 6.25).
The distribution of r(S) depends on Σ only through ρ(Σ); it is com-
plicated when Σ12 = 0. The invariance-reduced problem is that of testing

(6.14) ρ(Σ) = (0, . . . , 0) vs. ρ(Σ) ≥ (0, . . . , 0) based on r(S).

When p ≥ 2 the alternative hypothesis remains multi-dimensional even after


reduction by invariance, so a UMPI test for (6.13) does not exist. ¯

Remark 6.23. This model and testing problem can be reduced to the
multivariate linear model and MANOVA testing problem (see Remark 8.5)
by conditioning on S22 :

−1/2  

(6.15) Y := S12 S22  S22 ∼ Np1 ×p2 βS 1/2 , Σ11·2 ⊗ Ip2 ,


22

where β = Σ12 Σ−122 . Since Σ12 = 0 iff β = 0, the present testing problem


is equivalent to that of testing β = 0 vs. β = 0 based on (Y, S11·2 ), a
MANOVA testing problem under the conditional distribution of Y .
¯

Exercise 6.24. In Example 6.22 find the LRT for (6.13). Express the
test statistic in terms of r(S). Show this LRT statistic is equivalent to

84
STAT 542 Notes, Winter 2007; MDP

the conditional LRT statistic for testing β = 0 vs. β = 0 based on the


−1/2 
conditional distribution of (S12 S22 , S11·2 ) given S22 (see Exercise 6.37a).
Show that when Σ12 = 0, the conditional and unconditional distributions
of the LRT statistic are identical. [This distribution can be expressed in
terms of Wilks’ distribution U (p1 , p2 , n − p2 ) – see Exercises 6.37c, d, e.]
Partial answer: The (unconditional and conditional) LRT rejects Σ12 = 0
for large values of
|S11 ||S22 |
,
|S|
or equivalently, for small values of
q
(6.16) (1 − ri2 (S)).
i=1
 
X1
Exercise 6.25. Suppose that Σ = Cov . Show that
X2

a1 Σ12 a2
ρ1 (Σ) = max Cor(a1 X1 , a2 X2 ) ≡ max )  ) .
a1 =0, a2 =0 a1 =0, a2 =0 a1 Σ11 a1 a2 Σ22 a2

Hint: Apply the Cauchy-Schwartz inequality.


¯

Example 6.26. Testing a multiple correlation coefficient.


In Example 6.22 set p1 = 1, so p2 = p − 1 and q ≡ min(1, p − 1) = 1. Now
the MIS r1 (S) ≥ 0 and the MIP ≡ ρ1 (Σ) ≥ 0 are one-dimensional and can
be expressed explicitly as follows:
−1
S12 S22 S21 Σ12 Σ−1
22 Σ21
r12 (S) = =: R2 , ρ21 (Σ) = =: ρ2 .
S11 Σ11

The invariance-reduced problem (6.14) becomes that of testing

(6.17) ρ2 = 0 vs. ρ2 > 0 based on R2 .

By normality, the hypotheses

(6.18) Σ12 = 0, ρ2 = 0, and X1 ⊥


⊥ X2

85
STAT 542 Notes, Winter 2007; MDP

are mutually equivalent. By (6.16) and (3.74) the size α LRT



for testing
Σ12 = 0 vs. Σ12 = 0 rejects Σ12 = 0 if R > B 2 ,
2 p n−p+1
2 ;α .
Note: ρ and R are called the population (resp., sample) multiple correlation
coefficients for the following reason: if
   
Σ11 Σ12 X1
= Cov ,
Σ21 Σ22 X2

then
Σ12 a2
ρ = max Cor(X1 , a2 X2 ) = max √ ) ,
a2 =0 a2 =0 Σ11 a2 Σ22 a2

with equality attained at â2 = Σ−1


22 Σ21 . [Verify; apply the Cauchy-Schwartz
inequality – this implies that Σ12 Σ−1
22 X2 is the best linear predictor of X1
based on X2 when EX = 0.]
¯

Exercise 6.27. Show that the R2 -test is UMPI and unbiased.


Solution: In Example A.18 of Appendix A it is shown that the pdf of R2 has
MLR in ρ2 . Thus this R2 -test is the UMP size α test for the invariance-
reduced problem (6.17), hence is the UMPI size α test for Σ12 = 0 vs.
Σ12 = 0, and is unbiased. ¯

Remark 6.28. (Kiefer and Schwartz (1965) Ann. Math. Statist.) By an


argument similar to that in Exercise 6.14c, the LRT is a proper Bayes test
for testing Σ12 = 0 vs. Σ12 = 0 based on S, and thus is admissible among
all tests for this problem. ¯

Remark 6.29. When Σ12 = 0, R2 ≡ 1+Q


Q
∼ B p−1 2 ,
n−p+1
2 (see (3.74)),
so
p−1
E(R2 ) = > 0 = ρ2 .
n
Thus, under the null hypothesis of independence, R2 is an overestimate of
ρ21 (Σ) ≡ 0 (unless n  p), hence might naively suggest dependence of X1
on X2 .
¯

86
STAT 542 Notes, Winter 2007; MDP

Example 6.30. Testing independence of k ≥ 3 sets of variates.


In the framework of Example 6.22, partition S and Σ as
   
S11 ··· S1k Σ11 ··· Σ1k
. ..  . .. 
S =  .. . , Σ =  .. . ,
Sk1 · · · Skk Σk1 · · · Σkk

respectively, where k ≥ 3. Again Sij and Σij are pi × pj matrices, i, j =


1, . . . , k, where p1 + · · · + pk = p. Take

Θ0 = {Σ | Σij = 0, i = j},

so (6.1) becomes the problem of testing

(6.19) Σij = 0, i = j vs. Σij = 0 for some i = j based on S ∼ Wp (n, Σ)

with n ≥ p. If G is the set of all non-singular block-diagonal p × p matrices


 
g11 ··· 0
 . .. .. 
g ≡  .. . . ,
0 · · · gkk

so G = GL(p1 ) × · · · × GL(pk ) and P, then (6.19) is G-invariant. Now


no explicit representation of the MIS and MIP is known (probably none
exists). Again the alternative hypothesis remains multi-dimensional even
after reduction by invariance, so a UMPI test does not exist.
¯

Exercise 6.31. In Example 6.30, derive the LRT for (6.19).


k
|Sii |
Answer: The LRT rejects Σij = 0, i = j, for large values of i=1
|S| .
Note: This LRT is proper Bayes and admissible among all tests for (6.19)
(Kiefer and Schwartz (1965) Ann. Math. Statist.) and is unbiased. ¯

87
STAT 542 Notes, Winter 2007; MDP

Example 6.32. Testing equality of two covariance matrices.


Consider the problem of testing

Σ1 = Σ2 vs. Σ1 = Σ2
(6.20)
based on (S1 , S2 ) ∼ Wp (n1 , Σ1 ) × Wp (n2 , Σ2 )

with n1 , n2 ≥ p. Here

X = Θ = Sp+ × Sp+ , Θ0 = Sp+ .

This problem is invariant under the action of GL on Sp+ × Sp+ given by

(6.21) (S1 , S2 ) → (gS1 g  , gS2 g  )

It follows from Lemma 6.3 and the simultaneous diagonalizibility of two


positive definite matrices that the MIS and MIP are represented by

f (S1 , S2 ) ≡ (f1 (S1 , S2 ) ≥ · · · ≥ fp (S1 , S2 )) := the eigenvalues of S1 S2−1 ,


φ(Σ1 , Σ2 ) ≡ (φ1 (Σ1 , Σ2 ) ≥ · · · ≥ φp (Σ1 , Σ2 )) := the eigenvalues of Σ1 Σ−1
2 ,

respectively [verify!]. By Lemma 6.6 the distribution of f (S1 , S2 ) depends


on (Σ1 , Σ2 ) only through φ(Σ1 , Σ2 ); this distribution is complicated when
Σ1 = κΣ2 . The invariance-reduced problem becomes that of testing

(6.22) φ(Σ1 , Σ2 ) = (1, . . . , 1) vs. φ(Σ1 , Σ2 ) = (1, . . . , 1) based on f (S1 , S2 ).

When p ≥ 2 the alternative hypothesis remains multi-dimensional even after


reduction by invariance, so a UMPI test for (6.20) does not exist. ¯

Exercise 6.33. In Example 6.32, derive the LRT for (6.20) and express the
test statistic in terms of f (S1 , S2 ). Show that the LRT statistic is minimized
when n11 S1 = n12 S2 .
Answer: The LRT rejects Σ1 = Σ2 for large values of

|S1 + S2 |n1 +n2


,
|S1 |n1 |S2 |n2

88
STAT 542 Notes, Winter 2007; MDP

or equivalently, for large values of


p
(1 + fi−1 )n1 (1 + fi )n2 , (fi ≡ fi (S1 , S2 )).
i=1

The ith term in the product is minimized when fi = n1 /n2 .


¯

Example 6.34. Testing equality of k ≥ 3 covariance matrices.


Consider the problem of testing

Σ 1 = · · · = Σk vs. Σi = Σj for some i = j


(6.23)
based on (S1 , . . . , Sk ) ∼ Wp (n1 , Σ1 ) × · · · × Wp (nk , Σk ).

with n1 ≥ p, . . . , nk ≥ p. Here

X = Θ = Sp+ × · · · × Sp+ (k times), Θ0 = Sp+ .

This problem is invariant under the action of GL on Sp+ × · · · × Sp+ given


by
(S1 , . . . , Sk ) → (gS1 g  , . . . , gSk g  ).
As in Example 6.30, no explicit representation of the MIS and MIP are
known (probably none exists). The alternative hypothesis is multidimen-
sional after reduction by invariance; no UMPI test for (6.23) exists.
¯

Exercise 6.35. In Example 6.34, derive the LRT for (6.23). Show that the
LRT statistic is minimized when n11 S1 = · · · = n1k Sk .
Answer: The LRT rejects Σ1 = · · · = Σk for large values of
   ni
 k 
 i=1 Si 
k .
i=1 |Si |
n i

To minimize this, apply the case k = 2 repeatedly.


Note: This LRT, also called Bartlett’s test, is unbiased when k ≥ 2. (Perl-
man (1980) Ann. Statist.)
¯

89
STAT 542 Notes, Winter 2007; MDP

Example 6.36. The canonical MANOVA testing problem.


Consider the problem of testing

µ=0 vs. µ = 0 (Σ unknown)


(6.24)
based on (Y, W ) ∼ Np×r (µ, Σ ⊗ Ir ) × Wp (n, Σ)

with Σ > 0 unknown and n ≥ p. (Example 6.11 is the special case where
r = 1.) Here

X = Θ = Rp×r × Sp+ , Θ0 = {0} × Sp+ .

This problem is invariant under the action of the group GL × Or ≡ {(g, γ)}
acting on X and Θ via

(Y, W ) → (gY γ  , gW g  ),
(6.25)
(µ, Σ) → (gµγ  , gΣg  ),

respectively. It follows from Lemma 6.3 and the singular value decomposi-
tion that a MIS is [verify!]

f (Y, W ) ≡ (f1 (Y, W ) ≥ · · · ≥ fq (Y, W ))


:= the nonzero eigenvalues of Y  W −1 Y

where q := min(p, r) (or equivalently, the nonzero eigenvalues of Y Y  W −1 ),


and a MIP is [verify!]

φ(µ, Σ) ≡ (φ1 (µ, Σ) ≥ · · · ≥ φq (µ, Σ))


:= the nonzero eigenvalues of µ Σ−1 µ,

(or equivalently, the nonzero eigenvalues of µµ Σ−1 ). The distribution of


f (Y, W ) depends on (µ, Σ) only through φ(µ, Σ); it is complicated when
µ = 0. The invariance-reduced problem (6.2) becomes that of testing

(6.26) φ(µ, Σ) = (0, . . . , 0) vs. φ(µ, Σ) ≥ (0, . . . , 0) based on f (Y, W ).

Here the MIS and MIP have the same dimension, namely q, and a UMP
invariant test will not exist when q ≡ min(p, r) ≥ 2.

90
STAT 542 Notes, Winter 2007; MDP

Note that f (Y, W ) reduces to the T 2 statistic when r = 1, so in the gen-


eral case the distribution of f (Y, W ) is a generalization of the (central and
noncentral) F distribution. The distribution of (f1 (Y, W ), . . . , fq (Y, W ))
when µ = 0 is given in Exercise 7.2.
(The reduction of the general MANOVA testing problem to this canon-
ical form will be presented in §8.2.) ¯

Exercise 6.37a. In Example 6.36, derive the LRT for testing µ = 0 vs.
µ = 0 based on (Y, W ). Express the test statistic in terms of f (Y, W ). Show
that when µ = 0, W + Y Y  is independent of f (Y, W ), hence is independent
of the LRT statistic.
Partial solution: The LRT rejects µ = 0 for large values of

|W + Y Y  | q
 −1
(6.27) = |Ip + Y W Y | ≡ (1 + fi (Y, W )).
|W | i=1

When µ = 0, W + Y Y  is a complete and sufficient statistic for Σ, and


f (Y, W ) is an ancillary statistic, hence they are independent by Basu’s
Lemma. (Also see §7.1.)
¯

Exercise 6.37b. Let U be the matrix-variate Beta rv (recall Exercise 4.2)


defined as

(6.28) U := (W + Y Y  )−1/2 W (W + Y Y  )−1/2 .

|W |
Derive the moments of |W +Y Y  | ≡ |U | under the null hypothesis µ = 0.

Solution: By independence,

(6.29) E(|W |k ) = E(|U |k |W + Y Y  |k ) = E(|U |k ) E(|W + Y Y  |k ),

so (recall (4.16) in Exercise 4.2)


r+n
n

E(|W |k
) E(|S| k
) Γp Γ p + k
(6.30) E(|U |k ) = 
= = n 2
r+n2
.
¯
E(|W + Y Y | ) k E(|V | )
k Γp 2 Γp 2 + k

91
STAT 542 Notes, Winter 2007; MDP

Exercise 6.37c. Let U (p, r, n) denote the null (µ = 0) distribution of |U |.


(U (p, r, n) is called Wilks’ distribution.) Show that this distribution can
be represented as the product of independent Beta distributions:
r

(6.31) U (p, r, n) ∼ B n−p+i


2 , p
2 ,
i=1

where the Beta variates are mutually independent.


Note: The moments of |U | given in 6.30) or obtained directly from (6.31)
can be used to obtain the Box approximation, a chi-square approximation
to the Wilks’ distribution U (p, r, n). (See T.W.Anderson book, §8.5.) ¯

Exercise 6.37d. In Exercise 6.24 it was found that the LRT for testing

(6.32) Σ12 = 0 vs. Σ12 = 0

(i.e., testing independence of two sets of variates) rejects Σ12 = 0 for small
values of |S11|S|
||S22 | . Show that the null (Σ12 = 0) distribution of this LRT
statistic is U (p1 , p2 , n − p2 ) – see Exercise 6.24.
¯

Exercise 6.37e. Show that U (p, r, n) ∼ U (r, p, r + n − p), hence


p

U (p, r, n) ∼ B n−p+i
2 , r
2 .
¯
i=1

Remark 6.38. Perlman and Olkin (Annals of Statistics 1980) applied the
FKG inequality to show that the LRTs in Exercises 6.24 and 6.37a are
unbiased.
¯

Example 6.39. The canonical GMANOVA Model.


(Example 6.13 is a special case.) [To be completed]
¯

Example 6.40. An inadmissible UMPI test. (C. Stein – see Lehmann


TSH Example 11 p.305 and Example 9 p.522.)
Consider Example 6.32 (testing Σ1 = Σ2 ) with p > 1 but with n1 = n2 = 1,
so S1 and S2 are each singular of rank 1. This problem again remains
invariant under the action of GL on (S1 , S2 ) given by (6.21):

(S1 , S2 ) → (gS1 g  , gS2 g  ).

92
STAT 542 Notes, Winter 2007; MDP

Here, however, GL acts transitively [verify] on this sample space since S1 , S2


each have rank 1, so the MIS is trivial: t(S1 , S2 ) ≡ const. This implies that
the only size α invariant test is φ(S1 , S2 ) ≡ α, so its power is identically α.
However, there exist more powerful non-invariant tests. For any nonzero
a : p × 1, let

a S1 a a S1 a
(6.33) Va ≡  ∼  · F1,1 ≡ δa · F1,1
a S2 a a S2 a

and let φa denote the UMPU size α test for testing δa = 1 vs. δa = 1 based
on Va (cf. TSH Ch.5 §3). Then [verify]: φa is unbiased size α for testing
Σ1 = Σ2 , with power > α when δa = 1, so φa dominates the UMPI test φ.
Note: This failure of invariance to yield a nontrivial UMPI test is usually
attributed to the group GL being “too large”, i.e., not “amenable”.13 How-
ever, this example is somewhat artificial in that the sample sizes are too
small (n1 = n2 = 1) to permit estimation of Σ1 and Σ2 . It would be of
interest to find (if possible?) an example of a trivial UMPI test in a less
contrived model. ¯

Exercise 6.41. Another inadmissible UMPI test. (see Lehmann TSH


Problem 11 p.532.)
Consider Example 6.10 (testing µ = 0 with Σ unknown) with n > 1 obser-
vations but n < p. As in Example 6.40, show that the UMPI GL-invariant
test is trivial but there exists more powerful non-invariant tests. ¯

13
See Bondar and Milnes (1981) Zeit. f. Wahr. 57, pp. 103-128.

93
STAT 542 Notes, Winter 2007; MDP

7. Distribution of Eigenvalues. (See T.W.Anderson book, Ch. 13.)


In the invariant testing problems of Examples 6.22 (testing Σ12 = 0),
6.32 (testing Σ1 = Σ2 ), and 6.36 (the canonical MANOVA testing prob-
lem), the maximal invariant statistic (MIS) was represented as the set of
nontrivial eigenvalues of a matrix of one of the forms

ST −1 or S(S + T )−1 ,

where S and T are independent Wishart matrices.14 Because the LRT


statistic is invariant (Lemma 6.7), it is necessarily a function of these eigen-
values. When the dimensionality of the invariance-reduced alternative hy-
pothesis is ≥ 2,15 however, no single invariant test is UMPI, and other
reasonable invariant test statistics16 have been proposed: for example,
q ri2
r12 (Roy) and (Lawley − Hotelling)
i=1 1 − ri2
−1
in Example 6.22, where we may take (S, T ) = (S12 S22 S21 , S11·2 ), and
q
f1 (Roy) and fi (Lawley − Hotelling)
i=1

in Example 6.36, where (S, T ) = (Y Y  , W ).


Thus, to determine the distribution of such invariant test statistics it
is necessary to determine the distribution of the eigenvalues of ST −1 or
equivalently [why?] of S(S + T )−1 .

7.1. The central distribution of the eigenvalues of S(S + T )−1 .


Let S and T be independent with S ∼ Wp (r, Σ) and T ∼ Wp (n, Σ), Σ > 0.
Assume further that n ≥ p, so T > 0 w. pr. 1. Let

1 ≥ b1 ≥ · · · ≥ bq > 0 and f1 ≥ · · · ≥ fq > 0


14
In Example 6.36, Y Y  (≡ S here) has a noncentral Wishart distribution under the
−1
alternative hypothesis, i.e., E(Y ) = µ = 0. In Example 6.22, S12 S22 S21 (≡ S here)
has a conditional noncentral Wishart distribution under the alternative hypothesis.
15
For example, see (6.14), (6.22), and (6.26).
16
Schwartz (Ann. Math. Statist. (1967) 698-710), presents a sufficient condition and
a (weaker) necessary condition for an invariant test to be admissible among all tests.

94
STAT 542 Notes, Winter 2007; MDP

denote the q ≡ min(p, r) ordered nonzero17 eigenvalues of S(S + T )−1 (the


Beta form) and ST −1 (the F form), respectively. Set

b ≡ (b1 , . . . , bq ) ≡ {bi (S, T )},


(7.1)
f ≡ (f1 , · · · , fq ) ≡ {fi (S, T )}.

First we shall derive the pdf of b, then obtain the pdf of f using the relation

bi
(7.2) fi = .
1 − bi

Because b is GL-invariant, i.e.,

bi (S, T ) = bi (ASA , AT A ) ∀ A ∈ GL,

the distribution of b does not depend on Σ [verify], so we may set Σ = Ip .


Denote this distribution by b(p, n, r) and the corresponding distribution of
f by f (p, n, r).

Exercise 7.1. Show that (compare to Exercise 6.37e)

b(p, n, r) = b(r, n + r − p, p),


(7.3)
f (p, n, r) = f (r, n + r − p, p).

Outline of solution: Let W be a partitioned Wishart random matrix:

p r
 
p W11 W12
W ≡ ∼ Wp+r (m, Ip+r ).
r W21 W22

Assume that m ≥ max(p, r), so W11 > 0 and W22 > 0 w. pr. 1. By the
properties of the distribution of a partitioned Wishart matrix (Proposition
3.13),
−1 −1
(a) the distribution of the nonzero eigenvalues of W12 W22 W21 W11
is b(p, m − r, r) [verify!]
17
If p > r then q = r and p − r of the eigenvalues of S(S + T )−1 are trivially
≡ 1. By Okamoto’s Lemma the nonzero eigenvalues are distinct w. pr. 1.

95
STAT 542 Notes, Winter 2007; MDP

−1 −1
(b) the distribution of the nonzero eigenvalues of W21 W11 W12 W22
is b(r, m − p, p) [verify!].
But these two sets of eigenvalues are identical18 so the result follows by
setting n = m − r.
¯

By Exercise 7.1 it suffices to derive the distribution b(p, n, r) when


r ≥ p, where q = p. Because r ≥ p, also S > 0 w. pr. 1, so by (4.11) the
joint pdf of (S, T ) is
r−p−1 n−p−1
e− 2 tr(S+T ) ,
1
cp,r cp,n · |S| 2 |T | 2 S > 0, T > 0.

Make the transformation

(S, T ) → (S, V ≡ S + T ).

By the extended combination rule, the Jacobian is 1, so the joint pdf of


(S, V ) is
r−p−1 n−p−1
e− 2 trV ,
1
cp,r cp,n · |S| 2 |V − S| 2 V > S > 0.

By E.3, there exists a unique [verify] nonsingular p × p matrix E ≡ {eij }


with e1j > 0, j = 1, . . . , p, such that

S = E Db E  ,
(7.4)
V = E E,

where Db := Diag(b1 , . . . , bp ). Thus the joint pdf of (b, E) is given by


 ) 
r−p−1 n−p−1
 n+r−2p−2
− 12 tr EE   ∂(S, V
f (b, E) = cp,r cp,n ·|Db | 2 |Ip −Db | 2 |EE | 2 e  ,
∂(b, E)

where the range Rb,E is the Cartesian product Rb × RE with

Rb := {b | 1 > b1 > · · · > bp > 0},


RE := {E | e1j > 0, ∞ < eij < ∞, i = 2, . . . , p, j = 1, . . . , p}.
 
 λIp A 
18
Because |λIp − AB| =   = |λIp | · | λ1 (λIr − BA)|.
B Ir

96
STAT 542 Notes, Winter 2007; MDP

We will show that


 ∂(S, V ) 
  p+2
(7.5)   = 2p · |EE  | 2 · (bi − bj ),
∂(b, E) i<j

hence


p
r−p−1
p
n−p−1
f (b, E) =2 cp,r cp,n ·
p
bi 2
(1 − bi ) 2 (bi − bj )
i=1 i=1 1≤i<j≤p
n+r−p 
· |EE  | e− 2 tr EE .
1
2

Because Rb,E = Rb × RE , this implies that b and E are independent with


marginal pdfs given by


p
r−p−1 n−p−1
(7.6) f (b) = cb · bi 2
(1 − bi ) 2 · (bi − bj ), b ∈ Rb (p),
i=1 1≤i<j≤p
n+r−p 
f (E) = cE · |EE  | e− 2 tr EE ,
1
(7.7) 2 E ∈ RE (p),

where
cb cE = 2p cp,r cp,n .
Thus, to determine cb it suffices to determine cE . This is accomplished as
follows:

n+r−p 
−1
|EE  | 2 e− 2 tr EE dE
1
cE =
RE

n+r−p 
−p
|EE  | 2 e− 2 tr EE dE
1
=2 [by symmetry]
Rp2

p
−p p2
 n+r−p 1
√ e− 2 eij deij
1 2
=2 (2π) 2 |EE | 2

Rp2 i,j=1

p2
n+r−p

−p
=2 (2π) 2 · E |Wp (p, Ip )| 2 [why?]

−p p2 cp,p
=2 (2π) 2 · . [by (4.12)]
cp,n+r

97
STAT 542 Notes, Winter 2007; MDP

Therefore (recall (4.10))


cp,p cp,n cp,r
p2
(7.8) cb ≡ cb (p, n, r) = (2π) 2 ·
cp,n+r
p n+r

π 2 Γp
≡ n
r
2 p
.
Γp 2 Γp 2 Γp 2

This completes the derivation of the pdf f (b) in (7.6), hence determines the
distribution b(p, n, r) when r ≥ p. Note that this can be viewed as another
generalization of the Beta distribution.

Verification of the Jacobian (7.5):


By the linearization method (*) in §4.3,
   
 ∂(S, V )   ∂(dS, dV ) 
(7.9)    
 ∂(b, E)  =  ∂(db, dE) .

From (7.4),
dS = (dE)Db E  + E Ddb E  + EDb (dE) ,
dV = (dE)E  + E(dE) ,
hence, defining

dF = E −1 (dE),
dG = E −1 (dS)(E −1 ) ,
dH = E −1 (dV )(E −1 ) ,
we have
(7.10) dG = (dF )Db + Ddb + Db (dF ) ,
(7.11) dH = (dF ) + (dF ) .
 
 ∂(dS,dV ) 
To evaluate  ∂(db,dE) , apply the chain rule to the sequence

(db, dE) → (db, dF ) → (dG, dH) → (dS, dV )

to obtain
 ∂(dS, dV )   ∂(db, dF )   ∂(dG, dH)   ∂(dS, dV ) 
       
 = · · .
∂(db, dE) ∂(db, dE) ∂(db, dF ) ∂(dG, dH)

98
STAT 542 Notes, Winter 2007; MDP

By 4.2(d), 4.2(e), and the combination rule in §4.1,


 ∂(db, dF )   ∂(dF ) 
   
(7.12)  =  = |E|−p ,
∂(db, dE) ∂(dE)
 ∂(dS, dV )   ∂(dS)   ∂(dV ) 
     
(7.13)  = ·  = |E|p+1 |E|p+1 = |E|2(p+1) .
∂(dG, dH) ∂(dG) ∂(dH)
 
 ∂(dG,dH) 
Lastly we evaluate  ∂(db,dF )  =: J. Here dG ≡ {dgij } and dH ≡
{dhij } are p × p symmetric matrices, dF ≡ {dfij } is a p × p unconstrained
matrix, and db is a vector of dimension p. From (7.10) and (7.11),

dgii = 2(dfii )bi + dbi , i = 1, . . . , p,


dhii = 2dfii , i = 1, . . . , p,
dgij = (dfij )bj + bi (dfji ), 1 ≤ i < j ≤ p,
dhij = dfij + dfji , 1 ≤ i < j ≤ p.

Therefore 

 ∂ (dgii ), (dhii ), (dgij ), (dhij ) 
J = 

∂ (dbi ), (dfii ), (dfij ), (dfji )

 
 Ip 0 0 0 
 
 2D 2Ip 0 0 
= b ,
 0 0 D1 Ip(p−1)/2 
 
0 0 D2 Ip(p−1)/2
where

D1 := Diag(b2 , . . . , bp , b3 , . . . , bp , . . . , bp−1 , bp , bp )
D2 := Diag(b1 , . . . , b1 , b2 , . . . , b2 , . . . , bp−2 , bp−2 , bp−1 ),

hence [verify!]

(7.14) J = 2p |D1 − D2 | = 2p (bi − bj ).
1≤i<j≤p

The desired Jacobian (7.5) follows from (7.12), (7.13), and (7.14).
¯

99
STAT 542 Notes, Winter 2007; MDP

Exercise 7.2. Use (7.2) to show that if r ≥ p, the pdf of (f1 , · · · , fp ) is


given by


p
r−p−1 n+r
(7.15) cb (p, r, n) fi 2
(1 + fi )− 2 (fi − fj ),
i=1 1≤i<j≤p

where cb (p, n, r) is given by (7.8). If r < p then the pdf of (f1 , · · · , fr )


follows from f (p, n, r) = f (r, n + r − p, p) in (7.3). ¯

Exercise 7.3. Under the weaker assumption that n + r ≥ p, show that the
distribution of b ≡ {bi (S, T )} does not depend on Σ and that b and V are
independent. (Note that f ≡ {fi (S, T )} is not defined unless n ≥ p.)
Hint: Apply the GL-invariance of {bi (S, T )} and Basu’s Lemma. If n ≥ p
and r ≥ p the result also follows from Exercise 4.2. ¯

7.2. Eigenvalues and eigenvectors of one Wishart matrix.


In the invariant testing problems of Examples 6.18 (testing Σ = Ip ) and
Exercise 6.20 (testing Σ = κIp ), the maximal invariant statistic (MIS) can
be represented in terms of the set of ordered eigenvalues

{l1 ≥ · · · ≥ lp } ≡ {li (S)}

of a single Wishart matrix S ∼ Wp (r, Σ) (r ≥ p, Σ > 0). Again the LRT


statistic is invariant so is necessarily a function of these eigenvalues.
As in §7.1, when the dimensionality of the invariance-reduced alterna-
tive hypothesis is ≥ 2 (e.g. (6.11)), no single invariant test is UMPI – other
reasonable invariant test statistics include
p
2
1 − I{a<lp <l1 <b} (Roy) and 1
l
n i − 1 (Nagao)
i=1

in Example 6.18 and

l1 p p 1
(Roy) and li ·
lp i=1 i=1 li

in Example 6.20. To determine the distribution of such invariant test statis-


tics we need to find the distribution of (l1 , . . . , lp ) when Σ = Ip .

100
STAT 542 Notes, Winter 2007; MDP

Exercise 7.4. Eigenvalues of S ∼ Wp (r, Ip ).


Assume that r ≥ p. Show that the pdf of l ≡ (l1 , · · · , lp ) is
p
π2
p
r−p−1
− 12 li
(7.16) f (l) = pr p
r
li 2
e (li − lj )
2 2 Γp 2 Γp 2 i=1 1≤i<j≤p

on the range
Rl := {l | ∞ > l1 > · · · > lp > 0}.
Outline of solution. Use the limit representation

(7.17) li (S) = limn→∞ fi S, n1 T = limn→∞ nfi (S, T ) .

Let li = nfi , i = 1, . . . , p and derive the pdf of l1 , . . . , lp from the pdf of


(f1 , . . . , fp ) in (7.15). Now let n → ∞ and apply Stirling’s approximation
for the Gamma function.
¯

Alternate derivation of (7.16). Begin with the spectral decomposition

S = Γ Dl Γ ,
(7.18)
Ip = Γ Γ ,

where Dl = Diag(l1 , . . . , lp ). The joint pdf of (l, Γ) is given by


 ∂S 
r−p−1
− 12 tr S  
f (l, Γ) = cp,r · |S| 2 e · 
∂(l, Γ)

p  ∂(dS) 
r−p−1
 
e− 2 li
1
(7.19) = cp,r · li 2
· .
i=1
∂(dl, dΓ)

From (7.18),
dS = (dΓ) Dl Γ + Γ Ddl Γ + Γ Dl (dΓ) ,
0 = (dΓ) Γ + Γ (dΓ) ,
hence, defining dF = Γ−1 (dΓ),

(7.20) dG : = Γ−1 (dS)(Γ−1 ) = (dF )Dl + Ddl + Dl (dF ) ,


(7.21) 0 = E −1 (0)(E −1 ) = (dF ) + (dF ) .

101
STAT 542 Notes, Winter 2007; MDP

Thus dG ≡ {dgij } is symmetric, dF ≡ {dfij } is skew-symmetric, and

(7.22) dG = (dF )Dl + Ddl − Dl (dF ).


 
 ∂(dS) 
To evaluate  ∂(dl,dΓ) , apply the chain rule to the sequence

(dl, dΓ) → (dl, dF ) → dG → dS

to obtain
 ∂(dS)   ∂(dl, dF )   ∂(dG)   ∂(dS) 
       
 = · ·  [verify].
∂(dl, dΓ) ∂(dl, dΓ) ∂(dl, dF ) ∂(dG)
        
=1 ≡J =1

From (7.22),
dgii = dli , i = 1, . . . , p,
dgij = (dfij )(lj − li ), 1 ≤ i < j ≤ p,
(note that dfii = 0 by skew-symmetry), so

  

 ∂ (dgii ), (dgij )   Ip ∗ 
J = 
 =  = |D| = (li − lj ),
∂ (dli ), (dfij ) 0 D
1≤i<j≤p

where
D = Diag(l2 − l1 , . . . , lp − l1 , . . . , lp − lp−1 ).
Therefore from (7.19),


p
r−p−1
− 12 li
f (l, Γ) = cp,r · li 2
e (li − lj ),
i=1 1≤i<j≤p
so 
f (l) = f (l, Γ) dΓ
Op
'  (

p
r−p−1
e− 2 li
1
(7.23) cp,r dΓ · li 2
(li − lj ).
Op i=1 1≤i<j≤p

102
STAT 542 Notes, Winter 2007; MDP

The evaluation of this integral requires the theory of differential forms on


smooth manifolds.19 However, we have already obtained f (l) in (7.16), so
we can equate the constants in (7.16) and (7.23) to obtain
 p
π2
cp,r dΓ = pr p
r
,
Op 2 2 Γp 2 Γp 2
so from (4.10),
 p(p+1)
π 4
(7.24) dΓ =
.
¯
Op Γp n2

It follows from ((7.19) that l ⊥


⊥ Γ and that Γ is uniformly distributed
over Op w. r. to the measure dΓ – however, we have not defined this
measure explicitly. This is accomplished by the following proposition.

Proposition 7.5. Let S = ΓS Dl(S) ΓS be the spectral decomposition of


the Wishart matrix S ∼ Wp (r, I). Then the eigenvectors and eigenvalues
of S are independent, i.e., ΓS ⊥
⊥ l(S), and

ΓS ∼ Haar(Op ),

the unique orthogonally invariant probability distribution on Op .


Proof. It suffices to show that for any measurable sets A ∈ Op and B ∈ Rp ,

(7.25) Pr[ Ψ ΓS ∈ A | l(S) ∈ B] = Pr[ ΓS ∈ A | l(S) ∈ B] ∀ Ψ ∈ Op .

This will imply that the conditional distribution of ΓS is (left) orthogonally


invariant, hence, by the uniqueness of Haar measure,

ΓS  l(S) ∈ B ∼ Haar(Op ) ∀ B ∈ Rp .

This implies that ΓS ⊥


⊥ l(S) and ΓS ∼ Haar(Op ) unconditionally, as as-
serted.
19
This approach is followed in the books by R. J. Muirhead, Aspects of Multivariate
Statistical Theory (1982) and R. H. Farrell, Multivariate Calculation (1985).

103
STAT 542 Notes, Winter 2007; MDP

To establish (7.25), consider

S̃ = Ψ S Ψ ∼ Wp (r, I).
Then
S̃ = (ΨΓS )Dl(S) (ΨΓS ) ,
so
ΓS̃ = Ψ ΓS and l(S̃) = l(S).

Therefore

Pr[ Ψ ΓS ∈ A | l(S) ∈ B] = Pr[ ΓS̃ ∈ A | l(S̃) ∈ B]


= Pr[ ΓS ∈ A | l(S) ∈ B],

since S̃ ∼ S, so (7.25) holds.


¯.

7.3. Stein’s integral representation of the density of a maximal


invariant statistic.
Proposition 7.6. Suppose that the distribution of X is given by a pdf f (x)
w. r. to a measure µ on the sample space X . Assume that µ is invariant
under the action of a compact topological group G acting on X and that µ
is G-invariant, i.e., µ(gB) = µ(B) for all events B ⊆ X and all g ∈ G. If

t:X →T
x → t(x)

is a maximal invariant statistic then the pdf of t w.r. to the induced measure
µ̃ = µ(t−1 ) on T is given by

(7.26) f¯(x) = f (gx) dν(g),
G

where ν is the Haar probability measure on G.


Proof: First we show that f¯(x) is actually a function of the MIS t. The
integral is simply the average of f (·) over all members gx in the G-orbit of
x. By the G-invariance of µ, f¯(·) is also G-invariant:
 
f¯(g1 x) = f (gg1 x) dµ(g) = f (gx) dν(g) = f¯(x) ∀ g1 ∈ G,
G G

104
STAT 542 Notes, Winter 2007; MDP

hence f¯(x) = h(t(x)) for some function h(t).


Next, for any event A ⊆ T and any g ∈ G,

P [ t(X) ∈ A ] = IA (t(x)) f (x) dµ(x)
X
= IA (t(g −1 x)) f (x) dµ(x)
X
= IA (t(y)) f (gx) dµ(y),
X

by the G-invariance of t and µ, so


 
P [ t(X) ∈ A ] = IA (t(y)) f (gy) dµ(y) dν(g)
G X
 
= IA (t(y)) f (gy) dν(g) dµ(y)
X G

= IA (t(y)) h(t(y)) dµ(y)


X

= IA (t) h(t) dµ̃(t).
T

Thus h(t) ≡ f¯(x) is the pdf of t ≡ t(X) w. r. to dµ̃(t).


¯

Example 7.7. Let X = Rp , G = Op , and µ = Lebesgue measure on Rp ,


an Op -invariant measure. Here γ ∈ Op acts on Rp via x → γx. A maximal
invariant statistic is t(x) = x2 . If X has pdf f (x) w. r. to µ then the
integral representation (7.26) states that t(X) ≡ X2 has pdf

(7.27) h(t) = f (γx) dνp (γ)
Op

w. r. to dµ̃(t) on (0, ∞), where νp is the Haar probability measure on Op .


In particular, if f (x) is also Op -invariant, i.e., if

f (x) = k(x2 )

for some k(·) on (0, ∞), then the pdf of t(X) w.r.to dµ̃(t) is simply

(7.28) h(t) = k(t), t ∈ (0, ∞).

105
STAT 542 Notes, Winter 2007; MDP

The induced measure dµ̃(t) can be found by considering a special case:


If X ∼ Np (0, Ip ) then t ≡ X2 ∼ χ2p . Here

e− 2 x ≡ k(x2 )
1 2
1
f (x) = p w.r. to dµ(x),
(2π) 2
so t has pdf
e− 2 t
1 1
h(t) = k(t) = p w.r. to dµ̃(t).
(2π) 2

We also know, however, that t has the χ2p pdf


p
t 2 −1 e− 2 t
1
w(t) ≡ p
1
w.r. to dt (≡ Lebesgue measure).
22 Γ( p
2)

Therefore dµ̃(t) is determined as follows:

w(t) p
p
(7.29) dµ̃(t) = dt = Γ p t 2 −1 dt.
π2
k(t) (2)

Application: We can use Stein’s representation (7.27) to give an alternative


derivation of the noncentral chi-square pdf in (2.25) – (2.27). Suppose that
X ∼ Np (ξ, Ip ) with ξ = 0, so

t ≡ X2 ∼ χ2p (δ) with δ = ξ2 .

Here
e− 2 x−ξ ,
1 2
1
f (x) = p
(2π) 2

so by (7.27), t has pdf w. r. to dµ̃(t) given by



e− 2 x−γξ dνp (γ)
1 2
1
h(t) = p
(2π) 2 Op


− 12 ξ2 − 12 x2
= 1
p e e ex γξ dνp (γ)
(2π) 2 Op
 1 1
(1)
e− 2 e− 2
δ t
= 1
p et 2 δ 2 γ11 dνp (γ) [ verify! ]
(2π) 2 Op

 k 
− δ2 − 2t (tδ) 2
= 1
p e e k
γ11 dνp (γ) [ γ = {γij } ]
(2π) 2 k! Op
k=0

106
STAT 542 Notes, Winter 2007; MDP


 
(2) 1 − δ2 − 2t (tδ)k 2k
= p e e γ11 dνp (γ) [ verify! ]
(2π) 2 (2k)! Op
k=0
∞
(3) (tδ)k %
&k
e− 2 e− 2
δ t
= 1
p E Beta 12 , p−1
2 [ verify! ]
(2π) 2 (2k)!
k=0

p

 (tδ)k Γ 12 + k Γ 2
e− 2 e− 2
δ t
= 1
p

.
(2π) 2 (2k)! Γ p2 + k Γ 12
k=0

(1) This follows from the left and right invariance of the Haar measure νp .
(2) By the invariance of νp the distribution of γ11 is even, i.e., γ11 ∼ −γ11 ,
so its odd moments vanish.
(3) By left invariance, the first column of γ is uniformly

distributed on the
unit sphere in R , hence γ11 ∼ Beta 2 , 2
p 2 1 p−1
[verify!].
Thus from (7.29) and Legendre’s duplication formula, t has pdf w. r. to dt
given by
∞ 1

dµ̃(t)  t
p
2 −1 (tδ)k Γ + k
= 1p e− 2 e− 2
δ t
h(t) p 2
1

dt 22 (2k)! Γ 2 + k Γ 2
k=0
 

( ) δ k p+2k
−1 − t

 t e 2 
2
= e− 2
δ
(7.30) 2 ,
k! 2
p+2k
2 Γ p+2k
k=0
    
2

Poisson( δ2 ) weights pdf of χ2p+2k

as also found in (2.27).


¯

Example 7.9. Extend Example 7.8 as follows. Let

X = Rp×r , G = Op × Or , µ = Lebesgue measure on Rp×r ,

so µ is (Op × Or )-invariant. Here (γ, ψ) ∈ Op × Or acts on Rp×r via

x → γxψ  .

Assume first that r ≥ p. A maximal invariant statistic is [verify!]

(7.31) t(x) = ( l1 (xx ) ≥ · · · ≥ lp (xx ) ) ≡ l(xx ),

107
STAT 542 Notes, Winter 2007; MDP

the ordered nonzero eigenvalues of xx [verify]. If X has pdf f (x) w. r. to


µ then Stein’s integral representation (7.26) states that l ≡ l(XX  ) has pdf
 
(7.32) h(l) = f (γxψ  ) dνp (γ) dνr (ψ)
Op Or

w. r. to dµ̃(l) on Rl . In particular, if f (x) is also (Op × Or )-invariant, i.e.,


if
f (x) = k( l(xx ) )
for some k(·) on Rl , then the pdf of l(XX  ) w. r. to dµ̃(l) is simply

(7.33) h(l) = k(l), l ∈ Rl .

The induced measure dµ̃(l) can be found by considering a special case:

X ∼ Np×r (0, Ip ⊗ Ir ) =⇒ XX  ∼ Wp (r, Ip ).


Here

e− 2 tr xx
1 1
f (x) = pr
(2π) 2
q
− 12 li
= 1
pr e i=1 ≡ k(l) w.r. to dµ(x) on Rp×r ,
(2π) 2
so l has pdf
q
− 12 li
h(l) = k(l) = 1
pr e i=1 w.r. to dµ̃(l) on Rl .
(2π) 2

We also know from (7.16) that l has the pdf


p
π2
p
r−p−1
− 12 li
w(l) ≡ pr p
r
li 2
e (li − lj )
2 2 Γp 2 Γp 2 i=1 1≤i<j≤p

w.r. to dl (≡ Lebesgue measure) on Rl . Therefore dµ̃(l) is determined as


follows:

w(l)
p(r+1)
π 2
p
r−p−1
(7.34) dµ̃(l) = dl = p
r
li 2 (li − lj ) dl.
k(l) Γp 2 Γp 2 i=1 1≤i<j≤p

Finally, the case r < p follows from (7.33) by interchanging p and r, since
XX and X  X have the same nonzero eigenvalues.

108
STAT 542 Notes, Winter 2007; MDP

Application: Stein’s representation (7.32) provides an integral representa-


tion for the pdf of the eigenvalues of a noncentral Wishart matrix. If

X ∼ Np×r (ξ, Ip ⊗ Ir )

with ξ = 0, the distribution of XX  ≡ S depends on ξ only through ξξ  [ver-


ify], hence is designated the noncentral Wishart distribution Wp (r, Ip ; ξξ  ).
Assume first that r ≥ p. The distribution of the ordered eigenvalues

l ≡ l(XX  ) ≡ ( l1 (XX  ) ≥ · · · ≥ lp (XX  ) )

of S depends on ξξ  only through the ordered eigenvalues

λ ≡ λ(ξξ  ) ≡ ( l1 (ξξ  ) ≥ · · · ≥ lp (ξξ  ) )

of ξξ  , hence is designated by l(p, r; λ). Here



e− 2 tr (x−ξ)(x−ξ) ,
1 1
f (x) = pr
(2π) 2

so by (7.32), l has pdf w. r. to dµ̃(l) given by

h(l)
 
  
e− 2 tr (γxψ −ξ)(γxψ −ξ) dνp (γ) dνr (ψ)
1
1
= pr
(2π) 2 Op Or
 
− 12 tr ξξ  − 12 tr xx  
= 1
pr e e etr γxψ ξ dνp (γ) dνr (ψ)
(2π) 2 Op Or
    1 1
(1) 1 − 12 λi − 2 1
li tr γDl2 ψ̃  Dλ2
= pr e e e dνp (γ) dνr (ψ)
(2π) 2 Op Or
    p p 12 12
1 − 12 λi − 12 li l λ γ ψ
= pr e e e i=1 j=1 i j ji jidνp (γ) dνr (ψ)
(2π) 2 Op Or
 ' (2k 
   

p 
∞ 
p 1
(2)
= 1
e− 12 λi − 2
e
1
li  1 k dνp (γ) dνr (ψ).
pr
(2π) 2 (2k)! li λj2 γji ψji
Op Or i=1 k=0 j=1

109
STAT 542 Notes, Winter 2007; MDP

(1) Here Dλ = diag(λ1 , . . . , λp ), Dl = diag(l1 , . . . , lp ), and γ̃ is the leading


p × p submatrix of ψ. The equality follows from the left and right in-
variance of the Haar measures νp and νr and from the singular value
decompositions of ξ and x. The representation (1) is due to A. James
Ann. Math. Statist. (1961, 1964). Note that the double integral in (1)
is a convex and symmetric (≡ permutation-invariant) function of
1 1
l12 , . . . , lp2 on the unordered positive orthant Rp+ [explain].
(2) By the invariance of νp the distribution of γi ≡ (γ1i , . . . , γpi ) , the ith
column of γ, is even, i.e., γi ∼ −γi . Apply this for i = 1, . . . , p, using
the following expansion at each step:

 x2k
1 x −x
(e + e ) = .
2 (2k)!
k=0

Thus from (7.34), l has pdf fλ (l) w. r. to dl given by


dµ̃(l)
fλ (l) = h(l)
dl 
π 2 e− 2 λi
p 1
p
r−p−1
li 2 e− 2 li
1
= pr p
r
(li − lj )
2 2 Γp 2 Γp 2 i=1 1≤i<j≤p
 ' (2k 
 

p 
∞ 
p 1
(7.35) ·  1
l k
λ j γji ψji
2 dνp (γ) dνr (ψ)
(2k)! i
Op Or i=1 k=0 j=1

on the range Rl . The case r < p now follows by interchanging p and r in


(7.35), since XX and X  X have the same nonzero eigenvalues.
¯

Remark 7.10. The integrand in (7.35) is a multiple power series in {li },


and similarly in {λj } – this can be expanded and integrated term-by-term,
leading to an extension of the Poisson mixture representation (7.30) for
the noncentral chi-square pdf. However, important information already
can be obtained from the integral representation (7.35). By comparing the
noncentral pdf fλ (l) in (7.35) to the central pdf f (l) ≡ f0 (l) in (7.16) the
likelihood ratio
fλ (l)   
= e− 2 λi
1
(7.36) [· · ·],
f0 (l) Op Or

110
STAT 542 Notes, Winter 2007; MDP

the double integral in (7.35). From this representation it is immediate


1
that f (l) ≡ f0 (l) is strictly increasing in each li , hence in each li , and 2

as already noted in (1) its extension to the positive orthant Rp+ is convex
1
and symmetric in the {li2 }. Thus the symmetric extension to Rp+ of the
acceptance region A ⊆ Rl of any proper Bayes test for testing λ = 0 vs.
1
λ > 0 based on l must be convex and decreasing in {li2 } [explain and verify!].
Wald’s fundamental theorem of decision theory states that the closure
in the weak* topology of the set of all proper Bayes acceptance regions
determines an essentially complete class of tests. Because convexity and
monotonicity are preserved under weak* limits, this implies that the sym-
metric extension to Rp+ of any admissible acceptance region A ⊆ Rl must
1
be convex and decreasing in {li }. This shows, for example, that the test
2

which rejects λ = 0 for large values of the minimum eigenvalue lp (S) is


inadmissible among invariant tests [verify!], hence among all tests.
Furthermore, Perlman and Olkin (Ann. Statist. (1980) pp.1326-41)
used the monotonicity of the likelihood ratio (7.36) and the FKG inequality
to establish the unbiasedness of all monotone invariant tests, i.e., all tests
with acceptance regions of the form {g(l1 , . . . , gp ) ≤ c} with g nondecreasing
in each li .
¯

Exercise 7.11. Eigenvalues of S ∼ Wp (r, Σ) when Σ = Ip .


(a) Assume that r ≥ p and Σ > 0. Show that the pdf of l ≡ (l1 , · · · , lp ) is

π2
p
r−p−1
p
fλ (l) = pr p

p r li 2 (li − lj )
r
2 2 Γp 2 Γp 2 ( i=1 λi ) i=1 2
1≤i<j≤p

−1 
e− 2 tr Dλ γ Dl γ dνp (γ),
1
(7.37) · l ∈ Rl ,
Op

where l ≡ (l1 , · · · , lp ) and λ ≡ (λ1 , . . . , λp ) are the ordered eigenvalues of S


and Σ, respectively. (Compare to Exercise 7.4).
(b) Consider the problem of testing Σ = Ip vs. Σ ≥ Ip . Show that a
necessary condition for the admissibility of an invariant test is that the
symmetric extension to Rp+ of its acceptance region A ⊆ Rl must be convex
and decreasing in {li }. (Thus the test based on lp (S) is inadmissible.) )
¯

111
STAT 542 Notes, Winter 2007; MDP

Remark 7.12. Stein’s integral formula (7.26) for the pdf of a maximal
invariant statistic under the action of a compact topological group G can
be partially extended to the case where G is locally compact. Important
examples include the general linear group GL and the triangular groups GT
and GU . In this case, however, the integral representation does not provide
the normalizing constant for the pdf of the MIS, but still provides a useful
expression for the likelihood ratio, e.g. (7.36). References include:
S. A. Andersson (1982). Distributions of maximal invariants using quotient
measures. Ann. Statist. 10 955-961.
M. L. Eaton (1989). Group Invariance Applications in Statistics. Regional
Conference Series in Probability and Statistics Vol. 1, Institute of Mathe-
matical Statistics.
R. A. Wijsman (1990). Invariant Measures on Groups and their Use in
Statistics. Lecture Notes – Monograph Series Vol. 14, Institute of Mathe-
matical Statistics.

112
STAT 542 Notes, Winter 2007; MDP

8. The MANOVA Model and Testing Problem. (LehmannTSH Ch.8.)

8.1. Characterization of a MANOVA subspace.


In Section 3.3 the multivariate linear model was defined as follows:
Y1 , . . . , Ym (note m not n) are independent p × 1 vector observations hav-
ing common unknown pd covariance matrix Σ. Let Yj ≡ (Y1j , . . . , Ypj ) ,
j = 1, . . . , m. We assume that each of the p variates satisfies the same
univariate linear model, that is,

(8.1) E(Yi1 , . . . , Yim ) = βi X, i = 1, . . . , p,

where X : l × m is the design matrix, rank(X) = l ≤ m, and βi : 1 × l


is a vector of unknown regression coefficients. Equivalently, (8.1) can be
expressed geometrically as

(8.2) E(Yi1 , . . . , Yim ) ∈ L(X) ≡ row space of X ⊆ Rm , i = 1, . . . , p.

In matrix form, (8.1) and 8.2) can be written as

(8.3) E(Y ) ∈ {βX | β ∈ M(p, l)} =: Lp (X),

where M(a, b) denotes the vector space of all real a × b matrices,

Y ≡ (Y1 , . . . , Ym ) ∈ M(p, m),


 
β1
 
β ≡  ...  .
βp

Note that Lp (X) is a linear subspace of M(p, m) with

dim(Lp (X)) = p · dim(L(X)) = p l,

a multiple of p. Then (8.2) can be expressed equivalently as20


   
 v 1  
p  ..  
(8.4) E(Y ) ∈ ⊕i=1 L(X) ≡  v1 , . . . , vp ∈ L(X) .
 .  
vp
20
(8.3) and (8.4) can also be written as E(Y ) ∈ Rp ⊗ L(X).

113
STAT 542 Notes, Winter 2007; MDP

The forms (8.1) – (8.4) are all extrinsic, in that they require spec-
ification of the design matrix X, which in turn is specified only after a
choice of coordinate system. We seek to express these equivalent forms in
an intrinsic algebraic form that will allow us to determine when a specified
linear subspace L ⊆ M(p, m) can be written as Lp (X) for some X. This
is accomplished by means of an invariant ≡ coordinate-free definition of a
MANOVA subspace.

Definition 8.1. A linear subspace L ⊆ M(p, m) is called a MANOVA


subspace if

(8.5) M(p, p) L ⊆ L.

Because M(p, p) is in fact a matrix algebra (i.e., closed under matrix mul-
tiplication as well as matrix addition) that contains the identity matrix Ip ,
(8.5) is equivalent to the condition M(p, p) L = L.
¯

Proposition 8.2. Suppose that L is a linear subspace of M(p, m). The


following are equivalent:
(a) L is a MANOVA subspace.
(b) L = Lp (X) for some X ∈ M(l, m) of rank l ≤ m (so dim(L) = p l).
(c) There exists an orthogonal matrix Γ : m × m such that

(8.6) LΓ = {(µ, 0p×(m−l) )  µ ∈ M(p, l)} (l ≤ m).

((8.6) is the canonical form of a MANOVA subspace.)


(d) There exists a unique m × m projection matrix P such that

(8.7) L = {x ∈ M(p, m) | x = xP }.

Note: if L = Lp (X) then P = X  (XX  )−1 X and l = tr(P ) [verify]. Also, Γ


is obtained from the spectral decomposition P = Γ diag(Il , 0m−l ) Γ .
Proof. The equivalence of (b), (c), and (d) is proved exactly as for uni-
variate linear models [reference?] It is straightforward to show that (b) ⇒
(a). We now show that (a) ⇒ (d).

114
STAT 542 Notes, Winter 2007; MDP

Let i ≡ (0, . . . , 0, 1, 0, . . . , 0) denote the i-th coordinate vector in Rp ≡


M(1, p) and define Li := i L ⊆ M(1, n), i = 1, . . . , p. Then for every pair
i, j, it follows from (a) that

Lj = j L = i Πij L ⊆ i L = Li ,

where Πij ∈ M(p, p) is the i, j-permutation matrix, so

L1 = · · · = Lp =: L̃ ⊆ M(1, m).

Let P : m × m be the unique projection matrix onto L̃. Then for x ∈ L,


 p
xP = Ip xP = xPi i
i=1
 p  p  p

= i i xP = i i x = i i x = x,
i=1 i=1 i=1

where the third equality holds since i x ∈ Li ≡ L̃, hence

L ⊆ {x ∈ M(p, m) | x = xP }.

Conversely, for x ∈ M(p, m),

xP = x =⇒ i xP = i x, i = 1, . . . , p,
=⇒ i x ∈ L̃ ≡ Li
=⇒ i x = i xi for some xi ∈ L
 p p

=⇒ x ≡ i i x = (i i )xi ∈ L,
i=1 i=1

where the final membership follows from (a) and the assumption that L is
a linear subspace. Thus

L ⊇ {x ∈ M(p, m) | x = xP },

which completes the proof.


¯

115
STAT 542 Notes, Winter 2007; MDP

Remark 8.3. In the statistical literature, multivariate linear models often


occur in the form

(8.8) Lp (X, C) := {βX | β ∈ M(p, l), βC = 0},

where C : l × s (with rank(C) = s ≤ l) determines s linear constraints on


β. To see that Lp (X, C) is in fact a MANOVA subspace and thus can be
re-expressed in the form Lp (X0 ) for some design matrix X0 , by Proposition
8.2 it suffices to verify that

M(p, p) Lp (X, C) ⊆ Lp (X, C),

which is immediately evident.


¯

8.2. Reduction of a MANOVA testing problem to canonical form.


A normal MANOVA model is simply a normal multivariate linear model
(3.14), i.e., one observes

(8.9) Y ≡ (Y1 , . . . , Ym ) ∼ Np×m (η, Σ ⊗ Im ) with η ∈ L ⊆ Rp×m ,

where L is a MANOVA subspace of Rp×m and Σ > 0 is unknown.


The MANOVA testing problem is that of testing

(8.10) η ∈ L0 vs. η∈L based on Y,

for two MANOVA subspaces L0 ⊂ L ⊂ Rp×m with

dim(L0 ) ≡ p l0 < p l ≡ dim(L).


¯

Proposition 8.4. (extension of Proposition 8.2c). Let r = l−l0 , n = m−l.


There exists an m × m orthogonal matrix Γ∗ such that

L Γ∗ = {(ξ, µ, 0p×n ) | ξ ∈ M(p, l0 ), µ ∈ M(p, r) },


(8.11)
L0 Γ∗ = {(ξ, 0p×r , 0p×n ) | ξ ∈ M(p, l0 ) }.

Proof. Again this is proved exactly as for univariate linear subspaces:


From (8.6), choose Γ : n × n orthogonal such that

LΓ = {(ξ, µ, 0p×n ) | ξ ∈ M(p, l0 ), µ ∈ M(p, r)}.

116
STAT 542 Notes, Winter 2007; MDP
 
Il
By (8.5), L0 Γ is a MANOVA subspace of Rpl , so we can find
0n×l
Γ0 : l × l orthogonal so that
 
Il
L0 Γ Γ0 = {(ξ, 0p×r ) | ξ ∈ M(p, l0 )}.
0n×l
 
∗ Γ0 0l×n
Now take Γ = Γ and verify that (8.11) holds.
¯
0n×l In

From (8.11) the MANOVA testing problem (8.10) is transformed to


that of testing

µ = 0 vs. µ = 0 with ξ ∈ M(p, l0 ) and Σ unknown


(8.12)

based on Y ∗ := Γ∗ Y ≡ (U, Y, Z) ∼ Np×m (ξ, µ, 0p×n ), Σ ⊗ Im .

This testing problem is invariant under G∗ := M(p, l0 ) acting as a trans-


lation group on U (and ξ):

(U, Y, Z) → (U + b, Y, Z),
(8.13)
(ξ, µ, Σ) →
 (ξ + b, µ, Σ).

Since M(p, l0 ) acts transitively on itself, the MIS and MIP are (Y, Z) and
(µ, Σ), resp., and the invariance-reduced problem becomes that of testing

µ=0 vs. µ = 0 with Σ unknown


(8.14)

based on (Y, Z) ∼ Np×(r+n) (µ, 0p×n ), Σ ⊗ Ir+n .

For this problem, (Y, W ) := (Y, ZZ  ) is a sufficient statistic [verify], so


(8.14) is reduced by sufficiency to the canonical MANOVA testing problem
(6.24). As in Example 6.36, (6.24) is now reduced by invariance under (6.25)
to the testing problem (6.26) based on the nonzero eigenvalues of Y  W −1 Y .
(The condition n ≥ p, needed for the existence of the MLE Σ̂ in (6.24)
and (8.14), is equivalent to m ≥ l + p in (8.9) and (8.10).)
¯

117
STAT 542 Notes, Winter 2007; MDP

Remark 8.5. By Proposition 8.2b and Remark 8.3, Lp (X) and Lp (X, C)
are MANOVA subspaces of Rp×m such that Lp (X, C) ⊂ Lp (X). Thus the
general MANOVA testing problem (8.10) is often stated as that of testing

(8.15) η ∈ Lp (X, C) vs. η ∈ Lp (X).

[Add Examples]
¯

Exercise 8.6. Derive the LRT for (8.15).


Hint: The LRT already has been derived for the canonical MANOVA testing
problem in Exercise 6.37a. Now express the LRT statistic in terms of the
observation matrix Y, the design matrix X, and the constraint matrix C.
¯

8.3. Related topics.


8.3.1. Seemingly unrelated regressions (SUR).
If the p variates follow different univariate linear models, i.e., if (8.1) is
extended to

(8.16) E(Yi1 , . . . , Yim ) = βi Xi ∈ L(Xi ), i = 1, . . . , p,

where X1 : l1 × m, . . ., Xp : lp × m are design matrices with different row


spaces, the model (8.16) is called a seemingly unrelated regression (SUR)
model. The p univariate models are only “seemingly” unrelated because
they are correlated if Σ is not diagonal. Under the assumption of normality,
explicit likelihood inference (i.e., MLEs and LRTs) is not possible unless the
row spaces L(X1 ), . . . , L(Xp ) are nested. (But see Remark 8.9.)
¯

8.3.2. Invariant formulation of block-triangular matrices.


The invariant algebraic definition of a MANOVA subspace in Definition
8.1 suggests an invariant algebraic definition of generalized block-triangular
matrices. First, for any increasing sequence of integers

0 ≡ p0 < p1 < p2 < · · · < pr < pr+1 ≡ p (1 < r < p)

define the sequence

(8.17) {0} ⊂ V1 ⊂ V2 ⊂ · · · ⊂ Vr ⊂ Rp

118
STAT 542 Notes, Winter 2007; MDP

of proper linear subspaces of Rp as follows:

(8.18) Vi = span{1 , 2 , . . . , pi }, i = 1, . . . , r.

Consider a partitioned matrix

(8.19) A ≡ (Aij | 1 ≤ i, j ≤ r) ∈ M(p, p),

where Aij ∈ M(pi − pi−1 , pj − pj−1 ). Then A is upper block triangular,


i.e., Aij = 0 for 1 ≤ j < i ≤ r, if and only if [verify!]

AVi ⊆ Vi , i = 1, . . . , r

Thus the set of A of upper block-triangular matrices can be defined in the


following algebraic way:

(8.20) A ≡ A(p1 , . . . , pr ) := {A ∈ M(p, p) | AVi ⊆ Vi , i = 1, . . . , r}.

Exercise 8.7. Give an algebraic definition of the set of lower block trian-
gular matrices. ¯

More generally, let

(8.21) {0} ⊂ V1 ⊂ V2 ⊂ · · · ⊂ Vr ⊂ Rp

be a general increasing sequence of proper linear subspaces of Rp and define

(8.22) A ≡ A(V1 , . . . , Vr ) := {A ∈ M(p, p) | AVi ⊆ Vi , i = 1, . . . , p}.

Note that this is a completely invariant ≡ coordinate-free algebraic defi-


nition, and immediately implies that A is a matrix algebra, i.e., is closed
under matrix addition and multiplication [verify], and Ip ∈ A. The algebra
A is called the algebra of block-triangular matrices with respect to V1 , . . . , Vr .
The proper subset A∗ ⊂ A consisting of all nonsingular matrices in A is a
matrix group, i.e., it contains the identity matrix and is closed under matrix
inversion [verify]. Finally, it is readily seen that A(V1 , . . . , Vr ) is isomorphic
to A(p1 , . . . , pr ) under a similarity transformation, where pi := dim(Vi ). ¯

119
STAT 542 Notes, Winter 2007; MDP

Remark 8.8. Suppose that V1 , . . . , Vr is an arbitrary (i.e., non-nested)


finite collection of proper linear supspaces of Rp . Define A ≡ A(V1 , . . . , Vr )
as in (8.22). Then A is a generalized block-triangular matrix algebra [verify!]
and A∗ is a generalized block-triangular matrix group. Note too that

(8.23) A(V1 , . . . , Vr ) = A(L(V1 , . . . , Vr )),

where L(V1 , . . . , Vr ) is the lattice of linear subspaces generated from


(V1 , . . . , Vr ) by all possible finite unions and intersections.
¯

Remark 8.9. The algebra A ≡ A(L(V1 , . . . , Vr )) plays an important role


in the theory of normal lattice conditional independence (LCI) models (An-
dersson and Perlman (1993) Annals of Statistics). A subspace L ⊆ M(p, n)
is called an A-subspace if AL ⊆ L. It is shown by A&P (IMS Lecture Notes
Vol. 24, 1994) that if the linear model subspace L of a normal multivariate
linear model is an A-subspace and if the covariance structure satisfies a
corresponding set of LCI constraints, then the MLE and LRT statistics can
be obtained explicitly. This was extended to ADG covariance models by
A&P (J. Multivariate Analysis 1998), and to SUR models and non-nested
missing data models with conforming LCI covariance structure by Drton,
Andersson, and Perlman (J. Multivariate Analysis 2006). ¯

8.3.3. The GMANOVA model and testing problem.


(Recall Example 6.39.) [To be completed]
¯

120
STAT 542 Notes, Winter 2007; MDP

9. Testing and Estimation with Missing/Incomplete Data.


Let Y1 , . . . , Ym be an i.i.d. random sample from Np (µ, Σ) with µ and Σ
unknown. Partition Yk , µ, and Σ as

p1 p2
     
p1 Y1k µ1 p1 Σ11 Σ12
Yk = , µ= , Σ= .
p2 Y2k µ2 p2 Σ21 Σ22

Consider n additional i.i.d. observations V1 , . . . , Vn from Np2 (µ2 , Σ22 ), in-


dependent of Y1 , . . . , Ym . Here V1 , . . . , Vn can be viewed as incomplete ob-
servations from the original distribution Np (µ, Σ). We shall find the MLEs
µ̂, Σ̂ based on Y1 , . . . , Ym , V1 , . . . , Vn .
Because
Y1k | Y2k ∼ Np1 (α + βY2k , Σ11·2 ),
β = Σ12 Σ−1
22 ,
α = µ1 − βµ2 ,
the likelihood function (LF ≡ joint pdf of Y1 , . . . , Ym , V1 , . . . , Vn ) can be
written in the form

m
(1)

m
(2)

n
(2)
(9.1) fα,β,Σ11·2 (y1k | y2k ) fµ2 ,Σ22 (y2k ) fµ2 ,Σ22 (vk )
k=1 k=1 k=1
−m/2
−1 m ∗2

=c · |Σ11·2 | exp − 1
2 tr Σ11·2
k=1 (y 1k − α − βy2k )
−(m+n)/2
1 −1
%
m
∗2
n
&

· |Σ22 | exp − 2 tr Σ22 (y2k − µ2 ) + (vk − µ2 )∗2 ,


k=1 k=1

where (y)∗2 := yy  and the parameters α, β, Σ11·2 , µ2 , Σ22 vary indepen-


dently over their respective ranges. Thus we see that the LF is the product
of two LFs, the first that of a multivariate normal linear regression model
 
e
Np1 m ((α, β) , Σ11·2 )
Z

with e = (1, . . . , 1) : 1 × m and Z = (Y21 , . . . , Y2m ), and the second that of


m + n i.i.d. observations from Np (µ2 , Σ22 ).

121
STAT 542 Notes, Winter 2007; MDP

The MLEs for these models are given in (3.15), (3.16), (3.34), and
(3.35). To assure the existence of the MLE, the single condition m ≥ p + 1
is necessary and sufficient [verify!]. (This is the same condition required for
existence of the MLE based on the complete observations Y1 , . . . , Ym only.)
If this condition holds, then the MLEs of α, β, Σ11·2 , µ2 , Σ22 are as follows:

mȲ2 + nV̄
α̂ = Ȳ1 − β̂ Ȳ2 , µ̂2 = ,
m+n

−1 ∗2
(9.2) β̂ = S12 S22 , Σ̂22 1
= m+n S22 + T + mn
m+n (Ȳ2 − V̄ ) ,
1
Σ̂11·2 = m S11·2 ,

[verify!], where
m n
S= (Yk − Ȳ )∗2 , T = (Vk − V̄ )∗2 .
k=1 k=1

m+n
Verify that m+n−1 Σ̂22 is the sample covariance matrix based on the com-
bined sample Y21 , . . . , Y2m , V1 , . . . , Vn . Furthermore, the maximum value of
the LF is given by

(9.3) c · |Σ̂11·2 |−m/2 |Σ̂22 |−(m+n)/2 exp − 12 (mp + np2 ) .

Remark 9.1. The pairs (Ȳ , S) and (V̄ , T ) together form a complete and
sufficient statistic for the above incomplete data model.
¯

Remark 9.2. This analysis can be extended to the case of a monotone ≡


nested incomplete data model. The observed data consists of independent
observations of the forms
     
Y1
 Y2   Y2 
(9.4)  . ,  . , ..., 


,
 ..   .. 
Yr Yr Yr

where a complete observation Y ∼ Np (µ, Σ). The MLEs are obtained by


factoring the joint pdf of Y1 , . . . , Yr as

122
STAT 542 Notes, Winter 2007; MDP

(9.5) f (y1 , . . . , yr ) = f (y1 |y2 , . . . , yr )f (y2 |y3 , . . . , yr ) · · · f (yr−1 |yr )f (yr )

and noting that each conditional pdf is the LF of a normal linear regression
model.
¯

Exercise 9.4. Find the LRTs based on Y1 , . . . , Ym , V1 , . . . , Vn for testing


problems (i) and (ii) below. Argue that no explicit expression is available
for the LRT statistic in (iii). (Eaton and Kariya (1983) Ann. Statist.)

(i) H1 : µ2 = 0 vs. H : µ2 = 0 (µ1 and Σ unspecified).


(ii) H2 : µ1 = 0, µ2 = 0 vs.  0, µ2 = 0 ( Σ unspecified).
H : µ1 =
(iii) H3 : µ1 = 0 vs. H : µ1 = 0 (µ2 and Σ unspecified).

Partial solutions: First, for each testing problem, the LF is given by (9.1)
and its maximum under H given by (9.3).
(i) Because α = µ1 when µ2 = 0, it follows from (9.1) that the LF under
H1 is given by


m

−m/2 −1
c ·|Σ11·2 | exp − 1
2 tr Σ11·2 (y1k − µ1 − βy2k )∗2
k=1
(9.6)
−(m+n)/2
−1
%
m
∗2

n
&

· |Σ22 | exp − 1
2 tr Σ22 (y2k ) + (vk )∗2 ,
k=1 k=1

Thus the maximum of the LF under H1 is given by


(9.7) c · |Σ̂11·2 |−m/2 |Σ̃22 |−(m+n)/2 exp − 12 (mp + np2 ) ,

where
1
Σ̃22 := m+n (S̃22 + T̃ )
 m 
n
1 ∗2 ∗2
= m+n (Y2k ) + (Vk )
k=1 k=1

123
STAT 542 Notes, Winter 2007; MDP

[verify!]. Thus, by (9.3) and (9.7) the LRT rejects H2 in favor of H for large
values of [verify!]
 ∗2 
 mȲ2 +nV̄ 
|Σ̃22 | Σ̂22 + m+n 
=
|Σ̂22 | |Σ̂22 |

= 1+ mȲ2 +nV̄
m+n Σ̂−1
22
mȲ2 +nV̄
m+n

≡ 1 + T22 .

Note that T22 is exactly the T 2 statistic for testing µ2 = 0 vs. µ2 = 0 based
on the combined sample Y21 , . . . , Y2m , V1 , . . . , Vn , so the LRT ignores the
observations Y11 , . . . , Y1m .
(ii) The LRT statistic is the product of the LRT statistics for problem (i)
and for the problem of testing µ1 = 0, µ2 = 0 vs. µ1 = 0, µ2 = 0 (see
Exercise 6.14). Both LRTs can be obtained explicitly, but the distribution
of their product is not simple. (See Eaton and Kariya (1983).)
(iii) Under H3 : µ1 = 0, µ2 appears in different forms in the two exponen-
tials on the right-hand side of (9.1), hence maximization over µ2 cannot be
done explicitly.
¯

Exercise 9.5. For simplicity, assume µ is known, say µ = 0. Find the LRT
based on Y1 , . . . , Ym , V1 , . . . , Vn for testing

H0 : Σ12 = 0 vs. H : Σ12 = 0 (Σ11 and Σ22 unspecified).

Solution: The LRT statistic for this problem is the same as if the addi-
tional observations V1 , . . . , Vn were not present (cf. Exercise 6.24), namely
|S11 ||S22 |
|S| . This can be seen by examining the LF factorization in (9.1) when
µ = 0 (so α = 0 and µ2 = 0). The null hypothesis H0 : Σ12 = 0 is equivalent
to β = 0, so the second exponential on the right-hand side of (9.1) is the
same under H0 and H, hence has the same maximum value under H0 and
H. Thus this second factor cancels when forming the LRT statistic, hence
the LRT does not involve V1 , . . . , Vn .
¯

124
STAT 542 Notes, Winter 2007; MDP

9.1. Lattice conditional independence (LCI) models for non-


monotone missing/incomplete data.
If the incomplete data pattern is non-monotone ≡ non-nested, then no
explicit expressions exist for the MLEs. Instead, an iterative procedure
such as the EM algorithm must be used to compute the MLEs. (Caution:
convergence to the MLE is not always guaranteed, and the choice of starting
point may affect the convergence properties.)
An example of a non-monotone incomplete data pattern is
       
Y1 Y1
(9.8)  Y2  ,   ,  Y2  ,   .
Y3 Y3 Y3 Y3

Here no compatible factorization of the joint pdf such as (9.5) is possible.


However, Rubin (Multiple Imputation, 1987) and Andersson and Perlman
(Statist. Prob. Letters, 1991) have pointed out that a compatible factor-
ization is possible if a parsimonious set of lattice conditional independence
(LCI) restrictions determined by the incomplete data pattern is imposed on
the (unknown) covariance matrix Σ. In the present example, these restric-
tions reduce to the single condition Y1 ⊥ ⊥ Y2 | Y3 , in which case the joint
pdf of Y1 , Y2 , Y3 factors as

(9.9) f (y1 , y2 , y3 ) = f (y1 |y3 )f (y2 |y3 )f (y3 ).

Here again each conditional pdf is the LF of a normal linear regression


model, so the MLEs of the corresponding regression parameters can be
obtained explicitly.
Of course, the LCI restriction may not be defensible, but it can be
tested. If it is rejected, at least the MLEs obtained under the LCI restriction
may serve as a reasonable starting value for the EM algorithm. (See L. Wu
and M. D. Perlman (2000) Communications in Statistics - Simulation and
Computation 29 481-509.)

[Add handwritten notes on LCI models.]

125
STAT 542 Notes, Winter 2007; MDP

Appendix A. Monotone Likelihood Ratio and Total Positivity.

In Section 6 we study multivariate hypothesis testing problems which re-


main invariant under a group of symmetry transformations. In order to
respect these symmetries, we shall restrict consideration to test functions
that possess the same invariance properties and seek a uniformly most pow-
erful invariant (UMPI) test. Under multivariate normality, the distribution
of a UMPI test statistic is often a noncentral chi-square or related noncen-
tral distribution. To verify the UMPI property it is necessary to establish
that the noncentral distribution has monotone likelihood ratio (MLR) with
respect to the noncentrality parameter. For this we will rely on the relation
between the MLR property and total positivity of order 2.

Definition A.1. Let f (x, y) ≥ 0 be defined on A × B, a Cartesian product


of intervals in R1 . We say that f is totally positive of order 2 (TP2) if
 
 f (x1 , y1 ) f (x1 , y2 ) 
 
 f (x2 , y1 ) f (x2 , y2 )  ≥ 0 ∀ x1 < x2 , y1 < y2 ,

i .e., if
(A.1) f (x1 , y1 )f (x2 , y2 ) ≥ f (x1 , y2 )f (x2 , y1 ).

If f > 0 on A × B then (5.1) is equivalent to the following condition:

f (x2 , y)
(A.2) is nondecreasing in y ∀ x1 < x2 .
f (x1 , y)

Note that f (x, y) is TP2 on A × B iff f (y, x) is TP2 on B × A.


¯

Fact A.2. If f and g are TP2 on A × B then f · g is TP2 on A × B. In


particular, a(x)b(y)f (x, y) is TP2 for any a(·) ≥ 0 and b(·) ≥ 0.
¯

Fact A.3. If f is TP2 on A × B  and φ : A → A and ψ : B → B  are


both increasing or both decreasing, then f (φ(x), ψ(y)) is TP2 on A × B.
¯

∂ 2 log f
Fact A.4. If f (x, y) > 0 and ∂x∂y ≥ 0 on A × B then f is TP2.
¯

126
STAT 542 Notes, Winter 2007; MDP

Fact A.5. If f (x, y) = g(x − y) and g : R1 → [0, ∞) is log-concave, then


f is TP2 on R2 .
Proof. Let h(x) = log g(x). For x1 < x2 , y1 < y2 set

s = x1 − y1 , u = x1 − y2 ,
t = x2 − y2 , v = x2 − y1 .
Then [verify]
u ≤ min(s, t) ≤ max(s, t) ≤ v,
s + t = u + v,
so, since h is concave,
h(s) + h(t) ≥ h(u) + h(v),

which is equivalent to the TP2 condition (A.1) for f (x, y) ≡ g(x − y).
¯

These Facts yield the following examples of TP2 functions f (x, y):

Example A.6. Exponential kernel: f (x, y) = exy is TP2 on R1 × R1 .

Example A.7. Exponential family: f (x, y) = a(x)b(y)eφ(x)ψ(y) is TP2 on


A × B if a(·) ≥ 0 on A, b(·) ≥ 0 on B, φ(·) is increasing on A, and ψ(·) is
increasing on B. In particular, f (x, y) = xy is TP2 on (0, ∞) × R1 .

Example A.8. Order kernel: f (x, y) = (x − y)α + and f (x, y) = (x − y)−


α

are TP2 on R1 × R1 for α ≥ 0. [I(0,∞) and I(∞,0) are log concave on R1 .]

The following is a celebrated result in the theory of total positivity.


Proposition A.9. Composition Lemma ≡ Karlin’s Lemma (due to
Polya and Szego). If g(x, y) is TP2 on A × B and h(x, y) is TP2 on B × C,
then for any σ-finite measure µ,

(A.3) f (x, z) := g(x, y)h(y, z)dµ(y)
B

is TP2 on A × C.

127
STAT 542 Notes, Winter 2007; MDP

Proof. For x1 ≤ x2 and z1 ≤ z2 ,


f (x1 , z1 )f (x2 , z2 ) − f (x1 , z2 )f (x2 , z1 )

= g(x1 , y)g(x2 , u)[h(y, z1 )h(u, z2 ) − h(y, z2 )h(u, z1 )]dµ(y)dµ(u)
  
= + + .
{y<u} {y>u} {y=u}
  
=0

By interchanging the dummy variables y and u, however, we see that



g(x1 , y)g(x2 , u)[h(y, z1 )h(u, z2 ) − h(y, z2 )h(u, z1 )]dµ(y)dµ(u)
{y>u}

= g(x1 , u)g(x2 , y)[h(u, z1 )h(y, z2 ) − h(u, z2 )h(y, z1 )]dµ(y)dµ(u)
{y<u}
so   
+
{y<u} {y>u}

= [g(x1 , y)g(x2 , u) − g(x1 , u)g(x2 , y)]
{y<u}
· [h(y, z1 )h(u, z2 ) − h(y, z2 )h(u, z1 )]dµ(y)dµ(u) ≥ 0
since g and h are TP2. Thus h is TP2.
¯
∞ k k
Example A.10. Power series: f (x, y) = k=0 ck x y is TP2 on
(0, ∞) × (0, ∞) if ck ≥ 0 ∀k.
Proof. Apply the Composition Lemma with g(x, k) = xk , h(k, y) = y k ,
and µ the measure that assigns mass ck to k = 0, 1, . . ..
¯

Definition A.11. Let {f (x|λ) | λ ∈ Λ} be a 1-parameter family of pdfs


(discrete or continuous) for a real random variable X with range X , where
both X and Λ are intervals in R1 . We say that f (x|λ) has monotone
likelihood ratio (MLR) if f (x|λ) is TP2 on X × Λ. ¯

Proposition A.12. MLR preserves monotonicity. If f (x|λ) has MLR


and g(x) is nondecreasing on X , then

Eλ [g(X)] ≡ g(x) f (x|λ)dν(x)
X

128
STAT 542 Notes, Winter 2007; MDP

is nondecreasing in λ (ν is either counting measure or Lebesgue measure).


Proof. Set h(λ) = Eλ [g(X)]. Then for any λ1 ≤ λ2 in Λ,

h(λ2 ) − h(λ1 )

= g(x)[f (x|λ2 ) − f (x|λ1 )]dν(x)
11
= 12 [g(x) − g(y)] [f (x|λ2 )f (y|λ1 ) − f (y|λ2 )f (x|λ1 )]dν(x)dν(y)
≥ 0,

since the two [· · ·] terms are both ≥ 0 if x ≥ y or both ≤ 0 if x ≤ y.


¯

Remark A.13. If {f (x|λ)} has MLR and X ∼ f (x|λ), then for each a ∈ X ,
% &
Prλ [ X > a] ≡ Eλ I(a,∞)(x)

is nondecreasing in λ, hence X is stochastically increasing in λ.


¯

Example A.14. The noncentral chi-square distribution χ2n (δ) has


MLR w.r.to δ.
From (2.27), a noncentral chi-square rv χ2n (δ) with n df and noncentrality
parameter δ is a Poisson(δ/2)-mixture of central chi-square rvs:

(A.4) χ2n (δ)  K = k ∼ χ2n+2k , K ∼ Poisson(δ/2).

Thus if fn (x|δ) and fn (x) denote the pdfs of χ2n (δ) and χ2n , then


fn (x|δ) = fn+2k (x) Pr[K = k]
k=0

   δ
k 
 x 2 +k−1 e− 2
n x
e− 2 2δ
= n
·
k=0
2 2 +1 Γ n2 + k k!


2 −1 −x − δ2
n
(A.5) ≡x e 2 ·e · ck xk δ k ,
k=0

where ck ≥ 0. Thus by A.2, A.3, and A.10, fn (x|δ) is TP2 in (x, δ).
¯

129
STAT 542 Notes, Winter 2007; MDP

Example A.15. The noncentral F distribution Fm,n (δ) has MLR


w.r.to δ. Let
χ2m (δ)
distn
(A.6) Fm,n (δ) = ,
χ2n
the ratio of two independent chi-square rvs with χ2m (δ) noncentral and χ2n
central. From (A.4), Fm,n (δ) can be represented as a Poisson mixture of
central F distributions:

(A.7) Fm,n (δ)  K = k ∼ Fm+2k,n , K ∼ Poisson (δ/2) ,

so if fm,n (x|δ) and fm,n (x) now denote the pdfs of Fm,n (δ) and Fm,n , then


fm,n (x|δ) = fm, n+2k (x) Pr[K = k]
k=0



  δ
k 
 Γ m+n + k x
m
2 +k−1 e− 2 2δ
= m 2
n
· m+n ·
Γ 2 + k Γ 2 (x + 1) 2 +k−1 k!
k=0

x 2 −1
m
∞ x k
− δ2
(A.8) ≡ m+n ·e · dk δk ,
(x + 1) 2 −1 k=0
x+1

where dk ≥ 0. Thus by A.2 and A.10, fm,n (x|δ) is TP2 in (x, δ).
¯

Question A.16. Does χ2n (δ) have MLR w.r.to n? (δ fixed) Does Fm,n (δ)
have MLR w.r.to m? (n, δ fixed) ¯

Proposition A.17. Scale mixture of a TP2 kernel. Let g(x, y) be


TP2 on R1 × (0, ∞) and let h be a nonnegative function on (0, ∞) such
that h (y/ζ) is TP2 for (y, ζ) ∈ (0, ∞) × (0, ∞). Then
 ∞
(A.9) f (x, ζ) := g(x, ζz)h(z)dz
0

is TP2 on R1 × (0, ∞).


Proof. Set y = ζz, so
 ∞  
y dy
f (x, ζ) = g(x, y) h ,
0 ζ ζ

130
STAT 542 Notes, Winter 2007; MDP

hence the result follows from the Composition Lemma.


¯

Example A.18. The distribution of the multiple correlation coef-


ficient R2 has MLR w.r. to ρ2 .
Let R2 , ρ2 , U , ζ, and Z be as defined in Example 3.21 (also see Example
6.26 and Exercise 6.27). From (3.68),

U  Z ∼ Fp−1, n−p+1 (ζZ),
(A.10)
Z ∼ χ2n ,

so the unconditional pdf of u with parameter ζ is given by


 ∞
f (u|ζ) = fp−1, n−p+1 (u| ζz) fn (z)dz
0

where fp−1, n−p+1 (·| ζz) and fn (·) are the pdfs for Fp−1, n−p+1 (ζz) and χ2n ,
respectively. Then fp−1, n−p+1 (u|y) is TP2 in (u, y) by Example A.15, while
n2 −1 y
fn y
ζ =c· y
ζ e− 2ζ

is TP2 in (y, ζ) by Example A.7, so f (u|ζ) is TP2 in (u, ζ) by Proposition


A.17. Finally, because U and ζ are increasing functions of R2 and ρ2 ,
respectively, it follows by Fact A.3 that the distribution of R2 has MLR
w.r.to ρ2 .
¯

131

You might also like