0% found this document useful (0 votes)
37 views59 pages

WST 311 - Part 1 2023

Uploaded by

u20417242
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views59 pages

WST 311 - Part 1 2023

Uploaded by

u20417242
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Department of Statistics

Class notes – Part 1


Multivariate Analysis
WST 311

Last revision: February 2023


Revision by: Prof JT Ferreira
© Copyright reserved
Contents

I MULTIVARIATE ANALYSIS 4

1 MATRIX ALGEBRA 4
1.1 General De…nitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Addition and Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Elementary Operations and Inverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.4 Vector Space and Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.5 Properties of the Rank of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.6 Orthogonal Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Basic Results related to matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Properties of Trace and Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Characteristic Roots and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 MULTIVARIATE DISTRIBUTIONS 14
2.1 Moments and Characteristics of a Multivariate Distribution . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Expected Values and Covariance Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Sample Mean, Sample Covariance and Sample Correlation Matrices . . . . . . . . . . . . . . 19
2.1.3 The Multivariate Change of Variable Technique . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.4 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 The Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 Classical De…nition of the Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . 25
2.2.2 Standard Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.3 The Moment Generating Function of the Multivariate Normal Distribution . . . . . . . . . . . 27
2.2.4 Conditional Distributions of Normal Random Vectors . . . . . . . . . . . . . . . . . . . . . . 32
2.2.5 Multiple Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3 NORMAL SAMPLES 38
3.1 Estimation of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.1 The Mean and Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.2 Correlation Coe¢ cients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 The Mean and Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.2 Quadratic Forms of Normally Distributed Variates . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.3 The Correlation Coe¢ cient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Inferences Concerning Multivariate Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 The distribution of T 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.2 The Case of Unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

1
3.4 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.2 Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.3 Calculation of Principal Components for a Dataset . . . . . . . . . . . . . . . . . . . . . . . 49

II THE LINEAR MODEL 59

4 SIMPLE LINEAR REGRESSION 59


4.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Estimation of 0 , 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Hypothesis test and con…dence interval for 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Coe¢ cient of determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5 MULTIPLE LINEAR REGRESSION: Estimation 62


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 The model . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Estimation of and 2 . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.1 Least-squares estimator for . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.2 Properties of the least-squares estimator b . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.3 An estimator for 2 . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4 Normal model . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4.1 Assumptions . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 67
2
5.4.2 Maximum likelihood estimators for and . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4.3 Properties of b and b2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5 Coe¢ cient of determination, R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6 MULTIPLE LINEAR REGRESSION: Test of Hypotheses 71


6.1 Test of overall regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2 The general linear hypothesis test for H0 : C = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3 Testing one j . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.4 Additional topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7 ONE-WAY ANALYSIS-OF-VARIANCE 75

8 ANALYSIS-OF-COVARIANCE 76

III THE GENERALISED LINEAR MODEL 77

9 EXPONENTIAL FAMILIES 77
9.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
9.2 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

2
9.3 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.4 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.5 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.6 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

10 GENERALISED LINEAR MODEL (GLIM or GLM) 87


10.1 Components of a generalised linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
10.2 Model …tting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
10.3 Deviance and comparison of models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
10.4 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

3
Part I
MULTIVARIATE ANALYSIS

1 MATRIX ALGEBRA
In this section a broad overview of tools and results are given that is essential when working with vectors and matrices.
You need to be able to understand, use, and motivate these results in any of the subsequent sections of work.

1.1 General De…nitions


1.1.1 Addition and Multiplication

De…nition 1 A matrix A : m n = (aij ) is a rectangular ordering of the elements aij ; i = 1; 2; ; m; j =


1; 2; ; n in m rows and n columns:
0 1
a11 a12 a1n
B a21 a22 a2n C
A = A : m n = (aij ) = B
@
C:
A
am1 am2 amn

The transponent of A is the matrix:


0 1
a11 a21 am1
B a12 a22 am2 C
A 0 = A0 : n B
m = (aji ) = @ C:
A
a1n a2n amn
A matrix is said to be square if it has as many rows as it has columns (i.e. # of rows = # of columns). A square
matrix A is:
symmetric if A0 = A
skew-symmetric if A0 = A
lower triangular if aij = 0 for i < j
upper triangular if aij = 0 for i > j:
Addition and multiplication are de…ned by:
A + B = (aij + bij )
and !
n
X
AB = ai :b j :
=1
Addition and multiplication of matrices are de…ned only if the dimensions of the matrices correspond. From the
de…nition of matrix multiplication it follows that if the matrices A and B consist of submatrices Aij and B ij with
corresponding dimensions, then it is possible to express the matrix product in terms of the submatrices. That is,
if 0 1
A11 A12 A1h
B A21 A22 A2h C
A = A : m n = (Aij ) = B @
C
A
Ag1 Ag2 Agh

4
and 0 1
B 11 B 12 B 1k
B B 21 B 22 B 2k C
B=B:n B
k = (B ij ) = @ C
A
B h1 B h2 B hk
then the product C = AB is given by
0 1
C 11 C 12 C 1k
B C 21 C 22 C 2k C
C=C:m k = (Cij ) = B
@
C
A
C g1 C g2 C gk

with
h
X
C ij = Ai B j :
=1

Remark 1 What is the di¤erence between the dimensions n; k; m; g;and h in the example above?

The transponent of a matrix satis…es the following properties:

(A0 )0 = A
(A + B)0 = A0 + B 0
(AB)0 = B 0 A0

The following special matrices are important:

1. Identity matrix: 0 1
1 0 0
B 0 1 0 C
I=I:n n=B
@
C
A
0 0 1

2. Diagonal matrix:
0 1
d1 0 0
B 0 d2 0 C
D=D:n n = diag(d1 ; d2 ; ; dn ) = B
@
C
A
0 0 dn

3. Column vector: 0
1
a1
B a2 C
a=a:m 1=B
@
C
A
am

4. Row vector:
b0 = b0 : 1 n= b1 ; b2 ; ; bn

5
1.1.2 Determinants

The determinant of the 2 2 matrix


a b
A=
c d
is
a b
jAj = det (A) = = ad bc:
c d

Determinants have the following properties:

1. jAj = jA0 j

2. jdiag(d1 ; d2 ; ; dn )j = d1 d2 dn :
3. jAj = 0 if any row (column) contains only zeros or if any row (column) is a multiple of another row (column).
4. If a single row (column) of a square matrix is multiplied by a scalar r, then the determinant of the resulting
matrix is rjAj.
5. If jAj = 0, then a matrix is called singular.
6. jABj = jBAj = jAjjBj provided that A and B are both square matrices.

1.1.3 Elementary Operations and Inverses

De…nition 2 A square matrix A is invertible if a matrix C exists such that AC = CA = I. If A is singular, A is


not invertible.

Theorem 1 If A is square and AC = DA = I, then C = D.

Theorem 2 A square matrix A is invertible if and only if jAj =


6 0.

De…nition 3 A permutation matrix, P ; is a row or column rearrangement of the identity matrix.

Theorem 3 For any permutation matrix P it follows that P 0 P = I = P P 0 : Thus, P is orthogonal.

The inverse of the matrix A can also be determined by solving the system of equations

Ax = y

because
1
x=A y:

6
1.1.4 Vector Space and Rank

Suppose that the rows of the matrix A : m n are a01 ; a02 ; ; a0m . Then
0 0 1
a1
B a02 C
A=B @
C:
A
a0m

Any linear combination of the rows of A can be written as:


m
X
c0 A = ci a0i :
i=1

The rows of A are linearly dependent if a vector c 6= 0 exists such that


m
X
c0 A = ci a0i = 00 :
i=1

Otherwise the rows of A are linearly independent.


Pm
If b0 = c0 A = 0
i=1 ci ai is a linear combination of the rows of A, then
m
X
kb0 = kc0 A = kci a0i
i=1

is also a linear combination of the rows of A.


Pm Pm
It also follows that if b01 = c01 A = 0
i=1 c1i ai and b02 = c02 A = 0
i=1 c2i ai are two linear combinations of the rows
of A, then
m
X
b01 + b02 = c01 A + c02 A = (c01 + c02 )A = (c1i + c2i )a0i
i=1

is also a linear combination of the rows of A.

Let V be the set of vectors which are linear combinations of the rows of A. Then V is called a vector space and is
generated by the rows of A. The vector space V is also called the row space of A.

The column space of A is de…ned similarly.

Suppose that a1 ; a2 ; ; an are the columns of A : m n, without confusing the notation previously used to denote
the rows of A. (Note that aj now has a di¤erent meaning. It indicates columns and is not the transpose of the
rows!) Then
A = (a1 ; a2 ; ; an ) :
Any linear combination of the columns of A can be written as:
n
X
Ac = cj aj :
j=1

7
The columns of A are linearly dependent if a vector c 6= 0 exists such that
n
X
Ac = cj aj = 0:
j=1

Otherwise the columns of A are linearly independent.


Pn
If b = Ac = j=1 cj aj is a linear combination of the columns of A, then

kb = kAc

is also a linear combination of the columns of A. It also follows that if b1 = Ac1 and b2 = Ac2 are two linear
combinations of the columns of A, then

b1 + b2 = Ac1 + Ac2 = A(c1 + c2 )

is also a linear combination of the columns of A.

The set of vectors which are linear combinations of the columns of A, is the vector space generated by the columns
of A. This vector space is called the column space of A.

De…nition 4 The row (column) rank of a matrix is the maximum number of linearly independent rows (columns) of
A.

It can be shown that the row rank is equal to the column rank. It follows that

rank(A) = rowrank(A) = columnrank(A):

1.1.5 Properties of the Rank of a Matrix

If A : m n, then

1. 0 rank(A) min(m; n)
2. rank(A) = rank(A0 )
3. rank(A + B) rank(A)+ rank(B)
4. rank(AB) min(rank(A);rank(B))
5. rank(AA0 ) = rank(A0 A) = rank(A) = rank(A0 )
6. If B : m m, C : n n, jBj 6= 0 and jCj 6= 0, then it is true for the matrix A : m n that rank(BAC) =
rank(A)
7. If m = n then rank(A) = m if and only if A is non-singular.

8
1.1.6 Orthogonal Vectors

De…nition 5 The vectors a0 = (a1 ; a2 ; ; an ) and b0 = (b1 ; b2 ; ; bn ) are orthogonal if


n
X
a0 b = ai bi = 0:
i=1

De…nition 6 The matrix C : n n is idempotent if C 2 = C.

Suppose that C is idempotent and rank(C) = r. Then

(I C)2 = I 2C + C 2 = I C:

Therefore I C is also idempotent and


(I C)C = 0:
Therefore the rows of I C are orthogonal to the columns of C and
rank(I C) rank(I) rank(C) = n r.
But n = rank((I C) + C) rank(I C) + rank(C) or rank(I C) n r.
This implies that rank(I C) = n r.

1.2 Basic Results related to matrices


1.2.1 Properties of Trace and Inverse

De…nition 7 The trace of the matrix A : n n is de…ned as


n
X
tr(A) = aii :
i=1

Properties of the trace:

1. tr(AB) = tr(BA) if both AB and BA are square.


2. tr(A + B) = tr(A)+ tr(B)

3. tr(I : n n) = n
4. tr(A0 ) = tr(A)

Properties of the inverse of a matrix:


1 1 1
1. (AB) =B A
1 1
2. A =A
1 1 0
3. A0 = A
1 1 1
4. diag (d1 ; d2 ; ; dn ) = diag d1 ; d2 ; ; d1n

9
Orthogonal matrices play a big role in the general handling and analysis of matrices! We need to
understand them properly.

De…nition 8 The matrix H : n n is orthogonal if

H 0H = I

or
H0 = H 1
:

Since a left inverse is also a right inverse it follows that

HH 0 = I:

It also follows that jHj = 1.

1.2.2 Characteristic Roots and Vectors

De…nition 9 The characteristic equation of the matrix A : n n is

j I Aj = 0:

The LHS (left-hand side) is a polynomial of degree n in and therefore has n roots. If i is a root, the matrix
iI A is singular and the system of equations

( iI A) xi = 0

has a non-zero solution xi . Then i is the i-th characteristic root of A and xi is the corresponding characteristic
vector. Characteristic roots are also called eigenvalues of a matrix, and the characteristic vectors are also called
the eigenvectors of a matrix.

Theorem 4 If A is real and symmetric, i and xi are real.

Suppose H is a orthogonal matrix. Since

j I Aj = jH 0 H( I A)j = H 0 ( I A)H = I H 0 AH

it follows that the characteristic roots of H 0 AH are the same as those of A.

Theorem 5 Any symmetric matrix A can be diagonalized in the form

H 0 AH = = diag( 1; 2; ; n)

where 1 ; 2 ; ; n are the characteristic roots of A and the columns of H are the corresponding characteristic
vectors of A (H being orthogonal).

10
The characteristic equation of the symmetric matrix A : n n is
n
Y
0
j I Aj = j I H AHj = ( i ):
i=1

It also implies the characteristic roots are the solution to the characteristic equation. Let H = (h1 ; h2 ; ; hn ).
Then
AH = H
with i-th column
Ahi = i hi :

The last equality implies that the columns of H are the corresponding characteristic vectors of the matrix A.

Corollary 1 1. Let
1 1 1 1
2 = diag( 1
2
; 2
2 ; ; n ):
2

It follows that
1 1 1 1
A = H H0 = H0 H0
2 2 2 2
H H =A A
1 1
say. The matrix A = H 2 H 0 (which may have complex elements) is the symmetric square root of A.
2

2. Suppose that A : n n is symmetric and y : n 1. A quadratic form is de…ned as


n X
X n
s2 = y 0 Ay = aij yi yj :
i=1 j=1

3. The matrix A : n n is positive de…nite (A > 0) if

y 0 Ay > 0 8 y 6= 0

and semi-positive de…nite (A 0) if


y 0 Ay 0 8 y:
Suppose that H 0 AH = with H orthogonal and = diag( 1; ; n) is diagonal with diagonal elements the
characteristic roots of A. Let x = H 0 y. Then
n
X
y 0 Ay = x0 H 0 AHx = x0 x = 2
i xi :
i=1

It follows that A is positive de…nite if


i >0 8 i
and semi-positive de…nite if
i 0 8 i:
4. The symmetric matrix A is idempotent if
A2 = A:
5. If A is idempotent, let A = H H 0 with H orthogonal and diagonal. Then

A2 = H H 0 H H 0 = H 2
H0 = H H0 = A

11
which implies that
2
= :
It therefore follows that the characteristic roots of the idempotent matrix A are all either 1 or 0.
Furthermore
tr(A) = tr(H H 0 ) = tr( H 0 H) = tr( ) = rank(A):

Example 1 Let 0 1
5 1 1 1
B 1 4 2 1 C
A=B
@ 1
C
2 3 2 A
1 1 2 2
be a symmetric matrix with eigenvalue, eigenvector pairs given by ( i ; hi ) ; i = 1; 2; 3; 4:
0 1
1 0 0 0
B 0 0 0 C
Let = B @ 0
2 C and H = h1 h2 h3 h4
0 3 0 A
0 0 0 4

Determine the following using a computer program such as PROC IML.


1) The eigenvalues and eigenvectors of A. Let LA indicate the four eigenvalues and HA the four eigenvectors.
2) jAj and tr(A) :
3) j j and tr( ) :
4) A2 :
5) H H 0 :
2
6) H H 0:
1 1
7) A 2 = H 2 H 0:
1 1
8) A 2 A 2 :
9) Show that HH 0 = I, and make notes about any similarities/equalities of the quantities that you computed above.

12
SAS PROC IML program and output:
proc iml;
reset nolog;
A={5 1 1 1, 1 4 2 1, 1 2 3 2, 1 1 2 2};
call eigen(LA,HA,A);
detA=det(A);
trA=trace(A);
detL=det(diag(LA));
trL=trace(diag(LA));
print LA HA detA trA detL trL;

A2=A*A;
print A2;

A0=HA*diag(LA)*HA`;
A02=HA*DIAG(LA#LA)*HA`;
print A0 A02;

A012=HA*sqrt(diag(LA))*HA`;
print A012;
AA=A012*A012;
print AA;
quit;

LA HA detA trA detL trL


7.6191635 0.5473053 -0.829238 -0.100607 -0.051959 22 14 22 14
3.9198248 0.5404882 0.4305459 -0.700513 0.1782824
2.1122743 0.5158159 0.3295425 0.4319455 -0.662389
0.3487373 0.3771781 0.1356338 0.5590915 0.7257802

A2
28 12 12 10
12 22 17 11
12 17 18 13
10 11 13 10

A0 A02
5 1 1 1 28 12 12 10
1 4 2 1 12 22 17 11
1 2 3 2 12 17 18 13
1 1 2 2 10 11 13 10

A012
2.2045471 0.206625 0.1953853 0.2431111
0.206625 1.9053232 0.5409514 0.1855291
0.1953853 0.5409514 1.4796956 0.6926017
0.2431111 0.1855291 0.6926017 1.1944785

AA
5 1 1 1
1 4 2 1
1 2 3 2
1 1 2 2

13
2 MULTIVARIATE DISTRIBUTIONS
2.1 Moments and Characteristics of a Multivariate Distribution
2.1.1 Expected Values and Covariance Matrices

Expected values and (co)variances are essential measures that give us a …rst indication of the behaviour of random
variables. If we are working with a vector/matrix of variables (thus, multivariate), then our expected value must also
be a vector/matrix. If we are working with a vector/matrix of variables, then the covariance structure must thus be
a matrix - this is because if we have k number of random variables, there will be k number of variances (one for each
2
variable), and then k 2 k number of interaction variances; what we call covariances. Suppose that the elements of
X : p m are random variables. The expected value of X is de…ned as

E(X) = (E(Xij )) :

P
p P
m
Suppose that A : r p and B : m s are constant matrices. The i; j-th element of AXB is ai X b j
=1 =1
and the expected value of this element is
! !
P
p P
m P
p P
m
E ai X b j = ai E (X )b j
=1 =1 =1 =1

which is the i; j-th element of AE (X) B. It follows that

E (AXB) = AE (X) B:

Note that
0 10 10 1
a11 a12 a1p X11 X12 X1m b11 b12 b1s
B a21 a22 a2p CB X21 X22 X2m CB b21 b22 b2s C
B CB CB C
AXB =B .. .. .. CB .. .. .. CB .. .. .. C
@ . . . A@ . . . A@ . . . A
ar1 ar2 arp Xp1 Xp2 Xpm bm1 bm2 bms
0 1
P
p P
p P
p 0 1
B =1 a1 X 1 a1 X 2 a1 X m C b11 b12 b1s
B =1 =1 C B b21 b22 b2s C
B .. .. .. CB C
=B C B .. .. .. C
B p . . . C@ . . . A
@ P P
p P
p A
ar X 1 ar X 2 ar X m bm1 bm2 bms
=1 =1 =1
0 P
p Pm P
p P
m P
p P
m 1
a X b a1 X b a1 X b
B =1 =1 1 1
=1 =1
2
=1 =1
s
C
B C
B .. .. .. C
=B C:
B p m . . . C
@ P P P
p P
m P
p P
m A
ar X b 1 ar X b 2 ar X b s
=1 =1 =1 =1 =1 =1

14
2
Consider the random vector X : p 1 with i-th element Xi . Suppose that var(Xi ) = i = ii and cov(Xi ; Xj ) =
ij . The covariance matrix of X is de…ned as

0
cov(X; X 0 ) = E [X E(X)] [X E(X)]
= E XX 0 E(X)E(X 0 )
= 0 1
11 12 1p
B C
= B
@
21 22 2p C:
A
p1 p2 pp

Here we already see that for p variables, there are p variances (on the diagonal) and p2 p covariances (on the
o¤-diagonals). The matrix must be symmetric because, for example 12 = 21 , since the interaction variance
(covariance) between variable 1 and 2 must be the same as the covariance between variable 2 and 1. In e¤ect, the
matrix has p2 number of entries, p number of these are variances (on the diagonal), meaning p2 p covariances.
2
Because it is symmetric, the covariance matrix thus have p 2 p number of "unique" covariances. Note that the
covariances may be 0 - it will be the case when the two corresponding variables are independent of each other.
Similarly, if Y : q 1 is a random vector with j-th element Yj and cov(Xi ; Yj ) = ij , then
0
cov(X; Y 0 ) = E [X E(X)] [Y E(Y )]
0 1
11 12 1q
B C
= B
@
21 22 2q C:
A
p1 p2 pq

It follows that if A1 : r p and A2 : s p are constant matrices, then


0
cov(A1 X; (A2 X)0 ) = E [A1 X E(A1 X)] [A2 X E(A2 X)]
0
= A1 E [X E(X)] [X E(X)] A02
= A1 A02 :

The second last step follows since for A : n p and B : p m


0
(AB) = B 0 A0 :

It can also be shown that


cov(B 1 X; (B 2 Y )0 ) = B 1 cov(X; Y 0 )B 02 :
In particular it follows that if X : p 1 and Y : q 1 are random vectors and a : p 1; b : q 1 and c : p 1 are
constant vectors, then
0
cov(a0 X; a0 X) = cov(a0 X; (a0 X) ) = var(a0 X) = a0 a
0
cov(a0 X; c0 X) = cov(a0 X; (c0 X) ) = cov(a0 X; X 0 c) = a0 c
0
cov(a0 X; b0 Y ) = cov(a0 X; b0 Y ) = cov(a0 X; Y 0 b) = a0 cov(X; Y 0 )b:
If X 1 : p 1, X 2 : p 1, Y 1 : q 1 and Y 2 : q 1 are random vectors and a : p 1 and b : q 1 are constant

vectors, then

cov (X 1 + X 2 + a; (Y 1 + Y 2 + b)0 ) = cov(X 1 ; Y 01 ) + cov(X 1 ; Y 02 ) + cov(X 2 ; Y 01 ) + cov(X 2 ; Y 02 ):

15
The correlation matrix of X is given by
cor(X; X 0 ) =
0 1
11 12 1p
B C
= B
@
21 22 2p C
A
p1 p2 pp

or
ij
( ij ) = p :
ii jj

It is worthwhile to recall that the correlation matrix acts as a "standardised" covariance matrix, as entries of the
correlation matrix are bound between 1 and 1.

Suppose that X : p 1 is a random vector with covariance matrix . The following relation holds for any a : p 1:
var(a0 X) = a0 a 0:
Therefore it follows that is always semi-positive de…nite.

Theorem 6 If A : p p symmetric with eigenvalues 1 ; 2 ; ; p and normalized eigenvectors h1 ; h2 ; ; hp ;


then A can be expressed as
A = H H0
P
p
0
= i hi hi
i=1

where = diag ( 1; 2; ; p) and H = (h1 ; h2 ; ; hp ) is orthogonal. That is, H 0 H = HH 0 = I p : It also


follows that
H 0 AH = :

Corollary 2 For :p p the covariance matrix there exists an orthogonal matrix H such that
H0 H = = diag( 1; 2; ; p ):

where 1 ; 2 ; ; p are the characteristic roots (eigenvalues) of and the columns of H are the characteristic
vectors of H, H orthogonal. If H = (h1 ; h2 ; ; hp ), then i = h0i hi . For a : p 1,
p
X
0 0 0 0
a a=aH H a=b b= b2i i
i=1

with b = H 0 a. This means that is positive de…nite (or semi-positive de…nite) if and only if the characteristic roots
of are positive (or 0). This implies that a covariance matrix is always positive de…nite or semi-positive de…nite.
If is non-singular, it is positive de…nite.

Corollary 3 Let
1 1 1 1
2 = diag( 2
1 ; 2
2 ; ; 2
p ):
It follows that 0 0 1 1
1 1
A = H H0 = H 2 H H 2 H = A2 A2 :
1 1 0
The matrix A 2 = H 2 H is the symmetrical square root of A:

16
Theorem 7 If A is any p p matrix with eigenvalues 1; 2; ; p; then
Q
p
(i) jAj = i
i=1
P
p
(ii) tr(A) = i:
i=1

Example 2 The weight (in mg), X : 4 1; of four di¤erent species of bumblebees, I, II, III and IV have expected
value and covariance matrix
0 1 0 1
5 3 2 2 1
B 6 C B 2 4 2 1 C
E (X) = B
@ 5
C
A cov(X; X 0 ) = =B
@ 2
C:
2 4 1 A
7 1 1 1 3
Let 0 1 0 1
X1 + X2 1 1 0 0
Y =@ X1 X2 A=@ 1 1 0 0 A X.
X1 + X2 + X3 + X4 1 1 1 1
Note that the matrix Y represents (in this case) certain combinations of the four measurements under considerations.
For example, the third combination represents the total weight of the four species that the scientist may be interested
in. These combinations are informed by whatever the scientist or researcher are interested in. The expected value
and covariance matrix of Y are
0 1
0 1 5 0 1
1 1 0 0 B 6 C 11
E(Y ) = @ 1 1 0 0 AB @ 5 A=
C @ 1 A
1 1 1 1 23
7
and
0 10 1
0 3 1 2 2 1 1 1 1
1 1 0 0 B 2 4 2 1 C B 1 1 1 C
cov(Y ; Y 0 ) = @ 1 1 0 0 AB@ 2
CB C
2 4 1 A@ 0 0 1 A
1 1 1 1
1 1 1 3 0 0 1
0 1
0 1 1 1 1
5 6 4 2 B 1 1 1 C
= @ 1 2 0 0 AB@ 0
C
0 1 A
8 9 9 6
0 0 1
0 1
11 1 17
= @ 1 3 1 A .
17 1 32
Write down next to the steps above which Theorems/Results from earlier in the notes are used to obtain the results.
Furthermore
0 1 10 1 0 p1 1
p
11
0 0 11 1 17 11
0 0
B 1 C
0 A@ 1 B 1 C
0
cor(Y ; Y ) = @ 0 p3 3 1 A@ 0 p3 0 A
0 0 p132 17 1 32 0 0 p132
0 1
1 p 1 p 17
(11)(3) (11)(32)
B p 1 p 1 C
= B @ (3)(11)
1 C
(3)(32) A :
p 17 p 1
1
(32)(11) (32)(3)

17
The eigenvalues and eigenvectors of cov(X; X 0 ) = as well as the symmetric square root are calculated in the

following SAS IML program and given in the output. The calculation of the eigenvalues- and vectors are needed in
principal component analysis, which we cover in the course later as well.

proc iml; EY Phi


reset nolog; 11 11 -1 17
EX={5,6,5,7}; -1 -1 3 -1
Sig={3 2 2 1, 23 17 -1 32
2 4 2 1,
2 2 4 1, cor_y
1 1 1 3}; 1.00 -0.17 0.91
A={1 1 0 0, -0.17 1.00 -0.10
1 -1 0 0, 0.91 -0.10 1.00
1 1 1 1};
EY=A*EX;
Phi=A*Sig*A`; lambda H
print EY Phi; 8.27 0.49 -0.07 0.00 0.87
cor_y=inv(sqrt(diag(Phi)))*Phi*inv(sqrt(diag(Phi))); 2.45 0.57 -0.23 0.71 -0.35
print cor_y [format=5.2]; 2.00 0.57 -0.23 -0.71 -0.35
1.29 0.31 0.94 0.00 -0.10
call eigen(lambda,H,Sig);
SigHalf=H*diag(lambda##0.5)*H`; SigHalf
test=SigHalf*SigHalf; 1.56 0.50 0.50 0.24
print lambda [format=5.2] H [format=5.2]; 0.50 1.87 0.45 0.22
print SigHalf [format=5.2]; 0.50 0.45 1.87 0.22
print test; 0.24 0.22 0.22 1.69

test
3 2 2 1
2 4 2 1
2 2 4 1
1 1 1 3

Example 3 If cov(X; X 0 ) = :p p=( ij ) and a : p 1 and b : p 1 are constant then


p X
X p p
X p
X
var(a0 X) = ai aj ij = a2i ii + 2 ai aj ij
i=1 j=1 i=1 i<j

and
p X
X p
0 0
cov(a X; b X) = ai bj ij :
i=1 j=1

(See Bain and Engelhardt, Chapter 5)

Remark 2 You need to be able to use these result to calculated variances and covariances etc. when confronted
with linear combinations with coe¢ cient vectors a and b.

18
2.1.2 Sample Mean, Sample Covariance and Sample Correlation Matrices

Suppose X 1 ; X 2 ; ; X N , X i : p 1 and N > p is a random sample of N vector observations from a population


with expected value E (X i ) = and covariance matrix = cov X i ; X 0i : The observed values can be written as
a data matrix 0 1 0 0 1
x11 x12 x1p x1
B x21 x22 x C B x02 C
X:N p=B @
2p C=B
A @
C
A
0
xN 1 xN 2 xN p xN
0 1
x11 x21 xN 1
B x12 x22 xN 2 C
X0 : p N = B @
C = x1 ; x2 ;
A ; xN :
x1p x2p xN p
The sample mean is
0 1
1
P
N
B N xi1 C 0 1
B i=1 C x1
B PN C B C
PN B 1
xi2 C x2
1 0 1
xi = B N C=B C
x= N X 1N = N B i=1 C B .. C
i=1 B C @ . A
B C
@ PN A xp
1
N xip
i=1

and the sample covariance matrix is S where


0 1
s11 s12 s1p
B s21 s22 s2p C
B C
(N 1) S = (N 1) B . .. .. .. C
@ .. . . . A
sp1 sp2 spp
0 1
P
N
2 P
N P
N
B (xj1 x1 ) (xj1 x1 ) (xj2 x2 ) (xj1 x1 ) (xjp xp ) C
B j=1 j=1 j=1 C
B P P P C
B N (x x2 ) (xj1 x1 )
N
(xj2 x2 )
2
N
(xj2 x2 ) (xjp xp ) C
B j2 C
=B
B j=1 j=1 j=1 C
C
B .. .. .. .. C
B . . . . C
B N C
@ P P
N P
N
2 A
(xjp xp ) (xj1 x1 ) (xjp xp ) (xj2 x2 ) (xjp xp )
j=1 j=1 j=1

P
N
0
= (xj x) (xj x)
j=1
0
= (X 1N x0 ) (X 1N x0 )
= X0 IN 1 0
N 1N 1N X:

Remark 3 Take some time to write out the last two lines in the derivation above for a special case (for example
N = 3; p = 2) in order to observe the mathematical construction and correctness of the equality.

19
The sample correlation matrix is
1 1
R=D 2 SD 2

where 0 1
p1 0 0
s11
B 0 p1 0 C
1 B s22 C
D 2 =B
B .. .. .. .. C:
C
@ . . . . A
0 0 p1
spp

2.1.3 The Multivariate Change of Variable Technique

Let xj = gj (y1 ; ; yp ); j = 1; ; p de…ne a set of one-to-one transformations from a vector of random variables
x = (x1 ; ; xp )0 , to a vector of random variables y = (y1 ; ; yp )0 . If f (x) denotes the joint probability density
function of the random vector x then the joint probability density function of the random vector y is given by

h(y) = f (g1 (y); ; gp (y))jJ(x ! y)j

@xi
where J(x ! y) = @yj , i.e.
0 @x1 @x1 @x1 1
@y1 @y2 ::: @yp
B C
B @x2 @x2
::: @x2 C
B @y1 @y2 @yp C
J(x ! y) = @x
@y =B
B
C
C
B .. .. .. C
@ . . . A
@xp @xp @xp
@y1 @y2 ::: @yp

denotes the Jacobian matrix of the transformation from x to y and j:j is the absolute value of the determinant of
the matrix. Furthermore, jJ(x ! y)j is known as the Jacobian of the transformation.

Remark 4 This is similar to work completed in previous years of your studies, see in particular Chapter 4 of Bain
and Engelhardt (WST 211).

20
2.1.4 Moment Generating Functions

Suppose that X : p 1 is a random vector and that the joint density function of the elements of X is fX (x) =
; xp ). The moment generating function of X is
fX1 ;X2 ; ;Xp (x1 ; x2 ;
Z 1 Z 1 Z 1 Z 1
t0 X t0 x
MX (t) = E e = e fX (x)dx = et1 x1 +t2 x2 + +tp xp fX (x)dx:
1 1 1 1

The moment generating function of Xi is given by MX (0; 0; ; ti ; 0; ; 0). This follows since
0 X1 +0 X2 + +ti Xi + +0 Xp
MX (0; 0; ; ti ; 0; ; 0) = E e
= E eti Xi = MXi (ti ) :

Theorem 8 Suppose that MX (t) is the moment generating function of X. Then X1 ; X2 ; ; Xp are mutually
independent if and only if
Q
p
MX (t) = MX (t1 ; 0; ; 0)MX (0; t2 ; ; 0) MX (0; 0; ; tp ) = MXi (ti ) :
i=1

Proof. If X1 ; ; Xp are mutually independent, then


Z 1 Z 1
MX (t) = et1 x1 +t2 x2 + +tp xp fX (x)dx
1 1
Z 1 Z 1
= ::: et1 x1 et2 x2 : : : etp xp fX1 (x1 )fX2 (x2 ) : : : fXp (xp )dx1 : : : dxp
1 1
Z 1 Z 1 Z 1
= et1 x1 fX1 (x1 )dx1 et2 x2 fX2 (x2 )dx2 : : : etp xp fXp (xp )dxp
1 1 1
p
Y Z 1
= eti xi fXi (xi )dxi
i=1 1

Yp
= MXi (ti )
i=1
= MX (t1 ; 0; ; 0)MX (0; t2 ; ; 0) MX (0; 0; ; tp ):
Conversely, suppose that
MX (t) = MX (t1 ; 0; ; 0)MX (0; t2 ; ; 0) MX (0; 0; ; tp ):
Then
MX (t)
Z 1 Z 1
= et1 x1 +t2 x2 + +tp xp
fX (x)dx
1 1

= MX (t1 ; 0; ; 0)MX (0; t2 ; ; 0) MX (0; 0; ; tp )


p
Y
= MXi (ti )
i=1
p
Y Z 1
= eti xi fXi (xi )dxi
i=1 1
Z 1 Z 1
= ::: et1 x1 : : : etp xp fX1 (x1 )fX2 (x2 ) : : : fXp (xp )dx1 : : : dxp :
1 1

21
Since a moment generating function, if it exists, uniquely determines a distribution, it follows that

fX (x) = fX1 (x1 )fX2 (x2 ) fXp (xp )

and therefore X1 ; X2 ; ; Xp are mutually independent.

X1
Theorem 9 Let X = . Then X 1 and X 2 are independent if and only if MX (t) = MX 1 (t1 )MX 2 (t2 ):
X2

Proof. Same as previous theorem.

Exercise 1 In this example, we want to determine the moment generating function of a new (vector) variable Y ,
X1
which is made up of some combinations of the variable X. Let X = where X1 and X2 are independent
X2
and exponential, Xi EXP (1). The pdf of X is
x1 x2
e x1 > 0; x2 > 0
fX (x) =
0 elsewhere:

The moment generating function of X is


1
MX (t) = MX1 (t1 )MX2 (t2 ) = :
(1 t1 )(1 t2 )

Let Y1 = X1 and Y2 = X1 + X2 . Then x1 = y1 and x2 = y2 y1 . The Jacobian of the transformation is

@x 1 0
jJj = @y = = 1:
1 1

The pdf of Y is
y2
e 0 < y1 < y2 < 1
fY (y) =
0 elsewhere:
The moment generating function of Y is
0
MY (t) = E(et Y )
R1Ry
= 0 0 2 et1 y1 +t2 y2 e y2 dy1 dy2
R1 h iy2
= 0 e y2 (1 t2 ) t11 ey1 t1 dy2
0
R1
= t11 0 e y2 (1 t2 ) [ey2 t1 1] dy2 :
h i1
= t11 1
1 t1 t2 e
y2 (1 t1 t2 )
+ 1 1t2 e y2 (1 t2 )
0
h i
= t11 1 t11 t2 1
1 t2
1
= (1 t2 )(1 t1 t2 ) :

22
An alternative approach: The moment generating function of Y may also be directly determined by:
0
MY (t) = E et Y

= E et1 Y1 +t2 Y2
= E et1 X1 +t2 (X1 +X2 )
= E e(t1 +t2 )X1 +t2 X2
= MX (t1 + t2 ; t2 )
1
= (1 t1 t2 )(1 t2 ) :

The marginal moment generating functions of Y1 and Y2 are


1 1
MY1 (t1 ) = MY (t1 ; 0) = (1 t1 ) MY2 (t2 ) = MY (0; t2 ) = (1 t2 )2 :

This shows that MY (t) 6= MY1 (t1 )MY2 (t2 ); demonstrating the dependence between Y1 and Y2 .

Example 4 In the univariate case the moment generating function of the random variable Z can be written as

t2 t3
P
1 r
MZ (t) = 1 + E (Z) t + E Z 2 2! + E Z3 3! + ::: = E (Z r ) tr! :
r=0

r
t
If MZ (t) is expanded in a Taylor series around zero, then the coe¢ cient of r! is equal to E (Z r ) :
@r
Also, @t r MZ (t)
t=0
= E (Z r ) :
In the bivariate case, suppose Z1 and Z2 are jointly distributed random variables with joint pdf fZ (z) and moment
generating function MZ (t): Then
P1 P 1
tr1 ts2
MZ (t) = E (Z1r Z2s ) r!s! :
r=0 s=0

and
@ r+s
MZ1 ;Z2 (t1 ; t2 ) = E (Z1r Z2s ) :
@tr1 @ts2 t1 =t2 =0
tr1 ts2
the coe¢ cient of r!s! is equal to E (Z1r Z2s ) :

The moments of Y around the origin and around the mean can be determined from MY (t). The derivatives of
MY (t) with respect to t may be obtained by expanding MY (t) in a power series in t1 and t2 .
1
MY (t) = (1 t1 t2 )(1 t2 )

= 1 + (t1 + t2 ) + (t1 + t2 )2 + (t1 + t2 )3 + 1 + t2 + t22 + t32 +


= 1 + t1 + 2t2 + t21 + 2t1 t2 + t22 + t1 t2 + t22 + t22 +
= 1 + t1 + 2t2 + t21 + 3t1 t2 + 3t22 +

23
From this it follows that
E(Y1 ) = 1 E(Y2 ) = 2 E(Y1 Y2 ) = 3
E Y12 =2 E Y22 =6 cov(Y1 ; Y2 ) = 1
var(Y1 ) = 1 var(Y2 ) = 2
The moments of Y can also be determined directly by using the moments of X and the results in the previous
section. Since
1 1 0
E(X) = ; cov(X; X 0 ) = = I2
1 0 1
and
X1 1 0 X1 1 0
Y = = = X
X1 + X2 1 1 X2 1 1
it follows that
1 0 1
E(Y ) = E(X) =
1 1 2
and
1 0 1 0 1 1 1 1
cov(Y ; Y 0 ) = = :
1 1 0 1 0 1 1 2

Example 5 Redo the example above, but where Xi EXP (i).

24
2.2 The Multivariate Normal Distribution
The multivariate normal distribution forms the basis of much of multivariate analysis due to its computationally
attractive features (it’s relatively easy to implement in code), and its intuitive analogy from the univariate normal
environment. We study the multivariate normal distribution and its properties to gain insight into the "common"
behaviour of data in a multivariate setting, but note that there are many other multivariate distributions which are
also often considered in practice.

2.2.1 Classical De…nition of the Multivariate Normal Distribution

The random vector X : p 1 has the multivariate normal distribution if the density function of X is given by
1
2 (x b)0 A(x b)
fX (x) = ke ; 1 < xi < 1
where b is a constant vector and A > 0.
1 p
Theorem 10 If the density of X is given by the de…nition above, then k = jAj 2 =(2 ) 2 , the expected value of X
is b and the covariance matrix of X is A 1 : Conversely, given a vector and a positive de…nite matrix , there
exists a multivariate normal density
n 1 1 o
1 1
2 (x )0 2 2 (x )
p 1 e
(2 ) j j 2
2

such that the expected value of the vector with this density is and the covariance matrix is :
1 1
Proof. Let Y = A 2 (X b) or X = A 2 Y + b.
1 1
jJ(x ! y)j = A 2 = jAj 2 :

The density function of Y is


1 1 0
fY (y) = kjAj 2 e 2y y

Z 1 Z 1 p
Y Z 1
1 1 0 1 2 p
jAj 2 =k = e 2y y dy = e 2 yi dyi = (2 ) 2
1 1 i=1 1

and p
1
k = jAj 2 =(2 ) 2 :
Also,
Ip = cov(Y ; Y 0 )
1 1
= cov A 2 (X b); fA 2 (X b)g0
1 1
= cov A 2 X; fA 2 Xg0
1 1
= A 2 cov(X; X 0 )A 2
1 1
= cov(X; X 0 ) = A 2 A 2 =A 1
:
1 1
E(Y ) = 0 = A E(X)
2 A b and
2

= E(X) = b:
The density function of X is
1 )0 1
ef )g
1
fX (x) = 2 (x (x
:
p 1
(2 ) j j 2 2

and it is denoted as X : p 1 Np ( ; ).

25
Exercise 2 Let p = 1. Write down the corresponding density from the multivariate normal density function. What
does p = 1 mean in this case?

Example 6 For p = 2, the bivariate normal distribution, it follows that:


2
1 1 12
= = 2
2 12 2

and
2 2 2 2 2 2
j j= 1 2 12 = 1 2 (1 ):
a b
Since the inverse of a non-singular 2 2 matrix A = is given by
c d

1 1 d b
A = jAj c a

it follows that
2
1 1 2 12
= 2 2 (1 2) 2
1 2 12 1

1
!
2
1 1 2
= 1 2
1
1 :
2
1 2 2

The density function of X is


2 2
1p 1 x1 x1 x2 x2
fX (x) = 2
exp 2(1 2)
1
2 1 2
+ 2
:
2 1 2 1 1 1 2 2

Example 7 If Y : p 1 has a Np ( ; ) distribution with = diag( 21 ; 2


2; ; 2
p ), then the density function of
Y is
p
Y p
Y
1 1 2
yi
fY (y) = p 1 exp 1
2 (y )0 1
(y ) = p exp 1
2
i
= fYi (yi ) :
(2 ) 2 j j 2 i=1
2 i
i
i=1

2
This implies that the elements of Y are independent N ( i ; i) distributed.

X 1 : p1 1
Example 8 If X : p 1 has a Np ( ; ) distribution with X = and
X 2 : p2 1

1 11 0
= =
2 0 22

then X 1 and X 2 are independent Npi ( i; ii ) distributed since

11 0
j j= =j 11 j j 22 j
0 22

and

26
1 )0 1
ef )g
1
fX (x) = 2 (x (x
p 1
(2 ) j j
2 2

1 0 1 1 0 1
ef g: ef g
1 1
= 2 (x1 1) 11 (x1 1) 2 (x2 2) 22 (x2 2)
p1 1 p2 1
(2 ) 2 j 11 j 2 (2 ) 2 j 22 j 2

= fX 1 (x1 )fX 2 (x2 ):

2.2.2 Standard Multivariate Normal Distribution

Consider the random vector Z : p 1 where the elements of Z, the Zi ’s, i = 1; 2; : : : ; p, are independent and
Zi N (0; 1), i.e. each has a standard normal distribution with expected value 0 and variance 1. The density
function of Z is
p
Y 1 1 2 1 Pp 1 1 0
fZ (z) = p e 2 zi = 1 2 2z z
p exp 2 i=1 zi = p e ; 1 < zi < 1 8 i:
i=1
2 (2 ) 2 (2 ) 2

We say that Z : p 1 Np (0; I p ) where

E(Z) = 0 and cov(Z; Z 0 ) = I p :

Let : p p be any non-singular covariance matrix and : p 1 be any constant vector and consider the
transformation 1 1
X = 2Z + Z= 2 (X ):
1 1
The Jacobian of the transformation is jJ(Z ! X)j = j j 2 and the density function of X = 2 Z+ is
1 n 1 1
o
1
fX (x) = p 1 exp 2 (x )0 2 2 (x )
(2 ) j j2 2

1 1 1
= j2 j 2 exp 2 (x )0 (x ) :
1 1
1
The last step follows from the fact that j2 I p j = (2 )p and 2 2 = : We say that X : p 1 Np ( ; )
where
E(X) = and cov(X; X 0 ) = :

1
Remark 5 Consider X = 2 Z+ in particular. What is its equivalence when p = 1?

a c a c 2a 2c
Remark 6 22 = 4ab 4c2 and 2 = = 4ab 4c2 :
c b c b 2c 2b

2.2.3 The Moment Generating Function of the Multivariate Normal Distribution

Theorem 11 Suppose that X : p 1 has the Np ( ; ) distribution. The moment generating function of X is given
by 0
t+ 12 t0 t
MX (t) = e :

27
Proof. Using the fact that for Z Np (0; I p ) we have
p
Y p
Y
0 1 2 1 0
MZ (t) = E eZ t
= MZi (ti ) = e 2 ti = e 2 t t :
i=1 i=1

1
Since X = 2 Z + , we have
0
MX (t) = E(eX t )
1
Z+ )0 t
= E(e( 2
)
0 1
t ( 2 Z)0 t
= e E(e )
0 0 1
t
= e E(eZ 2 t
)
0 1
t
= e MZ ( 2 t)
0
t+ 12 t0 t
= e :

Theorem 12 Suppose that X : p 1 has a Np ( ; ) distribution and let the rank of D : q p be q (q p). Then
Y = DX has a Nq (D ; D D 0 ) distribution.

Proof.
0
MY (t) = E eY t

0
D0 t
= E eX
= MX (D 0 t)
0
D 0 t+ 12 t0 D D 0 t
= e

which is the moment generating function of a Nq (D ; D D 0 ) distribution.

Theorem 13 Suppose that X : p 1 has a Np ( ; ) distribution and let

X1 1 11 12
X= ; = ; and = :
X2 2 21 22

Then X 1 : q 1 has a Nq ( 1; 11 ) distribution.

.
Proof. Let D = I q .. 0 in the previous Theorem.

Theorem 14 Suppose that X : p 1 has a Np ( ; ) distribution and let

X1 1 11 12
X= ; = ; and = :
X2 2 21 22

Then X 1 and X 2 are independent if and only if 12 = 0.

28
Proof. If X 1 and X 2 are independent, then
0
cov(X 1 ; X 02 ) = E [X 1 E (X 1 )] [X 2 E (X 2 )]
= E [X 1 E (X 1 )] X 02 E X 02
= E X 1 X 02 X 1 E X 02 E (X 1 ) X 02 + E (X 1 ) X 02
= E (X 1 ) E X 02 E (X 1 ) E X 02 E (X 1 ) E X 02 + E (X 1 ) E X 02
= 0:

Conversely, if 12 = 0 it follows that


fX (x) = fX 1 (x1 )fX 2 (x2 ):
Refer to Example 6 where 12 = 0:

Corollary 4 Independent normally distributed vectors can be combined such that the combined vector is also normally
distributed.
If X i : pi 1; i = 1; 2; ; n are independent Npi ( i ; i ) and
0 1
X1
B X2 C
B C
X = B . C;
@ .. A
Xn

P
n
then X : pi 1 has a N pi ( ; ) distribution where
i=1
0 1 0 1
1 1 0 0
B 2 C B 0 0 C
B C B 2 C
=B . C and =B . .. .. .. C:
@ .. A @ .. . . . A
n 0 0 n

Note that the zero matrices as well as the i ’s may be of di¤erent order.

Corollary 5 If X i : p 1; i = 1; 2; ; n are independently and identically Np ( ; ) distributed, then


0 1
X1
B X2 C
B C
X=B . C
@ .. A
Xn

has a Nnp ( ; ) distribution where


0 1 0 1
0 0
B C B 0 0 C
B C B C
=B . C and =B .. .. .. .. C:
@ .. A @ . . . . A
0 0

29
The distribution of the sum of the random vectors,
0 1
X1
n
X B X2 C
B C
Y = X i = (I p I p I p )X = (I p I p I p) B .. C
i=1
@ . A
Xn

is the multivariate normal distribution, Np (n ; n ), since Y = DX with D = I p; I p; ; Ip and


D : p np: From Theorem 7, Y Np (D ; D D 0 )

D =n D D0 = n :

Corollary 6 Let X : p 1 with E (X) = and cov X; X 0 = . If all linear combinations of X are normally
distributed then X also has a multivariate normal distribution.
Consider the linear combination Y = d0 X = d1 X1 + + dp Xp . Then E (Y ) = E d0 X = d0 and var d0 X =
0 0 0 0 0
d cov X;X d = d d: Since Y N d ;d d
0
dt+ 21 d0 dt2 0 0
MY (t) = e = E ed X t
= E eX d t
= MX (dt):

From this it follows that X Np ( ; ) :

Corollary 7 If X : p 1 has the Np ( ; ) distribution, it follows that the elements of X, namely X1 ; X2 ; ; Xp


are mutually independent if and only if is a diagonal matrix.

Corollary 8 It also follows that Y1 = d01 X and Y2 = d02 X are independent if and only if d01 d2 = 0.
1
Corollary 9 If X : p 1 has a multivariate Np ( ; ) distribution, then Z = 2 (X ) N (0; I p ). This means
that the elements of Z are indpendently N (0; 1) distributed. It then follows that
P
p
Z 0 Z = (X )0 1
(X )= Zi2 2
(p) :
i=1

30
Example 9 Suppose that the random vectors X 1 , X 2 and X 3 are independently Np ( ; ) distributed.

a. Specify the joint distribution of X 1 , X 2 and X 3 .


0 1
X1
X = @ X 2 A N3p ( ; ) with
X3
0 1 0 1
0 0
=@ A and =@ 0 0 A:
0 0

b. Derive the joint distribution of Y 1 = X 1 X 2 , Y 2 = X1 + X2 2X 3 and Y 3 = X 1 + X 2 + X 3 .


0 1 0 1 0 10 1
Y1 X1 X2 Ip Ip 0 X1
Y = @ Y 2 A = @ X 1 + X 2 2X 3 A = @ Ip Ip 2I p A @ X 2 A = DX:
Y3 X1 + X2 + X3 0 I p 1 I0
p I1p X3
0
Hence Y N3p (D ; D D 0 ) with D = D @ A=@ 0 A
3
0 10 10 1
Ip Ip 0 0 0 Ip Ip Ip
D D0 = @ Ip Ip 2I p A @ 0 0 A @ Ip Ip Ip A
Ip Ip Ip 0 0 0 2I p Ip
0 10 1 0 1
0 Ip Ip Ip 2 0 0
=@ 2 A@ Ip Ip Ip A = @ 0 6 0 A
0 2I p Ip 0 0 3

c. Are Y 1 , Y 2 and Y 3 mutually independent?


Yes, since the covariance matrices are all zero.

Remark 7 Take special care in the example above: our usual A matrix consists of submatrices, or vector combina-
tions, and therefore the components of A is I instead of 1.

31
2.2.4 Conditional Distributions of Normal Random Vectors

Theorem 15 Suppose X : p 1 is Np ( ; ) distributed, i.e. with density function


p 1 1
f (x) = (2 ) 2
j j 2 expf 1
2 (x )0 (x )g
1 1
1 0
= j2 j 2 expf 2 (x ) (x )g:

X1 1
Let X = where X 1 : q 1 and X 2 : (p q) 1 with corresponding = and =
X2 2
11 12
.
21 22

The conditional distribution of X 1 , given X 2 = x2 , is a normal distribution with expected value


1
E [X 1 jX 2 = x2 ] = 1 + 12 22 (x2 2)

and covariance matrix


cov X 1 ; X 01 jX 2 = x2 = 11 12 22
1
21 = 11:2 :

Proof. The result will be given.

Remark 8 The matrix = 12 221 is the matrix of regression coe¢ cients of X 1 on x2 . The vector 1+ (x2
2 ) is often called the regression function.

Remark 9 The conditional variance of Xi , i = 1; 2; ; q, given that X 2 = x2 , i.e. the i-th diagonal element
of 11:2 , is called the partial variance of Xi , given that X 2 = x2 , and is written as ii:q+1; ;p . The conditional
covariance between Xi and Xj , given that X 2 = x2 , i.e. the i; j-th element of 11:2 , is called the partial covariance
between Xi and Xj , given X 2 = x2 , and is written as ij:q+1; ;p . The conditional correlation coe¢ cient between
Xi and Xj , given that X 2 = x2 , is the partial correlation coe¢ cient between Xi and Xj and is given by
ij:q+1; ;p
ij:q+1; ;p =p :
ii:q+1; ;p jj:q+1; ;p

Remark 10 Note that ij:q+1; ;p does not depend on x2 , which implies that ij:q+1; ;p is independent of the
given value of X 2 . We may consider the partial correlation coe¢ cient as the correlation coe¢ cient between Xi
and Xj when the in‡uence of X 2 is eliminated or removed (i.e. if X 2 is …xed).

Suppose X : p 1 is N ( ; ) distributed. It had already been shown that

Y = DX + c N (D + c; D D 0 ):

In the case where D = diag(d1 ; d2 ; ; dp ) is a diagonal matrix, we say that this transformation is a scale trans-
formation. If the elements of D are all positive, then it is a positive scale transformation. The correlation matrix
of the elements of Y in the case of a positive scale transformation, is
0 1
Y d d
@ q i j ij A = p ij
ij = = X ij :
d2i ii d2j jj ii jj

32
The correlation matrix of X is a covariance matrix of a speci…c positive scale transformation of X. Take
1 1 1
D = diag p p p :
11 22 pp

It then follows that


ij =D D
since D diagonal (and symmetric). Consequently, the correlation matrix of X simply is the covariance matrix of
Y = DX.

Theorem 16 Suppose X : p 1 is Np ( ; ) distributed. The partial correlation coe¢ cient ij:q+1; ;p is invariant
with respect to a positive scale transformation of the stochastic variates. That means that if
Y = DX + c and D = diag(d1 ; d2 ; ; dp )
the partial correlation coe¢ cient between the i-th and j-th elements of Y is

ij:q+1; ;p = ij:q+1; ;p :

Proof. The result will be given.

Example 10 Ficus is a genus of approximately 850 species of shrubs, trees, and vines in the family Moraceae, and are
sometimes loosely referred to as "…g trees", or …gs. The fruit that these trees bare are of vital cultural importance
around the world and also serve as an extremely important food resource for wildlife. Consider the hypothetical
example where p = 3 (this is also called "trivariate") about heights (in cm) of three di¤erent …cus species that are
found in South Africa:
X1 = height of Ficus bizanae
X2 = height of Ficus burtt-davyi
X3 = height of Ficus tettensis
The interest in Ficus and modeling of its characteristics is important for understanding ecological advancement, and
wildlife- as well as cultural conservation. The covariance matrix is given by
0 1
100 70 90
= ( ij ) = @ 70 100 80 A :
90 80 100
The partial covariance matrix of the heights of the …rst two trees, if the in‡uence of the third tree is eliminated, is:
1
11:2 = 11 12 22 21

100 70 1 90
= 100 (90; 80)
70 100 80
19 2
= :
2 36

The partial correlation coe¢ cient between the heights of the …rst two trees, if the in‡uence of the height of the third
tree is eliminated, then is:
2
12:3 = p = 0:076:
19 36

This analysis tells us for example that the heights between the …rst two trees are not particularly highly correlated,
and thus gives insight to farmers/scientists/public that by seeing one tall Ficus might not mean any Ficus will be
tall necessarily. This aids with plantation design and wildlife conservation - i.e. how "much" Ficus and Ficus leaves
might be available for wildlife consumption as part of herd- and grazing planning.

33
Example 11 Redo the above example, but calculate the partial correlation coe¢ cient between the height of Ficus
bizanae and Ficus tettensis, by eliminating the in‡uence of Ficus burtt-davyi.

Example 12 Take note: the type of analysis above is crucial for understanding the relationships between, in this
example, heights of shrubs/trees of the same genus but of di¤erent type. The elimination of the in‡uence of one
or more variables allows the scientist/practitioner to gain insight into the relationships that the remaining variables
have. Think for example if your data is vast but the geographical area of your interest does not cater for Ficus
bizanae. Then you can investigate the behaviour of the relationship of the remaining two types by disregarding the
in‡uence of Ficus bizanae!

Example 13 Finally: in this hypothetical example, we only considered 3 Ficus types. There are more than 800 Ficus
types, so in the case where you have access to the heights from a sample of all 800, then your covariance matrix
would be 800 800 - which is huge! It is then when you would want to have access to a reasonable computer, good
software, and your own strong coding skills to solve these computational challenges. :)

Take note:
Since the partial correlation coe¢ cient is scale invariant, it may be calculated directly from the correlation matrix.
For the case p = 3, 0 1
11 12 13
( ij ) =@ 21 22 23
A
31 32 33

and 0 1
1 12 13
( ij ) =@ 21 1 23
A:
31 32 1
In this case
1 2
11 33 13 12 33 13 23
11:2 = 2
33 12 33 13 23 22 33 23

The partial correlation coe¢ cient follows as


12:3
12:3 =p
11:3 22:3

12 33 13 23
=p 2 )( 2 )
( 11 33 13 22 33 23
p 1
2
11 22 33 12 33 13 23
= 1
p
p 2 ( 11 33
2 )(
13 22 33
2 )
23
11 22 33

12 13 23
=p 2 ) (1 2 )
:
(1 13 23

Example 14 De…ne X1 ; X2 and X3 as in the previous example. Assume that


0 1
66
= @ 66 A :
66

34
Let
Y1
Y =
Y2
X1 X1
where Y 1 = and Y 2 = (X3 ) = X3 . The expected value of Y 1 = given X3 is
X2 X2
1
E (Y 1 jY 2 = y 2 = x3 ) = 1 + 12 22 (y 2 2)
66 1 90
= + (x3 66)
66 100 80
6:6 0:9
= + x3
13:2 0:8
The estimated values of X1 and X2 given Ficus tettensis has a height of 72 are
X1 6:6 + 0:9 72 71:4
E jX3 = 72 = =
X2 13:2 + 0:8 72 70:8
The estimated values of X1 and X2 given Ficus tettensis has a height of 60 are
X1 6:6 + 0:9 60 60:6
E jX3 = 60 = = :
X2 13:2 + 0:8 60 61:2

2.2.5 Multiple Correlation

X1
Consider the random vector X : p 1 with expected value and covariance matrix . Let X = where
X2
0
1 11 1
X1 : 1 1, X 2 : (p 1) 1 and = and = correspondingly.
2 1 22
In order to …nd a measure of the relation between X1 on the one hand and the components of X 2 on the other,
the multiple correlation coe¢ cient between X1 and X 2 is de…ned. The correlation coe¢ cient between X1 and the
linear function 0 X 2 = 1 X2 + 2 X3 + + p 1 Xp is given by
0
cov (X1 ; X 2)
=p :
var(X1 )var( 0X
2)

The multiple correlation coe¢ cient between X1 and X 2 is de…ned as


( )
cov(X1 ; 0 X 2 )
R = max f g = max p :
var(X1 )var( 0 X 2 )
Thus the multiple correlation coe¢ cient between X1 and X 2 is de…ned as the maximum correlation between X1 and
the linear combination 0 X2 , where the maximum is taken over all vectors . It is now proven that the for which
this correlation is a maximum, is given by
= 221 1
i.e. the vector of regression coe¢ cients of X1 on X 2 .

Theorem 17 Of all linear combinations 0 X 2 , that combination which minimizes the variance of
0
X1 X 2 and maximizes the correlation between X1 and 0 X 2 , is given by
0 1
X 2 ; where = 22 1:

35
Proof. The result will be given.

Therefore it follows that the multiple correlation coe¢ cient is given by:
0
cov(X1 ; X 2)
R = p
var(X1 )var( 0 X 2 )
0
1
= p 0
11 22
0 1
1 22 1
= q
0 1
11 1 22 1
s
0 1
1 22 1
= :
11

It also follows that


0 1
11:2; ;p = 11 1 22 1 = (1 R2 ) 11 :
2
This identity implies that 1 R is the ratio of the variance of X1 which is not explained by the variables X2 ; ; Xp ,
since (1 R2 ) 11 is the conditional variance of X1 , given X2 ; ; Xp . The greater R, the greater the reduction of
the partial variance.

Theorem 18 The multiple correlation coe¢ cient R is invariant with respect to the transformation

X = DX
d1 00
where D = , d1 6= 0 and D 2 is non-singular.
0 D2

Proof.
0
d1 11 d1 d1 00
= D D0 = 1
D2 1 D2 22 0 D 02
0
d21 11 d1 0
1 D2
= 0
d1 D 2 1 D2 22 D 2

and s
0 D0 1
d21 1 2 D 2 22 D 02 D2 1
R = = R:
d21 11

Example 15 In the previous example, we may be interested in the multiple correlation coe¢ cient between the height
of the …rst tree species and the heights of the second and third species. In this case
0 1
100 70 90
( ij ) = @ 70 100 80 A
90 80 100

36
1
0 1 100 80 70
1 22 1 = (70; 90)
80 100 90
1 100 80 70
= (70; 90)
3600 80 100 90
1 70
= ( 200; 3400)
3600 90
292000
= = 81:1111
3600
Hence: r
81:1111 p
R1:23 = = 0:8111 = 0:90
100

0 1
Remark 11 Remind yourself here that 1 22 1 can be any value, but R will be bound between 1 and 1.

37
3 NORMAL SAMPLES
3.1 Estimation of Parameters
3.1.1 The Mean and Covariance Matrix

Suppose X 1 ; X 2 ; ; X N , N > p is a random sample of N vector observations from a population with a Np ( ; )


distribution, : p 1: The observed values can be written as a data matrix
0 1 0 0 1
x11 x12 x1p x1
B x21 x22 x C B x02 C
X:N p=B @
2p C=B
A @
C
A
0
xN 1 xN 2 xN p xN
or 0 1
x11 x21 xN 1
B x12 x22 xN 2 C
X0 : p N =B
@
C=
A x1 ; x2 ; ; xN :
x1p x2p xN p
The expected value of X is:
0 0
1
B 0 C
E(X) = B
@
C = 1N
A
0

and
E(X 0 ) = E X 1; X 2; ; XN = ; ; ; = 10N :
Also 0 1
1
P
N
B N xi1 C 0 1
B i=1 C x1
B PN C B C
PN B 1
xi2 C x2
1 0 1
xi = B N C=B C
x= N X 1N = N B i=1 C B .. C:
i=1 B C @ . A
B C
@ PN A xp
1
N xip
i=1

Next, the maximum likelihood estimators of and are given.

Theorem 19 Suppose X 1 ; X 2 ; ; X N , N > p is a random sample of N vector observations from a population


with a Np ( ; ) distribution, : p 1: The maximum likelihood estimators for and are given by

b = x and b = 1A
N
PN
where A = =1 (x x)(x x)0 :

Proof. The result will be given.

38
3.1.2 Correlation Coe¢ cients
1
Lemma 1 Let f ( ) : <p ! < and = ( ) : <p ! <p where ( ) is a one-to-one function, i.e. = ( ).
1
Let g( ) = f ( ( )). If f ( ) is a maximum at = 0 , then g( ) is a maximum at = 0 = ( 0 ). If the
maximum of f ( ) is unique, then the maximum of g( ) is also unique.

Proof. Know how to apply the result.

Remark 12 Find the theorem from Bain and Engelhardt (WST 221) where this concept was covered.

It follows from the lemma that if b is a set of m.l.e.’s of and = ( ) where is one-to-one, then c = (b)
is a set of m.l.e.’s of . Uniqueness is also carried over. Since the transformations of the parameters

i; i = 1; 2; ; p and ij ; i; j = 1; 2; ;p
to the parameters
i; i = 1; 2; ; p; ii ; i = 1; 2; ;p
and
ij
ij =p ; i 6= j = 1; 2; ;p
ii jj
are one-to-one, it follows that the m.l.e’s of the ij is given by
PN
bij aij =1 (xi xi )(xj xj )
bij = p =p = rn o nP o:
bii bjj aii ajj PN N
=1 (xi xi )2 =1 (xj xj )2

Suppose now that X : p 1 is Np ( ; ) distributed.


X1 1 11 12 1
Let X = where X 1 : q 1 with = and = corresponding. Let = 12 22 .
X2 2 21 22
The transformation of to 11:2 , and 22 is one-to-one. The maximum likelihood estimators are then
1
b 11:2 = b 11 b 12 b b 21
22
b = b 12 b 1 :
22

Similarly it also follows that the maximum likelihood estimator of ij:q+1; ;p is given by
bij:q+1; ;p
bij:q+1; ;p =p :
bii:q+1; ;p bjj:q+1; ;p

To determine the maximum likelihood estimator of the multiple correlation coe¢ cient suppose again that X : p 1
X1 1
is N ( ; ) distributed and let X = where X1 : 1 1, X 2 : (p 1) 1 and = and
X2 2
0
11 1
= corresponding. The multiple correlation coe¢ cient is
1 22
s
0 1
1 22 1
R= :
11

The transformation of to R, 1 and 22 is again one-to-one which implies that the maximum likelihood estimator
of R is given by s
1
^= b 01 b 22 b 1
R :
b11

39
3.2 Sampling Distributions
3.2.1 The Mean and Covariance Matrix

Theorem 20 The mean of a random sample of N observations from a Np ( ; ) distribution is Np ( ; N1 ) dis-


PN
tributed, independent of b = N1 A = N1 =1 (X X)(X X)0 , the m.l.e. of . Furthermore N b = A =
PN 1 0
=1 Z Z , where Z ; = 1; 2; ; N 1 are independently Np (0; ) distributed.

Proof. Suppose that the rows of X are a random sample from a Np ( ; ) distribution. Let B : N N be an
orthogonal matrix with last row
p1 ; p1 ; ; p1 :
N N N

Thus, the sum of the elements of any other row of B will be equal to zero.
0 PN 1
0 10 1 0
0 0 1
b11 b12 : : : b1N 0 i=1 b1i X i
X1 B P C Z
B b21 b22 : : : b2N C B B N C B 10
B C X 02 CC=B
0
i=1 b2i X i C B Z2 C
Let Z = BX = B .. .. C B@ A B C=@ C.
A
@ . . A B C
X 0 @ PN A Z 0N
p1 p1 ::: p1
N 0
p1
N N N
N i=1 X i

Note that
P
N
Z0 = b i X 0i :
i=1

Linear combinations of normally distributed variables are again normally distributed. This means that the joint
distribution of all the elements of Z is normal with
P
N
E Z0 = b i E X 0i
i=1
P
N
0
= b i
i=1
(
00 if 6= N
= p 0
N if = N:

and
PN PN
cov(Z ; Z 0 ) = cov i=1 b i X i ;
0
j=1 b j X j
PN 0
= i=1 cov b i X i ; b i X i + 0
PN 0
= i=1 b i b i cov X i ; X i
PN
= i=1 b i b i
(
if =
=
0 if 6= :

40
Since the covariance
0 between
1 any two di¤erent rows of Z is a zero matrix it follows that the rows of Z are independent.
Z 01
B Z 02 C p
Thus, for Z = B @
C; Z
A N p (0; ) if = 1; 2; : : : ; N 1 and Z N N p N ; , all independent.
0
ZN

1 0 1 0 p1 Z N ,
Since X = N X 1N = N Z B1N = N
X Np ( ; N1 ).

Next,

P
N
A = (X X)(X X)0
=1
P
N 0
= X X0 NX X
=1
= X 0X 1
N X 0 1N (10N X)
h i
0 0
= B0Z B0Z 1
N B 0
Z 1N 10N B 0 Z
= Z 0 BB 0 Z N1 Z 0 B1N 10N B 0 Z
= Z 0 Z Z N Z 0N
NP1
= Z Z0
=1
0 1 0 1
0 0 0
B 0 C p B 0 0 C
B C B C
since (B1N ) 10N B 0 = B .. C 0 0 N =B .. .. C:
@ A @ A
p. . .
N 0 N

Since A is only a function of Z ; = 1; 2; : : : ; N p1 Z N


only a function of Z N and the rows of Z are
1; X = N
independent, it follows that A and X are independent. From this, X is also independent of b = N1 A:

Theorem 21 An unbiased estimator for the covariance of a random sample of N observations from a Np ( ; )
distribution is
PN
1
N 1A = N 1
1
(X X)(X X)0 :
=1

Proof. From the distribution of Z , Z Np (0; ) if = 1; 2; : : : ; N 1, it follows that


NP1 NP1 NP1 NP1
E(A) = E Z Z0 = E Z Z0 E (Z ) E Z 0 = cov Z ; Z 0 = = (N 1)
=1 =1 =1 =1

and E b = N 1
N . An unbiased estimator of would be

P
N
N
1
1A = N 1
1
(X X)(X X)0 :
=1

41
3.2.2 Quadratic Forms of Normally Distributed Variates

Lemma 2 Let X : p 1 Np (0; I p ) and A : p p be symmetrical and idempotent of rank r. Then


S = X 0 AX
2
has the distribution with r degrees of freedom.

Proof. Let H be orthogonal such that


Ir 0
H 0 AH =
0 0
Let Y = H 0 X. Then Y Np (0; I p ) and
r
X
S = X 0 AX = Y 0 H 0 AHY = Yi2
i=1
2
which has the distribution with r degrees of freedom.

a b
Example 16 Write down S when p = 2 and A = .
c d

Corollary 10 1. S = X 0X 2
(p) :
Pn 2
2. If X : n 1 Nn (0; I n ) ; then (n 1) S 2 = i=1 Xi X 2
(n 1) :

The second part of the Corollary follows from the fact that if
0 1
X1 X
B X2 X C 1 1
B C
Y =B .. C=X 1n 10n X = In 1n 10n X
@ . A n n
Xn X
then I n 1 0
n 1n 1nis idempotent and has a rank of n 1. From the previous theorem Y 0 Y = X 0 I n 1 0
n 1n 1n X=
(n 1) S 2 2
(n 1) :

Lemma 3 Suppose that X : p 1 Np (0; I p ). Let


S = X 0 AX with A 0 and Y = BX:
If BA = 0, then S and Y are independent.

Proof. The result will be given.


1
Pn 2
Corollary 11 If X : n 1 Nn (0; I n ) then X and S 2 = n 1 i=1 Xi X are independent.

Lemma 4 If X : p 1 Np ( ; ), then
1
S = (X )0 (X )
2
has a distribution with p degrees of freedom.
1
Proof. Let Y = 2 (X ) Np (0; I p ). The quadratic form
1
S = (X )0 (X ) = Y 0Y 2
(p):

42
From the three lemmas above it follows that:

1. If X : N 1 NN (0; I N ), then S = X 0 X 2
(N ).
PN
2. If X : N 1 NN (0; I N ), then i=1 (Xi X)2 2
(N 1).
De…ne, for example, 00 1 0 1 11 0 1
1 1
1 0 0 3 3 3 X1
BB C B CC B C
BB 0 1 0 C B 1 1 1 CC B X 2 C
Y = BB C B 3 3 3 CC B C
@@ A @ 1 1 1
AA @ A
0 0 1 3 3 3
X3
0 2 1 1
10 1
3 3 3 X1
B CB C
B 1 2 1 C B X2 C
=B 3 3 3 CB C
@ 1 1 2
A@ A
3 3 3
X3
0 2 1 1
1
3 X1 3 X2 3 X3
B C
B 1
+ 32 X2 1 C
=B 3 X1 3 X3 C
@ 1 1
A
3 X1 3 X2 + 23 X3

00 1 0 1 1 1
11
1 0 0 3 3 3
BB C B CC
BB 0 C B 1 1 1 CC
then the rank of BB 0 1 C B 3 3 3 CC = 3 1 = 2: Consider Y 0 Y and the result follows.
@@ A @ 1 1 1
AA
0 0 1 3 3 3

PN 2
3. If X : N 1 NN ( 1N ; 2 I N ) then X and i=1 (Xi X)2 are independent, where X N( ; n ) and
1
PN
2 i=1 (Xi X)2 has a 2 (N 1) distribution.

3.2.3 The Correlation Coe¢ cient


PN 1
The sample covariance matrix b is distributed as N1 A, where A is distributed as 0
=1 Z Z . In order to determine
the sampling distribution of a correlation coe¢ cient, let p = 2. Then the correlation coe¢ cient is:
a12
r= p :
a11 a22

43
3.3 Inferences Concerning Multivariate Means
In this section, we describe the testing of hypothesis of multivariate means. This is the multivariate equivalent of
the one sample t-test that we study in earlier years, but instead of having only one sample (and thus, one mean),
we need to consider now a multivariate sample (a vector) which has a multivariate mean. Please take time to revise
Chapter 12 from WST 221 (Bain and Engelhardt, p. 389 - 404) in preparation for this section.

3.3.1 The distribution of T 2


PN 1
Theorem 22 Suppose Y : p 1 Np (0; ) is independent of A = =1 Z Z 0 , where Z Np (0; );
= 1; : : : ; N 1, all independent. Consider the statistic

T2
= Y 0A 1
Y:
N 1
Then
T2 N p
F (p; N p):
N 1 p
T2
The statistic is known as Hotelling’s T 2 statistic.
N 1

Proof. The result will be given.


2 2
Remark 13 Take care to distinguish between the statistic NT 1 and NT 1 N p p . The former is a function of the
random variable Y , and is therefore a statistic. The latter is a "scaled" version of this statistic which has been
proved to follow an F distribution. This is useful, since an F distribution is i) known, 2) implementable, and 3)
analytically available if required.

Remark 14 Also note that in the above theorem, is assumed to be known. This is equivalent to assuming that
the sample variance of a univariate sample, 2 is known. When is not known, we use something that can be
"estimated" from the available data (which is A or S, usually).

3.3.2 The Case of Unknown

Test for H0 : = 0 Consider a random sample of N observations, X 1 ; X 2 ; ; X N , N > p, from a Np ( ; )


p PN 1
distribution. In this case N (X ) Np (0; ) independent of A = (N 1) S = N b = 0
=1 Z Z , where
2
Z Np (0; ) all independent. Thus it follows that Hotelling’s T statistic is

T2 0 1
= N (X 0) A (X 0)
N 1
1
0b
= (X 0) (X 0)
N 0 1
= N 1 (X 0) S (X 0)

and
T2 N p
F (p; N p):
N 1 p
T2 N p
We reject H0 if > F (p; N p).
N 1 p

44
Exercise 3 Write down the alternative hypothesis.

Seal (1994) observed the head length (X1 ) and head width (X2 ) of a sample of 49 frogs, of which 14 were male
and 35 female. In this case, the question to be answered is: does the average for female frogs for both head length
and head width di¤er signi…cantly from 25? The results are

xm =
21:821 b m = 13 Sm =
1
Am =
17:159 17:731
22:843 14 14 17:731 19:273

xf =
22:860 b f = 34 Sf =
1
Af =
17:178 19:710
:
24:397 35 35 19:710 23:710
Consider the hypothesis
25
H0 : f = :
25
This solution can be calculated by hand: This gives

T2 1
25 12 ) b f (xf
0
= (xf 25 12 )
N 1
1:261 1:048 2:140
= ( 2:140 0:603
1:048 0:913 0:603
= 3:401

and since
T2 N p 35 2
= 3:401 = 56:116 > F (p; N p) = F0:01 (2; 33) = 5:312
N 1 p 2
we reject H0 on the 1% level of signi…cance. At least one of the means di¤er signi…cantly from 25:
The solution can also be SAS PROC IML code and output for the example:

proc iml;
xf={22.860, 24.397};
sigf={17.178 19.710,
19.710 23.710};

isigf=inv(sigf); print isigf;

Hotelling=(xf-25#J(2,1,1))`*isigf*(xf-25#J(2,1,1));
FStat=Hotelling#33/2;
FCrit=finv(0.99,2,33);

print Hotelling FStat FCrit;

isigf
1.2607491 -1.048054
-1.048054 0.9134183

Hotelling FStat FCrit


3.4009934 56.116392 5.3120289

45
19:5
Example 17 Redo the example above i) by hand and ii) in PROC IML by testing H0 : m = , and then
20:5
19:9
by testing H0 : m = . Is there a di¤erence in the outcome? Comment and discuss.
21

Example 18 See https://online.stat.psu.edu/stat505/lesson/7/7.1/7.1.1 for an additional example and comparison


to the univariate environment.

46
3.4 Principal Component Analysis
These notes are adapted from Lesson 11: Principal Component Analysis (PCA) of the STAT 505 course from
PennState Elberly College of Science, available on the internet.

3.4.1 Basics

Let X : p 1 with cov(X; X 0 ) = and let ( i ; ei ) ; i = 1; 2; : : : ; p be the p eigenvalue, eigenvector pairs for
0 0
such that 1 2 p and ei i = 1 for i = 1; 2; : : : ; p and ei ej = 0 for i 6= j:
e

De…nition 10 The total variation for the random vector X : p 1 is de…ned as


P
p
2 P
p
tr ( ) = i = i:
i=1 i=1

The last equality follows from Theorem 7. From Theorem 6,

= P P0
0 10 1
1 0 0 e01
B 0 0 CB e02 C
B 2 CB C
= e1 e2 ep B .. .. .. CB .. C
@ . . . A@ . A
0 0 p e0p
P
p
0
= i ei ei
i=1

and
0 1 0 1
1 0 0 e01
B 0 0 C B e02 C
B 2 C B C
= B . .. .. C = P0 P = B .. C e1 e2 ep
@ .. . . A @ . A
0 0 p e0p
0 0 1
e1 e1 e01 e2 e01 ep
B e02 e1 e02 e2 e02 ep C
B C
= B .. .. .. .. C:
@ . . . . A
e0p e1 e0p e2 e0p ep

Note: this process is also called the spectral decomposition of a matrix. From this last expression it follows that i =
e0i ei = e0i cov X; X 0 ei = cov e0i X; X 0 ei = var (Yi ) where Yi = e0i X: Also that e0i ej = cov (Yi ; Yj ) = 0
for all i 6= j:

The variable Yi = e0i X = ei1 X1 + ei2 X2 + + eip Xp is known as the ith principal component of the variables in
X.

The proportion of total variation in X explained by Yi = e0i X (the ith principal component) is

i
:
1+ 2+ + p

47
The purpose of principal component analysis is to approximate this expression for , that is to …nd k p such that

P
p
0 P
k
0
= i ei ei ' i ei ei :
i=1 i=1

The proportion of variation explained by the …rst k principal components is

1+ 2+ + k
:
1+ 2+ + p

In this expression, take careful note that we can say "…rst k principal components" because it is written in terms of
the eigenvalues, and the eigenvalues are ordered.

3.4.2 Principal Components


0 1
X1
B X2 C
B C
Let X : p 1 be a random vector, X = B .. C ; with covariance matrix
@ . A
Xp
0 1
11 12 1p
B C
cov(X; X 0 ) = =B
@
21 22 2p C:
A
p1 p2 pp

Consider the linear combinations

Yi = ei1 X1 + ei2 X2 + + eip Xp = e0i X for i = 1; 2; : : : ; p

with the constraint that e0i ei = 1 and e0i ej = 0 for i 6= j:

It follows that
var (Yi ) = cov e0i X; X 0 ei = e0i cov X; X 0 ei = e0i ei
and
cov (Yi ; Yj ) = cov e0i X; X 0 ej = e0i cov X; X 0 ej = e0i ej :
The Yi0 s are the p principal components for the X variables.

The 1st principal component is the linear combination of the X-variables that has the maximum variance, that
is e1 is selected such that it maximizes
var (Y1 ) = e01 e1
subject to the constraint
e01 e1 = 1:
This component accounts for as much as the variation as possible in the data.

48
The 2nd principal component is the linear combination of the X-variables that accounts for as much of the
remaining variation as possible in the data. The vector e2 is selected such that it maximizes

var (Y2 ) = e02 e2

subject to the constraints

e01 e1 = 1
cov (Y1 ; Y2 ) = e01 e2 = 0:

The last constraint implies that the principal components Y1 and Y2 are uncorrelated,

corr (Y1 ; Y2 ) = 0:

Similarly the ith principal component is the linear combination of the X-variables that accounts for as much
of the remaining variation as possible in the data. The vector ei is selected such that it maximizes

var (Yi ) = e0i ei = i

subject to the constraints

e0i ei = 1 for i = 1; 2; : : : ; p
cov (Yi ; Yj ) = e0i ej = 0 for i 6= j:

This implies that all the principal components are uncorrelated with each other.

Theorem 23 The p principal components for the random vector X : p 1 are given by
0 1 0 0 1
Y1 e1 X
B Y2 C B e02 X C
B C B C
B .. C = B .. C
@ . A @ . A
Yp e0p X

where ( i ; ei ) ; i = 1; 2; : : : ; p are the p eigenvalue, eigenvector pairs for = cov(X; X 0 ) such that 1 2
0 0
p and that ei ei = 1 for i = 1; 2; : : : ; p and ei ej = 0 for i 6= j:

3.4.3 Calculation of Principal Components for a Dataset

Suppose X 1 ; X 2 ; ; X N , X i : p 1 and N > p is a random sample of N vector observations (p variables) from a


population with expected value E (X i ) = and covariance matrix = cov X i ; X 0i : The principal components
for X : p 1 can be calculated by the following steps.

0 0
1. Calculate S = 1
N 1 (X 1N x0 ) (X 1N x0 ) = 1
N 1X IN 1 0
N 1N 1N X:

49
2. Calculate the eigenvalues, eigenvectors for S, bi ; b
ei ; i = 1; 2; : : : ; p:

3. Calculate the estimated principal components Ybi = b


e0i X for i = 1; 2; : : : ; p:

A fairly standard procedure is to use the di¤erence between the variables and their sample means rather than the raw
data. This is known as a translation of the random variables. Translation does not a¤ect the interpretations because
the variances of the original variables are the same as those of the translated variables.

The results of principal component analysis depend on the measurement scales. Variables with the highest sample
variances tend to be emphasized in the …rst few principal components. Principal component analysis using the
covariance function should only be considered if all of the variables have the same units of measurement. Otherwise
the variance-covariance matrix of the standardized data which is equal to the correlation matrix should be used in
stead of S in Step 1, that is use
1 1
R = D 2 SD 2
where 0 1
p1 0 0
s11
B 0 p1 0 C
1 B s22 C
D 2 =B
B .. .. .. .. C:
C
@ . . . . A
0 0 p1
spp

In this case the estimated value for the ith principal component is

Ybi = b
e0i Z = ebi1 Z1 + ebi2 Z2 + + ebip Zp

where
Xi X i
Zi = p :
Sii

Example 19 In the following 3-variate dataset with 10 observations each observation consists of 3 measurements on
a wafer: thickness, horizontal displacement and vertical displacement.
0 1
7 4 3
B 4 1 8 C
B C
B 6 3 5 C
B C
B 8 6 1 C
B C
B 8 5 7 C
X =B B 7 2 9 C
C
B C
B 5 3 3 C
B C
B 9 5 8 C
B C
@ 7 4 5 A
8 2 2

The purpose is to identify the number of principal components that will account for at least 85% of the variation in
the data.

Solution 1: Using PROC PRINCOMP in SAS

50
data a;
input x1 x2 x3 @@;
cards;
7 4 3
4 1 8
6 3 5
8 6 1
8 5 7
7 2 9
5 3 3
9 5 8
7 4 5
8 2 2
;
proc princomp out=a;
var x1 x2 x3;
proc print data=a;
run;

The PRINCOMP Procedure


Observations 10
Variables 3

Simple Statistics
x1 x2 x3
Mean 6.900000000 3.500000000 5.100000000
StD 1.523883927 1.581138830 2.806737925

Correlation Matrix
x1 x2 x3
x1 1.0000 0.6687 -.1013
x2 0.6687 1.0000 -.2879
x3 -.1013 -.2879 1.0000

Eigenvalues of the Correlation Matrix


Eigenvalue Difference Proportion Cumulative
1 1.76877414 0.84169822 0.5896 0.5896
2 0.92707592 0.62292597 0.3090 0.8986
3 0.30414995 0.1014 1.0000

Eigenvectors
Prin1 Prin2 Prin3
x1 0.642005 0.384672 -.663217
x2 0.686362 0.097130 0.720745
x3 -.341669 0.917929 0.201666

51
Obs x1 x2 x3 Prin1 Prin2 Prin3
1 7 4 3 0.51481 -0.63084 0.03351
2 4 1 8 -2.66001 0.06281 0.33089
3 6 3 5 -0.58404 -0.29061 0.15659
4 8 6 1 2.04776 -0.90963 0.36627
5 8 5 7 0.88327 0.99120 0.34154
6 7 2 9 -1.08376 1.20857 -0.44706
7 5 3 3 -0.76187 -1.19712 0.44810
8 9 5 8 1.18284 1.57068 -0.02183
9 7 4 5 0.27135 0.02325 0.17721
10 8 2 2 0.18965 -0.82831 -1.38523

From this we see that the …rst two principal components will explain

1:7687741 + 0:9270759
100 = 89:862%
1:7687741 + 0:9270759 + 0:3041499
of the total variation in X. The …rst principal component when using the correlation matrix is

Yb1 X1 6:9
= 0:642 1:523883927 X2 3:5
+ 0:686 1:581138830 0:342 X3 5:1
2:806737925
= 0:642Z1 + 0:686Z2 0:342Z3

and the second principal component is

Yb2 X1 6:9
= 0:385 1:523883927 X2 3:5
+ 0:097 1:581138830 + 0:918 X3 5:1
2:806737925
= 0:385Z1 + 0:097Z2 + 0:918Z3 :

For example, for the …rst observation

Yb1 = 0:642 7 6:9


1:523883927 + 0:686 4 3:5
1:581138830 0:342 3 5:1
2:806737925 = 0:51495

and
Yb2 = 0:385 7 6:9
1:523883927 + 0:097 4 3:5
1:581138830 + 0:918 3 5:1
2:806737925 = 0:63091:

Small di¤erences between the output and calculated values are due to rounding.

52
When using the covariance to do the PCA, the SAS program and output are given below.
proc princomp cov out=a;
var x1 x2 x3;
proc print data=a;
run;

The PRINCOMP Procedure


Observations 10
Variables 3

Simple Statistics
x1 x2 x3
Mean 6.900000000 3.500000000 5.100000000
StD 1.523883927 1.581138830 2.806737925

Covariance Matrix
x1 x2 x3
x1 2.322222222 1.611111111 -0.433333333
x2 1.611111111 2.500000000 -1.277777778
x3 -0.433333333 -1.277777778 7.877777778

Total Variance 12.7

Eigenvalues of the Covariance Matrix


Eigenvalue Difference Proportion Cumulative
1 8.27394258 4.59781331 0.6515 0.6515
2 3.67612927 2.92620111 0.2895 0.9410
3 0.74992815 0.0590 1.0000

Eigenvectors
Prin1 Prin2 Prin3
x1 -.137571 0.699037 -.701727
x2 -.250460 0.660889 0.707457
x3 0.958303 0.273080 0.084162

Obs x1 x2 x3 Prin1 Prin2 Prin3


1 7 4 3 -2.15142 -0.17312 0.10682
2 4 1 8 3.80418 -2.88750 0.51044
3 6 3 5 0.15321 -0.98689 0.26941
4 8 6 1 -4.70652 1.30154 0.65168
5 8 5 7 1.29376 2.27913 0.44919
6 7 2 9 4.09931 0.14358 -0.80313
7 5 3 3 -1.62582 -2.23208 0.80281
8 9 5 8 2.11449 3.25124 -0.16837
9 7 4 5 -0.23482 0.37304 0.27514
10 8 2 2 -2.74638 -1.06894 -2.09399

From this we see that the …rst two principal components will explain

8:27394258 + 3:67612927
100 = 94:095%
8:27394258 + 3:67612927 + 0:74992815
of the total variation in X.

The …rst principal component when using the covariance matrix is


Yb1 = 0:138 (X1 6:9) 0:250 (X2 3:5) + 0:958 (X3 5:1)

and the second principal component is


Yb2 = 0:699 (X1 6:9) + 0:661 (X2 3:5) + 0:273 (X3 5:1)

53
For example, for the …rst observation

Yb1 = 0:138 (7 6:9) 0:250 (4 3:5) + 0:958 (3 5:1) = 2:15060

and
Yb2 = 0:699 (7 6:9) + 0:661 (4 3:5) + 0:273 (3 5:1) = 0:1729:

Small di¤erences between the output and calculated values are due to rounding.
Solution 2: Using PROC IML: The covariance and correlation matrices with corresponding eigenvalues and eigen-
vectors can also be calculated in SAS IML.
proc iml;
X={7 4 3,
4 1 8,
6 3 5,
8 6 1,
8 5 7,
7 2 9,
5 3 3,
9 5 8,
7 4 5,
8 2 2};
j=j(10,1,1);
S=1/9*X`*(I(10)-1/10*j*j`)*X;
D_half=sqrt(inv(diag(S)));
R=D_half*S*D_half;
call eigen(ls,es,S);
call eigen(lr,er,R);
print ls es;
print lr er;

ls es
8.2739426 -0.137571 0.6990371 -0.701727
3.6761293 -0.25046 0.6608892 0.707457
0.7499282 0.9583028 0.2730799 0.0841616

lr er
1.7687741 0.6420046 0.3846723 -0.663217
0.9270759 0.6863616 0.0971303 0.720745
0.3041499 -0.341669 0.9179286 0.2016662

54
Example 20 Biopsy Data on Breast Cancer Patients
This breast cancer database was obtained from the University of Wisconsin Hospitals, Madison. The biopsy data set
contains data from 683 patients who had a breast biopsy performed. Each tissue sample was scored according to 9
di¤erent characteristics, each on a scale from 1 to 10. Also, for each patient the …nal outcome (benign/malignant)
was known. This data frame contains the following columns:
V1 clump thickness V6 bare nuclei
V2 uniformity of cell size V7 bland chromatin
V3 uniformity of cell shape V8 normal nucleoli
V4 marginal adhesion V9 mitoses
V5 single epithelial cell size V10 outcome: "benign" or "malignant"

The following SAS program and output gives the summary statistics for V1-V9 by V10 and the results of a principal
component analysis on V1-V9 using the correlation matrix.

proc means data=biopsy maxdec=3;


class V10;
var V1--V9;

proc princomp data=biopsy out=a ;


var V1--V9;

proc print;
var v1--v10 prin1 prin2;

proc sgscatter data=a ;


plot prin2*prin1/group=V10 markerattrs=(symbol=circlefilled);
run;

V10 Obs Variable N Mean Std Dev Minimum Maximum


ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
benign 444 V1 444 2.964 1.673 1.000 8.000
V2 444 1.306 0.856 1.000 9.000
V3 444 1.414 0.957 1.000 8.000
V4 444 1.347 0.917 1.000 10.000
V5 444 2.108 0.877 1.000 10.000
V6 444 1.347 1.178 1.000 10.000
V7 444 2.083 1.062 1.000 7.000
V8 444 1.261 0.955 1.000 8.000
V9 444 1.065 0.510 1.000 8.000
malignan 239 V1 239 7.188 2.438 1.000 10.000
V2 239 6.577 2.724 1.000 10.000
V3 239 6.561 2.569 1.000 10.000
V4 239 5.586 3.197 1.000 10.000
V5 239 5.326 2.443 1.000 10.000
V6 239 7.628 3.117 1.000 10.000
V7 239 5.975 2.282 1.000 10.000
V8 239 5.858 3.349 1.000 10.000
V9 239 2.603 2.564 1.000 10.000
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

55
The PRINCOMP Procedure
Observations 683
Variables 9

Simple Statistics
V1 V2 V3 V4 V5
Mean 4.442166911 3.150805271 3.215226940 2.830161054 3.234260615
StD 2.820761319 3.065144856 2.988580818 2.864562190 2.223085456

Simple Statistics
V6 V7 V8 V9
Mean 3.544655930 3.445095168 2.869692533 1.603221083
StD 3.643857160 2.449696573 3.052666407 1.732674146

Correlation Matrix
V1 V2 V3 V4 V5 V6 V7 V8 V9
V1 1.0000 0.6425 0.6535 0.4878 0.5236 0.5931 0.5537 0.5341 0.3510
V2 0.6425 1.0000 0.9072 0.7070 0.7535 0.6917 0.7556 0.7193 0.4608
V3 0.6535 0.9072 1.0000 0.6859 0.7225 0.7139 0.7353 0.7180 0.4413
V4 0.4878 0.7070 0.6859 1.0000 0.5945 0.6706 0.6686 0.6031 0.4189
V5 0.5236 0.7535 0.7225 0.5945 1.0000 0.5857 0.6181 0.6289 0.4806
V6 0.5931 0.6917 0.7139 0.6706 0.5857 1.0000 0.6806 0.5843 0.3392
V7 0.5537 0.7556 0.7353 0.6686 0.6181 0.6806 1.0000 0.6656 0.3460
V8 0.5341 0.7193 0.7180 0.6031 0.6289 0.5843 0.6656 1.0000 0.4338
V9 0.3510 0.4608 0.4413 0.4189 0.4806 0.3392 0.3460 0.4338 1.0000

Eigenvalues of the Correlation Matrix


Eigenvalue Difference Proportion Cumulative
1 5.89949935 5.12355246 0.6555 0.6555
2 0.77594689 0.23669465 0.0862 0.7417
3 0.53925224 0.07962479 0.0599 0.8016
4 0.45962745 0.07935163 0.0511 0.8527
5 0.38027583 0.07839938 0.0423 0.8950
6 0.30187645 0.00747374 0.0335 0.9285
7 0.29440271 0.03366686 0.0327 0.9612
8 0.26073586 0.17235264 0.0290 0.9902
9 0.08838322 0.0098 1.0000

Eigenvectors
Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7 Prin8 Prin9
V1 0.302063 -.140801 0.866372 0.107828 0.080321 -.242518 -.008516 0.247707 -.002747
V2 0.380793 -.046640 -.019938 -.204255 -.145653 -.139032 -.205434 -.436300 -.733211
V3 0.377583 -.082422 0.033511 -.175866 -.108392 -.074527 -.127209 -.582727 0.667481
V4 0.332724 -.052094 -.412647 0.493173 -.019569 -.654629 0.123830 0.163434 0.046019
V5 0.336234 0.164404 -.087743 -.427384 -.636693 0.069309 0.211018 0.458669 0.066891
V6 0.335068 -.261261 0.000691 0.498618 -.124773 0.609221 0.402790 -.126653 -.076510
V7 0.345747 -.228077 -.213072 0.013047 0.227666 0.298897 -.700417 0.383719 0.062241
V8 0.335591 0.033966 -.134248 -.417113 0.690210 0.021518 0.459783 0.074012 -.022079
V9 0.230206 0.905557 0.080492 0.258988 0.105042 0.148345 -.132117 -.053537 0.007496

56
The next output shows the values of the …rst and second principal components for the …rst 5 observations.
Obs V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 Prin1 Prin2

1 5 1 1 1 2 1 3 1 1 benign -1.46909 -0.10420


2 5 4 4 5 7 10 3 2 1 benign 1.44099 -0.56972
3 3 1 1 1 2 2 3 1 1 benign -1.59131 -0.07606
4 6 8 8 1 3 4 3 7 1 benign 1.47873 -0.52806
5 4 1 1 3 2 1 3 1 1 benign -1.34388 -0.09065

This was calculated by using the standardised values and the corresponding eigenvectors in the output, that is

Yb1 = b
e01 Z = 1:46909
where 0 1 0 1
0:302063 (1 4:44217) =2:82076
B 0:380793 C B (5 3:15081) =3:06514 C
B C B C
B 0:377583 C B (1 3:21523) =2:98858 C
B C B C
B 0:332724 C B (1 2:83016) =2:86456 C
B C B C
eb1 = B
B 0:336234 C
C and Z=B
B (1 3:23426) =2:22309 C
C
B 0:335068 C B (2 3:54466) =3:64386 C
B C B C
B 0:345747 C B (1 3:44510) =2:44970 C
B C B C
@ 0:335591 A @ (3 2:86969) =3:05267 A
0:230206 (1 1:60322) =1:73267

Drawing the values of Prin2 against Prin1 by outcome (benign or malignant) gives the following graph. The …rst
two principal components can be used in a logit analysis as covariates to predict the outcome.

57
58

You might also like