WST 311 - Part 1 2023
WST 311 - Part 1 2023
I MULTIVARIATE ANALYSIS 4
1 MATRIX ALGEBRA 4
1.1 General De…nitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Addition and Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Elementary Operations and Inverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.4 Vector Space and Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.5 Properties of the Rank of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.6 Orthogonal Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Basic Results related to matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Properties of Trace and Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Characteristic Roots and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 MULTIVARIATE DISTRIBUTIONS 14
2.1 Moments and Characteristics of a Multivariate Distribution . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Expected Values and Covariance Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Sample Mean, Sample Covariance and Sample Correlation Matrices . . . . . . . . . . . . . . 19
2.1.3 The Multivariate Change of Variable Technique . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.4 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 The Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 Classical De…nition of the Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . 25
2.2.2 Standard Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.3 The Moment Generating Function of the Multivariate Normal Distribution . . . . . . . . . . . 27
2.2.4 Conditional Distributions of Normal Random Vectors . . . . . . . . . . . . . . . . . . . . . . 32
2.2.5 Multiple Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 NORMAL SAMPLES 38
3.1 Estimation of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.1 The Mean and Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.2 Correlation Coe¢ cients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 The Mean and Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.2 Quadratic Forms of Normally Distributed Variates . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.3 The Correlation Coe¢ cient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Inferences Concerning Multivariate Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 The distribution of T 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.2 The Case of Unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1
3.4 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.2 Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.3 Calculation of Principal Components for a Dataset . . . . . . . . . . . . . . . . . . . . . . . 49
7 ONE-WAY ANALYSIS-OF-VARIANCE 75
8 ANALYSIS-OF-COVARIANCE 76
9 EXPONENTIAL FAMILIES 77
9.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
9.2 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2
9.3 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.4 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.5 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.6 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3
Part I
MULTIVARIATE ANALYSIS
1 MATRIX ALGEBRA
In this section a broad overview of tools and results are given that is essential when working with vectors and matrices.
You need to be able to understand, use, and motivate these results in any of the subsequent sections of work.
4
and 0 1
B 11 B 12 B 1k
B B 21 B 22 B 2k C
B=B:n B
k = (B ij ) = @ C
A
B h1 B h2 B hk
then the product C = AB is given by
0 1
C 11 C 12 C 1k
B C 21 C 22 C 2k C
C=C:m k = (Cij ) = B
@
C
A
C g1 C g2 C gk
with
h
X
C ij = Ai B j :
=1
Remark 1 What is the di¤erence between the dimensions n; k; m; g;and h in the example above?
(A0 )0 = A
(A + B)0 = A0 + B 0
(AB)0 = B 0 A0
1. Identity matrix: 0 1
1 0 0
B 0 1 0 C
I=I:n n=B
@
C
A
0 0 1
2. Diagonal matrix:
0 1
d1 0 0
B 0 d2 0 C
D=D:n n = diag(d1 ; d2 ; ; dn ) = B
@
C
A
0 0 dn
3. Column vector: 0
1
a1
B a2 C
a=a:m 1=B
@
C
A
am
4. Row vector:
b0 = b0 : 1 n= b1 ; b2 ; ; bn
5
1.1.2 Determinants
1. jAj = jA0 j
2. jdiag(d1 ; d2 ; ; dn )j = d1 d2 dn :
3. jAj = 0 if any row (column) contains only zeros or if any row (column) is a multiple of another row (column).
4. If a single row (column) of a square matrix is multiplied by a scalar r, then the determinant of the resulting
matrix is rjAj.
5. If jAj = 0, then a matrix is called singular.
6. jABj = jBAj = jAjjBj provided that A and B are both square matrices.
The inverse of the matrix A can also be determined by solving the system of equations
Ax = y
because
1
x=A y:
6
1.1.4 Vector Space and Rank
Suppose that the rows of the matrix A : m n are a01 ; a02 ; ; a0m . Then
0 0 1
a1
B a02 C
A=B @
C:
A
a0m
Let V be the set of vectors which are linear combinations of the rows of A. Then V is called a vector space and is
generated by the rows of A. The vector space V is also called the row space of A.
Suppose that a1 ; a2 ; ; an are the columns of A : m n, without confusing the notation previously used to denote
the rows of A. (Note that aj now has a di¤erent meaning. It indicates columns and is not the transpose of the
rows!) Then
A = (a1 ; a2 ; ; an ) :
Any linear combination of the columns of A can be written as:
n
X
Ac = cj aj :
j=1
7
The columns of A are linearly dependent if a vector c 6= 0 exists such that
n
X
Ac = cj aj = 0:
j=1
kb = kAc
is also a linear combination of the columns of A. It also follows that if b1 = Ac1 and b2 = Ac2 are two linear
combinations of the columns of A, then
The set of vectors which are linear combinations of the columns of A, is the vector space generated by the columns
of A. This vector space is called the column space of A.
De…nition 4 The row (column) rank of a matrix is the maximum number of linearly independent rows (columns) of
A.
It can be shown that the row rank is equal to the column rank. It follows that
If A : m n, then
1. 0 rank(A) min(m; n)
2. rank(A) = rank(A0 )
3. rank(A + B) rank(A)+ rank(B)
4. rank(AB) min(rank(A);rank(B))
5. rank(AA0 ) = rank(A0 A) = rank(A) = rank(A0 )
6. If B : m m, C : n n, jBj 6= 0 and jCj 6= 0, then it is true for the matrix A : m n that rank(BAC) =
rank(A)
7. If m = n then rank(A) = m if and only if A is non-singular.
8
1.1.6 Orthogonal Vectors
(I C)2 = I 2C + C 2 = I C:
3. tr(I : n n) = n
4. tr(A0 ) = tr(A)
9
Orthogonal matrices play a big role in the general handling and analysis of matrices! We need to
understand them properly.
H 0H = I
or
H0 = H 1
:
HH 0 = I:
j I Aj = 0:
The LHS (left-hand side) is a polynomial of degree n in and therefore has n roots. If i is a root, the matrix
iI A is singular and the system of equations
( iI A) xi = 0
has a non-zero solution xi . Then i is the i-th characteristic root of A and xi is the corresponding characteristic
vector. Characteristic roots are also called eigenvalues of a matrix, and the characteristic vectors are also called
the eigenvectors of a matrix.
j I Aj = jH 0 H( I A)j = H 0 ( I A)H = I H 0 AH
H 0 AH = = diag( 1; 2; ; n)
where 1 ; 2 ; ; n are the characteristic roots of A and the columns of H are the corresponding characteristic
vectors of A (H being orthogonal).
10
The characteristic equation of the symmetric matrix A : n n is
n
Y
0
j I Aj = j I H AHj = ( i ):
i=1
It also implies the characteristic roots are the solution to the characteristic equation. Let H = (h1 ; h2 ; ; hn ).
Then
AH = H
with i-th column
Ahi = i hi :
The last equality implies that the columns of H are the corresponding characteristic vectors of the matrix A.
Corollary 1 1. Let
1 1 1 1
2 = diag( 1
2
; 2
2 ; ; n ):
2
It follows that
1 1 1 1
A = H H0 = H0 H0
2 2 2 2
H H =A A
1 1
say. The matrix A = H 2 H 0 (which may have complex elements) is the symmetric square root of A.
2
y 0 Ay > 0 8 y 6= 0
A2 = H H 0 H H 0 = H 2
H0 = H H0 = A
11
which implies that
2
= :
It therefore follows that the characteristic roots of the idempotent matrix A are all either 1 or 0.
Furthermore
tr(A) = tr(H H 0 ) = tr( H 0 H) = tr( ) = rank(A):
Example 1 Let 0 1
5 1 1 1
B 1 4 2 1 C
A=B
@ 1
C
2 3 2 A
1 1 2 2
be a symmetric matrix with eigenvalue, eigenvector pairs given by ( i ; hi ) ; i = 1; 2; 3; 4:
0 1
1 0 0 0
B 0 0 0 C
Let = B @ 0
2 C and H = h1 h2 h3 h4
0 3 0 A
0 0 0 4
12
SAS PROC IML program and output:
proc iml;
reset nolog;
A={5 1 1 1, 1 4 2 1, 1 2 3 2, 1 1 2 2};
call eigen(LA,HA,A);
detA=det(A);
trA=trace(A);
detL=det(diag(LA));
trL=trace(diag(LA));
print LA HA detA trA detL trL;
A2=A*A;
print A2;
A0=HA*diag(LA)*HA`;
A02=HA*DIAG(LA#LA)*HA`;
print A0 A02;
A012=HA*sqrt(diag(LA))*HA`;
print A012;
AA=A012*A012;
print AA;
quit;
A2
28 12 12 10
12 22 17 11
12 17 18 13
10 11 13 10
A0 A02
5 1 1 1 28 12 12 10
1 4 2 1 12 22 17 11
1 2 3 2 12 17 18 13
1 1 2 2 10 11 13 10
A012
2.2045471 0.206625 0.1953853 0.2431111
0.206625 1.9053232 0.5409514 0.1855291
0.1953853 0.5409514 1.4796956 0.6926017
0.2431111 0.1855291 0.6926017 1.1944785
AA
5 1 1 1
1 4 2 1
1 2 3 2
1 1 2 2
13
2 MULTIVARIATE DISTRIBUTIONS
2.1 Moments and Characteristics of a Multivariate Distribution
2.1.1 Expected Values and Covariance Matrices
Expected values and (co)variances are essential measures that give us a …rst indication of the behaviour of random
variables. If we are working with a vector/matrix of variables (thus, multivariate), then our expected value must also
be a vector/matrix. If we are working with a vector/matrix of variables, then the covariance structure must thus be
a matrix - this is because if we have k number of random variables, there will be k number of variances (one for each
2
variable), and then k 2 k number of interaction variances; what we call covariances. Suppose that the elements of
X : p m are random variables. The expected value of X is de…ned as
E(X) = (E(Xij )) :
P
p P
m
Suppose that A : r p and B : m s are constant matrices. The i; j-th element of AXB is ai X b j
=1 =1
and the expected value of this element is
! !
P
p P
m P
p P
m
E ai X b j = ai E (X )b j
=1 =1 =1 =1
E (AXB) = AE (X) B:
Note that
0 10 10 1
a11 a12 a1p X11 X12 X1m b11 b12 b1s
B a21 a22 a2p CB X21 X22 X2m CB b21 b22 b2s C
B CB CB C
AXB =B .. .. .. CB .. .. .. CB .. .. .. C
@ . . . A@ . . . A@ . . . A
ar1 ar2 arp Xp1 Xp2 Xpm bm1 bm2 bms
0 1
P
p P
p P
p 0 1
B =1 a1 X 1 a1 X 2 a1 X m C b11 b12 b1s
B =1 =1 C B b21 b22 b2s C
B .. .. .. CB C
=B C B .. .. .. C
B p . . . C@ . . . A
@ P P
p P
p A
ar X 1 ar X 2 ar X m bm1 bm2 bms
=1 =1 =1
0 P
p Pm P
p P
m P
p P
m 1
a X b a1 X b a1 X b
B =1 =1 1 1
=1 =1
2
=1 =1
s
C
B C
B .. .. .. C
=B C:
B p m . . . C
@ P P P
p P
m P
p P
m A
ar X b 1 ar X b 2 ar X b s
=1 =1 =1 =1 =1 =1
14
2
Consider the random vector X : p 1 with i-th element Xi . Suppose that var(Xi ) = i = ii and cov(Xi ; Xj ) =
ij . The covariance matrix of X is de…ned as
0
cov(X; X 0 ) = E [X E(X)] [X E(X)]
= E XX 0 E(X)E(X 0 )
= 0 1
11 12 1p
B C
= B
@
21 22 2p C:
A
p1 p2 pp
Here we already see that for p variables, there are p variances (on the diagonal) and p2 p covariances (on the
o¤-diagonals). The matrix must be symmetric because, for example 12 = 21 , since the interaction variance
(covariance) between variable 1 and 2 must be the same as the covariance between variable 2 and 1. In e¤ect, the
matrix has p2 number of entries, p number of these are variances (on the diagonal), meaning p2 p covariances.
2
Because it is symmetric, the covariance matrix thus have p 2 p number of "unique" covariances. Note that the
covariances may be 0 - it will be the case when the two corresponding variables are independent of each other.
Similarly, if Y : q 1 is a random vector with j-th element Yj and cov(Xi ; Yj ) = ij , then
0
cov(X; Y 0 ) = E [X E(X)] [Y E(Y )]
0 1
11 12 1q
B C
= B
@
21 22 2q C:
A
p1 p2 pq
vectors, then
15
The correlation matrix of X is given by
cor(X; X 0 ) =
0 1
11 12 1p
B C
= B
@
21 22 2p C
A
p1 p2 pp
or
ij
( ij ) = p :
ii jj
It is worthwhile to recall that the correlation matrix acts as a "standardised" covariance matrix, as entries of the
correlation matrix are bound between 1 and 1.
Suppose that X : p 1 is a random vector with covariance matrix . The following relation holds for any a : p 1:
var(a0 X) = a0 a 0:
Therefore it follows that is always semi-positive de…nite.
Corollary 2 For :p p the covariance matrix there exists an orthogonal matrix H such that
H0 H = = diag( 1; 2; ; p ):
where 1 ; 2 ; ; p are the characteristic roots (eigenvalues) of and the columns of H are the characteristic
vectors of H, H orthogonal. If H = (h1 ; h2 ; ; hp ), then i = h0i hi . For a : p 1,
p
X
0 0 0 0
a a=aH H a=b b= b2i i
i=1
with b = H 0 a. This means that is positive de…nite (or semi-positive de…nite) if and only if the characteristic roots
of are positive (or 0). This implies that a covariance matrix is always positive de…nite or semi-positive de…nite.
If is non-singular, it is positive de…nite.
Corollary 3 Let
1 1 1 1
2 = diag( 2
1 ; 2
2 ; ; 2
p ):
It follows that 0 0 1 1
1 1
A = H H0 = H 2 H H 2 H = A2 A2 :
1 1 0
The matrix A 2 = H 2 H is the symmetrical square root of A:
16
Theorem 7 If A is any p p matrix with eigenvalues 1; 2; ; p; then
Q
p
(i) jAj = i
i=1
P
p
(ii) tr(A) = i:
i=1
Example 2 The weight (in mg), X : 4 1; of four di¤erent species of bumblebees, I, II, III and IV have expected
value and covariance matrix
0 1 0 1
5 3 2 2 1
B 6 C B 2 4 2 1 C
E (X) = B
@ 5
C
A cov(X; X 0 ) = =B
@ 2
C:
2 4 1 A
7 1 1 1 3
Let 0 1 0 1
X1 + X2 1 1 0 0
Y =@ X1 X2 A=@ 1 1 0 0 A X.
X1 + X2 + X3 + X4 1 1 1 1
Note that the matrix Y represents (in this case) certain combinations of the four measurements under considerations.
For example, the third combination represents the total weight of the four species that the scientist may be interested
in. These combinations are informed by whatever the scientist or researcher are interested in. The expected value
and covariance matrix of Y are
0 1
0 1 5 0 1
1 1 0 0 B 6 C 11
E(Y ) = @ 1 1 0 0 AB @ 5 A=
C @ 1 A
1 1 1 1 23
7
and
0 10 1
0 3 1 2 2 1 1 1 1
1 1 0 0 B 2 4 2 1 C B 1 1 1 C
cov(Y ; Y 0 ) = @ 1 1 0 0 AB@ 2
CB C
2 4 1 A@ 0 0 1 A
1 1 1 1
1 1 1 3 0 0 1
0 1
0 1 1 1 1
5 6 4 2 B 1 1 1 C
= @ 1 2 0 0 AB@ 0
C
0 1 A
8 9 9 6
0 0 1
0 1
11 1 17
= @ 1 3 1 A .
17 1 32
Write down next to the steps above which Theorems/Results from earlier in the notes are used to obtain the results.
Furthermore
0 1 10 1 0 p1 1
p
11
0 0 11 1 17 11
0 0
B 1 C
0 A@ 1 B 1 C
0
cor(Y ; Y ) = @ 0 p3 3 1 A@ 0 p3 0 A
0 0 p132 17 1 32 0 0 p132
0 1
1 p 1 p 17
(11)(3) (11)(32)
B p 1 p 1 C
= B @ (3)(11)
1 C
(3)(32) A :
p 17 p 1
1
(32)(11) (32)(3)
17
The eigenvalues and eigenvectors of cov(X; X 0 ) = as well as the symmetric square root are calculated in the
following SAS IML program and given in the output. The calculation of the eigenvalues- and vectors are needed in
principal component analysis, which we cover in the course later as well.
test
3 2 2 1
2 4 2 1
2 2 4 1
1 1 1 3
and
p X
X p
0 0
cov(a X; b X) = ai bj ij :
i=1 j=1
Remark 2 You need to be able to use these result to calculated variances and covariances etc. when confronted
with linear combinations with coe¢ cient vectors a and b.
18
2.1.2 Sample Mean, Sample Covariance and Sample Correlation Matrices
P
N
0
= (xj x) (xj x)
j=1
0
= (X 1N x0 ) (X 1N x0 )
= X0 IN 1 0
N 1N 1N X:
Remark 3 Take some time to write out the last two lines in the derivation above for a special case (for example
N = 3; p = 2) in order to observe the mathematical construction and correctness of the equality.
19
The sample correlation matrix is
1 1
R=D 2 SD 2
where 0 1
p1 0 0
s11
B 0 p1 0 C
1 B s22 C
D 2 =B
B .. .. .. .. C:
C
@ . . . . A
0 0 p1
spp
Let xj = gj (y1 ; ; yp ); j = 1; ; p de…ne a set of one-to-one transformations from a vector of random variables
x = (x1 ; ; xp )0 , to a vector of random variables y = (y1 ; ; yp )0 . If f (x) denotes the joint probability density
function of the random vector x then the joint probability density function of the random vector y is given by
@xi
where J(x ! y) = @yj , i.e.
0 @x1 @x1 @x1 1
@y1 @y2 ::: @yp
B C
B @x2 @x2
::: @x2 C
B @y1 @y2 @yp C
J(x ! y) = @x
@y =B
B
C
C
B .. .. .. C
@ . . . A
@xp @xp @xp
@y1 @y2 ::: @yp
denotes the Jacobian matrix of the transformation from x to y and j:j is the absolute value of the determinant of
the matrix. Furthermore, jJ(x ! y)j is known as the Jacobian of the transformation.
Remark 4 This is similar to work completed in previous years of your studies, see in particular Chapter 4 of Bain
and Engelhardt (WST 211).
20
2.1.4 Moment Generating Functions
Suppose that X : p 1 is a random vector and that the joint density function of the elements of X is fX (x) =
; xp ). The moment generating function of X is
fX1 ;X2 ; ;Xp (x1 ; x2 ;
Z 1 Z 1 Z 1 Z 1
t0 X t0 x
MX (t) = E e = e fX (x)dx = et1 x1 +t2 x2 + +tp xp fX (x)dx:
1 1 1 1
The moment generating function of Xi is given by MX (0; 0; ; ti ; 0; ; 0). This follows since
0 X1 +0 X2 + +ti Xi + +0 Xp
MX (0; 0; ; ti ; 0; ; 0) = E e
= E eti Xi = MXi (ti ) :
Theorem 8 Suppose that MX (t) is the moment generating function of X. Then X1 ; X2 ; ; Xp are mutually
independent if and only if
Q
p
MX (t) = MX (t1 ; 0; ; 0)MX (0; t2 ; ; 0) MX (0; 0; ; tp ) = MXi (ti ) :
i=1
Yp
= MXi (ti )
i=1
= MX (t1 ; 0; ; 0)MX (0; t2 ; ; 0) MX (0; 0; ; tp ):
Conversely, suppose that
MX (t) = MX (t1 ; 0; ; 0)MX (0; t2 ; ; 0) MX (0; 0; ; tp ):
Then
MX (t)
Z 1 Z 1
= et1 x1 +t2 x2 + +tp xp
fX (x)dx
1 1
21
Since a moment generating function, if it exists, uniquely determines a distribution, it follows that
X1
Theorem 9 Let X = . Then X 1 and X 2 are independent if and only if MX (t) = MX 1 (t1 )MX 2 (t2 ):
X2
Exercise 1 In this example, we want to determine the moment generating function of a new (vector) variable Y ,
X1
which is made up of some combinations of the variable X. Let X = where X1 and X2 are independent
X2
and exponential, Xi EXP (1). The pdf of X is
x1 x2
e x1 > 0; x2 > 0
fX (x) =
0 elsewhere:
@x 1 0
jJj = @y = = 1:
1 1
The pdf of Y is
y2
e 0 < y1 < y2 < 1
fY (y) =
0 elsewhere:
The moment generating function of Y is
0
MY (t) = E(et Y )
R1Ry
= 0 0 2 et1 y1 +t2 y2 e y2 dy1 dy2
R1 h iy2
= 0 e y2 (1 t2 ) t11 ey1 t1 dy2
0
R1
= t11 0 e y2 (1 t2 ) [ey2 t1 1] dy2 :
h i1
= t11 1
1 t1 t2 e
y2 (1 t1 t2 )
+ 1 1t2 e y2 (1 t2 )
0
h i
= t11 1 t11 t2 1
1 t2
1
= (1 t2 )(1 t1 t2 ) :
22
An alternative approach: The moment generating function of Y may also be directly determined by:
0
MY (t) = E et Y
= E et1 Y1 +t2 Y2
= E et1 X1 +t2 (X1 +X2 )
= E e(t1 +t2 )X1 +t2 X2
= MX (t1 + t2 ; t2 )
1
= (1 t1 t2 )(1 t2 ) :
This shows that MY (t) 6= MY1 (t1 )MY2 (t2 ); demonstrating the dependence between Y1 and Y2 .
Example 4 In the univariate case the moment generating function of the random variable Z can be written as
t2 t3
P
1 r
MZ (t) = 1 + E (Z) t + E Z 2 2! + E Z3 3! + ::: = E (Z r ) tr! :
r=0
r
t
If MZ (t) is expanded in a Taylor series around zero, then the coe¢ cient of r! is equal to E (Z r ) :
@r
Also, @t r MZ (t)
t=0
= E (Z r ) :
In the bivariate case, suppose Z1 and Z2 are jointly distributed random variables with joint pdf fZ (z) and moment
generating function MZ (t): Then
P1 P 1
tr1 ts2
MZ (t) = E (Z1r Z2s ) r!s! :
r=0 s=0
and
@ r+s
MZ1 ;Z2 (t1 ; t2 ) = E (Z1r Z2s ) :
@tr1 @ts2 t1 =t2 =0
tr1 ts2
the coe¢ cient of r!s! is equal to E (Z1r Z2s ) :
The moments of Y around the origin and around the mean can be determined from MY (t). The derivatives of
MY (t) with respect to t may be obtained by expanding MY (t) in a power series in t1 and t2 .
1
MY (t) = (1 t1 t2 )(1 t2 )
23
From this it follows that
E(Y1 ) = 1 E(Y2 ) = 2 E(Y1 Y2 ) = 3
E Y12 =2 E Y22 =6 cov(Y1 ; Y2 ) = 1
var(Y1 ) = 1 var(Y2 ) = 2
The moments of Y can also be determined directly by using the moments of X and the results in the previous
section. Since
1 1 0
E(X) = ; cov(X; X 0 ) = = I2
1 0 1
and
X1 1 0 X1 1 0
Y = = = X
X1 + X2 1 1 X2 1 1
it follows that
1 0 1
E(Y ) = E(X) =
1 1 2
and
1 0 1 0 1 1 1 1
cov(Y ; Y 0 ) = = :
1 1 0 1 0 1 1 2
24
2.2 The Multivariate Normal Distribution
The multivariate normal distribution forms the basis of much of multivariate analysis due to its computationally
attractive features (it’s relatively easy to implement in code), and its intuitive analogy from the univariate normal
environment. We study the multivariate normal distribution and its properties to gain insight into the "common"
behaviour of data in a multivariate setting, but note that there are many other multivariate distributions which are
also often considered in practice.
The random vector X : p 1 has the multivariate normal distribution if the density function of X is given by
1
2 (x b)0 A(x b)
fX (x) = ke ; 1 < xi < 1
where b is a constant vector and A > 0.
1 p
Theorem 10 If the density of X is given by the de…nition above, then k = jAj 2 =(2 ) 2 , the expected value of X
is b and the covariance matrix of X is A 1 : Conversely, given a vector and a positive de…nite matrix , there
exists a multivariate normal density
n 1 1 o
1 1
2 (x )0 2 2 (x )
p 1 e
(2 ) j j 2
2
such that the expected value of the vector with this density is and the covariance matrix is :
1 1
Proof. Let Y = A 2 (X b) or X = A 2 Y + b.
1 1
jJ(x ! y)j = A 2 = jAj 2 :
Z 1 Z 1 p
Y Z 1
1 1 0 1 2 p
jAj 2 =k = e 2y y dy = e 2 yi dyi = (2 ) 2
1 1 i=1 1
and p
1
k = jAj 2 =(2 ) 2 :
Also,
Ip = cov(Y ; Y 0 )
1 1
= cov A 2 (X b); fA 2 (X b)g0
1 1
= cov A 2 X; fA 2 Xg0
1 1
= A 2 cov(X; X 0 )A 2
1 1
= cov(X; X 0 ) = A 2 A 2 =A 1
:
1 1
E(Y ) = 0 = A E(X)
2 A b and
2
= E(X) = b:
The density function of X is
1 )0 1
ef )g
1
fX (x) = 2 (x (x
:
p 1
(2 ) j j 2 2
and it is denoted as X : p 1 Np ( ; ).
25
Exercise 2 Let p = 1. Write down the corresponding density from the multivariate normal density function. What
does p = 1 mean in this case?
and
2 2 2 2 2 2
j j= 1 2 12 = 1 2 (1 ):
a b
Since the inverse of a non-singular 2 2 matrix A = is given by
c d
1 1 d b
A = jAj c a
it follows that
2
1 1 2 12
= 2 2 (1 2) 2
1 2 12 1
1
!
2
1 1 2
= 1 2
1
1 :
2
1 2 2
2
This implies that the elements of Y are independent N ( i ; i) distributed.
X 1 : p1 1
Example 8 If X : p 1 has a Np ( ; ) distribution with X = and
X 2 : p2 1
1 11 0
= =
2 0 22
11 0
j j= =j 11 j j 22 j
0 22
and
26
1 )0 1
ef )g
1
fX (x) = 2 (x (x
p 1
(2 ) j j
2 2
1 0 1 1 0 1
ef g: ef g
1 1
= 2 (x1 1) 11 (x1 1) 2 (x2 2) 22 (x2 2)
p1 1 p2 1
(2 ) 2 j 11 j 2 (2 ) 2 j 22 j 2
Consider the random vector Z : p 1 where the elements of Z, the Zi ’s, i = 1; 2; : : : ; p, are independent and
Zi N (0; 1), i.e. each has a standard normal distribution with expected value 0 and variance 1. The density
function of Z is
p
Y 1 1 2 1 Pp 1 1 0
fZ (z) = p e 2 zi = 1 2 2z z
p exp 2 i=1 zi = p e ; 1 < zi < 1 8 i:
i=1
2 (2 ) 2 (2 ) 2
Let : p p be any non-singular covariance matrix and : p 1 be any constant vector and consider the
transformation 1 1
X = 2Z + Z= 2 (X ):
1 1
The Jacobian of the transformation is jJ(Z ! X)j = j j 2 and the density function of X = 2 Z+ is
1 n 1 1
o
1
fX (x) = p 1 exp 2 (x )0 2 2 (x )
(2 ) j j2 2
1 1 1
= j2 j 2 exp 2 (x )0 (x ) :
1 1
1
The last step follows from the fact that j2 I p j = (2 )p and 2 2 = : We say that X : p 1 Np ( ; )
where
E(X) = and cov(X; X 0 ) = :
1
Remark 5 Consider X = 2 Z+ in particular. What is its equivalence when p = 1?
a c a c 2a 2c
Remark 6 22 = 4ab 4c2 and 2 = = 4ab 4c2 :
c b c b 2c 2b
Theorem 11 Suppose that X : p 1 has the Np ( ; ) distribution. The moment generating function of X is given
by 0
t+ 12 t0 t
MX (t) = e :
27
Proof. Using the fact that for Z Np (0; I p ) we have
p
Y p
Y
0 1 2 1 0
MZ (t) = E eZ t
= MZi (ti ) = e 2 ti = e 2 t t :
i=1 i=1
1
Since X = 2 Z + , we have
0
MX (t) = E(eX t )
1
Z+ )0 t
= E(e( 2
)
0 1
t ( 2 Z)0 t
= e E(e )
0 0 1
t
= e E(eZ 2 t
)
0 1
t
= e MZ ( 2 t)
0
t+ 12 t0 t
= e :
Theorem 12 Suppose that X : p 1 has a Np ( ; ) distribution and let the rank of D : q p be q (q p). Then
Y = DX has a Nq (D ; D D 0 ) distribution.
Proof.
0
MY (t) = E eY t
0
D0 t
= E eX
= MX (D 0 t)
0
D 0 t+ 12 t0 D D 0 t
= e
X1 1 11 12
X= ; = ; and = :
X2 2 21 22
.
Proof. Let D = I q .. 0 in the previous Theorem.
X1 1 11 12
X= ; = ; and = :
X2 2 21 22
28
Proof. If X 1 and X 2 are independent, then
0
cov(X 1 ; X 02 ) = E [X 1 E (X 1 )] [X 2 E (X 2 )]
= E [X 1 E (X 1 )] X 02 E X 02
= E X 1 X 02 X 1 E X 02 E (X 1 ) X 02 + E (X 1 ) X 02
= E (X 1 ) E X 02 E (X 1 ) E X 02 E (X 1 ) E X 02 + E (X 1 ) E X 02
= 0:
Corollary 4 Independent normally distributed vectors can be combined such that the combined vector is also normally
distributed.
If X i : pi 1; i = 1; 2; ; n are independent Npi ( i ; i ) and
0 1
X1
B X2 C
B C
X = B . C;
@ .. A
Xn
P
n
then X : pi 1 has a N pi ( ; ) distribution where
i=1
0 1 0 1
1 1 0 0
B 2 C B 0 0 C
B C B 2 C
=B . C and =B . .. .. .. C:
@ .. A @ .. . . . A
n 0 0 n
Note that the zero matrices as well as the i ’s may be of di¤erent order.
29
The distribution of the sum of the random vectors,
0 1
X1
n
X B X2 C
B C
Y = X i = (I p I p I p )X = (I p I p I p) B .. C
i=1
@ . A
Xn
D =n D D0 = n :
Corollary 6 Let X : p 1 with E (X) = and cov X; X 0 = . If all linear combinations of X are normally
distributed then X also has a multivariate normal distribution.
Consider the linear combination Y = d0 X = d1 X1 + + dp Xp . Then E (Y ) = E d0 X = d0 and var d0 X =
0 0 0 0 0
d cov X;X d = d d: Since Y N d ;d d
0
dt+ 21 d0 dt2 0 0
MY (t) = e = E ed X t
= E eX d t
= MX (dt):
Corollary 8 It also follows that Y1 = d01 X and Y2 = d02 X are independent if and only if d01 d2 = 0.
1
Corollary 9 If X : p 1 has a multivariate Np ( ; ) distribution, then Z = 2 (X ) N (0; I p ). This means
that the elements of Z are indpendently N (0; 1) distributed. It then follows that
P
p
Z 0 Z = (X )0 1
(X )= Zi2 2
(p) :
i=1
30
Example 9 Suppose that the random vectors X 1 , X 2 and X 3 are independently Np ( ; ) distributed.
Remark 7 Take special care in the example above: our usual A matrix consists of submatrices, or vector combina-
tions, and therefore the components of A is I instead of 1.
31
2.2.4 Conditional Distributions of Normal Random Vectors
X1 1
Let X = where X 1 : q 1 and X 2 : (p q) 1 with corresponding = and =
X2 2
11 12
.
21 22
Remark 8 The matrix = 12 221 is the matrix of regression coe¢ cients of X 1 on x2 . The vector 1+ (x2
2 ) is often called the regression function.
Remark 9 The conditional variance of Xi , i = 1; 2; ; q, given that X 2 = x2 , i.e. the i-th diagonal element
of 11:2 , is called the partial variance of Xi , given that X 2 = x2 , and is written as ii:q+1; ;p . The conditional
covariance between Xi and Xj , given that X 2 = x2 , i.e. the i; j-th element of 11:2 , is called the partial covariance
between Xi and Xj , given X 2 = x2 , and is written as ij:q+1; ;p . The conditional correlation coe¢ cient between
Xi and Xj , given that X 2 = x2 , is the partial correlation coe¢ cient between Xi and Xj and is given by
ij:q+1; ;p
ij:q+1; ;p =p :
ii:q+1; ;p jj:q+1; ;p
Remark 10 Note that ij:q+1; ;p does not depend on x2 , which implies that ij:q+1; ;p is independent of the
given value of X 2 . We may consider the partial correlation coe¢ cient as the correlation coe¢ cient between Xi
and Xj when the in‡uence of X 2 is eliminated or removed (i.e. if X 2 is …xed).
Y = DX + c N (D + c; D D 0 ):
In the case where D = diag(d1 ; d2 ; ; dp ) is a diagonal matrix, we say that this transformation is a scale trans-
formation. If the elements of D are all positive, then it is a positive scale transformation. The correlation matrix
of the elements of Y in the case of a positive scale transformation, is
0 1
Y d d
@ q i j ij A = p ij
ij = = X ij :
d2i ii d2j jj ii jj
32
The correlation matrix of X is a covariance matrix of a speci…c positive scale transformation of X. Take
1 1 1
D = diag p p p :
11 22 pp
Theorem 16 Suppose X : p 1 is Np ( ; ) distributed. The partial correlation coe¢ cient ij:q+1; ;p is invariant
with respect to a positive scale transformation of the stochastic variates. That means that if
Y = DX + c and D = diag(d1 ; d2 ; ; dp )
the partial correlation coe¢ cient between the i-th and j-th elements of Y is
ij:q+1; ;p = ij:q+1; ;p :
Example 10 Ficus is a genus of approximately 850 species of shrubs, trees, and vines in the family Moraceae, and are
sometimes loosely referred to as "…g trees", or …gs. The fruit that these trees bare are of vital cultural importance
around the world and also serve as an extremely important food resource for wildlife. Consider the hypothetical
example where p = 3 (this is also called "trivariate") about heights (in cm) of three di¤erent …cus species that are
found in South Africa:
X1 = height of Ficus bizanae
X2 = height of Ficus burtt-davyi
X3 = height of Ficus tettensis
The interest in Ficus and modeling of its characteristics is important for understanding ecological advancement, and
wildlife- as well as cultural conservation. The covariance matrix is given by
0 1
100 70 90
= ( ij ) = @ 70 100 80 A :
90 80 100
The partial covariance matrix of the heights of the …rst two trees, if the in‡uence of the third tree is eliminated, is:
1
11:2 = 11 12 22 21
100 70 1 90
= 100 (90; 80)
70 100 80
19 2
= :
2 36
The partial correlation coe¢ cient between the heights of the …rst two trees, if the in‡uence of the height of the third
tree is eliminated, then is:
2
12:3 = p = 0:076:
19 36
This analysis tells us for example that the heights between the …rst two trees are not particularly highly correlated,
and thus gives insight to farmers/scientists/public that by seeing one tall Ficus might not mean any Ficus will be
tall necessarily. This aids with plantation design and wildlife conservation - i.e. how "much" Ficus and Ficus leaves
might be available for wildlife consumption as part of herd- and grazing planning.
33
Example 11 Redo the above example, but calculate the partial correlation coe¢ cient between the height of Ficus
bizanae and Ficus tettensis, by eliminating the in‡uence of Ficus burtt-davyi.
Example 12 Take note: the type of analysis above is crucial for understanding the relationships between, in this
example, heights of shrubs/trees of the same genus but of di¤erent type. The elimination of the in‡uence of one
or more variables allows the scientist/practitioner to gain insight into the relationships that the remaining variables
have. Think for example if your data is vast but the geographical area of your interest does not cater for Ficus
bizanae. Then you can investigate the behaviour of the relationship of the remaining two types by disregarding the
in‡uence of Ficus bizanae!
Example 13 Finally: in this hypothetical example, we only considered 3 Ficus types. There are more than 800 Ficus
types, so in the case where you have access to the heights from a sample of all 800, then your covariance matrix
would be 800 800 - which is huge! It is then when you would want to have access to a reasonable computer, good
software, and your own strong coding skills to solve these computational challenges. :)
Take note:
Since the partial correlation coe¢ cient is scale invariant, it may be calculated directly from the correlation matrix.
For the case p = 3, 0 1
11 12 13
( ij ) =@ 21 22 23
A
31 32 33
and 0 1
1 12 13
( ij ) =@ 21 1 23
A:
31 32 1
In this case
1 2
11 33 13 12 33 13 23
11:2 = 2
33 12 33 13 23 22 33 23
12 33 13 23
=p 2 )( 2 )
( 11 33 13 22 33 23
p 1
2
11 22 33 12 33 13 23
= 1
p
p 2 ( 11 33
2 )(
13 22 33
2 )
23
11 22 33
12 13 23
=p 2 ) (1 2 )
:
(1 13 23
34
Let
Y1
Y =
Y2
X1 X1
where Y 1 = and Y 2 = (X3 ) = X3 . The expected value of Y 1 = given X3 is
X2 X2
1
E (Y 1 jY 2 = y 2 = x3 ) = 1 + 12 22 (y 2 2)
66 1 90
= + (x3 66)
66 100 80
6:6 0:9
= + x3
13:2 0:8
The estimated values of X1 and X2 given Ficus tettensis has a height of 72 are
X1 6:6 + 0:9 72 71:4
E jX3 = 72 = =
X2 13:2 + 0:8 72 70:8
The estimated values of X1 and X2 given Ficus tettensis has a height of 60 are
X1 6:6 + 0:9 60 60:6
E jX3 = 60 = = :
X2 13:2 + 0:8 60 61:2
X1
Consider the random vector X : p 1 with expected value and covariance matrix . Let X = where
X2
0
1 11 1
X1 : 1 1, X 2 : (p 1) 1 and = and = correspondingly.
2 1 22
In order to …nd a measure of the relation between X1 on the one hand and the components of X 2 on the other,
the multiple correlation coe¢ cient between X1 and X 2 is de…ned. The correlation coe¢ cient between X1 and the
linear function 0 X 2 = 1 X2 + 2 X3 + + p 1 Xp is given by
0
cov (X1 ; X 2)
=p :
var(X1 )var( 0X
2)
Theorem 17 Of all linear combinations 0 X 2 , that combination which minimizes the variance of
0
X1 X 2 and maximizes the correlation between X1 and 0 X 2 , is given by
0 1
X 2 ; where = 22 1:
35
Proof. The result will be given.
Therefore it follows that the multiple correlation coe¢ cient is given by:
0
cov(X1 ; X 2)
R = p
var(X1 )var( 0 X 2 )
0
1
= p 0
11 22
0 1
1 22 1
= q
0 1
11 1 22 1
s
0 1
1 22 1
= :
11
Theorem 18 The multiple correlation coe¢ cient R is invariant with respect to the transformation
X = DX
d1 00
where D = , d1 6= 0 and D 2 is non-singular.
0 D2
Proof.
0
d1 11 d1 d1 00
= D D0 = 1
D2 1 D2 22 0 D 02
0
d21 11 d1 0
1 D2
= 0
d1 D 2 1 D2 22 D 2
and s
0 D0 1
d21 1 2 D 2 22 D 02 D2 1
R = = R:
d21 11
Example 15 In the previous example, we may be interested in the multiple correlation coe¢ cient between the height
of the …rst tree species and the heights of the second and third species. In this case
0 1
100 70 90
( ij ) = @ 70 100 80 A
90 80 100
36
1
0 1 100 80 70
1 22 1 = (70; 90)
80 100 90
1 100 80 70
= (70; 90)
3600 80 100 90
1 70
= ( 200; 3400)
3600 90
292000
= = 81:1111
3600
Hence: r
81:1111 p
R1:23 = = 0:8111 = 0:90
100
0 1
Remark 11 Remind yourself here that 1 22 1 can be any value, but R will be bound between 1 and 1.
37
3 NORMAL SAMPLES
3.1 Estimation of Parameters
3.1.1 The Mean and Covariance Matrix
and
E(X 0 ) = E X 1; X 2; ; XN = ; ; ; = 10N :
Also 0 1
1
P
N
B N xi1 C 0 1
B i=1 C x1
B PN C B C
PN B 1
xi2 C x2
1 0 1
xi = B N C=B C
x= N X 1N = N B i=1 C B .. C:
i=1 B C @ . A
B C
@ PN A xp
1
N xip
i=1
b = x and b = 1A
N
PN
where A = =1 (x x)(x x)0 :
38
3.1.2 Correlation Coe¢ cients
1
Lemma 1 Let f ( ) : <p ! < and = ( ) : <p ! <p where ( ) is a one-to-one function, i.e. = ( ).
1
Let g( ) = f ( ( )). If f ( ) is a maximum at = 0 , then g( ) is a maximum at = 0 = ( 0 ). If the
maximum of f ( ) is unique, then the maximum of g( ) is also unique.
Remark 12 Find the theorem from Bain and Engelhardt (WST 221) where this concept was covered.
It follows from the lemma that if b is a set of m.l.e.’s of and = ( ) where is one-to-one, then c = (b)
is a set of m.l.e.’s of . Uniqueness is also carried over. Since the transformations of the parameters
i; i = 1; 2; ; p and ij ; i; j = 1; 2; ;p
to the parameters
i; i = 1; 2; ; p; ii ; i = 1; 2; ;p
and
ij
ij =p ; i 6= j = 1; 2; ;p
ii jj
are one-to-one, it follows that the m.l.e’s of the ij is given by
PN
bij aij =1 (xi xi )(xj xj )
bij = p =p = rn o nP o:
bii bjj aii ajj PN N
=1 (xi xi )2 =1 (xj xj )2
Similarly it also follows that the maximum likelihood estimator of ij:q+1; ;p is given by
bij:q+1; ;p
bij:q+1; ;p =p :
bii:q+1; ;p bjj:q+1; ;p
To determine the maximum likelihood estimator of the multiple correlation coe¢ cient suppose again that X : p 1
X1 1
is N ( ; ) distributed and let X = where X1 : 1 1, X 2 : (p 1) 1 and = and
X2 2
0
11 1
= corresponding. The multiple correlation coe¢ cient is
1 22
s
0 1
1 22 1
R= :
11
The transformation of to R, 1 and 22 is again one-to-one which implies that the maximum likelihood estimator
of R is given by s
1
^= b 01 b 22 b 1
R :
b11
39
3.2 Sampling Distributions
3.2.1 The Mean and Covariance Matrix
Proof. Suppose that the rows of X are a random sample from a Np ( ; ) distribution. Let B : N N be an
orthogonal matrix with last row
p1 ; p1 ; ; p1 :
N N N
Thus, the sum of the elements of any other row of B will be equal to zero.
0 PN 1
0 10 1 0
0 0 1
b11 b12 : : : b1N 0 i=1 b1i X i
X1 B P C Z
B b21 b22 : : : b2N C B B N C B 10
B C X 02 CC=B
0
i=1 b2i X i C B Z2 C
Let Z = BX = B .. .. C B@ A B C=@ C.
A
@ . . A B C
X 0 @ PN A Z 0N
p1 p1 ::: p1
N 0
p1
N N N
N i=1 X i
Note that
P
N
Z0 = b i X 0i :
i=1
Linear combinations of normally distributed variables are again normally distributed. This means that the joint
distribution of all the elements of Z is normal with
P
N
E Z0 = b i E X 0i
i=1
P
N
0
= b i
i=1
(
00 if 6= N
= p 0
N if = N:
and
PN PN
cov(Z ; Z 0 ) = cov i=1 b i X i ;
0
j=1 b j X j
PN 0
= i=1 cov b i X i ; b i X i + 0
PN 0
= i=1 b i b i cov X i ; X i
PN
= i=1 b i b i
(
if =
=
0 if 6= :
40
Since the covariance
0 between
1 any two di¤erent rows of Z is a zero matrix it follows that the rows of Z are independent.
Z 01
B Z 02 C p
Thus, for Z = B @
C; Z
A N p (0; ) if = 1; 2; : : : ; N 1 and Z N N p N ; , all independent.
0
ZN
1 0 1 0 p1 Z N ,
Since X = N X 1N = N Z B1N = N
X Np ( ; N1 ).
Next,
P
N
A = (X X)(X X)0
=1
P
N 0
= X X0 NX X
=1
= X 0X 1
N X 0 1N (10N X)
h i
0 0
= B0Z B0Z 1
N B 0
Z 1N 10N B 0 Z
= Z 0 BB 0 Z N1 Z 0 B1N 10N B 0 Z
= Z 0 Z Z N Z 0N
NP1
= Z Z0
=1
0 1 0 1
0 0 0
B 0 C p B 0 0 C
B C B C
since (B1N ) 10N B 0 = B .. C 0 0 N =B .. .. C:
@ A @ A
p. . .
N 0 N
Theorem 21 An unbiased estimator for the covariance of a random sample of N observations from a Np ( ; )
distribution is
PN
1
N 1A = N 1
1
(X X)(X X)0 :
=1
and E b = N 1
N . An unbiased estimator of would be
P
N
N
1
1A = N 1
1
(X X)(X X)0 :
=1
41
3.2.2 Quadratic Forms of Normally Distributed Variates
a b
Example 16 Write down S when p = 2 and A = .
c d
Corollary 10 1. S = X 0X 2
(p) :
Pn 2
2. If X : n 1 Nn (0; I n ) ; then (n 1) S 2 = i=1 Xi X 2
(n 1) :
The second part of the Corollary follows from the fact that if
0 1
X1 X
B X2 X C 1 1
B C
Y =B .. C=X 1n 10n X = In 1n 10n X
@ . A n n
Xn X
then I n 1 0
n 1n 1nis idempotent and has a rank of n 1. From the previous theorem Y 0 Y = X 0 I n 1 0
n 1n 1n X=
(n 1) S 2 2
(n 1) :
Lemma 4 If X : p 1 Np ( ; ), then
1
S = (X )0 (X )
2
has a distribution with p degrees of freedom.
1
Proof. Let Y = 2 (X ) Np (0; I p ). The quadratic form
1
S = (X )0 (X ) = Y 0Y 2
(p):
42
From the three lemmas above it follows that:
1. If X : N 1 NN (0; I N ), then S = X 0 X 2
(N ).
PN
2. If X : N 1 NN (0; I N ), then i=1 (Xi X)2 2
(N 1).
De…ne, for example, 00 1 0 1 11 0 1
1 1
1 0 0 3 3 3 X1
BB C B CC B C
BB 0 1 0 C B 1 1 1 CC B X 2 C
Y = BB C B 3 3 3 CC B C
@@ A @ 1 1 1
AA @ A
0 0 1 3 3 3
X3
0 2 1 1
10 1
3 3 3 X1
B CB C
B 1 2 1 C B X2 C
=B 3 3 3 CB C
@ 1 1 2
A@ A
3 3 3
X3
0 2 1 1
1
3 X1 3 X2 3 X3
B C
B 1
+ 32 X2 1 C
=B 3 X1 3 X3 C
@ 1 1
A
3 X1 3 X2 + 23 X3
00 1 0 1 1 1
11
1 0 0 3 3 3
BB C B CC
BB 0 C B 1 1 1 CC
then the rank of BB 0 1 C B 3 3 3 CC = 3 1 = 2: Consider Y 0 Y and the result follows.
@@ A @ 1 1 1
AA
0 0 1 3 3 3
PN 2
3. If X : N 1 NN ( 1N ; 2 I N ) then X and i=1 (Xi X)2 are independent, where X N( ; n ) and
1
PN
2 i=1 (Xi X)2 has a 2 (N 1) distribution.
43
3.3 Inferences Concerning Multivariate Means
In this section, we describe the testing of hypothesis of multivariate means. This is the multivariate equivalent of
the one sample t-test that we study in earlier years, but instead of having only one sample (and thus, one mean),
we need to consider now a multivariate sample (a vector) which has a multivariate mean. Please take time to revise
Chapter 12 from WST 221 (Bain and Engelhardt, p. 389 - 404) in preparation for this section.
T2
= Y 0A 1
Y:
N 1
Then
T2 N p
F (p; N p):
N 1 p
T2
The statistic is known as Hotelling’s T 2 statistic.
N 1
Remark 14 Also note that in the above theorem, is assumed to be known. This is equivalent to assuming that
the sample variance of a univariate sample, 2 is known. When is not known, we use something that can be
"estimated" from the available data (which is A or S, usually).
T2 0 1
= N (X 0) A (X 0)
N 1
1
0b
= (X 0) (X 0)
N 0 1
= N 1 (X 0) S (X 0)
and
T2 N p
F (p; N p):
N 1 p
T2 N p
We reject H0 if > F (p; N p).
N 1 p
44
Exercise 3 Write down the alternative hypothesis.
Seal (1994) observed the head length (X1 ) and head width (X2 ) of a sample of 49 frogs, of which 14 were male
and 35 female. In this case, the question to be answered is: does the average for female frogs for both head length
and head width di¤er signi…cantly from 25? The results are
xm =
21:821 b m = 13 Sm =
1
Am =
17:159 17:731
22:843 14 14 17:731 19:273
xf =
22:860 b f = 34 Sf =
1
Af =
17:178 19:710
:
24:397 35 35 19:710 23:710
Consider the hypothesis
25
H0 : f = :
25
This solution can be calculated by hand: This gives
T2 1
25 12 ) b f (xf
0
= (xf 25 12 )
N 1
1:261 1:048 2:140
= ( 2:140 0:603
1:048 0:913 0:603
= 3:401
and since
T2 N p 35 2
= 3:401 = 56:116 > F (p; N p) = F0:01 (2; 33) = 5:312
N 1 p 2
we reject H0 on the 1% level of signi…cance. At least one of the means di¤er signi…cantly from 25:
The solution can also be SAS PROC IML code and output for the example:
proc iml;
xf={22.860, 24.397};
sigf={17.178 19.710,
19.710 23.710};
Hotelling=(xf-25#J(2,1,1))`*isigf*(xf-25#J(2,1,1));
FStat=Hotelling#33/2;
FCrit=finv(0.99,2,33);
isigf
1.2607491 -1.048054
-1.048054 0.9134183
45
19:5
Example 17 Redo the example above i) by hand and ii) in PROC IML by testing H0 : m = , and then
20:5
19:9
by testing H0 : m = . Is there a di¤erence in the outcome? Comment and discuss.
21
46
3.4 Principal Component Analysis
These notes are adapted from Lesson 11: Principal Component Analysis (PCA) of the STAT 505 course from
PennState Elberly College of Science, available on the internet.
3.4.1 Basics
Let X : p 1 with cov(X; X 0 ) = and let ( i ; ei ) ; i = 1; 2; : : : ; p be the p eigenvalue, eigenvector pairs for
0 0
such that 1 2 p and ei i = 1 for i = 1; 2; : : : ; p and ei ej = 0 for i 6= j:
e
= P P0
0 10 1
1 0 0 e01
B 0 0 CB e02 C
B 2 CB C
= e1 e2 ep B .. .. .. CB .. C
@ . . . A@ . A
0 0 p e0p
P
p
0
= i ei ei
i=1
and
0 1 0 1
1 0 0 e01
B 0 0 C B e02 C
B 2 C B C
= B . .. .. C = P0 P = B .. C e1 e2 ep
@ .. . . A @ . A
0 0 p e0p
0 0 1
e1 e1 e01 e2 e01 ep
B e02 e1 e02 e2 e02 ep C
B C
= B .. .. .. .. C:
@ . . . . A
e0p e1 e0p e2 e0p ep
Note: this process is also called the spectral decomposition of a matrix. From this last expression it follows that i =
e0i ei = e0i cov X; X 0 ei = cov e0i X; X 0 ei = var (Yi ) where Yi = e0i X: Also that e0i ej = cov (Yi ; Yj ) = 0
for all i 6= j:
The variable Yi = e0i X = ei1 X1 + ei2 X2 + + eip Xp is known as the ith principal component of the variables in
X.
The proportion of total variation in X explained by Yi = e0i X (the ith principal component) is
i
:
1+ 2+ + p
47
The purpose of principal component analysis is to approximate this expression for , that is to …nd k p such that
P
p
0 P
k
0
= i ei ei ' i ei ei :
i=1 i=1
1+ 2+ + k
:
1+ 2+ + p
In this expression, take careful note that we can say "…rst k principal components" because it is written in terms of
the eigenvalues, and the eigenvalues are ordered.
It follows that
var (Yi ) = cov e0i X; X 0 ei = e0i cov X; X 0 ei = e0i ei
and
cov (Yi ; Yj ) = cov e0i X; X 0 ej = e0i cov X; X 0 ej = e0i ej :
The Yi0 s are the p principal components for the X variables.
The 1st principal component is the linear combination of the X-variables that has the maximum variance, that
is e1 is selected such that it maximizes
var (Y1 ) = e01 e1
subject to the constraint
e01 e1 = 1:
This component accounts for as much as the variation as possible in the data.
48
The 2nd principal component is the linear combination of the X-variables that accounts for as much of the
remaining variation as possible in the data. The vector e2 is selected such that it maximizes
e01 e1 = 1
cov (Y1 ; Y2 ) = e01 e2 = 0:
The last constraint implies that the principal components Y1 and Y2 are uncorrelated,
corr (Y1 ; Y2 ) = 0:
Similarly the ith principal component is the linear combination of the X-variables that accounts for as much
of the remaining variation as possible in the data. The vector ei is selected such that it maximizes
e0i ei = 1 for i = 1; 2; : : : ; p
cov (Yi ; Yj ) = e0i ej = 0 for i 6= j:
This implies that all the principal components are uncorrelated with each other.
Theorem 23 The p principal components for the random vector X : p 1 are given by
0 1 0 0 1
Y1 e1 X
B Y2 C B e02 X C
B C B C
B .. C = B .. C
@ . A @ . A
Yp e0p X
where ( i ; ei ) ; i = 1; 2; : : : ; p are the p eigenvalue, eigenvector pairs for = cov(X; X 0 ) such that 1 2
0 0
p and that ei ei = 1 for i = 1; 2; : : : ; p and ei ej = 0 for i 6= j:
0 0
1. Calculate S = 1
N 1 (X 1N x0 ) (X 1N x0 ) = 1
N 1X IN 1 0
N 1N 1N X:
49
2. Calculate the eigenvalues, eigenvectors for S, bi ; b
ei ; i = 1; 2; : : : ; p:
A fairly standard procedure is to use the di¤erence between the variables and their sample means rather than the raw
data. This is known as a translation of the random variables. Translation does not a¤ect the interpretations because
the variances of the original variables are the same as those of the translated variables.
The results of principal component analysis depend on the measurement scales. Variables with the highest sample
variances tend to be emphasized in the …rst few principal components. Principal component analysis using the
covariance function should only be considered if all of the variables have the same units of measurement. Otherwise
the variance-covariance matrix of the standardized data which is equal to the correlation matrix should be used in
stead of S in Step 1, that is use
1 1
R = D 2 SD 2
where 0 1
p1 0 0
s11
B 0 p1 0 C
1 B s22 C
D 2 =B
B .. .. .. .. C:
C
@ . . . . A
0 0 p1
spp
In this case the estimated value for the ith principal component is
Ybi = b
e0i Z = ebi1 Z1 + ebi2 Z2 + + ebip Zp
where
Xi X i
Zi = p :
Sii
Example 19 In the following 3-variate dataset with 10 observations each observation consists of 3 measurements on
a wafer: thickness, horizontal displacement and vertical displacement.
0 1
7 4 3
B 4 1 8 C
B C
B 6 3 5 C
B C
B 8 6 1 C
B C
B 8 5 7 C
X =B B 7 2 9 C
C
B C
B 5 3 3 C
B C
B 9 5 8 C
B C
@ 7 4 5 A
8 2 2
The purpose is to identify the number of principal components that will account for at least 85% of the variation in
the data.
50
data a;
input x1 x2 x3 @@;
cards;
7 4 3
4 1 8
6 3 5
8 6 1
8 5 7
7 2 9
5 3 3
9 5 8
7 4 5
8 2 2
;
proc princomp out=a;
var x1 x2 x3;
proc print data=a;
run;
Simple Statistics
x1 x2 x3
Mean 6.900000000 3.500000000 5.100000000
StD 1.523883927 1.581138830 2.806737925
Correlation Matrix
x1 x2 x3
x1 1.0000 0.6687 -.1013
x2 0.6687 1.0000 -.2879
x3 -.1013 -.2879 1.0000
Eigenvectors
Prin1 Prin2 Prin3
x1 0.642005 0.384672 -.663217
x2 0.686362 0.097130 0.720745
x3 -.341669 0.917929 0.201666
51
Obs x1 x2 x3 Prin1 Prin2 Prin3
1 7 4 3 0.51481 -0.63084 0.03351
2 4 1 8 -2.66001 0.06281 0.33089
3 6 3 5 -0.58404 -0.29061 0.15659
4 8 6 1 2.04776 -0.90963 0.36627
5 8 5 7 0.88327 0.99120 0.34154
6 7 2 9 -1.08376 1.20857 -0.44706
7 5 3 3 -0.76187 -1.19712 0.44810
8 9 5 8 1.18284 1.57068 -0.02183
9 7 4 5 0.27135 0.02325 0.17721
10 8 2 2 0.18965 -0.82831 -1.38523
From this we see that the …rst two principal components will explain
1:7687741 + 0:9270759
100 = 89:862%
1:7687741 + 0:9270759 + 0:3041499
of the total variation in X. The …rst principal component when using the correlation matrix is
Yb1 X1 6:9
= 0:642 1:523883927 X2 3:5
+ 0:686 1:581138830 0:342 X3 5:1
2:806737925
= 0:642Z1 + 0:686Z2 0:342Z3
Yb2 X1 6:9
= 0:385 1:523883927 X2 3:5
+ 0:097 1:581138830 + 0:918 X3 5:1
2:806737925
= 0:385Z1 + 0:097Z2 + 0:918Z3 :
and
Yb2 = 0:385 7 6:9
1:523883927 + 0:097 4 3:5
1:581138830 + 0:918 3 5:1
2:806737925 = 0:63091:
Small di¤erences between the output and calculated values are due to rounding.
52
When using the covariance to do the PCA, the SAS program and output are given below.
proc princomp cov out=a;
var x1 x2 x3;
proc print data=a;
run;
Simple Statistics
x1 x2 x3
Mean 6.900000000 3.500000000 5.100000000
StD 1.523883927 1.581138830 2.806737925
Covariance Matrix
x1 x2 x3
x1 2.322222222 1.611111111 -0.433333333
x2 1.611111111 2.500000000 -1.277777778
x3 -0.433333333 -1.277777778 7.877777778
Eigenvectors
Prin1 Prin2 Prin3
x1 -.137571 0.699037 -.701727
x2 -.250460 0.660889 0.707457
x3 0.958303 0.273080 0.084162
From this we see that the …rst two principal components will explain
8:27394258 + 3:67612927
100 = 94:095%
8:27394258 + 3:67612927 + 0:74992815
of the total variation in X.
53
For example, for the …rst observation
and
Yb2 = 0:699 (7 6:9) + 0:661 (4 3:5) + 0:273 (3 5:1) = 0:1729:
Small di¤erences between the output and calculated values are due to rounding.
Solution 2: Using PROC IML: The covariance and correlation matrices with corresponding eigenvalues and eigen-
vectors can also be calculated in SAS IML.
proc iml;
X={7 4 3,
4 1 8,
6 3 5,
8 6 1,
8 5 7,
7 2 9,
5 3 3,
9 5 8,
7 4 5,
8 2 2};
j=j(10,1,1);
S=1/9*X`*(I(10)-1/10*j*j`)*X;
D_half=sqrt(inv(diag(S)));
R=D_half*S*D_half;
call eigen(ls,es,S);
call eigen(lr,er,R);
print ls es;
print lr er;
ls es
8.2739426 -0.137571 0.6990371 -0.701727
3.6761293 -0.25046 0.6608892 0.707457
0.7499282 0.9583028 0.2730799 0.0841616
lr er
1.7687741 0.6420046 0.3846723 -0.663217
0.9270759 0.6863616 0.0971303 0.720745
0.3041499 -0.341669 0.9179286 0.2016662
54
Example 20 Biopsy Data on Breast Cancer Patients
This breast cancer database was obtained from the University of Wisconsin Hospitals, Madison. The biopsy data set
contains data from 683 patients who had a breast biopsy performed. Each tissue sample was scored according to 9
di¤erent characteristics, each on a scale from 1 to 10. Also, for each patient the …nal outcome (benign/malignant)
was known. This data frame contains the following columns:
V1 clump thickness V6 bare nuclei
V2 uniformity of cell size V7 bland chromatin
V3 uniformity of cell shape V8 normal nucleoli
V4 marginal adhesion V9 mitoses
V5 single epithelial cell size V10 outcome: "benign" or "malignant"
The following SAS program and output gives the summary statistics for V1-V9 by V10 and the results of a principal
component analysis on V1-V9 using the correlation matrix.
proc print;
var v1--v10 prin1 prin2;
55
The PRINCOMP Procedure
Observations 683
Variables 9
Simple Statistics
V1 V2 V3 V4 V5
Mean 4.442166911 3.150805271 3.215226940 2.830161054 3.234260615
StD 2.820761319 3.065144856 2.988580818 2.864562190 2.223085456
Simple Statistics
V6 V7 V8 V9
Mean 3.544655930 3.445095168 2.869692533 1.603221083
StD 3.643857160 2.449696573 3.052666407 1.732674146
Correlation Matrix
V1 V2 V3 V4 V5 V6 V7 V8 V9
V1 1.0000 0.6425 0.6535 0.4878 0.5236 0.5931 0.5537 0.5341 0.3510
V2 0.6425 1.0000 0.9072 0.7070 0.7535 0.6917 0.7556 0.7193 0.4608
V3 0.6535 0.9072 1.0000 0.6859 0.7225 0.7139 0.7353 0.7180 0.4413
V4 0.4878 0.7070 0.6859 1.0000 0.5945 0.6706 0.6686 0.6031 0.4189
V5 0.5236 0.7535 0.7225 0.5945 1.0000 0.5857 0.6181 0.6289 0.4806
V6 0.5931 0.6917 0.7139 0.6706 0.5857 1.0000 0.6806 0.5843 0.3392
V7 0.5537 0.7556 0.7353 0.6686 0.6181 0.6806 1.0000 0.6656 0.3460
V8 0.5341 0.7193 0.7180 0.6031 0.6289 0.5843 0.6656 1.0000 0.4338
V9 0.3510 0.4608 0.4413 0.4189 0.4806 0.3392 0.3460 0.4338 1.0000
Eigenvectors
Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7 Prin8 Prin9
V1 0.302063 -.140801 0.866372 0.107828 0.080321 -.242518 -.008516 0.247707 -.002747
V2 0.380793 -.046640 -.019938 -.204255 -.145653 -.139032 -.205434 -.436300 -.733211
V3 0.377583 -.082422 0.033511 -.175866 -.108392 -.074527 -.127209 -.582727 0.667481
V4 0.332724 -.052094 -.412647 0.493173 -.019569 -.654629 0.123830 0.163434 0.046019
V5 0.336234 0.164404 -.087743 -.427384 -.636693 0.069309 0.211018 0.458669 0.066891
V6 0.335068 -.261261 0.000691 0.498618 -.124773 0.609221 0.402790 -.126653 -.076510
V7 0.345747 -.228077 -.213072 0.013047 0.227666 0.298897 -.700417 0.383719 0.062241
V8 0.335591 0.033966 -.134248 -.417113 0.690210 0.021518 0.459783 0.074012 -.022079
V9 0.230206 0.905557 0.080492 0.258988 0.105042 0.148345 -.132117 -.053537 0.007496
56
The next output shows the values of the …rst and second principal components for the …rst 5 observations.
Obs V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 Prin1 Prin2
This was calculated by using the standardised values and the corresponding eigenvectors in the output, that is
Yb1 = b
e01 Z = 1:46909
where 0 1 0 1
0:302063 (1 4:44217) =2:82076
B 0:380793 C B (5 3:15081) =3:06514 C
B C B C
B 0:377583 C B (1 3:21523) =2:98858 C
B C B C
B 0:332724 C B (1 2:83016) =2:86456 C
B C B C
eb1 = B
B 0:336234 C
C and Z=B
B (1 3:23426) =2:22309 C
C
B 0:335068 C B (2 3:54466) =3:64386 C
B C B C
B 0:345747 C B (1 3:44510) =2:44970 C
B C B C
@ 0:335591 A @ (3 2:86969) =3:05267 A
0:230206 (1 1:60322) =1:73267
Drawing the values of Prin2 against Prin1 by outcome (benign or malignant) gives the following graph. The …rst
two principal components can be used in a logit analysis as covariates to predict the outcome.
57
58