0% found this document useful (0 votes)
5 views

Document

Uploaded by

sharmila
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Document

Uploaded by

sharmila
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 234

Multivariate Analysis Lecture Notes

for
Stat 5353

J. D. Tubbs
Department of Mathematical Sciences

Fall Semester 2002


Contents

1 Matrices, Random Variables, and Distributions 1


1.1 Some Matrix Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Special Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.4 Kronecker or Direct Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.5 Partitioned Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.6 Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.7 Transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.8 Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.9 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.10 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.11 Idempotent Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.12 Orthogonal Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.13 The Generalized Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.14 Solution of Linear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Properties of Eigenvalues–Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Random Variables and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Univariate Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Distributions 11
2.1 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Univariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Bivariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Multivariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.4 Estimation of Parameters in the Multivariate Case . . . . . . . . . . . . . . . . . . . . 13
2.1.5 Matrix Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Other Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Chi-Square, T and F Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 The Wishart Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Hotelling’s T 2 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Quadratic Forms of Normal Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Sampling Distributions of µ̂ and S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1
3 Assessing the Normality Assumption 17
3.1 QQ Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Chi-Square Probability Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Example – US Navy Officers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Box-Cox Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.1 USnavy Example with Box Cox Transformations . . . . . . . . . . . . . . . . . . . . . 25

4 Multivariate Plots 36
4.1 2 Dimensional Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.1 Multiple Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.2 Chernoff Faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.3 Star Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 3 Dimensional Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Inference for the Mean 41


5.1 One Population Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1.1 Univariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1.2 Multivariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.3 Example–Sweat Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Two Population Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.1 Hotelling’s Two Sample T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6 The General Linear Model 55


6.1 Univariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.1.1 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1.2 Estimation of σ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.1.3 ANOVA Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.1.4 Expected Values of the Sum of Squares . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.1.5 Distribution of the Mean Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.1.6 Testing Linear Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.1.7 Constrained Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 Multivariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2.1 ANOVA Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2.2 Testing Linear Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2.3 One-Way MANOVA – Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2.4 Multivariate Regression – Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7 Inference for Covariance Matrices 74


7.1 One Sample Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.1.1 Univariate Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.1.2 Multivariate Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.1.3 SAS IML Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.1.4 Test for Sphericity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.1.5 Test for Equicorrelation or Compound Symmetry . . . . . . . . . . . . . . . . . . . . . 78
7.1.6 Test for Equality of Several Covariances . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.1.7 SAS Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

2
8 Profile Analysis 83
8.1 One Sample Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.1.1 Example – One Sample Profile Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.2 Two Sample Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.2.1 Example – Two Sample Profile Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.3 Mixed Models Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.3.1 Estimating G and R in the Mixed Model . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.3.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.3.3 Restricted Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . 103
8.3.4 REML Estimation for the Linear Mixed Model . . . . . . . . . . . . . . . . . . . . . . 103
8.3.5 Estimating β and γ in the Mixed Model . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.3.6 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.3.7 Statistical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.3.8 Inference and Test Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.4 Example using PROC MIXED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

9 Principle Component Analysis 120


9.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.1.1 Properties of Eigenvalues–Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.2 Principle Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.2.1 Example – FOC Sales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

10 Canonical Correlation 124


10.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.1.1 Job Satisfaction – Johnson-Wichern Data . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.1.2 Police Applicant Files – Johnson Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

11 Factor Analysis 136


11.1 SAS – PROC FACTOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
11.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
11.1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
11.2 Police Applicant Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

12 Classification and Discrimination Analysis 161


12.1 Discriminate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
12.1.1 Two Population Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
12.1.2 Multiple Population Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
12.1.3 SAS – PROC CANDISC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
12.1.4 SAS – PROC STEPDISC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
12.1.5 SAS – Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
12.2 Classification Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
12.2.1 Two Population Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
12.2.2 Minimizing the Total Cost of Misclassification . . . . . . . . . . . . . . . . . . . . . . . 177
12.2.3 Likelihood Ratio Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
12.2.4 Maximizing the Posterior Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
12.2.5 Minimax Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
12.3 Evaluating Classification Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
12.3.1 SAS – Classification Error-Rate Estimates . . . . . . . . . . . . . . . . . . . . . . . . . 179
12.4 Multiple Population Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
12.4.1 Parametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

3
12.4.2 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
12.4.3 Nearest Neighbor Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
12.5 SAS – Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
12.5.1 Salmon Size Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
12.6 Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

13 Cluster Analysis and Multidimensional Scaling 209


13.1 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
13.2 Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
13.2.1 Nonhierarchial Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
13.3 Hierarchial Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
13.3.1 Agglomerative Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
13.3.2 Ward’s Hierarchial Clustering Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
13.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
13.4.1 Splus Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
13.5 Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
13.5.1 The Basic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
13.5.2 SAS – Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

4
Chapter 1

Matrices, Random Variables, and


Distributions

1.1 Some Matrix Algebra


A matrix A = (aij ), i = 1, 2, . . . , r, j = 1, 2, . . . , c is said to be an r × c matrix. Its transpose id A0 = (aji ), i =
1, 2, . . . , r, j = 1, 2, . . . , c both are given by

a11 a12 . . . a1c a11 a21 . . . ar1


   
 a21 a22 . . . a2c  0
 a12 a22 . . . ar2 
A=  ... .
.. . .. . 
..  A =  .
 .. .. .. .. 
. . . 
ar1 ar2 ... arc a1c a2c ... arc

A vector x is said to be a n × 1 row vector, x0 is a 1 × n column vector given by

x1
 
 x2 
x=  ... 
 x0 = ( x1 , x2 , · · · , xn ) .

xn

1.1.1 Special Matrices


1. D = diag(A) is the diagonal of the r × r matrix A given by

a11 0 ... 0
 
 0 a22 ... 0 
D=
 ... .. .. ..  .
. . . 
0 0 ... arr

2. T is called the upper triangular matrix (of A) is given by

a11 a12 ... a1c


 
 0 a22 ... a2c 
T =
 ... .. .. .. 
. . . 
0 0 ... arc

1
3. In is called the n × n identity matrix given by

1 0 ... 0
 
0 1 ... 0
I=  ..... . . . .
. . .. 
0 0 ... 1

4. J is an n × n matrix with each element equal to one.


5. j is a n × 1 vector with each element equal to one where J = jj0 .

1.1.2 Addition
C = A ± B is defined as cij = aij ± bij provided both A and B have the same number of rows and columns.
It can easily be shown that (A ± B) ± C = A ± (B ± C) and A + B = B + A.

1.1.3 Multiplication
Pp
C = AB is defined as cij = k=1 aik bkj provided A and B are conformable matrices (A is r × p and B is
p × c).
1. Even if both AB and BA are defined they are not necessarily equal.
2. It follows that A(B ± C) = AB ± AC.
Pn
3. Two vectors a and b are said to be orthogonal, denoted by a⊥b = 0, if a0 b = i=1 ai bi = 0.
Pn
4. kak = a0 a = i=1 a2i .
5. kjk = j0 j = n.
Pn 0 0
6. i=1 ai = j a = a j.

7. j0 J = nj0 , Jj = nj.

1.1.4 Kronecker or Direct Product


If A is m × n and B is s × t, the direct or Kronecker product of A and B, denoted by A ⊗ B, is an ms × nt
matrix given by
a11 B a12 B . . . a1n B
 
 a21 B a22 B . . . a2n B 
A⊗B =  ... .. .. ..  .
. . . 
am1 B am2 B . . . amn B
Properties are given as
1. (A ⊗ B)(C ⊗ D) = (AC ⊗ BD).
2. ((A + B) ⊗ (C + D)) = (A ⊗ C) + (A ⊗ D) + (B ⊗ C) + (B ⊗ D).
3. A ⊗ (B ⊗ C) = (A ⊗ B) ⊗ C.

2
1.1.5 Partitioned Matrices
The r × c matrix A can be partitioned into four submatrix as
   
A11 A12 B11 B12
A= B= ,
A21 A22 B21 B22
where Aij is ri × cj and r = r1 + r2 , c = c1 + c2 . Suppose matrices A and B are such that AB is defined
then     
A11 A12 B11 B12 A11 B11 + A12 B21 A11 B12 + A12 B22
AB = = .
A21 A22 B21 B22 A21 B11 + A22 B21 A21 B12 + A22 B22

1.1.6 Inverse
The n × n matrix A is said to be nonsingular if there exists a matrix B satisfying AB = BA = In . B is
called the inverse of A is denoted by A−1 .
1. (AB)−1 = B −1 A−1 provided these inverses exist.
2. (I + A)−1 = I − A(A + I)−1 .
3. (A + B)−1 = A−1 − A−1 B(A + B)−1 = A−1 − A−1 (A−1 + B −1 )−1 A−1 and B(A + B)−1 A = (A−1 +
B −1 )−1 .
4. (A−1 + B −1 )−1 = (I + AB −1 )−1 .
5. If A is a r × c matrix with rank[A] = r then (AA0 )−1 exists and A−1 does not. If rank[A] = c then
(A0 A)−1 exists.
6. | A | is the determinant of the matrix A [Note | A | is a scalar]. | A |6= 0 whenever A inverse exists. A
is said to be singular if its inverse does not exist in which case | A |= 0.
7. | AB |=| A || B |.
8. Suppose  
A11 A12
A= ,
A21 A22
and the inverse of A exists and is given by
 
B11 B12
A−1 = B = ,
B21 B22
then
−1 −1
(a) B11 and B22 exist.
(b) B11 = [A11 − A12 A−1
22 A21 ]
−1
= A−1 −1 −1
11 + A11 A12 B22 A21 A11 .
(c) B12 = −A−1 −1
11 A12 [A22 − A21 A11 A12 ]
−1
= −A−1
11 A12 B22 .
(d) B22 = [A22 − A21 A−1
11 A12 ]
−1
= A−1 −1 −1
22 + A22 A21 B11 A12 A22 .
(e) B21 = −A−1 −1
22 A21 [A11 − A12 A22 A21 ]
−1
= −A−1
22 A21 B11 .
|A11 | |A22 |
(f) | A |= |B22 | = |B11 | and | A11 B11 |=| A22 B22 |.
(g) | A |=| A22 || A11 − A12 A−1
22 A21 | .
(h) | A |=| A11 || A22 − A21 A−1
11 A12 | .
A−1 cc0 A−1
(i) (A + cc0 )−1 = A−1 − 1+c0 A−1 c .

3
1.1.7 Transpose
If A is r × c then the transpose of A, denoted by A0 , is a c × r matrix. It follows that
1. (A0 )0 = A

2. (A ± B)0 = A0 ± B 0
3. (AB)0 = B 0 A0
4. If A = A0 then A is said to be symmetric.
5. A0 A and AA0 are symmetric.

6. (A ⊗ B)0 = (A0 ⊗ B 0 ).

1.1.8 Trace
Definition:
Pn 1.1 Suppose that the matrix A = (aij ), i = 1, . . . , n, j = 1, . . . , n then the trace of A given by
tr[A] = i=1 aii .

Provided the matrices are conformable


1. tr[A] = tr[A0 ].
2. tr[A ± B] = tr[A] ± tr[B].
3. tr[AB] = tr[BA].
4. tr[ABC] = tr[CAB] = tr[BCA].
5. tr[A ⊗ B] = tr[A]tr[B].
For a square matrix A, one can write Ax = λx for some non-null vector x, then λ is called a character-
istic or eigenvalue or latent root of A. x is called the corresponding characteristic vector (eigenvector
or latent vector).
If A is a symmetric n × n matrix with eigenvalues λi for i = 1, 2, . . . , n, then
Pn
6. tr[A] = i=1 λi
Pn
7. tr[As ] = i=1 λsi
Pn
8. tr[A−1 ] = i=1 λ−1i , A nonsingular.

1.1.9 Rank
Suppose that A is a r × c matrix with r rows a1 , a2 , . . . , ac are said to be linearly independent if no ai can
be expressed as a linear combination Pr of the remaining a0i s, that is, there does not exist a non-null vector
c = (c1 , c2 , . . . , cr ) such that i=1 ci ai = 0. It can be shown that the number of linearly independent rows
is equal to the number of linearly independent columns of any matrix A and that number is the rank of the
matrix. If the rank of A is r then the matrix A is said to be full row rank. If the rank of A is c then A is
said to be full row rank.
1. rank[A] = 0 if and only if A = 0.

2. rank[A] = rank[A0 ].

4
3. rank[A] = rank[A0 A] = rank[AA0 ].
4. rank[AB] ≤ min{rank[A], rank[B]}
5. If A is any matrix, and P and Q are any conformable nonsingular matrices then rank[P AQ] = rank[A].
6. If A is r × c with rank r then AA0 is nonsingular ((AA0 )−1 exists and rank[AA0 ] = r). If the rank of
A is c then A0 A is nonsingular ((A0 )−1 exists and rank[A0 A] = c).
7. If A is symmetric, then rank[A] is equal to the number of nonzero eigenvalues.

1.1.10 Quadratic Forms


Let A be a symmetric n × n matrix and x = (x1 , x2 , . . . , n) be a vector. Then q = x0 Ax, is called a quadratic
form of A. The quadratic form is a second degree polynomial in the x0i s.
1. A symmetric matrix A is said to be positive semidefinite (p.s.d.) if and only if q = x0 Ax ≥ 0 for all x.
(a) The eigenvalues of p.s.d. matrices are nonnegative.
(b) If A is p.s.d. then tr[A] ≥ 0.
(c) A is p.s.d. of rank r if and only if there exists an n × n matrix R of rank r such that A = RR0 .
(d) If A is an n × n p.s.d. matrix of rank r, then there exists an n × r matrix S of rank r such that
S 0 AS = Ir .
(e) If A is p.s.d., then X 0 AX = 0 ⇒ AX = 0.
2. A symmetric matrix A is said to be positive definite (p.d.) if and only if q = x0 Ax > 0 for all x, x 6= 0.
(a) The eigenvalues of p.d. matrices are positive.
(b) A is p.d. if and only if there exists an nonsingular matrix R such that A = RR0 .
(c) If A is p.d. then so is A−1 .
(d) If A is p.d. then rank[CAC 0 ] = rank[C].
(e) If A is n × n p.d. matrix and C is a p × n matrix of rank p, then CAC 0 is p.d.
(f) If X is n × p of rank p then X 0 X is p.d.
(g) If A is p.d. if and only if all the leading minor determinants of A are positive.
(h) The diagonal elements os a p.d. matrix are all positive.
(i) (Cholesky decomposition). Is A is p.d. there exists a unique upper triangular matrix T with
positive diagonal elements such that A = T 0 T .

1.1.11 Idempotent Matrices


A matrix P is said to be idempotent if P 2 = P . A symmetric idempotent matrix is called a projection
matrix.
1. If P is symmetric, then P is idempotent and of rank r if and only if it has r eigenvalues equal to unity
and n − r eigenvalues equal to zero.
2. If P is a projection matrix then the tr[P ] = rank[P ].
3. If P is idempotent, so is I − P .
4. Projection matrices are positive semidefinite.

5
1.1.12 Orthogonal Matrices
An n × n matrix A is said to be orthogonal if and only A−1 = A0 . If A is orthogonal then
1. −1 ≤ ai ≤ 1.

2. AA0 = A0 A = In .
3. | A |= 1.

Vector and Matrix Differentiation


Let X be an n × m matrix with elements xij , then if f (X) is a function of the elements of X, we define

df  df 
=
dX dxij

then
d(β 0 a)
1. dβ = a.
d(β 0 Aβ)
2. dβ = 2Aβ. (A symmetric).
df
3. if f (X) = a0 Xb, then dX = ab0 .
df
4. if f (X) = tr[AXB], then dX = A0 B 0 .
df
5. if X is symmetric and f (X) = a0 Xb, then dX = ab0 + b0 a − diag(ab0 ).
df
6. if X is symmetric and f (X) = tr[AXB], then dX = A0 B 0 + BA − diag(BA).
df
7. if X is n × n and f (X) = tr(X), then dX = In .
df
8. if X is n × n and f (X) = tr(X 0 AX), then dX = (A + A0 )X = 2AX, (when A = A0 ).
df
9. if X and A are symmetric and f (X) = tr[AXAX], then dX = 2AXA.
Suppose that Y is an k × p matrix, then the partial derivative of Y with respect to xij
 ∂y11 ∂y1 ∂y1p 
∂xij ∂xij ... ∂xij
∂y21 ∂y22 ∂y2p
...
 
∂Y  ∂xij ∂xij ∂xij 
=
 . ..

.. 
∂xij  .. ..
. . . 
∂yk1 ∂yk2 ∂ykp
∂xij ∂xij ... ∂xij

∂|X|
1. ∂X = [Xij ] = adjointX 0 for | X |6= 0.
∂|X|
2. If X is symmetric, then ∂X = 2[Xij ] − diag[Xij ] for | X |6= 0.
∂log(|X|)
3. ∂X = (X −1 )0 for | X |6= 0.
∂log(|X|)
4. If X is symmetric, then ∂X = 2X −1 − diag[X −1 ] for | X |6= 0.

6
1.1.13 The Generalized Inverse
A matrix B is said to be the generalized inverse of A if it satisfies ABA = A. The generalized inverse of A
is denoted by A− . If A is nonsingular then A−1 = A− . If A is singular then A− exists but is not unique.
1. If A is an r × c matrix of rank c. Then the generalized inverse of A is A− = (A0 A)−1 A0 .

2. If A is an r × c matrix of rank r. Then the generalized inverse of A is A− = A(AA0 )−1 .


3. If A is an r × c matrix of rank c. Then A(A0 A)− A0 is symmetric, idempotent, of rank A, and unique.

1.1.14 Solution of Linear Equations


A system of linear equations given by Ax = b is said to be consistent and has a solution which can be
expressed as x̃ = A− b. If A is nonsingular then x̃ is unique.

1.2 Eigenvalues and Eigenvectors


Suppose that A is an n × n matrix. Is there a transformation of vectors ~x that transforms A into a constant
multiple of itself? That is, does there exist a vector ~x satisfying,

A~x = λ~x

for some a constant λ? If so, then


(A − λIn )~x = 0,
for some ~x 6= 0. If so, then it follows that
| A − λIn |= 0
or
n
X
ai λi = 0.
i=0

This last equation is called the characteristic equation for a n × n matrix A. The matrix A has possibly n
roots or solutions for the value λ to the characteristic equation. These solutions are called the characteristic
values or roots or the eigenvalues for the matrix A. Suppose that λ1 is a solution and

A~x1 = λ1 ~x1 , ~x1 6= 0,

then ~x1 is said to a characteristic vector or eigenvector of A corresponding to the eigenvalue λ1 . Note: The
eigenvalues may or may not be real numbers.

1.2.1 Properties of Eigenvalues–Eigenvectors


1. The n × n matrix A has at least one eigenvalue equal to zero if and only if A is singular.
Pn
2. tr(A) = i=1 λ1 .
Qn
3. | A |= i=1 λi .
4. rank(A) equals the number of nonzero eigenvalues of A.
5. If the eigenvalues of A are either +1 or -1, then A is an orthogonal matrix.

7
6. The eigenvalues for A, C −1 AC and CAC −1 have the same set of eigenvalues for any nonsingular matrix
C.
7. The matrices A and A0 have the same set of eigenvalues but need not have the same eigenvectors.
8. Let A be a nonsingular matrix with eigenvalue λ, then 1/λ is an eigenvalue of A−1 .

9. The eigenvectors are not unique, for if ~x1 is an eigenvector corresponding to λ1 then c~x1 is also n
eigenvector, since Ac~x1 = λ1 c~x1 , for any nonzero value c.
10. Let A be an n × n real matrix, then there exists a nonsingular, complex matrix Q such that Q0 AQ = T ,
where T is a complex, upper triangular matrix, and the eigenvalues of A are the diagonal elements of
the matrix T .

11. Let P be an idempotent matrix then the eigenvalues of P are either 1 or 0. The only nonsingular
idempotent matrix is the identity matrix.
12. Suppose that A is a symmetric matrix;
(a) Then the eigenvalues of A are real numbers.
(b) For each eigenvalue there exists a real eigenvector (each element is a real number).
(c) Let λ1 and λ2 be eigenvalues of A with corresponding eigenvectors ~x1 and ~x2 , then ~x1 and ~x2 are
orthogonal vectors, that is, ~x01 ~x2 = 0.
(d) There exists an orthogonal matrix P (P 0 P = P P 0 = In ) such that P 0 AP = D, where D is a
diagonal matrix whose diagonal elements are the eigenvalues of the matrix A.
13. Let A be a symmetric matrix of order n and B is a p.d. matrix where λ1 ≥ λ2 ≥ . . . ≥ λn are the
roots of | A − λB |= 0. Then for any ~y 6= 0
(a) λn ≤ y 0 Ay/y 0 By ≤ λ1 .
(b) miny6=0 (y 0 Ay/y 0 By) = λn .
(c) maxy6=0 (y 0 Ay/y 0 By) = λ1 .
14. Properties of the roots of | H − λE |= 0.
(a) The roots of | E − ν(H + E) |= 0 are related to the roots of | H − λE |= 0
1 − νi
λi = , νi = (1 + λi )−1 .
νi

(b) The roots of | H − θ(H + E) |= 0 are related to the roots of | H − λE |= 0

θi λi
λi = , θi = .
(1 − θi ) (1 + λi )

(c) The roots of | E − ν(H + E) |= 0 are

νi = (1 − θi ).

8
1.3 Random Variables and Vectors
1.3.1 Univariate Random Variables
Expectations
Let U denote a random variable with expectation E(U ) and V ar(U ) = E(U − E(U ))2 ). Let a and b denote
any constants, then we have
1. E(aU ± b) = aE(U ) ± b.
2. V ar(aU ± b) = a2 V ar(U ).
Suppose that t(x) is a statistic that is used to estimate a parameter θ. If E(t(x)) = θ, the statistic is said to
be an unbiased estimate for θ. If E(t(x)) = η 6= θ then t(x) is biased and the bias is given by Bias = (θ − η),
in which case the mean square error is given by

M SE(t) = E(t(x) − θ)2 = V ar(t(x)) + Bias2 .

Covariance
Let U and V denote two random variables with respective means, µu and µv . The covariance between the
two random variables is defined by

Cov(U, V ) = E[(U − µu )(V − µv )] = E(U V ) − µu µv .

If U and V are independent then Cov(U, V ) = 0, one has the following:


1. Cov(aU ± b, cV ± d) = acCov(U, V ).
Cov(U,V )
2. −1 ≤ Corr(U, V ) = ρ = [V ar(U )V ar(V )]1/2
≤ 1.

Linear Combinations
Suppose that one has n r.v. given by u1 , u2 , . . . , un and one defines
n
X
u= ai ui
i=1

where E(ui ) = µi , V ar(ui ) = σi2 , and cov(ui , uj ) = σij when i 6= j. Then


Pn
1. E(u) = i=1 ai µi ,
Pn
2. V ar(u) = i=1 a2i σi2 +
PP
ai aj σij .

1.3.2 Random Vectors


Let U~ = (u1 , u2 , . . . , un )0 denote a n-dimensional vector of random variables. Then the expected value of U
~
is given by
E(U~ ) = (E(u1 ), E(u2 ), . . . , E(un ))0 .
The covariance matrix is an n × n matrix given by
~ ) = E[(u − E(u))(u − E(u))0 ] = Σ = (σij )
cov(U

where σij = cov(ui , uj ).

9
Properties for Σ
1. Σ is symmetric and at least a p.s.d. n × n matrix.
~U
2. E(U ~ 0 ) = Σ − E(U
~ )E(U
~ )0 .

~ + d) = cov(U
3. cov(U ~ ).

~ )] = trE[(U
4. tr[cov(U ~ −E(U~ ))(U
~ −E(U
~ ))0 ] = E[(U
~ −E(U~ ))0 (U ~ ))] = Pn σii is the total variance
~ −E(U
i=1
~.
of U
Suppose that A is a r × n matrix and one defines V ~ = AU~ ± b, then
~ ) = AE(U
5. E(V ~ ) ± b.

~ ) = Acov(U
6. cov(V ~ )A0 = AΣA0 . Note cov(V
~ ) is an r × r symmetric and at least p.s.d. matrix.
Suppose that C is a s × n matrix and one defines W~ = CU ~ ± d, then
~ ,W
7. cov(V ~ ) = AΣB 0 . Note cov(V
~ ,W
~ ) is a r × s matrix.

8. Define Σ1/2 as √ √ √
Σ1/2 = diag( σ 11 , σ 22 , . . . , σ nn )
and the correlation matrix as,
ρ = (Σ1/2 )−1 Σ(Σ1/2 )−1 .

~ 1 = (u1 , u2 , . . . , uq )0 and U
9. Let U ~ 2 = (uq+1 , uq+2 , . . . , un )0 then

. .
µ µ1 .. µ
~ ) = (~
~ = E(U ~ 2 )0 = (µ1 , µ2 , . . . , µq .. µq+1 , µq+2 , . . . , µn )0

and  
~) = Σ = Σ11 Σ12
Cov(U ,
Σ21 Σ22
where σ
1,1 σ1,2 ... σ1,q 
 σ2,1 σ2,2 ... σ2,q 
Σ11 = 
 .. .. .. .. ,
. . . .
σq,1 σq,2 . . . σq,q
σ σq+1,q+2 . . . σq+1,n 
q+1,q+1
 σq+2,q+1 σq+2,q+2 . . . σq+2,n 
Σ22 = .. .. .. ,
.. 

. . . .
σn,1 σn,2 . . . σn,n
σ σ ... σ 
1,q+1 1,q+2 1,n
 σ2,q+1 σ2,q+2 ... σ2,n 
Σ12 = Σ021 =
 .. .. .. .. .
. . . .
σq,1 σq,2 ... σq,n

~ 1 and U
10. The random vectors U ~ 2 are uncorrelated if and only if Σ12 = 0.

10
Chapter 2

Distributions

2.1 Normal Distribution


2.1.1 Univariate Case
The normal density function for y ∼ N (µ, σ 2 ) is given by

(y − µ)2
fy (y) = k exp[− ], −∞ < y < ∞
2σ 2
where E(y) = µ, var(y) = σ 2 and k is the normalizing constant given by k = (2πσ 2 )−1/2 .
In order to estimate the parameters µ and σ 2 using a sample of size n, given by y1 , y2 , . . . , yn , one defines
the likelihood function given by,
n n
Y (y − µ)2 X
L(y1 , y2 , . . . , yn | µ, σ 2 ) = (2πσ 2 )−1/2 exp[− ] = (2πσ 2 −n/2
) exp[−1/2σ 2
(yi − µ)2 ].
i=1
2σ 2 i=1

In order to find the values µ̂ and σ̂ 2 which maximize L(· | ·), one defines the log likelihood function as
l = log(L) and takes the partial derivative with respect to µ and σ 2 . Set these values to zero and solve. In
which case one obtains,
µ̂ = ȳ
and
n
X
σ̂ 2 = n−1 (yi − ȳ)2 .
i=1

2.1.2 Bivariate Case


Suppose that ~y = (y1 , y2 )0 , then the density function for ~y ∼ N2 (~
µ, Σ) is given by,

f (~y ) = k exp[Q],
p
where k = (2πσ1 σ2 1 − ρ2 )−1 ,

−1
Q= [((y1 − µ1 )/σ1 )2 − 2ρ((y1 − µ1 )/σ1 )((y2 − µ2 )/σ2 ) + ((y2 − µ2 )/σ2 )2 ],
2(1 − ρ2 )

11
 
σ12 ρσ1 σ2
~ = (µ1 , µ2 )0 , and Σ =
µ , where E(xi ) = µi , V ar(xi ) = σi2 , i = 1, 2, and corr(x1 , x2 ) = ρ.
ρσ1 σ2 σ22
By letting zi = yiσ−µ
i
i
, for i = 1, 2, Q can be written as

−1
Q= [z 2 − 2ρz1 z2 + z22 ].
2(1 − ρ2 ) 1
2
The√quadratic Q is an ellipse√and the contours of equal density given by Q = c are ellipses with major axis
±c λ1 e1 and minor axis ±c λ2 e2 where λ1 ≥ λ2 are the eigenvalues of Σ and e1 , e2 are the corresponding
eigenvectors where e1 e01 = e2 e02 = 1 and e1 e02 = 0.
It can be shown that
Pr[Q ≤ χ22 (α)] = (1 − α),
where χ22 (α) is the critical point form a Chi-square distribution with 2 degrees of freedom.

2.1.3 Multivariate Case


Let ~y = (y1 , y2 , . . . , yp )0 denote an p-dimensional vector with density function given by

~ )0 Σ−1 (~y − µ
f (~y ) = f (y1 , y2 , . . . , yp ) = k exp[−1/2(~y − µ ~ )],

where, ~y is said to have an p-dimensional multivariate normal distribution with mean = µ


~ and covariance
matrix = Σ provided Σ is nonsingular. This is denoted by ~y ∼ Np (µ, Σ).
1. k = (2π)−p/2 | Σ |−1/2 is the normalizing constant and | Σ | is the determinate of Σ.
~ = (µ1 , µ2 , . . . , µp )0 and cov(y) = Σ.
2. E(y) = µ

Properties of the Multivariate Normal Distribution


1. Normality of Linear combination of variables
µ, Σ) and A is a r × p. Then it follows that ~u = A~y ± ~b ∼ Nr (~
(a) Suppose that ~y ∼ Np (~ µu =
A~ ~
µ ± b, Σu = AΣA ) provided AΣA0 is nonsingular (i.e. rank(A) = r ≤ p, i.e. A is said to be full
0

column rank).
2. If ~y ∼ Np (~
µ, Σ) then ~z ∼ Np (0, Ip ) where

z = (T 0 )−1 (~y − µ) = (Σ1/2 )−1 (~y − µ),

where Σ = T 0 T the Cholesky decomposition or Σ = Σ1/2 Σ1/2


~ )0 Σ−1 (~y − µ
3. Q = (~y − µ ~ ) = ~z0 ~z ∼ χ2p , where χ2p is a Chi-square with p degrees of freedom. Q is called
the Mahalanobis distance for the vector y.

4. Conditional Distribution
Suppose that, ~y1 = (y1 , y2 , . . . , yq )0 and ~y2 = (yq+1 , yq+2 , . . . , yp )0 and (~y1 , ~y2 )0 ∼ Np (µ, Σ) where

. .
µ = (µ1 .. µ2 )0 = (µ1 , µ2 , . . . , µq .. µq+1 , µq+2 , . . . , µp )0

and  
Σ11 Σ12
Σ= ,
Σ21 Σ22

12
where σ
1,1 σ1,2 ... σ1,q 
 σ2,1 σ2,2 ... σ2,q 
Σ11 = 
 .. .. .. .. ,
. . . .
σq,1 σq,2 . . . σq,q
σ σq+1,q+2 . . . σq+1,p 
q+1,q+1
 σq+2,q+1 σq+2,q+2 . . . σq+2,p 
Σ22 = .. .. .. ..  ,

. . . .
σp,q+1 σp,q+2 . . . σp,p
σ σ1,q+2 ... σ 
1,q+1 1,p
 σ2,q+1 σ2,q+2 ... σ2,p 
Σ12 = Σ021 =
 .. .. .. .. .
. . . .
σq,q+1 σq,q+2 ... σq,p

then (Y1 | Y2 = ~y2 ) ∼ Nq (µ1·2 , Σ1·2 ) where

µ1·2 = µ1 + Σ12 Σ−1 y2 − µ2 ),


22 (~

Σ1·2 = Σ11 − Σ12 Σ−1


22 Σ21 .

The matrix Σ12 Σ−1 y1 | ~y2 ) to ~y2 .


22 is called the matrix of regression coefficients because it relates E(~

5. Independence
(a) The above vectors ~y1 and ~y2 are independent if and only if Σ12 = Σ21 = 0.
(b) If vectors ~y1 and ~y2 are independent then

~y1 ± ~y2 ∼ Nq (~
µ1 ± µ
~ 2 , Σ11 + Σ22 )

when both vectors have the same dimension (that is, ~y1 ± ~y2 is defined).
6. Partial Correlation
The partial correlation between two variables yi and yj given that Y2 = ~y2 is given by
σi,j|Y2 =~y2
ρi,j|Y2 =~y2 = √ √
σi,i|Y2 =~y2 σj,j|Y2 =~y2 ,

where σi,j|Y2 =~y2 is the ij th element of the matrix Σ1·2 for i 6= j = 1, 2, . . . , q.

2.1.4 Estimation of Parameters in the Multivariate Case


Suppose that µ and Σ are unknown and one observes n observations from a Np (µ, Σ) given by
 ~y 0  y y12 ... y1p 
1 11
 ~y20
  y21 y22 ... y2p 
Y =
 ..  =  ..
  .. .. .
.. 
. . . . .
~yn0 yn1 yn2 ... ynp

13
The likelihood function is given by
n
X
L = L(µ, Σ | Y ) = (2π)−np/2 | Σ |−n/2 exp[−1/2 (~yi − µ)0 Σ−1 (~yi − µ)]
i=1

Taking the derivative of L with respect to µ gives


n
X
Σ−1 (~yi − µ) = 0
i=1

which gives (since Σ is nonsingular),


n
X
µ̂ = ȳ = 1/nj0 ~y = ~yi /n
i=1

Now using this value and taking the derivative of L with respect to Σ, one has
n
X
Σ̂ = (~yi − ȳ)(~yi − ȳ)0 /n = [(n − 1)/n]S,
i=1

where
(n − 1)S = Y 0 [I − j(j0 j)−1 j0 ]Y = Y 0 Y − nȲ 0 Ȳ
and j is an n × 1 vector of ones.

2.1.5 Matrix Normal Distribution


Let Y denote a n × p matrix. Y is said to have a multivariate normal distribution, denoted as, Y ∼
Nn,p (µ, Σ ⊗ W ) where Σ and W are positive definite matrices, E(Y ) = µ and the covariance of Y is Σ ⊗ W .
This distribution arises when Y is a matrix made of n p-dimensional normal vectors that are correlated with
covariance structure specified by W . If the vectors were independent then W = In . The density function
for Y is given by
n o
fY (y) = (2π)−np/2 | Σ |−n/2 | W |−p/2 exp trace[−(1/2)Σ−1 (y − µ)0 W −1 (y − µ)] .

2.2 Other Distributions


2.2.1 Chi-Square, T and F Distributions
Recall from univariate statistics that if zi ∼ N (0, 1) for i = 1, 2, . . . , n then
Pn
1. zi2 ∼ χ2 (1) and i=1 zi2 ∼ χ2 (n).
Pn
2. (n − 1)s2z = i=1 (zi − z̄)2 ∼ χ2 (n − 1).
3. z̄ and s2z are independent.
4. If z ∼ N (0, 1) and u ∼ χ2 (n) then √ z ∼ t-dist(n).
u/n

u/n
5. If u ∼ χ2 (n) and v ∼ χ2 (m) then v/m ∼ F-dist(n,m).
n
6. z = (z1 , z2 , . . . , zn )0 then z 0 z = i=1 zi2 ∼ χ2 (n).
P

7. If x ∼ N (µ, 1) then x2 ∼ χ2 (df = 1, λ = µ2 ). x2 is said to have a noncentral Chi-square distribution


with noncentrality parameter λ.

14
2.2.2 The Wishart Distribution
If Q = Y 0 Y when Y ∼ Nn,p (µ, Σ ⊗ In ). Then Q is said to have a noncentral p-dimensional Wishart
distribution, denoted by Wp (n, Σ, Γ) where Γ = µ0 µΣ−1 is the noncentrality parameter. Whenever µ = 0, Q
is said to have a central Wishart distribution, denoted by Wp (n, Σ). The properties of a Wishart are;
1. E(Q) = nΣ + µ0 µ.
2. Q ∼ Wp (n, Σ, Γ) if and only if a0 Qa/a0 Σa ∼ χ2 (df = n, γ = a0 µ0 µa/a0 µa) for all a 6= 0.
3. E(Y 0 AY ) = trace(A)Σ + µ0 Aµ.
4. A1 ∼ Wp1 (n, Σ) and A2 ∼ Wp2 (n, Σ), A1 is independent of A2 , then A1 + A2 ∼ Wp1 +p2 (n, Σ).
5. If A ∼ Wp (n, Σ), then CAC 0 ∼ Wp (n, CΣC 0 ).

2.2.3 Hotelling’s T 2 Distribution


Hotelling’s T 2 distribution represents the multivariate extension of the univariate student’s t distribution.
That is, let Y and Q be independent random variables where Y ∼ Np (µ, Σ) and Q ∼ Wp (n, Σ), then

T 2 = nY 0 QY

has a distribution proportional to a noncentral F distribution


(n − p + 1) 2
T ∼ F (p, n − p + 1, γ)
np

where γ = µ0 Σ−1 µ.

2.3 Quadratic Forms of Normal Variables


1. Let z = (z1 , z2 , . . . , zn )0 ∼ Nn (0, In ). Define the quadratic form q = z 0 Az then
(a) The expected value of q is E(q) = tr[A].
(b) The variance of q is V ar(q) = 2 tr[A2 ].
(c) q ∼ χ2 (a) if and only if A2 = A (A is idempotent) where a = rank[A] = tr[A].
2. Let x = (x1 , x2 , . . . , xn )0 ∼ Nn (µ, In ). Define the quadratic form q = x0 Ax then
(a) The expected value of q is E(q) = tr[A] + µ0 Aµ.
(b) The variance of q is V ar(q) = 2 tr[A2 ] + 4µ0 A2 µ.
(c) q ∼ χ2 (a, λ) if and only if A2 = A (A is idempotent) where a = rank[A] = tr[A] and λ = 1/2µ0 Aµ.
(d) If x ∼ Nn (µ, σ 2 In ) then (x − µ)0 A(x − µ)/σ 2 ∼ χ2 (a) if and only if A is idempotent and a = tr[A].
3. Let x = (x1 , x2 , . . . , xn )0 ∼ Nn (µ, V ) (This means that the x0i s are not independent of one another).
Define the quadratic form q1 = x0 Ax then
(a) The expected value of q1 is E(q1 ) = tr[AV ] + µ0 Aµ.
(b) The variance of q1 is V ar(q1 ) = 2 tr[AV AV ] + 4µ0 AV µ.
(c) q1 ∼ χ2 (a, λ) if and only if (AV )2 = AV (AV is idempotent) where a = rank[A] and λ = 1/2µ0 Aµ.
Suppose that q2 = x0 Bx and t = Cx where C is an c × n matrix. Then

15
(d) cov(q1 , q2 ) = 2 tr[AV BA] + 4µ0 AV Bµ.
(e) cov(x, q1 ) = 2 V Aµ.
(f) cov(T, q1 ) = 2 CV Aµ.
(g) q1 and q2 are independent if and only if AV B = BV A = 0.
(h) q1 and T are independent if and only if CV A = 0.
4. (Cochran’s Theorem) Let x ∼ Nn (µ, V ), Ai , i = 1, 2, . . . , m be symmetric, rank[Ai ] = ri , and
X
A= Ai

ri then qi = x0 Ai x are mutually independent with


P
with rank[A] = r. If AV is idempotent, and r =
qi ∼ χ2 (df = ri , λi = µ0 Ai µ/2).

2.4 Sampling Distributions of µ̂ and S


Assuming that Y ∼ Nn,p (µ, Σ ⊗ In ), the sampling distributions for µ̂ and S are;
1. µ̂ = ȳ ∼ Np (µ, (1/n)Σ).
2. (n − 1)S ∼ Wp (n − 1, Σ) where Wp is a Wishart random variable.

3. µ̂ and Σ̂ are independent of one another.


Whenever n − p is large, it follows by the central limit theorem that,

4. n(µ̂ − µ) ≈ Np (0, Σ).
5. n(µ̂ − µ)0 S −1 (µ̂ − µ) ≈ χ2p .

16
Chapter 3

Assessing the Normality Assumption

The assessing of the univariate normality assumption can be done using the QQ plots from regression analysis.
In addition, one can use any of the goodness of fit procedures with the univariate plots, such as, histograms
and boxplots.

3.1 QQ Plots
Suppose that x1 , x2 , . . . , xn represent n univariate observations for a r.v., X. Let x(1) ≤ x(2) ≤ . . . ≤ x(n)
represent the order statistics. It follows that,
Z q(i)
2
Pr[Z ≤ q(i) ] = p(i) = (2π)−1/2 e−z /2 dz = Φ(q(i) ) = (i − 1/2)/n or (i − 3/8)/(n + 1/4),
−∞

where Φ(·) is the c.d.f. for the standard normal distribution. If the data x1 , x2 , . . . , xn are normally dis-
tributed then the plots of the pairs given by (q(i) , x(i) ) will be approximately linear with x(i) = σq(i) + µ.
Note: If the x0i s have been standardized then the line will be the line x(·) = q(·) .

3.2 Chi-Square Probability Plots


To assess multivariate normality, one can use the result that if ~x ∼ Np (µ, Σ) then q(x) = (~x −µ)0 Σ−1 (~x −µ) ∼
χ2 (df = p). Since, µ and Σ are usually unknown, one can compute

d2i = (~xi − x̄)0 S −1 (~xi − x̄),

whenever n and n − p are greater than 30. The procedure is as follows:


1. Order the distances d2i from smallest to largest as d2(1) ≤ d2(2) ≤ . . . ≤ d2(n) .

2. Graph the pairs (q(i) , d2(i) ) where q(i) is the 100(i − 1/2)/n percentile of the chi-square distribution with
degrees of freedom = p. If the data are multivariate normal then the plot should be linear.
nd2i
A modification of the above procedure is given in Rencher (1995) page 111 where ui = (n−1)2 has a Beta
distribution.
1. Order the distances u2i from smallest to largest as u2(1) ≤ u2(2) ≤ . . . ≤ u2(n) .

17
i−α
2. Graph the pairs (v(i) , u(i) ) where v(i) is the 100( n−α−β+1 ) percentile of the Beta(α, β) distribution,
where
p−2
α=
2p
and
n−p−2
β= .
2(n − p − 1)

A nonlinear pattern in the plot would indicate a departure from multivariate normality.

3.3 Example – US Navy Officers


The Johnson text presents the following SAS/IML code for computing the Chi-Square Probability Plots;
OPTIONS LINESIZE=75 PAGESIZE=54 NODATE PAGENO=1;
TITLE ’U.S. NAVY BACHELOR OFFICERS’’ QUARTERS’;
DATA USNAVY;
INPUT SITE 1-2 ADO MAC WHR CUA WNGS OBC RMS MMH;
LABEL ADO = ’AVERAGE DAILY OCCUPANCY’
MAC = ’AVERAGE NUMBER OF CHECK-INS PER MO.’
WHR = ’WEEKLY HRS OF SERVICE DESK OPERATION’
CUA = ’SQ FT OF COMMON USE AREA’
WNGS= ’NUMBER OF BUILDING WINGS’
OBC = ’OPERATIONAL BERTHING CAPACITY’
RMS = ’NUMBER OF ROOMS’
MMH = ’MONTHLY MAN-HOURS’ ;
CARDS;
1 2 4 4 1.26 1 6 6 180.23
2 3 1.58 40 1.25 1 5 5 182.61
3 16.6 23.78 40 1 1 13 13 164.38
4 7 2.37 168 1 1 7 8 284.55
5 5.3 1.67 42.5 7.79 3 25 25 199.92
6 16.5 8.25 168 1.12 2 19 19 267.38
7 25.89 3.00 40 0 3 36 36 999.09
8 44.42 159.75 168 .6 18 48 48 1103.24
9 39.63 50.86 40 27.37 10 77 77 944.21
10 31.92 40.08 168 5.52 6 47 47 931.84
11 97.33 255.08 168 19 6 165 130 2268.06
12 56.63 373.42 168 6.03 4 36 37 1489.5
13 96.67 206.67 168 17.86 14 120 120 1891.7
14 54.58 207.08 168 7.77 6 66 66 1387.82
15 113.88 981 168 24.48 6 166 179 3559.92
16 149.58 233.83 168 31.07 14 185 202 3115.29
17 134.32 145.82 168 25.99 12 192 192 2227.76
18 188.74 937.00 168 45.44 26 237 237 4804.24
19 110.24 410 168 20.05 12 115 115 2628.32
20 96.83 677.33 168 20.31 10 302 210 1880.84
21 102.33 288.83 168 21.01 14 131 131 3036.63
22 274.92 695.25 168 46.63 58 363 363 5539.98
23 811.08 714.33 168 22.76 17 242 242 3534.49

18
24 384.50 1473.66 168 7.36 24 540 453 8266.77
25 95 368 168 30.26 9 292 196 1845.89
RUN;

TITLE2 ’MULTIVARIATE NORMALITY PLOT’;

DATA USNAVY2;
SET USNAVY;
DROP SITE MMH;

PROC IML ; WORKSPACE=50;


RESET NOLOG LINESIZE=75 PAGESIZE=54;
USE USNAVY2;
READ ALL INTO X ;
N= NROW(X);
P= NCOL(X);
MEAN=( X[+,])/N; MEAN=MEAN‘;

PRINT "The Sample Mean is equal to" MEAN;

SUMSQ=X‘*X-N#MEAN*MEAN‘;
S=SUMSQ/(N-{1});

PRINT, "The Sample Covariance Matrix is equal to" S;

DIST = (X - J(N,{1})*MEAN‘)* INV(S)*(X - J(N,{1})*MEAN‘)‘;

D = VECDIAG(DIST); CNAME={"DIST"};
CREATE DIST FROM D[COLNAME=CNAME];
APPEND FROM D[COLNAME=DIST];
QUIT;

PROC PRINT DATA=DIST;

DATA ; SET DIST; X=DIST;

PROC RANK OUT=RANKS;


VAR X;
RANKS R;

DATA PLOTDATA; SET RANKS;

/* NOTE: The following two numbers need to be changed in order


for every new data set. */

NN = 25 ; * THIS IS THE NUMBER OF OBSERVATIONS IN THE DATA SET;


P = 7 ; * THIS IS THE NUMBER OF RESPONSE VARIABLES;
RSTAR=(R-.5)/NN;
ETA=P/2;

19
V=GAMINV(RSTAR,ETA);
V=2*V;

PROC PRINT;
PROC SORT; BY R;

PROC PRINT;
VARIABLES X R V;
FORMAT X 6.2 R 3.0 V 6.3;run;
symbol1 color=black i=join v=none;
symbol2 color=red v=diamond i=none;
PROC GPLOT DATA=PLOTDATA;
PLOT X*V=2 V*V = 1 /OVERLAY VZERO HZERO;
run;
The SAS output is;
U.S. NAVY BACHELOR OFFICERS’ QUARTERS
MULTIVARIATE NORMALITY PLOT

MEAN

The Sample Mean is equal to 118.3556


330.5056
135.94
15.7172
11.12
137.4
126.28

The Sample Covariance Matrix is equal to

28832.439 40247.12 3457.2293 917.78395 998.66888 14123.879 13378.186


40247.12 146539.3 10578.937 2527.0561 2546.7756 43626.324 38384.902
3457.2293 10578.937 3436.9233 317.57045 268.84083 3718.1917 3346.83
917.78395 2527.0561 317.57045 194.62539 115.27493 1111.362 1075.6971
998.66888 2546.7756 268.84083 115.27493 145.02667 1092.6167 1064.715
14123.879 43626.324 3718.1917 1111.362 1092.6167 17996.333 15286.842
13378.186 38384.902 3346.83 1075.6971 1064.715 15286.842 13570.793

Obs DIST

1 5.2149
2 2.9011
3 2.9139
4 2.9547
5 2.5796

20
6 2.8535
7 3.4292
8 7.6582
9 5.7796
10 2.1490
11 2.0194
12 3.8978
13 0.9649
14 1.3660
15 12.4224
16 8.6965
17 7.8778
18 9.7558
19 1.1235
20 7.8310
21 0.7295
22 17.8888
23 22.7630
24 20.0683
25 12.1617

Obs DIST X R NN P RSTAR ETA V

1 5.2149 5.2149 14 25 7 0.54 3.5 6.7071


2 2.9011 2.9011 9 25 7 0.34 3.5 4.9997
3 2.9139 2.9139 10 25 7 0.38 3.5 5.3280
4 2.9547 2.9547 11 25 7 0.42 3.5 5.6597
5 2.5796 2.5796 7 25 7 0.26 3.5 4.3391
6 2.8535 2.8535 8 25 7 0.30 3.5 4.6713
7 3.4292 3.4292 12 25 7 0.46 3.5 5.9979
8 7.6582 7.6582 16 25 7 0.62 3.5 7.4869
9 5.7796 5.7796 15 25 7 0.58 3.5 7.0858
10 2.1490 2.1490 6 25 7 0.22 3.5 3.9981
11 2.0194 2.0194 5 25 7 0.18 3.5 3.6417
12 3.8978 3.8978 13 25 7 0.50 3.5 6.3458
13 0.9649 0.9649 2 25 7 0.06 3.5 2.3205
14 1.3660 1.3660 4 25 7 0.14 3.5 3.2595
15 12.4224 12.4224 22 25 7 0.86 3.5 10.9685
16 8.6965 8.6965 19 25 7 0.74 3.5 8.8989
17 7.8778 7.8778 18 25 7 0.70 3.5 8.3834
18 9.7558 9.7558 20 25 7 0.78 3.5 9.4801
19 1.1235 1.1235 3 25 7 0.10 3.5 2.8331
20 7.8310 7.8310 17 25 7 0.66 3.5 7.9167
21 0.7295 0.7295 1 25 7 0.02 3.5 1.5643
22 17.8888 17.8888 23 25 7 0.90 3.5 12.0170
23 22.7630 22.7630 25 25 7 0.98 3.5 16.6224
24 20.0683 20.0683 24 25 7 0.94 3.5 13.5397
25 12.1617 12.1617 21 25 7 0.82 3.5 10.1542

21
Obs X R V

1 0.73 1 1.564
2 0.96 2 2.320
3 1.12 3 2.833
4 1.37 4 3.260
5 2.02 5 3.642
6 2.15 6 3.998
7 2.58 7 4.339
8 2.85 8 4.671
9 2.90 9 5.000
10 2.91 10 5.328
11 2.95 11 5.660
12 3.43 12 5.998
13 3.90 13 6.346
14 5.21 14 6.707
15 5.78 15 7.086
16 7.66 16 7.487
17 7.83 17 7.917
18 7.88 18 8.383
19 8.70 19 8.899
20 9.76 20 9.480
21 12.16 21 10.154
22 12.42 22 10.968
23 17.89 23 12.017
24 20.07 24 13.540
25 22.76 25 16.622
The graph is

22
3.4 Box-Cox Transformations
One method that has been proposed as a general method of finding normalizing transformation is called the
Box-Cox transformation.
Suppose that y > 0, define the Box-Cox transformation as
 λ
(λ) (yi − 1)/λ when λ 6= 0,
yi =
ln yi when λ = 0,
where i = 1, 2, . . . , n. One determines λ by maximizing
n
X
−n/2 log[s2 (λ)] = (λ − 1) ln(yi ) − n/2 log[σ̂ 2 (λ)],
i=1
(λ)
and σ̂ 2 (λ) = 1/n~y (λ)0 [I − H]~y (λ) i.e., it is the sum of squares for the error term when yi is used instead
(λ) (λ) (λ)
of yi and ~y (λ) = (y1 , y2 , . . . , yn )0 . The matrix H is X 0 (X 0 X)−1 X in the linear model Y = Xβ + E,
0 −1 0
otherwise H = j(j j) j .

23
Since, there is not a close form solution to the above maximization, one usually plots −n/2 log[s2 (λ)] vs
λ. Another approach is to compute a confidence interval using the fact that −n/2 log[s2 (λ)] ∼ χ2 (df = 1).
One can then use any λ which is contained in the confidence interval.

Splus Plots

24
SAS Plots

3.4.1 USnavy Example with Box Cox Transformations


The SAS code (including a listing of the boxcox macro) is as follows;
%macro boxcox(name1=, lower2=, number=, increas=);
/*
This program contains the MACRO in order to find the BOX-COX Power
Transformation (Box and Cox 1964).
The program reads an input SAS filename bc, i.e.,

The data needs to consist of a vector of observations.The program then


performs the BOX-COX Procedure to find a transformation that will aid in
transforming data to follow more closely the normal distribution.

The user inputs the lower bound where the search is to begin and the
number of searches to be made.The increment for the search is also input
by the user.

USEAGE: %BOXCOX

INPUT:

1. variable from data bc;

25
2. The lower bound for which the search to maximize the
log-likelihood is to be made (e.g. -1.2).

3. The number of searches to be made (e.g. 10).

4. The incremental unit to be added at each stage of the search


beginning at the lower bound (e.g. 0.2).

OUTPUT: LAMBDA (the power transformation to be used to make the


original data more normally distributed).

MAXL_LIK (the value of the log-likelihood for the LAMDA


transformation;this value is the maximum
log-likelihood over the grid search).
Graph:

REFERENCES:
Box,G.E.P., and D.R.Cox. (1964). An analysis of
transformations (with discussion). Journal of the Royal
Statistical Society.B.26:211-252.
Johnson,R.A., and D.Wichern. (1982) Applied Multivariate
Statistical Analysis.Englewood Cliffs,N.J.:Prentice-Hall.

*/

data a; set bc;


xx = &name1;
run;
proc sort data=a;
by xx;
data a;set a;
title1 ’THE BOX-COX TRANSFORMATION ’;
*title2 ’The Lower and Upper Bounds for the Search Grid’;
ll=&lower2;
incr=&increas;
numb=&number;
uu = ll+(numb-1)*incr;

data aa;set a;
mn =_N_;
if mn = 1;
drop mn;

proc print data=aa split =’*’ noobs ;


var ll uu;
label ll = ’Lower*Bound’

26
uu = ’Upper*Bound’;

%do i = 1 %to &number;


data a;set a;
lower=&lower2;
increase =&increas;
%let j=%eval(&i-1);
j=&j;

/* Transformation is increased by the increment at each stage


of the search
*/
lambda=lower+increase*j;
if lambda=0 then lambda=0.001;

data a;set a;
y=log(xx);
z=(xx**lambda-1)/lambda;

proc means data=a noprint;


var y z;
id lambda;
output out=b n=n1 n2 var=var1 var2 sum = sum1 sum2;

/* The log-likelihood is calculated at each stage of the search */

data b;set b;
c91=var2*(n2-1)/n2;
c71=sum1;
c51=(-n2*log(c91)/2)+(lambda-1)*c71;
if abs(lambda)<0.000001 then lambda=0;
lambda&i=lambda;
loglik&i=c51;
loglik=c51;

data dat&i;set b;
keep lambda&i loglik&i;
%end;
data cc;set dat1;
lambda=lambda1;
loglik=loglik1;
keep loglik lambda;

%do j=2 %to &number;

data cc&j;set dat&j;


lambda=lambda&j;
loglik=loglik&j;
keep loglik lambda;

27
proc append base=cc data=cc&j force;
%end;

data cc;set cc;


indicate=’a’;
proc sort data=cc;
by lambda;
proc print data=cc noobs split =’*’;
var lambda loglik ;
label loglik = ’Log*Likelihood’;
*title2 ’The Log-Likelihood is Maximized giving the required Power Transformation’;
run;

/* The LAMDA power Transformation corresponding to the value that


maximizes the log-likelihood is obtained */

proc means data=cc noprint;


var loglik;
output out=dd max=maxlik;
data dd;set dd;
indicate=’a’;
data ee;merge cc dd;
by indicate;
if loglik=maxlik;
drop indicate;
maxl_lik=maxlik;

data ee; set ee; level95 = maxl_lik - .5*3.84;


proc print data =ee split = ’*’ noobs;
var lambda maxl_lik level95;
label maxl_lik = ’Maximum*Log-Likelihood’;
label level95 = ’Level for 95% CI’;
/*title1 ’THE BOX-COX TRANSFORMATION ’;
title2 ’If LAMBDA = 0 or close to 0 Then Use the LOGe Transformation’;
title3 ’If LAMBDA = 1 or close to 1 Then Use NO Transformation’;
title4 ’If LAMBDA = 0.5 or close to 0.5 Then Use the Square Root Transformation’;
title5 ’If LAMBDA = -1 or close to -1 Then Use the Reciprocal Transformation’;
*/
run;
data ff;
if _n_=1 then set ee;
set cc;
run;

PROC GPLOT DATA=ff;


PLOT LOGLIK*LAMBDA level95*lambda

28
/overlay href=-1 -.5 0 .5 1 lh=2 ch=green;
SYMBOL1 V=NONE I=JOIN L=1;
symbol2 v=none i=join color=red;
symbol3 v=none i=join color=green;

%mend boxcox;
TITLE ’U.S. NAVY BACHELOR OFFICERS’’ QUARTERS’;
DATA USNAVY;
INPUT SITE 1-2 ADO MAC WHR CUA WNGS OBC RMS MMH;
LABEL ADO = ’AVERAGE DAILY OCCUPANCY’
MAC = ’AVERAGE NUMBER OF CHECK-INS PER MO.’
WHR = ’WEEKLY HRS OF SERVICE DESK OPERATION’
CUA = ’SQ FT OF COMMON USE AREA’
WNGS= ’NUMBER OF BUILDING WINGS’
OBC = ’OPERATIONAL BERTHING CAPACITY’
RMS = ’NUMBER OF ROOMS’
MMH = ’MONTHLY MAN-HOURS’ ;
CARDS;
1 2 4 4 1.26 1 6 6 180.23
2 3 1.58 40 1.25 1 5 5 182.61
3 16.6 23.78 40 1 1 13 13 164.38
4 7 2.37 168 1 1 7 8 284.55
5 5.3 1.67 42.5 7.79 3 25 25 199.92
6 16.5 8.25 168 1.12 2 19 19 267.38
7 25.89 3.00 40 0 3 36 36 999.09
8 44.42 159.75 168 .6 18 48 48 1103.24
9 39.63 50.86 40 27.37 10 77 77 944.21
10 31.92 40.08 168 5.52 6 47 47 931.84
11 97.33 255.08 168 19 6 165 130 2268.06
12 56.63 373.42 168 6.03 4 36 37 1489.5
13 96.67 206.67 168 17.86 14 120 120 1891.7
14 54.58 207.08 168 7.77 6 66 66 1387.82
15 113.88 981 168 24.48 6 166 179 3559.92
16 149.58 233.83 168 31.07 14 185 202 3115.29
17 134.32 145.82 168 25.99 12 192 192 2227.76
18 188.74 937.00 168 45.44 26 237 237 4804.24
19 110.24 410 168 20.05 12 115 115 2628.32
20 96.83 677.33 168 20.31 10 302 210 1880.84
21 102.33 288.83 168 21.01 14 131 131 3036.63
22 274.92 695.25 168 46.63 58 363 363 5539.98
23 811.08 714.33 168 22.76 17 242 242 3534.49
24 384.50 1473.66 168 7.36 24 540 453 8266.77
25 95 368 168 30.26 9 292 196 1845.89
RUN;

TITLE2 ’MULTIVARIATE NORMALITY PLOT’;


INPUT SITE 1-2 ADO MAC WHR CUA WNGS OBC RMS MMH;

29
/* When running Macro Boxcox the following results were found

data bc; set usnavy;


%boxcox(name1=ado, lower2=-0.95, number=25, increas=.1)run;
%boxcox(name1=mac, lower2=-0.95, number=25, increas=.1)run;
%boxcox(name1=whr, lower2=-0.95, number=25, increas=.1)run;
%boxcox(name1=cua, lower2=-0.95, number=25, increas=.1)run;
%boxcox(name1=wngs, lower2=-0.95, number=25, increas=.1)run;
%boxcox(name1=obc, lower2=-0.95, number=25, increas=.1)run;
%boxcox(name1=rms, lower2=-0.95, number=25, increas=.1)run;
%boxcox(name1=mmh, lower2=-0.95, number=25, increas=.1)run;
proc print data=usnavy;run;
*/
DATA USNAVY2;
SET USNAVY;lado=ado**.15; lmac=mac**.25; lcua=log(cua + .5);
lobc=obc**.25;lrms=rms**.25;
DROP SITE MMH ado mac cua wngs obc rms whr;run;

proc print data=usnavy2;run;

PROC IML ; WORKSPACE=50;


RESET NOLOG LINESIZE=75 PAGESIZE=54;
USE USNAVY2;
READ ALL INTO X ;
N= NROW(X);
P= NCOL(X);
MEAN=( X[+,])/N; MEAN=MEAN‘;

PRINT "The Sample Mean is equal to" MEAN;

SUMSQ=X‘*X-N#MEAN*MEAN‘;
S=SUMSQ/(N-{1});

PRINT, "The Sample Covariance Matrix is equal to" S;

DIST = (X - J(N,{1})*MEAN‘)* INV(S)*(X - J(N,{1})*MEAN‘)‘;

D = VECDIAG(DIST); CNAME={"DIST"};
CREATE DIST FROM D[COLNAME=CNAME];
APPEND FROM D[COLNAME=DIST];
QUIT;

PROC PRINT DATA=DIST;

DATA ; SET DIST; X=DIST;

PROC RANK OUT=RANKS;


VAR X;
RANKS R;

30
DATA PLOTDATA; SET RANKS;

/* NOTE: The following two numbers need to be changed in order


for every new data set. */

NN = 25 ; * THIS IS THE NUMBER OF OBSERVATIONS IN THE DATA SET;


P = 5 ; * THIS IS THE NUMBER OF RESPONSE VARIABLES;
RSTAR=(R-.5)/NN;
ETA=P/2;
V=GAMINV(RSTAR,ETA);
V=2*V;

PROC PRINT;
PROC SORT; BY R;

PROC PRINT;
VARIABLES X R V;
FORMAT X 6.2 R 3.0 V 6.3;run;
symbol1 color=black i=join v=none;
symbol2 color=red v=diamond i=none;
PROC GPLOT DATA=PLOTDATA;
PLOT X*V=2 V*V = 1 /OVERLAY VZERO HZERO;
run;

31
An example Box-Cox plot for these data is given by;

The sasoutput is (excluding Box-Cox output)


U.S. NAVY BACHELOR OFFICERS’ QUARTERS
MULTIVARIATE NORMALITY PLOT

Obs lado lmac lcua lobc lrms

1 1.10957 1.41421 0.56531 1.56508 1.56508


2 1.17915 1.12115 0.55962 1.49535 1.49535
3 1.52411 2.20827 0.40547 1.89883 1.89883
4 1.33895 1.24076 0.40547 1.62658 1.68179
5 1.28423 1.13679 2.11505 2.23607 2.23607
6 1.52273 1.69478 0.48243 2.08780 2.08780
7 1.62918 1.31607 -0.69315 2.44949 2.44949
8 1.76659 3.55517 0.09531 2.63215 2.63215
9 1.73662 2.67051 3.32755 2.96226 2.96226
10 1.68116 2.51612 1.79509 2.61833 2.61833
11 1.98718 3.99640 2.97041 3.58402 3.37665
12 1.83213 4.39592 1.87641 2.44949 2.46633
13 1.98515 3.79157 2.91017 3.30975 3.30975
14 1.82203 3.79345 2.11263 2.85027 2.85027
15 2.03454 5.59651 3.21808 3.58944 3.65774
16 2.11949 3.91043 3.45221 3.68802 3.76997
17 2.08555 3.47500 3.27677 3.72242 3.72242
18 2.19472 5.53267 3.82734 3.92362 3.92362

32
19 2.02465 4.49983 3.02286 3.27472 3.27472
20 1.98564 5.10153 3.03543 4.16871 3.80675
21 2.00217 4.12250 3.06852 3.38312 3.38312
22 2.32210 5.13494 3.85291 4.36492 4.36492
23 2.73124 5.16981 3.14674 3.94415 3.94415
24 2.44194 6.19583 2.06179 4.82057 4.61344
25 1.97997 4.37988 3.42622 4.13376 3.74166

MEAN

The Sample Mean is equal to 1.8528319


3.5188041
2.1726645
3.0711573
3.0333066

The Sample Covariance Matrix is equal to

0.1519886 0.5456573 0.3671253 0.3318444 0.3217499


0.5456573 2.4757068 1.5232469 1.3093377 1.2467932
0.3671253 1.5232469 1.828358 0.9933855 0.957028
0.3318444 1.3093377 0.9933855 0.8832163 0.8295946
0.3217499 1.2467932 0.957028 0.8295946 0.7905382

Obs DIST

1 4.2893
2 3.2561
3 2.8820
4 2.4979
5 5.6011
6 2.0426
7 10.5513
8 5.3148
9 3.4606
10 0.5301
11 3.0106
12 7.2053
13 0.5964
14 0.7616
15 7.5920
16 3.9497
17 3.5994
18 3.1316
19 1.4102

33
20 7.8354
21 0.6741
22 4.0409
23 15.6272
24 10.4920
25 9.6478

Obs DIST X R NN P RSTAR ETA V

1 4.2893 4.2893 16 25 5 0.62 2.5 5.3033


2 3.2561 3.2561 11 25 5 0.42 2.5 3.7902
3 2.8820 2.8820 8 25 5 0.30 2.5 2.9999
4 2.4979 2.4979 7 25 5 0.26 2.5 2.7400
5 5.6011 5.6011 18 25 5 0.70 2.5 6.0644
6 2.0426 2.0426 6 25 5 0.22 2.5 2.4767
7 10.5513 10.5513 24 25 5 0.94 2.5 10.5962
8 5.3148 5.3148 17 25 5 0.66 2.5 5.6668
9 3.4606 3.4606 12 25 5 0.46 2.5 4.0657
10 0.5301 0.5301 1 25 5 0.02 2.5 0.7519
11 3.0106 3.0106 9 25 5 0.34 2.5 3.2598
12 7.2053 7.2053 19 25 5 0.74 2.5 6.5065
13 0.5964 0.5964 2 25 5 0.06 2.5 1.2499
14 0.7616 0.7616 4 25 5 0.14 2.5 1.9207
15 7.5920 7.5920 20 25 5 0.78 2.5 7.0086
16 3.9497 3.9497 14 25 5 0.54 2.5 4.6505
17 3.5994 3.5994 13 25 5 0.50 2.5 4.3515
18 3.1316 3.1316 10 25 5 0.38 2.5 3.5224
19 1.4102 1.4102 5 25 5 0.18 2.5 2.2058
20 7.8354 7.8354 21 25 5 0.82 2.5 7.5952
21 0.6741 0.6741 3 25 5 0.10 2.5 1.6103
22 4.0409 4.0409 15 25 5 0.58 2.5 4.9664
23 15.6272 15.6272 25 25 5 0.98 2.5 13.3882
24 10.4920 10.4920 23 25 5 0.90 2.5 9.2364
25 9.6478 9.6478 22 25 5 0.86 2.5 8.3092

Obs X R V

1 0.53 1 0.752
2 0.60 2 1.250
3 0.67 3 1.610
4 0.76 4 1.921
5 1.41 5 2.206
6 2.04 6 2.477
7 2.50 7 2.740
8 2.88 8 3.000
9 3.01 9 3.260
10 3.13 10 3.522
11 3.26 11 3.790

34
12 3.46 12 4.066
13 3.60 13 4.351
14 3.95 14 4.651
15 4.04 15 4.966
16 4.29 16 5.303
17 5.31 17 5.667
18 5.60 18 6.064
19 7.21 19 6.507
20 7.59 20 7.009
21 7.84 21 7.595
22 9.65 22 8.309
23 10.49 23 9.236
24 10.55 24 10.596
25 15.63 25 13.388
The graph is

Note: That the data appear to be more normal with the transformation than without.

35
Chapter 4

Multivariate Plots

In this chapter I have presented a number of graphical plots for displaying higher dimensional data. In one
dimension, one can use plots like the histrograms, stemleaf, and boxplots. When considering two dimensional
data, the scatterplot is a convenient display. The next section considers sum 2-dimensional displays of higher
dimensional data.

4.1 2 Dimensional Plots


4.1.1 Multiple Scatterplots
The side by side scatterplot is a useful way of “looking” at the variables.

36
Additional two dimensional plots are the Chernoff Faces and Star Plots.

4.1.2 Chernoff Faces


The Splus code is;
#load MASS library
#load Usnavy data into Splus using windows
>attach(Usnavy) #loading data into Splus object
>Usnavy #list of the data
>faces(data.matrix(Usnavy[2:9])) #faces plot for all variables except SITE
>title("Chernoff Faces for U S Navy") #title on plot
>stars(data.matrix(Usanvy[2:9])) #stars plot
>title("Stars for U S Navy") #title on plot

37
38
4.1.3 Star Plots

The format for the stars plots is;

39
4.2 3 Dimensional Plots
Johnson gives several examples of three and higher dimensional plots in his text, pages 55-61, with corre-
sponding SAS code. I have several graphs produced using Splus in Windows. The first is a three dimensional
scatterplot of the US Navy data for (ADO,MAC) vs. MMH. The second is for (log(ADO), log(MAC) vs.
MMH with the size of the triangle proportional to another variable.

40
Chapter 5

Inference for the Mean

The purpose of this chapter is to describe inference procedures for the mean of normally distributed popu-
lations.
Rencher (1995) pages 127-128 gives a brief discussion of the advantages of using multivariate tests over many
univariate tests. He first indicates that the number
 of parameters for dimension p could be staggering, since
there are possibly, p means, p variances, and p2 covariances that sum to p(p + 3)/2 values that need to be
estimated or determined from the data of size n. Clearly, n needs to be very large whenever p is even of
moderate size. He lists other reasons that I have included for completeness. They are;
1. The use of p univariate tests inflates the type I error rate.
2. The univariate tests ignore the correlation structure in the p variables.
3. The multivariate test is more powerful.
In the first section we will consider the one population case. That is, it is assumed that X ∼ Np (µ, Σ).

5.1 One Population Case


5.1.1 Univariate Case
Recall, in the univariate case that if one wanted to test the hypothesis,

H0 : µ = µ0 vs. H1 : µ 6= µ0

using the data x1 , x2 , . . . , xn , that the exact form of the test statistic depended upon whether or not σ 2 is
known or unknown. That is, if σ 2 is known then the test statistics,
x̄ − µ0
t= √ ,
σ/ n
has a standard normal distribution whenever the null hypothesis is true. In which case, one rejects H0 in
favor of H1 if | t |≥ zα/2 where Pr[Z ≥ zα/2 ] = α/2.
Whenever σ 2 is unknown, one replaces σ 2 with s2 , the sample variance, and the new test statistics
becomes,
x̄ − µ0
t= √ .
s/ n
t has a t-distribution with degrees of freedom = n − 1 whenever the null hypothesis is true. In which case,
one rejects H0 in favor of H1 if | t |≥ tα/2 (df = n − 1) where Pr[T ≥ tα/2 (n − 1)] = α/2.

41
Univariate Likelihood Ratio Test
Although the above results are often simply stated in a methods text, the truth of these statements follows
from what is called the likelihood ratio test. That is, suppose one wants to test the above hypothesis
whenever σ 2 is unknown. The likelihood ratio given by

L(µ0 , σ̂ 2 | x1 , x2 , . . . , xn )
λ=
L(µ̂, σ̂ 2 | x1 , x2 , . . . , xn )

where L is the likelihood function given by,


n
Y
L= fx (xi | µ, σ 2 ).
i=1

The Likelihood Ratio Test (LRT) is as follows:


H0 is rejected in favor of H1 whenever λ is “too small”. Where “too small” depends upon the level of the
test and the distribution of λ. Note: λ ≤ 1. In the normal univariate case, it can be shown that,
Pn
(xi − x̄)2
λ2/n = Pni=1 2
.
i=1 (xi − µ0 )

By observing that
n
X n
X n
X
(xi − µ0 )2 = (xi − x̄ + x̄ − µ0 )2 = (xi − x̄)2 + n(x̄ − µ0 )2 ,
i=1 i=1 i=1

it follows that
1 1
λ2/n = Pn = .
1 + n(x̄ − µ0 )2 / i=1 (xi − x̄)2 1+ t2 /(n − 1)
It is not necessary to know the exact distribution of λ since t2 = 0 implies λ = 1 and whenever t2 → ∞
λ →√0. Therefore, one can reject H0 whenever t becomes “too large”. Furthermore, one can show that
t = t2 has a t-distribution with df = n − 1.

5.1.2 Multivariate Case


In this section suppose that one has n p-dimensional vectors from a normal distribution with unknown mean
vector µ and covariance matrix Σ. Let’s consider the LRT for this case.

Multivariate Likelihood Ratio Test


The likelihood function L is given by,
n
X
−n/2
L = L(µ, Σ | ~x1 , ~x2 , . . . , ~xn ) = (2π) np/2
|Σ| exp[−1/2 (~xi − µ)0 Σ−1 (~xi − µ)].
i=1
Pn
L is maximized when the null hypothesis is true when µ̃ = µ0 and nΣ̃ = i=1 (~xi − µ0 )(~xi − µ0 )0 . L is
Pn
maximized in general when µ̂ = x̄ and nΣ̂ = xi − x̄)(~xi − x̄)0 . In which case the likelihood ratio
i=1 (~
becomes, Pn
2/n | i=1 (xi − x̄)(xi − x̄i )0 |
λ = n .
| i=1 (xi − µ0 )(xi − µ0 )0 |
P

42
Let
n
X
Qe = (xi − x̄)(xi − x̄i )0 = X 0 [In − X(X 0 X)−1 X 0 ]X
i=1

where  ~x0  x
1 11 x12 ... x1p 
 ~x02
  x21 x22 ... x2p 
 ..  =  ..
X=   .. .. .
.. 
. . . . .
~x0n xn1 xn2 ... xnp
Note that,
n
X
(xi − µ0 )(xi − µ0 )0 = Qe + n(x̄ − µ0 )(x̄ − µ0 )0 = Qe + Qh ,
i=1
0
Pn Pn
where Qh = n(x̄Pn− µ0 )(x̄ −0 µ0 ) P. Since, i=1 (xi − µ0 )(xi − µ0 )0 = i=1 (xi − x̄ + x̄ − µ0 )(xi − x̄ + x̄ − µ0 )0 ,
n
and (x̄ − µ0 ) i=1 (xi − x̄) = [ i=1 (xi − x̄)](x̄ − µ0 )0 = 0. Note, the rank of Qh is the same as the rank of
(x̄ − µo ) = 1.
Thus,
| Qe | | Qe |
Λ = λ2/n = = .
| Qe + Qh | | Qe + n(x̄ − µ0 )(x̄ − µ0 )0 |
It can be shown that
| Qe | 1
Λ= −1 = .
0
| Qe || 1 + n(x̄ − µ0 ) Qe (x̄ − µ0 ) | 1 + [1 + T 2 /(n − 1)]

where S = Qe /(n − 1) and


T 2 = n(x̄ − µ0 )0 S −1 (x̄ − µ0 ) = nD2 .
D2 is called the Mahalanobis distance between the vectors x̄ and µ0 . The statistic T 2 is called the Hotellings
T 2 statistic. It can be shown that this statistic is related to the Fisher’s F-distribution. That is,

(n − p)T 2 (n − p)nD2
= ∼ F (p, n − p),
(n − 1)p (n − 1)p

whenever H0 is true. Thus, one can reject H0 in favor of H1 if T 2 ≥ [(n − 1)p/(n − p)]Fα (p, n − p)
where Fα (p, n − p) is the α critical point for an F-distribution with numerator degrees of freedom = p and
denominator degrees of freedom equal to n-p.
Note: It should be mentioned that Qe is the sum of squares for the error term and Qh is the sum of squares
explained by the model. In this case by the fact that µ = µ0 . Returning to the likelihood ratio given by,

| Qe |
Λ= .
| Qe + Qh |

It can be shown that Qe and Qh are independent Wishard random variables and that
s
Y
Λ= (1 + λi )−1
i=1

where λi is the ith largest eigenvalue of the matrix Qh Q−1


e , and s = rank(Qh ). The statistic Λ is called Wilks’
statistic. In the event that s > 1 there have been several other statistics proposed using the eigenvalues of
Qh Q−1
e . They are;

43
• Roy’s statistic:
λ1
θs =
(1 + λ1 )
where λ1 is the largest eigenvalue of Qh Q−1
e .

• Lawley-Hotelling trace:
s
X
U (s) = γe λi = γe T r[Qh Q−1
e ]
i=1

where γe is the error degrees of freedom.


• Pillai’s trace:
s
X λi
V (s)
= = T r[Qh (Qe + Qh )−1 ].
i=1
(1 + λi )

Again, it should be noted that in the testing of the hypothesis that H0 : µ = µ0 , s=1 and all four criterion
are the same. This will not be the case when one has a more general null hypothesis as in ANOVA models.

5.1.3 Example–Sweat Data


The following example is found in Johnson and Wichern page 229. The SAS-IML code is as follows;
OPTIONS LINESIZE=75 PAGESIZE=54 NODATE PAGENO=1; TITLE ’Sweat Data
- Johnson Wichern page 229’; DATA DAT;
INPUT sweat sodium potass;
CARDS;
3.7 48.5 9.3
5.7 65.1 8
3.8 47.2 10.9
3.2 53.2 12
3.1 55.5 9.7
4.6 36.1 7.9
2.4 24.8 14
7.2 33.1 7.6
6.7 47.4 8.5
5.4 54.1 11.3
3.9 36.9 12.7
4.5 58.8 12.3
3.5 27.8 9.8
4.5 40.2 8.4
1.5 13.5 10.1
8.5 56.4 7.1
4.5 71.6 8.2
6.5 52.8 10.9
4.1 44.1 11.2
5.5 40.9 9.4
;
title2 ’Test for mu_o = (4, 50, 10)’;
/* The test for Ho: mu = mu_o = (4, 50 ,10)
is done two ways. First, Using PROC GLM and defining a new variable
int = 1 for every observation.

44
Second, using PROC IML. The first procedure is better!
*/
* Hotellings one sample t-test with proc GLM;
data new; set dat; int = 1;
sweat = sweat - 4;
sodium = sodium - 50;
potass = potass - 10;
run;
PROC GLM DATA=new;
MODEL sweat sodium potass=INT/NOUNI NOINT;
MANOVA H=INT/PRINTE;
RUN;
* Hotellings one sample t-test with proc IML;
DATA XX; SET DAT;
PROC IML;
RESET NOLOG;
USE XX; READ ALL INTO X;
PRINT "The Data Matrix is" X;
N=NROW(X); P=NCOL(X);
print n p;
XBAR = X(|+,|)‘/N;
PRINT, "XBAR = " XBAR;
SUMSQ=X‘*X-(XBAR*XBAR‘)#N;
S=SUMSQ/(N-1);
PRINT , S;
NU=N-1;
W=NU#S;
mu_0 = {4 50 10}‘;
print mu_0;
T2=n*(xbar - mu_0)‘*INV(S)*(xbar -mu_0);
PRINT, "Hotellings T2 = " T2;
F_stat=(n-P)# T2/(n-1)/p;
p_hat = 1-PROBf(f_stat,p,n-p);
PRINT, f_stat p_hat;
QUIT;
The SAS output is;
Sweat Data - Johnson Wichren page 229
Test for mu_o = (4, 50, 10)

The GLM Procedure

Number of observations 20
Sweat Data - Johnson Wichren page 229
Test for mu_o = (4, 50, 10)

The GLM Procedure


Multivariate Analysis of Variance

E = Error SSCP Matrix

45
sweat sodium potass

sweat 54.708 190.19 -34.372


sodium 190.19 3795.98 -107.16
potass -34.372 -107.16 68.9255

Partial Correlation Coefficients from the Error SSCP Matrix / Prob > |r|

DF = 19 sweat sodium potass

sweat 1.000000 0.417350 -0.559744


0.0671 0.0103

sodium 0.417350 1.000000 -0.209498


0.0671 0.3754

potass -0.559744 -0.209498 1.000000


0.0103 0.3754
Sweat Data - Johnson Wichren page 229
Test for mu_o = (4, 50, 10)

The GLM Procedure


Multivariate Analysis of Variance

Characteristic Roots and Vectors of: E Inverse * H, where


H = Type III SSCP Matrix for int
E = Error SSCP Matrix

Characteristic Characteristic Vector V’EV=1


Root Percent sweat sodium potass

0.51256699 100.00 0.15376515 -0.01380445 0.05204607


0.00000000 0.00 0.04104950 0.00467799 0.13579778
0.00000000 0.00 0.07430678 0.01033833 0.00000000

MANOVA Test Criteria and Exact F Statistics


for the Hypothesis of No Overall int Effect
H = Type III SSCP Matrix for int
E = Error SSCP Matrix

S=1 M=0.5 N=7.5

Statistic Value F Value Num DF Den DF Pr > F

Wilks’ Lambda 0.66112774 2.90 3 17 0.0649


Pillai’s Trace 0.33887226 2.90 3 17 0.0649

46
Hotelling-Lawley Trace 0.51256699 2.90 3 17 0.0649
Roy’s Greatest Root 0.51256699 2.90 3 17 0.0649

Proc IML Output

Sweat Data - Johnson Wichren page 229


Test for mu_o = (4, 50, 10)

The Data Matrix is 3.7 48.5 9.3


5.7 65.1 8
3.8 47.2 10.9
3.2 53.2 12
3.1 55.5 9.7
4.6 36.1 7.9
2.4 24.8 14
7.2 33.1 7.6
6.7 47.4 8.5
5.4 54.1 11.3
3.9 36.9 12.7
4.5 58.8 12.3
3.5 27.8 9.8
4.5 40.2 8.4
1.5 13.5 10.1
8.5 56.4 7.1
4.5 71.6 8.2
6.5 52.8 10.9
4.1 44.1 11.2
5.5 40.9 9.4

N P

20 3

XBAR

XBAR = 4.64
45.4
9.965

2.8793684 10.01 -1.809053


10.01 199.78842 -5.64
-1.809053 -5.64 3.6276579

47
MU_0

4
50
10

T2

Hotellings T2 = 9.7387729

Sweat Data - Johnson Wichren page 229


Test for mu_o = (4, 50, 10)

F_STAT P_HAT

2.9045463 0.0649283

48
The Splus code for this same problem is;
attach(T5.1)
x<-data.matrix(T5.1)
n<-dim(x)[1] # define n
p<-dim(x)[2] # define p
mu0<-cbind(4,50,10) # define mu_0
dm<-cbind(mean(x[,1]),mean(x[,2]),mean(x[,3])) - mu0 # xbar-mu0
t2<- n*dm%*%solve(var(x))%*%t(dm) # compute Hotelling’s T2
t2
pvalue<-1 - pf((n-p)/p/(n-1)*t2,p,n-p) # compute p-value for test
pvalue

5.2 Two Population Case


Suppose that one has two normally distributed populations. As in the univariate case, the test procedure
for testing equality of means is dependent upon whether or not the variances are known or unknown, equal
or unequal. The problem of testing equality of variances/covariances will be covered in a separate chapter.
The simplest two sample test is the two sample Hotelling’s T-Test.

5.2.1 Hotelling’s Two Sample T-Test


Suppose that one has samples of size ni for i = 1, 2 from two different normal populations with possibly
different means but with equal covariances. That is, x11 , x12 , . . . , x1n1 ∼ Np (µ1 , Σ) and x21 , x22 , . . . , x2n2 ∼
Np (µ2 , Σ). Then test the hypothesis that,
H0 : µ1 = µ2
versus the alternative hypothesis that the two means are unequal. This test can actually be performed using
PROC GLM with a one-way MANOVA where k=2 is the number of treatment groups. However, it can also
be expressed in terms that are similar to the univariate case where the test statistic is,
n1 n2
T02 = (x̄1 − x̄2 )0 Sp−1 (x̄1 − x̄2 )
(n1 + n2 )
where Sp is the pooled covariance matrix given by;

Sp = 1/(n1 + n2 ) × [(n1 − 1)S1 + (n2 − 1)S2 ].

The null hypothesis is rejected if,

(n1 + n2 − p − 1)/p(n1 + n2 − 2) × T02 ≥ Fα (p, n1 + n2 − p − 1).

Test with Unequal Covariance Matrices


Paired Populations
If the two populations have the same sample size and have reason for being paired, the multivariate case
can be handled as in the paired univariate t-test. That is, reduce the problem to the one sample case by
computing di = x1i − x2i . Then the test statistic is,

T02 = d¯0 Sd−1 d¯

where Sd−1 is the sample covariance for the difference data, di .

49
Separate Populations with Different Covariances
If the data are not paired, one can compute,

T02 = (x̄1 − x̄2 )0 (n−1 −1


1 S1 + n 2 S2 )
−1
(x̄1 − x̄2 ).

It can be shown that the upper α critical point for the above statistic is,

k1 k2 χ2p (α)
kα = χ2p (α)[1 + .5( + )],
p p(p + 2)

where
2
X (tr[W −1 Wi ])2
k1 = ,
i=1
(ni − 1)
2
X (tr[W −1 Wi ])2 + tr[W −1 Wi W −1 Wi ])
k2 = ,
i=1
(ni − 1)
Si
Wi = ,
ni
W = W1 + W2 .
and χ2p (α)is the upper α critical point of a chi-square distribution with p degrees of freedom.
An alternative procedure suggested by Yao(1965) involves computing the degree of freedom given by;
2
X (x̄1 − x̄2 )0 W −1 Wi W −1 (x̄1 − x̄2 ) 2
1/ν = (ni − 1)−1 [ ] ,
i=1
T02

where the new critical point becomes,


2 pν
Tp,ν,α = Fα (p, ν − p + 1).
(ν − p − 1)

As I have already mentioned SAS PROC GLM can be used to test the equality of two means if the
covariances are equal, however, there are not any ways of simply handling the test of two means with
unequal covariances. As an example of assuming equality of covariance and testing for equality of means the
SAS code is;
proc glm;
class gen;
model tot ami = gen;
manova h=_all_/printe printh;
run;
The corresponding SAS output is;
Multivariate Linear Regression

The GLM Procedure

Class Level Information

Class Levels Values

50
gen 2 0 1

Number of observations 17

Multivariate Linear Regression

The GLM Procedure

Dependent Variable: tot

Sum of
Source DF Squares Mean Square F Value Pr > F

Model 1 288658.119 288658.119 0.58 0.4567

Error 15 7417282.117 494485.474

Corrected Total 16 7705940.235

R-Square Coeff Var Root MSE tot Mean

0.037459 62.75904 703.1966 1120.471

Source DF Type I SS Mean Square F Value Pr > F

gen 1 288658.1186 288658.1186 0.58 0.4567

Source DF Type III SS Mean Square F Value Pr > F

gen 1 288658.1186 288658.1186 0.58 0.4567

Multivariate Linear Regression

The GLM Procedure

Dependent Variable: ami

Sum of
Source DF Squares Mean Square F Value Pr > F

Model 1 532382.166 532382.166 1.13 0.3050

Error 15 7077995.717 471866.381

51
Corrected Total 16 7610377.882

R-Square Coeff Var Root MSE ami Mean

0.069955 77.85154 686.9253 882.3529

Source DF Type I SS Mean Square F Value Pr > F

gen 1 532382.1657 532382.1657 1.13 0.3050

Source DF Type III SS Mean Square F Value Pr > F

gen 1 532382.1657 532382.1657 1.13 0.3050

Multivariate Linear Regression

The GLM Procedure


Multivariate Analysis of Variance

E = Error SSCP Matrix

tot ami

tot 7417282.1167 7082751.3167


ami 7082751.3167 7077995.7167

Partial Correlation Coefficients from the Error SSCP Matrix / Prob > |r|

DF = 15 tot ami

tot 1.000000 0.977517


<.0001

ami 0.977517 1.000000


<.0001

Multivariate Linear Regression

The GLM Procedure


Multivariate Analysis of Variance

H = Type III SSCP Matrix for gen

tot ami

52
tot 288658.11863 392015.8598
ami 392015.8598 532382.16569

Characteristic Roots and Vectors of: E Inverse * H, where


H = Type III SSCP Matrix for gen
E = Error SSCP Matrix

Characteristic Characteristic Vector V’EV=1


Root Percent tot ami

0.18801392 100.00 -0.00134880 0.00158745


0.00000000 0.00 0.00110142 -0.00081103

MANOVA Test Criteria and Exact F Statistics for the Hypothesis of No Overall gen Effect
H = Type III SSCP Matrix for gen
E = Error SSCP Matrix

S=1 M=0 N=6

Statistic Value F Value Num DF Den DF Pr > F

Wilks’ Lambda 0.84174098 1.32 2 14 0.2994


Pillai’s Trace 0.15825902 1.32 2 14 0.2994
Hotelling-Lawley Trace 0.18801392 1.32 2 14 0.2994
Roy’s Greatest Root 0.18801392 1.32 2 14 0.2994

The Splus code for test of means with unequal covariances is;
attach(amitriptyline) #data in Johnson-Wichren page 455
amitriptyline
tot ami gen amt pr diap qrs
3 1131 810 0 3600 205 60 111
8 1111 941 0 1500 200 70 93
14 500 384 0 2000 160 60 80
15 781 501 0 4500 180 0 100
16 1070 405 0 1500 170 90 120
1 3389 3149 1 7500 220 0 140
2 1101 653 1 1975 200 0 100
4 596 448 1 675 160 60 120
5 896 844 1 750 185 70 83
6 1767 1450 1 2500 180 60 80
7 807 493 1 350 154 80 98
9 645 547 1 375 137 60 105
10 628 392 1 1050 167 60 74
11 1360 1283 1 3000 180 60 80
12 652 458 1 450 160 64 60

53
13 860 722 1 1750 135 90 79
17 1754 1520 1 3000 180 0 129
x1<-amitriptyline[1:5,1:2]
x1
x2<-amitriptyline[6:17,1:2]
x2
n1<-dim(x1)[1]
p<-dim(x1)[2]
n2<-dim(x2)[1]
x1bar<-cbind(mean(x1[,1]),mean(x1[,2]))
x2bar<-cbind(mean(x2[,1]),mean(x2[,2]))
s1<-var(x1)
s2<-var(x2)
s1
s2
st<-1/n1*s1 + 1/n2*s2
stinv<-solve(st)
t2<-(x1bar-x2bar)%*%stinv%*%t((x1bar-x2bar))
w1<-s1/n1
w2<-s2/n2
w<-w1+w2
winv<-solve(w)
k1<-1/(n1-1)*sum(diag(winv%*%w1))^2 + 1/(n2-1)*sum(diag(winv%*%w2))^2
k2<-((sum(diag(winv%*%w1))^2 + 2*sum(diag(winv*w1*winv*w1))))/(n1-1) +
((sum(diag(winv%*%w2))^2 + 2*sum(diag(winv*w2*winv*w2))))/(n2-1)
qalpha<-qchisq(.99,p)
kalpha<-qalpha*(1+.5*(k1/p+k2*qalpha/p/(p+2)))
kalpha
t2
#Yao’s procedure
nu<-1/(1/(n1-1)*((x1bar-x2bar)%*%winv%*%w1%*%winv%*%t(x1bar-x2bar)/t2)^2 +
1/(n2-1)*((x1bar-x2bar)%*%winv%*%w2%*%winv%*%t(x1bar-x2bar)/t2)^2)
talpha<-p*nu/(nu-p+1)*qf(.99,p,nu-p+1)
One can use the above code with just knowing the summary statistics. The Splus code for creating a vector
and a matrix is;
x1bar<-cbind(12,15)
s1<-rbind(cbind(20,-5),cbind(-5,15))
produces,
> x1bar
[,1] [,2]
[1,] 12 15
> s1
[,1] [,2]
[1,] 20 -5
[2,] -5 15

54
Chapter 6

The General Linear Model

In this chapter some of the inference results for the mean of a single populations are extended to the more
general linear model. In the univariate case, these models include; the simple linear regression, multiple
regression, anova models, and anlysis of covariance models.

6.1 Univariate Case


The general linear (not to be confused with the generalized linear model) model can be written as

yi = β0 + β1 xi1 + β2 xi2 + . . . + βk xik + ei

for i = 1, 2, . . . , n where the independent variables (xi1 , xi2 , . . . , xik ) are;


1. Polynomial Regression when xij = xji , i = 1, 2, . . . , n, j = 1, 2, . . . , k. This equation is called a
polynomial of degree k.
2. Multiple Regression when each of the variables are different independent variables.
3. ANOVA the variables are indicator; 0 or 1’s.
4. CANOVA when there both regression and indicator variables.
Where ei represents the unobserved error or residual that the observed data value yi is from the predicted
surface given by β0 + β1 xi1 + β2 xi2 + . . . + βk xik . This model can be written as
~ + ~e
~y = X β

where
β0
 
y1 1 x11 x12 ... x1k e1
     
 β1 
 y2  1 x21 x22 ... x2k   e2 
~ =  β2  ,
 
~y = 
 .. 
 X=
 ... .. .. ..  ~e = 
 ... 
 β  . 
. . . .   . 
.
yn 1 xn1 xn2 ... xnk en
βk
and
~y is n × 1 vector of dependent observations.
X is an n × (p = k + 1) matrix of independent observations or known values.

55
~ is a p × 1 vector of parameters.
β
~e is a n × 1 vector of unobservable errors or residuals.
The Least Squares problem consists of minimizing

Q(β) = e0 e
= (y − Xβ)0 (y − Xβ)
= y 0 y − β 0 X 0 y − y 0 Xβ + β 0 X 0 Xβ.

By using the properties of differentiation with matrices one has


∂Q(β)
= −2X 0 y + 2X 0 Xβ = 0.
∂β
From which one obtains the normal equations given by

X 0 Xβ = X 0 ~y .

If rank[X] = p the normal equation has a unique solution given by

β̂ = (X 0 X)−1 X 0 ~y .

If rank[X] < p the normal equation no longer has a unique solution. However, one can find a nonunique
solution given by
β̃ = (X 0 X)− X 0 ~y .
Since β̃ is no longer unique, one must restrict its use to what are called estimable functions. That is, a
parametric function c0 β is said to be estimable if there exists a vector ~a such that E(~a~y ) = c0 β. Estimable
functions are the only functions of β̃ that are unique.

6.1.1 Inference
The least square problem is an optimization rather than a statistical problem. In order for this problem
to become statistical it is necessary to introduce a distributional assumption. That is, assume that the
unobserved error ei , i = 1, 2, . . . , n are i.i.d normals with mean = 0 and variance = σ 2 . This assumption
becomes
e1
 
 e2  2
 ...  ∼ Nn (0, σ In ).
~e =  

en
Using the properties of linear transformations of normal variables one has
y1
 
 y2  2
 ..  ∼ Nn (Xβ, σ In ).
~y =  
.
yn

Now using the fact that β̂ = Ly it follows that


1. β̂ ∼ Np (β, σ 2 (X 0 X)−1 ).

(a) β̂ is an unbiased estimate of β.

56
(b) var(β̂i ) = σ 2 ((X 0 X)−1 )ii .
(c) cov(β̂i , β̂j ) = σ 2 ((X 0 X)−1 )ij .
(d) corr(β̂i , β̂j ) = ((X 0 X)−1 )ii /[((X 0 X)−1 )ii ((X 0 X)−1 )jj ]1/2 .

2. ŷ ∼ Nn (Xβ, σ 2 H).
(a) var(ŷi ) = σ 2 hii .
(b) cov(ŷi , ŷj ) = σ 2 hij , where H = (hij ). Notice that the ŷi0 s are not independent of one another
unless hij = 0.
(c) corr(ŷi , ŷj ) = hij /[hii hjj ]1/2 .
3. ê ∼ Nn (0, σ 2 (I − H)).
(a) var(êi ) = σ 2 (1 − hii ).
(b) cov(êi , êj ) = −σ 2 hij .
(c) corr(êi , êj ) = −hij /[(1 − hii )(1 − hjj )]1/2 .

6.1.2 Estimation of σ 2
The estimation of σ 2 follows from observing that the residual sum of of squares Q(β̂) is a quadratic form given
by Qe = ê0 ê = y 0 (I − H)y and using the expected value of a quadratic form, i.e., E(y 0 Ay) = tr[A] + µ0 Aµ,
one has

E(Q(β̂)) = y 0 (I − H)y)
= tr[(I − H)] + β̂ 0 X 0 (I − H)X β̂
= σ 2 tr[(I − H)]
= σ 2 (tr[I] − tr[H])
= σ 2 (n − p),

since
X 0 (I − H) = X 0 − X 0 X(X 0 X)−1 X 0 = X 0 − X 0 = 0
(I − H)X = X − X 0 X(X 0 X)−1 X = X − X = 0.
From here one can define an estimate for σ 2 with

σ̂ 2 = y 0 (I − H)y/(n − p) = SSE /(n − p) = Qe /(n − p).

6.1.3 ANOVA Table


The analysis of variance table is given by

Source Sum of Squares Degrees of Freedom Mean Square


due to β SS(β) = β̂ 0 X 0 y = y 0 Hy p M S(β) = SS(β)/p
Residual SSE = y 0 y − β̂ 0 X 0 y = y 0 (I − H)y n-p M SE = SSE /(n − p)
Uncorrected Total y0 y n

57
Since β1 is the only parameter that is needed if the independent variable x explains the dependent variable
y, the sum of squares term is adjusted for β0 , that is

y 0 Hy = y 0 (H − jj0 )y + y 0 jj0 y

where y 0 jj0 y = nȳ 2 is the correction factor or the sum of squares due to β0 . In which case the ANOVA table
become

Source Sum of Squares Degrees of Freedom Mean Square


due to β∗ | β0 SSβ∗ |β0 = y 0 Hy − yjj0 y p-1 M Sβ∗ |β0 = SSβ∗ |β0 /(p − 1)
Residual SSE = y 0 y − β̂ 0 X 0 y = y 0 (I − H)y n-p M SE = SSE /(n − p)
Corrected Total SSCT = y 0 y − y 0 jj0 y n-1
due to β0 SSβ0 = y 0 jj0 y = nȳ 2 1
Uncorrected Total y0 y n

where β∗ = (β1 , β2 , . . . , βp−1 )

6.1.4 Expected Values of the Sum of Squares


Observe that the terms under the sum of squares column are actually quadratic forms. Using the properities
of the expected value of quadratic forms, i.e.,
Let x = (x1 , x2 , . . . , xn )0 ∼ Nn (µ, V ) (This means that the x0i s are not independent of one another).
Define the quadratic form q1 = x0 Ax then the expected value of q1 is E(q1 ) = tr[AV ] + µ0 Aµ. It follows
when V = σ 2 In that
1. E(y 0 Hy) = σ 2 tr[H] + β 0 X 0 HXβ = pσ 2 + β 0 X 0 Xβ.
2. E(y 0 (I − H)y) = σ 2 tr[(I − H)] + β 0 X 0 (I − H)Xβ = (n − p)σ 2 , since X 0 (I − H) = (I − H)X = 0.
The R2 is an indicator of how much of the variation in the data is explained by the model. It is defined
as
SSβ∗ |β0 y 0 Hy − yjj0 y
R2 = = 0 .
SSCT y y − y 0 jj0 y
The adjusted R2 is the R2 value adjusted for the number of parameters in the model and is given by

adjR2 = 1 − [(n − i)/(n − p)](1 − R2 )

where i is 1 if the model includes the y intercept (β0 ), and is 0 otherwise. Tolerance (TOL) and variance
inflation factors (VIF) measure the strength of interrelationships among regressor variables in the model.
They are given as
T OL = 1 − R2
and
V IF = T OL−1 .

6.1.5 Distribution of the Mean Squares


Again from the properties of the distribution of quadratic forms it can be show that
1. SS(β)/σ 2 ∼ χ2 (df = p, λ = 1/2β 0 X 0 Xβ).

58
2. SSE /σ 2 ∼ χ2 (df = (n − p)).
3. SS(β) and SSE are independent since H(I − H) = 0.
4. F = M S(β)/M SE ∼ F (df1 = p, df2 = (n − p), λ = 1/2β 0 X 0 Xβ).
5. When β = 0 it follows that SS(β)/σ 2 ∼ χ2 (df = p) and F = M S(β)/M SE ∼ F (df1 = p, df2 = (n−p)).

6. SSβ∗ |β0 /σ 2 ∼ χ2 (df = p − 1, λ = 1/2β 0 X 0 Xβ).


7. SSβ∗ |β0 and SSE are independent since j0 (I − H) = 0.
8. F = M Sβ∗ |β0 /M SE ∼ F (df1 = p − 1, df2 = (n − p), λ = 1/2β 0 X 0 Xβ).

9. When β∗ = 0 it follows that SSβ∗ |β0 /σ 2 ∼ χ2 (df = p − 1) and F = M Sβ∗ |β0 /M SE ∼ F (df1 =
p − 1, df2 = (n − p)).

6.1.6 Testing Linear Hypothesis


The general form of a linear hypothesis for the parameters is

H0 : Lβ = c

where L is a q × p matrix of rank q. The approach is to estimate Lβ − c with Lβ̂ − c. If L is estimable then
Lβ̂ − c is unique. Using the properties of expectation and covariance we have
1. E(Lβ̂ − c) = Lβ − c.

2. Cov(Lβ̂ − c) = σ 2 L(X 0 X)−1 L0 .


3. The quadratic form given by

Q = (Lβ̂ − c)0 (L(X 0 X)−1 L0 )−1 (Lβ̂ − c) ∼ χ2 (df = q, λ = 1/2(Lβ − c)0 (L(X 0 X)−1 L0 )−1 (Lβ − c).

4. Q/q/σ̂ 2 ∼ F − dist(q, n − p) whenever Lβ = c.

6.1.7 Constrained Least Squares


A related topic involves finding β̃ which minimizes (y −Xβ)0 (y −Xβ)) subject to the restriction that Lβ = d.
The approach is to define the Lagrangean function given by

Λ = (y − Xβ)0 (y − Xβ) + 2λ0 (d − Lβ).

Taking the derivative of Λ with respect to both β and λ and setting equal to zero gives the following solution

β̃ = β̂ + (X 0 X)−1 L0 [L(X 0 X)−1 L0 ]−1 (c − Lβ̂).

where β̂ is the usual unrestricted estimator of β.

59
6.2 Multivariate Case
The previous model can be extended to the multivariate case. Suppose the Y is a matrix of n k-dimensional
random variables, such that
Y = XB + E
where
• ~yi0 = [yi1 , yi2 , . . . , yik ] is the ith row of the matrix Y .
• X is a known design n × p matrix of rank r ≤ p < n.
• B is an unknown p × k matrix of nonrandom coefficients. The j th column of B is

βj0 = [β1j , β2j , . . . , βkj ]

• E is a n × k matrix of unobservable errors, such that E(e0i ) = 0 and V ar(e0i ) = Σ and the rows of E
are independent of one another, i.e., cov(ei , ej ) = 0.
As in the univariate case, one can estimate the coefficient matrix B using a least squares method. In the
multivariate case one needs to minimize,

Q(B) = T r[(Y − XB)0 (Y − XB)].

As before the normal equations can be written as,

(X 0 X)B = X 0 Y

which has a unique solution given by


B̂ = (X 0 X)−1 X 0 Y
whenever rank[X] = r = p. The normal equations have a nonunique solution given by

B̃ = (X 0 X)− X 0 Y

whenever rank[X] = r ¡ p.
The estimation of Σ follows from observing that the residual sum of of squares Qe = Ê 0 Ê = Y 0 (I − H)Y
and Qh = Y 0 HY where H = X(X 0 X)−1 X 0 or H = X(X 0 X)− X 0 . Note that even though (X 0 X)− is not
unique H = X(X 0 X)− X 0 is. It can be shown that

E(Qe ) = (n − r)Σ.
From here one can define an estimate for Σ with

Σ̂ = Qe /(n − r).

6.2.1 ANOVA Table


As in the univariate case one has the analysis of variance table is given by

Source Sum of Squares Degrees of Freedom Mean Square


model Qh = B̂ 0 X 0 Y = Y 0 HY r M Sh = Qh /r
Residual Qe = Y 0 Y − B̂ 0 X 0 Y = Y 0 (I − H)Y n-r M Se = Qe /(n − r)
Uncorrected Total Y 0Y n

60
As in the univariate case the sum of squares term is adjusted for β0 , that is

Y 0 HY = Y 0 (H − jj0 )Y + Y 0 jj0 Y

where Y 0 jj0 Y = nȲ 0 Ȳ is the correction factor or the sum of squares due to β0 . In which case the ANOVA
table become

Source Sum of Squares Degrees of Freedom Mean Square


due to β∗ | β1 SSβ∗ |β1 = Y 0 HY − Y 0 jj0 Y r-1 M Sβ∗ |β1 = SSβ∗ |β0 /(r − 1)
Residual SSE = Y 0 Y − B̂ 0 X 0 Y = Y 0 (I − H)Y n-r M SE = SSE /(n − r)
Corrected Total SSCT = Y 0 Y − Y 0 jj0 Y n-1
due to β1 SSβ1 = Y 0 jj0 Y 1
Uncorrected Total Y 0Y n

where β∗ = (β2 , β3 , . . . , βq )

6.2.2 Testing Linear Hypothesis


The general form of a linear hypothesis for the parameters is

H0 : LBM = c

where L is a q × p matrix of rank q, M is a k × k of rank k and c is a specified (usually = 0) q × k matrix.


The approach is to estimate LBM − c with LB̂M − c. From here one follows similiar line as in the univariate
case by forming two matrices, given by

Qh = (LB̂M − c)0 [L(X 0 X)−1 L0 ]−1 (LB̂M − c)

and
Qe = M 0 Y 0 [I − H]Y M.
The diagonal elements of Qh and Qe correspond to the hypothesis and error SS for univariate tests. When
the M matrix is the identity matrix (SAS default), these tests are for the original dependent variables on
the left-hand side of the MODEL statement. When an M matrix other than the identity is specified, the
tests are for transformed variables defined by the columns of the M matrix. These tests can be studied
by requesting the SUMMARY option, which produces univariate analyses for each original or transformed
variable.
Four multivariate test statistics, all functions of the eigenvalues of Q−1
e Qh or (Qe + Qh )
−1
Qh , are con-
structed:
Qs
• Wilks’ lambda = Λ = | Qe | / | Qh + Qe |= i=1 (1 + λi )−1
Ps λi
• Pillai’s trace = U (s) = tr(Qh (Qh + Qe )−1 ) = i=1 (1+λ i)

Ps
• Hotelling-Lawley trace = V (s) = tr(Q−1
e Qh ) = i=1 λi
λ1
• Roy’s maximum root = θ = (1+λ1 )

where λi is the ith largest eigenvalue of Q−1


e Qh , and s = min(νh = rank(Qh ), p).

61
6.2.3 One-Way MANOVA – Examples
Consider some examples found in Johnson chapter 10. The SAS code is;
OPTIONS LINESIZE=74 PAGESIZE=54 NODATE;
TITLE ’Ex. 11.1 - MANOVA’’S on Cooked Turkey Data’;
DATA STORAGE;
INPUT REP 1 TRT 3 CKG_LOSS 5-8 PH 10-13 MOIST 15-19 FAT 21-25 HEX 27-30
NONHEM 32-35 CKG_TIME 37-38;
CARDS;
1 1 36.2 6.20 57.01 11.72 1.32 13.1 47
1 2 36.2 6.34 59.95 10.00 1.18 09.2 47
1 3 31.5 6.52 61.93 09.85 0.50 05.3 46
1 4 33.0 6.49 59.95 10.32 1.55 07.8 48
1 5 32.5 6.50 60.98 09.88 0.88 08.8 44
2 1 34.8 6.49 61.40 09.35 1.00 08.6 48
2 2 33.0 6.62 59.96 09.32 0.40 06.8 52
2 3 26.8 6.73 63.37 10.08 0.30 04.8 54
2 4 34.0 6.67 59.81 09.20 0.75 06.4 50
2 5 32.8 6.86 60.37 09.18 0.32 07.2 48
3 1 37.5 6.20 57.01 09.92 0.58 09.8 50
3 2 33.2 6.67 60.86 10.18 0.48 08.8 47
3 3 27.8 6.78 61.92 09.38 0.20 05.8 47
3 4 34.2 6.64 59.34 11.32 0.73 08.0 49
3 5 32.0 6.78 58.50 10.48 0.35 07.2 48
4 1 38.5 7.34 59.25 10.58 1.48 08.2 48
4 2 35.0 6.61 61.12 10.05 0.90 07.0 48
4 3 33.8 6.65 60.40 09.52 0.32 06.6 50
4 4 35.8 6.47 61.08 10.52 1.58 06.6 49
4 5 36.5 6.72 61.61 09.70 0.55 07.4 48
5 1 38.5 6.40 56.25 10.18 0.90 09.6 47
5 2 34.2 6.67 61.37 09.48 0.65 06.8 47
5 3 33.5 6.74 61.60 09.60 0.22 05.2 45
5 4 37.5 6.47 60.78 10.18 1.30 07.2 48
5 5 36.2 6.70 59.57 10.12 0.88 07.0 48
;
*PROC PRINT;run;

PROC GLM;
CLASSES TRT;
MODEL CKG_LOSS moist hex nonhem = TRT;
* MEANS TRT/LSD;
MANOVA H=TRT/PRINTE;
RUN;
The SAS output is;

Ex. 11.1 - MANOVA’S on Cooked Turkey Data

The GLM Procedure

62
Class Level Information

Class Levels Values

TRT 5 1 2 3 4 5

Number of observations 25

Ex. 11.1 - MANOVA’S on Cooked Turkey Data

The GLM Procedure

Dependent Variable: CKG_LOSS

Sum of
Source DF Squares Mean Square F Value Pr > F

Model 4 106.7240000 26.6810000 5.92 0.0026

Error 20 90.1560000 4.5078000

Corrected Total 24 196.8800000

R-Square Coeff Var Root MSE CKG_LOSS Mean

0.542076 6.208064 2.123158 34.20000

Source DF Type I SS Mean Square F Value Pr > F

TRT 4 106.7240000 26.6810000 5.92 0.0026

Source DF Type III SS Mean Square F Value Pr > F

TRT 4 106.7240000 26.6810000 5.92 0.0026

Ex. 11.1 - MANOVA’S on Cooked Turkey Data

The GLM Procedure

Dependent Variable: MOIST

Sum of
Source DF Squares Mean Square F Value Pr > F

Model 4 34.85089600 8.71272400 5.41 0.0040

63
Error 20 32.18872000 1.60943600

Corrected Total 24 67.03961600

R-Square Coeff Var Root MSE MOIST Mean

0.519855 2.106822 1.268635 60.21560

Source DF Type I SS Mean Square F Value Pr > F

TRT 4 34.85089600 8.71272400 5.41 0.0040

Source DF Type III SS Mean Square F Value Pr > F

TRT 4 34.85089600 8.71272400 5.41 0.0040

Ex. 11.1 - MANOVA’S on Cooked Turkey Data

The GLM Procedure

Dependent Variable: HEX

Sum of
Source DF Squares Mean Square F Value Pr > F

Model 4 2.48762400 0.62190600 6.32 0.0019

Error 20 1.96768000 0.09838400

Corrected Total 24 4.45530400

R-Square Coeff Var Root MSE HEX Mean

0.558351 40.58776 0.313662 0.772800

Source DF Type I SS Mean Square F Value Pr > F

TRT 4 2.48762400 0.62190600 6.32 0.0019

Source DF Type III SS Mean Square F Value Pr > F

TRT 4 2.48762400 0.62190600 6.32 0.0019

64
Ex. 11.1 - MANOVA’S on Cooked Turkey Data

The GLM Procedure

Dependent Variable: NONHEM

Sum of
Source DF Squares Mean Square F Value Pr > F

Model 4 47.63440000 11.90860000 8.98 0.0003

Error 20 26.52000000 1.32600000

Corrected Total 24 74.15440000

R-Square Coeff Var Root MSE NONHEM Mean

0.642368 15.21565 1.151521 7.568000

Source DF Type I SS Mean Square F Value Pr > F

TRT 4 47.63440000 11.90860000 8.98 0.0003

Source DF Type III SS Mean Square F Value Pr > F

TRT 4 47.63440000 11.90860000 8.98 0.0003

---------------------------------MANOVA RESULTS ------------------------------------------

Ex. 11.1 - MANOVA’S on Cooked Turkey Data

The GLM Procedure


Multivariate Analysis of Variance

E = Error SSCP Matrix

CKG_LOSS MOIST HEX NONHEM

CKG_LOSS 90.156 -11.8018 3.6296 0.092


MOIST -11.8018 32.18872 1.9382 -12.2128
HEX 3.6296 1.9382 1.96768 1.2708
NONHEM 0.092 -12.2128 1.2708 26.52

Partial Correlation Coefficients from the Error SSCP Matrix / Prob > |r|

65
DF = 20 CKG_LOSS MOIST HEX NONHEM

CKG_LOSS 1.000000 -0.219078 0.272511 0.001881


0.3400 0.2320 0.9935

MOIST -0.219078 1.000000 0.243540 -0.418000


0.3400 0.2874 0.0593

HEX 0.272511 0.243540 1.000000 0.175919


0.2320 0.2874 0.4456

NONHEM 0.001881 -0.418000 0.175919 1.000000


0.9935 0.0593 0.4456

Ex. 11.1 - MANOVA’S on Cooked Turkey Data

The GLM Procedure


Multivariate Analysis of Variance

Characteristic Roots and Vectors of: E Inverse * H, where


H = Type III SSCP Matrix for TRT
E = Error SSCP Matrix

Characteristic Characteristic Vector V’EV=1


Root Percent CKG_LOSS MOIST HEX
NONHEM

3.30759898 79.63 0.04170123 -0.06603053 0.30481595


0.09144639
0.76531486 18.43 -0.04427401 -0.09676340 0.75225835
-0.16839923
0.07403236 1.78 0.05768607 0.18866028 0.00400671
0.07286519
0.00654206 0.16 -0.08346829 0.02920328 0.23295316
0.10805741

MANOVA Test Criteria and F Approximations for


the Hypothesis of No Overall TRT Effect
H = Type III SSCP Matrix for TRT
E = Error SSCP Matrix

S=4 M=-0.5 N=7.5

Statistic Value F Value Num DF Den DF Pr > F

Wilks’ Lambda 0.12164472 3.26 16 52.573 0.0006


Pillai’s Trace 1.27680982 2.34 16 80 0.0067

66
Hotelling-Lawley Trace 4.15348825 4.19 16 28.471 0.0004
Roy’s Greatest Root 3.30759898 16.54 4 20 <.0001

NOTE: F Statistic for Roy’s Greatest Root is an upper bound.


In the above example I selected the significant variables from Johnson’s example pages 444-454. Below I
have chosen some of the non-significant variables.
PROC GLM;
CLASSES TRT;
MODEL ph fat ckg_time = TRT/NOUNI;
* MEANS TRT/LSD;
MANOVA H=TRT/PRINTE;
RUN;
with output;
Ex. 11.1 - MANOVA’S on Cooked Turkey Data

The GLM Procedure

Class Level Information

Class Levels Values

TRT 5 1 2 3 4 5

Number of observations 25

-------------------------MANOVA RESULTS-------------------------------

The GLM Procedure


Multivariate Analysis of Variance

E = Error SSCP Matrix

PH FAT CKG_TIME

PH 1.12168 -0.26992 1.396


FAT -0.26992 7.2868 -3.002
CKG_TIME 1.396 -3.002 93.6

Partial Correlation Coefficients from the Error SSCP Matrix / Prob > |r|

DF = 20 PH FAT CKG_TIME

PH 1.000000 -0.094413 0.136243


0.6840 0.5559

67
FAT -0.094413 1.000000 -0.114949
0.6840 0.6198

CKG_TIME 0.136243 -0.114949 1.000000


0.5559 0.6198

Ex. 11.1 - MANOVA’S on Cooked Turkey Data

The GLM Procedure


Multivariate Analysis of Variance

Characteristic Roots and Vectors of: E Inverse * H, where


H = Type III SSCP Matrix for TRT
E = Error SSCP Matrix

Characteristic Characteristic Vector V’EV=1


Root Percent PH FAT CKG_TIME

0.36051435 78.26 -0.48890698 0.29304617 0.04089892


0.08011478 17.39 0.39152178 0.18903075 -0.08034885
0.02002093 4.35 0.72243792 0.13549794 0.05361957

MANOVA Test Criteria and F Approximations for


the Hypothesis of No Overall TRT Effect
H = Type III SSCP Matrix for TRT
E = Error SSCP Matrix

S=3 M=0 N=8

Statistic Value F Value Num DF Den DF Pr > F

Wilks’ Lambda 0.66714138 0.66 12 47.915 0.7798


Pillai’s Trace 0.35878430 0.68 12 60 0.7644
Hotelling-Lawley Trace 0.46065007 0.66 12 27.465 0.7709
Roy’s Greatest Root 0.36051435 1.80 4 20 0.1680

NOTE: F Statistic for Roy’s Greatest Root is an upper bound.


As you can see there is agreement between the individual univariate models and the multivariate model.
This may not always be the case. One could extend the one way model to higher order models but the
method is the same. Instead, I have included a regression example.

6.2.4 Multivariate Regression – Example


The following example is taken from the Johnson-Wichern text page 455. The SAS code is;
title ’Multivariate Linear Regression’;
data mra;
input tot ami gen amt pr diap qrs;
cards;

68
3389 3149 1 7500 220 0 140
1101 653 1 1975 200 0 100
1131 810 0 3600 205 60 111
596 448 1 675 160 60 120
896 844 1 750 185 70 83
1767 1450 1 2500 180 60 80
807 493 1 350 154 80 98
1111 941 0 1500 200 70 93
645 547 1 375 137 60 105
628 392 1 1050 167 60 74
1360 1283 1 3000 180 60 80
652 458 1 450 160 64 60
860 722 1 1750 135 90 79
500 384 0 2000 160 60 80
781 501 0 4500 180 0 100
1070 405 0 1500 170 90 120
1754 1520 1 3000 180 0 129
;
proc glm;
model tot ami = gen amt /ss1 ss2;
manova h=_all_/printe printh;
run;
The SAS output is;
Multivariate Linear Regression

The GLM Procedure

Number of observations 17

Multivariate Linear Regression

The GLM Procedure

Dependent Variable: tot

Sum of
Source DF Squares Mean Square F Value Pr > F

Model 2 5905583.873 2952791.936 22.96 <.0001

Error 14 1800356.362 128596.883

Corrected Total 16 7705940.235

R-Square Coeff Var Root MSE tot Mean

0.766368 32.00477 358.6041 1120.471

69
Source DF Type I SS Mean Square F Value Pr > F

gen 1 288658.119 288658.119 2.24 0.1563


amt 1 5616925.754 5616925.754 43.68 <.0001

Source DF Type II SS Mean Square F Value Pr > F

gen 1 880450.891 880450.891 6.85 0.0203


amt 1 5616925.754 5616925.754 43.68 <.0001

Standard
Parameter Estimate Error t Value Pr > |t|

Intercept 56.7200533 206.7033686 0.27 0.7878


gen 507.0730843 193.7908247 2.62 0.0203
amt 0.3289618 0.0497750 6.61 <.0001

Multivariate Linear Regression

The GLM Procedure

Dependent Variable: ami

Sum of
Source DF Squares Mean Square F Value Pr > F

Model 2 5989720.538 2994860.269 25.87 <.0001

Error 14 1620657.344 115761.239

Corrected Total 16 7610377.882

R-Square Coeff Var Root MSE ami Mean

0.787046 38.56020 340.2370 882.3529

Source DF Type I SS Mean Square F Value Pr > F

gen 1 532382.166 532382.166 4.60 0.0500


amt 1 5457338.373 5457338.373 47.14 <.0001

Source DF Type II SS Mean Square F Value Pr > F

70
gen 1 1258789.188 1258789.188 10.87 0.0053
amt 1 5457338.373 5457338.373 47.14 <.0001

Standard
Parameter Estimate Error t Value Pr > |t|

Intercept -241.3479096 196.1164016 -1.23 0.2387


gen 606.3096657 183.8652145 3.30 0.0053
amt 0.3242549 0.0472256 6.87 <.0001

-------------------------MANOVA RESULTS---------------------------------------------------

Multivariate Linear Regression

The GLM Procedure


Multivariate Analysis of Variance

E = Error SSCP Matrix

tot ami

tot 1800356.3625 1546194.2227


ami 1546194.2227 1620657.344

Partial Correlation Coefficients from the Error SSCP Matrix / Prob > |r|

DF = 14 tot ami

tot 1.000000 0.905189


<.0001

ami 0.905189 1.000000


<.0001

Multivariate Linear Regression

The GLM Procedure


Multivariate Analysis of Variance

H = Type II SSCP Matrix for gen

tot ami

tot 880450.89128 1052759.2612


ami 1052759.2612 1258789.1875

71
Characteristic Roots and Vectors of: E Inverse * H, where
H = Type II SSCP Matrix for gen
E = Error SSCP Matrix

Characteristic Characteristic Vector V’EV=1


Root Percent tot ami

0.83036254 100.00 -0.00044572 0.00118496


0.00000000 0.00 -0.00169597 0.00141839

MANOVA Test Criteria and Exact F Statistics


for the Hypothesis of No Overall gen Effect
H = Type II SSCP Matrix for gen
E = Error SSCP Matrix

S=1 M=0 N=5.5

Statistic Value F Value Num DF Den DF Pr > F

Wilks’ Lambda 0.54633985 5.40 2 13 0.0197


Pillai’s Trace 0.45366015 5.40 2 13 0.0197
Hotelling-Lawley Trace 0.83036254 5.40 2 13 0.0197
Roy’s Greatest Root 0.83036254 5.40 2 13 0.0197

H = Type II SSCP Matrix for amt

tot ami

tot 5616925.7542 5536557.094


ami 5536557.094 5457338.3727

Multivariate Linear Regression

The GLM Procedure


Multivariate Analysis of Variance

Characteristic Roots and Vectors of: E Inverse * H, where


H = Type II SSCP Matrix for amt
E = Error SSCP Matrix

Characteristic Characteristic Vector V’EV=1


Root Percent tot ami

3.42870806 100.00 0.00023456 0.00055467


0.00000000 0.00 -0.00173781 0.00176303

72
MANOVA Test Criteria and Exact F Statistics
for the Hypothesis of No Overall amt Effect
H = Type II SSCP Matrix for amt
E = Error SSCP Matrix

S=1 M=0 N=5.5

Statistic Value F Value Num DF Den DF Pr > F

Wilks’ Lambda 0.22579949 22.29 2 13 <.0001


Pillai’s Trace 0.77420051 22.29 2 13 <.0001
Hotelling-Lawley Trace 3.42870806 22.29 2 13 <.0001
Roy’s Greatest Root 3.42870806 22.29 2 13 <.0001
Each of the above models are dependent upon the assumption of homogeneity of variance-covariance. The
next chapter addresses procedure for determining whether or not these assumptions are satisfied.

73
Chapter 7

Inference for Covariance Matrices

In this chapter the problem of inference for various forms of the covariance matrix. The first is the univariate
test for the variance of a population.

7.1 One Sample Case


7.1.1 Univariate Problem
Suppose that one assumes that X ∼ N (µ, σ 2 ) and wants to test the hypothesis that H0 : σ 2 = σ02 versus
H1 : σ 2 6= σ02 . The test statistic is,
(n − 1)σ̂ 2
t=
σ02
where
n
X
(n − 1)σ̂ 2 = (xi − x̄)2 .
i=1

If the null hypothesis is true then the distribution of t is χ2 (df = n − 1). That is, one rejects H0 : if
t ≤ χ21−α/2 (n − 1) or if t ≥ χ2α/2 (n − 1), where χ2ν (n − 1) is the upper ν × 100% critical point for a chi-square
distribution with degrees of freedom = n-1.

7.1.2 Multivariate Problem


In the multivariate case, the null hypothesis becomes H0 : Σ = Σ0 versus H1 : Σ 6= Σ0 where Σ0 is a specified
p × p covariance matrix. The procedure is to compute the Likelihood Ratio Test where the mean µ is a free
parameter which is estimated by x̄ in both the numerator and denominator, in which case you obtain,
−1
Λ = (e/ν)pν/2 | Σ−1
0 W |
ν/2 −1/2tr[Σ0
e W]

where ν = (n − 1), W = (n − 1)Σ̂. Johnson provides a SAS IML code for computing Λ (called Lam star).

7.1.3 SAS IML Example


The SAS code is;
OPTIONS LINESIZE=75 PAGESIZE=54 NODATE PAGENO=1;
TITLE ’Ex. 10.1 - A test on SIGMA, the variance-covariance matrix.’;

74
DATA DAT;
INPUT ID X1-X3;
CARDS;
1 1 3 5
2 2 3 5
3 1 4 6
4 1 4 4
5 3 4 7
6 2 5 6
DATA XX; SET DAT; DROP ID;
PROC IML;
RESET NOLOG;
USE XX; READ ALL INTO X;
PRINT "The Data Matrix is" X;
N=NROW(X); P=NCOL(X);
XBAR = X(|+,|)‘/N;
PRINT, "XBAR = " XBAR;
SUMSQ=X‘*X-(XBAR*XBAR‘)#N;
S=SUMSQ/(N-1);
PRINT , "The Variance-Covariance Matrix is " S;
NU=N-1;
W=NU#S;
SIG0 = {2 1 0, 1 2 0, 0 0 1};
Q=INV(SIG0)*W;
LAM_STAR = (EXP(1)/NU)##(P#NU/2)#(DET(Q))##(NU/2)#EXP(-.5#TRACE(Q));
PRINT, "LAM_STAR = " LAM_STAR;
L = -2#LOG(LAM_STAR);
PRINT, "L = " L;
B2 = P#(2#P#P+3#P-1)/24;
B3 = -P#(P-1)#(P+1)#(P+2)/32;
F=P#(P+1)/2;
A1 = 1-PROBCHI(L,F);
A2 = 1-PROBCHI(L,F+2);
A3 = 1-PROBCHI(L,F+4);
ALPHA = A1+(1/NU)#B2#(A2-A1)
+(1/(6#NU#NU))#((3#B2#B2-4#B3)#A3-6#B2#B2#A2
+(3#B2#B2+4#B3)#A1);
PRINT, "ALPHA = ", ALPHA;
QUIT;
The SAS output is;
Ex. 10.1 - A test on SIGMA, the variance-covariance matrix.
X

The Data Matrix is 1 3 5


2 3 5
1 4 6
1 4 4
3 4 7
2 5 6

75
XBAR

XBAR = 1.6666667
3.8333333
5.5

The Variance-Covariance Matrix is 0.6666667 0.1333333 0.6


0.1333333 0.5666667 0.3
0.6 0.3 1.1

LAM_STAR = 0.0162956

L = 8.2337203

ALPHA = 0.3842768
The Splus code is;
attach(e10) # or
e10<-t(cbind(c(1, 3, 5),c(2, 3, 5),c(1, 4, 6),c(1, 4, 4),c(3, 4, 7),c(2, 5, 6)))
x<-data.matrix(e10)
n<-dim(x)[1]
p<-dim(x)[2]
xbar<-cbind(mean(x[,1]),mean(x[,2]),mean(x[,3]))
nu<-n-1
w<-nu*var(x)
sig0<-cbind(c(2, 1, 0),c(1, 2 ,0),c(0, 0, 1))
q<-solve(sig0)%*%w
lamstar<-(exp(1)/nu)^(p*nu/2)*prod(eigen(q)$values)^(nu/2)*exp(-.5*sum(diag(q)))
b2<-p*(2*p*p+3*p-1)/24
b3<--p*(p-1)*(p+1)*(p+2)/32
f<-p*(p+1)/2
a1<- 1 - pchisq(-2*log(lamstar),f)
a2<- 1 - pchisq(-2*log(lamstar),f+2)
a3<- 1 - pchisq(-2*log(lamstar),f+4)
alpha<-a1 + b2*(a2-a1)/nu + ((3*b2*b2-4*b3)*a3 - 6*b2*b2*a2 + (3*b2*b2 + 4*b3)*a1)/(6*nu*nu)
Johnson gives several other related test with corresponding SAS IML code. I have included some of his
examples;

7.1.4 Test for Sphericity


Suppose that one wants to test H0 : Σ = σ 2 I versus the two sided alternative. The test statistic is a special
case of the Likelihood Ratio Test given by,
|W |
Λ= .
[(1/p)tr(W )]p

76
The SAS IML code is;
OPTIONS LINESIZE=75 PAGESIZE=54 NODATE PAGENO=1;
TITLE
’Ex. 10.2 - A test for sphericity of the variance-covariance matrix.’;
DATA DAT;
INPUT ID X1-X3;
CARDS;
1 1 3 5
2 2 3 5
3 1 4 6
4 1 4 4
5 3 4 7
6 2 5 6
DATA XX; SET DAT; DROP ID;
PROC IML;
RESET NOLOG;
USE XX; READ ALL INTO X;
PRINT "The Data Matrix is" X;
N=NROW(X); P=NCOL(X);
XBAR = X(|+,|)‘/N;
PRINT, "XBAR = " XBAR;
SUMSQ=X‘*X-(XBAR*XBAR‘)#N;
S=SUMSQ/(N-1);
PRINT , "The Variance-Covariance Matrix is " S;
NU=N-1;
W=NU#S;
LAMDA = DET(W)/((1/P)#TRACE(W))##P;
PRINT, "LAMDA = " LAMDA;
M = NU - (2#P#P+P+2)/6/P;
L = -M#LOG(LAMDA);
PRINT, "L = " L;
A = (P+1)#(P-1)#(P+2)#(2#P#P#P+6#P#P+3#P+2)/288/P/P;
F=P#(P+1)/2-1;
A1 = 1-PROBCHI(L,F);
A3 = 1-PROBCHI(L,F+4);
ALPHA = A1+(A/M/M)#(A3-A1);
PRINT, "ALPHA = ", ALPHA;
QUIT;
And the SAS output is;
Ex. 10.2 - A test for sphericity of the variance-covariance matrix.

The Data Matrix is 1 3 5


2 3 5
1 4 6
1 4 4
3 4 7
2 5 6

77
XBAR = 1.6666667
3.8333333
5.5

The Variance-Covariance Matrix is 0.6666667 0.1333333 0.6


0.1333333 0.5666667 0.3
0.6 0.3 1.1

LAMDA = 0.3825656

L = 3.5765164

ALPHA = 0.654943
The Splus code is;
attach(e10) # or
e10<-t(cbind(c(1, 3, 5),c(2, 3, 5),c(1, 4, 6),c(1, 4, 4),c(3, 4, 7),c(2, 5, 6)))
x<-data.matrix(e10)
n<-dim(x)[1]
p<-dim(x)[2]
xbar<-cbind(mean(x[,1]),mean(x[,2]),mean(x[,3]))
nu<-n-1
w<-nu*var(x)
lambda<-prod(eigen(w)$values)/((1/p)*sum(diag(w)))^p
m<-nu - (2*p*p+p+2)/6/p
L<--m*log(lambda)
a<-(p+1)*(p-1)*(p+2)*(2*p^3+6*p*p+3*p+2)/288/p/p
f<-p*(p+1)/2-1
a1<- 1 - pchisq(L,f)
a3<- 1 - pchisq(L,f+4)
alpha<-a1 + (a/m/m)*(a3-a1)
lambda
L
alpha

7.1.5 Test for Equicorrelation or Compound Symmetry


In this section the null hypothesis is;
1 ρ ... ρ
 
ρ 1 ... ρ
H0 : Σ = σ 2 

 .. .. . . . .
. . . .. 
ρ ρ ... 1
The test statistic is;
| Σ̂ |
Λ=
(s2 )p (1 − r)p−1 [1 + (p − 1)r]
where
p
X p
X
2
s = σ̂ii /p r=2 σ̂ij /[p(p − 1)s2 ].
i=1 i<j

78
One rejects the null hypothesis for large n is Q ≥ χ2α,f where

−n − 1 − p(p + 1)2 (2p − 3)


Q= log(Λ) f = [p(p + 1) − 4]/2.
6(p − 1)(p2 + p − 4)

The SAS IML code is;

OPTIONS LINESIZE=75 PAGESIZE=54 NODATE PAGENO=1;


TITLE
’Ex. 10.3 - A test for compound symmetry of a var-cov matrix.’;
DATA DAT;
INPUT ID X1-X3;
CARDS;
1 1 3 5
2 2 3 5
3 1 4 6
4 1 4 4
5 3 4 7
6 2 5 6
DATA XX; SET DAT; DROP ID;
PROC IML;
RESET NOLOG;
USE XX; READ ALL INTO X;
PRINT "The Data Matrix is" X;
N=NROW(X); P=NCOL(X);
XBAR = X(|+,|)‘/N;
PRINT, "XBAR = " XBAR;
SUMSQ=X‘*X-(XBAR*XBAR‘)#N;
S=SUMSQ/(N-1);
PRINT , "The Variance-Covariance Matrix is " S;
S2 = (1/P)#TRACE(S);
R = (2/(P#(P-1)#S2))#(S(|+,+|)-TRACE(S))/2;
LAMDA = DET(S)/((S2##P)#((1-R)##(P-1))#(1+(P-1)#R));
Q = -(N-1-(P#(P+1)#(P+1)#(2#P-3))/(6#(P-1)#(P#P+P-4)))#LOG(LAMDA);
F=(P#(P+1)-4)/2;
ALPHA = 1 - PROBCHI(Q,F);
PRINT, "S2 = " S2, "R = " R, "LAMDA = " LAMDA, "Q = " Q;
PRINT, "DEGREES OF FREEDOM = " F;
PRINT, "ALPHA = " ALPHA;
QUIT;

The SAS output is;


Ex. 10.3 - A test for compound symmetry of a var-cov matrix.

The Data Matrix is 1 3 5


2 3 5
1 4 6
1 4 4
3 4 7
2 5 6

79
XBAR

XBAR = 1.6666667
3.8333333
5.5
S

The Variance-Covariance Matrix is 0.6666667 0.1333333 0.6


0.1333333 0.5666667 0.3
0.6 0.3 1.1

S2 = 0.7777778
R = 0.4428571
LAMDA = 0.6535772
Q = 1.4885312
DEGREES OF FREEDOM = 4
ALPHA = 0.8286711
The Splus code is;
attach(e10) # or
e10<-t(cbind(c(1, 3, 5),c(2, 3, 5),c(1, 4, 6),c(1, 4, 4),c(3, 4, 7),c(2, 5, 6)))
x<-data.matrix(e10)
n<-dim(x)[1]
p<-dim(x)[2]
xbar<-cbind(mean(x[,1]),mean(x[,2]),mean(x[,3]))
nu<-n-1
s<-var(x)
s2<-sum(diag(s))/p
r<-(2/(p*(p-1)*s2))*(sum(s)-sum(diag(s)))/2
lambda<-prod(eigen(s)$values)/((s2^p)*((1-r)^(p-1))*(1+(p-1)*r))
Q<- -(nu-(p*(p+1)^2*(2*p-3))/6/(p-1)/(p^2+p-4))*log(lambda)
f<-(p*(p+1)-4)/2
alpha<- 1 - pchisq(Q,f)
s2
r
lambda
Q
f
alpha

7.1.6 Test for Equality of Several Covariances


Suppose that one has k populations and wishes to test the hypothesis,

H0 : Σ1 = Σ2 = . . . = Σk

versus the alternative that there is at least one inequality. Bartlett’s test is based upon the Likelihood Ratio
Test with the test statistic given by
Qk
| Σ̂i |(ni −1)/2
Λ = i=1
| Σ̂ |(n−k)/2

80
where Σ̂i is the sample covariance matrix for population i and
k
X
Σ̂ = (ni − 1)Σ̂i /(n − k)
i=1
Pk
and n = i=1 ni where ni is the sample size for the ith population. It can be shown that

−2log(Λ) ∼ χ2 (df = p(p + 1)(k − 1)/2).

SAS uses a modification of this statistic where the test statistic is,

npn/2 Λ
Q = −2ρ log[ Qk pni /2
]
i=1 ni

where
k
X 2p2 + 3p − 1
ρ=1−[ ni − 1/n][ ].
i=1
6(p + 1)(k − 1)

Then one rejects the null hypothesis if Q ≥ χ2α (df = p(p + 1)(k − 1)/2).

7.1.7 SAS Example


SAS using PROC DISCRIM to test the homogeneity of covariance assumption. The SAS code is;
title ’Multivariate Linear Regression’;
data mra;
input tot ami gen amt pr diap qrs;
cards;
3389 3149 1 7500 220 0 140
1101 653 1 1975 200 0 100
1131 810 0 3600 205 60 111
596 448 1 675 160 60 120
896 844 1 750 185 70 83
1767 1450 1 2500 180 60 80
807 493 1 350 154 80 98
1111 941 0 1500 200 70 93
645 547 1 375 137 60 105
628 392 1 1050 167 60 74
1360 1283 1 3000 180 60 80
652 458 1 450 160 64 60
860 722 1 1750 135 90 79
500 384 0 2000 160 60 80
781 501 0 4500 180 0 100
1070 405 0 1500 170 90 120
1754 1520 1 3000 180 0 129
;
proc discrim
method=normal pool=test;
class gen;
var tot ami amt;
run;

81
The SAS output is;
The DISCRIM Procedure
Test of Homogeneity of Within Covariance Matrices

Notation: K = Number of Groups


P = Number of Variables
N = Total Number of Observations - Number of Groups
N(i) = Number of Observations in the i’th Group - 1

__ N(i)/2
|| |Within SS Matrix(i)|
V = -----------------------------------
N/2
|Pooled SS Matrix|
_ _ 2
| 1 1 | 2P + 3P - 1
RHO = 1.0 - | SUM ----- - --- | -------------
|_ N(i) N _| 6(P+1)(K-1)

DF = .5(K-1)P(P+1)
_ _
| PN/2 |
| N V |
Under the null hypothesis: -2 RHO ln | ------------------ |
| __ PN(i)/2 |
|_ || N(i) _|

is distributed approximately as Chi-Square(DF).

Chi-Square DF Pr > ChiSq


12.609630 6 0.0497

Since the Chi-Square value is significant at the 0.1 level, the within covariance
matrices will be used in the discriminant function.
Reference: Morrison, D.F. (1976) Multivariate Statistical Methods p252.

82
Chapter 8

Profile Analysis

In this chapter an extension of the one and two sample Hotelling’s T 2 tests are extended to profile analysis.
The method for doing the analysis is given in the chapter for the General Linear Model, hence, the analysis
can be done using SAS PROC GLM.

8.1 One Sample Case


Suppose that one has n subjects that are measured repeatedly over p successive time periods. Then the
General Linear Multivariate Model where Y is a matrix of n p-dimensional random variables, such that

yij = µi + eij ,

for i = 1, 2, . . . , p, j = 1, 2, . . . , n. This can be written as

Y = XB + E

where
• ~yi0 = [yi1 , yi2 , . . . , yip ] is the ith row of the matrix Y , representing the measurements for the p time
periods for the ith subject.
• X is a n × 1 matrix of ones, that is, X = ~jn .
• B is an unknown 1 × p matrix of means, B = (µ1 , µ2 , . . . , µp ) where µj is the expected value of y for
the j th time period.
• E is a n × p matrix of unobservable errors, such that E(e0i ) = 0 and V ar(e0i ) = Σ and the rows of E
are independent of one another, i.e., cov(ei , ej ) = 0.
The null hypothesis of interest is;
H0 : µ1 = µ2 = . . . = µp
versus the alternative that the null hypothesis is false. This can be expressed in the general form of a linear
hypothesis for the parameters as
H0 : LBM = c

83
where L = 1 is a 1 × 1, M is a p × p − 1 of rank p − 1 and c is a p − 1 × 1 matrix of zeroes, and
1 0 ... 0
 
 0 1 ... 0 
 . .. .. 
M = .
 . . ... . 

 0 0 ... 1 
−1 −1 . . . −1
From here one computes;
Qh = (LB̂M − c)0 [L(X 0 X)−1 L0 ]−1 (LB̂M − c)
and
Qe = M 0 Y 0 [I − H]Y M.

8.1.1 Example – One Sample Profile Analysis


The SAS code is;
title ’Profile Analysis for Probe-Word Data’;
data word;
input subject t1-t5;
cards;
1 51 36 50 35 42
2 27 20 26 17 27
3 37 22 41 37 30
4 42 36 32 34 27
5 27 18 33 14 29
6 43 32 43 35 40
7 41 22 36 25 38
8 38 21 31 20 16
9 36 23 27 25 28
10 26 31 31 32 36
11 29 20 25 26 25
;
proc glm;
model t1-t5= /nouni;
repeated time 5 contrast(5) / summary printm printh printe;
run;
The SAS output is;
Profile Analysis for Probe-Word Data

The GLM Procedure

Number of observations 11
Repeated Measures Level Information

Dependent Variable t1 t2 t3 t4 t5

Level of time 1 2 3 4 5

84
Partial Correlation Coefficients from the Error SSCP Matrix / Prob > |r|

DF = 10 t1 t2 t3 t4 t5

t1 1.000000 0.614390 0.757185 0.575073 0.413057


0.0443 0.0070 0.0642 0.2067

t2 0.614390 1.000000 0.547390 0.749777 0.547660


0.0443 0.0813 0.0079 0.0812

t3 0.757185 0.547390 1.000000 0.605272 0.691893


0.0070 0.0813 0.0485 0.0183

t4 0.575073 0.749777 0.605272 1.000000 0.523888


0.0642 0.0079 0.0485 0.0981

t5 0.413057 0.547660 0.691893 0.523888 1.000000


0.2067 0.0812 0.0183 0.0981

time_N represents the contrast between the nth level of time and the 5th

M Matrix Describing Transformed Variables

t1 t2 t3 t4 t5

time_1 1.000000000 0.000000000 0.000000000 0.000000000 -1.000000000


time_2 0.000000000 1.000000000 0.000000000 0.000000000 -1.000000000
time_3 0.000000000 0.000000000 1.000000000 0.000000000 -1.000000000
time_4 0.000000000 0.000000000 0.000000000 1.000000000 -1.000000000

E = Error SSCP Matrix

time_N represents the contrast between the nth level of time and the 5th

time_1 time_2 time_3 time_4

time_1 724.55 380.73 392.55 378.82


time_2 380.73 475.64 176.73 385.09
time_3 392.55 176.73 366.55 227.82
time_4 378.82 385.09 227.82 576.73

Partial Correlation Coefficients from the Error SSCP Matrix of the


Variables Defined by the Specified Transformation / Prob > |r|

DF = 10 time_1 time_2 time_3 time_4

85
time_1 1.000000 0.648550 0.761716 0.586020
0.0309 0.0064 0.0581

time_2 0.648550 1.000000 0.423255 0.735259


0.0309 0.1946 0.0099

time_3 0.761716 0.423255 1.000000 0.495495


0.0064 0.1946 0.1212

time_4 0.586020 0.735259 0.495495 1.000000


0.0581 0.0099 0.1212

Sphericity Tests

Mauchly’s
Variables DF Criterion Chi-Square Pr > ChiSq

Transformed Variates 9 0.0879791 20.458026 0.0153


Orthogonal Components 9 0.4796455 6.1837927 0.7214

H = Type III SSCP Matrix for time

time_N represents the contrast between the nth level of time and the 5th

time_1 time_2 time_3 time_4

time_1 316.45454545 -305.7272727 198.45454545 -203.8181818


time_2 -305.7272727 295.36363636 -191.7272727 196.90909091
time_3 198.45454545 -191.7272727 124.45454545 -127.8181818
time_4 -203.8181818 196.90909091 -127.8181818 131.27272727

Manova Test Criteria and Exact F Statistics for the Hypothesis of no time Effect
H = Type III SSCP Matrix for time
E = Error SSCP Matrix

S=1 M=1 N=2.5

Statistic Value F Value Num DF Den DF Pr > F

Wilks’ Lambda 0.24822547 5.30 4 7 0.0277


Pillai’s Trace 0.75177453 5.30 4 7 0.0277
Hotelling-Lawley Trace 3.02859542 5.30 4 7 0.0277
Roy’s Greatest Root 3.02859542 5.30 4 7 0.0277

Repeated Measures Analysis of Variance


Univariate Tests of Hypotheses for Within Subject Effects

86
Adj Pr > F
Source DF Type III SS Mean Square F Value Pr > F G - G H - F

time 4 867.5272727 216.8818182 9.25 <.0001 0.0001 <.0001


Error(time) 40 938.0727273 23.4518182

Greenhouse-Geisser Epsilon 0.7851


Huynh-Feldt Epsilon 1.1860

The GLM Procedure


Repeated Measures Analysis of Variance
Analysis of Variance of Contrast Variables

time_N represents the contrast between the nth level of time and the 5th

Contrast Variable: time_1

Source DF Type III SS Mean Square F Value Pr > F

Mean 1 316.4545455 316.4545455 4.37 0.0631


Error 10 724.5454545 72.4545455

Contrast Variable: time_2

Source DF Type III SS Mean Square F Value Pr > F

Mean 1 295.3636364 295.3636364 6.21 0.0319


Error 10 475.6363636 47.5636364

Contrast Variable: time_3

Source DF Type III SS Mean Square F Value Pr > F

Mean 1 124.4545455 124.4545455 3.40 0.0952


Error 10 366.5454545 36.6545455

Contrast Variable: time_4

Source DF Type III SS Mean Square F Value Pr > F

Mean 1 131.2727273 131.2727273 2.28 0.1623

87
Error 10 576.7272727 57.6727273

8.2 Two Sample Case


Suppose that one has nu for u = 1, 2, subjects that are measures repeatedly over p successive time periods.
Then the General Linear Multivariate Model where Y is a matrix of n = n1 + n2 p-dimensional random
variables, such that
yijk = µik + eijk ,
for k = 1, 2, . . . , p, j = 1, 2, . . . , ni , i = 1, 2. This can be written as

Y = XB + E

where
0
• ~yui = [yui1 , yui2 , . . . , yuip ] is the ith row for sample u = 1, 2, representing the measurements for the p
time periods for the ith subject in sample u. That is,
 0 
~y11
 ~y12 0 
 .. 
 
 . 
 0 
 ~y1n 
Y = ~y21 0 .
1 

 0 
 ~y22 
 
 . 
 .. 
0
~y2n 2

• X is a n × 2 matrix given by,


~jn ~0n
 
1 1
X= ,
~0n2 ~jn2

where ~jn is a n × 1 vector of ones and ~0n is a n × 1 vector of zeros.


• B is an unknown 2 × p matrix of means,
 
µ11 µ12 ... µ1p
B=
µ21 µ22 ... µ2p

where µuj is the expected value of y for the j th time period for sample u.
• E is a n × p matrix of unobservable errors, such that E(e0i ) = 0 and V ar(e0i ) = Σ and the rows of E
are independent of one another, i.e., cov(ei , ej ) = 0.
There are three hypotheses of interest:
1. Are the profiles for the two samples parallel?
2. Are there differences among the time periods? Are the lines flat?
3. Are there differences between the two samples?

88
1. Using the General linear hypothesis given by H0 : LBM = c. When testing the hypothesis of parallel
profiles the matrices are;
L = (1 −1 ) is a 1 × 2, M is a p × p − 1 of rank p − 1 and c = ( 0 0 . . . 0 ) is a 1 × p − 1 matrix,
and
1 0 0 ... 0 0
 
 −1 1 0 ... 0 0 
 0 −1 1 ... 0 0 
 
M =  ... .. .. . . .. ..  .
 . . . . . 

 0 0 0 . . . −1 1 
0 0 0 . . . 0 −1
From here one computes;

Qh = (LB̂M − c)0 [L(X 0 X)−1 L0 ]−1 (LB̂M − c)

and
Qe = M 0 Y 0 [I − H]Y M.

2. The second hypothesis, the profiles are flat given that they are the same, the matrices are;
L = ( 1/2 1/2 ) is a 1 × 2, M is a p × p − 1 of rank p − 1 and c = ( 0 0 . . . 0 ) is a 1 × p − 1 matrix,
and
1 0 0 ... 0
 
 0 1 0 ... 0 
 0 0 1 ... 0 
 
M =  ... .. .. .. ..  .
 . . . . 

 0 0 0 ... 1 
−1 −1 −1 . . . −1
3. The third hypothesis, the two profiles are the same, the matrices are;
 
0 0 ... 0
L = ( 1 −1 ) is a 1 × 2, M = Ip the p dimensional identity matrix and c = is a
0 0 ... 0
2 × p − 1 matrix.

8.2.1 Example – Two Sample Profile Analysis


The SAS code is;
options center nodate;
title ’Profile Analysis for 2 Sample Probe-Word Data’;
data word;
input group subject t1-t5;
cards;
1 1 20 21 42 32 32
1 2 67 29 56 39 41
1 3 37 25 28 31 34
1 4 42 38 36 19 35
1 5 57 32 21 30 29
1 6 39 38 54 31 28
1 7 43 20 46 42 31
1 8 35 34 43 35 42
1 9 41 23 51 27 30

89
1 10 39 24 35 26 32
2 1 47 25 36 21 27
2 2 53 32 48 46 54
2 3 38 33 42 48 49
2 4 60 41 67 53 50
2 5 37 35 45 34 46
2 6 59 37 52 36 52
2 7 67 33 61 31 50
2 8 43 27 36 33 32
2 9 64 53 62 40 43
2 10 41 34 47 37 46
;
title2 ’Test for parallel profiles’;
proc glm;
model t1-t5= group/nouni;
manova h=group m=t1-t2,t2-t3,t3-t4,t4-t5/ printh printe;
run;
title2 ’Test for level parallel profiles’;
proc glm;
model t1-t5= /nouni;
manova h=intercept m=t1-t5,t2-t5,t3-t5,t4-t5/ printh printe;
run;
title2 ’Test for similar profiles’;
proc glm;
model t1-t5=group/nouni;
manova h=group / printh printe;
run;
The SAS output is;
Hypothesis #1

Profile Analysis for 2 Sample Probe-Word Data


Test for parallel profiles

The GLM Procedure

Number of observations 20

The GLM Procedure


Multivariate Analysis of Variance

M Matrix Describing Transformed Variables

t1 t2 t3 t4 t5

MVAR1 1 -1 0 0 0
MVAR2 0 1 -1 0 0
MVAR3 0 0 1 -1 0
MVAR4 0 0 0 1 -1

90
E = Error SSCP Matrix

MVAR1 MVAR2 MVAR3 MVAR4

MVAR1 2313.3 -740.8 274.7 16.2


MVAR2 -740.8 2026 -1103.8 -230.6
MVAR3 274.7 -1103.8 1996.1 -505
MVAR4 16.2 -230.6 -505 917.6

Partial Correlation Coefficients from the Error SSCP Matrix of the


Variables Defined by the Specified Transformation / Prob > |r|

DF = 18 MVAR1 MVAR2 MVAR3 MVAR4

MVAR1 1.000000 -0.342188 0.127836 0.011119


0.1516 0.6020 0.9640

MVAR2 -0.342188 1.000000 -0.548883 -0.169127


0.1516 0.0149 0.4888

MVAR3 0.127836 -0.548883 1.000000 -0.373141


0.6020 0.0149 0.1156

MVAR4 0.011119 -0.169127 -0.373141 1.000000


0.9640 0.4888 0.1156

H = Type III SSCP Matrix for group

MVAR1 MVAR2 MVAR3 MVAR4

MVAR1 26.45 -20.7 19.55 -55.2


MVAR2 -20.7 16.2 -15.3 43.2
MVAR3 19.55 -15.3 14.45 -40.8
MVAR4 -55.2 43.2 -40.8 115.2

Characteristic Roots and Vectors of: E Inverse * H, where


H = Type III SSCP Matrix for group
E = Error SSCP Matrix

Variables have been transformed by the M Matrix

Characteristic Characteristic Vector V’EV=1


Root Percent MVAR1 MVAR2 MVAR3 MVAR4

0.19181198 100.00 -0.00186620 0.01637997 0.01483907 0.03902363


0.00000000 0.00 0.00140848 0.01724386 0.02868271 0.00436691

91
0.00000000 0.00 0.00797379 0.02144933 -0.00238864 -0.00506870
0.00000000 0.00 0.02073438 0.00387216 0.00479861 0.01018267

MANOVA Test Criteria and Exact F Statistics for the Hypothesis of No Overall group Effect
on the Variables Defined by the M Matrix Transformation
H = Type III SSCP Matrix for group
E = Error SSCP Matrix

S=1 M=1 N=6.5

Statistic Value F Value Num DF Den DF Pr > F

Wilks’ Lambda 0.83905852 0.72 4 15 0.5919


Pillai’s Trace 0.16094148 0.72 4 15 0.5919
Hotelling-Lawley Trace 0.19181198 0.72 4 15 0.5919
Roy’s Greatest Root 0.19181198 0.72 4 15 0.5919

Hypothesis #2
The GLM Procedure

Number of observations 20

Profile Analysis for 2 Sample Probe-Word Data


Test for level parallel profiles

M Matrix Describing Transformed Variables

t1 t2 t3 t4 t5

MVAR1 1 0 0 0 -1
MVAR2 0 1 0 0 -1
MVAR3 0 0 1 0 -1
MVAR4 0 0 0 1 -1

E = Error SSCP Matrix

MVAR1 MVAR2 MVAR3 MVAR4

MVAR1 2708.2 874.7 900.5 260.6


MVAR2 874.7 1380.95 645.25 299.6
MVAR3 900.5 645.25 1951.75 487
MVAR4 260.6 299.6 487 1032.8

Partial Correlation Coefficients from the Error SSCP Matrix of the


Variables Defined by the Specified Transformation / Prob > |r|

92
DF = 19 MVAR1 MVAR2 MVAR3 MVAR4

MVAR1 1.000000 0.452303 0.391680 0.155821


0.0453 0.0877 0.5118

MVAR2 0.452303 1.000000 0.393031 0.250868


0.0453 0.0865 0.2860

MVAR3 0.391680 0.393031 1.000000 0.343012


0.0877 0.0865 0.1387

MVAR4 0.155821 0.250868 0.343012 1.000000


0.5118 0.2860 0.1387

H = Type III SSCP Matrix for Intercept

MVAR1 MVAR2 MVAR3 MVAR4

MVAR1 1065.8 -1087.7 912.5 -671.6


MVAR2 -1087.7 1110.05 -931.25 685.4
MVAR3 912.5 -931.25 781.25 -575
MVAR4 -671.6 685.4 -575 423.2

Characteristic Roots and Vectors of: E Inverse * H, where


H = Type III SSCP Matrix for Intercept
E = Error SSCP Matrix

Variables have been transformed by the M Matrix

Characteristic Characteristic Vector V’EV=1


Root Percent MVAR1 MVAR2 MVAR3 MVAR4

3.31284027 100.00 -0.01094274 0.02384048 -0.01404780 0.01341285


0.00000000 0.00 0.00538403 -0.01506510 -0.00170142 0.03063142
0.00000000 0.00 -0.01325539 0.00580254 0.02239892 0.00000000
0.00000000 0.00 0.01318958 0.01292402 0.00000000 0.00000000

MANOVA Test Criteria and Exact F Statistics for the Hypothesis of No Overall Intercept Effect
on the Variables Defined by the M Matrix Transformation
H = Type III SSCP Matrix for Intercept
E = Error SSCP Matrix

S=1 M=1 N=7

Statistic Value F Value Num DF Den DF Pr > F

93
Wilks’ Lambda 0.23186576 13.25 4 16 <.0001
Pillai’s Trace 0.76813424 13.25 4 16 <.0001
Hotelling-Lawley Trace 3.31284027 13.25 4 16 <.0001
Roy’s Greatest Root 3.31284027 13.25 4 16 <.0001

Hypothesis #3
The GLM Procedure

Number of observations 20

Profile Analysis for 2 Sample Probe-Word Data


Test for similar profiles

E = Error SSCP Matrix

t1 t2 t3 t4 t5

t1 2546.9 597 946.6 257.9 385.9


t2 597 960.4 569.2 155.2 299.4
t3 946.6 569.2 2204 686.2 599.8
t4 257.9 155.2 686.2 1164.5 573.1
t5 385.9 299.4 599.8 573.1 899.3

Partial Correlation Coefficients from the Error SSCP Matrix / Prob > |r|

DF = 18 t1 t2 t3 t4 t5

t1 1.000000 0.381718 0.399535 0.149753 0.254986


0.1068 0.0901 0.5406 0.2921

t2 0.381718 1.000000 0.391231 0.146756 0.322161


0.1068 0.0977 0.5488 0.1786

t3 0.399535 0.391231 1.000000 0.428327 0.426038


0.0901 0.0977 0.0673 0.0689

t4 0.149753 0.146756 0.428327 1.000000 0.560026


0.5406 0.5488 0.0673 0.0126

t5 0.254986 0.322161 0.426038 0.560026 1.000000


0.2921 0.1786 0.0689 0.0126

Profile Analysis for 2 Sample Probe-Word Data


Test for similar profiles

H = Type III SSCP Matrix for group

94
t1 t2 t3 t4 t5

t1 396.05 293.7 373.8 298.15 511.75


t2 293.7 217.8 277.2 221.1 379.5
t3 373.8 277.2 352.8 281.4 483
t4 298.15 221.1 281.4 224.45 385.25
t5 511.75 379.5 483 385.25 661.25

Characteristic Roots and Vectors of: E Inverse * H, where


H = Type III SSCP Matrix for group
E = Error SSCP Matrix

Characteristic Characteristic Vector V’EV=1


Root Percent t1 t2 t3 t4 t5

0.79831634 100.00 0.00323780 0.00686279 -0.00143367 -0.00116599 0.03002809


0.00000000 0.00 -0.00781849 -0.01305718 0.02465120 0.00017725 -0.00456484
0.00000000 0.00 -0.00192145 -0.00468339 -0.00957782 0.03375163 -0.00849313
0.00000000 0.00 -0.01373876 0.03294670 0.00121557 0.00676858 -0.01310726
0.00000000 0.00 0.01555660 0.00731192 0.00238472 0.01327870 -0.02571402

MANOVA Test Criteria and Exact F Statistics for the Hypothesis of No Overall group Effect
H = Type III SSCP Matrix for group
E = Error SSCP Matrix

S=1 M=1.5 N=6

Statistic Value F Value Num DF Den DF Pr > F

Wilks’ Lambda 0.55607569 2.24 5 14 0.1083


Pillai’s Trace 0.44392431 2.24 5 14 0.1083
Hotelling-Lawley Trace 0.79831634 2.24 5 14 0.1083
Roy’s Greatest Root 0.79831634 2.24 5 14 0.1083

SAS PROC GLM provides another method for doing a portion of the profile analysis. The following code
will provide a test for a factor, say, time or method, along with a treatment of group variable. The first
null hypothesis can be tested by considering the first level interaction term, time*group, while the second
hypothesis can be compared with testing for the repeated variable, time. In the above example, consider the
new SAS code given by;
PROC GLM;
CLASS group;
MODEL T1-T5 = group/NOUNI;
REPEATED TIME 5/PRINTE;
TITLE ’Word Probe Data - Using GLM for Profile Analysis and H-F Conditions’;
RUN;
The additional output is;

95
Word Probe Data - Using GLM for Profile Analysis and H-F Conditions

The GLM Procedure

Class Level Information

Class Levels Values

group 2 1 2

Number of observations 20

Repeated Measures Level Information

Dependent Variable t1 t2 t3 t4 t5

Level of TIME 1 2 3 4 5

Partial Correlation Coefficients from the Error SSCP Matrix / Prob > |r|

DF = 18 t1 t2 t3 t4 t5

t1 1.000000 0.381718 0.399535 0.149753 0.254986


0.1068 0.0901 0.5406 0.2921

t2 0.381718 1.000000 0.391231 0.146756 0.322161


0.1068 0.0977 0.5488 0.1786

t3 0.399535 0.391231 1.000000 0.428327 0.426038


0.0901 0.0977 0.0673 0.0689

t4 0.149753 0.146756 0.428327 1.000000 0.560026


0.5406 0.5488 0.0673 0.0126

t5 0.254986 0.322161 0.426038 0.560026 1.000000


0.2921 0.1786 0.0689 0.0126

E = Error SSCP Matrix

TIME_N represents the contrast between the nth level of TIME and the last

TIME_1 TIME_2 TIME_3 TIME_4

TIME_1 2674.4 811.0 860.2 198.2


TIME_2 811.0 1260.9 569.3 182.0
TIME_3 860.2 569.3 1903.7 412.6

96
TIME_4 198.2 182.0 412.6 917.6

Partial Correlation Coefficients from the Error SSCP Matrix of the


Variables Defined by the Specified Transformation / Prob > |r|

DF = 18 TIME_1 TIME_2 TIME_3 TIME_4

TIME_1 1.000000 0.441639 0.381230 0.126521


0.0584 0.1073 0.6058

TIME_2 0.441639 1.000000 0.367453 0.169202


0.0584 0.1217 0.4886

TIME_3 0.381230 0.367453 1.000000 0.312179


0.1073 0.1217 0.1932

TIME_4 0.126521 0.169202 0.312179 1.000000


0.6058 0.4886 0.1932

Sphericity Tests

Mauchly’s
Variables DF Criterion Chi-Square Pr > ChiSq

Transformed Variates 9 0.4217072 14.174874 0.1162


Orthogonal Components 9 0.5692268 9.250404 0.4145

Manova Test Criteria and Exact F Statistics


for the Hypothesis of no TIME Effect
H = Type III SSCP Matrix for TIME
E = Error SSCP Matrix

S=1 M=1 N=6.5

Statistic Value F Value Num DF Den DF Pr > F

Wilks’ Lambda 0.21986464 13.31 4 15 <.0001


Pillai’s Trace 0.78013536 13.31 4 15 <.0001
Hotelling-Lawley Trace 3.54825288 13.31 4 15 <.0001
Roy’s Greatest Root 3.54825288 13.31 4 15 <.0001

Manova Test Criteria and Exact F Statistics


for the Hypothesis of no TIME*group Effect
H = Type III SSCP Matrix for TIME*group
E = Error SSCP Matrix

S=1 M=1 N=6.5

97
Statistic Value F Value Num DF Den DF Pr > F

Wilks’ Lambda 0.83905852 0.72 4 15 0.5919


Pillai’s Trace 0.16094148 0.72 4 15 0.5919
Hotelling-Lawley Trace 0.19181198 0.72 4 15 0.5919
Roy’s Greatest Root 0.19181198 0.72 4 15 0.5919

The GLM Procedure


Repeated Measures Analysis of Variance
Tests of Hypotheses for Between Subjects Effects

Source DF Type III SS Mean Square F Value Pr > F

group 1 1772.410000 1772.410000 8.90 0.0080


Error 18 3583.140000 199.063333

The GLM Procedure


Repeated Measures Analysis of Variance
Univariate Tests of Hypotheses for Within Subject Effects

Source DF Type III SS Mean Square F Value Pr > F

TIME 4 3371.300000 842.825000 14.48 <.0001


TIME*group 4 79.940000 19.985000 0.34 0.8479
Error(TIME) 72 4191.960000 58.221667

Adj Pr > F
Source G - G H - F

TIME <.0001 <.0001


TIME*group 0.8068 0.8479
Error(TIME)

Greenhouse-Geisser Epsilon 0.8009


Huynh-Feldt Epsilon 1.0487

The above methods for considering the profile curves using PROC GLM did not consider the covariance
structure within the repeated measures (ie the subject covariance over time). PROC MIXED is a more
general procedure for dealing with repeated measures in that it allows for specifying a covariance structure
for the repeated measures. Refer to Littell, Milliken, Stroup, and Wolfinger (SAS System for Mixed Models).
SAS PROC MIXED provides another method for doing a portion of the profile analysis. The following
code will provide a test for a factor, say, time or method, along with a treatment of group variable. The first
null hypothesis can be tested by considering the first level interaction term, time*group, while the second
hypothesis can be compared with testing for the repeated variable, time. In the above example, consider the
new SAS code given by;
data newword;set word;
t=t1; time=1; output;

98
t=t2; time=2; output;
t=t3; time=3; output;
t=t4; time=4; output;
t=t5; time=5; output;
drop t1-t5;
*proc print data=newword;run;
title ’Combine test with MIXED’;
proc mixed data=newword method=ml;
class group time subject;
model t = group time group*time / s;
repeated / type=cs subject=subject;
run;
The additional output is;
Combine test with MIXED
The Mixed Procedure
Model Information

Data Set WORK.NEWWORD


Dependent Variable t
Covariance Structure Compound Symmetry
Subject Effect subject
Estimation Method ML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Between-Within

Class Level Information

Class Levels Values

group 2 1 2
time 5 1 2 3 4 5
subject 10 1 2 3 4 5 6 7 8 9 10

Dimensions

Covariance Parameters 2
Columns in X 18
Columns in Z 0
Subjects 10
Max Obs Per Subject 10
Observations Used 100
Observations Not Used 0
Total Observations 100

Iteration History

99
Iteration Evaluations -2 Log Like Criterion

0 1 719.13884791
1 1 707.99610545 0.00000000

Convergence criteria met.

Covariance Parameter Estimates

Cov Parm Subject Estimate

CS subject 16.8304
Residual 60.9206

Combine test with MIXED


The Mixed Procedure
Fit Statistics

-2 Log Likelihood 708.0


AIC (smaller is better) 732.0
AICC (smaller is better) 735.6
BIC (smaller is better) 735.6

Null Model Likelihood Ratio Test

DF Chi-Square Pr > ChiSq

1 11.14 0.0008

Solution for Fixed Effects

Standard
Effect group time Estimate Error DF t Value Pr > |t|

Intercept 44.9000 2.7884 9 16.10 <.0001


group 1 -11.5000 3.4906 9 -3.29 0.0093
group 2 0 . . . .
time 1 6.0000 3.4906 36 1.72 0.0942
time 2 -9.9000 3.4906 36 -2.84 0.0074
time 3 4.7000 3.4906 36 1.35 0.1866
time 4 -7.0000 3.4906 36 -2.01 0.0525
time 5 0 . . . .
group*time 1 1 2.6000 4.9364 36 0.53 0.6016
group*time 1 2 4.9000 4.9364 36 0.99 0.3275

100
group*time 1 3 3.1000 4.9364 36 0.63 0.5340
group*time 1 4 4.8000 4.9364 36 0.97 0.3374
group*time 1 5 0 . . . .
group*time 2 1 0 . . . .
group*time 2 2 0 . . . .
group*time 2 3 0 . . . .
group*time 2 4 0 . . . .
group*time 2 5 0 . . . .

Type 3 Tests of Fixed Effects

Num Den
Effect DF DF F Value Pr > F

group 1 9 29.09 0.0004


time 4 36 13.83 <.0001
group*time 4 36 0.33 0.8573

8.3 Mixed Models Theory


This section provides an overview of a likelihood-based approach to general linear mixed models. This ap-
proach simplifies and unifies many common statistical analyzes, including those involving repeated measures,
random effects, and random coefficients. The basic assumption is that the data are linearly related to unob-
served multivariate normal random variables. Extensions to nonlinear and nonnormal situations are possible
but are not discussed here. Additional theory with examples is provided in Littell et al. (1996) and Verbeke
and Molenberghs (1997).

Matrix Notation
Suppose that you observe n data points y1 , y2 , . . . , yn and that you want to explain them using n values
for each of p explanatory variables x11 , x12 , . . . , x1p , x21 , x22 , . . . , x2p , . . . , xn1 , xn2 , . . . , xnp . The xij values
may be either regression-type continuous variables or dummy variables indicating class membership. The
standard linear model for this setup is
p
X
yi = xij βj + i i = 1, . . . , n
j=1

where β1 , . . . , βp are unknown fixed-effects parameters to be estimated and 1 , . . . , n are unknown indepen-
dent and identically distributed normal (Gaussian) random variables with mean 0 and variance σ 2 .
The preceding equations can be written simultaneously using vectors and a matrix, as follows:

~y = Xβ + 
where ~y denotes the vector of observed yi 0 s, X is the known matrix of xij 0 s, β is the unknown fixed-effects
parameter vector, and  is the unobserved vector of independent and identically distributed Gaussian random
errors, that is,  ∼ Np (0, Σ = σ 2 In ).

101
Formulation of the Mixed Model
The previous general linear model is certainly a useful one (Searle 1971), and it is the one fitted by the GLM
procedure. However, many times the distributional assumption about  is too restrictive. The mixed model
extends the general linear model by allowing a more flexible specification of the covariance matrix of . In
other words, it allows for both correlation and heterogeneous variances, although you still assume normality.
The mixed model is written as

~y = Xβ + Zγ + 
where everything is the same as in the general linear model except for the addition of the known design
matrix, Z, and the vector of unknown random-effects parameters, γ. The matrix Z can contain either
continuous or dummy variables, just like X. The name mixed model comes from the fact that the model
contains both fixed-effects parameters, β, and random-effects parameters, γ. Refer to Henderson (1990) and
Searle, Casella, and McCulloch (1992) for historical developments of the mixed model.
A key assumption in the foregoing analysis is that γ and  are normally distributed with
   
γ 0
E =
 0
and    
γ G 0
V ar = .
 0 R
The variance of ~y is, therefore, V = ZGZ 0 + R. You can model V by setting up the random-effects design
matrix Z and by specifying covariance structures for G and R. Note that this is a general specification of the
mixed model, in contrast to many texts and articles that discuss only simple random effects. Simple random
effects are a special case of the general specification with Z containing dummy variables, G containing
variance components in a diagonal structure, and R = σ 2 In , where In denotes the n-dimensional identity
matrix. The general linear model is a further special case with Z = 0 and R = σ 2 In .

8.3.1 Estimating G and R in the Mixed Model


Estimation is more difficult in the mixed model than in the general linear model. Not only do you have β
as in the general linear model, but you have unknown parameters in γ, G, and R as well. Least squares is
no longer the best method. Generalized least squares (GLS) is more appropriate, minimizing

(~y − Xβ)0 V −1 (~y − Xβ).

However, it requires knowledge of V and, therefore, knowledge of G and R. Lacking such information, one
approach is to use estimated GLS, in which you insert some reasonable estimate for V into the minimization
problem. The goal thus becomes finding a reasonable estimate of G and R.
In many situations, the best approach is to use likelihood-based methods, exploiting the assumption
that γ and  are normally distributed (Hartley and Rao 1967; Patterson and Thompson 1971; Harville
1977; Laird and Ware 1982; Jennrich and Schluchter 1986). PROC MIXED implements two likelihood-
based methods: maximum likelihood (ML) and restricted/residual maximum likelihood (REML). A favorable
theoretical property of ML and REML is that they accommodate data that are missing at random (Rubin
1976; Little 1995). PROC MIXED constructs an objective function associated with ML or REML and
maximizes it over all unknown parameters. Using calculus, it is possible to reduce this maximization problem
to one over only the parameters in Gand R.
That is, let α denote the vector of all the variance and covariance parameters found in V . Let θ = ( β 0 α0 )
be the s-dimensional vector of all parameters in the marginal model for ~y and let Θ = Θβ × Θα denote the
parameter space for θ, where Θβ denote the parameter space for the fixed parameters and Θα denotes the

102
parameter space for the variance components. Note Θα is restricted so that both G and R are positive
(semi-)definite.
The classical approach to inference is based on estimators obtained from maximizing the marginal like-
lihood function given by,
n
X
Lml (θ) = (2π)−n/2 | V (α) |−n/2 ×exp[−1/2 (~yi − Xi β)0 V (α)−1 (~yi − Xi β)]
i=1

with respect to θ. If we assume that α is known then the MLE for β is given by
Xn n
X
β̂(α) = [ Xi0 V −1 Xi ]−1 Xi0 V −1 ~yi .
i=1 i=1

When α is not known, but an estimate α̂ is available one can replace V −1 with V̂ −1 = V (α̂)−1 . The two
commonly used methods for estimating α are maximum likelihood and restricted maximum likelihood.

8.3.2 Maximum Likelihood Estimation


The maximum likelihood estimator for α is obtained by maximizing the above likelihood function when β is
replaced by β̂. This approach is the same as simultaneously maximizing the likelihood function with respect
to both β and α.

8.3.3 Restricted Maximum Likelihood Estimation


Verbeke and Molenberghs (2000) introduced restricted mle with a familiar problem. I have reproduced their
example. Suppose that you have n observations
Pn from N (µ, σ 2 ) for which you want to estimate σ 2 . When
µ is known, the mle for σ is σ̂ = i=1 (yi − µ)2 /n, which is an unbiased estimator for σ 2 . When µ is
2 2
2
unknown one can replace µ by ȳ from which one obtains Pn a biased2 estimator for σ . This estimator can be
2 2
made unbiased by considering s = [n/(n − 1)]σ̂ = i=1 (yi − ȳ) /(n − 1).
From this one notes that the estimation of σ 2 does not require that µ be estimated first. That is, suppose
that ~y is a n×1 vector of the observations with ~y ∼ Nn (~µ = j0n µ, Σ = σ 2 In ). Let A denote any n×(n−1) full
column rank matrix that is orthogonal to jn and define ~u = A0 ~y . We know that ~u ∼ Nn−1 (A0 j0n µ = 0, σ 2 A0 A).
Maximizing the corresponding likelihood function with respect to the parameter σ 2 provides the mle as
σ̂ 2 = ~y 0 A(A0 A)−1 A0 ~y /(n − 1) which can be shown to be the same as s2 regardless of one’s choice of the
matrix A.
Now suppose that one has ~y = Xβ +  where  ∼ Nn (0, σ 2 In ). The mle for σ 2 can be shown to equal

σ̂ 2 = r0 r/(n − 1) = ~y 0 (I − H)~y /(n − 1)

where r = (I − H)~y and H = X(X 0 X)−1 X 0 is the so called “hat matrix”. Using the above approach, let
~u = A~y where A is any n × (n − p) full column rank matrix that is orthogonal to the matrix X. It follows
that ~u ∼ Nn−p (0, σ 2 A0 A). Again, it can be shown that

σ̂ 2 = r0 r/(n − p) = ~y 0 (I − H)~y /(n − p)

is an unbiased estimator for σ 2 .

8.3.4 REML Estimation for the Linear Mixed Model


The mixed model is written as
~y = Xβ + Zγ + .

103
The marginal distribution for ~y is normal with mean vector Xβ and with covariance matrix V (α) where
V (α) is a block diagonal matrix with main diagonal terms given by Vi (α). Again define ~u = A0 ~y where A is
any n × (n − p) full column rank matrix that is orthogonal to X from which ~u ∼ N(n−p) (0, σ 2 A0 V (α)A).
The corresponding log-likelihood

lR (G, R) = −1/2 log | V (α) | −1/2 log | X 0 V −1 (α)X | −1/2r0 V −1 (α)r − (n − p)/2 log(2π)

where r = y − X(X 0 V −1 (α)X)− X 0 V −1 (α)y and p is the rank of X. PROC MIXED actually minimizes
-2 times these functions using a ridge-stabilized Newton-Raphson algorithm. Lindstrom and Bates (1988)
provide reasons for preferring Newton-Raphson to the Expectation-Maximum (EM) algorithm described in
Dempster, Laird, and Rubin (1977) and Laird, Lange, and Stram (1987), as well as analytical details for
implementing a QR-decomposition approach to the problem. Wolfinger, Tobias, and Sall (1994) present the
sweep-based algorithms that are implemented in PROC MIXED.
One advantage of using the Newton-Raphson algorithm is that the second derivative matrix of the
objective function evaluated at the optima is available upon completion. Denoting this matrix K, the
asymptotic theory of maximum likelihood (refer to Serfling 1980) shows that 2K −1 is an asymptotic variance-
covariance matrix of the estimated parameters of G and R. Thus, tests and confidence intervals based on
asymptotic normality can be obtained. However, these can be unreliable in small samples, especially for
parameters such as variance components which have sampling distributions that tend to be skewed to the
right.
If a residual variance σ 2 is a part of your mixed model, it can usually be profiled out of the likelihood.
This means solving analytically for the optimal σ 2 and plugging this expression back into the likelihood
formula (refer to Wolfinger, Tobias, and Sall 1994). This reduces the number of optimization parameters
by one and can improve convergence properties. PROC MIXED profiles the residual variance out of the
log likelihood whenever it appears reasonable to do so. This includes the case when σ 2 In and when it has
blocks with a compound symmetry, time series, or spatial structure. PROC MIXED does not profile the log
likelihood when R has unstructured blocks, when you use the HOLD= or NOITER option in the PARMS
statement, or when you use the NOPROFILE option in the PROC MIXED statement.
Instead of ML or REML, you can use the noniterative MIVQUE0 method to estimate G and R (Rao
1972; LaMotte 1973; Wolfinger, Tobias, and Sall 1994). In fact, by default PROC MIXED uses MIVQUE0
estimates as starting values for the ML and REML procedures. For variance component models, another
estimation method involves equating Type I, II, or III expected mean squares to their observed values and
solving the resulting system. However, Swallow and Monahan (1984) present simulation evidence favoring
REML and ML over MIVQUE0 and other method-of-moment estimators.

8.3.5 Estimating β and γ in the Mixed Model


ML, REML, MIVQUE0, or Type1 -Type3 provide estimates of G and R, which are denoted Ĝ and R̂,respectively.
To obtain estimates of β and γ, the standard method is to solve the mixed model equations (Henderson 1984):
 0 −1
X 0 R̂−1 Z
    0 −1 
X R̂ X β̂ X R̂ y
=
Z 0 R̂−1 X Z 0 R̂−1 Z + Ĝ−1 γ̂ Z 0 R̂−1 y
The solutions can also be written as
β̂ = (X 0 V̂ −1 X)− X 0 V̂ −1 y
γ̂ = ĜZ 0 V̂ −1 (y − X β̂)
and have connections with empirical Bayes estimators (Laird and Ware 1982).
Note that the mixed model equations are extended normal equations and that the preceding expression
assumes that Ĝ is nonsingular. For the extreme case when the eigenvalues of Ĝ are very large, Ĝ−1 contributes
very little to the equations and γ̂ is close to what it would be if γ actually contained fixed-effects parameters.

104
On the other hand, when the eigenvalues of Ĝ are very small, Ĝ−1 dominates the equations and γ̂ is close
to 0. For intermediate cases, Ĝ−1 can be viewed as shrinking the fixed-effects estimates of γ towards 0
(Robinson 1991).
If Ĝ is singular, then the mixed model equations are modified (Henderson 1984) as follows:
−1
!
−1 −1
X 0 R̂ y
  
X 0 R̂ −1X X 0 R̂ Z L̂ β̂
= −1
L̂0 Z 0 R̂ X L̂0 Z 0 R̂−1 Z L̂ + I τ̂ L̂0 Z 0 R̂ y

where L̂ is the lower-triangular Cholesky root of Ĝ, satisfying Ĝ = L̂L̂0 . Both τ̂ and a generalized inverse of
the left-hand-side coefficient matrix are then transformed using L̂ to determine γ̂.
An example of when the singular form of the equations is necessary is when a variance component estimate
falls on the boundary constraint of 0.

8.3.6 Model Selection


The previous section on estimation assumes the specification of a mixed model in terms of X, Z, G, and
R. Even though X and Z have known elements, their specific form and construction is flexible, and several
possibilities may present themselves for a particular data set. Likewise, several different covariance structures
for G and R might be reasonable.
Space does not permit a thorough discussion of model selection, but a few brief comments and references
are in order. First, subject matter considerations and objectives are of great importance when selecting a
model; refer to Diggle (1988) and Lindsey (1993).
Second, when the data themselves are looked to for guidance, many of the graphical methods and di-
agnostics appropriate for the general linear model extend to the mixed model setting as well (Christensen,
Pearson, and Johnson 1992).
Finally, a likelihood-based approach to the mixed model provides several statistical measures for model
adequacy as well. The most common of these are the likelihood ratio test and Akaike’s and Schwarz’s criteria
(Bozdogan 1987; Wolfinger 1993).

8.3.7 Statistical Properties


If G and R are known, β̂ is the best linear unbiased estimator (BLUE) of β, and γ̂ is the best linear unbiased
predictor (BLUP) of γ (Searle 1971; Harville 1988, 1990; Robinson 1991; McLean, Sanders, and Stroup
1991). Here, “best” means minimum mean squared error. The covariance matrix of (β̂ − β, γ̂ − γ) is
−
X 0 R−1 X X 0 R−1 Z

C=
Z 0 R−1 X Z R Z + G−1
0 −1

where - denotes a generalized inverse (refer to Searle 1971).


However, G and R are usually unknown and are estimated using one of the aforementioned methods.
These estimates, Ĝand R̂, are therefore simply substituted into the preceding expression to obtain
 0 −1 −
X R̂ X X 0 R̂−1 Z
Ĉ =
Z 0 R̂−1 X Z 0 R̂−1 Z + Ĝ−1

as the approximate variance-covariance matrix of β̂ − β, γ̂ − γ. In this case, the BLUE and BLUP acronyms
no longer apply, but the word empirical is often added to indicate such an approximation. The appropriate
acronyms thus become EBLUE and EBLUP. Ĉ can also be written as
0
 
Ĉ11 Ĉ21
Ĉ =
Ĉ21 Ĉ22

105
where
Ĉ11 = (X 0 V̂ −1 X)−
Ĉ21 = −ĜZ 0 V̂ −1 X Ĉ11
Ĉ22 = (Z 0 R̂−1 Z + Ĝ−1 )−1 − Ĉ21 X 0 V̂ −1 Z Ĝ
Note that Ĉ11 is the familiar estimated generalized least-squares formula for the variance-covariance
matrix of β̂.
As a cautionary note, Ĉ tends to underestimate the true sampling variability of ( β̂ γ̂ ) because no
account is made for the uncertainty in estimating G and R. Although inflation factors have been proposed
(Kackar and Harville 1984; Kass and Steffey 1989; Prasad and Rao 1990), they tend to be small for data
sets that are fairly well balanced. PROC MIXED does not compute any inflation factors by default, but
rather accounts for the downward bias by using the approximate t and F statistics described subsequently.
The DDFM=KENWARDROGER option in the MODEL statement prompts PROC MIXED to compute a
specific inflation factor along with Satterthwaite-based degrees of freedom.

8.3.8 Inference and Test Statistics


For inferences concerning the covariance parameters in your model, you can use likelihood-based statistics.
One common likelihood-based statistic is the Wald Z, which is computed as the parameter estimate divided
by its asymptotic standard error. The asymptotic standard errors are computed from the inverse of the
second derivative matrix of the likelihood with respect to each of the covariance parameters. The Wald Z
is valid for large samples, but it can be unreliable for small data sets and for parameters such as variance
components, which are known to have a skewed or bounded sampling distribution.
A better alternative is the likelihood ratio χ2 . This statistic compares two covariance models, one a
special case of the other. To compute it, you must run PROC MIXED twice, once for each of the two
models, and then subtract the corresponding values of -2 times the log likelihoods. You can use either ML
or REML to construct this statistic, which tests whether the full model is necessary beyond the reduced
model.
As long as the reduced model does not occur on the boundary of the covariance parameter space, the
χ2 statistic computed in this fashion has a large-sample sampling distribution that is χ2 with degrees of
freedom equal to the difference in the number of covariance parameters between the two models. If the
reduced model does occur on the boundary of the covariance parameter space, the asymptotic distribution
becomes a mixture of χ2 distributions (Self and Liang 1987). A common example of this is when you are
testing that a variance component equals its lower boundary constraint of 0.
A final possibility for obtaining inferences concerning the covariance parameters is to simulate or resample
data from your model and construct empirical sampling distributions of the parameters. The SAS macro
language and the ODS system are useful tools in this regard.
For inferences concerning the fixed- and random-effects parameters in the mixed model, consider estimable
linear combinations of the following form:
 
β
L
γ
The estimability requirement (Searle 1971) applies only to the β -portion of L, as any linear combination
of γ is estimable. Such a formulation in terms of a general L matrix encompasses a wide variety of common
inferential procedures such as those employed with Type I -Type III tests and LS-means. The CONTRAST
and ESTIMATE statements in PROC MIXED enable you to specify your own L matrices. Typically,
inference on fixed-effects is the focus, and, in this case, the γ -portion of L is assumed to contain all 0s.
Statistical inferences are obtained by testing the hypothesis

106
 
β
H:L =0
γ
or by constructing point and interval estimates.
When L consists of a single row, a general t-statistic can be constructed as follows (refer to McLean and
Sanders 1988, Stroup 1989a):
 
β̂
L
γ̂
t= p
LĈL0
Under the assumed normality of γ and , t has an exact t-distribution only for data exhibiting certain types
of balance and for some special unbalanced cases. In general, t is only approximately t-distributed, and its
degrees of freedom must be estimated. See the DDFM= option for a description of the various degrees-of-
freedom methods available in PROC MIXED. ν̂ being the approximate degrees of freedom, the associated
confidence interval is
 
β̂
p
L ± tν̂,α/2 LĈL0
γ̂
where tν̂,α/2 is the (1 − α/2)100t h percentile of the tν̂ -distribution.
When the rank of l is greater than 1, PROC MIXED constructs the following general F-statistic:
   
β̂ 0 0 −1 β̂
L (L ĈL) L
γ̂ γ̂
F =
rank(L)
Analogous to t, F in general has an approximate F-distribution with rank(L) numerator degrees of freedom
and ν̂ denominator degrees of freedom.
The t- and F- statistics enable you to make inferences about your fixed effects, which account for the
variance-covariance model you select. An alternative is the χ2 statistic associated with the likelihood ratio
test. This statistic compares two fixed-effects models, one a special case of the other. It is computed just as
when comparing different covariance models, although you should use ML and not REML here because the
penalty term associated with restricted likelihoods depends upon the fixed-effects specification.

8.4 Example using PROC MIXED


The following example is for a common example in repeated measures or growth curves. It has been repro-
duced from the SAS User guide.
data pr;
input Person Gender $ y1 y2 y3 y4;
y=y1; Age=8; output;
y=y2; Age=10; output;
y=y3; Age=12; output;
y=y4; Age=14; output;
drop y1-y4;
datalines;
1 F 21.0 20.0 21.5 23.0
2 F 21.0 21.5 24.0 25.5
3 F 20.5 24.0 24.5 26.0

107
4 F 23.5 24.5 25.0 26.5
5 F 21.5 23.0 22.5 23.5
6 F 20.0 21.0 21.0 22.5
7 F 21.5 22.5 23.0 25.0
8 F 23.0 23.0 23.5 24.0
9 F 20.0 21.0 22.0 21.5
10 F 16.5 19.0 19.0 19.5
11 F 24.5 25.0 28.0 28.0
12 M 26.0 25.0 29.0 31.0
13 M 21.5 22.5 23.0 26.5
14 M 23.0 22.5 24.0 27.5
15 M 25.5 27.5 26.5 27.0
16 M 20.0 23.5 22.5 26.0
17 M 24.5 25.5 27.0 28.5
18 M 22.0 22.0 24.5 26.5
19 M 24.0 21.5 24.5 25.5
20 M 23.0 20.5 31.0 26.0
21 M 27.5 28.0 31.0 31.5
22 M 23.0 23.0 23.5 25.0
23 M 21.5 23.5 24.0 28.0
24 M 17.0 24.5 26.0 29.5
25 M 22.5 25.5 25.5 26.0
26 M 23.0 24.5 26.0 30.0
27 M 22.0 21.5 23.5 25.0
;

proc mixed data=pr method=ml covtest;


class Person Gender;
model y = Gender Age Gender*Age / s;
repeated / type=un subject=Person r;
run;
/*
Note that two of the estimates equal 0; this is a result of the overparameterized
model used by PROC MIXED. You can obtain a full rank parameterization by using
the following MODEL statement:
*/
proc mixed data=pr method=ml covtest;
class Person Gender;
model y = Gender Gender*Age / noint s;
repeated /type=un subject=Person r;
run;
proc mixed data=pr method=ml;
class Person Gender;
model y = Gender Age Gender*Age / s;
repeated / type=ar(1) sub=Person r;
run;
*To fit a random coefficients model, use the following code: ;
proc mixed data=pr method=ml;
class Person Gender;

108
model y = Gender Age Gender*Age / s;
random intercept Age / type=un sub=Person g;
run;
/*
This specifies an unstructured covariance matrix for the random intercept and slope.
In mixed model notation, G is block diagonal with identical 22 unstructured
blocks for each person. By default, R becomes . See Example 41.5 for further
information on this model.

Finally, you can fit a compound symmetry structure by using TYPE=CS.


*/
proc mixed data=pr method=ml covtest;
class Person Gender;
model y = Gender Age Gender*Age / s;
repeated / type=cs subject=Person r;
run;
The output is as follows;
The SAS System
The Mixed Procedure

Model Information

Data Set WORK.PR


Dependent Variable y
Covariance Structure Unstructured
Subject Effect Person
Estimation Method ML
Residual Variance Method None
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Between-Within

Class Level Information

Class Levels Values

Person 27 1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23
24 25 26 27
Gender 2 F M

Dimensions

Covariance Parameters 10
Columns in X 6
Columns in Z 0
Subjects 27
Max Obs Per Subject 4

109
Observations Used 108
Observations Not Used 0
Total Observations 108

Iteration History

Iteration Evaluations -2 Log Like Criterion

0 1 478.24175986
1 2 419.47721707 0.00000152
2 1 419.47704812 0.00000000
Convergence criteria met.

The SAS System


The Mixed Procedure

Estimated R Matrix for Person 1

Row Col1 Col2 Col3 Col4

1 5.1192 2.4409 3.6105 2.5222


2 2.4409 3.9279 2.7175 3.0624
3 3.6105 2.7175 5.9798 3.8235
4 2.5222 3.0624 3.8235 4.6180

Covariance Parameter Estimates


Standard Z
Cov Parm Subject Estimate Error Value Pr Z

UN(1,1) Person 5.1192 1.4169 3.61 0.0002


UN(2,1) Person 2.4409 0.9835 2.48 0.0131
UN(2,2) Person 3.9279 1.0824 3.63 0.0001
UN(3,1) Person 3.6105 1.2767 2.83 0.0047
UN(3,2) Person 2.7175 1.0740 2.53 0.0114
UN(3,3) Person 5.9798 1.6279 3.67 0.0001
UN(4,1) Person 2.5222 1.0649 2.37 0.0179
UN(4,2) Person 3.0624 1.0135 3.02 0.0025
UN(4,3) Person 3.8235 1.2508 3.06 0.0022
UN(4,4) Person 4.6180 1.2573 3.67 0.0001

Fit Statistics
-2 Log Likelihood 419.5
AIC (smaller is better) 447.5
AICC (smaller is better) 452.0
BIC (smaller is better) 465.6

110
Null Model Likelihood Ratio Test
DF Chi-Square Pr > ChiSq
9 58.76 <.0001

Solution for Fixed Effects


Standard
Effect Gender Estimate Error DF t Value Pr > |t|

Intercept 15.8423 0.9356 25 16.93 <.0001


Gender F 1.5831 1.4658 25 1.08 0.2904
Gender M 0 . . . .
Age 0.8268 0.07911 25 10.45 <.0001

The SAS System


The Mixed Procedure

Solution for Fixed Effects


Standard
Effect Gender Estimate Error DF t Value Pr > |t|

Age*Gender F -0.3504 0.1239 25 -2.83 0.0091


Age*Gender M 0 . . . .

Type 3 Tests of Fixed Effects

Num Den
Effect DF DF F Value Pr > F

Gender 1 25 1.17 0.2904


Age 1 25 110.54 <.0001
Age*Gender 1 25 7.99 0.0091
---------------------------------------------
The SAS System
The Mixed Procedure

Model Information

Data Set WORK.PR


Dependent Variable y
Covariance Structure Unstructured
Subject Effect Person
Estimation Method ML
Residual Variance Method None
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Between-Within

Class Level Information

111
Class Levels Values

Person 27 1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23
24 25 26 27
Gender 2 F M

Dimensions
Covariance Parameters 10
Columns in X 4
Columns in Z 0
Subjects 27
Max Obs Per Subject 4
Observations Used 108
Observations Not Used 0
Total Observations 108

Iteration History

Iteration Evaluations -2 Log Like Criterion

0 1 478.24175986
1 2 419.47721707 0.00000152
2 1 419.47704812 0.00000000
Convergence criteria met.

The SAS System


The Mixed Procedure
Estimated R Matrix for Person 1

Row Col1 Col2 Col3 Col4


1 5.1192 2.4409 3.6105 2.5222
2 2.4409 3.9279 2.7175 3.0624
3 3.6105 2.7175 5.9798 3.8235
4 2.5222 3.0624 3.8235 4.6180

Covariance Parameter Estimates


Standard Z
Cov Parm Subject Estimate Error Value Pr Z

UN(1,1) Person 5.1192 1.4169 3.61 0.0002


UN(2,1) Person 2.4409 0.9835 2.48 0.0131
UN(2,2) Person 3.9279 1.0824 3.63 0.0001
UN(3,1) Person 3.6105 1.2767 2.83 0.0047
UN(3,2) Person 2.7175 1.0740 2.53 0.0114
UN(3,3) Person 5.9798 1.6279 3.67 0.0001
UN(4,1) Person 2.5222 1.0649 2.37 0.0179

112
UN(4,2) Person 3.0624 1.0135 3.02 0.0025
UN(4,3) Person 3.8235 1.2508 3.06 0.0022
UN(4,4) Person 4.6180 1.2573 3.67 0.0001

Fit Statistics
-2 Log Likelihood 419.5
AIC (smaller is better) 447.5
AICC (smaller is better) 452.0
BIC (smaller is better) 465.6

Null Model Likelihood Ratio Test


DF Chi-Square Pr > ChiSq
9 58.76 <.0001

Solution for Fixed Effects


Standard
Effect Gender Estimate Error DF t Value Pr > |t|

Gender F 17.4254 1.1284 25 15.44 <.0001


Gender M 15.8423 0.9356 25 16.93 <.0001
Age*Gender F 0.4764 0.09541 25 4.99 <.0001
Age*Gender M 0.8268 0.07911 25 10.45 <.0001

The SAS System


The Mixed Procedure

Type 3 Tests of Fixed Effects


Num Den
Effect DF DF F Value Pr > F

Gender 2 25 262.60 <.0001


Age*Gender 2 25 67.07 <.0001
------------------------------------------------
The SAS System
The Mixed Procedure

Model Information
Data Set WORK.PR
Dependent Variable y
Covariance Structure Autoregressive
Subject Effect Person
Estimation Method ML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Between-Within

113
Class Level Information
Class Levels Values
Person 27 1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23
24 25 26 27
Gender 2 F M

Dimensions
Covariance Parameters 2
Columns in X 6
Columns in Z 0
Subjects 27
Max Obs Per Subject 4
Observations Used 108
Observations Not Used 0
Total Observations 108

Iteration History

Iteration Evaluations -2 Log Like Criterion

0 1 478.24175986
1 2 440.68100623 0.00000000
Convergence criteria met.

Estimated R Matrix for Person 1


Row Col1 Col2 Col3 Col4
1 4.8910 2.9696 1.8030 1.0947
2 2.9696 4.8910 2.9696 1.8030
3 1.8030 2.9696 4.8910 2.9696
4 1.0947 1.8030 2.9696 4.8910

Covariance Parameter Estimates


Cov Parm Subject Estimate

AR(1) Person 0.6071


Residual 4.8910

Fit Statistics
-2 Log Likelihood 440.7
AIC (smaller is better) 452.7
AICC (smaller is better) 453.5
BIC (smaller is better) 460.5

114
Null Model Likelihood Ratio Test

DF Chi-Square Pr > ChiSq


1 37.56 <.0001

Solution for Fixed Effects


Standard
Effect Gender Estimate Error DF t Value Pr > |t|

Intercept 16.5920 1.3299 25 12.48 <.0001


Gender F 0.7297 2.0836 25 0.35 0.7291
Gender M 0 . . . .
Age 0.7696 0.1147 79 6.71 <.0001
Age*Gender F -0.2858 0.1797 79 -1.59 0.1157
Age*Gender M 0 . . . .

Type 3 Tests of Fixed Effects

Num Den
Effect DF DF F Value Pr > F

Gender 1 25 0.12 0.7291


Age 1 79 48.63 <.0001
Age*Gender 1 79 2.53 0.1157
-------------------------------------------------
The SAS System
The Mixed Procedure
Model Information

Data Set WORK.PR


Dependent Variable y
Covariance Structure Unstructured
Subject Effect Person
Estimation Method ML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment

Class Level Information


Class Levels Values
Person 27 1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23
24 25 26 27
Gender 2 F M

Dimensions

115
Covariance Parameters 4
Columns in X 6
Columns in Z Per Subject 2
Subjects 27
Max Obs Per Subject 4
Observations Used 108
Observations Not Used 0
Total Observations 108

Iteration History
Iteration Evaluations -2 Log Like Criterion

0 1 478.24175986
1 1 427.80595080 0.00000000
Convergence criteria met.

Estimated G Matrix
Row Effect Person Col1 Col2

1 Intercept 1 4.5569 -0.1983


2 Age 1 -0.1983 0.02376

Covariance Parameter Estimates


Cov Parm Subject Estimate

UN(1,1) Person 4.5569


UN(2,1) Person -0.1983
UN(2,2) Person 0.02376
Residual 1.7162
Fit Statistics
-2 Log Likelihood 427.8
AIC (smaller is better) 443.8
AICC (smaller is better) 445.3
BIC (smaller is better) 454.2

Null Model Likelihood Ratio Test

DF Chi-Square Pr > ChiSq


3 50.44 <.0001

Solution for Fixed Effects


Standard
Effect Gender Estimate Error DF t Value Pr > |t|

Intercept 16.3406 0.9801 25 16.67 <.0001

116
Gender F 1.0321 1.5355 54 0.67 0.5043
Gender M 0 . . . .
Age 0.7844 0.08275 25 9.48 <.0001
Age*Gender F -0.3048 0.1296 54 -2.35 0.0224
Age*Gender M 0 . . . .

Type 3 Tests of Fixed Effects


Num Den
Effect DF DF F Value Pr > F

Gender 1 54 0.45 0.5043


Age 1 25 95.04 <.0001
Age*Gender 1 54 5.53 0.0224
-----------------------------------------------------
The SAS System
The Mixed Procedure
Model Information

Data Set WORK.PR


Dependent Variable y
Covariance Structure Compound Symmetry
Subject Effect Person
Estimation Method ML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Between-Within

Class Level Information


Class Levels Values
Person 27 1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23
24 25 26 27
Gender 2 F M

Dimensions
Covariance Parameters 2
Columns in X 6
Columns in Z 0
Subjects 27
Max Obs Per Subject 4
Observations Used 108
Observations Not Used 0
Total Observations 108

Iteration History
Iteration Evaluations -2 Log Like Criterion

117
0 1 478.24175986
1 1 428.63905802 0.00000000
Convergence criteria met.

Estimated R Matrix for Person 1

Row Col1 Col2 Col3 Col4


1 4.9052 3.0306 3.0306 3.0306
2 3.0306 4.9052 3.0306 3.0306
3 3.0306 3.0306 4.9052 3.0306
4 3.0306 3.0306 3.0306 4.9052

Covariance Parameter Estimates


Standard Z
Cov Parm Subject Estimate Error Value Pr Z

CS Person 3.0306 0.9552 3.17 0.0015


Residual 1.8746 0.2946 6.36 <.0001

Fit Statistics
-2 Log Likelihood 428.6
AIC (smaller is better) 440.6
AICC (smaller is better) 441.5
BIC (smaller is better) 448.4

Null Model Likelihood Ratio Test


DF Chi-Square Pr > ChiSq
1 49.60 <.0001

Solution for Fixed Effects


Standard
Effect Gender Estimate Error DF t Value Pr > |t|

Intercept 16.3406 0.9631 25 16.97 <.0001


Gender F 1.0321 1.5089 25 0.68 0.5003
Gender M 0 . . . .
Age 0.7844 0.07654 79 10.25 <.0001
Age*Gender F -0.3048 0.1199 79 -2.54 0.0130
Age*Gender M 0 . . . .

Type 3 Tests of Fixed Effects


Num Den
Effect DF DF F Value Pr > F

Gender 1 25 0.47 0.5003

118
Age 1 79 111.10 <.0001
Age*Gender 1 79 6.46 0.0130

119
Chapter 9

Principle Component Analysis

Principle Component Analysis is concern with explaining the variance-covariance structure of a set of p-
variables through a set of a few (q < p) linear combinations of the original p variables. This method is used
to aid one in (1) dimension reduction and (2) data interpretation.
Before considering an example using this method it is necessary to review some matrix algebra concerning
eigenvalues and eigenvectors.

9.1 Eigenvalues and Eigenvectors


Suppose that A is an n × n matrix. Is there a transformation of vectors ~x that transforms A into a constant
multiple of itself? That is, does there exist a vector ~x satisfying,

A~x = λ~x

for some a constant λ? If so, then


(A − λIn )~x = 0,
or
| A − λIn |= 0
for ~x 6= 0. It can be shown that
n
X
ai λi = 0.
i=0

This last equation is called the characteristic equation for a n × n matrix A. The matrix A could possibly
have as many as n different values λ that satisfy the characteristic equation. These solutions are called the
characteristic values or the eigenvalues for the matrix A. Suppose that λ1 is a solution and

A~x1 = λ1 ~x1 , ~x1 6= 0,

then ~x1 is said to a characteristic vector or eigenvector of A corresponding to the eigenvalue λ1 . Note: The
eigenvalues may or may not be real numbers.

9.1.1 Properties of Eigenvalues–Eigenvectors


1. The n × n matrix A has at least one eigenvalue equal to zero if and only if A is singular.

120
2. The eigenvalues for A, C −1 AC and CAC −1 have the same set of eigenvalues for any nonsingular matrix
C.
3. The matrices A and A0 have the same set of eigenvalues but need not have the same eigenvectors.
4. Let A be a nonsingular matrix with eigenvalue λ, then 1/λ is an eigenvalue of A−1 .

5. The eigenvectors are not unique, for if ~x1 is an eigenvector corresponding to λ1 then c~x1 is also n
eigenvector, since Ac~x1 = λ1 c~x1 , for any nonzero value c.
6. Let A be an n × n real matrix and let T denote the Cholesty decomposition. That is, there exists a
nonsingular, complex matrix Q such that Q0 AQ = T , where T is a complex, upper triangular matrix,
and the eigenvalues of A are the diagonal elements of the matrix T .

7. Suppose that A is a symmetric matrix;


(a) Then the eigenvalues of A are real numbers.
(b) For each eigenvalue there exists a real eigenvector (each element is a real number).
(c) Let λ1 and λ2 be eigenvalues of A with corresponding eigenvectors ~x1 and ~x2 , then ~x1 and ~x2 are
orthogonal vectors, that is, ~x01 ~x2 = 0.
(d) There exists an orthogonal matrix P (P 0 P = P P 0 = In ) such that P 0 AP = D, where D is a
diagonal matrix whose diagonal elements are the eigenvalues of the matrix A.

9.2 Principle Components


Suppose that one has a p-dimensional vector given by X = ( x1 x2 . . . xp ) that has covariance matrix
Σ The idea is to find a new q-dimensional vector Y = ( y1 y2 . . . yq ) where
p
X
yi = aij xj ,
j=1

for i = 1, 2, . . . , q and var(yi ) = a0i Σai , cov(yi , yk ) = a0i Σak = 0, var(y1 ) ≥ var(y2 ) ≥ . . . ≥ var(yq ) for
a0i = ( ai1 ai2 . . . aip ).
This problem has the following solution;
1. Suppose that the matrix Σ has associated real eigenvalue–eigenvectors given by (λi , ai ) where λ1 ≥
λ2 ≥ . . . ≥ λp ≥ 0, then the ith principle component is given by

yi = a0i X = ai1 x1 + ai2 x2 + . . . + aip xp ,

and var(yi ) = λi for i = 1, 2, . . . , p, cov(yi , yk ) = a0i Σak = 0 for i 6= k. Note, the eigenvalues λi are
unique, however, the eigenvectors (and hence the vectors ~yi ) are not.
Pp
2. The total variance for the p dimensionsP is tr[Σ] = i=1 λi . Hence, the proportion of variance explained
p
by the k th principle component is λk / i=1 λi .
Pp
3. If the matrix X is centered and scaled so that Σ is the correlation matrix, then i=1 λi = p.

121
9.2.1 Example – FOC Sales
This example using the FOC sales data found in Dielman’s text. The SAS code is,

title1 ’FOC Sales - Dielman Chapter 8’;


data foc_sales;
input SALES MONTH FOV COMPOSITE INDUSTRIAL TRANS UTILITY FINANCE PROD HOUSE;
datalines;
.
.
.
;
proc princomp data=bc;
var FOV COMPOSITE INDUSTRIAL TRANS UTILITY FINANCE PROD HOUSE;run;

and the corresponding output is;


FOC Sales - Dielman Chapter 8
The PRINCOMP Procedure

Observations 265
Variables 8

Simple Statistics

FOV COMPOSITE INDUSTRIAL TRANS

Mean 1069001.796 384.7471698 481.8603774 347.0226415


StD 288773.614 113.6248822 140.1715647 90.2328609

UTILITY FINANCE PROD HOUSE

Mean 270.7320755 340.8150943 72.38867925 331.3584906


StD 66.2953098 120.9088365 32.78401345 45.1613665

Correlation Matrix

FOV COMPOSITE INDUSTRIAL TRANS UTILITY FINANCE PROD HOUSE

FOV 1.0000 0.4480 0.4564 0.4107 0.3752 0.4395 0.4660 0.1227


COMPOSITE 0.4480 1.0000 0.9992 0.9808 0.9516 0.9942 0.9622 0.4756
INDUSTRIAL 0.4564 0.9992 1.0000 0.9798 0.9444 0.9914 0.9638 0.4704
TRANS 0.4107 0.9808 0.9798 1.0000 0.8977 0.9879 0.9166 0.4520
UTILITY 0.3752 0.9516 0.9444 0.8977 1.0000 0.9308 0.9335 0.5214
FINANCE 0.4395 0.9942 0.9914 0.9879 0.9308 1.0000 0.9446 0.4634
PROD 0.4660 0.9622 0.9638 0.9166 0.9335 0.9446 1.0000 0.6018
HOUSE 0.1227 0.4756 0.4704 0.4520 0.5214 0.4634 0.6018 1.0000

122
Eigenvalues of the Correlation Matrix

Eigenvalue Difference Proportion Cumulative

1 6.29775087 5.40865169 0.7872 0.7872


2 0.88909918 0.24608105 0.1111 0.8984
3 0.64301813 0.53207160 0.0804 0.9787
4 0.11094653 0.06361914 0.0139 0.9926
5 0.04732739 0.04035421 0.0059 0.9985
6 0.00697318 0.00209528 0.0009 0.9994
7 0.00487790 0.00487108 0.0006 1.0000
8 0.00000682 0.0000 1.0000

The PRINCOMP Procedure

Eigenvectors

Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7 Prin8

FOV 0.196279 0.767804 0.601657 -.008748 0.097951 0.011645 -.012393 0.000243


COMPOSITE 0.395296 0.025306 -.149454 0.040571 -.010696 -.046339 0.400148 -.810394
INDUSTRIAL 0.394798 0.036677 -.144876 0.069547 -.119048 0.180686 0.682217 0.552018
TRANS 0.385564 0.016319 -.201304 0.531083 0.267185 0.527935 -.422252 0.010084
UTILITY 0.380283 -.079879 -.111648 -.770369 0.448507 0.095961 -.166056 0.071625
FINANCE 0.392185 0.029463 -.169716 0.242321 0.118966 -.818914 -.199303 0.182498
PROD 0.389979 -.054709 0.064182 -.202320 -.818674 0.078091 -.351489 -.000664
HOUSE 0.224016 -.630859 0.713282 0.135168 0.138538 -.020693 0.071907 -.000346
From the output one notices that the dimension of the original variables is much smaller than 8 (look at
the size of the smaller eigenvalues) and that the first two principle components explain nearly 90% (.8984) of
the variance found in the original variables. In considering the first two principle components one observes
that the first component is essentially a weighted average of the 8 original variables, where FOV and HOUSE
are weighted the least. Whereas, principle component 2 is essentially FOV - HOUSE.
Considering the amount of variance explained, one observes that the first principle component explains
about 79% of the variance, whereas the first two principle components explains nearly 90% of the variance
in the data. The first three components explains about 98%. Thus one can effectively say that this 8
dimensional data set can effectively be reduced to a 2 or 3 dimensional data set. Thus, creating an effective
dimensional reduction.

123
Chapter 10

Canonical Correlation

Canonical correlation analysis seeks to identify and quantify the associations between two sets of variables.
It focuses on finding linear combinations of variables in one set that has the highest correlation with a linear
combination of the variables in the second set. The linear combination of variables is called the canonical
variables and the correlation is called the canonical correlation.
Suppose that one has two random vectors, X (1) of p dimensions and X (2) of q dimensions where p ≤ q.
Let µ(1) and Σ11 denote the mean vector and covariance matrix for X (1) and µ(2) and Σ22 denote the mean
vector and covariance matrix for X (2) . Let Σ12 = Σ021 denote the covariance matrix for X (1) and X (2) .
Define two linear combinations as;
U = ~a0 X (1)
V = ~b0 X (2) .
In which case one has,
E(U ) = ~a0 µ(1) , V ar(U ) = ~a0 Σ11~a
E(V ) = ~b0 µ(2) , V ar(U ) = ~b0 Σ22~b
Cov(U, V ) = ~a0 Σ12~b.
The idea of canonical correlation is to find ~a and ~b such that
~a0 Σ12~b
Corr(U, V ) = q
~a0 Σ11~a ~b0 Σ22~b
is a large as possible.
Let Ui , Vi denote the ith canonical pair (linear combinations) then it can be shown that
−1/2 −1/2
Ui = ~a0i X (1) = ~e0i Σ11 X (1) Vi = ~b0i X (2) = f~i0 Σ22 X (2)
where ~e1 , ~e2 , . . . , ~ep are the eigenvectors of Σ11 −1/2 Σ12 Σ22 −1 Σ21 Σ11 −1/2 when the corresponding eigenvalues
are ordered in descending order. Likewise, f~1 , f~2 , . . . , f~p are the eigenvectors of Σ22 −1/2 Σ21 Σ11 −1 Σ12 Σ22 −1/2
when the corresponding eigenvectors are ordered in descending order.
Furthermore it follows that,
V ar(Ui ) = V ar(Vi ) = 1
Cov(Ui , Uk ) = Cov(Vi , Vk ) = Cov(Ui , Vk ) = 0 i 6= k.
p
Corr(Ui , Vi ) = λi
where λi is the ith largest eigenvalue of Σ11 −1/2 Σ12 Σ22 −1 Σ21 Σ11 −1/2 and Σ11 = Σ11 1/2 Σ11 1/2 , Σ22 =
Σ22 1/2 Σ22 1/2 .

124
10.1 Examples
SAS PROC CANCORR is the procedure for finding the canonical correlations. I have included two examples.

10.1.1 Job Satisfaction – Johnson-Wichern Data


The SAS code is;

title ’Canonical Correlation Analysis’; title2 ’Johnson-Wichern


Data page 605’; data job_sat (type=corr); _type_ = ’corr’; input
feedbk task_sign task_var task_id auto sup_sat career fin_sat
work_sat comp_id kind_sat gen_sat; cards; 1.00 . . . . . . .
. . . . 0.49 1.00 . . . . . . . . . . 0.53 0.57
1.00 . . . . . . . . . 0.49 0.46 0.48 1.00 . . . .
. . . . 0.51 0.53 0.57 0.57 1.00 . . . . . . . 0.33
0.30 0.31 0.24 0.38 1.00 . . . . . . 0.32 0.21 0.23
0.22 0.32 0.43 1.00 . . . . . 0.20 0.16 0.14 0.12 0.17
0.27 0.33 1.00 . . . . 0.19 0.08 0.07 0.19 0.23 0.24
0.26 0.25 1.00 . . . 0.30 0.27 0.24 0.21 0.32 0.34 0.54
0.46 0.28 1.00 . . 0.37 0.35 0.37 0.29 0.36 0.37 0.32
0.29 0.30 0.35 1.00 . 0.21 0.20 0.18 0.16 0.27 0.40 0.58
0.45 0.27 0.59 0.31 1.00 ; proc cancorr data=job_sat; var
feedbk task_sign task_var task_id auto; with sup_sat career
fin_sat work_sat comp_id kind_sat gen_sat; run;
The SAS output is;
Canonical Correlation Analysis
Johnson-Wichern Data page 605
Canonical Correlation Analysis

Adjusted Approximate Squared


Canonical Canonical Standard Canonical
Correlation Correlation Error Correlation

1 0.553706 0.553073 0.006934 0.306591


2 0.236404 0.234689 0.009442 0.055887
3 0.119186 . 0.009858 0.014205
4 0.072228 . 0.009948 0.005217
5 0.057270 . 0.009968 0.003280

Eigenvalues of Inv(E)*H
= CanRsq/(1-CanRsq)

Eigenvalue Difference Proportion Cumulative

1 0.4421 0.3830 0.8433 0.8433


2 0.0592 0.0448 0.1129 0.9562
3 0.0144 0.0092 0.0275 0.9837
4 0.0052 0.0020 0.0100 0.9937
5 0.0033 0.0063 1.0000

125
Test of H0: The canonical correlations in the
current row and all that follow are zero

Likelihood Approximate
Ratio F Value Num DF Den DF Pr > F

1 0.63988477 134.42 35 42018 <.0001


2 0.92280941 33.82 24 34849 <.0001
3 0.97743541 15.26 15 27578 <.0001
4 0.99152030 10.66 8 19982 <.0001
5 0.99672015 10.96 3 9992 <.0001

Multivariate Statistics and F Approximations

S=5 M=0.5 N=4993

Statistic Value F Value Num DF Den DF Pr > F

Wilks’ Lambda 0.63988477 134.42 35 42018 <.0001


Pillai’s Trace 0.38517977 119.14 35 49960 <.0001
Hotelling-Lawley Trace 0.52428968 149.60 35 28415 <.0001
Roy’s Greatest Root 0.44214935 631.14 7 9992 <.0001

NOTE: F Statistic for Roy’s Greatest Root is an upper bound.

Raw Canonical Coefficients for the VAR Variables

V1 V2 V3

feedbk 0.4217037005 0.3428520819 -0.857665357


task_sign 0.1951059282 -0.66829859 0.4434261983
task_var 0.1676125195 -0.853155876 -0.259213388
task_id -0.022889257 0.3560702979 -0.42310623
auto 0.4596557262 0.7287240893 0.9799053152

Raw Canonical Coefficients for the VAR Variables

V4 V5

feedbk -0.788408809 0.0308427228


task_sign -0.269128849 0.9832287828
task_var 0.4687568565 -0.914141074
task_id 1.0423235086 0.5243667708
auto -0.168165047 -0.439242006

Raw Canonical Coefficients for the WITH Variables

W1 W2 W3

126
sup_sat 0.4251801045 -0.087992909 0.4918143621
career 0.2088846933 0.436262871 -0.78320006
fin_sat -0.035894963 -0.092909718 -0.47784774
work_sat 0.023525029 0.9260127356 -0.00651068
comp_id 0.2902803477 -0.101101165 0.2831089034
kind_sat 0.5157248004 -0.554289712 -0.41249444
gen_sat -0.11014262 -0.031722209 0.9284585695

Raw Canonical Coefficients for the WITH Variables

W4 W5

sup_sat -0.128429606 -0.482345255


career -0.340530647 -0.749890932
fin_sat -0.605914061 0.3457245048
work_sat 0.4043753719 0.3115896198
comp_id -0.446854955 0.7029742619
kind_sat 0.6875998752 0.1795657307
gen_sat 0.2738895655 -0.014489511

Standardized Canonical Coefficients for the VAR Variables

V1 V2 V3 V4 V5

feedbk 0.4217 0.3429 -0.8577 -0.7884 0.0308


task_sign 0.1951 -0.6683 0.4434 -0.2691 0.9832
task_var 0.1676 -0.8532 -0.2592 0.4688 -0.9141
task_id -0.0229 0.3561 -0.4231 1.0423 0.5244
auto 0.4597 0.7287 0.9799 -0.1682 -0.4392

Standardized Canonical Coefficients for the WITH Variables

W1 W2 W3 W4 W5

sup_sat 0.4252 -0.0880 0.4918 -0.1284 -0.4823


career 0.2089 0.4363 -0.7832 -0.3405 -0.7499
fin_sat -0.0359 -0.0929 -0.4778 -0.6059 0.3457
work_sat 0.0235 0.9260 -0.0065 0.4044 0.3116
comp_id 0.2903 -0.1011 0.2831 -0.4469 0.7030
kind_sat 0.5157 -0.5543 -0.4125 0.6876 0.1796
gen_sat -0.1101 -0.0317 0.9285 0.2739 -0.0145

Correlations Between the VAR Variables and Their Canonical Variables

V1 V2 V3 V4 V5

feedbk 0.8293 0.1093 -0.4853 -0.2469 0.0611


task_sign 0.7304 -0.4366 0.2001 0.0021 0.4857

127
task_var 0.7533 -0.4661 -0.1056 0.3020 -0.3360
task_id 0.6160 0.2225 -0.2053 0.6614 0.3026
auto 0.8606 0.2660 0.3886 0.1484 -0.1246

Correlations Between the WITH Variables and Their Canonical Variables

W1 W2 W3 W4 W5

sup_sat 0.7564 0.0446 0.3395 -0.1294 -0.3370


career 0.6439 0.3582 -0.1717 -0.3530 -0.3335
fin_sat 0.3872 0.0373 -0.1767 -0.5348 0.4148
work_sat 0.3772 0.7919 -0.0054 0.2886 0.3341
comp_id 0.6532 0.1084 0.2092 -0.4376 0.4346
kind_sat 0.8040 -0.2416 -0.2348 0.4052 0.1964
gen_sat 0.5024 0.1628 0.4933 -0.1890 0.0678

Correlations Between the VAR Variables and the


Canonical Variables of the WITH Variables

W1 W2 W3 W4 W5

feedbk 0.4592 0.0258 -0.0578 -0.0178 0.0035


task_sign 0.4044 -0.1032 0.0239 0.0002 0.0278
task_var 0.4171 -0.1102 -0.0126 0.0218 -0.0192
task_id 0.3411 0.0526 -0.0245 0.0478 0.0173
auto 0.4765 0.0629 0.0463 0.0107 -0.0071

Correlations Between the WITH Variables and


the Canonical Variables of the VAR Variables

V1 V2 V3 V4 V5

sup_sat 0.4188 0.0105 0.0405 -0.0093 -0.0193


career 0.3565 0.0847 -0.0205 -0.0255 -0.0191
fin_sat 0.2144 0.0088 -0.0211 -0.0386 0.0238
work_sat 0.2088 0.1872 -0.0006 0.0208 0.0191
comp_id 0.3617 0.0256 0.0249 -0.0316 0.0249
kind_sat 0.4452 -0.0571 -0.0280 0.0293 0.0112
gen_sat 0.2782 0.0385 0.0588 -0.0136 0.0039

10.1.2 Police Applicant Files – Johnson Data


The SAS code is;
options nodate ps=60 PAGENO=1 LINESIZE=75;
TITLE ’Police Department Applicant Data Johnson page 160’;
data police;
input ID REACT HEIGHT WEIGHT SHLDR PELVIC CHEST
THIGH PULSE DIAST CHNUP BREATH RECVR SPEED ENDUR FAT;
cards;

128
1 0.310 179.6 74.20 41.7 27.3 82.4 19.0 64 64 2 158 108 5.5 4.0 11.91
2 0.345 175.6 62.04 37.5 29.1 84.1 5.5 88 78 20 166 108 5.5 4.0 3.13
3 0.293 166.2 72.96 39.4 26.8 88.1 22.0 100 88 7 167 116 5.5 4.0 16.89
4 0.254 173.8 85.92 41.2 27.6 97.6 19.5 64 62 4 220 120 5.5 4.0 19.59
5 0.384 184.8 65.88 39.8 26.1 88.2 14.5 80 68 9 210 120 5.5 5.0 7.74
6 0.406 189.1 102.26 43.3 30.1 101.2 22.0 60 68 4 188 91 6.0 4.0 30.42
7 0.344 191.5 84.04 42.8 28.4 91.0 18.0 64 48 1 272 110 6.0 3.0 13.70
8 0.321 180.2 68.34 41.6 27.3 90.4 5.5 74 64 14 193 117 5.5 4.0 3.04
9 0.425 183.8 95.14 42.3 30.1 100.2 13.5 80 78 4 199 105 5.5 4.0 20.26
10 0.385 163.1 54.28 37.2 24.2 80.5 7.0 84 78 13 157 113 6.0 4.0 3.04
11 0.317 169.6 75.92 39.4 27.2 92.0 16.5 65 78 6 180 110 5.0 5.0 12.83
12 0.353 171.6 71.70 39.1 27.0 86.2 25.5 68 72 0 193 105 5.5 4.0 15.95
13 0.413 180.0 80.68 40.8 28.3 87.4 17.5 73 88 4 218 109 5.0 4.2 11.86
14 0.392 174.6 70.40 39.8 25.9 83.9 16.5 104 78 6 190 129 5.0 4.0 9.93
15 0.312 181.8 91.40 40.6 29.5 95.1 32.0 92 88 1 206 139 5.0 3.5 32.63
16 0.342 167.4 65.74 39.7 26.4 86.0 13.0 80 86 6 181 120 5.5 4.0 6.64
17 0.293 173.0 79.28 41.2 26.9 96.1 11.5 72 68 6 184 111 5.5 3.9 11.57
18 0.317 179.8 92.06 40.0 29.8 100.9 15.0 60 78 0 205 92 5.0 4.0 24.21
19 0.333 176.8 87.96 41.2 28.4 100.8 20.5 76 90 1 228 147 4.0 3.5 22.39
20 0.317 179.3 77.66 41.4 31.6 90.1 9.5 58 86 15 198 98 5.5 4.1 6.29
21 0.427 193.5 98.44 41.6 29.2 95.7 21.0 54 74 0 254 110 5.5 3.8 23.63
22 0.266 178.8 65.42 39.3 27.1 83.0 16.5 88 72 7 206 121 5.5 4.0 10.53
23 0.311 179.6 97.04 43.8 30.1 100.8 22.0 100 74 3 194 124 5.0 4.0 20.62
24 0.284 172.6 81.72 40.9 27.3 91.5 22.0 74 76 4 201 113 5.5 5.1 18.39
25 0.259 171.5 69.60 40.4 27.8 87.7 15.5 70 72 10 175 110 5.5 3.0 11.14
26 0.317 168.9 63.66 39.8 26.7 83.9 6.0 68 70 7 179 119 5.5 5.0 5.16
27 0.263 183.1 87.24 43.2 28.3 95.7 11.0 88 74 7 245 115 5.5 4.0 9.60
28 0.336 163.6 64.86 37.5 26.6 84.0 15.5 64 64 6 146 115 5.0 4.4 11.93
29 0.267 184.3 84.68 40.3 29.0 93.2 8.5 64 76 2 213 109 5.5 5.0 8.55
30 0.271 181.0 73.78 42.8 29.7 90.3 8.5 56 88 11 181 109 6.0 5.0 4.94
31 0.264 180.2 75.84 41.4 28.7 88.1 13.5 76 76 9 192 144 5.5 3.6 10.62
32 0.357 184.1 70.48 42.0 28.9 81.3 14.0 84 72 5 231 123 5.5 4.5 8.46
33 0.259 178.9 86.90 42.5 28.7 95.0 16.0 54 68 12 186 118 6.0 4.0 13.47
34 0.221 170.0 76.68 39.7 27.7 93.6 15.0 50 72 4 178 108 5.5 4.5 12.81
35 0.333 180.6 77.32 42.1 27.3 89.5 16.0 88 72 11 200 119 5.5 4.6 13.34
36 0.359 179.0 79.90 40.8 28.2 90.3 26.5 80 80 3 201 124 5.5 3.7 24.57
37 0.314 186.6 100.36 42.5 31.5 100.3 27.0 62 76 2 208 120 5.5 4.1 28.35
38 0.295 181.4 91.66 41.9 28.9 96.6 25.5 68 78 2 211 125 6.0 3.0 26.12
39 0.296 176.5 79.00 40.7 29.1 86.5 20.5 60 66 5 210 117 5.5 4.2 15.21
40 0.308 174.0 69.10 40.9 27.0 88.1 18.0 92 74 5 161 140 5.0 5.5 12.51
41 0.327 178.2 87.78 42.9 27.2 100.3 16.5 72 72 4 189 115 5.5 3.5 20.50
42 0.303 177.1 70.18 39.4 27.6 85.5 16.0 72 74 14 201 110 6.0 4.8 10.67
43 0.297 180.0 67.66 40.9 28.7 86.1 15.0 76 76 5 177 110 5.5 4.5 10.76
44 0.244 176.8 86.12 41.3 28.2 92.7 12.5 76 68 7 181 110 5.5 4.0 14.55
45 0.282 176.3 65.00 39.0 26.0 83.3 7.0 88 72 12 167 127 5.5 5.0 5.27
46 0.285 192.4 99.14 43.7 28.7 96.1 20.5 64 68 4 174 105 6.0 4.0 17.94
47 0.299 175.2 75.70 39.4 27.3 90.8 19.0 56 76 7 174 111 5.5 4.5 12.64
48 0.280 175.9 78.62 43.4 29.3 90.7 18.0 64 72 7 170 117 5.5 3.7 10.81
49 0.268 174.6 64.88 42.3 29.2 82.6 3.5 72 80 11 199 113 6.0 4.5 2.01

129
50 0.362 179.0 71.00 41.2 27.3 85.6 16.0 68 90 5 150 108 5.5 5.0 10.00
;
proc cancorr data=police;
var HEIGHT WEIGHT SHLDR PELVIC CHEST
THIGH PULSE FAT;
with DIAST CHNUP BREATH RECVR SPEED ENDUR REACT;
run;
The SAS output is;
Police Department Applicant Data Johnson page 160
The CANCORR Procedure

Canonical Correlation Analysis

Adjusted Approximate Squared


Canonical Canonical Standard Canonical
Correlation Correlation Error Correlation

1 0.828134 0.766849 0.044885 0.685806


2 0.692968 0.474986 0.074256 0.480205
3 0.672133 . 0.078320 0.451762
4 0.572969 . 0.095958 0.328293
5 0.370474 0.232790 0.123250 0.137251
6 0.290571 . 0.130796 0.084431
7 0.116698 0.090059 0.140912 0.013618

Eigenvalues of Inv(E)*H
= CanRsq/(1-CanRsq)

Eigenvalue Difference Proportion Cumulative

1 2.1827 1.2589 0.4660 0.4660


2 0.9238 0.0998 0.1972 0.6632
3 0.8240 0.3353 0.1759 0.8391
4 0.4887 0.3297 0.1043 0.9434
5 0.1591 0.0669 0.0340 0.9774
6 0.0922 0.0784 0.0197 0.9971
7 0.0138 0.0029 1.0000

Test of H0: The canonical correlations in the


current row and all that follow are zero

Likelihood Approximate
Ratio F Value Num DF Den DF Pr > F

1 0.04685970 2.65 56 193.79 <.0001


2 0.14914244 2.05 42 172.31 0.0007
3 0.28692530 1.83 30 150 0.0097
4 0.52335933 1.37 20 126.98 0.1501
5 0.77914858 0.85 12 103.48 0.5964

130
6 0.90310008 0.70 6 80 0.6527
7 0.98638151 0.28 2 41 0.7550

Multivariate Statistics and F Approximations

S=7 M=0 N=16.5

Statistic Value F Value Num DF Den DF Pr > F

Wilks’ Lambda 0.04685970 2.65 56 193.79 <.0001


Pillai’s Trace 2.18136682 2.32 56 287 <.0001
Hotelling-Lawley Trace 4.68445901 2.81 56 108.32 <.0001
Roy’s Greatest Root 2.18274398 11.19 8 41 <.0001

NOTE: F Statistic for Roy’s Greatest Root is an upper bound.

Canonical Correlation Analysis

Raw Canonical Coefficients for the VAR Variables

V1 V2 V3 V4

HEIGHT -0.03369978 -0.130730762 0.0777632323 0.1240318137


WEIGHT -0.031527077 -0.029616695 -0.098190313 -0.029064582
SHLDR -0.276851469 0.0241774481 0.2583675421 -0.432237505
PELVIC 0.4606242408 0.2580542149 -0.37455071 0.3947911092
CHEST 0.1196387402 0.0804406702 -0.031291718 0.1329780013
THIGH -0.000066706 0.0879242726 -0.020739688 0.0527928224
PULSE 0.0202969948 0.0336594758 0.0307722355 0.0546216245
FAT -0.138142293 0.0072196998 0.1023757778 -0.068035676

Raw Canonical Coefficients for the VAR Variables

V5 V6 V7

HEIGHT -0.044057143 -0.041544567 -0.011211444


WEIGHT -0.070782966 -0.000820503 -0.110706488
SHLDR 0.7432141994 0.0930975973 0.479205673
PELVIC 0.3211430261 0.4964168676 -0.203128828
CHEST 0.0331152404 -0.291251069 0.0501325869
THIGH 0.1006104456 -0.189587407 -0.300202599
PULSE 0.0088398018 -0.004610415 0.0032985514
FAT -0.076839829 0.28227855 0.3263432609

Raw Canonical Coefficients for the WITH Variables

W1 W2 W3 W4

DIAST 0.0176353469 0.0421264144 -0.035847635 0.0411990106

131
CHNUP 0.2189806787 -0.058683974 0.0409208152 0.0789130807
BREATH 0.0040725029 -0.020969651 -0.011241439 0.0331536948
RECVR -0.024918952 0.0226276962 0.086682768 0.0059240991
SPEED -1.401961388 -1.221248504 1.0630790256 -0.828522641
ENDUR 0.2485158692 -0.694038707 0.4126618238 -0.204662786
REACT -5.154743075 -4.324972169 13.336435687 5.4449477881

Raw Canonical Coefficients for the WITH Variables

W5 W6 W7

DIAST 0.0449626941 0.0930163199 -0.050826629


CHNUP -0.001445028 -0.018065475 0.1065978856
BREATH 0.0113879377 -0.011443896 -0.008210919
RECVR 0.0445544767 -0.009993506 -0.008657341
SPEED 1.8622567268 1.8089626692 -0.467293724
ENDUR -0.27522717 -0.427315494 -1.608351594
REACT -12.66111399 8.3020224802 2.9536753902

Standardized Canonical Coefficients for the VAR Variables

V1 V2 V3 V4

HEIGHT -0.2259 -0.8763 0.5213 0.8314


WEIGHT -0.3616 -0.3397 -1.1261 -0.3333
SHLDR -0.4391 0.0383 0.4098 -0.6856
PELVIC 0.6620 0.3709 -0.5383 0.5674
CHEST 0.7144 0.4803 -0.1868 0.7940
THIGH -0.0004 0.5365 -0.1266 0.3221
PULSE 0.2620 0.4346 0.3973 0.7052
FAT -1.0150 0.0530 0.7522 -0.4999

Standardized Canonical Coefficients for the VAR Variables

V5 V6 V7

HEIGHT -0.2953 -0.2785 -0.0752


WEIGHT -0.8118 -0.0094 -1.2696
SHLDR 1.1788 0.1477 0.7601
PELVIC 0.4615 0.7134 -0.2919
CHEST 0.1977 -1.7390 0.2993
THIGH 0.6139 -1.1568 -1.8318
PULSE 0.1141 -0.0595 0.0426
FAT -0.5646 2.0741 2.3978

Standardized Canonical Coefficients for the WITH Variables

W1 W2 W3 W4

132
DIAST 0.1431 0.3418 -0.2909 0.3343
CHNUP 0.9592 -0.2570 0.1792 0.3456
BREATH 0.1037 -0.5338 -0.2861 0.8439
RECVR -0.2786 0.2530 0.9691 0.0662
SPEED -0.5098 -0.4441 0.3866 -0.3013
ENDUR 0.1402 -0.3916 0.2328 -0.1155
REACT -0.2485 -0.2085 0.6430 0.2625

Standardized Canonical Coefficients for the WITH Variables

W5 W6 W7

DIAST 0.3648 0.7547 -0.4124


CHNUP -0.0063 -0.0791 0.4669
BREATH 0.2899 -0.2913 -0.2090
RECVR 0.4981 -0.1117 -0.0968
SPEED 0.6772 0.6578 -0.1699
ENDUR -0.1553 -0.2411 -0.9074
REACT -0.6105 0.4003 0.1424

Canonical Structure

Correlations Between the VAR Variables and Their Canonical Variables

V1 V2 V3 V4

HEIGHT -0.4690 -0.5849 -0.1467 0.6031


WEIGHT -0.6252 0.0215 -0.6025 0.3991
SHLDR -0.4257 -0.2246 -0.2598 0.1893
PELVIC -0.0912 -0.0963 -0.6641 0.4866
CHEST -0.4013 0.1649 -0.6225 0.3602
THIGH -0.7780 0.4706 -0.1039 0.1899
PULSE 0.1789 0.4304 0.6802 0.4242
FAT -0.7694 0.3838 -0.3359 0.3220

Correlations Between the VAR Variables and Their Canonical Variables

V5 V6 V7

HEIGHT 0.2250 -0.0007 0.0309


WEIGHT 0.1127 -0.1184 0.1803
SHLDR 0.7434 -0.1304 0.2726
PELVIC 0.4103 0.3586 0.0385
CHEST 0.0521 -0.3543 0.4064
THIGH 0.0363 0.0136 -0.3142
PULSE 0.0356 -0.0229 0.0679
FAT -0.0984 0.0773 0.1398

133
Correlations Between the WITH Variables and Their Canonical Variables

W1 W2 W3 W4

DIAST 0.2804 0.5144 -0.0832 0.3439


CHNUP 0.8506 -0.2571 0.2804 -0.1059
BREATH -0.3562 -0.3713 -0.1966 0.7589
RECVR -0.0604 0.5217 0.6184 0.2798
SPEED -0.0927 -0.6909 0.0226 -0.3885
ENDUR 0.3756 -0.2301 0.2091 -0.2746
REACT -0.2580 -0.1485 0.3306 0.4341

Correlations Between the WITH Variables and Their Canonical Variables

W5 W6 W7

DIAST 0.0653 0.5969 -0.4135


CHNUP 0.1626 0.1707 0.2546
BREATH 0.2078 -0.2785 -0.0034
RECVR 0.3759 -0.3389 -0.0857
SPEED 0.4263 0.3935 0.1616
ENDUR -0.2185 -0.0931 -0.7940
REACT -0.6670 0.4064 0.0599

Correlations Between the VAR Variables and the


Canonical Variables of the WITH Variables

W1 W2 W3 W4

HEIGHT -0.3884 -0.4053 -0.0986 0.3455


WEIGHT -0.5178 0.0149 -0.4050 0.2287
SHLDR -0.3525 -0.1556 -0.1746 0.1084
PELVIC -0.0756 -0.0668 -0.4464 0.2788
CHEST -0.3323 0.1143 -0.4184 0.2064
THIGH -0.6443 0.3261 -0.0699 0.1088
PULSE 0.1481 0.2982 0.4572 0.2431
FAT -0.6371 0.2660 -0.2258 0.1845

Correlations Between the VAR Variables and the


Canonical Variables of the WITH Variables

W5 W6 W7

HEIGHT 0.0834 -0.0002 0.0036


WEIGHT 0.0417 -0.0344 0.0210
SHLDR 0.2754 -0.0379 0.0318
PELVIC 0.1520 0.1042 0.0045
CHEST 0.0193 -0.1029 0.0474
THIGH 0.0135 0.0039 -0.0367

134
PULSE 0.0132 -0.0067 0.0079
FAT -0.0364 0.0225 0.0163

Correlations Between the WITH Variables and


the Canonical Variables of the VAR Variables

V1 V2 V3 V4

DIAST 0.2322 0.3564 -0.0559 0.1970


CHNUP 0.7044 -0.1782 0.1885 -0.0607
BREATH -0.2950 -0.2573 -0.1321 0.4348
RECVR -0.0500 0.3615 0.4157 0.1603
SPEED -0.0768 -0.4788 0.0152 -0.2226
ENDUR 0.3111 -0.1595 0.1405 -0.1573
REACT -0.2137 -0.1029 0.2222 0.2487

Correlations Between the WITH Variables and


the Canonical Variables of the VAR Variables

V5 V6 V7

DIAST 0.0242 0.1735 -0.0483


CHNUP 0.0603 0.0496 0.0297
BREATH 0.0770 -0.0809 -0.0004
RECVR 0.1392 -0.0985 -0.0100
SPEED 0.1579 0.1143 0.0189
ENDUR -0.0809 -0.0271 -0.0927
REACT -0.2471 0.1181 0.0070

135
Chapter 11

Factor Analysis

Factor Analysis (FA) is a topic that is related to Principle Component Analysis (PCA) in that there some
similarities although FA has existed as a separate and somewhat controversial topic within multivariate
statistical analysis. I have chosen to leave the discussion of the similarities and differences to the reader.
This discussion is given in a number of texts that exclusively cover FA. Whereas in PCA the objective
was to find linear combinations of the original measurable variables that explained a specified proportion of
the variance/covariance. In FA one postulates the existence of unobservable constructs or latent variables
that are called factors. The existence of these hypothetical variables is open to question since they are
unobservable.
The basis Factor Analysis (FA) model is;

X = ΛF + ,

where X is an observable (manifest) random vector (p×1), F is an unobservable (latent) vector of k common
factors,  is a vector random errors, and Λ is a p × k matrix of regression weights or loadings. The following
assumptions are made;
• E(X) = E(F ) = E() = 0.
• V ar(X) = Σ.
• V ar(F ) = Ik .

• V ar() = Ψ = diag ( ψ1 ψ2 ... ψp ) where each of the ψi0 s > 0.


• Which implies that;
Σ = ΛΛ0 + Ψ,
or
V ar(xi ) = σi2 = h2i + ψi ,
Pk
where h2i = j=1 λ2ij is called the common variance or communality and ψi is called the unique or
specific variance of xi .
• The covariance between xi and xm given by
k
X
cov(xi , xm ) = σim = λij λjm .
j=1

136
If Σ is the correlation matrix then
k
X
corr(xi , xm ) = ρim = λij λjm .
j=1

• Cov(X, F ) = Λ
or
cov(xi , fj ) = λij .

It can be shown that the factor loadings Λ are unique only up to an orthogonal transformation. That
is, suppose that Λ∗ = ΛT where T T 0 = Ik , then Λ∗ (Λ∗ )0 = ΛT T 0 Λ0 = ΛΛ0 . Since one has this type of
nonuniqueness, the problem is to find the “best” matrix loadings which reproduce the existing covariance
structure where “best” usual is defined as the transformation that leads to the clear understanding of the
nature of the latent variables. Some authors address this problem by finding a rotation or transformation
that leads to loadings such that,
1. Each row of Λ should have at least one zero.
2. Each column of Λ should have at least one zero.
3. For all pairs of column of Λ, there should be several rows of zero and nonzero loadings.
4. If k ≥ 4, several pairs of column of Λ should have two zero loadings and a small number of nonzero
loadings.
When one finds an orthogonal transformation, the common factors will be uncorrelated with one another.
To allow for ease of interpretation some investigators suggest using nonorthogonal or oblique rotations or
transformations. In which case, the resulting factors are no longer uncorrelated with one another.
There are a number of procedures for estimating Λ and for determining k the number of factors. SAS
uses a variety of methods. Rather than continue with this problem I have included some comments found in
the SAS Users manual and have included it below. This interested reader should consult some specific texts
for FA.

11.1 SAS – PROC FACTOR


11.1.1 Overview
The FACTOR procedure performs a variety of common factor and component analyzes and rotations. Input
can be multivariate data, a correlation matrix, a covariance matrix, a factor pattern, or a matrix of scoring
coefficients. The procedure can factor either the correlation or covariance matrix, and you can save most
results in an output data set.
PROC FACTOR can process output from other procedures. For example, it can rotate the canonical
coefficients from multivariate analyzes in the GLM procedure.
The methods for factor extraction are principal component analysis, principal factor analysis, iterated
principal factor analysis, unweighted least-squares factor analysis, maximum-likelihood (canonical) factor
analysis, alpha factor analysis, image component analysis, and Harris component analysis. A variety of
methods for prior communality estimation is also available.
The methods for rotation are varimax, quartimax, parsimax, equamax, orthomax with user-specified
gamma, promax with user-specified exponent, Harris-Kaiser case II with user-specified exponent, and oblique
Procrustean with a user-specified target pattern.

137
Output includes means, standard deviations, correlations, Kaiser’s measure of sampling adequacy, eigen-
values, a scree plot, eigenvectors, prior and final communality estimates, the unrotated factor pattern,
residual and partial correlations, the rotated primary factor pattern, the primary factor structure, interfac-
tor correlations, the reference structure, reference axis correlations, the variance explained by each factor
both ignoring and eliminating other factors, plots of both rotated and unrotated factors, squared multiple
correlation of each factor with the variables, and scoring coefficients.
Any topics that are not given explicit references are discussed in Mulaik (1972) or Harman (1976).

11.1.2 Background
Common factor analysis was invented by Spearman (1904). Kim and Mueller (1978a,b) provide a very
elementary discussion of the common factor model. Gorsuch (1974) contains a broad survey of factor analysis,
and Gorsuch (1974) and Cattell (1978) are useful as guides to practical research methodology. Harman (1976)
gives a lucid discussion of many of the more technical aspects of factor analysis, especially oblique rotation.
Morrison (1976) and Mardia, Kent, and Bibby (1979) provide excellent statistical treatments of common
factor analysis. Mulaik (1972) is the most thorough and authoritative general reference on factor analysis
and is highly recommended to anyone familiar with matrix algebra. Stewart (1981) gives a nontechnical
presentation of some issues to consider when deciding whether or not a factor analysis may be appropriate.
A frequent source of confusion in the field of factor analysis is the term factor. It sometimes refers to
a hypothetical, unobservable variable, as in the phrase common factor. In this sense, factor analysis must
be distinguished from component analysis since a component is an observable linear combination. Factor is
also used in the sense of matrix factor, in that one matrix is a factor of a second matrix if the first matrix
multiplied by its transpose equals the second matrix. In this sense, factor analysis refers to all methods of
data analysis using matrix factors, including component analysis and common factor analysis.
A common factor is an unobservable, hypothetical variable that contributes to the variance of at least
two of the observed variables. The unqualified term ”factor” often refers to a common factor. A unique
factor is an unobservable, hypothetical variable that contributes to the variance of only one of the observed
variables. The model for common factor analysis posits one unique factor for each observed variable.
If the original variables are standardized to unit variance, the preceding formula yields correlations instead
of covariances. It is in this sense that common factors explain the correlations among the observed variables.
The difference between the correlation predicted by the common factor model and the actual correlation is
the residual correlation. A good way to assess the goodness-of-fit of the common factor model is to examine
the residual correlations.
The common factor model implies that the partial correlations among the variables, removing the effects
of the common factors, must all be 0. When the common factors are removed, only unique factors, which
are by definition uncorrelated, remain.
The assumptions of common factor analysis imply that the common factors are, in general, not linear
combinations of the observed variables. In fact, even if the data contain measurements on the entire popu-
lation of observations, you cannot compute the scores of the observations on the common factors. Although
the common factor scores cannot be computed directly, they can be estimated in a variety of ways.
The problem of factor score indeterminacy has led several factor analysts to propose methods yielding
components that can be considered approximations to common factors. Since these components are defined
as linear combinations, they are computable. The methods include Harris component analysis and image
component analysis. The advantage of producing determinate component scores is offset by the fact that,
even if the data fit the common factor model perfectly, component methods do not generally recover the
correct factor solution. You should not use any type of component analysis if you really want a common
factor analysis (Dziuban and Harris 1973; Lee and Comrey 1979).
After the factors are estimated, it is necessary to interpret them. Interpretation usually means assigning
to each common factor a name that reflects the importance of the factor in predicting each of the observed
variables, that is, the coefficients in the pattern matrix corresponding to the factor. Factor interpretation is

138
a subjective process. It can sometimes be made less subjective by rotating the common factors, that is, by
applying a nonsingular linear transformation. A rotated pattern matrix in which all the coefficients are close
to 0 or ±1 is easier to interpret than a pattern with many intermediate elements. Therefore, most rotation
methods attempt to optimize a function of the pattern matrix that measures, in some sense, how close the
elements are to 0 or ±1.
After the initial factor extraction, the common factors are uncorrelated with each other. If the factors are
rotated by an orthogonal transformation, the rotated factors are also uncorrelated. If the factors are rotated
by an oblique transformation, the rotated factors become correlated. Oblique rotations often produce more
useful patterns than do orthogonal rotations. However, a consequence of correlated factors is that there is
no single unambiguous measure of the importance of a factor in explaining a variable. Thus, for oblique
rotations, the pattern matrix does not provide all the necessary information for interpreting the factors; you
must also examine the factor structure and the reference structure. Rotating a set of factors does not change
the statistical explanatory power of the factors. You cannot say that any rotation is better than any other
rotation from a statistical point of view; all rotations are equally good statistically. Therefore, the choice
among different rotations must be based on nonstatistical grounds. For most applications, the preferred
rotation is that which is most easily interpretable.
If two rotations give rise to different interpretations, those two interpretations must not be regarded as
conflicting. Rather, they are two different ways of looking at the same thing, two different points of view
in the common-factor space. Any conclusion that depends on one and only one rotation being correct is
invalid.
Johnson considers an example using police applicant candidates on page 159. He uses SPSS for his factor
analysis. I have tried to produce similar results using SAS.

11.2 Police Applicant Example


The SAS code used is;
TITLE ’Police Department Applicant Data Johnson page 160’;
data police;
input ID REACT HEIGHT WEIGHT SHLDR PELVIC CHEST
THIGH PULSE DIAST CHNUP BREATH RECVR SPEED ENDUR FAT;
cards;
1 0.310 179.6 74.20 41.7 27.3 82.4 19.0 64 64 2 158 108 5.5 4.0 11.91
2 0.345 175.6 62.04 37.5 29.1 84.1 5.5 88 78 20 166 108 5.5 4.0 3.13
3 0.293 166.2 72.96 39.4 26.8 88.1 22.0 100 88 7 167 116 5.5 4.0 16.89
4 0.254 173.8 85.92 41.2 27.6 97.6 19.5 64 62 4 220 120 5.5 4.0 19.59
5 0.384 184.8 65.88 39.8 26.1 88.2 14.5 80 68 9 210 120 5.5 5.0 7.74
6 0.406 189.1 102.26 43.3 30.1 101.2 22.0 60 68 4 188 91 6.0 4.0 30.42
7 0.344 191.5 84.04 42.8 28.4 91.0 18.0 64 48 1 272 110 6.0 3.0 13.70
8 0.321 180.2 68.34 41.6 27.3 90.4 5.5 74 64 14 193 117 5.5 4.0 3.04
9 0.425 183.8 95.14 42.3 30.1 100.2 13.5 80 78 4 199 105 5.5 4.0 20.26
10 0.385 163.1 54.28 37.2 24.2 80.5 7.0 84 78 13 157 113 6.0 4.0 3.04
11 0.317 169.6 75.92 39.4 27.2 92.0 16.5 65 78 6 180 110 5.0 5.0 12.83
12 0.353 171.6 71.70 39.1 27.0 86.2 25.5 68 72 0 193 105 5.5 4.0 15.95
13 0.413 180.0 80.68 40.8 28.3 87.4 17.5 73 88 4 218 109 5.0 4.2 11.86
14 0.392 174.6 70.40 39.8 25.9 83.9 16.5 104 78 6 190 129 5.0 4.0 9.93
15 0.312 181.8 91.40 40.6 29.5 95.1 32.0 92 88 1 206 139 5.0 3.5 32.63
16 0.342 167.4 65.74 39.7 26.4 86.0 13.0 80 86 6 181 120 5.5 4.0 6.64
17 0.293 173.0 79.28 41.2 26.9 96.1 11.5 72 68 6 184 111 5.5 3.9 11.57

139
18 0.317 179.8 92.06 40.0 29.8 100.9 15.0 60 78 0 205 92 5.0 4.0 24.21
19 0.333 176.8 87.96 41.2 28.4 100.8 20.5 76 90 1 228 147 4.0 3.5 22.39
20 0.317 179.3 77.66 41.4 31.6 90.1 9.5 58 86 15 198 98 5.5 4.1 6.29
21 0.427 193.5 98.44 41.6 29.2 95.7 21.0 54 74 0 254 110 5.5 3.8 23.63
22 0.266 178.8 65.42 39.3 27.1 83.0 16.5 88 72 7 206 121 5.5 4.0 10.53
23 0.311 179.6 97.04 43.8 30.1 100.8 22.0 100 74 3 194 124 5.0 4.0 20.62
24 0.284 172.6 81.72 40.9 27.3 91.5 22.0 74 76 4 201 113 5.5 5.1 18.39
25 0.259 171.5 69.60 40.4 27.8 87.7 15.5 70 72 10 175 110 5.5 3.0 11.14
26 0.317 168.9 63.66 39.8 26.7 83.9 6.0 68 70 7 179 119 5.5 5.0 5.16
27 0.263 183.1 87.24 43.2 28.3 95.7 11.0 88 74 7 245 115 5.5 4.0 9.60
28 0.336 163.6 64.86 37.5 26.6 84.0 15.5 64 64 6 146 115 5.0 4.4 11.93
29 0.267 184.3 84.68 40.3 29.0 93.2 8.5 64 76 2 213 109 5.5 5.0 8.55
30 0.271 181.0 73.78 42.8 29.7 90.3 8.5 56 88 11 181 109 6.0 5.0 4.94
31 0.264 180.2 75.84 41.4 28.7 88.1 13.5 76 76 9 192 144 5.5 3.6 10.62
32 0.357 184.1 70.48 42.0 28.9 81.3 14.0 84 72 5 231 123 5.5 4.5 8.46
33 0.259 178.9 86.90 42.5 28.7 95.0 16.0 54 68 12 186 118 6.0 4.0 13.47
34 0.221 170.0 76.68 39.7 27.7 93.6 15.0 50 72 4 178 108 5.5 4.5 12.81
35 0.333 180.6 77.32 42.1 27.3 89.5 16.0 88 72 11 200 119 5.5 4.6 13.34
36 0.359 179.0 79.90 40.8 28.2 90.3 26.5 80 80 3 201 124 5.5 3.7 24.57
37 0.314 186.6 100.36 42.5 31.5 100.3 27.0 62 76 2 208 120 5.5 4.1 28.35
38 0.295 181.4 91.66 41.9 28.9 96.6 25.5 68 78 2 211 125 6.0 3.0 26.12
39 0.296 176.5 79.00 40.7 29.1 86.5 20.5 60 66 5 210 117 5.5 4.2 15.21
40 0.308 174.0 69.10 40.9 27.0 88.1 18.0 92 74 5 161 140 5.0 5.5 12.51
41 0.327 178.2 87.78 42.9 27.2 100.3 16.5 72 72 4 189 115 5.5 3.5 20.50
42 0.303 177.1 70.18 39.4 27.6 85.5 16.0 72 74 14 201 110 6.0 4.8 10.67
43 0.297 180.0 67.66 40.9 28.7 86.1 15.0 76 76 5 177 110 5.5 4.5 10.76
44 0.244 176.8 86.12 41.3 28.2 92.7 12.5 76 68 7 181 110 5.5 4.0 14.55
45 0.282 176.3 65.00 39.0 26.0 83.3 7.0 88 72 12 167 127 5.5 5.0 5.27
46 0.285 192.4 99.14 43.7 28.7 96.1 20.5 64 68 4 174 105 6.0 4.0 17.94
47 0.299 175.2 75.70 39.4 27.3 90.8 19.0 56 76 7 174 111 5.5 4.5 12.64
48 0.280 175.9 78.62 43.4 29.3 90.7 18.0 64 72 7 170 117 5.5 3.7 10.81
49 0.268 174.6 64.88 42.3 29.2 82.6 3.5 72 80 11 199 113 6.0 4.5 2.01
50 0.362 179.0 71.00 41.2 27.3 85.6 16.0 68 90 5 150 108 5.5 5.0 10.00
;
proc princomp out=new;
var REACT HEIGHT WEIGHT SHLDR PELVIC CHEST
THIGH PULSE DIAST CHNUP BREATH RECVR SPEED ENDUR FAT; run;
proc factor data=police reorder;
var REACT HEIGHT WEIGHT SHLDR PELVIC CHEST
THIGH PULSE DIAST CHNUP BREATH RECVR SPEED ENDUR FAT;
title2 ’Principle Component FA’;
run;
proc factor data=police rotate=varimax reorder;
var REACT HEIGHT WEIGHT SHLDR PELVIC CHEST
THIGH PULSE DIAST CHNUP BREATH RECVR SPEED ENDUR FAT;
title2 ’Principle Component FA with Varimax Rotation’;
run;
proc factor data=police method=ml heywood n=5 reorder;
var REACT HEIGHT WEIGHT SHLDR PELVIC CHEST

140
THIGH PULSE DIAST CHNUP BREATH RECVR SPEED ENDUR FAT;
title2 ’Maximum Likelihood FA for 5 factors’;
run;
proc factor data=police method=ml heywood n=5 rotate=V reorder;
var REACT HEIGHT WEIGHT SHLDR PELVIC CHEST
THIGH PULSE DIAST CHNUP BREATH RECVR SPEED ENDUR FAT;
title2 ’Maximum Likelihood FA for 5 factors with Varimax Rotation’;
run;
proc factor data=police method=ml heywood n=5 rotate=promax reorder;
var REACT HEIGHT WEIGHT SHLDR PELVIC CHEST
THIGH PULSE DIAST CHNUP BREATH RECVR SPEED ENDUR FAT;
title2 ’Maximum Likelihood FA for 5 factors with Oblique Promax Rotation’;
run;
The SAS output is;
Police Department Applicant Data Johnson page 160

The PRINCOMP Procedure

Observations 50
Variables 15

Simple Statistics

REACT HEIGHT WEIGHT SHLDR PELVIC

Mean 0.3162000000 177.9060000 78.35240000 40.95200000 28.10600000


StD 0.0482146561 6.7031339 11.46836991 1.58606688 1.43717552

Simple Statistics

CHEST THIGH PULSE DIAST CHNUP

Mean 90.62000000 16.13000000 73.08000000 74.60000000 6.280000000


StD 5.97091590 6.10186489 12.91074541 8.11398390 4.380103437

Simple Statistics

BREATH RECVR SPEED ENDUR FAT

Mean 193.3400000 115.5400000 5.480000000 4.174000000 13.78240000


StD 25.4547297 11.1798105 0.363654916 0.564171779 7.34763118

Correlation Matrix

REACT HEIGHT WEIGHT SHLDR PELVIC CHEST THIGH PULSE

141
REACT 1.0000 0.2223 0.0562 -.0938 -.0559 -.0318 0.1324 0.1631
HEIGHT 0.2223 1.0000 0.6353 0.6543 0.5859 0.4259 0.2232 -.1815
WEIGHT 0.0562 0.6353 1.0000 0.6656 0.6470 0.8887 0.5542 -.2638
SHLDR -.0938 0.6543 0.6656 1.0000 0.5824 0.5545 0.2046 -.1682
PELVIC -.0559 0.5859 0.6470 0.5824 1.0000 0.5221 0.2075 -.3229
CHEST -.0318 0.4259 0.8887 0.5545 0.5221 1.0000 0.3978 -.2463
THIGH 0.1324 0.2232 0.5542 0.2046 0.2075 0.3978 1.0000 -.0062
PULSE 0.1631 -.1815 -.2638 -.1682 -.3229 -.2463 -.0062 1.0000
DIAST 0.1473 -.1866 -.0517 -.1449 0.1477 -.0084 0.0487 0.2341
CHNUP -.1585 -.2760 -.5758 -.2739 -.1582 -.4536 -.6695 0.1548
BREATH 0.1595 0.5878 0.4502 0.3677 0.3536 0.3473 0.2065 -.0644
RECVR -.1296 -.1213 -.1219 -.0239 -.2069 -.0832 0.2031 0.5039
SPEED -.1493 0.2156 -.0526 0.2018 0.0412 -.1624 -.2080 -.2996
ENDUR -.0525 -.1892 -.3680 -.2448 -.2294 -.3275 -.3357 0.0073
FAT 0.1648 0.3651 0.8095 0.3306 0.4132 0.7246 0.8442 -.0948

Police Department Applicant Data Johnson page 160

The PRINCOMP Procedure

Correlation Matrix

DIAST CHNUP BREATH RECVR SPEED ENDUR FAT

REACT 0.1473 -.1585 0.1595 -.1296 -.1493 -.0525 0.1648


HEIGHT -.1866 -.2760 0.5878 -.1213 0.2156 -.1892 0.3651
WEIGHT -.0517 -.5758 0.4502 -.1219 -.0526 -.3680 0.8095
SHLDR -.1449 -.2739 0.3677 -.0239 0.2018 -.2448 0.3306
PELVIC 0.1477 -.1582 0.3536 -.2069 0.0412 -.2294 0.4132
CHEST -.0084 -.4536 0.3473 -.0832 -.1624 -.3275 0.7246
THIGH 0.0487 -.6695 0.2065 0.2031 -.2080 -.3357 0.8442
PULSE 0.2341 0.1548 -.0644 0.5039 -.2996 0.0073 -.0948
DIAST 1.0000 0.0537 -.1648 0.1538 -.3209 0.1337 0.0446
CHNUP 0.0537 1.0000 -.3576 -.0602 0.3239 0.2128 -.6912
BREATH -.1648 -.3576 1.0000 0.0913 -.0312 -.3146 0.2987
RECVR 0.1538 -.0602 0.0913 1.0000 -.4365 -.0728 0.0665
SPEED -.3209 0.3239 -.0312 -.4365 1.0000 -.0225 -.2055
ENDUR 0.1337 0.2128 -.3146 -.0728 -.0225 1.0000 -.4055
FAT 0.0446 -.6912 0.2987 0.0665 -.2055 -.4055 1.0000

Eigenvalues of the Correlation Matrix

Eigenvalue Difference Proportion Cumulative

1 5.21852549 2.81172746 0.3479 0.3479


2 2.40679803 1.09411978 0.1605 0.5084

142
3 1.31267825 0.08160302 0.0875 0.5959
4 1.23107523 0.02722958 0.0821 0.6779
5 1.20384565 0.35594143 0.0803 0.7582
6 0.84790423 0.14315618 0.0565 0.8147
7 0.70474804 0.12633945 0.0470 0.8617
8 0.57840860 0.18486182 0.0386 0.9003
9 0.39354677 0.02535508 0.0262 0.9265
10 0.36819169 0.04160632 0.0245 0.9510
11 0.32658537 0.13971542 0.0218 0.9728
12 0.18686996 0.04806356 0.0125 0.9853
13 0.13880640 0.09492503 0.0093 0.9945
14 0.04388137 0.00574644 0.0029 0.9975
15 0.03813492 0.0025 1.0000

Police Department Applicant Data Johnson page 160

The PRINCOMP Procedure

Eigenvectors

Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7 Prin8

REACT 0.050679 0.152438 0.111390 0.055590 0.821021 0.112335 0.039864 -.195870


HEIGHT 0.305474 -.217386 0.365729 0.053823 0.207982 -.153731 0.192102 0.087542
WEIGHT 0.416679 -.028657 -.067019 0.090966 -.055609 -.003716 0.049983 -.217041
SHLDR 0.300144 -.211114 0.274988 0.106404 -.211416 -.060832 0.298153 -.178103
PELVIC 0.293831 -.192972 0.076787 0.439018 -.099467 0.106493 -.125905 0.225978
CHEST 0.360777 0.003599 -.127985 0.143322 -.172201 -.005218 -.139044 -.488893
THIGH 0.284124 0.306774 -.230006 -.205456 0.012004 0.106243 0.348454 0.347371
PULSE -.119324 0.381339 0.462508 0.018235 -.002604 0.186025 0.329322 -.330122
DIAST -.035778 0.276632 -.042841 0.698232 0.048207 0.239704 0.012464 0.354343
CHNUP -.291889 -.235778 0.239828 0.216371 -.114520 0.341144 -.011821 -.141973
BREATH 0.252284 -.032462 0.439492 -.154562 0.161941 -.238549 -.427747 0.326682
RECVR -.025487 0.423898 0.399832 -.103577 -.380320 -.128264 0.091771 0.184475
SPEED -.029346 -.491403 0.027360 -.198160 0.040617 0.313603 0.468765 0.262473
ENDUR -.204355 -.053576 -.130568 0.325749 0.083824 -.735143 0.407882 -.006655
FAT 0.368171 0.223891 -.233374 -.065212 0.014090 0.129600 0.165113 -.003158

Eigenvectors

Prin9 Prin10 Prin11 Prin12 Prin13 Prin14 Prin15

REACT -.253233 0.044063 0.274577 0.239172 0.159100 -.007271 0.089545


HEIGHT -.097720 0.204687 0.011100 -.692312 -.235902 0.154691 -.029567
WEIGHT 0.146352 0.018241 0.066675 -.083460 0.030823 -.837258 0.158004
SHLDR -.398108 -.441559 -.011248 0.387215 -.227161 0.054628 -.226447
PELVIC -.198253 0.402639 -.363504 0.221906 0.412291 0.103203 0.161878
CHEST 0.303459 -.034681 0.327976 -.012938 0.024762 0.479319 0.336882
THIGH -.010404 0.213359 -.066997 0.239692 -.473685 0.082677 0.371141

143
PULSE 0.331791 -.020677 -.496585 -.009483 0.099315 0.034112 0.076313
DIAST 0.146511 -.400896 0.156265 -.158356 -.118531 -.009948 -.038833
CHNUP 0.143353 0.499748 0.294961 0.215904 -.415019 -.094720 -.141276
BREATH 0.473400 -.112931 0.009034 0.302890 -.107551 -.011775 -.071042
RECVR -.219312 0.142011 0.514234 -.020974 0.319568 -.050876 0.087558
SPEED 0.314410 -.180529 0.214643 0.032084 0.359654 0.038933 0.140015
ENDUR 0.232765 0.160019 0.048775 0.185324 0.040703 -.001605 -.012893
FAT 0.193846 0.223611 0.068805 0.020013 0.175096 0.101581 -.760033

Police Department Applicant Data Johnson page 160


Principle Component FA

The FACTOR Procedure


Initial Factor Method: Principal Components

Prior Communality Estimates: ONE

Eigenvalues of the Correlation Matrix: Total = 15 Average = 1

Eigenvalue Difference Proportion Cumulative

1 5.21852549 2.81172746 0.3479 0.3479


2 2.40679803 1.09411978 0.1605 0.5084
3 1.31267825 0.08160302 0.0875 0.5959
4 1.23107523 0.02722958 0.0821 0.6779
5 1.20384565 0.35594143 0.0803 0.7582
6 0.84790423 0.14315618 0.0565 0.8147
7 0.70474804 0.12633945 0.0470 0.8617
8 0.57840860 0.18486182 0.0386 0.9003
9 0.39354677 0.02535508 0.0262 0.9265
10 0.36819169 0.04160632 0.0245 0.9510
11 0.32658537 0.13971542 0.0218 0.9728
12 0.18686996 0.04806356 0.0125 0.9853
13 0.13880640 0.09492503 0.0093 0.9945
14 0.04388137 0.00574644 0.0029 0.9975
15 0.03813492 0.0025 1.0000

5 factors will be retained by the MINEIGEN criterion.

Factor Pattern

Factor1 Factor2 Factor3 Factor4 Factor5

WEIGHT 0.95187 -0.04446 -0.07678 0.10093 -0.06101

144
FAT 0.84105 0.34734 -0.26738 -0.07236 0.01546
CHEST 0.82416 0.00558 -0.14664 0.15902 -0.18894
HEIGHT 0.69783 -0.33725 0.41902 0.05972 0.22820
SHLDR 0.68565 -0.32752 0.31506 0.11806 -0.23196
PELVIC 0.67123 -0.29937 0.08798 0.48711 -0.10913
THIGH 0.64905 0.47592 -0.26352 -0.22796 0.01317
BREATH 0.57632 -0.05036 0.50353 -0.17149 0.17768
ENDUR -0.46683 -0.08312 -0.14959 0.36143 0.09197
CHNUP -0.66679 -0.36578 0.27478 0.24007 -0.12565
RECVR -0.05822 0.65763 0.45810 -0.11492 -0.41729
PULSE -0.27258 0.59160 0.52991 0.02023 -0.00286
SPEED -0.06704 -0.76236 0.03135 -0.21987 0.04457
DIAST -0.08173 0.42916 -0.04908 0.77472 0.05289
REACT 0.11577 0.23649 0.12762 0.06168 0.90082

Police Department Applicant Data Johnson page 160


Principle Component FA

The FACTOR Procedure


Initial Factor Method: Principal Components

Variance Explained by Each Factor

Factor1 Factor2 Factor3 Factor4 Factor5

5.2185255 2.4067980 1.3126783 1.2310752 1.2038457

Final Communality Estimates: Total = 11.372923

REACT HEIGHT WEIGHT SHLDR PELVIC CHEST THIGH PULSE

0.90090440 0.83192324 0.92783038 0.74439622 0.79709996 0.76176311 0.76936042 0.70551551

DIAST CHNUP BREATH RECVR SPEED ENDUR FAT

0.79625170 0.72733481 0.64920781 0.83305264 0.63699092 0.38630835 0.90498318

Police Department Applicant Data Johnson page 160


Principle Component FA with Varimax Rotation

The FACTOR Procedure


Initial Factor Method: Principal Components

Prior Communality Estimates: ONE

145
Eigenvalues of the Correlation Matrix: Total = 15 Average = 1

Eigenvalue Difference Proportion Cumulative

1 5.21852549 2.81172746 0.3479 0.3479


2 2.40679803 1.09411978 0.1605 0.5084
3 1.31267825 0.08160302 0.0875 0.5959
4 1.23107523 0.02722958 0.0821 0.6779
5 1.20384565 0.35594143 0.0803 0.7582
6 0.84790423 0.14315618 0.0565 0.8147
7 0.70474804 0.12633945 0.0470 0.8617
8 0.57840860 0.18486182 0.0386 0.9003
9 0.39354677 0.02535508 0.0262 0.9265
10 0.36819169 0.04160632 0.0245 0.9510
11 0.32658537 0.13971542 0.0218 0.9728
12 0.18686996 0.04806356 0.0125 0.9853
13 0.13880640 0.09492503 0.0093 0.9945
14 0.04388137 0.00574644 0.0029 0.9975
15 0.03813492 0.0025 1.0000

5 factors will be retained by the MINEIGEN criterion.

Factor Pattern

Factor1 Factor2 Factor3 Factor4 Factor5

WEIGHT 0.95187 -0.04446 -0.07678 0.10093 -0.06101


FAT 0.84105 0.34734 -0.26738 -0.07236 0.01546
CHEST 0.82416 0.00558 -0.14664 0.15902 -0.18894
HEIGHT 0.69783 -0.33725 0.41902 0.05972 0.22820
SHLDR 0.68565 -0.32752 0.31506 0.11806 -0.23196
PELVIC 0.67123 -0.29937 0.08798 0.48711 -0.10913
THIGH 0.64905 0.47592 -0.26352 -0.22796 0.01317
BREATH 0.57632 -0.05036 0.50353 -0.17149 0.17768
ENDUR -0.46683 -0.08312 -0.14959 0.36143 0.09197
CHNUP -0.66679 -0.36578 0.27478 0.24007 -0.12565
RECVR -0.05822 0.65763 0.45810 -0.11492 -0.41729
PULSE -0.27258 0.59160 0.52991 0.02023 -0.00286
SPEED -0.06704 -0.76236 0.03135 -0.21987 0.04457
DIAST -0.08173 0.42916 -0.04908 0.77472 0.05289
REACT 0.11577 0.23649 0.12762 0.06168 0.90082

Police Department Applicant Data Johnson page 160


Principle Component FA with Varimax Rotation

The FACTOR Procedure

146
Initial Factor Method: Principal Components

Variance Explained by Each Factor

Factor1 Factor2 Factor3 Factor4 Factor5

5.2185255 2.4067980 1.3126783 1.2310752 1.2038457

Final Communality Estimates: Total = 11.372923

REACT HEIGHT WEIGHT SHLDR PELVIC CHEST THIGH PULSE

0.90090440 0.83192324 0.92783038 0.74439622 0.79709996 0.76176311 0.76936042 0.70551551

DIAST CHNUP BREATH RECVR SPEED ENDUR FAT

0.79625170 0.72733481 0.64920781 0.83305264 0.63699092 0.38630835 0.90498318

Police Department Applicant Data Johnson page 160


Principle Component FA with Varimax Rotation

The FACTOR Procedure


Rotation Method: Varimax

Orthogonal Transformation Matrix

1 2 3 4 5

1 0.69851 0.70292 -0.10336 -0.07220 0.04570


2 0.49361 -0.36287 0.69546 0.34334 0.15206
3 -0.45431 0.50851 0.65356 -0.23492 0.22955
4 -0.24900 0.32685 -0.09783 0.90636 -0.01077
5 -0.00560 -0.09389 -0.26256 0.01540 0.96020

Rotated Factor Pattern

Factor1 Factor2 Factor3 Factor4 Factor5

FAT 0.89834 0.30408 -0.01710 0.05600 0.04550


THIGH 0.86470 0.07378 0.11051 -0.02796 0.05664
CHEST 0.60652 0.57244 -0.14309 0.11808 -0.17828
ENDUR -0.38966 -0.26455 -0.16683 0.36931 0.01610
CHNUP -0.83023 -0.10598 0.00362 0.07366 -0.14626
HEIGHT 0.11446 0.82407 -0.09857 -0.20697 0.29527
SHLDR 0.14604 0.82138 -0.04338 -0.13253 -0.17015
PELVIC 0.16044 0.79465 -0.23908 0.26790 -0.10469

147
WEIGHT 0.65304 0.68489 -0.17334 0.02459 -0.04056
BREATH 0.19065 0.60670 0.20462 -0.32989 0.30673
RECVR 0.10678 -0.04500 0.88357 0.01179 -0.19694
PULSE -0.14414 -0.12994 0.78471 0.11661 0.19618
SPEED -0.38288 0.16941 -0.49297 -0.46286 -0.06663
DIAST -0.01615 0.01011 0.18516 0.86776 0.09270
REACT 0.11922 -0.00396 -0.00664 0.11263 0.93485

Variance Explained by Each Factor

Factor1 Factor2 Factor3 Factor4 Factor5

3.4799455 3.3769250 1.8753118 1.3949543 1.2457860

Final Communality Estimates: Total = 11.372923

REACT HEIGHT WEIGHT SHLDR PELVIC CHEST THIGH PULSE

0.90090440 0.83192324 0.92783038 0.74439622 0.79709996 0.76176311 0.76936042 0.70551551

Police Department Applicant Data Johnson page 160


Principle Component FA with Varimax Rotation

The FACTOR Procedure


Rotation Method: Varimax

DIAST CHNUP BREATH RECVR SPEED ENDUR FAT

0.79625170 0.72733481 0.64920781 0.83305264 0.63699092 0.38630835 0.90498318

Police Department Applicant Data Johnson page 160


Maximum Likelihood FA for 5 factors

The FACTOR Procedure


Initial Factor Method: Maximum Likelihood

Prior Communality Estimates: SMC

REACT HEIGHT WEIGHT SHLDR PELVIC CHEST THIGH PULSE

0.43327302 0.74642701 0.94073711 0.73988702 0.72886627 0.89250807 0.84259198 0.47263879

DIAST CHNUP BREATH RECVR SPEED ENDUR FAT

0.35875934 0.71587234 0.50649838 0.58340532 0.62235763 0.28692528 0.93750738

148
Preliminary Eigenvalues: Total = 62.2250448 Average = 4.14833632

Eigenvalue Difference Proportion Cumulative

1 45.0397746 36.0684793 0.7238 0.7238


2 8.9712953 5.7571439 0.1442 0.8680
3 3.2141514 0.5487643 0.0517 0.9196
4 2.6653871 1.1212112 0.0428 0.9625
5 1.5441759 0.0401474 0.0248 0.9873
6 1.5040285 0.5148665 0.0242 1.0115
7 0.9891620 0.4442794 0.0159 1.0274
8 0.5448826 0.4398188 0.0088 1.0361
9 0.1050638 0.2296711 0.0017 1.0378
10 -0.1246073 0.0888343 -0.0020 1.0358
11 -0.2134417 0.1361614 -0.0034 1.0324
12 -0.3496030 0.0549062 -0.0056 1.0268
13 -0.4045092 0.1203960 -0.0065 1.0203
14 -0.5249052 0.2109050 -0.0084 1.0118
15 -0.7358101 -0.0118 1.0000

5 factors will be retained by the NFACTOR criterion.

Iteration Criterion Ridge Change Communalities

1 1.4462130 0.0000 0.3216 0.27784 0.90750 0.97656 0.67692 0.49074 0.90980


0.88450 0.34842 0.29435 0.53165 0.37336 0.90501
0.58936 0.19221 0.95775
2 1.3727907 0.0000 0.1714 0.44924 0.90056 0.97368 0.69783 0.50217 0.91056
0.91069 0.34950 0.18679 0.53482 0.40610 1.00000
0.56468 0.19049 0.94759
3 1.3547530 0.0000 0.1011 0.55035 0.90113 0.97335 0.70023 0.50795 0.91400
0.93346 0.36290 0.14267 0.53867 0.41749 1.00000
0.51945 0.18087 0.94224
4 1.3510559 0.0000 0.0381 0.58850 0.89787 0.97319 0.69362 0.51216 0.91573
0.94370 0.37064 0.13328 0.53673 0.41591 1.00000
0.49421 0.17575 0.94147
5 1.3503942 0.0000 0.0204 0.60887 0.89477 0.97281 0.69046 0.51399 0.91723
0.94544 0.37426 0.13018 0.53581 0.41459 1.00000
0.48406 0.17443 0.94167
6 1.3502608 0.0000 0.0100 0.61886 0.89296 0.97259 0.68871 0.51469 0.91805
0.94531 0.37607 0.12947 0.53527 0.41366 1.00000
0.48035 0.17415 0.94198
7 1.3502311 0.0000 0.0050 0.62382 0.89201 0.97247 0.68790 0.51495 0.91847
0.94503 0.37694 0.12937 0.53500 0.41315 1.00000
0.47882 0.17410 0.94216

149
8 1.3502243 0.0000 0.0024 0.62622 0.89154 0.97241 0.68750 0.51506 0.91866
0.94486 0.37735 0.12939 0.53486 0.41288 1.00000
0.47814 0.17409 0.94226
9 1.3502227 0.0000 0.0012 0.62738 0.89131 0.97239 0.68731 0.51511 0.91876
0.94476 0.37755 0.12940 0.53479 0.41275 1.00000
0.47783 0.17409 0.94230
10 1.3502223 0.0000 0.0006 0.62794 0.89120 0.97237 0.68722 0.51513 0.91881
0.94471 0.37765 0.12941 0.53476 0.41269 1.00000
0.47768 0.17409 0.94233

Convergence criterion satisfied.

Significance Tests Based on 50 Observations

Pr >
Test DF Chi-Square ChiSq

H0: No common factors 105 473.1958 <.0001


HA: At least one common factor
H0: 5 Factors are sufficient 40 53.7839 0.0714
HA: More factors are needed

Chi-Square without Bartlett’s Correction 66.160895


Akaike’s Information Criterion -13.839105
Schwarz’s Bayesian Criterion -90.320025
Tucker and Lewis’s Reliability Coefficient 0.901730

Squared Canonical Correlations

Factor1 Factor2 Factor3 Factor4 Factor5

1.0000000 0.9864528 0.9298891 0.8552151 0.6953437

Police Department Applicant Data Johnson page 160


Maximum Likelihood FA for 5 factors

The FACTOR Procedure


Initial Factor Method: Maximum Likelihood

Eigenvalues of the Weighted Reduced Correlation Matrix: Total = 94.2680654 Average = 6.73343324

Eigenvalue Difference Proportion Cumulative

1 Infty Infty
2 72.8157670 59.5526525 0.7724 0.7724

150
3 13.2631146 7.3563181 0.1407 0.9131
4 5.9067965 3.6244087 0.0627 0.9758
5 2.2823878 1.5748067 0.0242 1.0000
6 0.7075811 0.1211672 0.0075 1.0075
7 0.5864139 0.2073404 0.0062 1.0137
8 0.3790735 0.2137267 0.0040 1.0177
9 0.1653468 0.1429644 0.0018 1.0195
10 0.0223823 0.0949270 0.0002 1.0197
11 -0.0725446 0.1454124 -0.0008 1.0190
12 -0.2179570 0.0952176 -0.0023 1.0167
13 -0.3131747 0.1702114 -0.0033 1.0133
14 -0.4833861 0.2903498 -0.0051 1.0082
15 -0.7737358 -0.0082 1.0000

Factor Pattern

Factor1 Factor2 Factor3 Factor4 Factor5

RECVR 1.00000 -0.00000 0.00000 -0.00000 -0.00000


PULSE 0.50389 -0.19460 -0.08711 0.04878 0.27561
SPEED -0.43652 -0.12948 0.17123 0.29858 -0.38970
WEIGHT -0.12188 0.95896 0.18902 -0.04564 -0.01046
FAT 0.06652 0.91307 -0.31229 -0.04848 0.06581
CHEST -0.08323 0.85373 0.24075 -0.34999 0.05062
THIGH 0.20313 0.72563 -0.58376 0.16414 -0.09586
SHLDR -0.02394 0.59185 0.49095 0.18223 -0.24918
HEIGHT -0.12126 0.58658 0.45458 0.56854 0.05041
PELVIC -0.20686 0.58103 0.33755 0.10606 -0.09779
BREATH 0.09128 0.44303 0.29501 0.32258 0.13023
ENDUR -0.07280 -0.40654 0.04336 -0.01653 0.03696
CHNUP -0.06025 -0.66088 0.29276 -0.06020 -0.07086
REACT -0.12957 0.09818 -0.16829 0.33604 0.67845
DIAST 0.15384 -0.02090 -0.12208 -0.16473 0.25156

Variance Explained by Each Factor

Factor Weighted Unweighted

Factor1 2.5448036 1.62770415


Factor2 72.8157670 4.90368487
Factor3 13.2631146 1.34836092

Police Department Applicant Data Johnson page 160


Maximum Likelihood FA for 5 factors

The FACTOR Procedure


Initial Factor Method: Maximum Likelihood

151
Variance Explained by Each Factor

Factor Weighted Unweighted

Factor4 5.9067965 0.86110178


Factor5 2.2823878 0.86514424

Final Communality Estimates and Variable Weights


Total Communality: Weighted = 96.812869 Unweighted = 9.605996

Variable Communality Weight

REACT 0.62797040 2.6877327


HEIGHT 0.89119601 9.1909879
WEIGHT 0.97237461 36.1990669
SHLDR 0.68719064 3.1971113
PELVIC 0.51513802 2.0624119
CHEST 0.91880673 12.3160354
THIGH 0.94471334 18.0873827
PULSE 0.37770382 1.6068142
DIAST 0.12942491 1.1486505
CHNUP 0.53474783 2.1494398
BREATH 0.41265447 1.7026689
RECVR 1.00000000 Infty
SPEED 0.47765130 1.9145480
ENDUR 0.17409446 1.2107922
FAT 0.94232943 17.3392265

Police Department Applicant Data Johnson page 160


Maximum Likelihood FA for 5 factors with Varimax Rotation

The FACTOR Procedure


Initial Factor Method: Maximum Likelihood

Prior Communality Estimates: SMC

REACT HEIGHT WEIGHT SHLDR PELVIC CHEST THIGH PULSE

0.43327302 0.74642701 0.94073711 0.73988702 0.72886627 0.89250807 0.84259198 0.47263879

DIAST CHNUP BREATH RECVR SPEED ENDUR FAT

0.35875934 0.71587234 0.50649838 0.58340532 0.62235763 0.28692528 0.93750738

Preliminary Eigenvalues: Total = 62.2250448 Average = 4.14833632

152
Eigenvalue Difference Proportion Cumulative

1 45.0397746 36.0684793 0.7238 0.7238


2 8.9712953 5.7571439 0.1442 0.8680
3 3.2141514 0.5487643 0.0517 0.9196
4 2.6653871 1.1212112 0.0428 0.9625
5 1.5441759 0.0401474 0.0248 0.9873
6 1.5040285 0.5148665 0.0242 1.0115
7 0.9891620 0.4442794 0.0159 1.0274
8 0.5448826 0.4398188 0.0088 1.0361
9 0.1050638 0.2296711 0.0017 1.0378
10 -0.1246073 0.0888343 -0.0020 1.0358
11 -0.2134417 0.1361614 -0.0034 1.0324
12 -0.3496030 0.0549062 -0.0056 1.0268
13 -0.4045092 0.1203960 -0.0065 1.0203
14 -0.5249052 0.2109050 -0.0084 1.0118
15 -0.7358101 -0.0118 1.0000

5 factors will be retained by the NFACTOR criterion.

Iteration Criterion Ridge Change Communalities

1 1.4462130 0.0000 0.3216 0.27784 0.90750 0.97656 0.67692 0.49074 0.90980


0.88450 0.34842 0.29435 0.53165 0.37336 0.90501
0.58936 0.19221 0.95775
2 1.3727907 0.0000 0.1714 0.44924 0.90056 0.97368 0.69783 0.50217 0.91056
0.91069 0.34950 0.18679 0.53482 0.40610 1.00000
0.56468 0.19049 0.94759
3 1.3547530 0.0000 0.1011 0.55035 0.90113 0.97335 0.70023 0.50795 0.91400
0.93346 0.36290 0.14267 0.53867 0.41749 1.00000
0.51945 0.18087 0.94224
4 1.3510559 0.0000 0.0381 0.58850 0.89787 0.97319 0.69362 0.51216 0.91573
0.94370 0.37064 0.13328 0.53673 0.41591 1.00000
0.49421 0.17575 0.94147
5 1.3503942 0.0000 0.0204 0.60887 0.89477 0.97281 0.69046 0.51399 0.91723
0.94544 0.37426 0.13018 0.53581 0.41459 1.00000
0.48406 0.17443 0.94167
6 1.3502608 0.0000 0.0100 0.61886 0.89296 0.97259 0.68871 0.51469 0.91805
0.94531 0.37607 0.12947 0.53527 0.41366 1.00000
0.48035 0.17415 0.94198
7 1.3502311 0.0000 0.0050 0.62382 0.89201 0.97247 0.68790 0.51495 0.91847
0.94503 0.37694 0.12937 0.53500 0.41315 1.00000
0.47882 0.17410 0.94216
8 1.3502243 0.0000 0.0024 0.62622 0.89154 0.97241 0.68750 0.51506 0.91866
0.94486 0.37735 0.12939 0.53486 0.41288 1.00000
0.47814 0.17409 0.94226

153
9 1.3502227 0.0000 0.0012 0.62738 0.89131 0.97239 0.68731 0.51511 0.91876
0.94476 0.37755 0.12940 0.53479 0.41275 1.00000
0.47783 0.17409 0.94230
10 1.3502223 0.0000 0.0006 0.62794 0.89120 0.97237 0.68722 0.51513 0.91881
0.94471 0.37765 0.12941 0.53476 0.41269 1.00000
0.47768 0.17409 0.94233

Convergence criterion satisfied.

Significance Tests Based on 50 Observations

Pr >
Test DF Chi-Square ChiSq

H0: No common factors 105 473.1958 <.0001


HA: At least one common factor
H0: 5 Factors are sufficient 40 53.7839 0.0714
HA: More factors are needed

Chi-Square without Bartlett’s Correction 66.160895


Akaike’s Information Criterion -13.839105
Schwarz’s Bayesian Criterion -90.320025
Tucker and Lewis’s Reliability Coefficient 0.901730

Police Department Applicant Data Johnson page 160


Maximum Likelihood FA for 5 factors with Varimax Rotation

The FACTOR Procedure


Initial Factor Method: Maximum Likelihood

Squared Canonical Correlations

Factor1 Factor2 Factor3 Factor4 Factor5

1.0000000 0.9864528 0.9298891 0.8552151 0.6953437

Eigenvalues of the Weighted Reduced Correlation Matrix: Total = 94.2680654 Average = 6.73343324

Eigenvalue Difference Proportion Cumulative

1 Infty Infty
2 72.8157670 59.5526525 0.7724 0.7724

154
3 13.2631146 7.3563181 0.1407 0.9131
4 5.9067965 3.6244087 0.0627 0.9758
5 2.2823878 1.5748067 0.0242 1.0000
6 0.7075811 0.1211672 0.0075 1.0075
7 0.5864139 0.2073404 0.0062 1.0137
8 0.3790735 0.2137267 0.0040 1.0177
9 0.1653468 0.1429644 0.0018 1.0195
10 0.0223823 0.0949270 0.0002 1.0197
11 -0.0725446 0.1454124 -0.0008 1.0190
12 -0.2179570 0.0952176 -0.0023 1.0167
13 -0.3131747 0.1702114 -0.0033 1.0133
14 -0.4833861 0.2903498 -0.0051 1.0082
15 -0.7737358 -0.0082 1.0000

Factor Pattern

Factor1 Factor2 Factor3 Factor4 Factor5

RECVR 1.00000 -0.00000 0.00000 -0.00000 -0.00000


PULSE 0.50389 -0.19460 -0.08711 0.04878 0.27561
SPEED -0.43652 -0.12948 0.17123 0.29858 -0.38970
WEIGHT -0.12188 0.95896 0.18902 -0.04564 -0.01046
FAT 0.06652 0.91307 -0.31229 -0.04848 0.06581
CHEST -0.08323 0.85373 0.24075 -0.34999 0.05062
THIGH 0.20313 0.72563 -0.58376 0.16414 -0.09586
SHLDR -0.02394 0.59185 0.49095 0.18223 -0.24918
HEIGHT -0.12126 0.58658 0.45458 0.56854 0.05041
PELVIC -0.20686 0.58103 0.33755 0.10606 -0.09779
BREATH 0.09128 0.44303 0.29501 0.32258 0.13023
ENDUR -0.07280 -0.40654 0.04336 -0.01653 0.03696
CHNUP -0.06025 -0.66088 0.29276 -0.06020 -0.07086
REACT -0.12957 0.09818 -0.16829 0.33604 0.67845
DIAST 0.15384 -0.02090 -0.12208 -0.16473 0.25156

Police Department Applicant Data Johnson page 160


Maximum Likelihood FA for 5 factors with Varimax Rotation

The FACTOR Procedure


Initial Factor Method: Maximum Likelihood

Variance Explained by Each Factor

Factor Weighted Unweighted

Factor1 2.5448036 1.62770415


Factor2 72.8157670 4.90368487
Factor3 13.2631146 1.34836092

155
Factor4 5.9067965 0.86110178
Factor5 2.2823878 0.86514424

Final Communality Estimates and Variable Weights


Total Communality: Weighted = 96.812869 Unweighted = 9.605996

Variable Communality Weight

REACT 0.62797040 2.6877327


HEIGHT 0.89119601 9.1909879
WEIGHT 0.97237461 36.1990669
SHLDR 0.68719064 3.1971113
PELVIC 0.51513802 2.0624119
CHEST 0.91880673 12.3160354
THIGH 0.94471334 18.0873827
PULSE 0.37770382 1.6068142
DIAST 0.12942491 1.1486505
CHNUP 0.53474783 2.1494398
BREATH 0.41265447 1.7026689
RECVR 1.00000000 Infty
SPEED 0.47765130 1.9145480
ENDUR 0.17409446 1.2107922
FAT 0.94232943 17.3392265

Police Department Applicant Data Johnson page 160


Maximum Likelihood FA for 5 factors with Oblique Promax Rotation

The FACTOR Procedure


Initial Factor Method: Maximum Likelihood

Prior Communality Estimates: SMC

REACT HEIGHT WEIGHT SHLDR PELVIC CHEST THIGH PULSE

0.43327302 0.74642701 0.94073711 0.73988702 0.72886627 0.89250807 0.84259198 0.47263879

DIAST CHNUP BREATH RECVR SPEED ENDUR FAT

0.35875934 0.71587234 0.50649838 0.58340532 0.62235763 0.28692528 0.93750738

Preliminary Eigenvalues: Total = 62.2250448 Average = 4.14833632

Eigenvalue Difference Proportion Cumulative

1 45.0397746 36.0684793 0.7238 0.7238


2 8.9712953 5.7571439 0.1442 0.8680

156
3 3.2141514 0.5487643 0.0517 0.9196
4 2.6653871 1.1212112 0.0428 0.9625
5 1.5441759 0.0401474 0.0248 0.9873
6 1.5040285 0.5148665 0.0242 1.0115
7 0.9891620 0.4442794 0.0159 1.0274
8 0.5448826 0.4398188 0.0088 1.0361
9 0.1050638 0.2296711 0.0017 1.0378
10 -0.1246073 0.0888343 -0.0020 1.0358
11 -0.2134417 0.1361614 -0.0034 1.0324
12 -0.3496030 0.0549062 -0.0056 1.0268
13 -0.4045092 0.1203960 -0.0065 1.0203
14 -0.5249052 0.2109050 -0.0084 1.0118
15 -0.7358101 -0.0118 1.0000

5 factors will be retained by the NFACTOR criterion.

Iteration Criterion Ridge Change Communalities

1 1.4462130 0.0000 0.3216 0.27784 0.90750 0.97656 0.67692 0.49074 0.90980


0.88450 0.34842 0.29435 0.53165 0.37336 0.90501
0.58936 0.19221 0.95775
2 1.3727907 0.0000 0.1714 0.44924 0.90056 0.97368 0.69783 0.50217 0.91056
0.91069 0.34950 0.18679 0.53482 0.40610 1.00000
0.56468 0.19049 0.94759
3 1.3547530 0.0000 0.1011 0.55035 0.90113 0.97335 0.70023 0.50795 0.91400
0.93346 0.36290 0.14267 0.53867 0.41749 1.00000
0.51945 0.18087 0.94224
4 1.3510559 0.0000 0.0381 0.58850 0.89787 0.97319 0.69362 0.51216 0.91573
0.94370 0.37064 0.13328 0.53673 0.41591 1.00000
0.49421 0.17575 0.94147
5 1.3503942 0.0000 0.0204 0.60887 0.89477 0.97281 0.69046 0.51399 0.91723
0.94544 0.37426 0.13018 0.53581 0.41459 1.00000
0.48406 0.17443 0.94167
6 1.3502608 0.0000 0.0100 0.61886 0.89296 0.97259 0.68871 0.51469 0.91805
0.94531 0.37607 0.12947 0.53527 0.41366 1.00000
0.48035 0.17415 0.94198
7 1.3502311 0.0000 0.0050 0.62382 0.89201 0.97247 0.68790 0.51495 0.91847
0.94503 0.37694 0.12937 0.53500 0.41315 1.00000
0.47882 0.17410 0.94216
8 1.3502243 0.0000 0.0024 0.62622 0.89154 0.97241 0.68750 0.51506 0.91866
0.94486 0.37735 0.12939 0.53486 0.41288 1.00000
0.47814 0.17409 0.94226
9 1.3502227 0.0000 0.0012 0.62738 0.89131 0.97239 0.68731 0.51511 0.91876
0.94476 0.37755 0.12940 0.53479 0.41275 1.00000
0.47783 0.17409 0.94230
10 1.3502223 0.0000 0.0006 0.62794 0.89120 0.97237 0.68722 0.51513 0.91881
0.94471 0.37765 0.12941 0.53476 0.41269 1.00000

157
0.47768 0.17409 0.94233

Convergence criterion satisfied.

Significance Tests Based on 50 Observations

Pr >
Test DF Chi-Square ChiSq

H0: No common factors 105 473.1958 <.0001


HA: At least one common factor
H0: 5 Factors are sufficient 40 53.7839 0.0714
HA: More factors are needed

Chi-Square without Bartlett’s Correction 66.160895


Akaike’s Information Criterion -13.839105
Schwarz’s Bayesian Criterion -90.320025
Tucker and Lewis’s Reliability Coefficient 0.901730

Police Department Applicant Data Johnson page 160


Maximum Likelihood FA for 5 factors with Oblique Promax Rotation

The FACTOR Procedure


Initial Factor Method: Maximum Likelihood

Squared Canonical Correlations

Factor1 Factor2 Factor3 Factor4 Factor5

1.0000000 0.9864528 0.9298891 0.8552151 0.6953437

Eigenvalues of the Weighted Reduced Correlation Matrix: Total = 94.2680654 Average = 6.73343324

Eigenvalue Difference Proportion Cumulative

1 Infty Infty
2 72.8157670 59.5526525 0.7724 0.7724
3 13.2631146 7.3563181 0.1407 0.9131
4 5.9067965 3.6244087 0.0627 0.9758
5 2.2823878 1.5748067 0.0242 1.0000
6 0.7075811 0.1211672 0.0075 1.0075
7 0.5864139 0.2073404 0.0062 1.0137

158
8 0.3790735 0.2137267 0.0040 1.0177
9 0.1653468 0.1429644 0.0018 1.0195
10 0.0223823 0.0949270 0.0002 1.0197
11 -0.0725446 0.1454124 -0.0008 1.0190
12 -0.2179570 0.0952176 -0.0023 1.0167
13 -0.3131747 0.1702114 -0.0033 1.0133
14 -0.4833861 0.2903498 -0.0051 1.0082
15 -0.7737358 -0.0082 1.0000

Factor Pattern

Factor1 Factor2 Factor3 Factor4 Factor5

RECVR 1.00000 -0.00000 0.00000 -0.00000 -0.00000


PULSE 0.50389 -0.19460 -0.08711 0.04878 0.27561
SPEED -0.43652 -0.12948 0.17123 0.29858 -0.38970
WEIGHT -0.12188 0.95896 0.18902 -0.04564 -0.01046
FAT 0.06652 0.91307 -0.31229 -0.04848 0.06581
CHEST -0.08323 0.85373 0.24075 -0.34999 0.05062
THIGH 0.20313 0.72563 -0.58376 0.16414 -0.09586
SHLDR -0.02394 0.59185 0.49095 0.18223 -0.24918
HEIGHT -0.12126 0.58658 0.45458 0.56854 0.05041
PELVIC -0.20686 0.58103 0.33755 0.10606 -0.09779
BREATH 0.09128 0.44303 0.29501 0.32258 0.13023
ENDUR -0.07280 -0.40654 0.04336 -0.01653 0.03696
CHNUP -0.06025 -0.66088 0.29276 -0.06020 -0.07086
REACT -0.12957 0.09818 -0.16829 0.33604 0.67845
DIAST 0.15384 -0.02090 -0.12208 -0.16473 0.25156

Police Department Applicant Data Johnson page 160


Maximum Likelihood FA for 5 factors with Oblique Promax Rotation

The FACTOR Procedure


Initial Factor Method: Maximum Likelihood

Variance Explained by Each Factor

Factor Weighted Unweighted

Factor1 2.5448036 1.62770415


Factor2 72.8157670 4.90368487
Factor3 13.2631146 1.34836092
Factor4 5.9067965 0.86110178
Factor5 2.2823878 0.86514424

Final Communality Estimates and Variable Weights

159
Total Communality: Weighted = 96.812869 Unweighted = 9.605996

Variable Communality Weight

REACT 0.62797040 2.6877327


HEIGHT 0.89119601 9.1909879
WEIGHT 0.97237461 36.1990669
SHLDR 0.68719064 3.1971113
PELVIC 0.51513802 2.0624119
CHEST 0.91880673 12.3160354
THIGH 0.94471334 18.0873827
PULSE 0.37770382 1.6068142
DIAST 0.12942491 1.1486505
CHNUP 0.53474783 2.1494398
BREATH 0.41265447 1.7026689
RECVR 1.00000000 Infty
SPEED 0.47765130 1.9145480
ENDUR 0.17409446 1.2107922
FAT 0.94232943 17.3392265

160
Chapter 12

Classification and Discrimination


Analysis

Although classification and discrimination are different topics and have differing purposes, they have some
similarities that merit them being considered together in a single chapter. The goal of discrimination is to
find a linear combination of the original variables which best separate two (or more) populations. These
linear combinations are called the discriminants. The goal of classification is to define a rule by which an
unknown vector can be assigned to one of two (or more) populations. Rencher (2001) describes these two
problems as;
1. The description of group (either populations or samples from a population), in which linear functions
(discriminating functions) of the variables are used to describe or elucidate the differences between two
or more groups. The goal of discriminate analysis include the relative contributions of the p variables
to separation of the two groups and finding the optimal plane on which the points can be projected to
best illustrate the configurations (separation) of the groups.
2. Prediction or allocation, in which linear or quadratic functions (classification functions) of the variables
are employed to assign an individual sampling unit to one of the groups. The measured values (in the
observation vector) for an individual or object are evaluated by the classification functions to see to
which group the individual most likely belongs.

12.1 Discriminate Analysis


12.1.1 Two Population Problem
Assume that two groups have the same covariance matrix Σ with distinct means µ1 and µ2 . Furthermore
suppose that one has a sample of size n1 from group 1 and n2 from group 2, denoted by ~y1 and ~y2 . The
discriminate function is the linear comination z = a0 ~y that maiximizes the distance between a0 µ1 and a0 µ2 .
That is, maximize
[a0 (ȳ1 − ȳ2 )]2
a0 Sp a
where Sp is the pooled sample covariance matrix. The answer is given by any scalar multiple of

a = Sp−1 (ȳ1 − ȳ2 )

161
12.1.2 Multiple Population Problem
The multiple population problem is similar to the two population problem where

[a0 (ȳ1 − ȳ2 )]2 a0 (ȳ1 − ȳ2 )(ȳ1 − ȳ2 )0 a


=
a0 Sp a a0 Sp a

and the matrix H from the one way MANOVA is used instead of (ȳ1 − ȳ2 )(ȳ1 − ȳ2 )0 and the error matrix E
is used instead of Sp . That is one writes
a0 Ha
λ= 0 .
a Ea
Which can be written as,
a0 Ha = λa0 Ea
or
a0 (H − λE)a = 0.
Since this has to hold for any vector a, one has

(E −1 H − λI)a = 0

which implies that λ1 ≥ λ2 ≥ . . . ≥ λs and a1 , a2 , . . . , as are the eigenvalue and corresponding eigenvectors
of E −1 H. Thus λ1 is the solution that maximizes the discriminate equation and z1 = a01 ~y is the discrim-
inant function that maximizes the separation between the means. Furthermore, zi = a0i ~y provide the ith
discriminate function which are uncorrelated. The relative importance of the ith discriminate function is
given by
λ
Ps i .
j=1 λj

This problem is covered in SAS using PROC CANDISC. I have included a brief discussion of an overview
to this procedure as given in the SAS User‘s manual.

12.1.3 SAS – PROC CANDISC


Canonical discriminant analysis is a dimension-reduction technique related to principal component analysis
and canonical correlation. The methodology used in deriving the canonical coefficients parallels that of a
one-way MANOVA. Whereas in MANOVA the goal is to test for equality of the mean vector across class lev-
els, in a canonical discriminant analysis we find linear combinations of the quantitative variables that provide
maximal separation between the classes or groups. Given a classification variable and several quantitative
variables, the CANDISC procedure derives canonical variables, linear combinations of the quantitative vari-
ables that summarize between-class variation in much the same way that principal components summarize
total variation.
The CANDISC procedure performs a canonical discriminant analysis, computes squared Mahalanobis
distances between class means, and performs both univariate and multivariate one-way analyses of variance.
Two output data sets can be produced: one containing the canonical coefficients and another containing,
among other things, scored canonical variables. The canonical coefficients output data set can be rotated
by the FACTOR procedure. It is customary to standardize the canonical coefficients so that the canonical
variables have means that are equal to zero and pooled within-class variances that are equal to one. PROC
CANDISC displays both standardized and unstandardized canonical coefficients. Correlations between the
canonical variables and the original variables as well as the class means for the canonical variables are also
displayed; these correlations, sometimes known as loadings, are called canonical structures. The scored
canonical variables output data set can be used in conjunction with the PLOT procedure or the

162
Given two or more groups of observations with measurements on several quantitative variables, canonical
discriminant analysis derives a linear combination of the variables that has the highest possible multiple
correlation with the groups. This maximal multiple correlation is called the first canonical correlation.
The coefficients of the linear combination are the canonical coefficients or canonical weights. The variable
defined by the linear combination is the first canonical variable or canonical component. The second canonical
correlation is obtained by finding the linear combination uncorrelated with the first canonical variable that
has the highest possible multiple correlation with the groups. The process of extracting canonical variables
can be repeated until the number of canonical variables equals the number of original variables or the number
of classes minus one, whichever is smaller.
The first canonical correlation is at least as large as the multiple correlation between the groups and
any of the original variables. If the original variables have high within-group correlations, the first canonical
correlation can be large even if all the multiple correlations are small. In other words, the first canonical
variable can show substantial differences between the classes, even if none of the original variables do.
Canonical variables are sometimes called discriminant functions, but this usage is ambiguous because the
DISCRIM procedure produces very different functions for classification that are also called discriminant
functions.
For each canonical correlation, PROC CANDISC tests the hypothesis that it and all smaller canonical
correlations are zero in the population. An F approximation (Rao 1973; Kshirsagar 1972) is used that
gives better small-sample results than the usual chi-square approximation. The variables should have an
approximate multivariate normal distribution within each class, with a common covariance matrix in order
for the probability levels to be valid.
Canonical discriminant analysis is equivalent to canonical correlation analysis between the quantitative
variables and a set of dummy variables coded from the class variable. Canonical discriminant analysis is also
equivalent to performing the following steps:
1. Transform the variables so that the pooled within-class covariance matrix is an identity matrix.
2. Compute class means on the transformed variables.
3. Perform a principal component analysis on the means, weighting each mean by the number of obser-
vations in the class. The eigenvalues are equal to the ratio of between-class variation to within-class
variation in the direction of each principal component.
4. Back-transform the principal components into the space of the original variables, obtaining the canon-
ical variables.
An interesting property of the canonical variables is that they are uncorrelated whether the correlation
is calculated from the total sample or from the pooled within-class correlations. The canonical coefficients
are not orthogonal, however, so the canonical variables do not represent perpendicular directions through
the space of the original variables.

12.1.4 SAS – PROC STEPDISC


SAS has a procedure for performing stepwise discrimination that is often used as a first step when there are
many variables. The procedure is similar to stepwise regression whereby the new variables are either added
or remove according to the F values found in the MANOVA. The following discussion is found the SAS Users
manual for PROC STEPDISC.
Given a classification variable and several quantitative variables, the STEPDISC procedure performs
a stepwise discriminant analysis to select a subset of the quantitative variables for use in discriminating
among the classes. The set of variables that make up each class is assumed to be multivariate normal with a
common covariance matrix. The STEPDISC procedure can use forward selection, backward elimination, or
stepwise selection (Klecka 1980). The STEPDISC procedure is a useful prelude to further analyzes using the

163
CANDISC procedure or the DISCRIM procedure. With PROC STEPDISC, variables are chosen to enter or
leave the model according to one of two criteria:

• the significance level of an F-test from an analysis of covariance, where the variables already chosen
act as covariates and the variable under consideration is the dependent variable
• the squared partial correlation for predicting the variable under consideration from the CLASS variable,
controlling for the effects of the variables already selected for the model

Forward selection begins with no variables in the model. At each step, PROC STEPDISC enters the variable
that contributes most to the discriminatory power of the model as measured by Wilks’ Lambda, the likelihood
ratio criterion. When none of the unselected variables meets the entry criterion, the forward selection process
stops.
Backward elimination begins with all variables in the model except those that are linearly dependent on
previous variables in the VAR statement. At each step, the variable that contributes least to the discrimi-
natory power of the model as measured by Wilks’ Lambda is removed. When all remaining variables meet
the criterion to stay in the model, the backward elimination process stops.
Stepwise selection begins, like forward selection, with no variables in the model. At each step, the model
is examined. If the variable in the model that contributes least to the discriminatory power of the model as
measured by Wilks’ lambda fails to meet the criterion to stay, then that variable is removed. Otherwise, the
variable not in the model that contributes most to the discriminatory power of the model is entered. When
all variables in the model meet the criterion to stay and none of the other variables meet the criterion to
enter, the stepwise selection process stops. Stepwise selection is the default method of variable selection.
It is important to realize that, in the selection of variables for entry, only one variable can be entered into
the model at each step. The selection process does not take into account the relationships between variables
that have not yet been selected. Thus, some important variables could be excluded in the process. Also,
Wilks’ Lambda may not be the best measure of discriminatory power for your application. However, if you
use PROC STEPDISC carefully, in combination with your knowledge of the data and careful cross-validation,
it can be a valuable aid in selecting variables for a discrimination model.
As with any stepwise procedure, it is important to remember that, when many significance tests are
performed, each at a level of, for example, 5probability of rejecting at least one true null hypothesis is
much larger than 5discriminatory power of the model in the population, you should specify a very small
significance level. In most applications, all variables considered have some discriminatory power, however
small. To choose the model that provides the best discrimination using the sample estimates, you need only
to guard against estimating more parameters than can be reliably estimated with the given sample size.
Costanza and Afifi (1979) use Monte Carlo studies to compare alternative stopping rules that can be
used with the forward selection method in the two-group multivariate normal classification problem. Five
different numbers of variables, ranging from 10 to 30, are considered in the studies. The comparison is based
on conditional and estimated unconditional probabilities of correct classification. They conclude that the
use of a moderate significance level, in the range of 10 percent to 25 percent, often performs better than the
use of a much larger or a much smaller significance level.
The significance level and the squared partial correlation criteria select variables in the same order,
although they may select different numbers of variables. Increasing the sample size tends to increase the
number of variables selected when using significance levels, but it has little effect on the number selected
using squared partial correlations.

12.1.5 SAS – Examples


proc format;
value specfmt
1=’Bream’

164
2=’Roach’
3=’Whitefish’
4=’Parkki’
5=’Perch’
6=’Pike’
7=’Smelt’;
data fish (drop=HtPct WidthPct);
title ’Fish Measurement Data’;
input Species Weight Length1 Length2 Length3 HtPct WidthPct @@;
Height=HtPct*Length3/100;
Width=WidthPct*Length3/100;
format Species specfmt.;
datalines;
1 242.0 23.2 25.4 30.0 38.4 13.4 1 290.0 24.0 26.3 31.2 40.0 13.8
1 340.0 23.9 26.5 31.1 39.8 15.1 1 363.0 26.3 29.0 33.5 38.0 13.3
1 430.0 26.5 29.0 34.0 36.6 15.1 1 450.0 26.8 29.7 34.7 39.2 14.2
1 500.0 26.8 29.7 34.5 41.1 15.3 1 390.0 27.6 30.0 35.0 36.2 13.4
1 450.0 27.6 30.0 35.1 39.9 13.8 1 500.0 28.5 30.7 36.2 39.3 13.7
1 475.0 28.4 31.0 36.2 39.4 14.1 1 500.0 28.7 31.0 36.2 39.7 13.3
1 500.0 29.1 31.5 36.4 37.8 12.0 1 . 29.5 32.0 37.3 37.3 13.6
1 600.0 29.4 32.0 37.2 40.2 13.9 1 600.0 29.4 32.0 37.2 41.5 15.0
1 700.0 30.4 33.0 38.3 38.8 13.8 1 700.0 30.4 33.0 38.5 38.8 13.5
1 610.0 30.9 33.5 38.6 40.5 13.3 1 650.0 31.0 33.5 38.7 37.4 14.8
1 575.0 31.3 34.0 39.5 38.3 14.1 1 685.0 31.4 34.0 39.2 40.8 13.7
1 620.0 31.5 34.5 39.7 39.1 13.3 1 680.0 31.8 35.0 40.6 38.1 15.1
1 700.0 31.9 35.0 40.5 40.1 13.8 1 725.0 31.8 35.0 40.9 40.0 14.8
1 720.0 32.0 35.0 40.6 40.3 15.0 1 714.0 32.7 36.0 41.5 39.8 14.1
1 850.0 32.8 36.0 41.6 40.6 14.9 1 1000.0 33.5 37.0 42.6 44.5 15.5
1 920.0 35.0 38.5 44.1 40.9 14.3 1 955.0 35.0 38.5 44.0 41.1 14.3
1 925.0 36.2 39.5 45.3 41.4 14.9 1 975.0 37.4 41.0 45.9 40.6 14.7
1 950.0 38.0 41.0 46.5 37.9 13.7
2 40.0 12.9 14.1 16.2 25.6 14.0 2 69.0 16.5 18.2 20.3 26.1 13.9
2 78.0 17.5 18.8 21.2 26.3 13.7 2 87.0 18.2 19.8 22.2 25.3 14.3
2 120.0 18.6 20.0 22.2 28.0 16.1 2 0.0 19.0 20.5 22.8 28.4 14.7
2 110.0 19.1 20.8 23.1 26.7 14.7 2 120.0 19.4 21.0 23.7 25.8 13.9
2 150.0 20.4 22.0 24.7 23.5 15.2 2 145.0 20.5 22.0 24.3 27.3 14.6
2 160.0 20.5 22.5 25.3 27.8 15.1 2 140.0 21.0 22.5 25.0 26.2 13.3
2 160.0 21.1 22.5 25.0 25.6 15.2 2 169.0 22.0 24.0 27.2 27.7 14.1
2 161.0 22.0 23.4 26.7 25.9 13.6 2 200.0 22.1 23.5 26.8 27.6 15.4
2 180.0 23.6 25.2 27.9 25.4 14.0 2 290.0 24.0 26.0 29.2 30.4 15.4
2 272.0 25.0 27.0 30.6 28.0 15.6 2 390.0 29.5 31.7 35.0 27.1 15.3
3 270.0 23.6 26.0 28.7 29.2 14.8 3 270.0 24.1 26.5 29.3 27.8 14.5
3 306.0 25.6 28.0 30.8 28.5 15.2 3 540.0 28.5 31.0 34.0 31.6 19.3
3 800.0 33.7 36.4 39.6 29.7 16.6 3 1000.0 37.3 40.0 43.5 28.4 15.0
4 55.0 13.5 14.7 16.5 41.5 14.1 4 60.0 14.3 15.5 17.4 37.8 13.3
4 90.0 16.3 17.7 19.8 37.4 13.5 4 120.0 17.5 19.0 21.3 39.4 13.7
4 150.0 18.4 20.0 22.4 39.7 14.7 4 140.0 19.0 20.7 23.2 36.8 14.2
4 170.0 19.0 20.7 23.2 40.5 14.7 4 145.0 19.8 21.5 24.1 40.4 13.1
4 200.0 21.2 23.0 25.8 40.1 14.2 4 273.0 23.0 25.0 28.0 39.6 14.8

165
4 300.0 24.0 26.0 29.0 39.2 14.6
5 5.9 7.5 8.4 8.8 24.0 16.0 5 32.0 12.5 13.7 14.7 24.0 13.6
5 40.0 13.8 15.0 16.0 23.9 15.2 5 51.5 15.0 16.2 17.2 26.7 15.3
5 70.0 15.7 17.4 18.5 24.8 15.9 5 100.0 16.2 18.0 19.2 27.2 17.3
5 78.0 16.8 18.7 19.4 26.8 16.1 5 80.0 17.2 19.0 20.2 27.9 15.1
5 85.0 17.8 19.6 20.8 24.7 14.6 5 85.0 18.2 20.0 21.0 24.2 13.2
5 110.0 19.0 21.0 22.5 25.3 15.8 5 115.0 19.0 21.0 22.5 26.3 14.7
5 125.0 19.0 21.0 22.5 25.3 16.3 5 130.0 19.3 21.3 22.8 28.0 15.5
5 120.0 20.0 22.0 23.5 26.0 14.5 5 120.0 20.0 22.0 23.5 24.0 15.0
5 130.0 20.0 22.0 23.5 26.0 15.0 5 135.0 20.0 22.0 23.5 25.0 15.0
5 110.0 20.0 22.0 23.5 23.5 17.0 5 130.0 20.5 22.5 24.0 24.4 15.1
5 150.0 20.5 22.5 24.0 28.3 15.1 5 145.0 20.7 22.7 24.2 24.6 15.0
5 150.0 21.0 23.0 24.5 21.3 14.8 5 170.0 21.5 23.5 25.0 25.1 14.9
5 225.0 22.0 24.0 25.5 28.6 14.6 5 145.0 22.0 24.0 25.5 25.0 15.0
5 188.0 22.6 24.6 26.2 25.7 15.9 5 180.0 23.0 25.0 26.5 24.3 13.9
5 197.0 23.5 25.6 27.0 24.3 15.7 5 218.0 25.0 26.5 28.0 25.6 14.8
5 300.0 25.2 27.3 28.7 29.0 17.9 5 260.0 25.4 27.5 28.9 24.8 15.0
5 265.0 25.4 27.5 28.9 24.4 15.0 5 250.0 25.4 27.5 28.9 25.2 15.8
5 250.0 25.9 28.0 29.4 26.6 14.3 5 300.0 26.9 28.7 30.1 25.2 15.4
5 320.0 27.8 30.0 31.6 24.1 15.1 5 514.0 30.5 32.8 34.0 29.5 17.7
5 556.0 32.0 34.5 36.5 28.1 17.5 5 840.0 32.5 35.0 37.3 30.8 20.9
5 685.0 34.0 36.5 39.0 27.9 17.6 5 700.0 34.0 36.0 38.3 27.7 17.6
5 700.0 34.5 37.0 39.4 27.5 15.9 5 690.0 34.6 37.0 39.3 26.9 16.2
5 900.0 36.5 39.0 41.4 26.9 18.1 5 650.0 36.5 39.0 41.4 26.9 14.5
5 820.0 36.6 39.0 41.3 30.1 17.8 5 850.0 36.9 40.0 42.3 28.2 16.8
5 900.0 37.0 40.0 42.5 27.6 17.0 5 1015.0 37.0 40.0 42.4 29.2 17.6
5 820.0 37.1 40.0 42.5 26.2 15.6 5 1100.0 39.0 42.0 44.6 28.7 15.4
5 1000.0 39.8 43.0 45.2 26.4 16.1 5 1100.0 40.1 43.0 45.5 27.5 16.3
5 1000.0 40.2 43.5 46.0 27.4 17.7 5 1000.0 41.1 44.0 46.6 26.8 16.3
6 200.0 30.0 32.3 34.8 16.0 9.7 6 300.0 31.7 34.0 37.8 15.1 11.0
6 300.0 32.7 35.0 38.8 15.3 11.3 6 300.0 34.8 37.3 39.8 15.8 10.1
6 430.0 35.5 38.0 40.5 18.0 11.3 6 345.0 36.0 38.5 41.0 15.6 9.7
6 456.0 40.0 42.5 45.5 16.0 9.5 6 510.0 40.0 42.5 45.5 15.0 9.8
6 540.0 40.1 43.0 45.8 17.0 11.2 6 500.0 42.0 45.0 48.0 14.5 10.2
6 567.0 43.2 46.0 48.7 16.0 10.0 6 770.0 44.8 48.0 51.2 15.0 10.5
6 950.0 48.3 51.7 55.1 16.2 11.2 6 1250.0 52.0 56.0 59.7 17.9 11.7
6 1600.0 56.0 60.0 64.0 15.0 9.6 6 1550.0 56.0 60.0 64.0 15.0 9.6
6 1650.0 59.0 63.4 68.0 15.9 11.0
7 6.7 9.3 9.8 10.8 16.1 9.7 7 7.5 10.0 10.5 11.6 17.0 10.0
7 7.0 10.1 10.6 11.6 14.9 9.9 7 9.7 10.4 11.0 12.0 18.3 11.5
7 9.8 10.7 11.2 12.4 16.8 10.3 7 8.7 10.8 11.3 12.6 15.7 10.2
7 10.0 11.3 11.8 13.1 16.9 9.8 7 9.9 11.3 11.8 13.1 16.9 8.9
7 9.8 11.4 12.0 13.2 16.7 8.7 7 12.2 11.5 12.2 13.4 15.6 10.4
7 13.4 11.7 12.4 13.5 18.0 9.4 7 12.2 12.1 13.0 13.8 16.5 9.1
7 19.7 13.2 14.3 15.2 18.9 13.6 7 19.9 13.8 15.0 16.2 18.1 11.6
;
*proc step discriminate;
proc stepdisc data=fish;
class Species;

166
run;
*proc candiscriminate;
proc candisc data=fish ncan=3 out=outcan;
class Species;
var Weight Length1 Length2 Length3 Height Width;
run;
%plotit(data=outcan, plotvars=Can2 Can1,
labelvar=_blank_, symvar=symbol, typevar=symbol,
symsize=1, symlen=4, tsize=1.5, exttypes=symbol, ls=100,
plotopts=vaxis=-5 to 15 by 5, vtoh=, extend=close);

SAS – Output
The SAS output for the above program is:
Fish Measurement Data
The STEPDISC Procedure

The Method for Selecting Variables is STEPWISE

Observations 158 Variable(s) in the Analysis 6


Class Levels 7 Variable(s) will be Included 0
Significance Level to Enter 0.15
Significance Level to Stay 0.15

Class Level Information

Variable
Species Name Frequency Weight Proportion

Bream Bream 34 34.0000 0.215190


Parkki Parkki 11 11.0000 0.069620
Perch Perch 56 56.0000 0.354430
Pike Pike 17 17.0000 0.107595
Roach Roach 20 20.0000 0.126582
Smelt Smelt 14 14.0000 0.088608
Whitefish Whitefish 6 6.0000 0.037975
Fish Measurement Data
The STEPDISC Procedure
Stepwise Selection: Step 1

Statistics for Entry, DF = 6, 151

Variable R-Square F Value Pr > F Tolerance

Weight 0.3750 15.10 <.0001 1.0000


Length1 0.6017 38.02 <.0001 1.0000
Length2 0.6098 39.32 <.0001 1.0000
Length3 0.6280 42.49 <.0001 1.0000

167
Height 0.7553 77.69 <.0001 1.0000
Width 0.4806 23.29 <.0001 1.0000

Variable Height will be entered.

Variable(s) that have been Entered

Height

Multivariate Statistics

Statistic Value F Value Num DF Den DF Pr > F

Wilks’ Lambda 0.244670 77.69 6 151 <.0001


Pillai’s Trace 0.755330 77.69 6 151 <.0001
Average Squared Canonical Correlation 0.125888

Fish Measurement Data


The STEPDISC Procedure
Stepwise Selection: Step 2

Statistics for Removal, DF = 6, 151

Variable R-Square F Value Pr > F

Height 0.7553 77.69 <.0001

No variables can be removed.

Statistics for Entry, DF = 6, 150

Partial
Variable R-Square F Value Pr > F Tolerance

Weight 0.7388 70.71 <.0001 0.4690


Length1 0.9220 295.35 <.0001 0.6083
Length2 0.9229 299.31 <.0001 0.5892
Length3 0.9173 277.37 <.0001 0.5056
Width 0.8783 180.44 <.0001 0.3699

Variable Length2 will be entered.

Variable(s) that have been Entered

Length2 Height

168
Multivariate Statistics

Statistic Value F Value Num DF Den DF Pr > F

Wilks’ Lambda 0.018861 157.04 12 300 <.0001


Pillai’s Trace 1.554349 87.78 12 302 <.0001
Average Squared Canonical Correlation 0.259058

Fish Measurement Data


The STEPDISC Procedure
Stepwise Selection: Step 3

Statistics for Removal, DF = 6, 150

Partial
Variable R-Square F Value Pr > F

Length2 0.9229 299.31 <.0001


Height 0.9517 492.27 <.0001

No variables can be removed.

Statistics for Entry, DF = 6, 149

Partial
Variable R-Square F Value Pr > F Tolerance

Weight 0.4743 22.41 <.0001 0.1218


Length1 0.2894 10.12 <.0001 0.0006
Length3 0.8826 186.77 <.0001 0.0042
Width 0.7569 77.34 <.0001 0.1442

Variable Length3 will be entered.

Variable(s) that have been Entered

Length2 Length3 Height

Multivariate Statistics

Statistic Value F Value Num DF Den DF Pr > F

Wilks’ Lambda 0.002213 180.09 18 421.92 <.0001


Pillai’s Trace 2.305626 83.56 18 453 <.0001
Average Squared Canonical Correlation 0.384271

Fish Measurement Data

169
The STEPDISC Procedure
Stepwise Selection: Step 4

Statistics for Removal, DF = 6, 149

Partial
Variable R-Square F Value Pr > F

Length2 0.8906 202.13 <.0001


Length3 0.8826 186.77 <.0001
Height 0.8726 170.03 <.0001

No variables can be removed.

Statistics for Entry, DF = 6, 148

Partial
Variable R-Square F Value Pr > F Tolerance

Weight 0.4508 20.25 <.0001 0.0039


Length1 0.2881 9.98 <.0001 0.0005
Width 0.5775 33.72 <.0001 0.0024

Variable Width will be entered.

Variable(s) that have been Entered

Length2 Length3 Height Width

Multivariate Statistics

Statistic Value F Value Num DF Den DF Pr > F

Wilks’ Lambda 0.000935 137.66 24 517.52 <.0001


Pillai’s Trace 2.712044 52.99 24 604 <.0001
Average Squared Canonical Correlation 0.452007

Fish Measurement Data


The STEPDISC Procedure
Stepwise Selection: Step 5

Statistics for Removal, DF = 6, 148

Partial
Variable R-Square F Value Pr > F

Length2 0.7770 85.96 <.0001

170
Length3 0.7960 96.26 <.0001
Height 0.7633 79.53 <.0001
Width 0.5775 33.72 <.0001

No variables can be removed.

Statistics for Entry, DF = 6, 147

Partial
Variable R-Square F Value Pr > F Tolerance

Weight 0.4461 19.73 <.0001 0.0023


Length1 0.2910 10.06 <.0001 0.0005

Variable Weight will be entered.

Variable(s) that have been Entered

Weight Length2 Length3 Height Width

Multivariate Statistics

Statistic Value F Value Num DF Den DF Pr > F

Wilks’ Lambda 0.000518 110.70 30 590 <.0001


Pillai’s Trace 2.969307 36.80 30 755 <.0001
Average Squared Canonical Correlation 0.494885

Fish Measurement Data


The STEPDISC Procedure
Stepwise Selection: Step 6

Statistics for Removal, DF = 6, 147

Partial
Variable R-Square F Value Pr > F

Weight 0.4461 19.73 <.0001


Length2 0.7672 80.76 <.0001
Length3 0.7954 95.22 <.0001
Height 0.7508 73.81 <.0001
Width 0.5739 33.00 <.0001

No variables can be removed.

Statistics for Entry, DF = 6, 146

171
Partial
Variable R-Square F Value Pr > F Tolerance

Length1 0.2987 10.36 <.0001 0.0005

Variable Length1 will be entered.

All variables have been entered.

Multivariate Statistics

Statistic Value F Value Num DF Den DF Pr > F

Wilks’ Lambda 0.000363 90.71 36 643.89 <.0001


Pillai’s Trace 3.104651 26.99 36 906 <.0001
Average Squared Canonical Correlation 0.517442

Fish Measurement Data


The STEPDISC Procedure
Stepwise Selection: Step 7

Statistics for Removal, DF = 6, 146

Partial
Variable R-Square F Value Pr > F

Weight 0.4521 20.08 <.0001


Length1 0.2987 10.36 <.0001
Length2 0.5250 26.89 <.0001
Length3 0.7948 94.25 <.0001
Height 0.7257 64.37 <.0001
Width 0.5757 33.02 <.0001

No variables can be removed.


No further steps are possible.

Fish Measurement Data


The STEPDISC Procedure

Stepwise Selection Summary

Average
Squared
Number Partial Wilks’ Pr < Canonical Pr >
Step In Entered Removed R-Square F Value Pr > F Lambda Lambda Correlation ASCC

1 1 Height 0.7553 77.69 <.0001 0.24466983 <.0001 0.12588836 <.0001

172
2 2 Length2 0.9229 299.31 <.0001 0.01886065 <.0001 0.25905822 <.0001
3 3 Length3 0.8826 186.77 <.0001 0.00221342 <.0001 0.38427100 <.0001
4 4 Width 0.5775 33.72 <.0001 0.00093510 <.0001 0.45200732 <.0001
5 5 Weight 0.4461 19.73 <.0001 0.00051794 <.0001 0.49488458 <.0001
6 6 Length1 0.2987 10.36 <.0001 0.00036325 <.0001 0.51744189 <.0001
------------------------

Fish Measurement Data


The CANDISC Procedure

Observations 158 DF Total 157


Variables 6 DF Within Classes 151
Classes 7 DF Between Classes 6

Class Level Information

Variable
Species Name Frequency Weight Proportion

Bream Bream 34 34.0000 0.215190


Parkki Parkki 11 11.0000 0.069620
Perch Perch 56 56.0000 0.354430
Pike Pike 17 17.0000 0.107595
Roach Roach 20 20.0000 0.126582
Smelt Smelt 14 14.0000 0.088608
Whitefish Whitefish 6 6.0000 0.037975

Fish Measurement Data


The CANDISC Procedure

Multivariate Statistics and F Approximations

S=6 M=-0.5 N=72

Statistic Value F Value Num DF Den DF Pr > F

Wilks’ Lambda 0.00036325 90.71 36 643.89 <.0001


Pillai’s Trace 3.10465132 26.99 36 906 <.0001
Hotelling-Lawley Trace 52.05799676 209.24 36 413.64 <.0001
Roy’s Greatest Root 39.13499776 984.90 6 151 <.0001

NOTE: F Statistic for Roy’s Greatest Root is an upper bound.

Fish Measurement Data


The CANDISC Procedure

Adjusted Approximate Squared


Canonical Canonical Standard Canonical

173
Correlation Correlation Error Correlation

1 0.987463 0.986671 0.001989 0.975084


2 0.952349 0.950095 0.007425 0.906969
3 0.838637 0.832518 0.023678 0.703313
4 0.633094 0.623649 0.047821 0.400809
5 0.344157 0.334170 0.070356 0.118444
6 0.005701 . 0.079806 0.000033

Test of H0: The canonical correlations in


Eigenvalues of Inv(E)*H the current row and all that follow are zero
= CanRsq/(1-CanRsq)
Likelihood Approximate
Eigenvalue Difference Proportion Cumulative Ratio F Value Num DF Den DF Pr > F

1 39.1350 29.3859 0.7518 0.7518 0.00036325 90.71 36 643.89 <.0001


2 9.7491 7.3786 0.1873 0.9390 0.01457896 46.46 25 547.58 <.0001
3 2.3706 1.7016 0.0455 0.9846 0.15671134 23.61 16 452.79 <.0001
4 0.6689 0.5346 0.0128 0.9974 0.52820347 12.09 9 362.78 <.0001
5 0.1344 0.1343 0.0026 1.0000 0.88152702 4.88 4 300 0.0008
6 0.0000 0.0000 1.0000 0.99996749 0.00 1 151 0.9442

Fish Measurement Data


The CANDISC Procedure

Total Canonical Structure

Variable Can1 Can2 Can3

Weight 0.230560 0.420828 0.409512


Length1 0.102173 0.670793 0.490738
Length2 0.119374 0.664922 0.506495
Length3 0.222321 0.665900 0.484137
Height 0.763233 0.131542 0.484394
Width 0.240650 0.273018 0.696271

Between Canonical Structure

Variable Can1 Can2 Can3

Weight 0.371793 0.654482 0.560838


Length1 0.130068 0.823569 0.530566
Length2 0.150956 0.810939 0.543964
Length3 0.277024 0.800241 0.512339
Height 0.867181 0.144143 0.467417
Width 0.342763 0.375037 0.842247

174
Pooled Within Canonical Structure

Variable Can1 Can2 Can3

Weight 0.046034 0.162357 0.282143


Length1 0.025554 0.324182 0.423532
Length2 0.030163 0.324651 0.441629
Length3 0.057538 0.333012 0.432369
Height 0.243560 0.081113 0.533406
Width 0.052710 0.115551 0.526255

Fish Measurement Data


The CANDISC Procedure

Total-Sample Standardized Canonical Coefficients

Variable Can1 Can2 Can3

Weight -0.23287040 -1.87861652 -2.00951527


Length1 -3.30254226 -6.28154797 -29.41614257
Length2 -26.71741394 -7.41786401 43.47030076
Length3 30.20558626 20.98357040 -13.25763695
Height 4.80387980 -3.06026678 1.21255849
Width -2.44489997 -1.53319065 1.25337213

Pooled Within-Class Standardized Canonical Coefficients

Variable Can1 Can2 Can3

Weight -0.18772539 -1.51442189 -1.61994419


Length1 -2.12531351 -4.04241874 -18.93042395
Length2 -17.01855083 -4.72505670 27.68986268
Length3 18.78500902 13.04979006 -8.24499242
Height 2.42294578 -1.54351499 0.61158138
Width -1.79661023 -1.12664977 0.92102794

Raw Canonical Coefficients

Variable Can1 Can2 Can3

Weight -0.000648508 -0.005231659 -0.005596192


Length1 -0.329435762 -0.626598051 -2.934324102
Length2 -2.486133674 -0.690253987 4.045038893
Length3 2.595648437 1.803175454 -1.139264914
Height 1.121983854 -0.714749340 0.283202557
Width -1.446386704 -0.907025481 0.741486686

175
Class Means on Canonical Variables

Species Can1 Can2 Can3

Bream 10.94142464 0.52078394 0.23496708


Parkki 2.58903743 -2.54722416 -0.49326158
Perch -4.47181389 -1.70822715 1.29281314
Pike -4.89689441 8.22140791 -0.16469132
Roach -0.35837149 0.08733611 -1.10056438
Smelt -4.09136653 -2.35805841 -4.03836098
Whitefish -0.39541755 -0.42071778 1.06459242
The discriminate plot is given as;

12.2 Classification Analysis


12.2.1 Two Population Case
Suppose that one has a p dimensional random vector X that comes from one of two populations, π1 or π2
with corresponding probability density functions, f1 (x) and f2 (x). A classification rule R1 is a subset of the
p-dimensional plain, i.e., R1 ⊂ Rp , such that an unknown vector x0 is assigned to population π1 if x0 ∈ R1 ,
otherwise x0 ∈ R2 = Rp − R1 is assigned to π2 . If the boundary between the region defined by R1 and its

176
complement is a plane (point in one dimension, line in two dimensions) then the classification rule is said to
be linear, otherwise it will be described as being quadratic or curved.
From here one can define the conditional probability,
Z
Pr[2 | 1] = Pr[X ∈ R2 | π1 ] = f1 (x)dx
R2

and Z
Pr[1 | 2] = Pr[X ∈ R1 | π2 ] = f2 (x)dx.
R1

Suppose that one has prior information concerning the proportion of objects belonging to the respective
populations, given by p1 = Pr[X ∈ π1 ] and p2 = Pr[X ∈ π2 ] where p1 + p2 = 1. Then the respective
probabilities of misclassifying an object using the classification rule defined by R1 . That is, the probability
that an observation belonging to π1 is misclassified as belonging to π2 equals Pr[2 | 1]p1 and Pr[1 | 2]p2 is
the probability of misclassifying a vector into π1 when it belongs to population π2 .
To make the problem completely general suppose that the costs associated with misclassifying an object
are no equal. That is, let c(2 | 1) denote the cost of misclassifying an object from π1 into π2 and c(1 | 2) is
the cost of misclassifying an object from π2 into π1 .
There are a number of different classification procedures. The first is found by minimizing the total
or expected cost of misclassification. Others include the likelihood ratio rule, maximizing the posterior
probability, and the minimax allocation.

12.2.2 Minimizing the Total Cost of Misclassification


The expected cost of misclassification (ECM) is given by

ECM = c(2 | 1) Pr[2 | 1]p1 + c(1 | 2) Pr[1 | 2]p2 .

The objective would be to find or specify the region R1 such that the value of the ECM is as small as
possible. This minimum is obtained in special cases as;
• p1 = p2 then
f1 (x) c(1 | 2) f1 (x) c(1 | 2)
R1 : ≥ R2 : <
f2 (x) c(2 | 1) f2 (x) c(2 | 1)

• c(1 | 2) = c(2 | 1) then


f1 (x) p2 f1 (x) p2
R1 : ≥ R2 : <
f2 (x) p1 f2 (x) p1

• p2 /p1 = c(1 | 2)/c(2 | 1) = 1 then

f1 (x) f1 (x)
R1 : ≥ 1 R2 : <1
f2 (x) f2 (x)

The above easily follows in the special case where c(1 | 2) = c(2 | 1), then
Z Z
ECM = p1 (1 − f1 (x)dx) + p2 f2 (x)dx
R2 R1
Z
= p1 + [p2 f2 (x) − p1 f1 (x)]dx.
R−1

which is minimized when p2 f2 (x) − p1 f1 (x) ≤ 0

177
The following gives a special case when the distribution of f1 (x) and f2 (x) is specified. Suppose that
fi (x) ∼ N (µi , Σi ) for i = 1, 2 where Σ1 = Σ2 = Σ. That is,

fi (x) = k | Σ |−1/2 exp[−1/2(x − µi )0 Σ−1 (x − µi )]

and
f1 (x)
= exp[−1/2(x − µ1 )0 Σ−1 (x − mu1 ) + 1/2(x − µ2 )0 Σ−1 (x − µ2 )]
f2 (x)
= exp[(µ1 − µ2 )0 Σ−1 x − 1/2(µ1 − µ2 )0 Σ−1 (µ1 + µ2 )]
Let
D(x) = (µ1 − µ2 )0 Σ−1 [x − 1/2(µ1 + µ2 )
then the equation D(x) = log(p2 /p1 ) defines a hyperplane for separating the two groups. Let

∆2 = (µ1 − µ2 )0 Σ−1 (µ1 − µ2 )

the squared Mahalanobis distance between the two vectors µ1 and µ2 . It follows that

E(D(x) | x ∈ πi ) = D(µi ) = 1/2(−i)i+1 ∆2

and
V ar(D(x) | x ∈ πi ) = ∆2 .
From here one has,
Pr[2 | 1] = Pr[D(x) ≤ log(p2 /p1 ) | x ∈ π1 ]
= Pr[Z ≤ [(log(p2 /p1 ) − .5∆2 ]/∆]].
And
Pr[1 | 2] = Pr[Z ≤ −[(log(p2 /p1 ) − .5∆2 ]/∆]].
If p1 = p2 , then assign x to π1 if

(µ1 − µ2 )0 Σ−1 x > 1/2(µ1 − µ2 )Σ−1 (µ1 + µ2 ).

Classification Using Sample Data


If one assumes that the data are normal but the parameters; µ1 , µ2 , Σ1 = Σ2 are unknown then one uses the
following rule to minimize the ECM:
Allocate an unknown x0 to π1 if

−1 C(1 | 2)p2 −1
(x̄1 − x̄2 )0 Spooled x0 ≥ log[ ] + 1/2(x̄1 − x̄2 )0 Spooled (x̄1 + x̄2 )
C(2 | 1)p1
otherwise allocate x0 to π2 where ni is the sample size from population πi , and

Spooled = [(n1 − 1)S1 + (n2 − 1)S2 ]/(n1 + n2 − 2).

This rule is called a linear classification rule.

12.2.3 Likelihood Ratio Method


If p1 is unknown then the likelihood ratio method would allocate x0 to π1 if

f1 (x0 )/f2 (x0 ) ≥ 1.

This is a special case of the previous rule when p1 = p2 = 1/2.

178
12.2.4 Maximizing the Posterior Probability
This method assumes that one has prior information concerning the allocation of x0 . That is, suppose that
one specifies
fi (x0 )pi
qi (x0 ) = Pr[x0 ∈ πi ] = .
(f1 (x0 )p1 + f2 (x0 )p2 )
Then allocate x0 to π1 if q1 (x0 ) > q2 (x0 ).

12.2.5 Minimax Allocation


The minimax allocation is one that minimizes the worst case. It can be shown that one should allocate x0
to π1 if f1 (x0 )/f2 (x0 ) > c where c satisfies

Pr[1 | 2] = Pr[2 | 1].

12.3 Evaluating Classification Functions


Since the population parameters are usually unknown, one can not compute the probability of misclassifi-
cation or the expected cost of misclassification. Instead, one must estimate these. In some cases, one may
define the classification rule using a subset of the available data, called the training set and then classify
another set, called the test set in order to evaluate the classification rule. However, this procedure assumes
that one has a large amount of data. Alternative procedures have been suggested which are similar to the
“leave one out” rule used in regression. This procedure was proposed by Lachenbruch and Mickey and is
available in SAS PROC DISCRIM when asking for the CROSSVALIDATE option. I have included the
description from the SAS User’s manual

12.3.1 SAS – Classification Error-Rate Estimates


A classification criterion can be evaluated by its performance in the classification of future observations.
PROC DISCRIM uses two types of error-rate estimates to evaluate the derived classification criterion based
on parameters estimated by the training sample:
• error-count estimates
• posterior probability error-rate estimates.
The error-count estimate is calculated by applying the classification criterion derived from the training
sample to a test set and then counting the number of misclassified observations. The group-specific error-
count estimate is the proportion of misclassified observations in the group. When the test set is independent
of the training sample, the estimate is unbiased. However, it can have a large variance, especially if the test
set is small.
When the input data set is an ordinary SAS data set and no independent test sets are available, the same
data set can be used both to define and to evaluate the classification criterion. The resulting error-count
estimate has an optimistic bias and is called an apparent error rate. To reduce the bias, you can split the
data into two sets, one set for deriving the discriminant function and the other set for estimating the error
rate. Such a split-sample method has the unfortunate effect of reducing the effective sample size.
Another way to reduce bias is cross validation (Lachenbruch and Mickey 1968). Cross validation treats
n-1 out of n training observations as a training set. It determines the discriminant functions based on these
n-1 observations and then applies them to classify the one observation left out. This is done for each of the
n training observations. The misclassification rate for each group is the proportion of sample observations

179
in that group that are misclassified. This method achieves a nearly unbiased estimate but with a relatively
large variance.
To reduce the variance in an error-count estimate, smoothed error-rate estimates are suggested (Glick
1978). Instead of summing terms that are either zero or one as in the error-count estimator, the smoothed
estimator uses a continuum of values between zero and one in the terms that are summed. The resulting es-
timator has a smaller variance than the error-count estimate. The posterior probability error-rate estimates
provided by the POSTERR option in the PROC DISCRIM statement (see the following section, ”Posterior
Probability Error-Rate Estimates”) are smoothed error-rate estimates. The posterior probability estimates
for each group are based on the posterior probabilities of the observations classified into that same group.
The posterior probability estimates provide good estimates of the error rate when the posterior probabili-
ties are accurate. When a parametric classification criterion (linear or quadratic discriminant function) is
derived from a nonnormal population, the resulting posterior probability error-rate estimators may not be
appropriate.
The overall error rate is estimated through a weighted average of the individual group-specific error-rate
estimates, where the prior probabilities are used as the weights.
To reduce both the bias and the variance of the estimator, Hora and Wilcox (1982) compute the posterior
probability estimates based on cross validation. The resulting estimates are intended to have both low vari-
ance from using the posterior probability estimate and low bias from cross validation. They use Monte Carlo
studies on two-group multivariate normal distributions to compare the cross validation posterior probability
estimates with three other estimators: the apparent error rate, cross validation estimator, and posterior
probability estimator. They conclude that the cross validation posterior probability estimator has a lower
mean squared error in their simulations.

12.4 Multiple Population Case


The above methods can easily be extended to the multiple class problem. SAS PROC DISCRIM allows for
both parametric and nonparametric classification procedures. The following descriptions have been taken
from the SAS manual for PROC DISCRIM.

12.4.1 Parametric Methods


The following notation is used to describe the classification methods:
• x – a p-dimensional vector containing the quantitative variables of an observation
• Sp – the pooled covariance matrix
• t – a subscript to distinguish the groups
• nt – the number of training set observations in group t

• mt – the p-dimensional vector containing variable means in group t


• St – the covariance matrix within group t
• | St | – the determinant of St
• qt – the prior probability of membership in group t

• p(t | x) – the posterior probability of an observation x belonging to group t


• ft – the probability density function for group t

180
• ft (x) – the group-specific density estimate at x from group t
P
• f (x) = t qt ft (x) – the estimated unconditional density at x
• et – the classification error rate for group t

Bayes’ Theorem
Assuming that the prior probabilities of group membership are known and that the group-specific densities
at x can be estimated, PROC DISCRIM computes , the probability of x belonging to group t, by applying
Bayes’ theorem:
qt ft (x)
p(t | x) = .
f (x)
PROC DISCRIM partitions a p-dimensional vector space into regions Rt , where the region Rt is the
subspace containing all p-dimensional vectors y such that is the largest among all groups. An observation is
classified as coming from group t if it lies in region Rt .

Parametric Methods
Assuming that each group has a multivariate normal distribution, PROC DISCRIM develops a discriminant
function or classification criterion using a measure of generalized squared distance. The classification criterion
is based on either the individual within-group covariance matrices or the pooled covariance matrix; it also
takes into account the prior probabilities of the classes. Each observation is placed in the class from which
it has the smallest generalized squared distance. PROC DISCRIM also computes the posterior probability
of an observation belonging to each class.
The squared Mahalanobis distance from x to group t is
d2t (x) = (x − mt )0 Vt−1 (x − mt )
where Vt = St if the within-group covariance matrices are used, or Vt = Sp if the pooled covariance matrix
is used.
The group-specific density estimate at x from group t is then given by
ft (x) = (2π)−p/2 | Vt |−1/2 exp[−.5d2t (x)]
Using Bayes’ theorem, the posterior probability of x belonging to group t is
qt ft (x)
p(t | x) = P
u qu fu (x)

where the summation is over all groups.


The generalized squared distance from x to group t is defined as

Dt2 (x) = d2t (x) + g1 (t) + g2 (t)


where g1 (t) = ln | St | if the within-group covariance matrices are used and is zero otherwise and g2 (t) =
−2ln(qt ) if the prior probabilities are not all equal and is zero if otherwise.
The posterior probability of x belonging to group t is then equal to
exp[−.5Dt2 (x)]
p(t | x) = P 2
u exp[−.5Du (x)]

The discriminant scores are −0.5Du2 (x). An observation is classified into group u if setting t=u produces the
largest value of p(t | x) or the smallest value of Dt2 (x). If this largest posterior probability is less than the
threshold specified, x is classified into group OTHER.

181
12.4.2 Nonparametric Methods
Whenever the density function for πi is unspecified, one can use a nonparametric method for estimating
fi (x). SAS PROC DISCRIM provides a number of nonparametric options. I have included their description
of some of the options.

Nonparametric Methods

Nonparametric discriminant methods are based on nonparametric estimates of group-specific probability


densities. Either a kernel method or the k-nearest-neighbor method can be used to generate a nonparametric
density estimate in each group and to produce a classification criterion. The kernel method uses uniform,
normal, Epanechnikov, biweight, or triweight kernels in the density estimation.
Either Mahalanobis distance or Euclidean distance can be used to determine proximity. When the k-
nearest-neighbor method is used, the Mahalanobis distances are based on the pooled covariance matrix.
When a kernel method is used, the Mahalanobis distances are based on either the individual within-group
covariance matrices or the pooled covariance matrix. Either the full covariance matrix or the diagonal matrix
of variances can be used to calculate the Mahalanobis distances.
The squared distance between two observation vectors, x and y, in group t is given by

d2t (x, y) = (x − y)0 Vt−1 (x − y)


where Vt is one of the following forms: Sp , diag(Sp ), St , diag(St ), orI.
The classification of an observation vector x is based on the estimated group-specific densities from
the training set. From these estimated densities, the posterior probabilities of group membership at x are
evaluated. An observation x is classified into group u if setting t=u produces the largest value of p(t | x). If
there is a tie for the largest probability or if this largest probability is less than the threshold specified, x is
classified into group OTHER.
The kernel method uses a fixed radius, r, and a specified kernel, Kt , to estimate the group t density at
each observation vector x. Let z be a p-dimensional vector. Then the volume of a p-dimensional unit sphere
bounded by z 0 z = 1 is

π p/2
v0 =
Γ(p/2 + 1)
where Γ(·) represents the gamma function (refer to SAS Language Reference: Dictionary).
Thus, in group t, the volume of a p-dimensional ellipsoid bounded by {z | z 0 Vt−1 z = r2 } is

vr (t) = rp | Vt |1/2 v0
The kernel method uses one of the following densities as the kernel density in group t.

Uniform Kernel
Ky (z) = vr (t)−1
if z 0 Vt−1 z ≤ r2 , otherwise it is zero.

Normal Kernel, N(0, σ 2 = r2 Vt )


Kt (z) = [1/(c0 (t))]exp(−[1/(2r2 )]z 0 Vt−1 z)
where c0 (t) = (2π)p/2 rp | Vt |1/2 .

182
Epanechnikov Kernel
Kt (z) = c1 (t)(1 − r−2 z 0 Vt−1 z)
−1
if z 0 Vt−1 z ≤ r2 , where c1 (t) = vr (t) (1 + [p/2]).

Biweight Kernel
Kt (z) = c2 (t)(1 − r−2 z 0 Vt−1 z)
if z 0 Vt−1 z ≤ r2 , where c2 (t) = (1 + [p/4])c1 (t).

Triweight Kernel
Kt (z) = c3 (t)(1 − r−2 z 0 Vt−1 z)
if z 0 Vt−1 z ≤ r2 , where c3 (t) = (1 + [p/6])c2 (t).
The group t density at x is estimated by
X
ft (x) = Kt (x − y)/nt
y

where the summation is over all observations y in group t, and Kt is the specified kernel function. The
posterior probability of membership in group t is then given by

p(x | t) = qt ft (x)/f (x)


P
where f (x) = u qu fu (x) is the estimated unconditional density where qu is the prior probability for
population u. If f (x) is zero, the observation x is classified into group OTHER.
The uniform-kernel method treats Kt (z) as a multivariate uniform function with density uniformly dis-
tributed over z 0 Vt1 z ≤ r2 . Let kt be the number of training set observations y from group t within the closed
ellipsoid centered at x specified by dt (x, y)2 ≤ r2 . Then the group t density at x is estimated by

ft (x) = [(kt )/(nt vr (t))]


When the identity matrix or the pooled within-group covariance matrix is used in calculating the squared
distance, vr (t) is a constant, independent of group membership. The posterior probability of x belonging to
group t is then given by

qt kt /nt
p(t | x) = P .
u qu ku /nu

If the closed ellipsoid centered at x does not include any training set observations, f(x) is zero and x is
classified into group OTHER. When the prior probabilities are equal, p(t | x) is proportional to kt /nt and
x is classified into the group that has the highest proportion of observations
P in the closed ellipsoid. When
the prior probabilities are proportional to the group sizes, p(t | x) = kt / u ku , x is classified into the group
that has the largest number of observations in the closed ellipsoid.

183
12.4.3 Nearest Neighbor Method
The nearest-neighbor method fixes the number, k, of training set points for each observation x. The method
finds the radius rk (x) that is the distance from x to the k th nearest training set point in the metric Vt−1 .
Consider a closed ellipsoid centered at x bounded by {z | (z − x)0 Vt−1 (x − z) = rk2 (x)}; the nearest-neighbor
method is equivalent to the uniform-kernel method with a location-dependent radius rk (x). Note that, with
ties, more than k training set points may be in the ellipsoid.
Using the k-nearest-neighbor rule, the kn (or more with ties) smallest distances are saved. Of these
k distances, let kt represent the number of distances that are associated with group t. Then, as in the
uniform-kernel method, the estimated group t density at x is

ft (x) = [(kt )/(nt vk (x))]


where vk (x) is the volume of the ellipsoid bounded by {z | (z − x)0 Vt−1 (x − z) = rk2 (x)}. Since the pooled
within-group covariance matrix is used to calculate the distances used in the nearest-neighbor method, the
volume vk (x) is a constant independent of group membership. When k=1 is used in the nearest-neighbor
rule, x is classified into the group associated with the y point that yields the smallest squared distance
d2t (x, y). Prior probabilities affect nearest-neighbor results in the same way that they affect uniform-kernel
results.
With a specified squared distance formula (METRIC=, POOL=), the values of r and k determine the
degree of irregularity in the estimate of the density function, and they are called smoothing parameters.
Small values of r or k produce jagged density estimates, and large values of r or k produce smoother density
estimates. Various methods for choosing the smoothing parameters have been suggested, and there is as yet
no simple solution to this problem.
For a fixed kernel shape, one way to choose the smoothing parameter r is to plot estimated densities with
different values of r and to choose the estimate that is most in accordance with the prior information about
the density. For many applications, this approach is satisfactory.
Another way of selecting the smoothing parameter r is to choose a value that optimizes a given criterion.
Different groups may have different sets of optimal values. Assume that the unknown density has bounded
and continuous second derivatives and that the kernel is a symmetric probability density function. One
criterion is to minimize an approximate mean integrated square error of the estimated density (Rosenblatt
1956). The resulting optimal value of r depends on the density function and the kernel. A reasonable choice
for the smoothing parameter r is to optimize the criterion with the assumption that group t has a normal
distribution with covariance matrix Vt. Then, in group t, the resulting optimal value for r is given by

([(A(Kt ))/(nt )])[1/(p+4)]


where the optimal constant A(Kt ) depends on the kernel Kt (Epanechnikov 1969).
These selections of A(Kt ) are derived under the assumption that the data in each group are from a
multivariate normal distribution with covariance matrix Vt . However, when the Euclidean distances are
used in calculating the squared distance (Vt = I), the smoothing constant should be multiplied by s, where
s is an estimate of standard deviations for all variables. A reasonable choice for s is
X 1/2
s = (1/p sjj )
where sjj are group t marginal variances.
The DISCRIM procedure uses only a single smoothing parameter for all groups. However, with the
selection of the matrix to be used in the distance formula (using the METRIC= or POOL= option), individual
groups and variables can have different scalings. When Vt , the matrix used in calculating the squared
distances, is an identity matrix, the kernel estimate on each data point is scaled equally for all variables
in all groups. When Vt is the diagonal matrix of a covariance matrix, each variable in group t is scaled

184
separately by its variance in the kernel estimation, where the variance can be the pooled variance (Vt = Sp )
or an individual within-group variance (Vt = St ). When Vt is a full covariance matrix, the variables in group
t are scaled simultaneously by Vt in the kernel estimation.
In nearest-neighbor methods, the choice of k is usually relatively uncritical (Hand 1982). A practical
approach is to try several different values of the smoothing parameters within the context of the particular
application and to choose the one that gives the best cross validated estimate of the error rate.

12.5 SAS – Examples


In this section several examples are included using PROC DISCRIM.

12.5.1 Salmon Size Example


This example is given in Johnson and Widhren page 659. The SAS Code is;
OPTIONS PAGENO=1;
OPTIONS PAGESIZE=54 LINESIZE=75 NODATE;
TITLE ’Salmon Data Johnson-Wichren page 659’;
DATA salmon;
INPUT locale gender fresh marine;
label locale = ’Alaskan = 1 Canadian=2’
gender = ’female = 1 male = 2’
fresh = ’ growth in freshwater’
marine = ’ growth in marine’;
CARDS;
1 2 108 368
1 1 131 355
1 1 105 469
1 2 86 506
.
.
.
2 1 133 375
2 1 128 383
2 2 123 349
2 1 144 373
2 2 140 388
2 2 150 339
2 2 124 341
2 1 125 346
2 1 153 352
2 1 108 339
;
* Parametric Method;
PROC DISCRIM DATA=salmon short POOL=test posterr CROSSVALIDATE;
CLASSES locale;
VAR fresh marine;
RUN;
* Nonparametric Method with Uniform Kernel, r=4

185
PROC DISCRIM DATA=salmon short method=npar r=4 posterr CROSSVALIDATE;
CLASSES locale;
VAR fresh marine;
RUN;
* Nonparametric Method with Normal Kernel r=4
PROC DISCRIM DATA=salmon short method=npar r=4 kernel=normal posterr CROSSVALIDATE;
CLASSES locale;
VAR fresh marine;
RUN;
* Nonparametric Method with EPANECHNIKOV Kernel r=4
PROC DISCRIM DATA=salmon short method=npar r=4 kernel=epa posterr CROSSVALIDATE;
CLASSES locale;
VAR fresh marine;
RUN;
* Nonparamtric Method with k=4 Nearest Neighbor
PROC DISCRIM DATA=salmon short method=npar k=4 posterr CROSSVALIDATE;
CLASSES locale;
VAR fresh marine;
RUN;
quit;
The output is;

Parametric Method
Salmon Data Johnson-Wichren page 659

The DISCRIM Procedure

Observations 100 DF Total 99


Variables 2 DF Within Classes 98
Classes 2 DF Between Classes 1

Class Level Information

Variable Prior
locale Name Frequency Weight Proportion Probability

1 _1 50 50.0000 0.500000 0.500000


2 _2 50 50.0000 0.500000 0.500000

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure
Test of Homogeneity of Within Covariance Matrices

Notation: K = Number of Groups

P = Number of Variables

186
N = Total Number of Observations - Number of Groups

N(i) = Number of Observations in the i’th Group - 1

__ N(i)/2
|| |Within SS Matrix(i)|
V = -----------------------------------
N/2
|Pooled SS Matrix|

_ _ 2
| 1 1 | 2P + 3P - 1
RHO = 1.0 - | SUM ----- - --- | -------------
|_ N(i) N _| 6(P+1)(K-1)

DF = .5(K-1)P(P+1)
_ _
| PN/2 |
| N V |
Under the null hypothesis: -2 RHO ln | ------------------ |
| __ PN(i)/2 |
|_ || N(i) _|

is distributed approximately as Chi-Square(DF).

Chi-Square DF Pr > ChiSq

10.696146 3 0.0135

Since the Chi-Square value is significant at the 0.1 level, the within
covariance matrices will be used in the discriminant function.
Reference: Morrison, D.F. (1976) Multivariate Statistical Methods
p252.

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure
Classification Summary for Calibration Data: WORK.SALMON
Resubstitution Summary using Quadratic Discriminant Function

Generalized Squared Distance Function

2 _ -1 _
D (X) = (X-X )’ COV (X-X ) + ln |COV |
j j j j j

Posterior Probability of Membership in Each locale

2 2

187
Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))
j k k

Number of Observations and Percent Classified into locale

From locale 1 2 Total

1 45 5 50
90.00 10.00 100.00

2 2 48 50
4.00 96.00 100.00

Total 47 53 100
47.00 53.00 100.00

Priors 0.5 0.5

Error Count Estimates for locale

1 2 Total

Rate 0.1000 0.0400 0.0700


Priors 0.5000 0.5000

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure
Classification Results for Calibration Data: WORK.SALMON
Resubstitution Results using Quadratic Discriminant Function

Generalized Squared Distance Function

2 _ -1 _
D (X) = (X-X )’ COV (X-X ) + ln |COV |
j j j j j

Posterior Probability of Membership in Each locale

2 2
Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))
j k k

Number of Observations and Average Posterior


Probabilities Classified into locale

From locale 1 2

188
1 45 5
0.9599 0.7391

2 2 48
0.7380 0.9266

Total 47 53
0.9505 0.9089

Priors 0.5 0.5

Posterior Probability Error Rate Estimates for locale

Estimate 1 2 Total

Stratified 0.1066 0.0365 0.0716


Unstratified 0.1066 0.0365 0.0716
Priors 0.5000 0.5000

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure
Classification Summary for Calibration Data: WORK.SALMON
Cross-validation Summary using Quadratic Discriminant Function

Generalized Squared Distance Function

2 _ -1 _
D (X) = (X-X )’ COV (X-X ) + ln |COV |
j (X)j (X)j (X)j (X)j

Posterior Probability of Membership in Each locale

2 2
Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))
j k k

Number of Observations and Percent Classified into locale

From locale 1 2 Total

1 45 5 50
90.00 10.00 100.00

2 3 47 50
6.00 94.00 100.00

189
Total 48 52 100
48.00 52.00 100.00

Priors 0.5 0.5

Error Count Estimates for locale

1 2 Total

Rate 0.1000 0.0600 0.0800


Priors 0.5000 0.5000

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure
Classification Results for Calibration Data: WORK.SALMON
Cross-validation Results using Quadratic Discriminant Function

Generalized Squared Distance Function

2 _ -1 _
D (X) = (X-X )’ COV (X-X ) + ln |COV |
j (X)j (X)j (X)j (X)j

Posterior Probability of Membership in Each locale

2 2
Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))
j k k

Number of Observations and Average Posterior


Probabilities Classified into locale

From locale 1 2

1 45 5
0.9582 0.7609

2 3 47
0.7070 0.9315

Total 48 52
0.9425 0.9151

Priors 0.5 0.5

Posterior Probability Error Rate Estimates for locale

190
Estimate 1 2 Total

Stratified 0.0952 0.0483 0.0718


Unstratified 0.0952 0.0483 0.0718
Priors 0.5000 0.5000

Nonparamtric Method – Uniform Kernal


Salmon Data Johnson-Wichren page 659
The DISCRIM Procedure

Observations 100 DF Total 99


Variables 2 DF Within Classes 98
Classes 2 DF Between Classes 1

Class Level Information

Variable Prior
locale Name Frequency Weight Proportion Probability

1 _1 50 50.0000 0.500000 0.500000


2 _2 50 50.0000 0.500000 0.500000

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure
Classification Summary for Calibration Data: WORK.SALMON
Resubstitution Summary using Uniform Kernel Density

Squared Distance Function

2 -1
D (X,Y) = (X-Y)’ COV (X-Y)

Posterior Probability of Membership in Each locale

m (X) = Proportion of obs in group k within the radius R of X


k

Pr(j|X) = m (X) PRIOR / SUM ( m (X) PRIOR )


j j k k k

Number of Observations and Percent Classified into locale

From locale 1 2 Other Total

1 41 7 2 50
82.00 14.00 4.00 100.00

191
2 2 48 0 50
4.00 96.00 0.00 100.00

Total 43 55 2 100
43.00 55.00 2.00 100.00

Priors 0.5 0.5

Error Count Estimates for locale

1 2 Total

Rate 0.1800 0.0400 0.1100


Priors 0.5000 0.5000

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure
Classification Results for Calibration Data: WORK.SALMON
Resubstitution Results using Uniform Kernel Density

Squared Distance Function

2 -1
D (X,Y) = (X-Y)’ COV (X-Y)

Posterior Probability of Membership in Each locale

m (X) = Proportion of obs in group k within the radius R of X


k

Pr(j|X) = m (X) PRIOR / SUM ( m (X) PRIOR )


j j k k k

Number of Observations and Average Posterior


Probabilities Classified into locale

From locale 1 2

1 41 7
0.6263 0.5160

2 2 48
0.5214 0.5966

Total 43 55
0.6214 0.5864

192
Priors 0.5 0.5

Posterior Probability Error Rate Estimates for locale

Estimate 1 2 Total

Stratified 0.4656 0.3550 0.4103


Unstratified 0.4656 0.3550 0.4103
Priors 0.5000 0.5000

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure
Classification Summary for Calibration Data: WORK.SALMON
Cross-validation Summary using Uniform Kernel Density

Squared Distance Function

2 -1
D (X,Y) = (X-Y)’ COV (X-Y)

Posterior Probability of Membership in Each locale

m (X) = Proportion of obs in group k within the radius R of X


k

Pr(j|X) = m (X) PRIOR / SUM ( m (X) PRIOR )


j j k k k

Number of Observations and Percent Classified into locale

From locale 1 2 Other Total

1 41 7 2 50
82.00 14.00 4.00 100.00

2 2 48 0 50
4.00 96.00 0.00 100.00

Total 43 55 2 100
43.00 55.00 2.00 100.00

Priors 0.5 0.5

Error Count Estimates for locale

193
1 2 Total

Rate 0.1800 0.0400 0.1100


Priors 0.5000 0.5000

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure
Classification Results for Calibration Data: WORK.SALMON
Cross-validation Results using Uniform Kernel Density

Squared Distance Function

2 -1
D (X,Y) = (X-Y)’ COV (X-Y)

Posterior Probability of Membership in Each locale

m (X) = Proportion of obs in group k within the radius R of X


k

Pr(j|X) = m (X) PRIOR / SUM ( m (X) PRIOR )


j j k k k

Number of Observations and Average Posterior


Probabilities Classified into locale

From locale 1 2

1 41 7
0.6263 0.5164

2 2 48
0.5220 0.5966

Total 43 55
0.6214 0.5864

Priors 0.5 0.5

Posterior Probability Error Rate Estimates for locale

Estimate 1 2 Total

Stratified 0.4656 0.3550 0.4103


Unstratified 0.4656 0.3550 0.4103
Priors 0.5000 0.5000

194
Nonparametric Method – Normal Kernal

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure

Observations 100 DF Total 99


Variables 2 DF Within Classes 98
Classes 2 DF Between Classes 1

Class Level Information

Variable Prior
locale Name Frequency Weight Proportion Probability

1 _1 50 50.0000 0.500000 0.500000


2 _2 50 50.0000 0.500000 0.500000

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure
Classification Summary for Calibration Data: WORK.SALMON
Resubstitution Summary using Normal Kernel Density

Squared Distance Function

2 -1
D (X,Y) = (X-Y)’ COV (X-Y)

Posterior Probability of Membership in Each locale

-1 2 2
F(X|j) = n SUM exp( -.5 D (X,Y ) / R )
j i ji

Pr(j|X) = PRIOR F(X|j) / SUM PRIOR F(X|k)


j k k

Number of Observations and Percent Classified into locale

From locale 1 2 Total

1 44 6 50
88.00 12.00 100.00

2 1 49 50
2.00 98.00 100.00

Total 45 55 100

195
45.00 55.00 100.00

Priors 0.5 0.5

Error Count Estimates for locale

1 2 Total

Rate 0.1200 0.0200 0.0700


Priors 0.5000 0.5000

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure
Classification Results for Calibration Data: WORK.SALMON
Resubstitution Results using Normal Kernel Density

Squared Distance Function

2 -1
D (X,Y) = (X-Y)’ COV (X-Y)

Posterior Probability of Membership in Each locale

-1 2 2
F(X|j) = n SUM exp( -.5 D (X,Y ) / R )
j i ji

Pr(j|X) = PRIOR F(X|j) / SUM PRIOR F(X|k)


j k k

Number of Observations and Average Posterior


Probabilities Classified into locale

From locale 1 2

1 44 6
0.5718 0.5232

2 1 49
0.5414 0.5629

Total 45 55
0.5711 0.5586

Priors 0.5 0.5

196
Posterior Probability Error Rate Estimates for locale

Estimate 1 2 Total

Stratified 0.4860 0.3856 0.4358


Unstratified 0.4860 0.3856 0.4358
Priors 0.5000 0.5000

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure
Classification Summary for Calibration Data: WORK.SALMON
Cross-validation Summary using Normal Kernel Density

Squared Distance Function

2 -1
D (X,Y) = (X-Y)’ COV (X-Y)

Posterior Probability of Membership in Each locale

-1 2 2
F(X|j) = n SUM exp( -.5 D (X,Y ) / R )
j i ji

Pr(j|X) = PRIOR F(X|j) / SUM PRIOR F(X|k)


j k k

Number of Observations and Percent Classified into locale

From locale 1 2 Total

1 44 6 50
88.00 12.00 100.00

2 1 49 50
2.00 98.00 100.00

Total 45 55 100
45.00 55.00 100.00

Priors 0.5 0.5

Error Count Estimates for locale

1 2 Total

Rate 0.1200 0.0200 0.0700

197
Priors 0.5000 0.5000

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure
Classification Results for Calibration Data: WORK.SALMON
Cross-validation Results using Normal Kernel Density

Squared Distance Function

2 -1
D (X,Y) = (X-Y)’ COV (X-Y)

Posterior Probability of Membership in Each locale

-1 2 2
F(X|j) = n SUM exp( -.5 D (X,Y ) / R )
j i ji

Pr(j|X) = PRIOR F(X|j) / SUM PRIOR F(X|k)


j k k

Number of Observations and Average Posterior


Probabilities Classified into locale

From locale 1 2

1 44 6
0.5712 0.5243

2 1 49
0.5431 0.5623

Total 45 55
0.5706 0.5582

Priors 0.5 0.5

Posterior Probability Error Rate Estimates for locale

Estimate 1 2 Total

Stratified 0.4865 0.3860 0.4363


Unstratified 0.4865 0.3860 0.4363
Priors 0.5000 0.5000

Nonparametric Method – Epanechnikov Kernal

198
Salmon Data Johnson-Wichren page 659
The DISCRIM Procedure

Observations 100 DF Total 99


Variables 2 DF Within Classes 98
Classes 2 DF Between Classes 1

Class Level Information

Variable Prior
locale Name Frequency Weight Proportion Probability

1 _1 50 50.0000 0.500000 0.500000


2 _2 50 50.0000 0.500000 0.500000

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure
Classification Summary for Calibration Data: WORK.SALMON
Resubstitution Summary using Epanechnikov Kernel Density

Squared Distance Function

2 -1
D (X,Y) = (X-Y)’ COV (X-Y)

Posterior Probability of Membership in Each locale

-1 2 2
F(X|j) = n SUM ( 1.0 - D (X,Y ) / R )
j i ji

Pr(j|X) = PRIOR F(X|j) / SUM PRIOR F(X|k)


j k k

Number of Observations and Percent Classified into locale

From locale 1 2 Total

1 44 6 50
88.00 12.00 100.00

2 1 49 50
2.00 98.00 100.00

Total 45 55 100
45.00 55.00 100.00

199
Priors 0.5 0.5

Error Count Estimates for locale

1 2 Total

Rate 0.1200 0.0200 0.0700


Priors 0.5000 0.5000

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure
Classification Results for Calibration Data: WORK.SALMON
Resubstitution Results using Epanechnikov Kernel Density

Squared Distance Function

2 -1
D (X,Y) = (X-Y)’ COV (X-Y)

Posterior Probability of Membership in Each locale

-1 2 2
F(X|j) = n SUM ( 1.0 - D (X,Y ) / R )
j i ji

Pr(j|X) = PRIOR F(X|j) / SUM PRIOR F(X|k)


j k k

Number of Observations and Average Posterior


Probabilities Classified into locale

From locale 1 2

1 44 6
0.7448 0.5691

2 1 49
0.6350 0.6983

Total 45 55
0.7423 0.6842

Priors 0.5 0.5

Posterior Probability Error Rate Estimates for locale

200
Estimate 1 2 Total

Stratified 0.3319 0.2473 0.2896


Unstratified 0.3319 0.2473 0.2896
Priors 0.5000 0.5000

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure
Classification Summary for Calibration Data: WORK.SALMON
Cross-validation Summary using Epanechnikov Kernel Density

Squared Distance Function

2 -1
D (X,Y) = (X-Y)’ COV (X-Y)

Posterior Probability of Membership in Each locale

-1 2 2
F(X|j) = n SUM ( 1.0 - D (X,Y ) / R )
j i ji

Pr(j|X) = PRIOR F(X|j) / SUM PRIOR F(X|k)


j k k

Number of Observations and Percent Classified into locale

From locale 1 2 Total

1 44 6 50
88.00 12.00 100.00

2 1 49 50
2.00 98.00 100.00

Total 45 55 100
45.00 55.00 100.00

Priors 0.5 0.5

Error Count Estimates for locale

1 2 Total

Rate 0.1200 0.0200 0.0700


Priors 0.5000 0.5000

201
Salmon Data Johnson-Wichren page 659
The DISCRIM Procedure
Classification Results for Calibration Data: WORK.SALMON
Cross-validation Results using Epanechnikov Kernel Density

Squared Distance Function

2 -1
D (X,Y) = (X-Y)’ COV (X-Y)

Posterior Probability of Membership in Each locale

-1 2 2
F(X|j) = n SUM ( 1.0 - D (X,Y ) / R )
j i ji

Pr(j|X) = PRIOR F(X|j) / SUM PRIOR F(X|k)


j k k

Number of Observations and Average Posterior


Probabilities Classified into locale

From locale 1 2

1 44 6
0.7438 0.5725

2 1 49
0.6414 0.6971

Total 45 55
0.7415 0.6835

Priors 0.5 0.5

Posterior Probability Error Rate Estimates for locale

Estimate 1 2 Total

Stratified 0.3326 0.2481 0.2904


Unstratified 0.3326 0.2481 0.2904
Priors 0.5000 0.5000

Salmon Data Johnson-Wichren page 659

Nonparametric Method – Nearest Neighbor

202
The DISCRIM Procedure

Observations 100 DF Total 99


Variables 2 DF Within Classes 98
Classes 2 DF Between Classes 1

Class Level Information

Variable Prior
locale Name Frequency Weight Proportion Probability

1 _1 50 50.0000 0.500000 0.500000


2 _2 50 50.0000 0.500000 0.500000

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure
Classification Summary for Calibration Data: WORK.SALMON
Resubstitution Summary using 4 Nearest Neighbors

Squared Distance Function

2 -1
D (X,Y) = (X-Y)’ COV (X-Y)

Posterior Probability of Membership in Each locale

m (X) = Proportion of obs in group k in 4


k nearest neighbors of X

Pr(j|X) = m (X) PRIOR / SUM ( m (X) PRIOR )


j j k k k

Number of Observations and Percent Classified into locale

From locale 1 2 Other Total

1 44 1 5 50
88.00 2.00 10.00 100.00

2 2 46 2 50
4.00 92.00 4.00 100.00

Total 46 47 7 100
46.00 47.00 7.00 100.00

Priors 0.5 0.5

203
Error Count Estimates for locale

1 2 Total

Rate 0.1200 0.0800 0.1000


Priors 0.5000 0.5000

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure
Classification Results for Calibration Data: WORK.SALMON
Resubstitution Results using 4 Nearest Neighbors

Squared Distance Function

2 -1
D (X,Y) = (X-Y)’ COV (X-Y)

Posterior Probability of Membership in Each locale

m (X) = Proportion of obs in group k in 4


k nearest neighbors of X

Pr(j|X) = m (X) PRIOR / SUM ( m (X) PRIOR )


j j k k k

Number of Observations and Average Posterior


Probabilities Classified into locale

From locale 1 2

1 44 1
0.9909 0.7500

2 2 46
0.6750 0.9674

Total 46 47
0.9772 0.9628

Priors 0.5 0.5

Posterior Probability Error Rate Estimates for locale

Estimate 1 2 Total

Stratified 0.1010 0.0950 0.0980

204
Unstratified 0.1010 0.0950 0.0980
Priors 0.5000 0.5000

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure
Classification Summary for Calibration Data: WORK.SALMON
Cross-validation Summary using 4 Nearest Neighbors

Squared Distance Function

2 -1
D (X,Y) = (X-Y)’ COV (X-Y)

Posterior Probability of Membership in Each locale

m (X) = Proportion of obs in group k in 4


k nearest neighbors of X

Pr(j|X) = m (X) PRIOR / SUM ( m (X) PRIOR )


j j k k k

Number of Observations and Percent Classified into locale

From locale 1 2 Total

1 46 4 50
92.00 8.00 100.00

2 2 48 50
4.00 96.00 100.00

Total 48 52 100
48.00 52.00 100.00

Priors 0.5 0.5

Error Count Estimates for locale

1 2 Total

Rate 0.0800 0.0400 0.0600


Priors 0.5000 0.5000

Salmon Data Johnson-Wichren page 659


The DISCRIM Procedure
Classification Results for Calibration Data: WORK.SALMON
Cross-validation Results using 4 Nearest Neighbors

205
Squared Distance Function

2 -1
D (X,Y) = (X-Y)’ COV (X-Y)

Posterior Probability of Membership in Each locale

m (X) = Proportion of obs in group k in 4


k nearest neighbors of X

Pr(j|X) = m (X) PRIOR / SUM ( m (X) PRIOR )


j j k k k

Number of Observations and Average Posterior


Probabilities Classified into locale

From locale 1 2

1 46 4
0.9517 0.8096

2 2 48
0.8731 0.9300

Total 48 52
0.9484 0.9207

Priors 0.5 0.5

Posterior Probability Error Rate Estimates for locale

Estimate 1 2 Total

Stratified 0.0895 0.0424 0.0660


Unstratified 0.0895 0.0424 0.0660
Priors 0.5000 0.5000

12.6 Classification Trees


Another classification procedure is one first suggested by Breidman, called classification trees. The procedure
is very easy to use and interpretate although the procedure is not easy to understand how the trees are
actually formed. Since, the subject is not one that is normally covered in a multivariate analysis course, I
will limit my presentation to that of considering an example.
Using the Salmon data, the classification tree using Splus becomes;

206
The partition of the classification trees is;

207
208
Chapter 13

Cluster Analysis and


Multidimensional Scaling

The basic objective of cluster analysis is to discover natural groupings of item or variables within a data set.
These groupings will be based upon how similar or dissimilar objects are according to some measure that
quantifies these similarities and dissimilarities.

13.1 Similarity Measures


Perhaps the most common dissimilarity measure for two vectors; x and y is a metric, ∆ that maps Rd × Rd
unto R1 and satisfies the following:
1. ∆(x, y) ≥ 0 for all x, y ∈ Rd .
2. ∆(x, y) = 0 if and only if x = y.
3. ∆(x, y) = ∆(y, x) for all x, y ∈ Rd .
4. ∆(x, y) ≤ ∆(x, z) + ∆(z, y) for all x, y, z ∈ Rd .
A number of measures have been suggested. Some include;
• “Lp - norm” called the Minkowshi metrics,
d
X 1/p
∆p (x, y) = | xj − yj |p ,
j=1

• The most common of these metrics are;

1. p=1, the City Block metric.


2. p=2, the L2 of Eucledian norm given by,
d
X 1/2
∆2 (x, y) = | xj − yj |2 .
j=1

209
3. p=∞, the sup norm given by,

∆∞ (x, y) = sup | xj − yj | .
1≤j≤d

• The Mahalanobis Distance,


1/2
∆M (x, y) = (x − y)0 S −1 (x − y)


− x̄)(xj − x̄)0 /(n − 1) is the sample covariance matrix.


P
where S = j (xj

A number of other metrics have been suggested and are available to the user of most statistics packages.

13.2 Clustering Methods


There are essentially two methods for finding clusters; nonhierarchial or hierarchial methods. Nonheirarchial
methods are sometimes referred as disjoint clusters.

13.2.1 Nonhierarchial Methods


This method begins a search of clusters by finding an initial set of starting points (seed clusters) and then
build clusters around them by choosing points which are close in some sense. Once, a cluster becomes too
large then it is split into smaller clusters or is combined with other clusters. PROC FASTCLUS is such
a procedure and it should be used when one has a very large data set without specifying the number or
location of the clusters.

PROC FASTCLUS – Overview


The FASTCLUS procedure performs a disjoint cluster analysis on the basis of distances computed from
one or more quantitative variables. The observations are divided into clusters such that every observation
belongs to one and only one cluster; the clusters do not form a tree structure as they do in the CLUSTER
procedure. If you want separate analyzes for different numbers of clusters, you can run PROC FASTCLUS
once for each analysis. Alternatively, to do hierarchical clustering on a large data set, use PROC FASTCLUS
to find initial clusters, then use those initial clusters as input to PROC CLUSTER.
By default, the FASTCLUS procedure uses Euclidean distances, so the cluster centers are based on
least-squares estimation. This kind of clustering method is often called a k-means model, since the cluster
centers are the means of the observations assigned to each cluster when the algorithm is run to complete
convergence. Each iteration reduces the least-squares criterion until convergence is achieved.
Often there is no need to run the FASTCLUS procedure to convergence. PROC FASTCLUS is designed
to find good clusters (but not necessarily the best possible clusters) with only two or three passes over the
data set. The initialization method of PROC FASTCLUS guarantees that, if there exist clusters such that
all distances between observations in the same cluster are less than all distances between observations in
different clusters, and if you tell PROC FASTCLUS the correct number of clusters to find, it can always find
such a clustering without iterating. Even with clusters that are not as well separated, PROC FASTCLUS
usually finds initial seeds that are sufficiently good so that few iterations are required. Hence, by default,
PROC FASTCLUS performs only one iteration.
The initialization method used by the FASTCLUS procedure makes it sensitive to outliers. PROC
FASTCLUS can be an effective procedure for detecting outliers because outliers often appear as clusters
with only one member.
The FASTCLUS procedure can use an Lp (least pth powers) clustering criterion (Spath 1985, pp. 62
-63) instead of the least-squares (L2) criterion used in k-means clustering methods. The LEAST=p option
specifies the power p to be used. Using the LEAST= option increases execution time since more iterations

210
are usually required, and the default iteration limit is increased when you specify LEAST=p. Values of p
less than 2 reduce the effect of outliers on the cluster centers compared with least-squares methods; values
of p greater than 2 increase the effect of outliers.
The FASTCLUS procedure is intended for use with large data sets, with 100 or more observations. With
small data sets, the results may be highly sensitive to the order of the observations in the data set.
PROC FASTCLUS produces brief summaries of the clusters it finds. For more extensive examination of
the clusters, you can request an output data set containing a cluster membership variable.

Background
The FASTCLUS procedure combines an effective method for finding initial clusters with a standard iterative
algorithm for minimizing the sum of squared distances from the cluster means. The result is an efficient
procedure for disjoint clustering of large data sets. PROC FASTCLUS was directly inspired by Hartigan’s
(1975) leader algorithm and MacQueen’s (1967) k-means algorithm. PROC FASTCLUS uses a method that
Anderberg (1973) calls nearest centroid sorting. A set of points called cluster seeds is selected as a first guess
of the means of the clusters. Each observation is assigned to the nearest seed to form temporary clusters. The
seeds are then replaced by the means of the temporary clusters, and the process is repeated until no further
changes occur in the clusters. Similar techniques are described in most references on clustering (Anderberg
1973; Hartigan 1975; Everitt 1980; Spath 1980).
The FASTCLUS procedure differs from other nearest centroid sorting methods in the way the initial
cluster seeds are selected. The importance of initial seed selection is demonstrated by Milligan (1980).
The clustering is done on the basis of Euclidean distances computed from one or more numeric variables.
If there are missing values, PROC FASTCLUS computes an adjusted distance using the nonmissing values.
Observations that are very close to each other are usually assigned to the same cluster, while observations
that are far apart are in different clusters.
The FASTCLUS procedure operates in four steps:
1. Observations called cluster seeds are selected.
2. If you specify the DRIFT option, temporary clusters are formed by assigning each observation to the
cluster with the nearest seed. Each time an observation is assigned, the cluster seed is updated as
the current mean of the cluster. This method is sometimes called incremental, on-line, or adaptive
training.
3. If the maximum number of iterations is greater than zero, clusters are formed by assigning each obser-
vation to the nearest seed. After all observations are assigned, the cluster seeds are replaced by either
the cluster means or other location estimates (cluster centers) appropriate to the LEAST=p option.
This step can be repeated until the changes in the cluster seeds become small or zero (MAXITER=).
4. Final clusters are formed by assigning each observation to the nearest seed.

If PROC FASTCLUS runs to complete convergence, the final cluster seeds will equal the cluster means
or cluster centers. If PROC FASTCLUS terminates before complete convergence, which often happens with
the default settings, the final cluster seeds may not equal the cluster means or cluster centers. If you want
complete convergence, specify CONVERGE=0 and a large value for the MAXITER= option.
The initial cluster seeds must be observations with no missing values. You can specify the maximum
number of seeds (and, hence, clusters) using the MAXCLUSTERS= option. You can also specify a minimum
distance by which the seeds must be separated using the RADIUS= option.
PROC FASTCLUS always selects the first complete (no missing values) observation as the first seed.
The next complete observation that is separated from the first seed by at least the distance specified in
the RADIUS= option becomes the second seed. Later observations are selected as new seeds if they are

211
separated from all previous seeds by at least the radius, as long as the maximum number of seeds is not
exceeded.
If an observation is complete but fails to qualify as a new seed, PROC FASTCLUS considers using it to
replace one of the old seeds. Two tests are made to see if the observation can qualify as a new seed.
First, an old seed is replaced if the distance between the observation and the closest seed is greater than
the minimum distance between seeds. The seed that is replaced is selected from the two seeds that are
closest to each other. The seed that is replaced is the one of these two with the shortest distance to the
closest of the remaining seeds when the other seed is replaced by the current observation.
If the observation fails the first test for seed replacement, a second test is made. The observation replaces
the nearest seed if the smallest distance from the observation to all seeds other than the nearest one is greater
than the shortest distance from the nearest seed to all other seeds. If the observation fails this test, PROC
FASTCLUS goes on to the next observation.
You can specify the REPLACE= option to limit seed replacement. You can omit the second test for seed
replacement (REPLACE=PART), causing PROC FASTCLUS to run faster, but the seeds selected may not
be as widely separated as those obtained by the default method. You can also suppress seed replacement
entirely by specifying REPLACE=NONE. In this case, PROC FASTCLUS runs much faster, but you must
choose a good value for the RADIUS= option in order to get good clusters. This method is similar to
Hartigan’s (1975, pp. 74 -78) leader algorithm and the simple cluster seeking algorithm described by Tou
and Gonzalez (1974, pp. 90 -92).

13.3 Hierarchial Clustering


These methods either start with as many clusters as there are objects and then the objects are merged
into fewer clusters according to similarity measures. That is, objects which are close together are combined
together. Diversive hierarchial methods work in the opposite direction. That is, all the objects start in a
single cluster and are then divided according to how far the objects are apart. The divisions and mergers
can be presented or displayed in a two-dimensional diagram known as a dendrogram. Efficient methods
for doing these operations is called linkage methods. Several methods will be discussed. The first is called
single linkage procedure (minimum distance or nearest neighbor). the second is complete linkage (maximum
distance or farthest neighbor), and the third is called average linkage (average distance).

13.3.1 Agglomerative Linkage


The steps are;
1. Start with n clusters (one for each object) and a corresponding n × n distance matrix D.
2. Search the distance matrix for the nearest pairs of objects. Let the distance be dU V .
3. Merge clusters U and V , update the distances from this cluster to the remaining objects and select the
object that is nearest. Repeat this step n − 1 times (all n objects will be in a single cluster). Record
which object is added at which step.
The distance cluster U V is from the remaining objects is defined as
• Single Linkage – d(U V )W = min{dU W , dV W }.
• Complete linkage – d(U V )W = max{dU W , dV W }.
• Average Linkage – P P
I k dik
d(U V )W = ,
N(U V ) NW

212
where dik is the distance between object i in cluster U V and object k in cluster W , and N(U V ) and
NW are the number of objects in the respective clusters.

13.3.2 Ward’s Hierarchial Clustering Method


Another hierarchial method proposed by Ward is based upon minimizing the ‘loss of information’ from
joining two groups into a single cluster. The loss of information is taken as the increase in an error sum of
squares criterion. First, for a given cluster, k, let ESSk be the sum of squares
Pkof every item in the cluster
from the cluster mean. Suppose that there are K clusters and define ESS = i=1 ESSi . Then at each step
in the process, the union of every possible cluster is considered, and the two clusters which when combined
contribute to the smallest increase in ESS are combined and the next step begins. Initially, each item is a
seperate cluster.

SAS – PROC CLUSTER


SAS handles the above procedures in PROC CLUSTER in addition to many other procedures which are
beyond the scope of this course. I have included a brief overview of PROC CLUSTER.

Overview

The CLUSTER procedure hierarchically clusters the observations in a SAS data set using one of eleven
methods. The CLUSTER procedure finds hierarchical clusters of the observations in a SAS data set. The
data can be coordinates or distances. If the data are coordinates, PROC CLUSTER computes (possibly
squared) Euclidean distances. If you want to perform a cluster analysis on non-Euclidean distance data, it
is possible to do so by using a TYPE=DISTANCE data set as input. The the SAS/STAT sample library
can compute many kinds of distance matrices.
One situation where analyzing non-Euclidean distance data can be useful is when you have categori-
cal data, where the distance data are calculated using an association measure. For more information, see
Example 23.5. The clustering methods available are average linkage, the centroid method, complete link-
age, density linkage (including Wong’s hybrid and kth-nearest-neighbor methods), maximum likelihood for
mixtures of spherical multivariate normal distributions with equal variances but possibly unequal mixing
proportions, the flexible-beta method, McQuitty’s similarity analysis, the median method, single linkage,
two-stage density linkage, and Ward’s minimum-variance method. All methods are based on the usual ag-
glomerative hierarchical clustering procedure. Each observation begins in a cluster by itself. The two closest
clusters are merged to form a new cluster that replaces the two old clusters. Merging of the two closest
clusters is repeated until only one cluster is left. The various clustering methods differ in how the distance
between two clusters is computed. Each method is described in the section ”Clustering Methods”.
The CLUSTER procedure is not practical for very large data sets because, with most methods, the
CPU time varies as the square or cube of the number of observations. The FASTCLUS procedure requires
time proportional to the number of observations and can, therefore, be used with much larger data sets than
PROC CLUSTER. If you want to cluster a very large data set hierarchically, you can use PROC FASTCLUS
for a preliminary cluster analysis producing a large number of clusters and then use PROC CLUSTER to
cluster the preliminary clusters hierarchically. This method is used to find clusters for the Fisher Iris data
in Example 23.3, later in this chapter.
PROC CLUSTER displays a history of the clustering process, giving statistics useful for estimating the
number of clusters in the population from which the data are sampled. PROC CLUSTER also creates an
output data set that can be used by the TREE procedure to draw a tree diagram of the cluster hierarchy or
to output the cluster membership at any desired level. For example, to obtain the six-cluster solution, you
could first use PROC CLUSTER with the OUTTREE= option then use this output data set as the input
data set to the TREE procedure. With PROC TREE, specify NCLUSTERS=6 and the OUT= options to

213
obtain the six-cluster solution and draw a tree diagram. For an example, see Example 66.1 in Chapter 66,
”The TREE Procedure.”
Before you perform a cluster analysis on coordinate data, it is necessary to consider scaling or transforming
the variables since variables with large variances tend to have more effect on the resulting clusters than
those with small variances. The ACECLUS procedure is useful for performing linear transformations of the
variables. You can also use the PRINCOMP procedure with the STD option, although in some cases it
tends to obscure clusters or magnify the effect of error in the data when all components are retained. The
STD option in the CLUSTER procedure standardizes the variables to mean 0 and standard deviation 1.
Standardization is not always appropriate. See Milligan and Cooper (1987) for a Monte Carlo study on
various methods of variable standardization. You should remove outliers before using PROC PRINCOMP
or before using PROC CLUSTER with the STD option unless you specify the TRIM= option.
Nonlinear transformations of the variables may change the number of population clusters and should,
therefore, be approached with caution. For most applications, the variables should be transformed so that
equal differences are of equal practical importance. An interval scale of measurement is required if raw data
are used as input. Ordinal or ranked data are generally not appropriate.
Agglomerative hierarchical clustering is discussed in all standard references on cluster analysis, for ex-
ample, Anderberg (1973), Sneath and Sokal (1973), Hartigan (1975), Everitt (1980), and Spath (1980). An
especially good introduction is given by Massart and Kaufman (1983). Anyone considering doing a hierar-
chical cluster analysis should study the Monte Carlo results of Milligan (1980), Milligan and Cooper (1985),
and Cooper and Milligan (1988). Other essential, though more advanced, references on hierarchical clus-
tering include Hartigan (1977, pp. 60 -68; 1981), Wong (1982), Wong and Schaack (1982), and Wong and
Lane (1983). Refer to Blashfield and Aldenderfer (1978) for a discussion of the confusing terminology in
hierarchical cluster analysis.

13.4 Example
In this section have included the output for SAS PROC FASTCLUS and CLUSTER. The exampe is the U
S Navy Officer data that I used in earlier examples. The SAS code is;
*options nodate ps=60 PAGENO=1 LINESIZE=75;
/*dm ’log;clear;out;clear;’;
*/
TITLE ’U.S. NAVY BACHELOR OFFICERS’’ QUARTERS’;
DATA USNAVY;
INPUT SITE 1-2 ADO MAC WHR CUA WNGS OBC RMS MMH;
LOGADO=LOG(ADO);
LOGMAC=LOG(MAC);
LABEL ADO = ’AVG DAILY OCCUPANCY’
MAC = ’AVG NUMBER OF CHECK-INS PER MO.’
WHR = ’WEEKLY HRS OF SERVICE DESK OPERATION’
CUA = ’SQ FT OF COMMON USE AREA’
WNGS= ’NUMBER OF BUILDING WINGS’
OBC = ’OPERATIONAL BERTHING CAPACITY’
RMS = ’NUMBER OF ROOMS’
MMH = ’MONTHLY MAN-HOURS’
LOGADO = ’LOG OCCUPANCY’
LOGMAC = ’LOG CHK-INS’;
CARDS;
1 2 4 4 1.26 1 6 6 180.23

214
2 3 1.58 40 1.25 1 5 5 182.61
3 16.6 23.78 40 1 1 13 13 164.38
4 7 2.37 168 1 1 7 8 284.55
5 5.3 1.67 42.5 7.79 3 25 25 199.92
6 16.5 8.25 168 1.12 2 19 19 267.38
7 25.89 3.00 40 0 3 36 36 999.09
8 44.42 159.75 168 .6 18 48 48 1103.24
9 39.63 50.86 40 27.37 10 77 77 944.21
10 31.92 40.08 168 5.52 6 47 47 931.84
11 97.33 255.08 168 19 6 165 130 2268.06
12 56.63 373.42 168 6.03 4 36 37 1489.5
13 96.67 206.67 168 17.86 14 120 120 1891.7
14 54.58 207.08 168 7.77 6 66 66 1387.82
15 113.88 981 168 24.48 6 166 179 3559.92
16 149.58 233.83 168 31.07 14 185 202 3115.29
17 134.32 145.82 168 25.99 12 192 192 2227.76
18 188.74 937.00 168 45.44 26 237 237 4804.24
19 110.24 410 168 20.05 12 115 115 2628.32
20 96.83 677.33 168 20.31 10 302 210 1880.84
21 102.33 288.83 168 21.01 14 131 131 3036.63
22 274.92 695.25 168 46.63 58 363 363 5539.98
23 811.08 714.33 168 22.76 17 242 242 3534.49
24 384.50 1473.66 168 7.36 24 540 453 8266.77
25 95 368 168 30.26 9 292 196 1845.89
*goptions device=ps2ega;
/*
PROC G3D data=usnavy;
SCATTER LOGADO*LOGMAC=MMH;
TITLE2 ’3-D Plot’;
run;
*/
*proc princomp;var ADO MAC WHR CUA WNGS OBC RMS; run;
proc fastclus data=usnavy maxc=2 maxiter=10 out=clus;
var ADO MAC WHR CUA WNGS OBC RMS MMH;
run;

proc fastclus data=usnavy maxc=3 maxiter=10 out=clus;


var ADO MAC WHR CUA WNGS OBC RMS MMH;
run;

proc cluster data=usnavy outtree=Tree method=single


ccc pseudo print=15;
var ADO MAC WHR CUA WNGS OBC RMS MMH;
id site;
run;
goptions vsize=8in htext=1pct htitle=2.5pct;
axis1 order=(0 to 1 by 0.2);
proc tree data=Tree out=New nclusters=3
graphics haxis=axis1 horizontal;

215
* height _rsq_;
copy ADO MAC WHR CUA WNGS OBC RMS MMH;
id site;
run;
The SAS ourput is;
U.S. NAVY BACHELOR OFFICERS’ QUARTERS
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02

Initial Seeds

Cluster ADO MAC WHR CUA

1 2.000000 4.000000 4.000000 1.260000


2 384.500000 1473.660000 168.000000 7.360000

Initial Seeds

Cluster WNGS OBC RMS MMH

1 1.000000 6.000000 6.000000 180.230000


2 24.000000 540.000000 453.000000 8266.770000

Minimum Distance Between Initial Seeds = 8258.981

Iteration History

Relative Change
in Cluster Seeds
Iteration Criterion 1 2

1 684.1 0.1702 0.2567


2 428.1 0 0

Convergence criterion is satisfied.

Criterion Based on Final Seeds = 428.1

Cluster Summary

Maximum Distance
RMS Std from Seed Radius Nearest Distance Between
Cluster Frequency Deviation to Observation Exceeded Cluster Cluster Centroids

1 22 419.7 2172.0 2 4739.9

216
2 3 664.4 2120.3 1 4739.9

Statistics for Variables

Variable Total STD Within STD R-Square RSQ/(1-RSQ)

ADO 169.80118 161.49838 0.133095 0.153529


MAC 382.80452 281.57259 0.481507 0.928665

U.S. NAVY BACHELOR OFFICERS’ QUARTERS

The FASTCLUS Procedure


Replace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02

Statistics for Variables

Variable Total STD Within STD R-Square RSQ/(1-RSQ)

WHR 58.62528 58.60040 0.042480 0.044365


CUA 13.95082 12.57287 0.221631 0.284737
WNGS 12.04270 7.71887 0.606290 1.539939
OBC 134.15041 100.27587 0.464542 0.867560
RMS 116.49375 81.70549 0.528574 1.121222
MMH 1946 1212 0.628616 1.692628
OVER-ALL 706.98292 446.32832 0.618049 1.618137

Pseudo F Statistic = 37.22

Approximate Expected Over-All R-Squared = 0.73361

Cubic Clustering Criterion = -1.845

WARNING: The two values above are invalid for correlated variables.

Cluster Means

Cluster ADO MAC WHR CUA

1 95.942273 234.396818 131.568182 13.340909


2 282.720000 1035.303333 168.000000 33.143333

Cluster Means

Cluster WNGS OBC RMS MMH

217
1 7.727273 104.318182 95.636364 1551.075909
2 36.000000 380.000000 351.000000 6203.663333

Cluster Standard Deviations

Cluster ADO MAC WHR CUA

1 166.279665 267.800162 61.327438 11.207760


2 98.112815 398.407067 0.000000 22.336948

U.S. NAVY BACHELOR OFFICERS’ QUARTERS

The FASTCLUS Procedure


Replace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02

Cluster Standard Deviations

Cluster WNGS OBC RMS MMH

1 5.530709 93.842008 78.679304 1136.137709


2 19.078784 152.213666 108.498848 1824.180686

U.S. NAVY BACHELOR OFFICERS’ QUARTERS


The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02

Initial Seeds

Cluster ADO MAC WHR CUA

1 2.000000 4.000000 4.000000 1.260000


2 274.920000 695.250000 168.000000 46.630000
3 384.500000 1473.660000 168.000000 7.360000

Initial Seeds

Cluster WNGS OBC RMS MMH

1 1.000000 6.000000 6.000000 180.230000


2 58.000000 363.000000 363.000000 5539.980000
3 24.000000 540.000000 453.000000 8266.770000

Minimum Distance Between Initial Seeds = 2845.249

Iteration History

218
Relative Change in Cluster Seeds
Iteration Criterion 1 2 3

1 513.6 0.3538 0.5700 0


2 302.6 0.0308 0.0674 0
3 299.4 0 0 0

Convergence criterion is satisfied.

Criterion Based on Final Seeds = 299.4

Cluster Summary

Maximum Distance
RMS Std from Seed Radius Nearest Distance Between
Cluster Frequency Deviation to Observation Exceeded Cluster Cluster Centroids

1 17 284.1 1205.2 2 2725.5


2 7 397.7 1810.5 1 2725.5
3 1 . 0 2 4623.8

U.S. NAVY BACHELOR OFFICERS’ QUARTERS


The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02

Statistics for Variables

Variable Total STD Within STD R-Square RSQ/(1-RSQ)

ADO 169.80118 137.60145 0.398028 0.661208


MAC 382.80452 224.33367 0.685192 2.176536
WHR 58.62528 56.56608 0.146598 0.171780
CUA 13.95082 10.91767 0.438601 0.781265
WNGS 12.04270 10.08043 0.357725 0.556965
OBC 134.15041 93.36186 0.556017 1.252338
RMS 116.49375 74.21000 0.628010 1.688245
MMH 1946 853.16306 0.823852 4.677029
OVER-ALL 706.98292 319.13772 0.813212 4.353652

Pseudo F Statistic = 47.89

219
Approximate Expected Over-All R-Squared = 0.86912

Cubic Clustering Criterion = -1.486

WARNING: The two values above are invalid for correlated variables.

Cluster Means

Cluster ADO MAC WHR CUA

1 48.448235 148.749412 120.852941 10.242941


2 250.110000 608.605714 168.000000 30.205714
3 384.500000 1473.660000 168.000000 7.360000

Cluster Means

Cluster WNGS OBC RMS MMH

1 6.294118 85.647059 72.647059 1073.471765


2 21.000000 205.571429 209.857143 3745.552857
3 24.000000 540.000000 453.000000 8266.770000

U.S. NAVY BACHELOR OFFICERS’ QUARTERS

The FASTCLUS Procedure


Replace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02

Cluster Standard Deviations

Cluster ADO MAC WHR CUA

1 41.393598 187.049524 66.329614 10.734843


2 254.668606 302.038282 0.000000 11.390857
3 . . . .

Cluster Standard Deviations

Cluster WNGS OBC RMS MMH

1 5.132795 96.472108 70.579690 768.346928


2 17.387735 84.510073 83.119249 1046.247777
3 . . . .

U.S. NAVY BACHELOR OFFICERS’ QUARTERS

220
The CLUSTER Procedure
Single Linkage Cluster Analysis

Eigenvalues of the Covariance Matrix

Eigenvalue Difference Proportion Cumulative

1 3947958.95 3921316.06 0.9873 0.9873


2 26642.90 10137.40 0.0067 0.9940
3 16505.49 11872.07 0.0041 0.9981
4 4633.43 2073.01 0.0012 0.9993
5 2560.41 2344.28 0.0006 0.9999
6 216.14 170.55 0.0001 1.0000
7 45.58 9.66 0.0000 1.0000
8 35.92 0.0000 1.0000

Root-Mean-Square Total-Sample Standard Deviation = 706.9829


Mean Distance Between Observations = 2151.088

Cluster History
Norm T
Min i
NCL -----Clusters Joined------ FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Dist e

15 12 14 2 0.0002 .999 . . 712 . 0.0927


14 CL19 8 4 0.0004 .999 . . 623 4.0 0.0975
13 13 25 2 0.0003 .998 . . 592 . 0.1173
12 CL14 CL15 6 0.0035 .995 . . 226 19.1 0.1349
11 CL13 20 3 0.0011 .994 . . 219 3.4 0.145
10 CL17 CL11 5 0.0024 .991 . . 189 4.6 0.1666
9 CL10 19 6 0.0033 .988 . . 163 3.4 0.184
8 CL9 CL18 8 0.0143 .974 . . 89.5 11.6 0.1983
7 CL12 CL8 14 0.0557 .918 . . 33.6 25.7 0.2107
6 CL16 CL7 20 0.1197 .798 . . 15.0 26.3 0.3028
5 15 23 2 0.0030 .795 .966 -6.6 19.4 . 0.3503
4 18 22 2 0.0033 .792 .946 -5.1 26.6 . 0.3719
3 CL6 CL5 22 0.1036 .688 .901 -4.6 24.3 10.1 0.4044
2 CL3 CL4 24 0.2589 .429 .763 -4.3 17.3 18.3 0.5816
1 CL2 24 25 0.4294 .000 .000 0.00 . 17.3 1.3227

221
The Cluster Tree is;

13.4.1 Splus Example


The same data can be analyzed using Splus. The output is;

*** Agglomerative Hierarchical Clustering ***


Call:
agnes(x = menuModelFrame(data = Usnavy, variables =
"ADO,MAC,WHR,CUA,WNGS,OBC,RMS,MMH", subset = NULL, na.rm = T),
diss = F, metric = "euclidean", stand = F, method = "single",
save.x = T, save.diss = T)
Merge:
[,1] [,2]

222
[1,] -4 -6
[2,] -2 -3
[3,] 2 -5
[4,] -1 3
[5,] -7 -9
[6,] 5 -10
[7,] -16 -21
[8,] -11 -17
[9,] 4 1
[10,] -12 -14
[11,] 6 -8
[12,] -13 -25
[13,] 11 10
[14,] 12 -20
[15,] 8 14
[16,] 15 -19
[17,] 16 7
[18,] 13 17
[19,] 9 18
[20,] -15 -23
[21,] -18 -22
[22,] 19 20
[23,] 22 21
[24,] 23 -24
Order of objects:
[1] 1 2 3 5 4 6 7 9 10 8 12 14 11 17 13 25 20 19 16 21 15 23
[23] 18 22 24
Height:
[1] 36.20112 33.73715 34.02875 143.48285 26.18488 651.30164
[7] 98.24102 137.86258 209.82213 290.12504 199.39922 453.16504
[13] 139.95699 358.46153 252.32414 311.93894 395.87403 426.58963
[19] 139.65372 869.86791 753.46956 1251.04620 799.97551 2845.24880
Agglomerative coefficient:
[1] 0.8754856

Available arguments:
[1] "order" "height" "ac" "merge" "order.lab"
[6] "diss" "data" "call"

223
The Splus Cluster Tree is;

13.5 Multidimensional Scaling


In this section one is interested in displaying high dimensional data in a low-dimensional space. Procedures
such as, principle components and discriminate scores are two such methods of reducing the dimensionality
of the data. Often, the distances between observations in the high dimensional space becomes distorted in
the lower dimensions when these types of procedures are used. Multidimensional scaling is a procedure by
which the Euclidean distance between observations is preserved from transforming from high dimensions into
lower dimensions (two dimensions).
The problem is to find a representation in say two dimensions that “nearly matches” the distance between
points in the higher dimensions. It is not possible to preserve this property for every pair of observations
or points, consequently one might compute or measure a “stress” associated with how much of the higher
dimensional ordering is preserved at each lower dimensional representation. Since it is possible to arrange N

224
items in a low dimensional coordinate system using only the ranks orders of the N2 original distances, and


not their actual distances, procedures which use these rank orderings are called nonmetric multidimensional
scaling. If the actual distances are used, the procedures are called metric multidimensional scaling. Principle
component analysis is a metric multidimensional scaling procedure.

13.5.1 The Basic Algorithm


For N items there are M = N2 distances (similarities) between the pairs of items. These similarities


constitute the basic data. If one assumes that there are no ties in these similarities, then they can be
arranged in a strickly ascending order as;

si1 ,k1 < si2 ,k2 < . . . < siM ,kM

where si1 ,k1 represents the pair of points with the smallest or least similarity. Now we want to find q
(q)
dimensions of the N items such that the distance dik , between pairs of items match the above ordering
when the distances are laid out in descending magnitude. That is, a match is perfect if,
(q) (q) (q)
di1 ,k1 > di2 ,k2 > . . . > dIM ,kM .

Now for any given value of q << p, define the stress as,
(P P (q) (q)
)1/2
i<k (dik − dˆik )2
Stress(q) = PP (q) 2
i<k [dik ]

(q)
where the dˆik ’s are monotone functions of the distances. The value of the Stress function is subjective where
values greater than 10% are poor, where values in the 5% - 10% range are good to excellent.
The following measure is becoming a more acceptable criterion;
(P P )1/2
− dˆ2ik )2
2
i<k (dik
Stress = PP 4
i<k dik

which is a value between 0 and 1. Any value less than .1 is typically an acceptable value for indicating there
is a good representation.

13.5.2 SAS – Example


SAS PROC MDS is used to find multidimensional scaling. The procedure is very general and can be used
for a number of models that are beyond the scope of this course. I have included a single example.

Breakfast Cereal Example


The SAS code is;
options ps=60 nodate center;
data cereal;
title ’Breakfast Cereal Data’;
input brand $ manufact $ Cal Protein Sodium Fiber Carbs Sugar Potass Group;
datalines;
ACCheerios G 110 2 2 180 1.5 10.5 10 70 1
Cheerios G 110 6 2 290 2.0 17.0 1 105 1

225
CocoaPuffs G 110 1 1 180 0.0 12.0 13 55 1
CountChocula G 110 1 1 180 0.0 12.0 13 65 1
GoldenGrahams G 110 1 1 280 0.0 15.0 9 45 1
HoneyNutCheerios G 110 3 1 250 1.5 11.5 10 90 1
Kix G 110 2 1 260 0.0 21.0 3 40 1
LuckyCharms G 110 2 1 180 0.0 12.0 12 55 1
MultiGrainCheerios G 100 2 1 220 2.0 15.0 6 90 1
OatmealRaisinCrisp G 130 3 2 170 1.5 13.5 10 120 1
RaisinNutBran G 100 3 2 140 2.5 10.5 8 140 1
TotalCornFlakes G 110 2 1 200 0.0 21.0 3 35 1
TotalRaisinBran G 140 3 1 190 4.0 15.0 14 230 1
TotalWholeGrain G 100 3 1 200 3.0 16.0 3 110 1
Trix G 110 1 1 140 0.0 13.0 12 25 1
Cheaties G 100 3 1 200 3.0 17.0 3 110 1
WheatiesHoneyGold G 110 2 1 200 1.0 16.0 8 60 1
AllBran K 70 4 1 260 9.0 7.0 5 320 2
AppleJacks K 110 2 0 125 1.0 11.0 14 30 2
CornFlakes K 100 2 0 290 1.0 21.0 2 35 2
CornPops K 110 1 0 90 1.0 13.0 12 20 2
CracklinOatBran K 110 3 3 140 4.0 10.0 7 160 2
Crispix K 110 2 0 220 1.0 21.0 3 30 2
FrootLoops K 110 2 1 125 1.0 11.0 13 30 2
FrostedFlakes K 110 1 0 200 1.0 14.0 11 25 2
FrostedMiniWheats K 100 3 0 0 3.0 14.0 7 100 2
FruitfulBran K 120 3 0 240 5.0 14.0 12 190 2
JustRightCrunchyNuggets K 110 2 1 170 1.0 17.0 6 60 2
MueslixCrispyBlend K 160 3 2 150 3.0 17.0 13 160 2
NutNHoneyCrunch K 120 2 1 190 0.0 15.0 9 40 2
NutriGrainAlmondRaisin K 140 3 2 220 3.0 21.0 7 130 2
NutriGrainWheat K 90 3 0 170 3.0 18.0 2 90 2
Product19 K 100 3 0 320 1.0 20.0 3 45 2
RaisinBran K 120 3 1 210 5.0 14.0 12 240 2
RiceKrispies K 110 2 0 290 0.0 22.0 3 35 2
Smacks K 110 2 1 70 1.0 9.0 15 40 2
SpecialK K 110 6 0 230 1.0 16.0 3 55 2
CapNCrunch Q 120 1 2 220 0.0 12.0 12 35 3
HoneyGrahamOhs Q 120 1 2 220 1.0 12.0 11 45 3
Life Q 100 4 2 150 2.0 12.0 6 95 3
PuffedRice Q 50 1 0 0 0.0 13.0 0 15 3
PuffedWheat Q 50 2 0 0 1.0 10.0 0 50 3
QuakerOatmeal Q 100 5 2 0 2.7 1.0 1 110 3
;
PROC STANDARD DATA=Cereal MEAN=0 STD=1 OUT=Cereal2;
VAR Cal Protein Sodium Fiber Carbs Sugar Potass;
RUN;

data cereal4; set cereal; keep brand;run;

DATA ONE; SET Cereal2;

226
I = _N_;
KEEP Cal Protein Sodium Fiber Carbs Sugar Potass I;

DATA ORIG; SET ONE;


DO J=1 TO 43;
OUTPUT; END;

DATA DUP; SET ORIG;


II=J; JJ=I; I=II; J=JJ; DROP JJ II;
Y1=cal; Y2=protein; Y3=sodium; Y4=fiber; Y5=carbs; Y6=sugar; Y7=potass;
NN = _N_;
DROP Cal Protein Sodium Fiber Carbs Sugar Potass;

PROC SORT DATA=DUP; BY I J;

****************************************************
* CREATED A DUPLICATE DATA SET BY CREATING DUMMY
VARIABLES I AND J FOR MERGING SETS TO ALLOW US
TO CALCULATE H , WHICH IS DISTANCE.
******************************************************;

DATA COMB; MERGE ORIG DUP; BY I J;


DROP I J;

******************************************************************
* CALCULATE H, THE DISTANCE BETWEEN ALL PAIRS OF POINTS
*****************************************************************;

H = SQRT((cal-Y1)**2 + (protein-Y2)**2 +(sodium-Y3)**2 + (fiber-Y4)**2 +(carbs-Y5)**2


+(sugar-Y6)**2 +(potass-Y7)**2);
KEEP H;
RUN;

PROC IML; RESET NOLOG;


USE COMB;
READ ALL INTO DIST;
USE Cereal4;
READ ALL VAR _CHAR_ INTO brand;
* PRINT brand;
N=SQRT(NROW(DIST));
* print n;
DD=SHAPE(DIST,N,N);
*PRINT ’DISTANCE MATRIX = ’,DD;
CREATE DIST_DAT FROM DD;
APPEND FROM DD;
run;
*proc print data=cereal4;run;
DATA FINAL;
MERGE Cereal4 DIST_DAT;

227
RUN;

PROC MDS DATA=FINAL OUT=OUT OUTRES=RES SHAPE=SQUARE; ID brand;


TITLE3 ’MDS Analyses and Plots’;
RUN;

data out; set out; if _type_=criterion then delete;run;


%let plotitop = gopts = gsfmode = replace
gaccess = gsasfile device = pslepsf
hsize = 8.63 vsize = 6.5
cback = white,
color = black,
colors = black,
options = noclip border expand, post=myplot.ps;
title1 ’Plot of configuration’;
%plotit(data=out(where=(_type_=’CONFIG’)), datatype=mds,
labelvar=brand, vtoh=1.75);run;
%plotit(data=out(where=(_type_=’CONFIG’)), datatype=mds,
plotvars=dim2 dim1, labelvar=brand, vtoh=1.75);
run;
quit;
The output from PROC MDS is in the form of output data sets. The only print form the procedure is output
concerning the iteration of the algorithm used to find the lower dimensions. These outputs can be used to
form graphs. An example is;

228
229

You might also like