Document
Document
for
Stat 5353
J. D. Tubbs
Department of Mathematical Sciences
2 Distributions 11
2.1 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Univariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Bivariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Multivariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.4 Estimation of Parameters in the Multivariate Case . . . . . . . . . . . . . . . . . . . . 13
2.1.5 Matrix Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Other Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Chi-Square, T and F Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 The Wishart Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Hotelling’s T 2 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Quadratic Forms of Normal Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Sampling Distributions of µ̂ and S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1
3 Assessing the Normality Assumption 17
3.1 QQ Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Chi-Square Probability Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Example – US Navy Officers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Box-Cox Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.1 USnavy Example with Box Cox Transformations . . . . . . . . . . . . . . . . . . . . . 25
4 Multivariate Plots 36
4.1 2 Dimensional Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.1 Multiple Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.2 Chernoff Faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.3 Star Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 3 Dimensional Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2
8 Profile Analysis 83
8.1 One Sample Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.1.1 Example – One Sample Profile Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.2 Two Sample Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.2.1 Example – Two Sample Profile Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.3 Mixed Models Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.3.1 Estimating G and R in the Mixed Model . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.3.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.3.3 Restricted Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . 103
8.3.4 REML Estimation for the Linear Mixed Model . . . . . . . . . . . . . . . . . . . . . . 103
8.3.5 Estimating β and γ in the Mixed Model . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.3.6 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.3.7 Statistical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.3.8 Inference and Test Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.4 Example using PROC MIXED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3
12.4.2 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
12.4.3 Nearest Neighbor Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
12.5 SAS – Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
12.5.1 Salmon Size Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
12.6 Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
4
Chapter 1
x1
x2
x= ...
x0 = ( x1 , x2 , · · · , xn ) .
xn
a11 0 ... 0
0 a22 ... 0
D=
... .. .. .. .
. . .
0 0 ... arr
1
3. In is called the n × n identity matrix given by
1 0 ... 0
0 1 ... 0
I= ..... . . . .
. . ..
0 0 ... 1
1.1.2 Addition
C = A ± B is defined as cij = aij ± bij provided both A and B have the same number of rows and columns.
It can easily be shown that (A ± B) ± C = A ± (B ± C) and A + B = B + A.
1.1.3 Multiplication
Pp
C = AB is defined as cij = k=1 aik bkj provided A and B are conformable matrices (A is r × p and B is
p × c).
1. Even if both AB and BA are defined they are not necessarily equal.
2. It follows that A(B ± C) = AB ± AC.
Pn
3. Two vectors a and b are said to be orthogonal, denoted by a⊥b = 0, if a0 b = i=1 ai bi = 0.
Pn
4. kak = a0 a = i=1 a2i .
5. kjk = j0 j = n.
Pn 0 0
6. i=1 ai = j a = a j.
7. j0 J = nj0 , Jj = nj.
2
1.1.5 Partitioned Matrices
The r × c matrix A can be partitioned into four submatrix as
A11 A12 B11 B12
A= B= ,
A21 A22 B21 B22
where Aij is ri × cj and r = r1 + r2 , c = c1 + c2 . Suppose matrices A and B are such that AB is defined
then
A11 A12 B11 B12 A11 B11 + A12 B21 A11 B12 + A12 B22
AB = = .
A21 A22 B21 B22 A21 B11 + A22 B21 A21 B12 + A22 B22
1.1.6 Inverse
The n × n matrix A is said to be nonsingular if there exists a matrix B satisfying AB = BA = In . B is
called the inverse of A is denoted by A−1 .
1. (AB)−1 = B −1 A−1 provided these inverses exist.
2. (I + A)−1 = I − A(A + I)−1 .
3. (A + B)−1 = A−1 − A−1 B(A + B)−1 = A−1 − A−1 (A−1 + B −1 )−1 A−1 and B(A + B)−1 A = (A−1 +
B −1 )−1 .
4. (A−1 + B −1 )−1 = (I + AB −1 )−1 .
5. If A is a r × c matrix with rank[A] = r then (AA0 )−1 exists and A−1 does not. If rank[A] = c then
(A0 A)−1 exists.
6. | A | is the determinant of the matrix A [Note | A | is a scalar]. | A |6= 0 whenever A inverse exists. A
is said to be singular if its inverse does not exist in which case | A |= 0.
7. | AB |=| A || B |.
8. Suppose
A11 A12
A= ,
A21 A22
and the inverse of A exists and is given by
B11 B12
A−1 = B = ,
B21 B22
then
−1 −1
(a) B11 and B22 exist.
(b) B11 = [A11 − A12 A−1
22 A21 ]
−1
= A−1 −1 −1
11 + A11 A12 B22 A21 A11 .
(c) B12 = −A−1 −1
11 A12 [A22 − A21 A11 A12 ]
−1
= −A−1
11 A12 B22 .
(d) B22 = [A22 − A21 A−1
11 A12 ]
−1
= A−1 −1 −1
22 + A22 A21 B11 A12 A22 .
(e) B21 = −A−1 −1
22 A21 [A11 − A12 A22 A21 ]
−1
= −A−1
22 A21 B11 .
|A11 | |A22 |
(f) | A |= |B22 | = |B11 | and | A11 B11 |=| A22 B22 |.
(g) | A |=| A22 || A11 − A12 A−1
22 A21 | .
(h) | A |=| A11 || A22 − A21 A−1
11 A12 | .
A−1 cc0 A−1
(i) (A + cc0 )−1 = A−1 − 1+c0 A−1 c .
3
1.1.7 Transpose
If A is r × c then the transpose of A, denoted by A0 , is a c × r matrix. It follows that
1. (A0 )0 = A
2. (A ± B)0 = A0 ± B 0
3. (AB)0 = B 0 A0
4. If A = A0 then A is said to be symmetric.
5. A0 A and AA0 are symmetric.
6. (A ⊗ B)0 = (A0 ⊗ B 0 ).
1.1.8 Trace
Definition:
Pn 1.1 Suppose that the matrix A = (aij ), i = 1, . . . , n, j = 1, . . . , n then the trace of A given by
tr[A] = i=1 aii .
1.1.9 Rank
Suppose that A is a r × c matrix with r rows a1 , a2 , . . . , ac are said to be linearly independent if no ai can
be expressed as a linear combination Pr of the remaining a0i s, that is, there does not exist a non-null vector
c = (c1 , c2 , . . . , cr ) such that i=1 ci ai = 0. It can be shown that the number of linearly independent rows
is equal to the number of linearly independent columns of any matrix A and that number is the rank of the
matrix. If the rank of A is r then the matrix A is said to be full row rank. If the rank of A is c then A is
said to be full row rank.
1. rank[A] = 0 if and only if A = 0.
2. rank[A] = rank[A0 ].
4
3. rank[A] = rank[A0 A] = rank[AA0 ].
4. rank[AB] ≤ min{rank[A], rank[B]}
5. If A is any matrix, and P and Q are any conformable nonsingular matrices then rank[P AQ] = rank[A].
6. If A is r × c with rank r then AA0 is nonsingular ((AA0 )−1 exists and rank[AA0 ] = r). If the rank of
A is c then A0 A is nonsingular ((A0 )−1 exists and rank[A0 A] = c).
7. If A is symmetric, then rank[A] is equal to the number of nonzero eigenvalues.
5
1.1.12 Orthogonal Matrices
An n × n matrix A is said to be orthogonal if and only A−1 = A0 . If A is orthogonal then
1. −1 ≤ ai ≤ 1.
2. AA0 = A0 A = In .
3. | A |= 1.
df df
=
dX dxij
then
d(β 0 a)
1. dβ = a.
d(β 0 Aβ)
2. dβ = 2Aβ. (A symmetric).
df
3. if f (X) = a0 Xb, then dX = ab0 .
df
4. if f (X) = tr[AXB], then dX = A0 B 0 .
df
5. if X is symmetric and f (X) = a0 Xb, then dX = ab0 + b0 a − diag(ab0 ).
df
6. if X is symmetric and f (X) = tr[AXB], then dX = A0 B 0 + BA − diag(BA).
df
7. if X is n × n and f (X) = tr(X), then dX = In .
df
8. if X is n × n and f (X) = tr(X 0 AX), then dX = (A + A0 )X = 2AX, (when A = A0 ).
df
9. if X and A are symmetric and f (X) = tr[AXAX], then dX = 2AXA.
Suppose that Y is an k × p matrix, then the partial derivative of Y with respect to xij
∂y11 ∂y1 ∂y1p
∂xij ∂xij ... ∂xij
∂y21 ∂y22 ∂y2p
...
∂Y ∂xij ∂xij ∂xij
=
. ..
..
∂xij .. ..
. . .
∂yk1 ∂yk2 ∂ykp
∂xij ∂xij ... ∂xij
∂|X|
1. ∂X = [Xij ] = adjointX 0 for | X |6= 0.
∂|X|
2. If X is symmetric, then ∂X = 2[Xij ] − diag[Xij ] for | X |6= 0.
∂log(|X|)
3. ∂X = (X −1 )0 for | X |6= 0.
∂log(|X|)
4. If X is symmetric, then ∂X = 2X −1 − diag[X −1 ] for | X |6= 0.
6
1.1.13 The Generalized Inverse
A matrix B is said to be the generalized inverse of A if it satisfies ABA = A. The generalized inverse of A
is denoted by A− . If A is nonsingular then A−1 = A− . If A is singular then A− exists but is not unique.
1. If A is an r × c matrix of rank c. Then the generalized inverse of A is A− = (A0 A)−1 A0 .
A~x = λ~x
This last equation is called the characteristic equation for a n × n matrix A. The matrix A has possibly n
roots or solutions for the value λ to the characteristic equation. These solutions are called the characteristic
values or roots or the eigenvalues for the matrix A. Suppose that λ1 is a solution and
then ~x1 is said to a characteristic vector or eigenvector of A corresponding to the eigenvalue λ1 . Note: The
eigenvalues may or may not be real numbers.
7
6. The eigenvalues for A, C −1 AC and CAC −1 have the same set of eigenvalues for any nonsingular matrix
C.
7. The matrices A and A0 have the same set of eigenvalues but need not have the same eigenvectors.
8. Let A be a nonsingular matrix with eigenvalue λ, then 1/λ is an eigenvalue of A−1 .
9. The eigenvectors are not unique, for if ~x1 is an eigenvector corresponding to λ1 then c~x1 is also n
eigenvector, since Ac~x1 = λ1 c~x1 , for any nonzero value c.
10. Let A be an n × n real matrix, then there exists a nonsingular, complex matrix Q such that Q0 AQ = T ,
where T is a complex, upper triangular matrix, and the eigenvalues of A are the diagonal elements of
the matrix T .
11. Let P be an idempotent matrix then the eigenvalues of P are either 1 or 0. The only nonsingular
idempotent matrix is the identity matrix.
12. Suppose that A is a symmetric matrix;
(a) Then the eigenvalues of A are real numbers.
(b) For each eigenvalue there exists a real eigenvector (each element is a real number).
(c) Let λ1 and λ2 be eigenvalues of A with corresponding eigenvectors ~x1 and ~x2 , then ~x1 and ~x2 are
orthogonal vectors, that is, ~x01 ~x2 = 0.
(d) There exists an orthogonal matrix P (P 0 P = P P 0 = In ) such that P 0 AP = D, where D is a
diagonal matrix whose diagonal elements are the eigenvalues of the matrix A.
13. Let A be a symmetric matrix of order n and B is a p.d. matrix where λ1 ≥ λ2 ≥ . . . ≥ λn are the
roots of | A − λB |= 0. Then for any ~y 6= 0
(a) λn ≤ y 0 Ay/y 0 By ≤ λ1 .
(b) miny6=0 (y 0 Ay/y 0 By) = λn .
(c) maxy6=0 (y 0 Ay/y 0 By) = λ1 .
14. Properties of the roots of | H − λE |= 0.
(a) The roots of | E − ν(H + E) |= 0 are related to the roots of | H − λE |= 0
1 − νi
λi = , νi = (1 + λi )−1 .
νi
θi λi
λi = , θi = .
(1 − θi ) (1 + λi )
νi = (1 − θi ).
8
1.3 Random Variables and Vectors
1.3.1 Univariate Random Variables
Expectations
Let U denote a random variable with expectation E(U ) and V ar(U ) = E(U − E(U ))2 ). Let a and b denote
any constants, then we have
1. E(aU ± b) = aE(U ) ± b.
2. V ar(aU ± b) = a2 V ar(U ).
Suppose that t(x) is a statistic that is used to estimate a parameter θ. If E(t(x)) = θ, the statistic is said to
be an unbiased estimate for θ. If E(t(x)) = η 6= θ then t(x) is biased and the bias is given by Bias = (θ − η),
in which case the mean square error is given by
Covariance
Let U and V denote two random variables with respective means, µu and µv . The covariance between the
two random variables is defined by
Linear Combinations
Suppose that one has n r.v. given by u1 , u2 , . . . , un and one defines
n
X
u= ai ui
i=1
9
Properties for Σ
1. Σ is symmetric and at least a p.s.d. n × n matrix.
~U
2. E(U ~ 0 ) = Σ − E(U
~ )E(U
~ )0 .
~ + d) = cov(U
3. cov(U ~ ).
~ )] = trE[(U
4. tr[cov(U ~ −E(U~ ))(U
~ −E(U
~ ))0 ] = E[(U
~ −E(U~ ))0 (U ~ ))] = Pn σii is the total variance
~ −E(U
i=1
~.
of U
Suppose that A is a r × n matrix and one defines V ~ = AU~ ± b, then
~ ) = AE(U
5. E(V ~ ) ± b.
~ ) = Acov(U
6. cov(V ~ )A0 = AΣA0 . Note cov(V
~ ) is an r × r symmetric and at least p.s.d. matrix.
Suppose that C is a s × n matrix and one defines W~ = CU ~ ± d, then
~ ,W
7. cov(V ~ ) = AΣB 0 . Note cov(V
~ ,W
~ ) is a r × s matrix.
8. Define Σ1/2 as √ √ √
Σ1/2 = diag( σ 11 , σ 22 , . . . , σ nn )
and the correlation matrix as,
ρ = (Σ1/2 )−1 Σ(Σ1/2 )−1 .
~ 1 = (u1 , u2 , . . . , uq )0 and U
9. Let U ~ 2 = (uq+1 , uq+2 , . . . , un )0 then
. .
µ µ1 .. µ
~ ) = (~
~ = E(U ~ 2 )0 = (µ1 , µ2 , . . . , µq .. µq+1 , µq+2 , . . . , µn )0
and
~) = Σ = Σ11 Σ12
Cov(U ,
Σ21 Σ22
where σ
1,1 σ1,2 ... σ1,q
σ2,1 σ2,2 ... σ2,q
Σ11 =
.. .. .. .. ,
. . . .
σq,1 σq,2 . . . σq,q
σ σq+1,q+2 . . . σq+1,n
q+1,q+1
σq+2,q+1 σq+2,q+2 . . . σq+2,n
Σ22 = .. .. .. ,
..
. . . .
σn,1 σn,2 . . . σn,n
σ σ ... σ
1,q+1 1,q+2 1,n
σ2,q+1 σ2,q+2 ... σ2,n
Σ12 = Σ021 =
.. .. .. .. .
. . . .
σq,1 σq,2 ... σq,n
~ 1 and U
10. The random vectors U ~ 2 are uncorrelated if and only if Σ12 = 0.
10
Chapter 2
Distributions
(y − µ)2
fy (y) = k exp[− ], −∞ < y < ∞
2σ 2
where E(y) = µ, var(y) = σ 2 and k is the normalizing constant given by k = (2πσ 2 )−1/2 .
In order to estimate the parameters µ and σ 2 using a sample of size n, given by y1 , y2 , . . . , yn , one defines
the likelihood function given by,
n n
Y (y − µ)2 X
L(y1 , y2 , . . . , yn | µ, σ 2 ) = (2πσ 2 )−1/2 exp[− ] = (2πσ 2 −n/2
) exp[−1/2σ 2
(yi − µ)2 ].
i=1
2σ 2 i=1
In order to find the values µ̂ and σ̂ 2 which maximize L(· | ·), one defines the log likelihood function as
l = log(L) and takes the partial derivative with respect to µ and σ 2 . Set these values to zero and solve. In
which case one obtains,
µ̂ = ȳ
and
n
X
σ̂ 2 = n−1 (yi − ȳ)2 .
i=1
f (~y ) = k exp[Q],
p
where k = (2πσ1 σ2 1 − ρ2 )−1 ,
−1
Q= [((y1 − µ1 )/σ1 )2 − 2ρ((y1 − µ1 )/σ1 )((y2 − µ2 )/σ2 ) + ((y2 − µ2 )/σ2 )2 ],
2(1 − ρ2 )
11
σ12 ρσ1 σ2
~ = (µ1 , µ2 )0 , and Σ =
µ , where E(xi ) = µi , V ar(xi ) = σi2 , i = 1, 2, and corr(x1 , x2 ) = ρ.
ρσ1 σ2 σ22
By letting zi = yiσ−µ
i
i
, for i = 1, 2, Q can be written as
−1
Q= [z 2 − 2ρz1 z2 + z22 ].
2(1 − ρ2 ) 1
2
The√quadratic Q is an ellipse√and the contours of equal density given by Q = c are ellipses with major axis
±c λ1 e1 and minor axis ±c λ2 e2 where λ1 ≥ λ2 are the eigenvalues of Σ and e1 , e2 are the corresponding
eigenvectors where e1 e01 = e2 e02 = 1 and e1 e02 = 0.
It can be shown that
Pr[Q ≤ χ22 (α)] = (1 − α),
where χ22 (α) is the critical point form a Chi-square distribution with 2 degrees of freedom.
~ )0 Σ−1 (~y − µ
f (~y ) = f (y1 , y2 , . . . , yp ) = k exp[−1/2(~y − µ ~ )],
column rank).
2. If ~y ∼ Np (~
µ, Σ) then ~z ∼ Np (0, Ip ) where
4. Conditional Distribution
Suppose that, ~y1 = (y1 , y2 , . . . , yq )0 and ~y2 = (yq+1 , yq+2 , . . . , yp )0 and (~y1 , ~y2 )0 ∼ Np (µ, Σ) where
. .
µ = (µ1 .. µ2 )0 = (µ1 , µ2 , . . . , µq .. µq+1 , µq+2 , . . . , µp )0
and
Σ11 Σ12
Σ= ,
Σ21 Σ22
12
where σ
1,1 σ1,2 ... σ1,q
σ2,1 σ2,2 ... σ2,q
Σ11 =
.. .. .. .. ,
. . . .
σq,1 σq,2 . . . σq,q
σ σq+1,q+2 . . . σq+1,p
q+1,q+1
σq+2,q+1 σq+2,q+2 . . . σq+2,p
Σ22 = .. .. .. .. ,
. . . .
σp,q+1 σp,q+2 . . . σp,p
σ σ1,q+2 ... σ
1,q+1 1,p
σ2,q+1 σ2,q+2 ... σ2,p
Σ12 = Σ021 =
.. .. .. .. .
. . . .
σq,q+1 σq,q+2 ... σq,p
5. Independence
(a) The above vectors ~y1 and ~y2 are independent if and only if Σ12 = Σ21 = 0.
(b) If vectors ~y1 and ~y2 are independent then
~y1 ± ~y2 ∼ Nq (~
µ1 ± µ
~ 2 , Σ11 + Σ22 )
when both vectors have the same dimension (that is, ~y1 ± ~y2 is defined).
6. Partial Correlation
The partial correlation between two variables yi and yj given that Y2 = ~y2 is given by
σi,j|Y2 =~y2
ρi,j|Y2 =~y2 = √ √
σi,i|Y2 =~y2 σj,j|Y2 =~y2 ,
13
The likelihood function is given by
n
X
L = L(µ, Σ | Y ) = (2π)−np/2 | Σ |−n/2 exp[−1/2 (~yi − µ)0 Σ−1 (~yi − µ)]
i=1
Now using this value and taking the derivative of L with respect to Σ, one has
n
X
Σ̂ = (~yi − ȳ)(~yi − ȳ)0 /n = [(n − 1)/n]S,
i=1
where
(n − 1)S = Y 0 [I − j(j0 j)−1 j0 ]Y = Y 0 Y − nȲ 0 Ȳ
and j is an n × 1 vector of ones.
u/n
5. If u ∼ χ2 (n) and v ∼ χ2 (m) then v/m ∼ F-dist(n,m).
n
6. z = (z1 , z2 , . . . , zn )0 then z 0 z = i=1 zi2 ∼ χ2 (n).
P
14
2.2.2 The Wishart Distribution
If Q = Y 0 Y when Y ∼ Nn,p (µ, Σ ⊗ In ). Then Q is said to have a noncentral p-dimensional Wishart
distribution, denoted by Wp (n, Σ, Γ) where Γ = µ0 µΣ−1 is the noncentrality parameter. Whenever µ = 0, Q
is said to have a central Wishart distribution, denoted by Wp (n, Σ). The properties of a Wishart are;
1. E(Q) = nΣ + µ0 µ.
2. Q ∼ Wp (n, Σ, Γ) if and only if a0 Qa/a0 Σa ∼ χ2 (df = n, γ = a0 µ0 µa/a0 µa) for all a 6= 0.
3. E(Y 0 AY ) = trace(A)Σ + µ0 Aµ.
4. A1 ∼ Wp1 (n, Σ) and A2 ∼ Wp2 (n, Σ), A1 is independent of A2 , then A1 + A2 ∼ Wp1 +p2 (n, Σ).
5. If A ∼ Wp (n, Σ), then CAC 0 ∼ Wp (n, CΣC 0 ).
T 2 = nY 0 QY
where γ = µ0 Σ−1 µ.
15
(d) cov(q1 , q2 ) = 2 tr[AV BA] + 4µ0 AV Bµ.
(e) cov(x, q1 ) = 2 V Aµ.
(f) cov(T, q1 ) = 2 CV Aµ.
(g) q1 and q2 are independent if and only if AV B = BV A = 0.
(h) q1 and T are independent if and only if CV A = 0.
4. (Cochran’s Theorem) Let x ∼ Nn (µ, V ), Ai , i = 1, 2, . . . , m be symmetric, rank[Ai ] = ri , and
X
A= Ai
16
Chapter 3
The assessing of the univariate normality assumption can be done using the QQ plots from regression analysis.
In addition, one can use any of the goodness of fit procedures with the univariate plots, such as, histograms
and boxplots.
3.1 QQ Plots
Suppose that x1 , x2 , . . . , xn represent n univariate observations for a r.v., X. Let x(1) ≤ x(2) ≤ . . . ≤ x(n)
represent the order statistics. It follows that,
Z q(i)
2
Pr[Z ≤ q(i) ] = p(i) = (2π)−1/2 e−z /2 dz = Φ(q(i) ) = (i − 1/2)/n or (i − 3/8)/(n + 1/4),
−∞
where Φ(·) is the c.d.f. for the standard normal distribution. If the data x1 , x2 , . . . , xn are normally dis-
tributed then the plots of the pairs given by (q(i) , x(i) ) will be approximately linear with x(i) = σq(i) + µ.
Note: If the x0i s have been standardized then the line will be the line x(·) = q(·) .
2. Graph the pairs (q(i) , d2(i) ) where q(i) is the 100(i − 1/2)/n percentile of the chi-square distribution with
degrees of freedom = p. If the data are multivariate normal then the plot should be linear.
nd2i
A modification of the above procedure is given in Rencher (1995) page 111 where ui = (n−1)2 has a Beta
distribution.
1. Order the distances u2i from smallest to largest as u2(1) ≤ u2(2) ≤ . . . ≤ u2(n) .
17
i−α
2. Graph the pairs (v(i) , u(i) ) where v(i) is the 100( n−α−β+1 ) percentile of the Beta(α, β) distribution,
where
p−2
α=
2p
and
n−p−2
β= .
2(n − p − 1)
A nonlinear pattern in the plot would indicate a departure from multivariate normality.
18
24 384.50 1473.66 168 7.36 24 540 453 8266.77
25 95 368 168 30.26 9 292 196 1845.89
RUN;
DATA USNAVY2;
SET USNAVY;
DROP SITE MMH;
SUMSQ=X‘*X-N#MEAN*MEAN‘;
S=SUMSQ/(N-{1});
D = VECDIAG(DIST); CNAME={"DIST"};
CREATE DIST FROM D[COLNAME=CNAME];
APPEND FROM D[COLNAME=DIST];
QUIT;
19
V=GAMINV(RSTAR,ETA);
V=2*V;
PROC PRINT;
PROC SORT; BY R;
PROC PRINT;
VARIABLES X R V;
FORMAT X 6.2 R 3.0 V 6.3;run;
symbol1 color=black i=join v=none;
symbol2 color=red v=diamond i=none;
PROC GPLOT DATA=PLOTDATA;
PLOT X*V=2 V*V = 1 /OVERLAY VZERO HZERO;
run;
The SAS output is;
U.S. NAVY BACHELOR OFFICERS’ QUARTERS
MULTIVARIATE NORMALITY PLOT
MEAN
Obs DIST
1 5.2149
2 2.9011
3 2.9139
4 2.9547
5 2.5796
20
6 2.8535
7 3.4292
8 7.6582
9 5.7796
10 2.1490
11 2.0194
12 3.8978
13 0.9649
14 1.3660
15 12.4224
16 8.6965
17 7.8778
18 9.7558
19 1.1235
20 7.8310
21 0.7295
22 17.8888
23 22.7630
24 20.0683
25 12.1617
21
Obs X R V
1 0.73 1 1.564
2 0.96 2 2.320
3 1.12 3 2.833
4 1.37 4 3.260
5 2.02 5 3.642
6 2.15 6 3.998
7 2.58 7 4.339
8 2.85 8 4.671
9 2.90 9 5.000
10 2.91 10 5.328
11 2.95 11 5.660
12 3.43 12 5.998
13 3.90 13 6.346
14 5.21 14 6.707
15 5.78 15 7.086
16 7.66 16 7.487
17 7.83 17 7.917
18 7.88 18 8.383
19 8.70 19 8.899
20 9.76 20 9.480
21 12.16 21 10.154
22 12.42 22 10.968
23 17.89 23 12.017
24 20.07 24 13.540
25 22.76 25 16.622
The graph is
22
3.4 Box-Cox Transformations
One method that has been proposed as a general method of finding normalizing transformation is called the
Box-Cox transformation.
Suppose that y > 0, define the Box-Cox transformation as
λ
(λ) (yi − 1)/λ when λ 6= 0,
yi =
ln yi when λ = 0,
where i = 1, 2, . . . , n. One determines λ by maximizing
n
X
−n/2 log[s2 (λ)] = (λ − 1) ln(yi ) − n/2 log[σ̂ 2 (λ)],
i=1
(λ)
and σ̂ 2 (λ) = 1/n~y (λ)0 [I − H]~y (λ) i.e., it is the sum of squares for the error term when yi is used instead
(λ) (λ) (λ)
of yi and ~y (λ) = (y1 , y2 , . . . , yn )0 . The matrix H is X 0 (X 0 X)−1 X in the linear model Y = Xβ + E,
0 −1 0
otherwise H = j(j j) j .
23
Since, there is not a close form solution to the above maximization, one usually plots −n/2 log[s2 (λ)] vs
λ. Another approach is to compute a confidence interval using the fact that −n/2 log[s2 (λ)] ∼ χ2 (df = 1).
One can then use any λ which is contained in the confidence interval.
Splus Plots
24
SAS Plots
The user inputs the lower bound where the search is to begin and the
number of searches to be made.The increment for the search is also input
by the user.
USEAGE: %BOXCOX
INPUT:
25
2. The lower bound for which the search to maximize the
log-likelihood is to be made (e.g. -1.2).
REFERENCES:
Box,G.E.P., and D.R.Cox. (1964). An analysis of
transformations (with discussion). Journal of the Royal
Statistical Society.B.26:211-252.
Johnson,R.A., and D.Wichern. (1982) Applied Multivariate
Statistical Analysis.Englewood Cliffs,N.J.:Prentice-Hall.
*/
data aa;set a;
mn =_N_;
if mn = 1;
drop mn;
26
uu = ’Upper*Bound’;
data a;set a;
y=log(xx);
z=(xx**lambda-1)/lambda;
data b;set b;
c91=var2*(n2-1)/n2;
c71=sum1;
c51=(-n2*log(c91)/2)+(lambda-1)*c71;
if abs(lambda)<0.000001 then lambda=0;
lambda&i=lambda;
loglik&i=c51;
loglik=c51;
data dat&i;set b;
keep lambda&i loglik&i;
%end;
data cc;set dat1;
lambda=lambda1;
loglik=loglik1;
keep loglik lambda;
27
proc append base=cc data=cc&j force;
%end;
28
/overlay href=-1 -.5 0 .5 1 lh=2 ch=green;
SYMBOL1 V=NONE I=JOIN L=1;
symbol2 v=none i=join color=red;
symbol3 v=none i=join color=green;
%mend boxcox;
TITLE ’U.S. NAVY BACHELOR OFFICERS’’ QUARTERS’;
DATA USNAVY;
INPUT SITE 1-2 ADO MAC WHR CUA WNGS OBC RMS MMH;
LABEL ADO = ’AVERAGE DAILY OCCUPANCY’
MAC = ’AVERAGE NUMBER OF CHECK-INS PER MO.’
WHR = ’WEEKLY HRS OF SERVICE DESK OPERATION’
CUA = ’SQ FT OF COMMON USE AREA’
WNGS= ’NUMBER OF BUILDING WINGS’
OBC = ’OPERATIONAL BERTHING CAPACITY’
RMS = ’NUMBER OF ROOMS’
MMH = ’MONTHLY MAN-HOURS’ ;
CARDS;
1 2 4 4 1.26 1 6 6 180.23
2 3 1.58 40 1.25 1 5 5 182.61
3 16.6 23.78 40 1 1 13 13 164.38
4 7 2.37 168 1 1 7 8 284.55
5 5.3 1.67 42.5 7.79 3 25 25 199.92
6 16.5 8.25 168 1.12 2 19 19 267.38
7 25.89 3.00 40 0 3 36 36 999.09
8 44.42 159.75 168 .6 18 48 48 1103.24
9 39.63 50.86 40 27.37 10 77 77 944.21
10 31.92 40.08 168 5.52 6 47 47 931.84
11 97.33 255.08 168 19 6 165 130 2268.06
12 56.63 373.42 168 6.03 4 36 37 1489.5
13 96.67 206.67 168 17.86 14 120 120 1891.7
14 54.58 207.08 168 7.77 6 66 66 1387.82
15 113.88 981 168 24.48 6 166 179 3559.92
16 149.58 233.83 168 31.07 14 185 202 3115.29
17 134.32 145.82 168 25.99 12 192 192 2227.76
18 188.74 937.00 168 45.44 26 237 237 4804.24
19 110.24 410 168 20.05 12 115 115 2628.32
20 96.83 677.33 168 20.31 10 302 210 1880.84
21 102.33 288.83 168 21.01 14 131 131 3036.63
22 274.92 695.25 168 46.63 58 363 363 5539.98
23 811.08 714.33 168 22.76 17 242 242 3534.49
24 384.50 1473.66 168 7.36 24 540 453 8266.77
25 95 368 168 30.26 9 292 196 1845.89
RUN;
29
/* When running Macro Boxcox the following results were found
SUMSQ=X‘*X-N#MEAN*MEAN‘;
S=SUMSQ/(N-{1});
D = VECDIAG(DIST); CNAME={"DIST"};
CREATE DIST FROM D[COLNAME=CNAME];
APPEND FROM D[COLNAME=DIST];
QUIT;
30
DATA PLOTDATA; SET RANKS;
PROC PRINT;
PROC SORT; BY R;
PROC PRINT;
VARIABLES X R V;
FORMAT X 6.2 R 3.0 V 6.3;run;
symbol1 color=black i=join v=none;
symbol2 color=red v=diamond i=none;
PROC GPLOT DATA=PLOTDATA;
PLOT X*V=2 V*V = 1 /OVERLAY VZERO HZERO;
run;
31
An example Box-Cox plot for these data is given by;
32
19 2.02465 4.49983 3.02286 3.27472 3.27472
20 1.98564 5.10153 3.03543 4.16871 3.80675
21 2.00217 4.12250 3.06852 3.38312 3.38312
22 2.32210 5.13494 3.85291 4.36492 4.36492
23 2.73124 5.16981 3.14674 3.94415 3.94415
24 2.44194 6.19583 2.06179 4.82057 4.61344
25 1.97997 4.37988 3.42622 4.13376 3.74166
MEAN
Obs DIST
1 4.2893
2 3.2561
3 2.8820
4 2.4979
5 5.6011
6 2.0426
7 10.5513
8 5.3148
9 3.4606
10 0.5301
11 3.0106
12 7.2053
13 0.5964
14 0.7616
15 7.5920
16 3.9497
17 3.5994
18 3.1316
19 1.4102
33
20 7.8354
21 0.6741
22 4.0409
23 15.6272
24 10.4920
25 9.6478
Obs X R V
1 0.53 1 0.752
2 0.60 2 1.250
3 0.67 3 1.610
4 0.76 4 1.921
5 1.41 5 2.206
6 2.04 6 2.477
7 2.50 7 2.740
8 2.88 8 3.000
9 3.01 9 3.260
10 3.13 10 3.522
11 3.26 11 3.790
34
12 3.46 12 4.066
13 3.60 13 4.351
14 3.95 14 4.651
15 4.04 15 4.966
16 4.29 16 5.303
17 5.31 17 5.667
18 5.60 18 6.064
19 7.21 19 6.507
20 7.59 20 7.009
21 7.84 21 7.595
22 9.65 22 8.309
23 10.49 23 9.236
24 10.55 24 10.596
25 15.63 25 13.388
The graph is
Note: That the data appear to be more normal with the transformation than without.
35
Chapter 4
Multivariate Plots
In this chapter I have presented a number of graphical plots for displaying higher dimensional data. In one
dimension, one can use plots like the histrograms, stemleaf, and boxplots. When considering two dimensional
data, the scatterplot is a convenient display. The next section considers sum 2-dimensional displays of higher
dimensional data.
36
Additional two dimensional plots are the Chernoff Faces and Star Plots.
37
38
4.1.3 Star Plots
39
4.2 3 Dimensional Plots
Johnson gives several examples of three and higher dimensional plots in his text, pages 55-61, with corre-
sponding SAS code. I have several graphs produced using Splus in Windows. The first is a three dimensional
scatterplot of the US Navy data for (ADO,MAC) vs. MMH. The second is for (log(ADO), log(MAC) vs.
MMH with the size of the triangle proportional to another variable.
40
Chapter 5
The purpose of this chapter is to describe inference procedures for the mean of normally distributed popu-
lations.
Rencher (1995) pages 127-128 gives a brief discussion of the advantages of using multivariate tests over many
univariate tests. He first indicates that the number
of parameters for dimension p could be staggering, since
there are possibly, p means, p variances, and p2 covariances that sum to p(p + 3)/2 values that need to be
estimated or determined from the data of size n. Clearly, n needs to be very large whenever p is even of
moderate size. He lists other reasons that I have included for completeness. They are;
1. The use of p univariate tests inflates the type I error rate.
2. The univariate tests ignore the correlation structure in the p variables.
3. The multivariate test is more powerful.
In the first section we will consider the one population case. That is, it is assumed that X ∼ Np (µ, Σ).
H0 : µ = µ0 vs. H1 : µ 6= µ0
using the data x1 , x2 , . . . , xn , that the exact form of the test statistic depended upon whether or not σ 2 is
known or unknown. That is, if σ 2 is known then the test statistics,
x̄ − µ0
t= √ ,
σ/ n
has a standard normal distribution whenever the null hypothesis is true. In which case, one rejects H0 in
favor of H1 if | t |≥ zα/2 where Pr[Z ≥ zα/2 ] = α/2.
Whenever σ 2 is unknown, one replaces σ 2 with s2 , the sample variance, and the new test statistics
becomes,
x̄ − µ0
t= √ .
s/ n
t has a t-distribution with degrees of freedom = n − 1 whenever the null hypothesis is true. In which case,
one rejects H0 in favor of H1 if | t |≥ tα/2 (df = n − 1) where Pr[T ≥ tα/2 (n − 1)] = α/2.
41
Univariate Likelihood Ratio Test
Although the above results are often simply stated in a methods text, the truth of these statements follows
from what is called the likelihood ratio test. That is, suppose one wants to test the above hypothesis
whenever σ 2 is unknown. The likelihood ratio given by
L(µ0 , σ̂ 2 | x1 , x2 , . . . , xn )
λ=
L(µ̂, σ̂ 2 | x1 , x2 , . . . , xn )
By observing that
n
X n
X n
X
(xi − µ0 )2 = (xi − x̄ + x̄ − µ0 )2 = (xi − x̄)2 + n(x̄ − µ0 )2 ,
i=1 i=1 i=1
it follows that
1 1
λ2/n = Pn = .
1 + n(x̄ − µ0 )2 / i=1 (xi − x̄)2 1+ t2 /(n − 1)
It is not necessary to know the exact distribution of λ since t2 = 0 implies λ = 1 and whenever t2 → ∞
λ →√0. Therefore, one can reject H0 whenever t becomes “too large”. Furthermore, one can show that
t = t2 has a t-distribution with df = n − 1.
42
Let
n
X
Qe = (xi − x̄)(xi − x̄i )0 = X 0 [In − X(X 0 X)−1 X 0 ]X
i=1
where ~x0 x
1 11 x12 ... x1p
~x02
x21 x22 ... x2p
.. = ..
X= .. .. .
..
. . . . .
~x0n xn1 xn2 ... xnp
Note that,
n
X
(xi − µ0 )(xi − µ0 )0 = Qe + n(x̄ − µ0 )(x̄ − µ0 )0 = Qe + Qh ,
i=1
0
Pn Pn
where Qh = n(x̄Pn− µ0 )(x̄ −0 µ0 ) P. Since, i=1 (xi − µ0 )(xi − µ0 )0 = i=1 (xi − x̄ + x̄ − µ0 )(xi − x̄ + x̄ − µ0 )0 ,
n
and (x̄ − µ0 ) i=1 (xi − x̄) = [ i=1 (xi − x̄)](x̄ − µ0 )0 = 0. Note, the rank of Qh is the same as the rank of
(x̄ − µo ) = 1.
Thus,
| Qe | | Qe |
Λ = λ2/n = = .
| Qe + Qh | | Qe + n(x̄ − µ0 )(x̄ − µ0 )0 |
It can be shown that
| Qe | 1
Λ= −1 = .
0
| Qe || 1 + n(x̄ − µ0 ) Qe (x̄ − µ0 ) | 1 + [1 + T 2 /(n − 1)]
(n − p)T 2 (n − p)nD2
= ∼ F (p, n − p),
(n − 1)p (n − 1)p
whenever H0 is true. Thus, one can reject H0 in favor of H1 if T 2 ≥ [(n − 1)p/(n − p)]Fα (p, n − p)
where Fα (p, n − p) is the α critical point for an F-distribution with numerator degrees of freedom = p and
denominator degrees of freedom equal to n-p.
Note: It should be mentioned that Qe is the sum of squares for the error term and Qh is the sum of squares
explained by the model. In this case by the fact that µ = µ0 . Returning to the likelihood ratio given by,
| Qe |
Λ= .
| Qe + Qh |
It can be shown that Qe and Qh are independent Wishard random variables and that
s
Y
Λ= (1 + λi )−1
i=1
43
• Roy’s statistic:
λ1
θs =
(1 + λ1 )
where λ1 is the largest eigenvalue of Qh Q−1
e .
• Lawley-Hotelling trace:
s
X
U (s) = γe λi = γe T r[Qh Q−1
e ]
i=1
Again, it should be noted that in the testing of the hypothesis that H0 : µ = µ0 , s=1 and all four criterion
are the same. This will not be the case when one has a more general null hypothesis as in ANOVA models.
44
Second, using PROC IML. The first procedure is better!
*/
* Hotellings one sample t-test with proc GLM;
data new; set dat; int = 1;
sweat = sweat - 4;
sodium = sodium - 50;
potass = potass - 10;
run;
PROC GLM DATA=new;
MODEL sweat sodium potass=INT/NOUNI NOINT;
MANOVA H=INT/PRINTE;
RUN;
* Hotellings one sample t-test with proc IML;
DATA XX; SET DAT;
PROC IML;
RESET NOLOG;
USE XX; READ ALL INTO X;
PRINT "The Data Matrix is" X;
N=NROW(X); P=NCOL(X);
print n p;
XBAR = X(|+,|)‘/N;
PRINT, "XBAR = " XBAR;
SUMSQ=X‘*X-(XBAR*XBAR‘)#N;
S=SUMSQ/(N-1);
PRINT , S;
NU=N-1;
W=NU#S;
mu_0 = {4 50 10}‘;
print mu_0;
T2=n*(xbar - mu_0)‘*INV(S)*(xbar -mu_0);
PRINT, "Hotellings T2 = " T2;
F_stat=(n-P)# T2/(n-1)/p;
p_hat = 1-PROBf(f_stat,p,n-p);
PRINT, f_stat p_hat;
QUIT;
The SAS output is;
Sweat Data - Johnson Wichren page 229
Test for mu_o = (4, 50, 10)
Number of observations 20
Sweat Data - Johnson Wichren page 229
Test for mu_o = (4, 50, 10)
45
sweat sodium potass
Partial Correlation Coefficients from the Error SSCP Matrix / Prob > |r|
46
Hotelling-Lawley Trace 0.51256699 2.90 3 17 0.0649
Roy’s Greatest Root 0.51256699 2.90 3 17 0.0649
N P
20 3
XBAR
XBAR = 4.64
45.4
9.965
47
MU_0
4
50
10
T2
Hotellings T2 = 9.7387729
F_STAT P_HAT
2.9045463 0.0649283
48
The Splus code for this same problem is;
attach(T5.1)
x<-data.matrix(T5.1)
n<-dim(x)[1] # define n
p<-dim(x)[2] # define p
mu0<-cbind(4,50,10) # define mu_0
dm<-cbind(mean(x[,1]),mean(x[,2]),mean(x[,3])) - mu0 # xbar-mu0
t2<- n*dm%*%solve(var(x))%*%t(dm) # compute Hotelling’s T2
t2
pvalue<-1 - pf((n-p)/p/(n-1)*t2,p,n-p) # compute p-value for test
pvalue
49
Separate Populations with Different Covariances
If the data are not paired, one can compute,
It can be shown that the upper α critical point for the above statistic is,
k1 k2 χ2p (α)
kα = χ2p (α)[1 + .5( + )],
p p(p + 2)
where
2
X (tr[W −1 Wi ])2
k1 = ,
i=1
(ni − 1)
2
X (tr[W −1 Wi ])2 + tr[W −1 Wi W −1 Wi ])
k2 = ,
i=1
(ni − 1)
Si
Wi = ,
ni
W = W1 + W2 .
and χ2p (α)is the upper α critical point of a chi-square distribution with p degrees of freedom.
An alternative procedure suggested by Yao(1965) involves computing the degree of freedom given by;
2
X (x̄1 − x̄2 )0 W −1 Wi W −1 (x̄1 − x̄2 ) 2
1/ν = (ni − 1)−1 [ ] ,
i=1
T02
As I have already mentioned SAS PROC GLM can be used to test the equality of two means if the
covariances are equal, however, there are not any ways of simply handling the test of two means with
unequal covariances. As an example of assuming equality of covariance and testing for equality of means the
SAS code is;
proc glm;
class gen;
model tot ami = gen;
manova h=_all_/printe printh;
run;
The corresponding SAS output is;
Multivariate Linear Regression
50
gen 2 0 1
Number of observations 17
Sum of
Source DF Squares Mean Square F Value Pr > F
Sum of
Source DF Squares Mean Square F Value Pr > F
51
Corrected Total 16 7610377.882
tot ami
Partial Correlation Coefficients from the Error SSCP Matrix / Prob > |r|
DF = 15 tot ami
tot ami
52
tot 288658.11863 392015.8598
ami 392015.8598 532382.16569
MANOVA Test Criteria and Exact F Statistics for the Hypothesis of No Overall gen Effect
H = Type III SSCP Matrix for gen
E = Error SSCP Matrix
The Splus code for test of means with unequal covariances is;
attach(amitriptyline) #data in Johnson-Wichren page 455
amitriptyline
tot ami gen amt pr diap qrs
3 1131 810 0 3600 205 60 111
8 1111 941 0 1500 200 70 93
14 500 384 0 2000 160 60 80
15 781 501 0 4500 180 0 100
16 1070 405 0 1500 170 90 120
1 3389 3149 1 7500 220 0 140
2 1101 653 1 1975 200 0 100
4 596 448 1 675 160 60 120
5 896 844 1 750 185 70 83
6 1767 1450 1 2500 180 60 80
7 807 493 1 350 154 80 98
9 645 547 1 375 137 60 105
10 628 392 1 1050 167 60 74
11 1360 1283 1 3000 180 60 80
12 652 458 1 450 160 64 60
53
13 860 722 1 1750 135 90 79
17 1754 1520 1 3000 180 0 129
x1<-amitriptyline[1:5,1:2]
x1
x2<-amitriptyline[6:17,1:2]
x2
n1<-dim(x1)[1]
p<-dim(x1)[2]
n2<-dim(x2)[1]
x1bar<-cbind(mean(x1[,1]),mean(x1[,2]))
x2bar<-cbind(mean(x2[,1]),mean(x2[,2]))
s1<-var(x1)
s2<-var(x2)
s1
s2
st<-1/n1*s1 + 1/n2*s2
stinv<-solve(st)
t2<-(x1bar-x2bar)%*%stinv%*%t((x1bar-x2bar))
w1<-s1/n1
w2<-s2/n2
w<-w1+w2
winv<-solve(w)
k1<-1/(n1-1)*sum(diag(winv%*%w1))^2 + 1/(n2-1)*sum(diag(winv%*%w2))^2
k2<-((sum(diag(winv%*%w1))^2 + 2*sum(diag(winv*w1*winv*w1))))/(n1-1) +
((sum(diag(winv%*%w2))^2 + 2*sum(diag(winv*w2*winv*w2))))/(n2-1)
qalpha<-qchisq(.99,p)
kalpha<-qalpha*(1+.5*(k1/p+k2*qalpha/p/(p+2)))
kalpha
t2
#Yao’s procedure
nu<-1/(1/(n1-1)*((x1bar-x2bar)%*%winv%*%w1%*%winv%*%t(x1bar-x2bar)/t2)^2 +
1/(n2-1)*((x1bar-x2bar)%*%winv%*%w2%*%winv%*%t(x1bar-x2bar)/t2)^2)
talpha<-p*nu/(nu-p+1)*qf(.99,p,nu-p+1)
One can use the above code with just knowing the summary statistics. The Splus code for creating a vector
and a matrix is;
x1bar<-cbind(12,15)
s1<-rbind(cbind(20,-5),cbind(-5,15))
produces,
> x1bar
[,1] [,2]
[1,] 12 15
> s1
[,1] [,2]
[1,] 20 -5
[2,] -5 15
54
Chapter 6
In this chapter some of the inference results for the mean of a single populations are extended to the more
general linear model. In the univariate case, these models include; the simple linear regression, multiple
regression, anova models, and anlysis of covariance models.
where
β0
y1 1 x11 x12 ... x1k e1
β1
y2 1 x21 x22 ... x2k e2
~ = β2 ,
~y =
..
X=
... .. .. .. ~e =
...
β .
. . . . .
.
yn 1 xn1 xn2 ... xnk en
βk
and
~y is n × 1 vector of dependent observations.
X is an n × (p = k + 1) matrix of independent observations or known values.
55
~ is a p × 1 vector of parameters.
β
~e is a n × 1 vector of unobservable errors or residuals.
The Least Squares problem consists of minimizing
Q(β) = e0 e
= (y − Xβ)0 (y − Xβ)
= y 0 y − β 0 X 0 y − y 0 Xβ + β 0 X 0 Xβ.
X 0 Xβ = X 0 ~y .
β̂ = (X 0 X)−1 X 0 ~y .
If rank[X] < p the normal equation no longer has a unique solution. However, one can find a nonunique
solution given by
β̃ = (X 0 X)− X 0 ~y .
Since β̃ is no longer unique, one must restrict its use to what are called estimable functions. That is, a
parametric function c0 β is said to be estimable if there exists a vector ~a such that E(~a~y ) = c0 β. Estimable
functions are the only functions of β̃ that are unique.
6.1.1 Inference
The least square problem is an optimization rather than a statistical problem. In order for this problem
to become statistical it is necessary to introduce a distributional assumption. That is, assume that the
unobserved error ei , i = 1, 2, . . . , n are i.i.d normals with mean = 0 and variance = σ 2 . This assumption
becomes
e1
e2 2
... ∼ Nn (0, σ In ).
~e =
en
Using the properties of linear transformations of normal variables one has
y1
y2 2
.. ∼ Nn (Xβ, σ In ).
~y =
.
yn
56
(b) var(β̂i ) = σ 2 ((X 0 X)−1 )ii .
(c) cov(β̂i , β̂j ) = σ 2 ((X 0 X)−1 )ij .
(d) corr(β̂i , β̂j ) = ((X 0 X)−1 )ii /[((X 0 X)−1 )ii ((X 0 X)−1 )jj ]1/2 .
2. ŷ ∼ Nn (Xβ, σ 2 H).
(a) var(ŷi ) = σ 2 hii .
(b) cov(ŷi , ŷj ) = σ 2 hij , where H = (hij ). Notice that the ŷi0 s are not independent of one another
unless hij = 0.
(c) corr(ŷi , ŷj ) = hij /[hii hjj ]1/2 .
3. ê ∼ Nn (0, σ 2 (I − H)).
(a) var(êi ) = σ 2 (1 − hii ).
(b) cov(êi , êj ) = −σ 2 hij .
(c) corr(êi , êj ) = −hij /[(1 − hii )(1 − hjj )]1/2 .
6.1.2 Estimation of σ 2
The estimation of σ 2 follows from observing that the residual sum of of squares Q(β̂) is a quadratic form given
by Qe = ê0 ê = y 0 (I − H)y and using the expected value of a quadratic form, i.e., E(y 0 Ay) = tr[A] + µ0 Aµ,
one has
E(Q(β̂)) = y 0 (I − H)y)
= tr[(I − H)] + β̂ 0 X 0 (I − H)X β̂
= σ 2 tr[(I − H)]
= σ 2 (tr[I] − tr[H])
= σ 2 (n − p),
since
X 0 (I − H) = X 0 − X 0 X(X 0 X)−1 X 0 = X 0 − X 0 = 0
(I − H)X = X − X 0 X(X 0 X)−1 X = X − X = 0.
From here one can define an estimate for σ 2 with
57
Since β1 is the only parameter that is needed if the independent variable x explains the dependent variable
y, the sum of squares term is adjusted for β0 , that is
y 0 Hy = y 0 (H − jj0 )y + y 0 jj0 y
where y 0 jj0 y = nȳ 2 is the correction factor or the sum of squares due to β0 . In which case the ANOVA table
become
where i is 1 if the model includes the y intercept (β0 ), and is 0 otherwise. Tolerance (TOL) and variance
inflation factors (VIF) measure the strength of interrelationships among regressor variables in the model.
They are given as
T OL = 1 − R2
and
V IF = T OL−1 .
58
2. SSE /σ 2 ∼ χ2 (df = (n − p)).
3. SS(β) and SSE are independent since H(I − H) = 0.
4. F = M S(β)/M SE ∼ F (df1 = p, df2 = (n − p), λ = 1/2β 0 X 0 Xβ).
5. When β = 0 it follows that SS(β)/σ 2 ∼ χ2 (df = p) and F = M S(β)/M SE ∼ F (df1 = p, df2 = (n−p)).
9. When β∗ = 0 it follows that SSβ∗ |β0 /σ 2 ∼ χ2 (df = p − 1) and F = M Sβ∗ |β0 /M SE ∼ F (df1 =
p − 1, df2 = (n − p)).
H0 : Lβ = c
where L is a q × p matrix of rank q. The approach is to estimate Lβ − c with Lβ̂ − c. If L is estimable then
Lβ̂ − c is unique. Using the properties of expectation and covariance we have
1. E(Lβ̂ − c) = Lβ − c.
Q = (Lβ̂ − c)0 (L(X 0 X)−1 L0 )−1 (Lβ̂ − c) ∼ χ2 (df = q, λ = 1/2(Lβ − c)0 (L(X 0 X)−1 L0 )−1 (Lβ − c).
Taking the derivative of Λ with respect to both β and λ and setting equal to zero gives the following solution
59
6.2 Multivariate Case
The previous model can be extended to the multivariate case. Suppose the Y is a matrix of n k-dimensional
random variables, such that
Y = XB + E
where
• ~yi0 = [yi1 , yi2 , . . . , yik ] is the ith row of the matrix Y .
• X is a known design n × p matrix of rank r ≤ p < n.
• B is an unknown p × k matrix of nonrandom coefficients. The j th column of B is
• E is a n × k matrix of unobservable errors, such that E(e0i ) = 0 and V ar(e0i ) = Σ and the rows of E
are independent of one another, i.e., cov(ei , ej ) = 0.
As in the univariate case, one can estimate the coefficient matrix B using a least squares method. In the
multivariate case one needs to minimize,
(X 0 X)B = X 0 Y
B̃ = (X 0 X)− X 0 Y
whenever rank[X] = r ¡ p.
The estimation of Σ follows from observing that the residual sum of of squares Qe = Ê 0 Ê = Y 0 (I − H)Y
and Qh = Y 0 HY where H = X(X 0 X)−1 X 0 or H = X(X 0 X)− X 0 . Note that even though (X 0 X)− is not
unique H = X(X 0 X)− X 0 is. It can be shown that
E(Qe ) = (n − r)Σ.
From here one can define an estimate for Σ with
Σ̂ = Qe /(n − r).
60
As in the univariate case the sum of squares term is adjusted for β0 , that is
Y 0 HY = Y 0 (H − jj0 )Y + Y 0 jj0 Y
where Y 0 jj0 Y = nȲ 0 Ȳ is the correction factor or the sum of squares due to β0 . In which case the ANOVA
table become
where β∗ = (β2 , β3 , . . . , βq )
H0 : LBM = c
and
Qe = M 0 Y 0 [I − H]Y M.
The diagonal elements of Qh and Qe correspond to the hypothesis and error SS for univariate tests. When
the M matrix is the identity matrix (SAS default), these tests are for the original dependent variables on
the left-hand side of the MODEL statement. When an M matrix other than the identity is specified, the
tests are for transformed variables defined by the columns of the M matrix. These tests can be studied
by requesting the SUMMARY option, which produces univariate analyses for each original or transformed
variable.
Four multivariate test statistics, all functions of the eigenvalues of Q−1
e Qh or (Qe + Qh )
−1
Qh , are con-
structed:
Qs
• Wilks’ lambda = Λ = | Qe | / | Qh + Qe |= i=1 (1 + λi )−1
Ps λi
• Pillai’s trace = U (s) = tr(Qh (Qh + Qe )−1 ) = i=1 (1+λ i)
Ps
• Hotelling-Lawley trace = V (s) = tr(Q−1
e Qh ) = i=1 λi
λ1
• Roy’s maximum root = θ = (1+λ1 )
61
6.2.3 One-Way MANOVA – Examples
Consider some examples found in Johnson chapter 10. The SAS code is;
OPTIONS LINESIZE=74 PAGESIZE=54 NODATE;
TITLE ’Ex. 11.1 - MANOVA’’S on Cooked Turkey Data’;
DATA STORAGE;
INPUT REP 1 TRT 3 CKG_LOSS 5-8 PH 10-13 MOIST 15-19 FAT 21-25 HEX 27-30
NONHEM 32-35 CKG_TIME 37-38;
CARDS;
1 1 36.2 6.20 57.01 11.72 1.32 13.1 47
1 2 36.2 6.34 59.95 10.00 1.18 09.2 47
1 3 31.5 6.52 61.93 09.85 0.50 05.3 46
1 4 33.0 6.49 59.95 10.32 1.55 07.8 48
1 5 32.5 6.50 60.98 09.88 0.88 08.8 44
2 1 34.8 6.49 61.40 09.35 1.00 08.6 48
2 2 33.0 6.62 59.96 09.32 0.40 06.8 52
2 3 26.8 6.73 63.37 10.08 0.30 04.8 54
2 4 34.0 6.67 59.81 09.20 0.75 06.4 50
2 5 32.8 6.86 60.37 09.18 0.32 07.2 48
3 1 37.5 6.20 57.01 09.92 0.58 09.8 50
3 2 33.2 6.67 60.86 10.18 0.48 08.8 47
3 3 27.8 6.78 61.92 09.38 0.20 05.8 47
3 4 34.2 6.64 59.34 11.32 0.73 08.0 49
3 5 32.0 6.78 58.50 10.48 0.35 07.2 48
4 1 38.5 7.34 59.25 10.58 1.48 08.2 48
4 2 35.0 6.61 61.12 10.05 0.90 07.0 48
4 3 33.8 6.65 60.40 09.52 0.32 06.6 50
4 4 35.8 6.47 61.08 10.52 1.58 06.6 49
4 5 36.5 6.72 61.61 09.70 0.55 07.4 48
5 1 38.5 6.40 56.25 10.18 0.90 09.6 47
5 2 34.2 6.67 61.37 09.48 0.65 06.8 47
5 3 33.5 6.74 61.60 09.60 0.22 05.2 45
5 4 37.5 6.47 60.78 10.18 1.30 07.2 48
5 5 36.2 6.70 59.57 10.12 0.88 07.0 48
;
*PROC PRINT;run;
PROC GLM;
CLASSES TRT;
MODEL CKG_LOSS moist hex nonhem = TRT;
* MEANS TRT/LSD;
MANOVA H=TRT/PRINTE;
RUN;
The SAS output is;
62
Class Level Information
TRT 5 1 2 3 4 5
Number of observations 25
Sum of
Source DF Squares Mean Square F Value Pr > F
Sum of
Source DF Squares Mean Square F Value Pr > F
63
Error 20 32.18872000 1.60943600
Sum of
Source DF Squares Mean Square F Value Pr > F
64
Ex. 11.1 - MANOVA’S on Cooked Turkey Data
Sum of
Source DF Squares Mean Square F Value Pr > F
Partial Correlation Coefficients from the Error SSCP Matrix / Prob > |r|
65
DF = 20 CKG_LOSS MOIST HEX NONHEM
66
Hotelling-Lawley Trace 4.15348825 4.19 16 28.471 0.0004
Roy’s Greatest Root 3.30759898 16.54 4 20 <.0001
TRT 5 1 2 3 4 5
Number of observations 25
-------------------------MANOVA RESULTS-------------------------------
PH FAT CKG_TIME
Partial Correlation Coefficients from the Error SSCP Matrix / Prob > |r|
DF = 20 PH FAT CKG_TIME
67
FAT -0.094413 1.000000 -0.114949
0.6840 0.6198
68
3389 3149 1 7500 220 0 140
1101 653 1 1975 200 0 100
1131 810 0 3600 205 60 111
596 448 1 675 160 60 120
896 844 1 750 185 70 83
1767 1450 1 2500 180 60 80
807 493 1 350 154 80 98
1111 941 0 1500 200 70 93
645 547 1 375 137 60 105
628 392 1 1050 167 60 74
1360 1283 1 3000 180 60 80
652 458 1 450 160 64 60
860 722 1 1750 135 90 79
500 384 0 2000 160 60 80
781 501 0 4500 180 0 100
1070 405 0 1500 170 90 120
1754 1520 1 3000 180 0 129
;
proc glm;
model tot ami = gen amt /ss1 ss2;
manova h=_all_/printe printh;
run;
The SAS output is;
Multivariate Linear Regression
Number of observations 17
Sum of
Source DF Squares Mean Square F Value Pr > F
69
Source DF Type I SS Mean Square F Value Pr > F
Standard
Parameter Estimate Error t Value Pr > |t|
Sum of
Source DF Squares Mean Square F Value Pr > F
70
gen 1 1258789.188 1258789.188 10.87 0.0053
amt 1 5457338.373 5457338.373 47.14 <.0001
Standard
Parameter Estimate Error t Value Pr > |t|
-------------------------MANOVA RESULTS---------------------------------------------------
tot ami
Partial Correlation Coefficients from the Error SSCP Matrix / Prob > |r|
DF = 14 tot ami
tot ami
71
Characteristic Roots and Vectors of: E Inverse * H, where
H = Type II SSCP Matrix for gen
E = Error SSCP Matrix
tot ami
72
MANOVA Test Criteria and Exact F Statistics
for the Hypothesis of No Overall amt Effect
H = Type II SSCP Matrix for amt
E = Error SSCP Matrix
73
Chapter 7
In this chapter the problem of inference for various forms of the covariance matrix. The first is the univariate
test for the variance of a population.
If the null hypothesis is true then the distribution of t is χ2 (df = n − 1). That is, one rejects H0 : if
t ≤ χ21−α/2 (n − 1) or if t ≥ χ2α/2 (n − 1), where χ2ν (n − 1) is the upper ν × 100% critical point for a chi-square
distribution with degrees of freedom = n-1.
where ν = (n − 1), W = (n − 1)Σ̂. Johnson provides a SAS IML code for computing Λ (called Lam star).
74
DATA DAT;
INPUT ID X1-X3;
CARDS;
1 1 3 5
2 2 3 5
3 1 4 6
4 1 4 4
5 3 4 7
6 2 5 6
DATA XX; SET DAT; DROP ID;
PROC IML;
RESET NOLOG;
USE XX; READ ALL INTO X;
PRINT "The Data Matrix is" X;
N=NROW(X); P=NCOL(X);
XBAR = X(|+,|)‘/N;
PRINT, "XBAR = " XBAR;
SUMSQ=X‘*X-(XBAR*XBAR‘)#N;
S=SUMSQ/(N-1);
PRINT , "The Variance-Covariance Matrix is " S;
NU=N-1;
W=NU#S;
SIG0 = {2 1 0, 1 2 0, 0 0 1};
Q=INV(SIG0)*W;
LAM_STAR = (EXP(1)/NU)##(P#NU/2)#(DET(Q))##(NU/2)#EXP(-.5#TRACE(Q));
PRINT, "LAM_STAR = " LAM_STAR;
L = -2#LOG(LAM_STAR);
PRINT, "L = " L;
B2 = P#(2#P#P+3#P-1)/24;
B3 = -P#(P-1)#(P+1)#(P+2)/32;
F=P#(P+1)/2;
A1 = 1-PROBCHI(L,F);
A2 = 1-PROBCHI(L,F+2);
A3 = 1-PROBCHI(L,F+4);
ALPHA = A1+(1/NU)#B2#(A2-A1)
+(1/(6#NU#NU))#((3#B2#B2-4#B3)#A3-6#B2#B2#A2
+(3#B2#B2+4#B3)#A1);
PRINT, "ALPHA = ", ALPHA;
QUIT;
The SAS output is;
Ex. 10.1 - A test on SIGMA, the variance-covariance matrix.
X
75
XBAR
XBAR = 1.6666667
3.8333333
5.5
LAM_STAR = 0.0162956
L = 8.2337203
ALPHA = 0.3842768
The Splus code is;
attach(e10) # or
e10<-t(cbind(c(1, 3, 5),c(2, 3, 5),c(1, 4, 6),c(1, 4, 4),c(3, 4, 7),c(2, 5, 6)))
x<-data.matrix(e10)
n<-dim(x)[1]
p<-dim(x)[2]
xbar<-cbind(mean(x[,1]),mean(x[,2]),mean(x[,3]))
nu<-n-1
w<-nu*var(x)
sig0<-cbind(c(2, 1, 0),c(1, 2 ,0),c(0, 0, 1))
q<-solve(sig0)%*%w
lamstar<-(exp(1)/nu)^(p*nu/2)*prod(eigen(q)$values)^(nu/2)*exp(-.5*sum(diag(q)))
b2<-p*(2*p*p+3*p-1)/24
b3<--p*(p-1)*(p+1)*(p+2)/32
f<-p*(p+1)/2
a1<- 1 - pchisq(-2*log(lamstar),f)
a2<- 1 - pchisq(-2*log(lamstar),f+2)
a3<- 1 - pchisq(-2*log(lamstar),f+4)
alpha<-a1 + b2*(a2-a1)/nu + ((3*b2*b2-4*b3)*a3 - 6*b2*b2*a2 + (3*b2*b2 + 4*b3)*a1)/(6*nu*nu)
Johnson gives several other related test with corresponding SAS IML code. I have included some of his
examples;
76
The SAS IML code is;
OPTIONS LINESIZE=75 PAGESIZE=54 NODATE PAGENO=1;
TITLE
’Ex. 10.2 - A test for sphericity of the variance-covariance matrix.’;
DATA DAT;
INPUT ID X1-X3;
CARDS;
1 1 3 5
2 2 3 5
3 1 4 6
4 1 4 4
5 3 4 7
6 2 5 6
DATA XX; SET DAT; DROP ID;
PROC IML;
RESET NOLOG;
USE XX; READ ALL INTO X;
PRINT "The Data Matrix is" X;
N=NROW(X); P=NCOL(X);
XBAR = X(|+,|)‘/N;
PRINT, "XBAR = " XBAR;
SUMSQ=X‘*X-(XBAR*XBAR‘)#N;
S=SUMSQ/(N-1);
PRINT , "The Variance-Covariance Matrix is " S;
NU=N-1;
W=NU#S;
LAMDA = DET(W)/((1/P)#TRACE(W))##P;
PRINT, "LAMDA = " LAMDA;
M = NU - (2#P#P+P+2)/6/P;
L = -M#LOG(LAMDA);
PRINT, "L = " L;
A = (P+1)#(P-1)#(P+2)#(2#P#P#P+6#P#P+3#P+2)/288/P/P;
F=P#(P+1)/2-1;
A1 = 1-PROBCHI(L,F);
A3 = 1-PROBCHI(L,F+4);
ALPHA = A1+(A/M/M)#(A3-A1);
PRINT, "ALPHA = ", ALPHA;
QUIT;
And the SAS output is;
Ex. 10.2 - A test for sphericity of the variance-covariance matrix.
77
XBAR = 1.6666667
3.8333333
5.5
LAMDA = 0.3825656
L = 3.5765164
ALPHA = 0.654943
The Splus code is;
attach(e10) # or
e10<-t(cbind(c(1, 3, 5),c(2, 3, 5),c(1, 4, 6),c(1, 4, 4),c(3, 4, 7),c(2, 5, 6)))
x<-data.matrix(e10)
n<-dim(x)[1]
p<-dim(x)[2]
xbar<-cbind(mean(x[,1]),mean(x[,2]),mean(x[,3]))
nu<-n-1
w<-nu*var(x)
lambda<-prod(eigen(w)$values)/((1/p)*sum(diag(w)))^p
m<-nu - (2*p*p+p+2)/6/p
L<--m*log(lambda)
a<-(p+1)*(p-1)*(p+2)*(2*p^3+6*p*p+3*p+2)/288/p/p
f<-p*(p+1)/2-1
a1<- 1 - pchisq(L,f)
a3<- 1 - pchisq(L,f+4)
alpha<-a1 + (a/m/m)*(a3-a1)
lambda
L
alpha
78
One rejects the null hypothesis for large n is Q ≥ χ2α,f where
79
XBAR
XBAR = 1.6666667
3.8333333
5.5
S
S2 = 0.7777778
R = 0.4428571
LAMDA = 0.6535772
Q = 1.4885312
DEGREES OF FREEDOM = 4
ALPHA = 0.8286711
The Splus code is;
attach(e10) # or
e10<-t(cbind(c(1, 3, 5),c(2, 3, 5),c(1, 4, 6),c(1, 4, 4),c(3, 4, 7),c(2, 5, 6)))
x<-data.matrix(e10)
n<-dim(x)[1]
p<-dim(x)[2]
xbar<-cbind(mean(x[,1]),mean(x[,2]),mean(x[,3]))
nu<-n-1
s<-var(x)
s2<-sum(diag(s))/p
r<-(2/(p*(p-1)*s2))*(sum(s)-sum(diag(s)))/2
lambda<-prod(eigen(s)$values)/((s2^p)*((1-r)^(p-1))*(1+(p-1)*r))
Q<- -(nu-(p*(p+1)^2*(2*p-3))/6/(p-1)/(p^2+p-4))*log(lambda)
f<-(p*(p+1)-4)/2
alpha<- 1 - pchisq(Q,f)
s2
r
lambda
Q
f
alpha
H0 : Σ1 = Σ2 = . . . = Σk
versus the alternative that there is at least one inequality. Bartlett’s test is based upon the Likelihood Ratio
Test with the test statistic given by
Qk
| Σ̂i |(ni −1)/2
Λ = i=1
| Σ̂ |(n−k)/2
80
where Σ̂i is the sample covariance matrix for population i and
k
X
Σ̂ = (ni − 1)Σ̂i /(n − k)
i=1
Pk
and n = i=1 ni where ni is the sample size for the ith population. It can be shown that
SAS uses a modification of this statistic where the test statistic is,
npn/2 Λ
Q = −2ρ log[ Qk pni /2
]
i=1 ni
where
k
X 2p2 + 3p − 1
ρ=1−[ ni − 1/n][ ].
i=1
6(p + 1)(k − 1)
Then one rejects the null hypothesis if Q ≥ χ2α (df = p(p + 1)(k − 1)/2).
81
The SAS output is;
The DISCRIM Procedure
Test of Homogeneity of Within Covariance Matrices
__ N(i)/2
|| |Within SS Matrix(i)|
V = -----------------------------------
N/2
|Pooled SS Matrix|
_ _ 2
| 1 1 | 2P + 3P - 1
RHO = 1.0 - | SUM ----- - --- | -------------
|_ N(i) N _| 6(P+1)(K-1)
DF = .5(K-1)P(P+1)
_ _
| PN/2 |
| N V |
Under the null hypothesis: -2 RHO ln | ------------------ |
| __ PN(i)/2 |
|_ || N(i) _|
Since the Chi-Square value is significant at the 0.1 level, the within covariance
matrices will be used in the discriminant function.
Reference: Morrison, D.F. (1976) Multivariate Statistical Methods p252.
82
Chapter 8
Profile Analysis
In this chapter an extension of the one and two sample Hotelling’s T 2 tests are extended to profile analysis.
The method for doing the analysis is given in the chapter for the General Linear Model, hence, the analysis
can be done using SAS PROC GLM.
yij = µi + eij ,
Y = XB + E
where
• ~yi0 = [yi1 , yi2 , . . . , yip ] is the ith row of the matrix Y , representing the measurements for the p time
periods for the ith subject.
• X is a n × 1 matrix of ones, that is, X = ~jn .
• B is an unknown 1 × p matrix of means, B = (µ1 , µ2 , . . . , µp ) where µj is the expected value of y for
the j th time period.
• E is a n × p matrix of unobservable errors, such that E(e0i ) = 0 and V ar(e0i ) = Σ and the rows of E
are independent of one another, i.e., cov(ei , ej ) = 0.
The null hypothesis of interest is;
H0 : µ1 = µ2 = . . . = µp
versus the alternative that the null hypothesis is false. This can be expressed in the general form of a linear
hypothesis for the parameters as
H0 : LBM = c
83
where L = 1 is a 1 × 1, M is a p × p − 1 of rank p − 1 and c is a p − 1 × 1 matrix of zeroes, and
1 0 ... 0
0 1 ... 0
. .. ..
M = .
. . ... .
0 0 ... 1
−1 −1 . . . −1
From here one computes;
Qh = (LB̂M − c)0 [L(X 0 X)−1 L0 ]−1 (LB̂M − c)
and
Qe = M 0 Y 0 [I − H]Y M.
Number of observations 11
Repeated Measures Level Information
Dependent Variable t1 t2 t3 t4 t5
Level of time 1 2 3 4 5
84
Partial Correlation Coefficients from the Error SSCP Matrix / Prob > |r|
DF = 10 t1 t2 t3 t4 t5
time_N represents the contrast between the nth level of time and the 5th
t1 t2 t3 t4 t5
time_N represents the contrast between the nth level of time and the 5th
85
time_1 1.000000 0.648550 0.761716 0.586020
0.0309 0.0064 0.0581
Sphericity Tests
Mauchly’s
Variables DF Criterion Chi-Square Pr > ChiSq
time_N represents the contrast between the nth level of time and the 5th
Manova Test Criteria and Exact F Statistics for the Hypothesis of no time Effect
H = Type III SSCP Matrix for time
E = Error SSCP Matrix
86
Adj Pr > F
Source DF Type III SS Mean Square F Value Pr > F G - G H - F
time_N represents the contrast between the nth level of time and the 5th
87
Error 10 576.7272727 57.6727273
Y = XB + E
where
0
• ~yui = [yui1 , yui2 , . . . , yuip ] is the ith row for sample u = 1, 2, representing the measurements for the p
time periods for the ith subject in sample u. That is,
0
~y11
~y12 0
..
.
0
~y1n
Y = ~y21 0 .
1
0
~y22
.
..
0
~y2n 2
where µuj is the expected value of y for the j th time period for sample u.
• E is a n × p matrix of unobservable errors, such that E(e0i ) = 0 and V ar(e0i ) = Σ and the rows of E
are independent of one another, i.e., cov(ei , ej ) = 0.
There are three hypotheses of interest:
1. Are the profiles for the two samples parallel?
2. Are there differences among the time periods? Are the lines flat?
3. Are there differences between the two samples?
88
1. Using the General linear hypothesis given by H0 : LBM = c. When testing the hypothesis of parallel
profiles the matrices are;
L = (1 −1 ) is a 1 × 2, M is a p × p − 1 of rank p − 1 and c = ( 0 0 . . . 0 ) is a 1 × p − 1 matrix,
and
1 0 0 ... 0 0
−1 1 0 ... 0 0
0 −1 1 ... 0 0
M = ... .. .. . . .. .. .
. . . . .
0 0 0 . . . −1 1
0 0 0 . . . 0 −1
From here one computes;
and
Qe = M 0 Y 0 [I − H]Y M.
2. The second hypothesis, the profiles are flat given that they are the same, the matrices are;
L = ( 1/2 1/2 ) is a 1 × 2, M is a p × p − 1 of rank p − 1 and c = ( 0 0 . . . 0 ) is a 1 × p − 1 matrix,
and
1 0 0 ... 0
0 1 0 ... 0
0 0 1 ... 0
M = ... .. .. .. .. .
. . . .
0 0 0 ... 1
−1 −1 −1 . . . −1
3. The third hypothesis, the two profiles are the same, the matrices are;
0 0 ... 0
L = ( 1 −1 ) is a 1 × 2, M = Ip the p dimensional identity matrix and c = is a
0 0 ... 0
2 × p − 1 matrix.
89
1 10 39 24 35 26 32
2 1 47 25 36 21 27
2 2 53 32 48 46 54
2 3 38 33 42 48 49
2 4 60 41 67 53 50
2 5 37 35 45 34 46
2 6 59 37 52 36 52
2 7 67 33 61 31 50
2 8 43 27 36 33 32
2 9 64 53 62 40 43
2 10 41 34 47 37 46
;
title2 ’Test for parallel profiles’;
proc glm;
model t1-t5= group/nouni;
manova h=group m=t1-t2,t2-t3,t3-t4,t4-t5/ printh printe;
run;
title2 ’Test for level parallel profiles’;
proc glm;
model t1-t5= /nouni;
manova h=intercept m=t1-t5,t2-t5,t3-t5,t4-t5/ printh printe;
run;
title2 ’Test for similar profiles’;
proc glm;
model t1-t5=group/nouni;
manova h=group / printh printe;
run;
The SAS output is;
Hypothesis #1
Number of observations 20
t1 t2 t3 t4 t5
MVAR1 1 -1 0 0 0
MVAR2 0 1 -1 0 0
MVAR3 0 0 1 -1 0
MVAR4 0 0 0 1 -1
90
E = Error SSCP Matrix
91
0.00000000 0.00 0.00797379 0.02144933 -0.00238864 -0.00506870
0.00000000 0.00 0.02073438 0.00387216 0.00479861 0.01018267
MANOVA Test Criteria and Exact F Statistics for the Hypothesis of No Overall group Effect
on the Variables Defined by the M Matrix Transformation
H = Type III SSCP Matrix for group
E = Error SSCP Matrix
Hypothesis #2
The GLM Procedure
Number of observations 20
t1 t2 t3 t4 t5
MVAR1 1 0 0 0 -1
MVAR2 0 1 0 0 -1
MVAR3 0 0 1 0 -1
MVAR4 0 0 0 1 -1
92
DF = 19 MVAR1 MVAR2 MVAR3 MVAR4
MANOVA Test Criteria and Exact F Statistics for the Hypothesis of No Overall Intercept Effect
on the Variables Defined by the M Matrix Transformation
H = Type III SSCP Matrix for Intercept
E = Error SSCP Matrix
93
Wilks’ Lambda 0.23186576 13.25 4 16 <.0001
Pillai’s Trace 0.76813424 13.25 4 16 <.0001
Hotelling-Lawley Trace 3.31284027 13.25 4 16 <.0001
Roy’s Greatest Root 3.31284027 13.25 4 16 <.0001
Hypothesis #3
The GLM Procedure
Number of observations 20
t1 t2 t3 t4 t5
Partial Correlation Coefficients from the Error SSCP Matrix / Prob > |r|
DF = 18 t1 t2 t3 t4 t5
94
t1 t2 t3 t4 t5
MANOVA Test Criteria and Exact F Statistics for the Hypothesis of No Overall group Effect
H = Type III SSCP Matrix for group
E = Error SSCP Matrix
SAS PROC GLM provides another method for doing a portion of the profile analysis. The following code
will provide a test for a factor, say, time or method, along with a treatment of group variable. The first
null hypothesis can be tested by considering the first level interaction term, time*group, while the second
hypothesis can be compared with testing for the repeated variable, time. In the above example, consider the
new SAS code given by;
PROC GLM;
CLASS group;
MODEL T1-T5 = group/NOUNI;
REPEATED TIME 5/PRINTE;
TITLE ’Word Probe Data - Using GLM for Profile Analysis and H-F Conditions’;
RUN;
The additional output is;
95
Word Probe Data - Using GLM for Profile Analysis and H-F Conditions
group 2 1 2
Number of observations 20
Dependent Variable t1 t2 t3 t4 t5
Level of TIME 1 2 3 4 5
Partial Correlation Coefficients from the Error SSCP Matrix / Prob > |r|
DF = 18 t1 t2 t3 t4 t5
TIME_N represents the contrast between the nth level of TIME and the last
96
TIME_4 198.2 182.0 412.6 917.6
Sphericity Tests
Mauchly’s
Variables DF Criterion Chi-Square Pr > ChiSq
97
Statistic Value F Value Num DF Den DF Pr > F
Adj Pr > F
Source G - G H - F
The above methods for considering the profile curves using PROC GLM did not consider the covariance
structure within the repeated measures (ie the subject covariance over time). PROC MIXED is a more
general procedure for dealing with repeated measures in that it allows for specifying a covariance structure
for the repeated measures. Refer to Littell, Milliken, Stroup, and Wolfinger (SAS System for Mixed Models).
SAS PROC MIXED provides another method for doing a portion of the profile analysis. The following
code will provide a test for a factor, say, time or method, along with a treatment of group variable. The first
null hypothesis can be tested by considering the first level interaction term, time*group, while the second
hypothesis can be compared with testing for the repeated variable, time. In the above example, consider the
new SAS code given by;
data newword;set word;
t=t1; time=1; output;
98
t=t2; time=2; output;
t=t3; time=3; output;
t=t4; time=4; output;
t=t5; time=5; output;
drop t1-t5;
*proc print data=newword;run;
title ’Combine test with MIXED’;
proc mixed data=newword method=ml;
class group time subject;
model t = group time group*time / s;
repeated / type=cs subject=subject;
run;
The additional output is;
Combine test with MIXED
The Mixed Procedure
Model Information
group 2 1 2
time 5 1 2 3 4 5
subject 10 1 2 3 4 5 6 7 8 9 10
Dimensions
Covariance Parameters 2
Columns in X 18
Columns in Z 0
Subjects 10
Max Obs Per Subject 10
Observations Used 100
Observations Not Used 0
Total Observations 100
Iteration History
99
Iteration Evaluations -2 Log Like Criterion
0 1 719.13884791
1 1 707.99610545 0.00000000
CS subject 16.8304
Residual 60.9206
1 11.14 0.0008
Standard
Effect group time Estimate Error DF t Value Pr > |t|
100
group*time 1 3 3.1000 4.9364 36 0.63 0.5340
group*time 1 4 4.8000 4.9364 36 0.97 0.3374
group*time 1 5 0 . . . .
group*time 2 1 0 . . . .
group*time 2 2 0 . . . .
group*time 2 3 0 . . . .
group*time 2 4 0 . . . .
group*time 2 5 0 . . . .
Num Den
Effect DF DF F Value Pr > F
Matrix Notation
Suppose that you observe n data points y1 , y2 , . . . , yn and that you want to explain them using n values
for each of p explanatory variables x11 , x12 , . . . , x1p , x21 , x22 , . . . , x2p , . . . , xn1 , xn2 , . . . , xnp . The xij values
may be either regression-type continuous variables or dummy variables indicating class membership. The
standard linear model for this setup is
p
X
yi = xij βj + i i = 1, . . . , n
j=1
where β1 , . . . , βp are unknown fixed-effects parameters to be estimated and 1 , . . . , n are unknown indepen-
dent and identically distributed normal (Gaussian) random variables with mean 0 and variance σ 2 .
The preceding equations can be written simultaneously using vectors and a matrix, as follows:
~y = Xβ +
where ~y denotes the vector of observed yi 0 s, X is the known matrix of xij 0 s, β is the unknown fixed-effects
parameter vector, and is the unobserved vector of independent and identically distributed Gaussian random
errors, that is, ∼ Np (0, Σ = σ 2 In ).
101
Formulation of the Mixed Model
The previous general linear model is certainly a useful one (Searle 1971), and it is the one fitted by the GLM
procedure. However, many times the distributional assumption about is too restrictive. The mixed model
extends the general linear model by allowing a more flexible specification of the covariance matrix of . In
other words, it allows for both correlation and heterogeneous variances, although you still assume normality.
The mixed model is written as
~y = Xβ + Zγ +
where everything is the same as in the general linear model except for the addition of the known design
matrix, Z, and the vector of unknown random-effects parameters, γ. The matrix Z can contain either
continuous or dummy variables, just like X. The name mixed model comes from the fact that the model
contains both fixed-effects parameters, β, and random-effects parameters, γ. Refer to Henderson (1990) and
Searle, Casella, and McCulloch (1992) for historical developments of the mixed model.
A key assumption in the foregoing analysis is that γ and are normally distributed with
γ 0
E =
0
and
γ G 0
V ar = .
0 R
The variance of ~y is, therefore, V = ZGZ 0 + R. You can model V by setting up the random-effects design
matrix Z and by specifying covariance structures for G and R. Note that this is a general specification of the
mixed model, in contrast to many texts and articles that discuss only simple random effects. Simple random
effects are a special case of the general specification with Z containing dummy variables, G containing
variance components in a diagonal structure, and R = σ 2 In , where In denotes the n-dimensional identity
matrix. The general linear model is a further special case with Z = 0 and R = σ 2 In .
However, it requires knowledge of V and, therefore, knowledge of G and R. Lacking such information, one
approach is to use estimated GLS, in which you insert some reasonable estimate for V into the minimization
problem. The goal thus becomes finding a reasonable estimate of G and R.
In many situations, the best approach is to use likelihood-based methods, exploiting the assumption
that γ and are normally distributed (Hartley and Rao 1967; Patterson and Thompson 1971; Harville
1977; Laird and Ware 1982; Jennrich and Schluchter 1986). PROC MIXED implements two likelihood-
based methods: maximum likelihood (ML) and restricted/residual maximum likelihood (REML). A favorable
theoretical property of ML and REML is that they accommodate data that are missing at random (Rubin
1976; Little 1995). PROC MIXED constructs an objective function associated with ML or REML and
maximizes it over all unknown parameters. Using calculus, it is possible to reduce this maximization problem
to one over only the parameters in Gand R.
That is, let α denote the vector of all the variance and covariance parameters found in V . Let θ = ( β 0 α0 )
be the s-dimensional vector of all parameters in the marginal model for ~y and let Θ = Θβ × Θα denote the
parameter space for θ, where Θβ denote the parameter space for the fixed parameters and Θα denotes the
102
parameter space for the variance components. Note Θα is restricted so that both G and R are positive
(semi-)definite.
The classical approach to inference is based on estimators obtained from maximizing the marginal like-
lihood function given by,
n
X
Lml (θ) = (2π)−n/2 | V (α) |−n/2 ×exp[−1/2 (~yi − Xi β)0 V (α)−1 (~yi − Xi β)]
i=1
with respect to θ. If we assume that α is known then the MLE for β is given by
Xn n
X
β̂(α) = [ Xi0 V −1 Xi ]−1 Xi0 V −1 ~yi .
i=1 i=1
When α is not known, but an estimate α̂ is available one can replace V −1 with V̂ −1 = V (α̂)−1 . The two
commonly used methods for estimating α are maximum likelihood and restricted maximum likelihood.
where r = (I − H)~y and H = X(X 0 X)−1 X 0 is the so called “hat matrix”. Using the above approach, let
~u = A~y where A is any n × (n − p) full column rank matrix that is orthogonal to the matrix X. It follows
that ~u ∼ Nn−p (0, σ 2 A0 A). Again, it can be shown that
103
The marginal distribution for ~y is normal with mean vector Xβ and with covariance matrix V (α) where
V (α) is a block diagonal matrix with main diagonal terms given by Vi (α). Again define ~u = A0 ~y where A is
any n × (n − p) full column rank matrix that is orthogonal to X from which ~u ∼ N(n−p) (0, σ 2 A0 V (α)A).
The corresponding log-likelihood
lR (G, R) = −1/2 log | V (α) | −1/2 log | X 0 V −1 (α)X | −1/2r0 V −1 (α)r − (n − p)/2 log(2π)
where r = y − X(X 0 V −1 (α)X)− X 0 V −1 (α)y and p is the rank of X. PROC MIXED actually minimizes
-2 times these functions using a ridge-stabilized Newton-Raphson algorithm. Lindstrom and Bates (1988)
provide reasons for preferring Newton-Raphson to the Expectation-Maximum (EM) algorithm described in
Dempster, Laird, and Rubin (1977) and Laird, Lange, and Stram (1987), as well as analytical details for
implementing a QR-decomposition approach to the problem. Wolfinger, Tobias, and Sall (1994) present the
sweep-based algorithms that are implemented in PROC MIXED.
One advantage of using the Newton-Raphson algorithm is that the second derivative matrix of the
objective function evaluated at the optima is available upon completion. Denoting this matrix K, the
asymptotic theory of maximum likelihood (refer to Serfling 1980) shows that 2K −1 is an asymptotic variance-
covariance matrix of the estimated parameters of G and R. Thus, tests and confidence intervals based on
asymptotic normality can be obtained. However, these can be unreliable in small samples, especially for
parameters such as variance components which have sampling distributions that tend to be skewed to the
right.
If a residual variance σ 2 is a part of your mixed model, it can usually be profiled out of the likelihood.
This means solving analytically for the optimal σ 2 and plugging this expression back into the likelihood
formula (refer to Wolfinger, Tobias, and Sall 1994). This reduces the number of optimization parameters
by one and can improve convergence properties. PROC MIXED profiles the residual variance out of the
log likelihood whenever it appears reasonable to do so. This includes the case when σ 2 In and when it has
blocks with a compound symmetry, time series, or spatial structure. PROC MIXED does not profile the log
likelihood when R has unstructured blocks, when you use the HOLD= or NOITER option in the PARMS
statement, or when you use the NOPROFILE option in the PROC MIXED statement.
Instead of ML or REML, you can use the noniterative MIVQUE0 method to estimate G and R (Rao
1972; LaMotte 1973; Wolfinger, Tobias, and Sall 1994). In fact, by default PROC MIXED uses MIVQUE0
estimates as starting values for the ML and REML procedures. For variance component models, another
estimation method involves equating Type I, II, or III expected mean squares to their observed values and
solving the resulting system. However, Swallow and Monahan (1984) present simulation evidence favoring
REML and ML over MIVQUE0 and other method-of-moment estimators.
104
On the other hand, when the eigenvalues of Ĝ are very small, Ĝ−1 dominates the equations and γ̂ is close
to 0. For intermediate cases, Ĝ−1 can be viewed as shrinking the fixed-effects estimates of γ towards 0
(Robinson 1991).
If Ĝ is singular, then the mixed model equations are modified (Henderson 1984) as follows:
−1
!
−1 −1
X 0 R̂ y
X 0 R̂ −1X X 0 R̂ Z L̂ β̂
= −1
L̂0 Z 0 R̂ X L̂0 Z 0 R̂−1 Z L̂ + I τ̂ L̂0 Z 0 R̂ y
where L̂ is the lower-triangular Cholesky root of Ĝ, satisfying Ĝ = L̂L̂0 . Both τ̂ and a generalized inverse of
the left-hand-side coefficient matrix are then transformed using L̂ to determine γ̂.
An example of when the singular form of the equations is necessary is when a variance component estimate
falls on the boundary constraint of 0.
as the approximate variance-covariance matrix of β̂ − β, γ̂ − γ. In this case, the BLUE and BLUP acronyms
no longer apply, but the word empirical is often added to indicate such an approximation. The appropriate
acronyms thus become EBLUE and EBLUP. Ĉ can also be written as
0
Ĉ11 Ĉ21
Ĉ =
Ĉ21 Ĉ22
105
where
Ĉ11 = (X 0 V̂ −1 X)−
Ĉ21 = −ĜZ 0 V̂ −1 X Ĉ11
Ĉ22 = (Z 0 R̂−1 Z + Ĝ−1 )−1 − Ĉ21 X 0 V̂ −1 Z Ĝ
Note that Ĉ11 is the familiar estimated generalized least-squares formula for the variance-covariance
matrix of β̂.
As a cautionary note, Ĉ tends to underestimate the true sampling variability of ( β̂ γ̂ ) because no
account is made for the uncertainty in estimating G and R. Although inflation factors have been proposed
(Kackar and Harville 1984; Kass and Steffey 1989; Prasad and Rao 1990), they tend to be small for data
sets that are fairly well balanced. PROC MIXED does not compute any inflation factors by default, but
rather accounts for the downward bias by using the approximate t and F statistics described subsequently.
The DDFM=KENWARDROGER option in the MODEL statement prompts PROC MIXED to compute a
specific inflation factor along with Satterthwaite-based degrees of freedom.
106
β
H:L =0
γ
or by constructing point and interval estimates.
When L consists of a single row, a general t-statistic can be constructed as follows (refer to McLean and
Sanders 1988, Stroup 1989a):
β̂
L
γ̂
t= p
LĈL0
Under the assumed normality of γ and , t has an exact t-distribution only for data exhibiting certain types
of balance and for some special unbalanced cases. In general, t is only approximately t-distributed, and its
degrees of freedom must be estimated. See the DDFM= option for a description of the various degrees-of-
freedom methods available in PROC MIXED. ν̂ being the approximate degrees of freedom, the associated
confidence interval is
β̂
p
L ± tν̂,α/2 LĈL0
γ̂
where tν̂,α/2 is the (1 − α/2)100t h percentile of the tν̂ -distribution.
When the rank of l is greater than 1, PROC MIXED constructs the following general F-statistic:
β̂ 0 0 −1 β̂
L (L ĈL) L
γ̂ γ̂
F =
rank(L)
Analogous to t, F in general has an approximate F-distribution with rank(L) numerator degrees of freedom
and ν̂ denominator degrees of freedom.
The t- and F- statistics enable you to make inferences about your fixed effects, which account for the
variance-covariance model you select. An alternative is the χ2 statistic associated with the likelihood ratio
test. This statistic compares two fixed-effects models, one a special case of the other. It is computed just as
when comparing different covariance models, although you should use ML and not REML here because the
penalty term associated with restricted likelihoods depends upon the fixed-effects specification.
107
4 F 23.5 24.5 25.0 26.5
5 F 21.5 23.0 22.5 23.5
6 F 20.0 21.0 21.0 22.5
7 F 21.5 22.5 23.0 25.0
8 F 23.0 23.0 23.5 24.0
9 F 20.0 21.0 22.0 21.5
10 F 16.5 19.0 19.0 19.5
11 F 24.5 25.0 28.0 28.0
12 M 26.0 25.0 29.0 31.0
13 M 21.5 22.5 23.0 26.5
14 M 23.0 22.5 24.0 27.5
15 M 25.5 27.5 26.5 27.0
16 M 20.0 23.5 22.5 26.0
17 M 24.5 25.5 27.0 28.5
18 M 22.0 22.0 24.5 26.5
19 M 24.0 21.5 24.5 25.5
20 M 23.0 20.5 31.0 26.0
21 M 27.5 28.0 31.0 31.5
22 M 23.0 23.0 23.5 25.0
23 M 21.5 23.5 24.0 28.0
24 M 17.0 24.5 26.0 29.5
25 M 22.5 25.5 25.5 26.0
26 M 23.0 24.5 26.0 30.0
27 M 22.0 21.5 23.5 25.0
;
108
model y = Gender Age Gender*Age / s;
random intercept Age / type=un sub=Person g;
run;
/*
This specifies an unstructured covariance matrix for the random intercept and slope.
In mixed model notation, G is block diagonal with identical 22 unstructured
blocks for each person. By default, R becomes . See Example 41.5 for further
information on this model.
Model Information
Person 27 1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23
24 25 26 27
Gender 2 F M
Dimensions
Covariance Parameters 10
Columns in X 6
Columns in Z 0
Subjects 27
Max Obs Per Subject 4
109
Observations Used 108
Observations Not Used 0
Total Observations 108
Iteration History
0 1 478.24175986
1 2 419.47721707 0.00000152
2 1 419.47704812 0.00000000
Convergence criteria met.
Fit Statistics
-2 Log Likelihood 419.5
AIC (smaller is better) 447.5
AICC (smaller is better) 452.0
BIC (smaller is better) 465.6
110
Null Model Likelihood Ratio Test
DF Chi-Square Pr > ChiSq
9 58.76 <.0001
Num Den
Effect DF DF F Value Pr > F
Model Information
111
Class Levels Values
Person 27 1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23
24 25 26 27
Gender 2 F M
Dimensions
Covariance Parameters 10
Columns in X 4
Columns in Z 0
Subjects 27
Max Obs Per Subject 4
Observations Used 108
Observations Not Used 0
Total Observations 108
Iteration History
0 1 478.24175986
1 2 419.47721707 0.00000152
2 1 419.47704812 0.00000000
Convergence criteria met.
112
UN(4,2) Person 3.0624 1.0135 3.02 0.0025
UN(4,3) Person 3.8235 1.2508 3.06 0.0022
UN(4,4) Person 4.6180 1.2573 3.67 0.0001
Fit Statistics
-2 Log Likelihood 419.5
AIC (smaller is better) 447.5
AICC (smaller is better) 452.0
BIC (smaller is better) 465.6
Model Information
Data Set WORK.PR
Dependent Variable y
Covariance Structure Autoregressive
Subject Effect Person
Estimation Method ML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Between-Within
113
Class Level Information
Class Levels Values
Person 27 1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23
24 25 26 27
Gender 2 F M
Dimensions
Covariance Parameters 2
Columns in X 6
Columns in Z 0
Subjects 27
Max Obs Per Subject 4
Observations Used 108
Observations Not Used 0
Total Observations 108
Iteration History
0 1 478.24175986
1 2 440.68100623 0.00000000
Convergence criteria met.
Fit Statistics
-2 Log Likelihood 440.7
AIC (smaller is better) 452.7
AICC (smaller is better) 453.5
BIC (smaller is better) 460.5
114
Null Model Likelihood Ratio Test
Num Den
Effect DF DF F Value Pr > F
Dimensions
115
Covariance Parameters 4
Columns in X 6
Columns in Z Per Subject 2
Subjects 27
Max Obs Per Subject 4
Observations Used 108
Observations Not Used 0
Total Observations 108
Iteration History
Iteration Evaluations -2 Log Like Criterion
0 1 478.24175986
1 1 427.80595080 0.00000000
Convergence criteria met.
Estimated G Matrix
Row Effect Person Col1 Col2
116
Gender F 1.0321 1.5355 54 0.67 0.5043
Gender M 0 . . . .
Age 0.7844 0.08275 25 9.48 <.0001
Age*Gender F -0.3048 0.1296 54 -2.35 0.0224
Age*Gender M 0 . . . .
Dimensions
Covariance Parameters 2
Columns in X 6
Columns in Z 0
Subjects 27
Max Obs Per Subject 4
Observations Used 108
Observations Not Used 0
Total Observations 108
Iteration History
Iteration Evaluations -2 Log Like Criterion
117
0 1 478.24175986
1 1 428.63905802 0.00000000
Convergence criteria met.
Fit Statistics
-2 Log Likelihood 428.6
AIC (smaller is better) 440.6
AICC (smaller is better) 441.5
BIC (smaller is better) 448.4
118
Age 1 79 111.10 <.0001
Age*Gender 1 79 6.46 0.0130
119
Chapter 9
Principle Component Analysis is concern with explaining the variance-covariance structure of a set of p-
variables through a set of a few (q < p) linear combinations of the original p variables. This method is used
to aid one in (1) dimension reduction and (2) data interpretation.
Before considering an example using this method it is necessary to review some matrix algebra concerning
eigenvalues and eigenvectors.
A~x = λ~x
This last equation is called the characteristic equation for a n × n matrix A. The matrix A could possibly
have as many as n different values λ that satisfy the characteristic equation. These solutions are called the
characteristic values or the eigenvalues for the matrix A. Suppose that λ1 is a solution and
then ~x1 is said to a characteristic vector or eigenvector of A corresponding to the eigenvalue λ1 . Note: The
eigenvalues may or may not be real numbers.
120
2. The eigenvalues for A, C −1 AC and CAC −1 have the same set of eigenvalues for any nonsingular matrix
C.
3. The matrices A and A0 have the same set of eigenvalues but need not have the same eigenvectors.
4. Let A be a nonsingular matrix with eigenvalue λ, then 1/λ is an eigenvalue of A−1 .
5. The eigenvectors are not unique, for if ~x1 is an eigenvector corresponding to λ1 then c~x1 is also n
eigenvector, since Ac~x1 = λ1 c~x1 , for any nonzero value c.
6. Let A be an n × n real matrix and let T denote the Cholesty decomposition. That is, there exists a
nonsingular, complex matrix Q such that Q0 AQ = T , where T is a complex, upper triangular matrix,
and the eigenvalues of A are the diagonal elements of the matrix T .
for i = 1, 2, . . . , q and var(yi ) = a0i Σai , cov(yi , yk ) = a0i Σak = 0, var(y1 ) ≥ var(y2 ) ≥ . . . ≥ var(yq ) for
a0i = ( ai1 ai2 . . . aip ).
This problem has the following solution;
1. Suppose that the matrix Σ has associated real eigenvalue–eigenvectors given by (λi , ai ) where λ1 ≥
λ2 ≥ . . . ≥ λp ≥ 0, then the ith principle component is given by
and var(yi ) = λi for i = 1, 2, . . . , p, cov(yi , yk ) = a0i Σak = 0 for i 6= k. Note, the eigenvalues λi are
unique, however, the eigenvectors (and hence the vectors ~yi ) are not.
Pp
2. The total variance for the p dimensionsP is tr[Σ] = i=1 λi . Hence, the proportion of variance explained
p
by the k th principle component is λk / i=1 λi .
Pp
3. If the matrix X is centered and scaled so that Σ is the correlation matrix, then i=1 λi = p.
121
9.2.1 Example – FOC Sales
This example using the FOC sales data found in Dielman’s text. The SAS code is,
Observations 265
Variables 8
Simple Statistics
Correlation Matrix
122
Eigenvalues of the Correlation Matrix
Eigenvectors
123
Chapter 10
Canonical Correlation
Canonical correlation analysis seeks to identify and quantify the associations between two sets of variables.
It focuses on finding linear combinations of variables in one set that has the highest correlation with a linear
combination of the variables in the second set. The linear combination of variables is called the canonical
variables and the correlation is called the canonical correlation.
Suppose that one has two random vectors, X (1) of p dimensions and X (2) of q dimensions where p ≤ q.
Let µ(1) and Σ11 denote the mean vector and covariance matrix for X (1) and µ(2) and Σ22 denote the mean
vector and covariance matrix for X (2) . Let Σ12 = Σ021 denote the covariance matrix for X (1) and X (2) .
Define two linear combinations as;
U = ~a0 X (1)
V = ~b0 X (2) .
In which case one has,
E(U ) = ~a0 µ(1) , V ar(U ) = ~a0 Σ11~a
E(V ) = ~b0 µ(2) , V ar(U ) = ~b0 Σ22~b
Cov(U, V ) = ~a0 Σ12~b.
The idea of canonical correlation is to find ~a and ~b such that
~a0 Σ12~b
Corr(U, V ) = q
~a0 Σ11~a ~b0 Σ22~b
is a large as possible.
Let Ui , Vi denote the ith canonical pair (linear combinations) then it can be shown that
−1/2 −1/2
Ui = ~a0i X (1) = ~e0i Σ11 X (1) Vi = ~b0i X (2) = f~i0 Σ22 X (2)
where ~e1 , ~e2 , . . . , ~ep are the eigenvectors of Σ11 −1/2 Σ12 Σ22 −1 Σ21 Σ11 −1/2 when the corresponding eigenvalues
are ordered in descending order. Likewise, f~1 , f~2 , . . . , f~p are the eigenvectors of Σ22 −1/2 Σ21 Σ11 −1 Σ12 Σ22 −1/2
when the corresponding eigenvectors are ordered in descending order.
Furthermore it follows that,
V ar(Ui ) = V ar(Vi ) = 1
Cov(Ui , Uk ) = Cov(Vi , Vk ) = Cov(Ui , Vk ) = 0 i 6= k.
p
Corr(Ui , Vi ) = λi
where λi is the ith largest eigenvalue of Σ11 −1/2 Σ12 Σ22 −1 Σ21 Σ11 −1/2 and Σ11 = Σ11 1/2 Σ11 1/2 , Σ22 =
Σ22 1/2 Σ22 1/2 .
124
10.1 Examples
SAS PROC CANCORR is the procedure for finding the canonical correlations. I have included two examples.
Eigenvalues of Inv(E)*H
= CanRsq/(1-CanRsq)
125
Test of H0: The canonical correlations in the
current row and all that follow are zero
Likelihood Approximate
Ratio F Value Num DF Den DF Pr > F
V1 V2 V3
V4 V5
W1 W2 W3
126
sup_sat 0.4251801045 -0.087992909 0.4918143621
career 0.2088846933 0.436262871 -0.78320006
fin_sat -0.035894963 -0.092909718 -0.47784774
work_sat 0.023525029 0.9260127356 -0.00651068
comp_id 0.2902803477 -0.101101165 0.2831089034
kind_sat 0.5157248004 -0.554289712 -0.41249444
gen_sat -0.11014262 -0.031722209 0.9284585695
W4 W5
V1 V2 V3 V4 V5
W1 W2 W3 W4 W5
V1 V2 V3 V4 V5
127
task_var 0.7533 -0.4661 -0.1056 0.3020 -0.3360
task_id 0.6160 0.2225 -0.2053 0.6614 0.3026
auto 0.8606 0.2660 0.3886 0.1484 -0.1246
W1 W2 W3 W4 W5
W1 W2 W3 W4 W5
V1 V2 V3 V4 V5
128
1 0.310 179.6 74.20 41.7 27.3 82.4 19.0 64 64 2 158 108 5.5 4.0 11.91
2 0.345 175.6 62.04 37.5 29.1 84.1 5.5 88 78 20 166 108 5.5 4.0 3.13
3 0.293 166.2 72.96 39.4 26.8 88.1 22.0 100 88 7 167 116 5.5 4.0 16.89
4 0.254 173.8 85.92 41.2 27.6 97.6 19.5 64 62 4 220 120 5.5 4.0 19.59
5 0.384 184.8 65.88 39.8 26.1 88.2 14.5 80 68 9 210 120 5.5 5.0 7.74
6 0.406 189.1 102.26 43.3 30.1 101.2 22.0 60 68 4 188 91 6.0 4.0 30.42
7 0.344 191.5 84.04 42.8 28.4 91.0 18.0 64 48 1 272 110 6.0 3.0 13.70
8 0.321 180.2 68.34 41.6 27.3 90.4 5.5 74 64 14 193 117 5.5 4.0 3.04
9 0.425 183.8 95.14 42.3 30.1 100.2 13.5 80 78 4 199 105 5.5 4.0 20.26
10 0.385 163.1 54.28 37.2 24.2 80.5 7.0 84 78 13 157 113 6.0 4.0 3.04
11 0.317 169.6 75.92 39.4 27.2 92.0 16.5 65 78 6 180 110 5.0 5.0 12.83
12 0.353 171.6 71.70 39.1 27.0 86.2 25.5 68 72 0 193 105 5.5 4.0 15.95
13 0.413 180.0 80.68 40.8 28.3 87.4 17.5 73 88 4 218 109 5.0 4.2 11.86
14 0.392 174.6 70.40 39.8 25.9 83.9 16.5 104 78 6 190 129 5.0 4.0 9.93
15 0.312 181.8 91.40 40.6 29.5 95.1 32.0 92 88 1 206 139 5.0 3.5 32.63
16 0.342 167.4 65.74 39.7 26.4 86.0 13.0 80 86 6 181 120 5.5 4.0 6.64
17 0.293 173.0 79.28 41.2 26.9 96.1 11.5 72 68 6 184 111 5.5 3.9 11.57
18 0.317 179.8 92.06 40.0 29.8 100.9 15.0 60 78 0 205 92 5.0 4.0 24.21
19 0.333 176.8 87.96 41.2 28.4 100.8 20.5 76 90 1 228 147 4.0 3.5 22.39
20 0.317 179.3 77.66 41.4 31.6 90.1 9.5 58 86 15 198 98 5.5 4.1 6.29
21 0.427 193.5 98.44 41.6 29.2 95.7 21.0 54 74 0 254 110 5.5 3.8 23.63
22 0.266 178.8 65.42 39.3 27.1 83.0 16.5 88 72 7 206 121 5.5 4.0 10.53
23 0.311 179.6 97.04 43.8 30.1 100.8 22.0 100 74 3 194 124 5.0 4.0 20.62
24 0.284 172.6 81.72 40.9 27.3 91.5 22.0 74 76 4 201 113 5.5 5.1 18.39
25 0.259 171.5 69.60 40.4 27.8 87.7 15.5 70 72 10 175 110 5.5 3.0 11.14
26 0.317 168.9 63.66 39.8 26.7 83.9 6.0 68 70 7 179 119 5.5 5.0 5.16
27 0.263 183.1 87.24 43.2 28.3 95.7 11.0 88 74 7 245 115 5.5 4.0 9.60
28 0.336 163.6 64.86 37.5 26.6 84.0 15.5 64 64 6 146 115 5.0 4.4 11.93
29 0.267 184.3 84.68 40.3 29.0 93.2 8.5 64 76 2 213 109 5.5 5.0 8.55
30 0.271 181.0 73.78 42.8 29.7 90.3 8.5 56 88 11 181 109 6.0 5.0 4.94
31 0.264 180.2 75.84 41.4 28.7 88.1 13.5 76 76 9 192 144 5.5 3.6 10.62
32 0.357 184.1 70.48 42.0 28.9 81.3 14.0 84 72 5 231 123 5.5 4.5 8.46
33 0.259 178.9 86.90 42.5 28.7 95.0 16.0 54 68 12 186 118 6.0 4.0 13.47
34 0.221 170.0 76.68 39.7 27.7 93.6 15.0 50 72 4 178 108 5.5 4.5 12.81
35 0.333 180.6 77.32 42.1 27.3 89.5 16.0 88 72 11 200 119 5.5 4.6 13.34
36 0.359 179.0 79.90 40.8 28.2 90.3 26.5 80 80 3 201 124 5.5 3.7 24.57
37 0.314 186.6 100.36 42.5 31.5 100.3 27.0 62 76 2 208 120 5.5 4.1 28.35
38 0.295 181.4 91.66 41.9 28.9 96.6 25.5 68 78 2 211 125 6.0 3.0 26.12
39 0.296 176.5 79.00 40.7 29.1 86.5 20.5 60 66 5 210 117 5.5 4.2 15.21
40 0.308 174.0 69.10 40.9 27.0 88.1 18.0 92 74 5 161 140 5.0 5.5 12.51
41 0.327 178.2 87.78 42.9 27.2 100.3 16.5 72 72 4 189 115 5.5 3.5 20.50
42 0.303 177.1 70.18 39.4 27.6 85.5 16.0 72 74 14 201 110 6.0 4.8 10.67
43 0.297 180.0 67.66 40.9 28.7 86.1 15.0 76 76 5 177 110 5.5 4.5 10.76
44 0.244 176.8 86.12 41.3 28.2 92.7 12.5 76 68 7 181 110 5.5 4.0 14.55
45 0.282 176.3 65.00 39.0 26.0 83.3 7.0 88 72 12 167 127 5.5 5.0 5.27
46 0.285 192.4 99.14 43.7 28.7 96.1 20.5 64 68 4 174 105 6.0 4.0 17.94
47 0.299 175.2 75.70 39.4 27.3 90.8 19.0 56 76 7 174 111 5.5 4.5 12.64
48 0.280 175.9 78.62 43.4 29.3 90.7 18.0 64 72 7 170 117 5.5 3.7 10.81
49 0.268 174.6 64.88 42.3 29.2 82.6 3.5 72 80 11 199 113 6.0 4.5 2.01
129
50 0.362 179.0 71.00 41.2 27.3 85.6 16.0 68 90 5 150 108 5.5 5.0 10.00
;
proc cancorr data=police;
var HEIGHT WEIGHT SHLDR PELVIC CHEST
THIGH PULSE FAT;
with DIAST CHNUP BREATH RECVR SPEED ENDUR REACT;
run;
The SAS output is;
Police Department Applicant Data Johnson page 160
The CANCORR Procedure
Eigenvalues of Inv(E)*H
= CanRsq/(1-CanRsq)
Likelihood Approximate
Ratio F Value Num DF Den DF Pr > F
130
6 0.90310008 0.70 6 80 0.6527
7 0.98638151 0.28 2 41 0.7550
V1 V2 V3 V4
V5 V6 V7
W1 W2 W3 W4
131
CHNUP 0.2189806787 -0.058683974 0.0409208152 0.0789130807
BREATH 0.0040725029 -0.020969651 -0.011241439 0.0331536948
RECVR -0.024918952 0.0226276962 0.086682768 0.0059240991
SPEED -1.401961388 -1.221248504 1.0630790256 -0.828522641
ENDUR 0.2485158692 -0.694038707 0.4126618238 -0.204662786
REACT -5.154743075 -4.324972169 13.336435687 5.4449477881
W5 W6 W7
V1 V2 V3 V4
V5 V6 V7
W1 W2 W3 W4
132
DIAST 0.1431 0.3418 -0.2909 0.3343
CHNUP 0.9592 -0.2570 0.1792 0.3456
BREATH 0.1037 -0.5338 -0.2861 0.8439
RECVR -0.2786 0.2530 0.9691 0.0662
SPEED -0.5098 -0.4441 0.3866 -0.3013
ENDUR 0.1402 -0.3916 0.2328 -0.1155
REACT -0.2485 -0.2085 0.6430 0.2625
W5 W6 W7
Canonical Structure
V1 V2 V3 V4
V5 V6 V7
133
Correlations Between the WITH Variables and Their Canonical Variables
W1 W2 W3 W4
W5 W6 W7
W1 W2 W3 W4
W5 W6 W7
134
PULSE 0.0132 -0.0067 0.0079
FAT -0.0364 0.0225 0.0163
V1 V2 V3 V4
V5 V6 V7
135
Chapter 11
Factor Analysis
Factor Analysis (FA) is a topic that is related to Principle Component Analysis (PCA) in that there some
similarities although FA has existed as a separate and somewhat controversial topic within multivariate
statistical analysis. I have chosen to leave the discussion of the similarities and differences to the reader.
This discussion is given in a number of texts that exclusively cover FA. Whereas in PCA the objective
was to find linear combinations of the original measurable variables that explained a specified proportion of
the variance/covariance. In FA one postulates the existence of unobservable constructs or latent variables
that are called factors. The existence of these hypothetical variables is open to question since they are
unobservable.
The basis Factor Analysis (FA) model is;
X = ΛF + ,
where X is an observable (manifest) random vector (p×1), F is an unobservable (latent) vector of k common
factors, is a vector random errors, and Λ is a p × k matrix of regression weights or loadings. The following
assumptions are made;
• E(X) = E(F ) = E() = 0.
• V ar(X) = Σ.
• V ar(F ) = Ik .
136
If Σ is the correlation matrix then
k
X
corr(xi , xm ) = ρim = λij λjm .
j=1
• Cov(X, F ) = Λ
or
cov(xi , fj ) = λij .
It can be shown that the factor loadings Λ are unique only up to an orthogonal transformation. That
is, suppose that Λ∗ = ΛT where T T 0 = Ik , then Λ∗ (Λ∗ )0 = ΛT T 0 Λ0 = ΛΛ0 . Since one has this type of
nonuniqueness, the problem is to find the “best” matrix loadings which reproduce the existing covariance
structure where “best” usual is defined as the transformation that leads to the clear understanding of the
nature of the latent variables. Some authors address this problem by finding a rotation or transformation
that leads to loadings such that,
1. Each row of Λ should have at least one zero.
2. Each column of Λ should have at least one zero.
3. For all pairs of column of Λ, there should be several rows of zero and nonzero loadings.
4. If k ≥ 4, several pairs of column of Λ should have two zero loadings and a small number of nonzero
loadings.
When one finds an orthogonal transformation, the common factors will be uncorrelated with one another.
To allow for ease of interpretation some investigators suggest using nonorthogonal or oblique rotations or
transformations. In which case, the resulting factors are no longer uncorrelated with one another.
There are a number of procedures for estimating Λ and for determining k the number of factors. SAS
uses a variety of methods. Rather than continue with this problem I have included some comments found in
the SAS Users manual and have included it below. This interested reader should consult some specific texts
for FA.
137
Output includes means, standard deviations, correlations, Kaiser’s measure of sampling adequacy, eigen-
values, a scree plot, eigenvectors, prior and final communality estimates, the unrotated factor pattern,
residual and partial correlations, the rotated primary factor pattern, the primary factor structure, interfac-
tor correlations, the reference structure, reference axis correlations, the variance explained by each factor
both ignoring and eliminating other factors, plots of both rotated and unrotated factors, squared multiple
correlation of each factor with the variables, and scoring coefficients.
Any topics that are not given explicit references are discussed in Mulaik (1972) or Harman (1976).
11.1.2 Background
Common factor analysis was invented by Spearman (1904). Kim and Mueller (1978a,b) provide a very
elementary discussion of the common factor model. Gorsuch (1974) contains a broad survey of factor analysis,
and Gorsuch (1974) and Cattell (1978) are useful as guides to practical research methodology. Harman (1976)
gives a lucid discussion of many of the more technical aspects of factor analysis, especially oblique rotation.
Morrison (1976) and Mardia, Kent, and Bibby (1979) provide excellent statistical treatments of common
factor analysis. Mulaik (1972) is the most thorough and authoritative general reference on factor analysis
and is highly recommended to anyone familiar with matrix algebra. Stewart (1981) gives a nontechnical
presentation of some issues to consider when deciding whether or not a factor analysis may be appropriate.
A frequent source of confusion in the field of factor analysis is the term factor. It sometimes refers to
a hypothetical, unobservable variable, as in the phrase common factor. In this sense, factor analysis must
be distinguished from component analysis since a component is an observable linear combination. Factor is
also used in the sense of matrix factor, in that one matrix is a factor of a second matrix if the first matrix
multiplied by its transpose equals the second matrix. In this sense, factor analysis refers to all methods of
data analysis using matrix factors, including component analysis and common factor analysis.
A common factor is an unobservable, hypothetical variable that contributes to the variance of at least
two of the observed variables. The unqualified term ”factor” often refers to a common factor. A unique
factor is an unobservable, hypothetical variable that contributes to the variance of only one of the observed
variables. The model for common factor analysis posits one unique factor for each observed variable.
If the original variables are standardized to unit variance, the preceding formula yields correlations instead
of covariances. It is in this sense that common factors explain the correlations among the observed variables.
The difference between the correlation predicted by the common factor model and the actual correlation is
the residual correlation. A good way to assess the goodness-of-fit of the common factor model is to examine
the residual correlations.
The common factor model implies that the partial correlations among the variables, removing the effects
of the common factors, must all be 0. When the common factors are removed, only unique factors, which
are by definition uncorrelated, remain.
The assumptions of common factor analysis imply that the common factors are, in general, not linear
combinations of the observed variables. In fact, even if the data contain measurements on the entire popu-
lation of observations, you cannot compute the scores of the observations on the common factors. Although
the common factor scores cannot be computed directly, they can be estimated in a variety of ways.
The problem of factor score indeterminacy has led several factor analysts to propose methods yielding
components that can be considered approximations to common factors. Since these components are defined
as linear combinations, they are computable. The methods include Harris component analysis and image
component analysis. The advantage of producing determinate component scores is offset by the fact that,
even if the data fit the common factor model perfectly, component methods do not generally recover the
correct factor solution. You should not use any type of component analysis if you really want a common
factor analysis (Dziuban and Harris 1973; Lee and Comrey 1979).
After the factors are estimated, it is necessary to interpret them. Interpretation usually means assigning
to each common factor a name that reflects the importance of the factor in predicting each of the observed
variables, that is, the coefficients in the pattern matrix corresponding to the factor. Factor interpretation is
138
a subjective process. It can sometimes be made less subjective by rotating the common factors, that is, by
applying a nonsingular linear transformation. A rotated pattern matrix in which all the coefficients are close
to 0 or ±1 is easier to interpret than a pattern with many intermediate elements. Therefore, most rotation
methods attempt to optimize a function of the pattern matrix that measures, in some sense, how close the
elements are to 0 or ±1.
After the initial factor extraction, the common factors are uncorrelated with each other. If the factors are
rotated by an orthogonal transformation, the rotated factors are also uncorrelated. If the factors are rotated
by an oblique transformation, the rotated factors become correlated. Oblique rotations often produce more
useful patterns than do orthogonal rotations. However, a consequence of correlated factors is that there is
no single unambiguous measure of the importance of a factor in explaining a variable. Thus, for oblique
rotations, the pattern matrix does not provide all the necessary information for interpreting the factors; you
must also examine the factor structure and the reference structure. Rotating a set of factors does not change
the statistical explanatory power of the factors. You cannot say that any rotation is better than any other
rotation from a statistical point of view; all rotations are equally good statistically. Therefore, the choice
among different rotations must be based on nonstatistical grounds. For most applications, the preferred
rotation is that which is most easily interpretable.
If two rotations give rise to different interpretations, those two interpretations must not be regarded as
conflicting. Rather, they are two different ways of looking at the same thing, two different points of view
in the common-factor space. Any conclusion that depends on one and only one rotation being correct is
invalid.
Johnson considers an example using police applicant candidates on page 159. He uses SPSS for his factor
analysis. I have tried to produce similar results using SAS.
139
18 0.317 179.8 92.06 40.0 29.8 100.9 15.0 60 78 0 205 92 5.0 4.0 24.21
19 0.333 176.8 87.96 41.2 28.4 100.8 20.5 76 90 1 228 147 4.0 3.5 22.39
20 0.317 179.3 77.66 41.4 31.6 90.1 9.5 58 86 15 198 98 5.5 4.1 6.29
21 0.427 193.5 98.44 41.6 29.2 95.7 21.0 54 74 0 254 110 5.5 3.8 23.63
22 0.266 178.8 65.42 39.3 27.1 83.0 16.5 88 72 7 206 121 5.5 4.0 10.53
23 0.311 179.6 97.04 43.8 30.1 100.8 22.0 100 74 3 194 124 5.0 4.0 20.62
24 0.284 172.6 81.72 40.9 27.3 91.5 22.0 74 76 4 201 113 5.5 5.1 18.39
25 0.259 171.5 69.60 40.4 27.8 87.7 15.5 70 72 10 175 110 5.5 3.0 11.14
26 0.317 168.9 63.66 39.8 26.7 83.9 6.0 68 70 7 179 119 5.5 5.0 5.16
27 0.263 183.1 87.24 43.2 28.3 95.7 11.0 88 74 7 245 115 5.5 4.0 9.60
28 0.336 163.6 64.86 37.5 26.6 84.0 15.5 64 64 6 146 115 5.0 4.4 11.93
29 0.267 184.3 84.68 40.3 29.0 93.2 8.5 64 76 2 213 109 5.5 5.0 8.55
30 0.271 181.0 73.78 42.8 29.7 90.3 8.5 56 88 11 181 109 6.0 5.0 4.94
31 0.264 180.2 75.84 41.4 28.7 88.1 13.5 76 76 9 192 144 5.5 3.6 10.62
32 0.357 184.1 70.48 42.0 28.9 81.3 14.0 84 72 5 231 123 5.5 4.5 8.46
33 0.259 178.9 86.90 42.5 28.7 95.0 16.0 54 68 12 186 118 6.0 4.0 13.47
34 0.221 170.0 76.68 39.7 27.7 93.6 15.0 50 72 4 178 108 5.5 4.5 12.81
35 0.333 180.6 77.32 42.1 27.3 89.5 16.0 88 72 11 200 119 5.5 4.6 13.34
36 0.359 179.0 79.90 40.8 28.2 90.3 26.5 80 80 3 201 124 5.5 3.7 24.57
37 0.314 186.6 100.36 42.5 31.5 100.3 27.0 62 76 2 208 120 5.5 4.1 28.35
38 0.295 181.4 91.66 41.9 28.9 96.6 25.5 68 78 2 211 125 6.0 3.0 26.12
39 0.296 176.5 79.00 40.7 29.1 86.5 20.5 60 66 5 210 117 5.5 4.2 15.21
40 0.308 174.0 69.10 40.9 27.0 88.1 18.0 92 74 5 161 140 5.0 5.5 12.51
41 0.327 178.2 87.78 42.9 27.2 100.3 16.5 72 72 4 189 115 5.5 3.5 20.50
42 0.303 177.1 70.18 39.4 27.6 85.5 16.0 72 74 14 201 110 6.0 4.8 10.67
43 0.297 180.0 67.66 40.9 28.7 86.1 15.0 76 76 5 177 110 5.5 4.5 10.76
44 0.244 176.8 86.12 41.3 28.2 92.7 12.5 76 68 7 181 110 5.5 4.0 14.55
45 0.282 176.3 65.00 39.0 26.0 83.3 7.0 88 72 12 167 127 5.5 5.0 5.27
46 0.285 192.4 99.14 43.7 28.7 96.1 20.5 64 68 4 174 105 6.0 4.0 17.94
47 0.299 175.2 75.70 39.4 27.3 90.8 19.0 56 76 7 174 111 5.5 4.5 12.64
48 0.280 175.9 78.62 43.4 29.3 90.7 18.0 64 72 7 170 117 5.5 3.7 10.81
49 0.268 174.6 64.88 42.3 29.2 82.6 3.5 72 80 11 199 113 6.0 4.5 2.01
50 0.362 179.0 71.00 41.2 27.3 85.6 16.0 68 90 5 150 108 5.5 5.0 10.00
;
proc princomp out=new;
var REACT HEIGHT WEIGHT SHLDR PELVIC CHEST
THIGH PULSE DIAST CHNUP BREATH RECVR SPEED ENDUR FAT; run;
proc factor data=police reorder;
var REACT HEIGHT WEIGHT SHLDR PELVIC CHEST
THIGH PULSE DIAST CHNUP BREATH RECVR SPEED ENDUR FAT;
title2 ’Principle Component FA’;
run;
proc factor data=police rotate=varimax reorder;
var REACT HEIGHT WEIGHT SHLDR PELVIC CHEST
THIGH PULSE DIAST CHNUP BREATH RECVR SPEED ENDUR FAT;
title2 ’Principle Component FA with Varimax Rotation’;
run;
proc factor data=police method=ml heywood n=5 reorder;
var REACT HEIGHT WEIGHT SHLDR PELVIC CHEST
140
THIGH PULSE DIAST CHNUP BREATH RECVR SPEED ENDUR FAT;
title2 ’Maximum Likelihood FA for 5 factors’;
run;
proc factor data=police method=ml heywood n=5 rotate=V reorder;
var REACT HEIGHT WEIGHT SHLDR PELVIC CHEST
THIGH PULSE DIAST CHNUP BREATH RECVR SPEED ENDUR FAT;
title2 ’Maximum Likelihood FA for 5 factors with Varimax Rotation’;
run;
proc factor data=police method=ml heywood n=5 rotate=promax reorder;
var REACT HEIGHT WEIGHT SHLDR PELVIC CHEST
THIGH PULSE DIAST CHNUP BREATH RECVR SPEED ENDUR FAT;
title2 ’Maximum Likelihood FA for 5 factors with Oblique Promax Rotation’;
run;
The SAS output is;
Police Department Applicant Data Johnson page 160
Observations 50
Variables 15
Simple Statistics
Simple Statistics
Simple Statistics
Correlation Matrix
141
REACT 1.0000 0.2223 0.0562 -.0938 -.0559 -.0318 0.1324 0.1631
HEIGHT 0.2223 1.0000 0.6353 0.6543 0.5859 0.4259 0.2232 -.1815
WEIGHT 0.0562 0.6353 1.0000 0.6656 0.6470 0.8887 0.5542 -.2638
SHLDR -.0938 0.6543 0.6656 1.0000 0.5824 0.5545 0.2046 -.1682
PELVIC -.0559 0.5859 0.6470 0.5824 1.0000 0.5221 0.2075 -.3229
CHEST -.0318 0.4259 0.8887 0.5545 0.5221 1.0000 0.3978 -.2463
THIGH 0.1324 0.2232 0.5542 0.2046 0.2075 0.3978 1.0000 -.0062
PULSE 0.1631 -.1815 -.2638 -.1682 -.3229 -.2463 -.0062 1.0000
DIAST 0.1473 -.1866 -.0517 -.1449 0.1477 -.0084 0.0487 0.2341
CHNUP -.1585 -.2760 -.5758 -.2739 -.1582 -.4536 -.6695 0.1548
BREATH 0.1595 0.5878 0.4502 0.3677 0.3536 0.3473 0.2065 -.0644
RECVR -.1296 -.1213 -.1219 -.0239 -.2069 -.0832 0.2031 0.5039
SPEED -.1493 0.2156 -.0526 0.2018 0.0412 -.1624 -.2080 -.2996
ENDUR -.0525 -.1892 -.3680 -.2448 -.2294 -.3275 -.3357 0.0073
FAT 0.1648 0.3651 0.8095 0.3306 0.4132 0.7246 0.8442 -.0948
Correlation Matrix
142
3 1.31267825 0.08160302 0.0875 0.5959
4 1.23107523 0.02722958 0.0821 0.6779
5 1.20384565 0.35594143 0.0803 0.7582
6 0.84790423 0.14315618 0.0565 0.8147
7 0.70474804 0.12633945 0.0470 0.8617
8 0.57840860 0.18486182 0.0386 0.9003
9 0.39354677 0.02535508 0.0262 0.9265
10 0.36819169 0.04160632 0.0245 0.9510
11 0.32658537 0.13971542 0.0218 0.9728
12 0.18686996 0.04806356 0.0125 0.9853
13 0.13880640 0.09492503 0.0093 0.9945
14 0.04388137 0.00574644 0.0029 0.9975
15 0.03813492 0.0025 1.0000
Eigenvectors
Eigenvectors
143
PULSE 0.331791 -.020677 -.496585 -.009483 0.099315 0.034112 0.076313
DIAST 0.146511 -.400896 0.156265 -.158356 -.118531 -.009948 -.038833
CHNUP 0.143353 0.499748 0.294961 0.215904 -.415019 -.094720 -.141276
BREATH 0.473400 -.112931 0.009034 0.302890 -.107551 -.011775 -.071042
RECVR -.219312 0.142011 0.514234 -.020974 0.319568 -.050876 0.087558
SPEED 0.314410 -.180529 0.214643 0.032084 0.359654 0.038933 0.140015
ENDUR 0.232765 0.160019 0.048775 0.185324 0.040703 -.001605 -.012893
FAT 0.193846 0.223611 0.068805 0.020013 0.175096 0.101581 -.760033
Factor Pattern
144
FAT 0.84105 0.34734 -0.26738 -0.07236 0.01546
CHEST 0.82416 0.00558 -0.14664 0.15902 -0.18894
HEIGHT 0.69783 -0.33725 0.41902 0.05972 0.22820
SHLDR 0.68565 -0.32752 0.31506 0.11806 -0.23196
PELVIC 0.67123 -0.29937 0.08798 0.48711 -0.10913
THIGH 0.64905 0.47592 -0.26352 -0.22796 0.01317
BREATH 0.57632 -0.05036 0.50353 -0.17149 0.17768
ENDUR -0.46683 -0.08312 -0.14959 0.36143 0.09197
CHNUP -0.66679 -0.36578 0.27478 0.24007 -0.12565
RECVR -0.05822 0.65763 0.45810 -0.11492 -0.41729
PULSE -0.27258 0.59160 0.52991 0.02023 -0.00286
SPEED -0.06704 -0.76236 0.03135 -0.21987 0.04457
DIAST -0.08173 0.42916 -0.04908 0.77472 0.05289
REACT 0.11577 0.23649 0.12762 0.06168 0.90082
145
Eigenvalues of the Correlation Matrix: Total = 15 Average = 1
Factor Pattern
146
Initial Factor Method: Principal Components
1 2 3 4 5
147
WEIGHT 0.65304 0.68489 -0.17334 0.02459 -0.04056
BREATH 0.19065 0.60670 0.20462 -0.32989 0.30673
RECVR 0.10678 -0.04500 0.88357 0.01179 -0.19694
PULSE -0.14414 -0.12994 0.78471 0.11661 0.19618
SPEED -0.38288 0.16941 -0.49297 -0.46286 -0.06663
DIAST -0.01615 0.01011 0.18516 0.86776 0.09270
REACT 0.11922 -0.00396 -0.00664 0.11263 0.93485
148
Preliminary Eigenvalues: Total = 62.2250448 Average = 4.14833632
149
8 1.3502243 0.0000 0.0024 0.62622 0.89154 0.97241 0.68750 0.51506 0.91866
0.94486 0.37735 0.12939 0.53486 0.41288 1.00000
0.47814 0.17409 0.94226
9 1.3502227 0.0000 0.0012 0.62738 0.89131 0.97239 0.68731 0.51511 0.91876
0.94476 0.37755 0.12940 0.53479 0.41275 1.00000
0.47783 0.17409 0.94230
10 1.3502223 0.0000 0.0006 0.62794 0.89120 0.97237 0.68722 0.51513 0.91881
0.94471 0.37765 0.12941 0.53476 0.41269 1.00000
0.47768 0.17409 0.94233
Pr >
Test DF Chi-Square ChiSq
Eigenvalues of the Weighted Reduced Correlation Matrix: Total = 94.2680654 Average = 6.73343324
1 Infty Infty
2 72.8157670 59.5526525 0.7724 0.7724
150
3 13.2631146 7.3563181 0.1407 0.9131
4 5.9067965 3.6244087 0.0627 0.9758
5 2.2823878 1.5748067 0.0242 1.0000
6 0.7075811 0.1211672 0.0075 1.0075
7 0.5864139 0.2073404 0.0062 1.0137
8 0.3790735 0.2137267 0.0040 1.0177
9 0.1653468 0.1429644 0.0018 1.0195
10 0.0223823 0.0949270 0.0002 1.0197
11 -0.0725446 0.1454124 -0.0008 1.0190
12 -0.2179570 0.0952176 -0.0023 1.0167
13 -0.3131747 0.1702114 -0.0033 1.0133
14 -0.4833861 0.2903498 -0.0051 1.0082
15 -0.7737358 -0.0082 1.0000
Factor Pattern
151
Variance Explained by Each Factor
152
Eigenvalue Difference Proportion Cumulative
153
9 1.3502227 0.0000 0.0012 0.62738 0.89131 0.97239 0.68731 0.51511 0.91876
0.94476 0.37755 0.12940 0.53479 0.41275 1.00000
0.47783 0.17409 0.94230
10 1.3502223 0.0000 0.0006 0.62794 0.89120 0.97237 0.68722 0.51513 0.91881
0.94471 0.37765 0.12941 0.53476 0.41269 1.00000
0.47768 0.17409 0.94233
Pr >
Test DF Chi-Square ChiSq
Eigenvalues of the Weighted Reduced Correlation Matrix: Total = 94.2680654 Average = 6.73343324
1 Infty Infty
2 72.8157670 59.5526525 0.7724 0.7724
154
3 13.2631146 7.3563181 0.1407 0.9131
4 5.9067965 3.6244087 0.0627 0.9758
5 2.2823878 1.5748067 0.0242 1.0000
6 0.7075811 0.1211672 0.0075 1.0075
7 0.5864139 0.2073404 0.0062 1.0137
8 0.3790735 0.2137267 0.0040 1.0177
9 0.1653468 0.1429644 0.0018 1.0195
10 0.0223823 0.0949270 0.0002 1.0197
11 -0.0725446 0.1454124 -0.0008 1.0190
12 -0.2179570 0.0952176 -0.0023 1.0167
13 -0.3131747 0.1702114 -0.0033 1.0133
14 -0.4833861 0.2903498 -0.0051 1.0082
15 -0.7737358 -0.0082 1.0000
Factor Pattern
155
Factor4 5.9067965 0.86110178
Factor5 2.2823878 0.86514424
156
3 3.2141514 0.5487643 0.0517 0.9196
4 2.6653871 1.1212112 0.0428 0.9625
5 1.5441759 0.0401474 0.0248 0.9873
6 1.5040285 0.5148665 0.0242 1.0115
7 0.9891620 0.4442794 0.0159 1.0274
8 0.5448826 0.4398188 0.0088 1.0361
9 0.1050638 0.2296711 0.0017 1.0378
10 -0.1246073 0.0888343 -0.0020 1.0358
11 -0.2134417 0.1361614 -0.0034 1.0324
12 -0.3496030 0.0549062 -0.0056 1.0268
13 -0.4045092 0.1203960 -0.0065 1.0203
14 -0.5249052 0.2109050 -0.0084 1.0118
15 -0.7358101 -0.0118 1.0000
157
0.47768 0.17409 0.94233
Pr >
Test DF Chi-Square ChiSq
Eigenvalues of the Weighted Reduced Correlation Matrix: Total = 94.2680654 Average = 6.73343324
1 Infty Infty
2 72.8157670 59.5526525 0.7724 0.7724
3 13.2631146 7.3563181 0.1407 0.9131
4 5.9067965 3.6244087 0.0627 0.9758
5 2.2823878 1.5748067 0.0242 1.0000
6 0.7075811 0.1211672 0.0075 1.0075
7 0.5864139 0.2073404 0.0062 1.0137
158
8 0.3790735 0.2137267 0.0040 1.0177
9 0.1653468 0.1429644 0.0018 1.0195
10 0.0223823 0.0949270 0.0002 1.0197
11 -0.0725446 0.1454124 -0.0008 1.0190
12 -0.2179570 0.0952176 -0.0023 1.0167
13 -0.3131747 0.1702114 -0.0033 1.0133
14 -0.4833861 0.2903498 -0.0051 1.0082
15 -0.7737358 -0.0082 1.0000
Factor Pattern
159
Total Communality: Weighted = 96.812869 Unweighted = 9.605996
160
Chapter 12
Although classification and discrimination are different topics and have differing purposes, they have some
similarities that merit them being considered together in a single chapter. The goal of discrimination is to
find a linear combination of the original variables which best separate two (or more) populations. These
linear combinations are called the discriminants. The goal of classification is to define a rule by which an
unknown vector can be assigned to one of two (or more) populations. Rencher (2001) describes these two
problems as;
1. The description of group (either populations or samples from a population), in which linear functions
(discriminating functions) of the variables are used to describe or elucidate the differences between two
or more groups. The goal of discriminate analysis include the relative contributions of the p variables
to separation of the two groups and finding the optimal plane on which the points can be projected to
best illustrate the configurations (separation) of the groups.
2. Prediction or allocation, in which linear or quadratic functions (classification functions) of the variables
are employed to assign an individual sampling unit to one of the groups. The measured values (in the
observation vector) for an individual or object are evaluated by the classification functions to see to
which group the individual most likely belongs.
161
12.1.2 Multiple Population Problem
The multiple population problem is similar to the two population problem where
and the matrix H from the one way MANOVA is used instead of (ȳ1 − ȳ2 )(ȳ1 − ȳ2 )0 and the error matrix E
is used instead of Sp . That is one writes
a0 Ha
λ= 0 .
a Ea
Which can be written as,
a0 Ha = λa0 Ea
or
a0 (H − λE)a = 0.
Since this has to hold for any vector a, one has
(E −1 H − λI)a = 0
which implies that λ1 ≥ λ2 ≥ . . . ≥ λs and a1 , a2 , . . . , as are the eigenvalue and corresponding eigenvectors
of E −1 H. Thus λ1 is the solution that maximizes the discriminate equation and z1 = a01 ~y is the discrim-
inant function that maximizes the separation between the means. Furthermore, zi = a0i ~y provide the ith
discriminate function which are uncorrelated. The relative importance of the ith discriminate function is
given by
λ
Ps i .
j=1 λj
This problem is covered in SAS using PROC CANDISC. I have included a brief discussion of an overview
to this procedure as given in the SAS User‘s manual.
162
Given two or more groups of observations with measurements on several quantitative variables, canonical
discriminant analysis derives a linear combination of the variables that has the highest possible multiple
correlation with the groups. This maximal multiple correlation is called the first canonical correlation.
The coefficients of the linear combination are the canonical coefficients or canonical weights. The variable
defined by the linear combination is the first canonical variable or canonical component. The second canonical
correlation is obtained by finding the linear combination uncorrelated with the first canonical variable that
has the highest possible multiple correlation with the groups. The process of extracting canonical variables
can be repeated until the number of canonical variables equals the number of original variables or the number
of classes minus one, whichever is smaller.
The first canonical correlation is at least as large as the multiple correlation between the groups and
any of the original variables. If the original variables have high within-group correlations, the first canonical
correlation can be large even if all the multiple correlations are small. In other words, the first canonical
variable can show substantial differences between the classes, even if none of the original variables do.
Canonical variables are sometimes called discriminant functions, but this usage is ambiguous because the
DISCRIM procedure produces very different functions for classification that are also called discriminant
functions.
For each canonical correlation, PROC CANDISC tests the hypothesis that it and all smaller canonical
correlations are zero in the population. An F approximation (Rao 1973; Kshirsagar 1972) is used that
gives better small-sample results than the usual chi-square approximation. The variables should have an
approximate multivariate normal distribution within each class, with a common covariance matrix in order
for the probability levels to be valid.
Canonical discriminant analysis is equivalent to canonical correlation analysis between the quantitative
variables and a set of dummy variables coded from the class variable. Canonical discriminant analysis is also
equivalent to performing the following steps:
1. Transform the variables so that the pooled within-class covariance matrix is an identity matrix.
2. Compute class means on the transformed variables.
3. Perform a principal component analysis on the means, weighting each mean by the number of obser-
vations in the class. The eigenvalues are equal to the ratio of between-class variation to within-class
variation in the direction of each principal component.
4. Back-transform the principal components into the space of the original variables, obtaining the canon-
ical variables.
An interesting property of the canonical variables is that they are uncorrelated whether the correlation
is calculated from the total sample or from the pooled within-class correlations. The canonical coefficients
are not orthogonal, however, so the canonical variables do not represent perpendicular directions through
the space of the original variables.
163
CANDISC procedure or the DISCRIM procedure. With PROC STEPDISC, variables are chosen to enter or
leave the model according to one of two criteria:
• the significance level of an F-test from an analysis of covariance, where the variables already chosen
act as covariates and the variable under consideration is the dependent variable
• the squared partial correlation for predicting the variable under consideration from the CLASS variable,
controlling for the effects of the variables already selected for the model
Forward selection begins with no variables in the model. At each step, PROC STEPDISC enters the variable
that contributes most to the discriminatory power of the model as measured by Wilks’ Lambda, the likelihood
ratio criterion. When none of the unselected variables meets the entry criterion, the forward selection process
stops.
Backward elimination begins with all variables in the model except those that are linearly dependent on
previous variables in the VAR statement. At each step, the variable that contributes least to the discrimi-
natory power of the model as measured by Wilks’ Lambda is removed. When all remaining variables meet
the criterion to stay in the model, the backward elimination process stops.
Stepwise selection begins, like forward selection, with no variables in the model. At each step, the model
is examined. If the variable in the model that contributes least to the discriminatory power of the model as
measured by Wilks’ lambda fails to meet the criterion to stay, then that variable is removed. Otherwise, the
variable not in the model that contributes most to the discriminatory power of the model is entered. When
all variables in the model meet the criterion to stay and none of the other variables meet the criterion to
enter, the stepwise selection process stops. Stepwise selection is the default method of variable selection.
It is important to realize that, in the selection of variables for entry, only one variable can be entered into
the model at each step. The selection process does not take into account the relationships between variables
that have not yet been selected. Thus, some important variables could be excluded in the process. Also,
Wilks’ Lambda may not be the best measure of discriminatory power for your application. However, if you
use PROC STEPDISC carefully, in combination with your knowledge of the data and careful cross-validation,
it can be a valuable aid in selecting variables for a discrimination model.
As with any stepwise procedure, it is important to remember that, when many significance tests are
performed, each at a level of, for example, 5probability of rejecting at least one true null hypothesis is
much larger than 5discriminatory power of the model in the population, you should specify a very small
significance level. In most applications, all variables considered have some discriminatory power, however
small. To choose the model that provides the best discrimination using the sample estimates, you need only
to guard against estimating more parameters than can be reliably estimated with the given sample size.
Costanza and Afifi (1979) use Monte Carlo studies to compare alternative stopping rules that can be
used with the forward selection method in the two-group multivariate normal classification problem. Five
different numbers of variables, ranging from 10 to 30, are considered in the studies. The comparison is based
on conditional and estimated unconditional probabilities of correct classification. They conclude that the
use of a moderate significance level, in the range of 10 percent to 25 percent, often performs better than the
use of a much larger or a much smaller significance level.
The significance level and the squared partial correlation criteria select variables in the same order,
although they may select different numbers of variables. Increasing the sample size tends to increase the
number of variables selected when using significance levels, but it has little effect on the number selected
using squared partial correlations.
164
2=’Roach’
3=’Whitefish’
4=’Parkki’
5=’Perch’
6=’Pike’
7=’Smelt’;
data fish (drop=HtPct WidthPct);
title ’Fish Measurement Data’;
input Species Weight Length1 Length2 Length3 HtPct WidthPct @@;
Height=HtPct*Length3/100;
Width=WidthPct*Length3/100;
format Species specfmt.;
datalines;
1 242.0 23.2 25.4 30.0 38.4 13.4 1 290.0 24.0 26.3 31.2 40.0 13.8
1 340.0 23.9 26.5 31.1 39.8 15.1 1 363.0 26.3 29.0 33.5 38.0 13.3
1 430.0 26.5 29.0 34.0 36.6 15.1 1 450.0 26.8 29.7 34.7 39.2 14.2
1 500.0 26.8 29.7 34.5 41.1 15.3 1 390.0 27.6 30.0 35.0 36.2 13.4
1 450.0 27.6 30.0 35.1 39.9 13.8 1 500.0 28.5 30.7 36.2 39.3 13.7
1 475.0 28.4 31.0 36.2 39.4 14.1 1 500.0 28.7 31.0 36.2 39.7 13.3
1 500.0 29.1 31.5 36.4 37.8 12.0 1 . 29.5 32.0 37.3 37.3 13.6
1 600.0 29.4 32.0 37.2 40.2 13.9 1 600.0 29.4 32.0 37.2 41.5 15.0
1 700.0 30.4 33.0 38.3 38.8 13.8 1 700.0 30.4 33.0 38.5 38.8 13.5
1 610.0 30.9 33.5 38.6 40.5 13.3 1 650.0 31.0 33.5 38.7 37.4 14.8
1 575.0 31.3 34.0 39.5 38.3 14.1 1 685.0 31.4 34.0 39.2 40.8 13.7
1 620.0 31.5 34.5 39.7 39.1 13.3 1 680.0 31.8 35.0 40.6 38.1 15.1
1 700.0 31.9 35.0 40.5 40.1 13.8 1 725.0 31.8 35.0 40.9 40.0 14.8
1 720.0 32.0 35.0 40.6 40.3 15.0 1 714.0 32.7 36.0 41.5 39.8 14.1
1 850.0 32.8 36.0 41.6 40.6 14.9 1 1000.0 33.5 37.0 42.6 44.5 15.5
1 920.0 35.0 38.5 44.1 40.9 14.3 1 955.0 35.0 38.5 44.0 41.1 14.3
1 925.0 36.2 39.5 45.3 41.4 14.9 1 975.0 37.4 41.0 45.9 40.6 14.7
1 950.0 38.0 41.0 46.5 37.9 13.7
2 40.0 12.9 14.1 16.2 25.6 14.0 2 69.0 16.5 18.2 20.3 26.1 13.9
2 78.0 17.5 18.8 21.2 26.3 13.7 2 87.0 18.2 19.8 22.2 25.3 14.3
2 120.0 18.6 20.0 22.2 28.0 16.1 2 0.0 19.0 20.5 22.8 28.4 14.7
2 110.0 19.1 20.8 23.1 26.7 14.7 2 120.0 19.4 21.0 23.7 25.8 13.9
2 150.0 20.4 22.0 24.7 23.5 15.2 2 145.0 20.5 22.0 24.3 27.3 14.6
2 160.0 20.5 22.5 25.3 27.8 15.1 2 140.0 21.0 22.5 25.0 26.2 13.3
2 160.0 21.1 22.5 25.0 25.6 15.2 2 169.0 22.0 24.0 27.2 27.7 14.1
2 161.0 22.0 23.4 26.7 25.9 13.6 2 200.0 22.1 23.5 26.8 27.6 15.4
2 180.0 23.6 25.2 27.9 25.4 14.0 2 290.0 24.0 26.0 29.2 30.4 15.4
2 272.0 25.0 27.0 30.6 28.0 15.6 2 390.0 29.5 31.7 35.0 27.1 15.3
3 270.0 23.6 26.0 28.7 29.2 14.8 3 270.0 24.1 26.5 29.3 27.8 14.5
3 306.0 25.6 28.0 30.8 28.5 15.2 3 540.0 28.5 31.0 34.0 31.6 19.3
3 800.0 33.7 36.4 39.6 29.7 16.6 3 1000.0 37.3 40.0 43.5 28.4 15.0
4 55.0 13.5 14.7 16.5 41.5 14.1 4 60.0 14.3 15.5 17.4 37.8 13.3
4 90.0 16.3 17.7 19.8 37.4 13.5 4 120.0 17.5 19.0 21.3 39.4 13.7
4 150.0 18.4 20.0 22.4 39.7 14.7 4 140.0 19.0 20.7 23.2 36.8 14.2
4 170.0 19.0 20.7 23.2 40.5 14.7 4 145.0 19.8 21.5 24.1 40.4 13.1
4 200.0 21.2 23.0 25.8 40.1 14.2 4 273.0 23.0 25.0 28.0 39.6 14.8
165
4 300.0 24.0 26.0 29.0 39.2 14.6
5 5.9 7.5 8.4 8.8 24.0 16.0 5 32.0 12.5 13.7 14.7 24.0 13.6
5 40.0 13.8 15.0 16.0 23.9 15.2 5 51.5 15.0 16.2 17.2 26.7 15.3
5 70.0 15.7 17.4 18.5 24.8 15.9 5 100.0 16.2 18.0 19.2 27.2 17.3
5 78.0 16.8 18.7 19.4 26.8 16.1 5 80.0 17.2 19.0 20.2 27.9 15.1
5 85.0 17.8 19.6 20.8 24.7 14.6 5 85.0 18.2 20.0 21.0 24.2 13.2
5 110.0 19.0 21.0 22.5 25.3 15.8 5 115.0 19.0 21.0 22.5 26.3 14.7
5 125.0 19.0 21.0 22.5 25.3 16.3 5 130.0 19.3 21.3 22.8 28.0 15.5
5 120.0 20.0 22.0 23.5 26.0 14.5 5 120.0 20.0 22.0 23.5 24.0 15.0
5 130.0 20.0 22.0 23.5 26.0 15.0 5 135.0 20.0 22.0 23.5 25.0 15.0
5 110.0 20.0 22.0 23.5 23.5 17.0 5 130.0 20.5 22.5 24.0 24.4 15.1
5 150.0 20.5 22.5 24.0 28.3 15.1 5 145.0 20.7 22.7 24.2 24.6 15.0
5 150.0 21.0 23.0 24.5 21.3 14.8 5 170.0 21.5 23.5 25.0 25.1 14.9
5 225.0 22.0 24.0 25.5 28.6 14.6 5 145.0 22.0 24.0 25.5 25.0 15.0
5 188.0 22.6 24.6 26.2 25.7 15.9 5 180.0 23.0 25.0 26.5 24.3 13.9
5 197.0 23.5 25.6 27.0 24.3 15.7 5 218.0 25.0 26.5 28.0 25.6 14.8
5 300.0 25.2 27.3 28.7 29.0 17.9 5 260.0 25.4 27.5 28.9 24.8 15.0
5 265.0 25.4 27.5 28.9 24.4 15.0 5 250.0 25.4 27.5 28.9 25.2 15.8
5 250.0 25.9 28.0 29.4 26.6 14.3 5 300.0 26.9 28.7 30.1 25.2 15.4
5 320.0 27.8 30.0 31.6 24.1 15.1 5 514.0 30.5 32.8 34.0 29.5 17.7
5 556.0 32.0 34.5 36.5 28.1 17.5 5 840.0 32.5 35.0 37.3 30.8 20.9
5 685.0 34.0 36.5 39.0 27.9 17.6 5 700.0 34.0 36.0 38.3 27.7 17.6
5 700.0 34.5 37.0 39.4 27.5 15.9 5 690.0 34.6 37.0 39.3 26.9 16.2
5 900.0 36.5 39.0 41.4 26.9 18.1 5 650.0 36.5 39.0 41.4 26.9 14.5
5 820.0 36.6 39.0 41.3 30.1 17.8 5 850.0 36.9 40.0 42.3 28.2 16.8
5 900.0 37.0 40.0 42.5 27.6 17.0 5 1015.0 37.0 40.0 42.4 29.2 17.6
5 820.0 37.1 40.0 42.5 26.2 15.6 5 1100.0 39.0 42.0 44.6 28.7 15.4
5 1000.0 39.8 43.0 45.2 26.4 16.1 5 1100.0 40.1 43.0 45.5 27.5 16.3
5 1000.0 40.2 43.5 46.0 27.4 17.7 5 1000.0 41.1 44.0 46.6 26.8 16.3
6 200.0 30.0 32.3 34.8 16.0 9.7 6 300.0 31.7 34.0 37.8 15.1 11.0
6 300.0 32.7 35.0 38.8 15.3 11.3 6 300.0 34.8 37.3 39.8 15.8 10.1
6 430.0 35.5 38.0 40.5 18.0 11.3 6 345.0 36.0 38.5 41.0 15.6 9.7
6 456.0 40.0 42.5 45.5 16.0 9.5 6 510.0 40.0 42.5 45.5 15.0 9.8
6 540.0 40.1 43.0 45.8 17.0 11.2 6 500.0 42.0 45.0 48.0 14.5 10.2
6 567.0 43.2 46.0 48.7 16.0 10.0 6 770.0 44.8 48.0 51.2 15.0 10.5
6 950.0 48.3 51.7 55.1 16.2 11.2 6 1250.0 52.0 56.0 59.7 17.9 11.7
6 1600.0 56.0 60.0 64.0 15.0 9.6 6 1550.0 56.0 60.0 64.0 15.0 9.6
6 1650.0 59.0 63.4 68.0 15.9 11.0
7 6.7 9.3 9.8 10.8 16.1 9.7 7 7.5 10.0 10.5 11.6 17.0 10.0
7 7.0 10.1 10.6 11.6 14.9 9.9 7 9.7 10.4 11.0 12.0 18.3 11.5
7 9.8 10.7 11.2 12.4 16.8 10.3 7 8.7 10.8 11.3 12.6 15.7 10.2
7 10.0 11.3 11.8 13.1 16.9 9.8 7 9.9 11.3 11.8 13.1 16.9 8.9
7 9.8 11.4 12.0 13.2 16.7 8.7 7 12.2 11.5 12.2 13.4 15.6 10.4
7 13.4 11.7 12.4 13.5 18.0 9.4 7 12.2 12.1 13.0 13.8 16.5 9.1
7 19.7 13.2 14.3 15.2 18.9 13.6 7 19.9 13.8 15.0 16.2 18.1 11.6
;
*proc step discriminate;
proc stepdisc data=fish;
class Species;
166
run;
*proc candiscriminate;
proc candisc data=fish ncan=3 out=outcan;
class Species;
var Weight Length1 Length2 Length3 Height Width;
run;
%plotit(data=outcan, plotvars=Can2 Can1,
labelvar=_blank_, symvar=symbol, typevar=symbol,
symsize=1, symlen=4, tsize=1.5, exttypes=symbol, ls=100,
plotopts=vaxis=-5 to 15 by 5, vtoh=, extend=close);
SAS – Output
The SAS output for the above program is:
Fish Measurement Data
The STEPDISC Procedure
Variable
Species Name Frequency Weight Proportion
167
Height 0.7553 77.69 <.0001 1.0000
Width 0.4806 23.29 <.0001 1.0000
Height
Multivariate Statistics
Partial
Variable R-Square F Value Pr > F Tolerance
Length2 Height
168
Multivariate Statistics
Partial
Variable R-Square F Value Pr > F
Partial
Variable R-Square F Value Pr > F Tolerance
Multivariate Statistics
169
The STEPDISC Procedure
Stepwise Selection: Step 4
Partial
Variable R-Square F Value Pr > F
Partial
Variable R-Square F Value Pr > F Tolerance
Multivariate Statistics
Partial
Variable R-Square F Value Pr > F
170
Length3 0.7960 96.26 <.0001
Height 0.7633 79.53 <.0001
Width 0.5775 33.72 <.0001
Partial
Variable R-Square F Value Pr > F Tolerance
Multivariate Statistics
Partial
Variable R-Square F Value Pr > F
171
Partial
Variable R-Square F Value Pr > F Tolerance
Multivariate Statistics
Partial
Variable R-Square F Value Pr > F
Average
Squared
Number Partial Wilks’ Pr < Canonical Pr >
Step In Entered Removed R-Square F Value Pr > F Lambda Lambda Correlation ASCC
172
2 2 Length2 0.9229 299.31 <.0001 0.01886065 <.0001 0.25905822 <.0001
3 3 Length3 0.8826 186.77 <.0001 0.00221342 <.0001 0.38427100 <.0001
4 4 Width 0.5775 33.72 <.0001 0.00093510 <.0001 0.45200732 <.0001
5 5 Weight 0.4461 19.73 <.0001 0.00051794 <.0001 0.49488458 <.0001
6 6 Length1 0.2987 10.36 <.0001 0.00036325 <.0001 0.51744189 <.0001
------------------------
Variable
Species Name Frequency Weight Proportion
173
Correlation Correlation Error Correlation
174
Pooled Within Canonical Structure
175
Class Means on Canonical Variables
176
complement is a plane (point in one dimension, line in two dimensions) then the classification rule is said to
be linear, otherwise it will be described as being quadratic or curved.
From here one can define the conditional probability,
Z
Pr[2 | 1] = Pr[X ∈ R2 | π1 ] = f1 (x)dx
R2
and Z
Pr[1 | 2] = Pr[X ∈ R1 | π2 ] = f2 (x)dx.
R1
Suppose that one has prior information concerning the proportion of objects belonging to the respective
populations, given by p1 = Pr[X ∈ π1 ] and p2 = Pr[X ∈ π2 ] where p1 + p2 = 1. Then the respective
probabilities of misclassifying an object using the classification rule defined by R1 . That is, the probability
that an observation belonging to π1 is misclassified as belonging to π2 equals Pr[2 | 1]p1 and Pr[1 | 2]p2 is
the probability of misclassifying a vector into π1 when it belongs to population π2 .
To make the problem completely general suppose that the costs associated with misclassifying an object
are no equal. That is, let c(2 | 1) denote the cost of misclassifying an object from π1 into π2 and c(1 | 2) is
the cost of misclassifying an object from π2 into π1 .
There are a number of different classification procedures. The first is found by minimizing the total
or expected cost of misclassification. Others include the likelihood ratio rule, maximizing the posterior
probability, and the minimax allocation.
The objective would be to find or specify the region R1 such that the value of the ECM is as small as
possible. This minimum is obtained in special cases as;
• p1 = p2 then
f1 (x) c(1 | 2) f1 (x) c(1 | 2)
R1 : ≥ R2 : <
f2 (x) c(2 | 1) f2 (x) c(2 | 1)
f1 (x) f1 (x)
R1 : ≥ 1 R2 : <1
f2 (x) f2 (x)
The above easily follows in the special case where c(1 | 2) = c(2 | 1), then
Z Z
ECM = p1 (1 − f1 (x)dx) + p2 f2 (x)dx
R2 R1
Z
= p1 + [p2 f2 (x) − p1 f1 (x)]dx.
R−1
177
The following gives a special case when the distribution of f1 (x) and f2 (x) is specified. Suppose that
fi (x) ∼ N (µi , Σi ) for i = 1, 2 where Σ1 = Σ2 = Σ. That is,
and
f1 (x)
= exp[−1/2(x − µ1 )0 Σ−1 (x − mu1 ) + 1/2(x − µ2 )0 Σ−1 (x − µ2 )]
f2 (x)
= exp[(µ1 − µ2 )0 Σ−1 x − 1/2(µ1 − µ2 )0 Σ−1 (µ1 + µ2 )]
Let
D(x) = (µ1 − µ2 )0 Σ−1 [x − 1/2(µ1 + µ2 )
then the equation D(x) = log(p2 /p1 ) defines a hyperplane for separating the two groups. Let
the squared Mahalanobis distance between the two vectors µ1 and µ2 . It follows that
and
V ar(D(x) | x ∈ πi ) = ∆2 .
From here one has,
Pr[2 | 1] = Pr[D(x) ≤ log(p2 /p1 ) | x ∈ π1 ]
= Pr[Z ≤ [(log(p2 /p1 ) − .5∆2 ]/∆]].
And
Pr[1 | 2] = Pr[Z ≤ −[(log(p2 /p1 ) − .5∆2 ]/∆]].
If p1 = p2 , then assign x to π1 if
−1 C(1 | 2)p2 −1
(x̄1 − x̄2 )0 Spooled x0 ≥ log[ ] + 1/2(x̄1 − x̄2 )0 Spooled (x̄1 + x̄2 )
C(2 | 1)p1
otherwise allocate x0 to π2 where ni is the sample size from population πi , and
178
12.2.4 Maximizing the Posterior Probability
This method assumes that one has prior information concerning the allocation of x0 . That is, suppose that
one specifies
fi (x0 )pi
qi (x0 ) = Pr[x0 ∈ πi ] = .
(f1 (x0 )p1 + f2 (x0 )p2 )
Then allocate x0 to π1 if q1 (x0 ) > q2 (x0 ).
179
in that group that are misclassified. This method achieves a nearly unbiased estimate but with a relatively
large variance.
To reduce the variance in an error-count estimate, smoothed error-rate estimates are suggested (Glick
1978). Instead of summing terms that are either zero or one as in the error-count estimator, the smoothed
estimator uses a continuum of values between zero and one in the terms that are summed. The resulting es-
timator has a smaller variance than the error-count estimate. The posterior probability error-rate estimates
provided by the POSTERR option in the PROC DISCRIM statement (see the following section, ”Posterior
Probability Error-Rate Estimates”) are smoothed error-rate estimates. The posterior probability estimates
for each group are based on the posterior probabilities of the observations classified into that same group.
The posterior probability estimates provide good estimates of the error rate when the posterior probabili-
ties are accurate. When a parametric classification criterion (linear or quadratic discriminant function) is
derived from a nonnormal population, the resulting posterior probability error-rate estimators may not be
appropriate.
The overall error rate is estimated through a weighted average of the individual group-specific error-rate
estimates, where the prior probabilities are used as the weights.
To reduce both the bias and the variance of the estimator, Hora and Wilcox (1982) compute the posterior
probability estimates based on cross validation. The resulting estimates are intended to have both low vari-
ance from using the posterior probability estimate and low bias from cross validation. They use Monte Carlo
studies on two-group multivariate normal distributions to compare the cross validation posterior probability
estimates with three other estimators: the apparent error rate, cross validation estimator, and posterior
probability estimator. They conclude that the cross validation posterior probability estimator has a lower
mean squared error in their simulations.
180
• ft (x) – the group-specific density estimate at x from group t
P
• f (x) = t qt ft (x) – the estimated unconditional density at x
• et – the classification error rate for group t
Bayes’ Theorem
Assuming that the prior probabilities of group membership are known and that the group-specific densities
at x can be estimated, PROC DISCRIM computes , the probability of x belonging to group t, by applying
Bayes’ theorem:
qt ft (x)
p(t | x) = .
f (x)
PROC DISCRIM partitions a p-dimensional vector space into regions Rt , where the region Rt is the
subspace containing all p-dimensional vectors y such that is the largest among all groups. An observation is
classified as coming from group t if it lies in region Rt .
Parametric Methods
Assuming that each group has a multivariate normal distribution, PROC DISCRIM develops a discriminant
function or classification criterion using a measure of generalized squared distance. The classification criterion
is based on either the individual within-group covariance matrices or the pooled covariance matrix; it also
takes into account the prior probabilities of the classes. Each observation is placed in the class from which
it has the smallest generalized squared distance. PROC DISCRIM also computes the posterior probability
of an observation belonging to each class.
The squared Mahalanobis distance from x to group t is
d2t (x) = (x − mt )0 Vt−1 (x − mt )
where Vt = St if the within-group covariance matrices are used, or Vt = Sp if the pooled covariance matrix
is used.
The group-specific density estimate at x from group t is then given by
ft (x) = (2π)−p/2 | Vt |−1/2 exp[−.5d2t (x)]
Using Bayes’ theorem, the posterior probability of x belonging to group t is
qt ft (x)
p(t | x) = P
u qu fu (x)
The discriminant scores are −0.5Du2 (x). An observation is classified into group u if setting t=u produces the
largest value of p(t | x) or the smallest value of Dt2 (x). If this largest posterior probability is less than the
threshold specified, x is classified into group OTHER.
181
12.4.2 Nonparametric Methods
Whenever the density function for πi is unspecified, one can use a nonparametric method for estimating
fi (x). SAS PROC DISCRIM provides a number of nonparametric options. I have included their description
of some of the options.
Nonparametric Methods
π p/2
v0 =
Γ(p/2 + 1)
where Γ(·) represents the gamma function (refer to SAS Language Reference: Dictionary).
Thus, in group t, the volume of a p-dimensional ellipsoid bounded by {z | z 0 Vt−1 z = r2 } is
vr (t) = rp | Vt |1/2 v0
The kernel method uses one of the following densities as the kernel density in group t.
Uniform Kernel
Ky (z) = vr (t)−1
if z 0 Vt−1 z ≤ r2 , otherwise it is zero.
182
Epanechnikov Kernel
Kt (z) = c1 (t)(1 − r−2 z 0 Vt−1 z)
−1
if z 0 Vt−1 z ≤ r2 , where c1 (t) = vr (t) (1 + [p/2]).
Biweight Kernel
Kt (z) = c2 (t)(1 − r−2 z 0 Vt−1 z)
if z 0 Vt−1 z ≤ r2 , where c2 (t) = (1 + [p/4])c1 (t).
Triweight Kernel
Kt (z) = c3 (t)(1 − r−2 z 0 Vt−1 z)
if z 0 Vt−1 z ≤ r2 , where c3 (t) = (1 + [p/6])c2 (t).
The group t density at x is estimated by
X
ft (x) = Kt (x − y)/nt
y
where the summation is over all observations y in group t, and Kt is the specified kernel function. The
posterior probability of membership in group t is then given by
qt kt /nt
p(t | x) = P .
u qu ku /nu
If the closed ellipsoid centered at x does not include any training set observations, f(x) is zero and x is
classified into group OTHER. When the prior probabilities are equal, p(t | x) is proportional to kt /nt and
x is classified into the group that has the highest proportion of observations
P in the closed ellipsoid. When
the prior probabilities are proportional to the group sizes, p(t | x) = kt / u ku , x is classified into the group
that has the largest number of observations in the closed ellipsoid.
183
12.4.3 Nearest Neighbor Method
The nearest-neighbor method fixes the number, k, of training set points for each observation x. The method
finds the radius rk (x) that is the distance from x to the k th nearest training set point in the metric Vt−1 .
Consider a closed ellipsoid centered at x bounded by {z | (z − x)0 Vt−1 (x − z) = rk2 (x)}; the nearest-neighbor
method is equivalent to the uniform-kernel method with a location-dependent radius rk (x). Note that, with
ties, more than k training set points may be in the ellipsoid.
Using the k-nearest-neighbor rule, the kn (or more with ties) smallest distances are saved. Of these
k distances, let kt represent the number of distances that are associated with group t. Then, as in the
uniform-kernel method, the estimated group t density at x is
184
separately by its variance in the kernel estimation, where the variance can be the pooled variance (Vt = Sp )
or an individual within-group variance (Vt = St ). When Vt is a full covariance matrix, the variables in group
t are scaled simultaneously by Vt in the kernel estimation.
In nearest-neighbor methods, the choice of k is usually relatively uncritical (Hand 1982). A practical
approach is to try several different values of the smoothing parameters within the context of the particular
application and to choose the one that gives the best cross validated estimate of the error rate.
185
PROC DISCRIM DATA=salmon short method=npar r=4 posterr CROSSVALIDATE;
CLASSES locale;
VAR fresh marine;
RUN;
* Nonparametric Method with Normal Kernel r=4
PROC DISCRIM DATA=salmon short method=npar r=4 kernel=normal posterr CROSSVALIDATE;
CLASSES locale;
VAR fresh marine;
RUN;
* Nonparametric Method with EPANECHNIKOV Kernel r=4
PROC DISCRIM DATA=salmon short method=npar r=4 kernel=epa posterr CROSSVALIDATE;
CLASSES locale;
VAR fresh marine;
RUN;
* Nonparamtric Method with k=4 Nearest Neighbor
PROC DISCRIM DATA=salmon short method=npar k=4 posterr CROSSVALIDATE;
CLASSES locale;
VAR fresh marine;
RUN;
quit;
The output is;
Parametric Method
Salmon Data Johnson-Wichren page 659
Variable Prior
locale Name Frequency Weight Proportion Probability
P = Number of Variables
186
N = Total Number of Observations - Number of Groups
__ N(i)/2
|| |Within SS Matrix(i)|
V = -----------------------------------
N/2
|Pooled SS Matrix|
_ _ 2
| 1 1 | 2P + 3P - 1
RHO = 1.0 - | SUM ----- - --- | -------------
|_ N(i) N _| 6(P+1)(K-1)
DF = .5(K-1)P(P+1)
_ _
| PN/2 |
| N V |
Under the null hypothesis: -2 RHO ln | ------------------ |
| __ PN(i)/2 |
|_ || N(i) _|
10.696146 3 0.0135
Since the Chi-Square value is significant at the 0.1 level, the within
covariance matrices will be used in the discriminant function.
Reference: Morrison, D.F. (1976) Multivariate Statistical Methods
p252.
2 _ -1 _
D (X) = (X-X )’ COV (X-X ) + ln |COV |
j j j j j
2 2
187
Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))
j k k
1 45 5 50
90.00 10.00 100.00
2 2 48 50
4.00 96.00 100.00
Total 47 53 100
47.00 53.00 100.00
1 2 Total
2 _ -1 _
D (X) = (X-X )’ COV (X-X ) + ln |COV |
j j j j j
2 2
Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))
j k k
From locale 1 2
188
1 45 5
0.9599 0.7391
2 2 48
0.7380 0.9266
Total 47 53
0.9505 0.9089
Estimate 1 2 Total
2 _ -1 _
D (X) = (X-X )’ COV (X-X ) + ln |COV |
j (X)j (X)j (X)j (X)j
2 2
Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))
j k k
1 45 5 50
90.00 10.00 100.00
2 3 47 50
6.00 94.00 100.00
189
Total 48 52 100
48.00 52.00 100.00
1 2 Total
2 _ -1 _
D (X) = (X-X )’ COV (X-X ) + ln |COV |
j (X)j (X)j (X)j (X)j
2 2
Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))
j k k
From locale 1 2
1 45 5
0.9582 0.7609
2 3 47
0.7070 0.9315
Total 48 52
0.9425 0.9151
190
Estimate 1 2 Total
Variable Prior
locale Name Frequency Weight Proportion Probability
2 -1
D (X,Y) = (X-Y)’ COV (X-Y)
1 41 7 2 50
82.00 14.00 4.00 100.00
191
2 2 48 0 50
4.00 96.00 0.00 100.00
Total 43 55 2 100
43.00 55.00 2.00 100.00
1 2 Total
2 -1
D (X,Y) = (X-Y)’ COV (X-Y)
From locale 1 2
1 41 7
0.6263 0.5160
2 2 48
0.5214 0.5966
Total 43 55
0.6214 0.5864
192
Priors 0.5 0.5
Estimate 1 2 Total
2 -1
D (X,Y) = (X-Y)’ COV (X-Y)
1 41 7 2 50
82.00 14.00 4.00 100.00
2 2 48 0 50
4.00 96.00 0.00 100.00
Total 43 55 2 100
43.00 55.00 2.00 100.00
193
1 2 Total
2 -1
D (X,Y) = (X-Y)’ COV (X-Y)
From locale 1 2
1 41 7
0.6263 0.5164
2 2 48
0.5220 0.5966
Total 43 55
0.6214 0.5864
Estimate 1 2 Total
194
Nonparametric Method – Normal Kernal
Variable Prior
locale Name Frequency Weight Proportion Probability
2 -1
D (X,Y) = (X-Y)’ COV (X-Y)
-1 2 2
F(X|j) = n SUM exp( -.5 D (X,Y ) / R )
j i ji
1 44 6 50
88.00 12.00 100.00
2 1 49 50
2.00 98.00 100.00
Total 45 55 100
195
45.00 55.00 100.00
1 2 Total
2 -1
D (X,Y) = (X-Y)’ COV (X-Y)
-1 2 2
F(X|j) = n SUM exp( -.5 D (X,Y ) / R )
j i ji
From locale 1 2
1 44 6
0.5718 0.5232
2 1 49
0.5414 0.5629
Total 45 55
0.5711 0.5586
196
Posterior Probability Error Rate Estimates for locale
Estimate 1 2 Total
2 -1
D (X,Y) = (X-Y)’ COV (X-Y)
-1 2 2
F(X|j) = n SUM exp( -.5 D (X,Y ) / R )
j i ji
1 44 6 50
88.00 12.00 100.00
2 1 49 50
2.00 98.00 100.00
Total 45 55 100
45.00 55.00 100.00
1 2 Total
197
Priors 0.5000 0.5000
2 -1
D (X,Y) = (X-Y)’ COV (X-Y)
-1 2 2
F(X|j) = n SUM exp( -.5 D (X,Y ) / R )
j i ji
From locale 1 2
1 44 6
0.5712 0.5243
2 1 49
0.5431 0.5623
Total 45 55
0.5706 0.5582
Estimate 1 2 Total
198
Salmon Data Johnson-Wichren page 659
The DISCRIM Procedure
Variable Prior
locale Name Frequency Weight Proportion Probability
2 -1
D (X,Y) = (X-Y)’ COV (X-Y)
-1 2 2
F(X|j) = n SUM ( 1.0 - D (X,Y ) / R )
j i ji
1 44 6 50
88.00 12.00 100.00
2 1 49 50
2.00 98.00 100.00
Total 45 55 100
45.00 55.00 100.00
199
Priors 0.5 0.5
1 2 Total
2 -1
D (X,Y) = (X-Y)’ COV (X-Y)
-1 2 2
F(X|j) = n SUM ( 1.0 - D (X,Y ) / R )
j i ji
From locale 1 2
1 44 6
0.7448 0.5691
2 1 49
0.6350 0.6983
Total 45 55
0.7423 0.6842
200
Estimate 1 2 Total
2 -1
D (X,Y) = (X-Y)’ COV (X-Y)
-1 2 2
F(X|j) = n SUM ( 1.0 - D (X,Y ) / R )
j i ji
1 44 6 50
88.00 12.00 100.00
2 1 49 50
2.00 98.00 100.00
Total 45 55 100
45.00 55.00 100.00
1 2 Total
201
Salmon Data Johnson-Wichren page 659
The DISCRIM Procedure
Classification Results for Calibration Data: WORK.SALMON
Cross-validation Results using Epanechnikov Kernel Density
2 -1
D (X,Y) = (X-Y)’ COV (X-Y)
-1 2 2
F(X|j) = n SUM ( 1.0 - D (X,Y ) / R )
j i ji
From locale 1 2
1 44 6
0.7438 0.5725
2 1 49
0.6414 0.6971
Total 45 55
0.7415 0.6835
Estimate 1 2 Total
202
The DISCRIM Procedure
Variable Prior
locale Name Frequency Weight Proportion Probability
2 -1
D (X,Y) = (X-Y)’ COV (X-Y)
1 44 1 5 50
88.00 2.00 10.00 100.00
2 2 46 2 50
4.00 92.00 4.00 100.00
Total 46 47 7 100
46.00 47.00 7.00 100.00
203
Error Count Estimates for locale
1 2 Total
2 -1
D (X,Y) = (X-Y)’ COV (X-Y)
From locale 1 2
1 44 1
0.9909 0.7500
2 2 46
0.6750 0.9674
Total 46 47
0.9772 0.9628
Estimate 1 2 Total
204
Unstratified 0.1010 0.0950 0.0980
Priors 0.5000 0.5000
2 -1
D (X,Y) = (X-Y)’ COV (X-Y)
1 46 4 50
92.00 8.00 100.00
2 2 48 50
4.00 96.00 100.00
Total 48 52 100
48.00 52.00 100.00
1 2 Total
205
Squared Distance Function
2 -1
D (X,Y) = (X-Y)’ COV (X-Y)
From locale 1 2
1 46 4
0.9517 0.8096
2 2 48
0.8731 0.9300
Total 48 52
0.9484 0.9207
Estimate 1 2 Total
206
The partition of the classification trees is;
207
208
Chapter 13
The basic objective of cluster analysis is to discover natural groupings of item or variables within a data set.
These groupings will be based upon how similar or dissimilar objects are according to some measure that
quantifies these similarities and dissimilarities.
209
3. p=∞, the sup norm given by,
∆∞ (x, y) = sup | xj − yj | .
1≤j≤d
A number of other metrics have been suggested and are available to the user of most statistics packages.
210
are usually required, and the default iteration limit is increased when you specify LEAST=p. Values of p
less than 2 reduce the effect of outliers on the cluster centers compared with least-squares methods; values
of p greater than 2 increase the effect of outliers.
The FASTCLUS procedure is intended for use with large data sets, with 100 or more observations. With
small data sets, the results may be highly sensitive to the order of the observations in the data set.
PROC FASTCLUS produces brief summaries of the clusters it finds. For more extensive examination of
the clusters, you can request an output data set containing a cluster membership variable.
Background
The FASTCLUS procedure combines an effective method for finding initial clusters with a standard iterative
algorithm for minimizing the sum of squared distances from the cluster means. The result is an efficient
procedure for disjoint clustering of large data sets. PROC FASTCLUS was directly inspired by Hartigan’s
(1975) leader algorithm and MacQueen’s (1967) k-means algorithm. PROC FASTCLUS uses a method that
Anderberg (1973) calls nearest centroid sorting. A set of points called cluster seeds is selected as a first guess
of the means of the clusters. Each observation is assigned to the nearest seed to form temporary clusters. The
seeds are then replaced by the means of the temporary clusters, and the process is repeated until no further
changes occur in the clusters. Similar techniques are described in most references on clustering (Anderberg
1973; Hartigan 1975; Everitt 1980; Spath 1980).
The FASTCLUS procedure differs from other nearest centroid sorting methods in the way the initial
cluster seeds are selected. The importance of initial seed selection is demonstrated by Milligan (1980).
The clustering is done on the basis of Euclidean distances computed from one or more numeric variables.
If there are missing values, PROC FASTCLUS computes an adjusted distance using the nonmissing values.
Observations that are very close to each other are usually assigned to the same cluster, while observations
that are far apart are in different clusters.
The FASTCLUS procedure operates in four steps:
1. Observations called cluster seeds are selected.
2. If you specify the DRIFT option, temporary clusters are formed by assigning each observation to the
cluster with the nearest seed. Each time an observation is assigned, the cluster seed is updated as
the current mean of the cluster. This method is sometimes called incremental, on-line, or adaptive
training.
3. If the maximum number of iterations is greater than zero, clusters are formed by assigning each obser-
vation to the nearest seed. After all observations are assigned, the cluster seeds are replaced by either
the cluster means or other location estimates (cluster centers) appropriate to the LEAST=p option.
This step can be repeated until the changes in the cluster seeds become small or zero (MAXITER=).
4. Final clusters are formed by assigning each observation to the nearest seed.
If PROC FASTCLUS runs to complete convergence, the final cluster seeds will equal the cluster means
or cluster centers. If PROC FASTCLUS terminates before complete convergence, which often happens with
the default settings, the final cluster seeds may not equal the cluster means or cluster centers. If you want
complete convergence, specify CONVERGE=0 and a large value for the MAXITER= option.
The initial cluster seeds must be observations with no missing values. You can specify the maximum
number of seeds (and, hence, clusters) using the MAXCLUSTERS= option. You can also specify a minimum
distance by which the seeds must be separated using the RADIUS= option.
PROC FASTCLUS always selects the first complete (no missing values) observation as the first seed.
The next complete observation that is separated from the first seed by at least the distance specified in
the RADIUS= option becomes the second seed. Later observations are selected as new seeds if they are
211
separated from all previous seeds by at least the radius, as long as the maximum number of seeds is not
exceeded.
If an observation is complete but fails to qualify as a new seed, PROC FASTCLUS considers using it to
replace one of the old seeds. Two tests are made to see if the observation can qualify as a new seed.
First, an old seed is replaced if the distance between the observation and the closest seed is greater than
the minimum distance between seeds. The seed that is replaced is selected from the two seeds that are
closest to each other. The seed that is replaced is the one of these two with the shortest distance to the
closest of the remaining seeds when the other seed is replaced by the current observation.
If the observation fails the first test for seed replacement, a second test is made. The observation replaces
the nearest seed if the smallest distance from the observation to all seeds other than the nearest one is greater
than the shortest distance from the nearest seed to all other seeds. If the observation fails this test, PROC
FASTCLUS goes on to the next observation.
You can specify the REPLACE= option to limit seed replacement. You can omit the second test for seed
replacement (REPLACE=PART), causing PROC FASTCLUS to run faster, but the seeds selected may not
be as widely separated as those obtained by the default method. You can also suppress seed replacement
entirely by specifying REPLACE=NONE. In this case, PROC FASTCLUS runs much faster, but you must
choose a good value for the RADIUS= option in order to get good clusters. This method is similar to
Hartigan’s (1975, pp. 74 -78) leader algorithm and the simple cluster seeking algorithm described by Tou
and Gonzalez (1974, pp. 90 -92).
212
where dik is the distance between object i in cluster U V and object k in cluster W , and N(U V ) and
NW are the number of objects in the respective clusters.
Overview
The CLUSTER procedure hierarchically clusters the observations in a SAS data set using one of eleven
methods. The CLUSTER procedure finds hierarchical clusters of the observations in a SAS data set. The
data can be coordinates or distances. If the data are coordinates, PROC CLUSTER computes (possibly
squared) Euclidean distances. If you want to perform a cluster analysis on non-Euclidean distance data, it
is possible to do so by using a TYPE=DISTANCE data set as input. The the SAS/STAT sample library
can compute many kinds of distance matrices.
One situation where analyzing non-Euclidean distance data can be useful is when you have categori-
cal data, where the distance data are calculated using an association measure. For more information, see
Example 23.5. The clustering methods available are average linkage, the centroid method, complete link-
age, density linkage (including Wong’s hybrid and kth-nearest-neighbor methods), maximum likelihood for
mixtures of spherical multivariate normal distributions with equal variances but possibly unequal mixing
proportions, the flexible-beta method, McQuitty’s similarity analysis, the median method, single linkage,
two-stage density linkage, and Ward’s minimum-variance method. All methods are based on the usual ag-
glomerative hierarchical clustering procedure. Each observation begins in a cluster by itself. The two closest
clusters are merged to form a new cluster that replaces the two old clusters. Merging of the two closest
clusters is repeated until only one cluster is left. The various clustering methods differ in how the distance
between two clusters is computed. Each method is described in the section ”Clustering Methods”.
The CLUSTER procedure is not practical for very large data sets because, with most methods, the
CPU time varies as the square or cube of the number of observations. The FASTCLUS procedure requires
time proportional to the number of observations and can, therefore, be used with much larger data sets than
PROC CLUSTER. If you want to cluster a very large data set hierarchically, you can use PROC FASTCLUS
for a preliminary cluster analysis producing a large number of clusters and then use PROC CLUSTER to
cluster the preliminary clusters hierarchically. This method is used to find clusters for the Fisher Iris data
in Example 23.3, later in this chapter.
PROC CLUSTER displays a history of the clustering process, giving statistics useful for estimating the
number of clusters in the population from which the data are sampled. PROC CLUSTER also creates an
output data set that can be used by the TREE procedure to draw a tree diagram of the cluster hierarchy or
to output the cluster membership at any desired level. For example, to obtain the six-cluster solution, you
could first use PROC CLUSTER with the OUTTREE= option then use this output data set as the input
data set to the TREE procedure. With PROC TREE, specify NCLUSTERS=6 and the OUT= options to
213
obtain the six-cluster solution and draw a tree diagram. For an example, see Example 66.1 in Chapter 66,
”The TREE Procedure.”
Before you perform a cluster analysis on coordinate data, it is necessary to consider scaling or transforming
the variables since variables with large variances tend to have more effect on the resulting clusters than
those with small variances. The ACECLUS procedure is useful for performing linear transformations of the
variables. You can also use the PRINCOMP procedure with the STD option, although in some cases it
tends to obscure clusters or magnify the effect of error in the data when all components are retained. The
STD option in the CLUSTER procedure standardizes the variables to mean 0 and standard deviation 1.
Standardization is not always appropriate. See Milligan and Cooper (1987) for a Monte Carlo study on
various methods of variable standardization. You should remove outliers before using PROC PRINCOMP
or before using PROC CLUSTER with the STD option unless you specify the TRIM= option.
Nonlinear transformations of the variables may change the number of population clusters and should,
therefore, be approached with caution. For most applications, the variables should be transformed so that
equal differences are of equal practical importance. An interval scale of measurement is required if raw data
are used as input. Ordinal or ranked data are generally not appropriate.
Agglomerative hierarchical clustering is discussed in all standard references on cluster analysis, for ex-
ample, Anderberg (1973), Sneath and Sokal (1973), Hartigan (1975), Everitt (1980), and Spath (1980). An
especially good introduction is given by Massart and Kaufman (1983). Anyone considering doing a hierar-
chical cluster analysis should study the Monte Carlo results of Milligan (1980), Milligan and Cooper (1985),
and Cooper and Milligan (1988). Other essential, though more advanced, references on hierarchical clus-
tering include Hartigan (1977, pp. 60 -68; 1981), Wong (1982), Wong and Schaack (1982), and Wong and
Lane (1983). Refer to Blashfield and Aldenderfer (1978) for a discussion of the confusing terminology in
hierarchical cluster analysis.
13.4 Example
In this section have included the output for SAS PROC FASTCLUS and CLUSTER. The exampe is the U
S Navy Officer data that I used in earlier examples. The SAS code is;
*options nodate ps=60 PAGENO=1 LINESIZE=75;
/*dm ’log;clear;out;clear;’;
*/
TITLE ’U.S. NAVY BACHELOR OFFICERS’’ QUARTERS’;
DATA USNAVY;
INPUT SITE 1-2 ADO MAC WHR CUA WNGS OBC RMS MMH;
LOGADO=LOG(ADO);
LOGMAC=LOG(MAC);
LABEL ADO = ’AVG DAILY OCCUPANCY’
MAC = ’AVG NUMBER OF CHECK-INS PER MO.’
WHR = ’WEEKLY HRS OF SERVICE DESK OPERATION’
CUA = ’SQ FT OF COMMON USE AREA’
WNGS= ’NUMBER OF BUILDING WINGS’
OBC = ’OPERATIONAL BERTHING CAPACITY’
RMS = ’NUMBER OF ROOMS’
MMH = ’MONTHLY MAN-HOURS’
LOGADO = ’LOG OCCUPANCY’
LOGMAC = ’LOG CHK-INS’;
CARDS;
1 2 4 4 1.26 1 6 6 180.23
214
2 3 1.58 40 1.25 1 5 5 182.61
3 16.6 23.78 40 1 1 13 13 164.38
4 7 2.37 168 1 1 7 8 284.55
5 5.3 1.67 42.5 7.79 3 25 25 199.92
6 16.5 8.25 168 1.12 2 19 19 267.38
7 25.89 3.00 40 0 3 36 36 999.09
8 44.42 159.75 168 .6 18 48 48 1103.24
9 39.63 50.86 40 27.37 10 77 77 944.21
10 31.92 40.08 168 5.52 6 47 47 931.84
11 97.33 255.08 168 19 6 165 130 2268.06
12 56.63 373.42 168 6.03 4 36 37 1489.5
13 96.67 206.67 168 17.86 14 120 120 1891.7
14 54.58 207.08 168 7.77 6 66 66 1387.82
15 113.88 981 168 24.48 6 166 179 3559.92
16 149.58 233.83 168 31.07 14 185 202 3115.29
17 134.32 145.82 168 25.99 12 192 192 2227.76
18 188.74 937.00 168 45.44 26 237 237 4804.24
19 110.24 410 168 20.05 12 115 115 2628.32
20 96.83 677.33 168 20.31 10 302 210 1880.84
21 102.33 288.83 168 21.01 14 131 131 3036.63
22 274.92 695.25 168 46.63 58 363 363 5539.98
23 811.08 714.33 168 22.76 17 242 242 3534.49
24 384.50 1473.66 168 7.36 24 540 453 8266.77
25 95 368 168 30.26 9 292 196 1845.89
*goptions device=ps2ega;
/*
PROC G3D data=usnavy;
SCATTER LOGADO*LOGMAC=MMH;
TITLE2 ’3-D Plot’;
run;
*/
*proc princomp;var ADO MAC WHR CUA WNGS OBC RMS; run;
proc fastclus data=usnavy maxc=2 maxiter=10 out=clus;
var ADO MAC WHR CUA WNGS OBC RMS MMH;
run;
215
* height _rsq_;
copy ADO MAC WHR CUA WNGS OBC RMS MMH;
id site;
run;
The SAS ourput is;
U.S. NAVY BACHELOR OFFICERS’ QUARTERS
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02
Initial Seeds
Initial Seeds
Iteration History
Relative Change
in Cluster Seeds
Iteration Criterion 1 2
Cluster Summary
Maximum Distance
RMS Std from Seed Radius Nearest Distance Between
Cluster Frequency Deviation to Observation Exceeded Cluster Cluster Centroids
216
2 3 664.4 2120.3 1 4739.9
WARNING: The two values above are invalid for correlated variables.
Cluster Means
Cluster Means
217
1 7.727273 104.318182 95.636364 1551.075909
2 36.000000 380.000000 351.000000 6203.663333
Initial Seeds
Initial Seeds
Iteration History
218
Relative Change in Cluster Seeds
Iteration Criterion 1 2 3
Cluster Summary
Maximum Distance
RMS Std from Seed Radius Nearest Distance Between
Cluster Frequency Deviation to Observation Exceeded Cluster Cluster Centroids
219
Approximate Expected Over-All R-Squared = 0.86912
WARNING: The two values above are invalid for correlated variables.
Cluster Means
Cluster Means
220
The CLUSTER Procedure
Single Linkage Cluster Analysis
Cluster History
Norm T
Min i
NCL -----Clusters Joined------ FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Dist e
221
The Cluster Tree is;
222
[1,] -4 -6
[2,] -2 -3
[3,] 2 -5
[4,] -1 3
[5,] -7 -9
[6,] 5 -10
[7,] -16 -21
[8,] -11 -17
[9,] 4 1
[10,] -12 -14
[11,] 6 -8
[12,] -13 -25
[13,] 11 10
[14,] 12 -20
[15,] 8 14
[16,] 15 -19
[17,] 16 7
[18,] 13 17
[19,] 9 18
[20,] -15 -23
[21,] -18 -22
[22,] 19 20
[23,] 22 21
[24,] 23 -24
Order of objects:
[1] 1 2 3 5 4 6 7 9 10 8 12 14 11 17 13 25 20 19 16 21 15 23
[23] 18 22 24
Height:
[1] 36.20112 33.73715 34.02875 143.48285 26.18488 651.30164
[7] 98.24102 137.86258 209.82213 290.12504 199.39922 453.16504
[13] 139.95699 358.46153 252.32414 311.93894 395.87403 426.58963
[19] 139.65372 869.86791 753.46956 1251.04620 799.97551 2845.24880
Agglomerative coefficient:
[1] 0.8754856
Available arguments:
[1] "order" "height" "ac" "merge" "order.lab"
[6] "diss" "data" "call"
223
The Splus Cluster Tree is;
224
items in a low dimensional coordinate system using only the ranks orders of the N2 original distances, and
not their actual distances, procedures which use these rank orderings are called nonmetric multidimensional
scaling. If the actual distances are used, the procedures are called metric multidimensional scaling. Principle
component analysis is a metric multidimensional scaling procedure.
constitute the basic data. If one assumes that there are no ties in these similarities, then they can be
arranged in a strickly ascending order as;
where si1 ,k1 represents the pair of points with the smallest or least similarity. Now we want to find q
(q)
dimensions of the N items such that the distance dik , between pairs of items match the above ordering
when the distances are laid out in descending magnitude. That is, a match is perfect if,
(q) (q) (q)
di1 ,k1 > di2 ,k2 > . . . > dIM ,kM .
Now for any given value of q << p, define the stress as,
(P P (q) (q)
)1/2
i<k (dik − dˆik )2
Stress(q) = PP (q) 2
i<k [dik ]
(q)
where the dˆik ’s are monotone functions of the distances. The value of the Stress function is subjective where
values greater than 10% are poor, where values in the 5% - 10% range are good to excellent.
The following measure is becoming a more acceptable criterion;
(P P )1/2
− dˆ2ik )2
2
i<k (dik
Stress = PP 4
i<k dik
which is a value between 0 and 1. Any value less than .1 is typically an acceptable value for indicating there
is a good representation.
225
CocoaPuffs G 110 1 1 180 0.0 12.0 13 55 1
CountChocula G 110 1 1 180 0.0 12.0 13 65 1
GoldenGrahams G 110 1 1 280 0.0 15.0 9 45 1
HoneyNutCheerios G 110 3 1 250 1.5 11.5 10 90 1
Kix G 110 2 1 260 0.0 21.0 3 40 1
LuckyCharms G 110 2 1 180 0.0 12.0 12 55 1
MultiGrainCheerios G 100 2 1 220 2.0 15.0 6 90 1
OatmealRaisinCrisp G 130 3 2 170 1.5 13.5 10 120 1
RaisinNutBran G 100 3 2 140 2.5 10.5 8 140 1
TotalCornFlakes G 110 2 1 200 0.0 21.0 3 35 1
TotalRaisinBran G 140 3 1 190 4.0 15.0 14 230 1
TotalWholeGrain G 100 3 1 200 3.0 16.0 3 110 1
Trix G 110 1 1 140 0.0 13.0 12 25 1
Cheaties G 100 3 1 200 3.0 17.0 3 110 1
WheatiesHoneyGold G 110 2 1 200 1.0 16.0 8 60 1
AllBran K 70 4 1 260 9.0 7.0 5 320 2
AppleJacks K 110 2 0 125 1.0 11.0 14 30 2
CornFlakes K 100 2 0 290 1.0 21.0 2 35 2
CornPops K 110 1 0 90 1.0 13.0 12 20 2
CracklinOatBran K 110 3 3 140 4.0 10.0 7 160 2
Crispix K 110 2 0 220 1.0 21.0 3 30 2
FrootLoops K 110 2 1 125 1.0 11.0 13 30 2
FrostedFlakes K 110 1 0 200 1.0 14.0 11 25 2
FrostedMiniWheats K 100 3 0 0 3.0 14.0 7 100 2
FruitfulBran K 120 3 0 240 5.0 14.0 12 190 2
JustRightCrunchyNuggets K 110 2 1 170 1.0 17.0 6 60 2
MueslixCrispyBlend K 160 3 2 150 3.0 17.0 13 160 2
NutNHoneyCrunch K 120 2 1 190 0.0 15.0 9 40 2
NutriGrainAlmondRaisin K 140 3 2 220 3.0 21.0 7 130 2
NutriGrainWheat K 90 3 0 170 3.0 18.0 2 90 2
Product19 K 100 3 0 320 1.0 20.0 3 45 2
RaisinBran K 120 3 1 210 5.0 14.0 12 240 2
RiceKrispies K 110 2 0 290 0.0 22.0 3 35 2
Smacks K 110 2 1 70 1.0 9.0 15 40 2
SpecialK K 110 6 0 230 1.0 16.0 3 55 2
CapNCrunch Q 120 1 2 220 0.0 12.0 12 35 3
HoneyGrahamOhs Q 120 1 2 220 1.0 12.0 11 45 3
Life Q 100 4 2 150 2.0 12.0 6 95 3
PuffedRice Q 50 1 0 0 0.0 13.0 0 15 3
PuffedWheat Q 50 2 0 0 1.0 10.0 0 50 3
QuakerOatmeal Q 100 5 2 0 2.7 1.0 1 110 3
;
PROC STANDARD DATA=Cereal MEAN=0 STD=1 OUT=Cereal2;
VAR Cal Protein Sodium Fiber Carbs Sugar Potass;
RUN;
226
I = _N_;
KEEP Cal Protein Sodium Fiber Carbs Sugar Potass I;
****************************************************
* CREATED A DUPLICATE DATA SET BY CREATING DUMMY
VARIABLES I AND J FOR MERGING SETS TO ALLOW US
TO CALCULATE H , WHICH IS DISTANCE.
******************************************************;
******************************************************************
* CALCULATE H, THE DISTANCE BETWEEN ALL PAIRS OF POINTS
*****************************************************************;
227
RUN;
228
229