Multivariate Statistical Analysis: Old School
Multivariate Statistical Analysis: Old School
Old School
John I. Marden
Department of Statistics
University of Illinois at Urbana-Champaign
2011
c by John I. Marden
Preface
The goal of this text is to give the reader a thorough grounding in old-school mul-
tivariate statistical analysis. The emphasis is on multivariate normal modeling and
inference, both theory and implementation. Linear models form a central theme of
the book. Several chapters are devoted to developing the basic models, including
multivariate regression and analysis of variance, and especially the “both-sides mod-
els” (i.e., generalized multivariate analysis of variance models), which allow model-
ing relationships among individuals as well as variables. Growth curve and repeated
measure models are special cases.
The linear models are concerned with means. Inference on covariance matrices
covers testing equality of several covariance matrices, testing independence and con-
ditional independence of (blocks of) variables, factor analysis, and some symmetry
models. Principal components, though mainly a graphical/exploratory technique,
also lends itself to some modeling.
Classification and clustering are related areas. Both attempt to categorize indi-
viduals. Classification tries to classify individuals based upon a previous sample of
observed individuals and their categories. In clustering, there is no observed catego-
rization, nor often even knowledge of how many categories there are. These must be
estimated from the data.
Other useful multivariate techniques include biplots, multidimensional scaling,
and canonical correlations.
The bulk of the results here are mathematically justified, but I have tried to arrange
the material so that the reader can learn the basic concepts and techniques while
plunging as much or as little as desired into the details of the proofs.
Practically all the calculations and graphics in the examples are implemented
using the statistical computing environment R [R Development Core Team, 2010].
Throughout the notes we have scattered some of the actual R code we used. Many of
the data sets and original R functions can be found in the file http://www.istics.
net/r/multivariateOldSchool.r. For others we refer to available R packages.
iii
Contents
Preface iii
Contents iv
2 Multivariate Distributions 27
2.1 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.1 Distribution functions . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.2 Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.3 Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.4 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . 30
2.2 Expected values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 Means, variances, and covariances . . . . . . . . . . . . . . . . . . . . . . 33
2.3.1 Vectors and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.2 Moment generating functions . . . . . . . . . . . . . . . . . . . . 35
2.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5 Additional properties of conditional distributions . . . . . . . . . . . . . 37
2.6 Affine transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
iv
Contents v
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
11 Classification 199
11.1 Mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
11.2 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
11.3 Fisher’s linear discrimination . . . . . . . . . . . . . . . . . . . . . . . . . 203
11.4 Cross-validation estimate of error . . . . . . . . . . . . . . . . . . . . . . 205
11.4.1 Example: Iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
11.5 Fisher’s quadratic discrimination . . . . . . . . . . . . . . . . . . . . . . . 210
11.5.1 Example: Iris data, continued . . . . . . . . . . . . . . . . . . . . 210
11.6 Modifications to Fisher’s discrimination . . . . . . . . . . . . . . . . . . . 211
11.7 Conditioning on X: Logistic regression . . . . . . . . . . . . . . . . . . . 212
11.7.1 Example: Iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.7.2 Example: Spam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.8 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
11.8.1 CART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
11.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
12 Clustering 229
12.1 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
12.1.1 Example: Sports data . . . . . . . . . . . . . . . . . . . . . . . . . 230
12.1.2 Gap statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
12.1.3 Silhouettes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
12.1.4 Plotting clusters in one and two dimensions . . . . . . . . . . . . 233
12.1.5 Example: Sports data, using R . . . . . . . . . . . . . . . . . . . . 236
12.2 K-medoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
12.3 Model-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
12.3.1 Example: Automobile data . . . . . . . . . . . . . . . . . . . . . . 240
12.3.2 Some of the models in mclust . . . . . . . . . . . . . . . . . . . . 243
12.4 An example of the EM algorithm . . . . . . . . . . . . . . . . . . . . . . . 245
12.5 Soft K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
12.5.1 Example: Sports data . . . . . . . . . . . . . . . . . . . . . . . . . 247
12.6 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
12.6.1 Example: Grades data . . . . . . . . . . . . . . . . . . . . . . . . . 248
12.6.2 Example: Sports data . . . . . . . . . . . . . . . . . . . . . . . . . 249
12.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Bibliography 295
Index 301
Chapter 1
In this chapter, we try to give a sense of what multivariate data sets looks like, and
introduce some of the basic matrix manipulations needed throughout these notes.
Chapters 2 and 3 lay down the distributional theory. Linear models are probably the
most popular statistical models ever. With multivariate data, we can model relation-
ships between individuals or between variables, leading to what we call “both-sides
models,” which do both simultaneously. Chapters 4 through 8 present these models
in detail. The linear models are concerned with means. Before turning to models
on covariances, Chapter 9 briefly reviews likelihood methods, including maximum
likelihood estimation, likelihood ratio tests, and model selection criteria (Bayes and
Akaike). Chapter 10 looks at a number of models based on covariance matrices, in-
cluding equality of covariances, independence and conditional independence, factor
analysis, and other structural models. Chapter 11 deals with classification, in which
the goal is to find ways to classify individuals into categories, e.g., healthy or un-
healthy, based on a number of observed variable. Chapter 12 has a similar goal,
except that the categories are unknown and we seek groupings of individuals using
just the observed variables. Finally, Chapter 13 explores principal components, which
we first see in Section 1.6. It is an approach for reducing the number of variables,
or at least find a few interesting ones, by searching through linear combinations of
the observed variables. Multidimensional scaling has a similar objective, but tries to
exhibit the individual data points in a low-dimensional space while preserving the
original inter-point distances. Canonical correlations has two sets of variables, and
finds linear combinations of the two sets to explain the correlations between them.
On to the data.
1
2 Chapter 1. Multivariate Data
data matrix, with rows denoting individuals and the columns denoting variables:
Then yij is the value of the variable j for individual i. Much more complex data
structures exist, but this course concentrates on these straightforward data matrices.
1.2 Glyphs
Graphical displays of univariate data, that is, data on one variable, are well-known:
histograms, stem-and-leaf plots, pie charts, box plots, etc. For two variables, scat-
ter plots are valuable. It is more of a challenge when dealing with three or more
variables.
Glyphs provide an option. A little picture is created for each individual, with char-
acteristics based on the values of the variables. Chernoff’s faces [Chernoff, 1973] may
be the most famous glyphs. The idea is that people intuitively respond to character-
istics of faces, so that many variables can be summarized in a face.
Figure 1.1 exhibits faces for the nine planets. We use the faces routine by H. P. Wolf
in the R package aplpack, Wolf and Bielefeld [2010]. The distance the planet is from
the sun is represented by the height of the face (Pluto has a long face), the length of
the planet’s day by the width of the face (Venus has a wide face), etc. One can then
cluster the planets. Mercury, Earth and Mars look similar, as do Saturn and Jupiter.
These face plots are more likely to be amusing than useful, especially if the number
of individuals is large. A star plot is similar. Each individual is represented by a
p-pointed star, where each point corresponds to a variable, and the distance of the
1.3. Scatter plots 3
Figure 1.1: Chernoff’s faces for the planets. Each feature represents a variable. For
the first six variables, from the faces help file, 1-height of face, 2-width of face, 3-shape of
face, 4-height of mouth, 5-width of mouth, 6-curve of smile.
point from the center is based on the variable’s value for that individual. See Figure
1.2.
Two-dimensional scatter plots can be enhanced by using different symbols for the
observations instead of plain dots. For example, different colors could be used for
different groups of points, or glyphs representing other variables could be plotted.
Figure 1.2 plots the planets with the logarithms of day length and year length as the
axes, where the stars created from the other four variables are the plotted symbols.
Note that the planets pair up in a reasonable way. Mercury and Venus are close, both
in terms of the scatter plot and in the look of their stars. Similarly, Earth and Mars
pair up, as do Jupiter and Saturn, and Uranus and Neptune. See Listing 1.1 for the R
code.
A scatter plot matrix arranges all possible two-way scatter plots in a q × q matrix.
These displays can be enhanced with brushing, in which individual or groups of
individual plots can be selected in one plot, and be simultaneously highlighted in the
other plots.
4 Chapter 1. Multivariate Data
Listing 1.1: R code for the star plot of the planets, Figure 1.2. The data are in the
matrix planets. The first statement normalizes the variables to range from 0 to 1. The
ep matrix is used to place the names of the planets. Tweaking is necessary, depending
on the size of the plot.
p <− apply(planets,2,function(z) (z−min(z))/(max(z)−min(z)))
x <− log(planets[,2])
y <− log(planets[,3])
ep <− matrix(c(−.3,.4),c(−.5,.4),c(.5,0),c(.5,0),c(.6,−1),c(−.5,1.4),
c(1,−.6),c(1.3,.4),c(1,−.5))
symbols(x,y,stars=p[,−(2:3)],xlab=’log(day)’,ylab=’log(year)’,inches=.4)
text(x+ep[,1],y+ep[,2],labels=rownames(planets),cex=.5)
12
Neptune
Pluto
Saturn
10
Uranus
log(year)
8
Jupiter
Mars
6
Earth Venus
Mercury
4
0 2 4 6
log(day)
Figure 1.2: Scatter plot of log(day) versus log(year) for the planets, with plotting
symbols being stars created from the other four variables, distance, diameter, tem-
perature, moons.
1.3. Scatter plots 5
gg g g
g ggggg g
ggg
gggg g gg
gggg gg g
g v
g vvvgggggg vvvgg gg
g
g vvvg vvgg
vg vvvvggggggg gg ggg
vvvvvg ggg
vvvgvggg g g
vvgg vg
vg v v gg
vggggg vvvgvg gggggggg
Sepal.Length g v g
vv v
g
v vv s vvvvvvvvvg
vggg
v v v vg v v g
v
gvg g
vvg sss vvvvvvvvv sss vvvv v gg g
vv v v ss sss s s
v vv v ssss vvvvvgg ss s v v v
v vvgv v sssssssss s ssssssss vvvv g ssss
sssssss vv g
v
s s sssssssss
s sssss s s s ss
ss
s s s
sss s
ssss sss
sssss s gg ssss gg ssssss gg
s ss g sssssss g ssss s g
ssssssssss vgg gg sssss vvggggg ggg sssss vvg gg
ggg
sssssssss g vggvgvg
ggvgggg
vvggvg
vgg gg Sepal.Width ssssssss vvgv vvvvg
gggg
g
gggggg ssssss vvvvg v
g g g
gg
sss sss vg vvvvgvvvvg
vvg
gvvvg
g v ggg g sssss vvvvvvvvvvvvvvvvg gg s vvvvv vg
vvvvvvg
g gggg
gggggg
v vvgvg
vvvv
vgg gg vvv g gg
vg vggg g
g vvvvvvvg
vvg g
g v v v v g vvvvvvvvvgvg gg g vgggg g
s vv vv g vv v s v vvvv g s vvv vg
v v
v v v
g gg g gggg
gggg ggggg
gg ggg ggggg
g ggg g
ggg
ggg g
ggggg
g
gg ggg gggg
g ggg gg g g
ggggg
ggg
g g
g vvg
ggggvgggggg v gg vg
g
gg
vg gg g
g
vvvvv
vvvvg g vvggg
g gg
vvvg vvvvvvvvvv
vvg vvg vvv v v
vvvvvvvvvvgg
v
v vvvvvvvvvvvvvvvv vv vvvvvvv Petal.Length vvvvvvvvv
vvvv v v vvvvvv v vv
v v
ssssss s s s s
s sssssssssssssssssssssss s
ssssssss
ssssss
ssssssssssssssssss sss
g gggg gg gggg g ggg
g
ggg gg g g ggg gg
g g g gggggg
g gggg gg
ggg gg gggggggggg g
g g
g
g
g gggg g
g gg
ggggggggggggg
g
g g
vgggggggv gg g
gggg gg vggv gvg vgvgggg ggg
v v g v vgvvg vvg vvvvv vvvvvv v
vvvvvvvg vvg
vg vvvv
vvvvv g
v vvvvvvvvvvvvvv
vvvgg
vvvvv g
vvvvvvvvvvvvv
vvv vvvvvv Petal.Width
vvv vvvvvv v vvvvvvv vvvvvvvv
ss sss sss s s
ssssssssssssssssss s sssss ss s ssssssss
ssssssss
ss sssssssssss sss sssssss
sssssss
Figure 1.3: A scatter plot matrix for the Fisher/Anderson iris data. In each plot, “s”
indicates setosa plants, “v” indicates versicolor, and “r” indicates virginica.
The most famous data set in multivariate analysis is the iris data analyzed by Fisher
[1936] based on data collected by Anderson [1935]. See also Anderson [1936]. There
are fifty specimens each of three species of iris: setosa, versicolor, and virginica. There
are four variables measured on each plant, sepal length, sepal width, petal length, and
pedal width. Thus n = 150 and q = 4. Figure 1.3 contains the corresponding scatter
plot matrix, with species indicated by letter. We used the R function pairs. Note that
the three species separate fairly well, with setosa especially different from the other
two.
As a preview of classification (Chapter 11), Figure 1.4 uses faces to exhibit five
observations from each species, and five random observations without their species
label. Can you guess which species each is? See page 20. The setosas are not too
difficult, since they have small faces, but distinguishing the other two can be a chal-
lenge.
6 Chapter 1. Multivariate Data
? ? ? ? ?
Figure 1.4: Five specimens from each iris species, plus five from unspecified species.
Here, “set” indicates setosa plants, “vers” indicates versicolor, and “virg” indicates
virginica.
1 n
n i∑
x̄ = xi , (1.3)
=1
1 n
n i∑
s2x = s xx = ( xi − x̄ )2 . (1.4)
=1
Note the two notations: The s2x is most common when dealing with individual vari-
ables, but the s xx transfers better to multivariate data. Often one is tempted to divide
by n − 1 instead of n. That’s fine, too. With a second set of values z1 , . . . , zn , we have
1.4. Sample means, variances, and covariances 7
1 n
n i∑
s xz = ( xi − x̄ )(zi − z̄). (1.5)
=1
So the covariance between the xi ’s and themselves is the variance, which is to say that
s2x = s xx . The sample correlation coefficient is a normalization of the covariance that
ranges between −1 and +1, defined by
s xz
r xz = (1.6)
s x sz
provided both variances are positive. (See Corollary 8.1.) In a scatter plot of x versus
z, the correlation coefficient is +1 if all the points lie on a line with positive slope,
and −1 if they all lie on a line with negative slope.
For a data matrix Y (1.1) with q variables, there are q means:
1 n
n i∑
ȳ j = yij . (1.7)
=1
The n × 1 one vector is 1n = (1, 1, . . . , 1) , the vector of all 1’s. Then the mean vector
(1.8) can be written
1
ȳ = 1n Y. (1.9)
n
To find the variances and covariances, we first have to subtract the means from the
individual observations in Y: change yij to yij − ȳ j for each i, j. That can be achieved
by subtracting the n × q matrix 1n ȳ from Y to get the matrix of deviations. Using
(1.9), we can write
1 1
Y − 1n ȳ = Y − 1n 1n Y = (In − 1n 1n )Y ≡ Hn Y. (1.10)
n n
There are two important matrices in that formula: The n × n identity matrix In ,
⎛ ⎞
1 0 ... 0
⎜ 0 1 ... 0 ⎟
⎜ ⎟
In = ⎜ .. .. .. .. ⎟, (1.11)
⎝ . . . . ⎠
0 0 ... 1
and
n
∑ (xi − x̄ )(zi − z̄) = (x − x̄1n ) (z − z̄1n ). (1.14)
i =1
Thus taking the deviations matrix in (1.10), (Hn Y) (Hn Y) contains all the ∑(yij −
ȳ j )(yik − ȳk )’s. We will call that matrix the sum of squares and cross-products matrix.
Notice that
(Hn Y) (Hn Y) = Y Hn Hn Y = Y Hn Y. (1.15)
What happened to the Hn ’s? First, Hn is clearly symmetric, so that Hn = Hn . Then
notice that Hn Hn = Hn . Such a matrix is called idempotent, that is, a square matrix
A is idempotent if AA = A. (1.16)
Dividing the sum of squares and cross-products matrix by n gives the sample
variance-covariance matrix, or more simply sample covariance matrix:
⎛ ⎞
s11 s12 ... s1q
⎜ s s22 ... s2q ⎟
1 ⎜ 21 ⎟
S = Y Hn Y = ⎜ . .. .. .. ⎟, (1.17)
n ⎝ .. . . . ⎠
sq1 sq2 ... sqq
where s jj is the sample variance of the jth variable (column), and s jk is the sample
covariance between the jth and kth variables. (When doing inference later, we may
divide by n − d f instead of n for some “degrees-of-freedom” integer d f .)
transforming the original data matrix to another one, albeit with only one variable.
Now there is a histogram for each vector b. A one-dimensional grand tour runs
through the vectors b, displaying the histogram for Yb as it goes. (See Asimov [1985]
and Buja and Asimov [1986] for general grand tour methodology.) Actually, one does
not need all b, e.g., the vectors b = (1, 2, 5) and b = (2, 4, 10) would give the same
histogram. Just the scale of the horizontal axis on the histograms would be different.
One simplification is to look at only the b’s with norm 1. That is, the norm of a vector
x = ( x1 , . . . , xq ) is
√
x = x12 + · · · + x2q = x x, (1.20)
so one would run through the b’s with b = 1. Note that the one-dimensional
marginals are special cases: take
Scatter plots of two linear combinations are more common. That is, there are two
sets of coefficients (b1j ’s and b2j ’s), and two resulting variables:
In general, the data matrix generated from p linear combinations can be written
W = YB, (1.23)
where W is n × p, and B is q × p with column k containing the coefficients for the kth
linear combination. As for one linear combination, the coefficient vectors are taken to
have norm 1, i.e., (b1k , . . . , bqk ) = 1, which is equivalent to having all the diagonals
of B B being 1.
Another common restriction is to have the linear combination vectors be orthogo-
nal, where two column vectors b and c are orthogonal if b c = 0. Geometrically, or-
thogonality means the vectors are perpendicular to each other. One benefit of restrict-
ing to orthogonal linear combinations is that one avoids scatter plots that are highly
correlated but not meaningfully so, e.g., one might have w1 be Height + Weight, and
w2 be .99 × Height + 1.01 × Weight. Having those two highly correlated does not tell
us anything about the data set. If the columns of B are orthogonal to each other, as
well as having norm 1, then
B B = I p . (1.24)
A set of norm 1 vectors that are mutually orthogonal are said to be orthonormal .
Return to q = 2 orthonormal linear combinations. A two-dimensional grand tour
plots the two variables as the q × 2 matrix B runs through all the matrices with a pair
of orthonormal columns.
10 Chapter 1. Multivariate Data
1.5.1 Rotations
If the B in (1.24) is q × q, i.e., there are as many orthonormal linear combinations as
variables, then B is an orthogonal matrix
.
G G = GG = Iq . (1.25)
Note that the definition says that the columns are orthonormal, and the rows are
orthonormal. In fact, the rows are orthonormal if and only of the columns are (if
the matrix is square), hence the middle equality in (1.25) is not strictly needed in the
definition.
Think of the data matrix Y being the set of n points in q-dimensional space. For
orthogonal matrix G, what does the set of points W = YG look like? It looks exactly
like Y, but rotated or flipped. Think of a pinwheel turning, or a chicken on a rotisserie,
or the earth spinning around its axis or rotating about the sun. Figure 1.5 illustrates
a simple rotation of two variables. In particular, the norms of the points in Y are the
same as in W, so each point remains the same distance from 0.
Rotating point clouds for three variables work by first multiplying the n × 3 data
matrix by a 3 × 3 orthogonal matrix, then making a scatter plot of the first two re-
sulting variables. By running through the orthogonal matrices quickly, one gets the
illusion of three dimensions. See the discussion immediately above Exercise 1.9.21
for some suggestions on software for real-time rotations.
obtained from the mean and covariance matrix of Y using (1.8) and (1.15):
1 1
w= 1 W = 1n YB = ȳB, (1.26)
n n n
by (1.9), and
1 1
SW = W Hn W = B Y Hn YB = B SB, (1.27)
n n
where S is the covariance matrix of Y in (1.17). In particular, for a column vector b,
the sample variance of Yb is b Sb. Thus the principal components aim to maximize
g Sg for g’s of unit length.
Definition 1.2. Suppose S is the sample covariance matrix for the n × q data matrix Y. Let
g1 , . . . , gq be an orthonormal set of q × 1 vectors such that
Then Ygi is the i th sample principal component, gi is its loading vector, and li ≡ gi Sgi
is its sample variance.
Because the function g Sg is continuous in g, and the maximizations are over
compact sets, these principal components always exist. They may not be unique,
although for sample covariance matrices, if n ≥ q, they almost always are unique, up
to sign. See Section 13.1 for further discussion.
By the construction in (1.28), we have that the sample variances of the principal
components are ordered as
l1 ≥ l2 ≥ · · · ≥ l q . (1.29)
What is not as obvious, but quite important, is that the principal components are
uncorrelated, as in the next lemma, proved in Section 1.8.
Lemma 1.1. The S and g1 , . . . , gq in Definition 1.2 satisfy
1.5
0.5
0.5
Sepal width
PC2
−0.5
−0.5
−1.5
−1.5
−1.5 −0.5 0.5 1.5 −1.5 −0.5 0.5 1.5
Sepal length PC1
Figure 1.5: The sepal length and sepal width for the setosa iris data. The first plot is
the raw data, centered. The second shows the two principal components.
S = GLG . (1.33)
Although we went through the derivation with S being a covariance matrix, all
we really needed for this theorem was that S is symmetric. The gi ’s and li ’s have
mathematical names, too: Eigenvectors and eigenvalues.
Definition 1.3 (Eigenvalues and eigenvectors). Suppose A is a q × q matrix. Then λ is
an eigenvalue of A if there exists a non-zero q × 1 vector u such that Au = λu. The vector
u is the corresponding eigenvector. Similarly, u = 0 is an eigenvector if there exists an
eigenvalue to which it corresponds.
A little linear algebra shows that indeed, each gi is an eigenvector of S correspond-
ing to li . Hence the following:
Symbol Principal components Spectral decomposition
li Variance Eigenvalue (1.34)
gi Loadings Eigenvector
Figure 1.5 plots the principal components for the q = 2 variables sepal length and
sepal width for the fifty iris observations of the species setosa. The data has been
centered, so that the means are zero. The variances of the two original variables are
0.124 and 0.144, respectively. The first graph shows the two variables are highly cor-
related, with most of the points lining up near the 45◦ line. The principal component
loading matrix G rotates the points approximately 45◦ clockwise as in the second
graph, so that the data are now most spread out along the horizontal axis (variance is
0.234), and least along the vertical (variance is 0.034). The two principal components
are also, as it appears, uncorrelated.
1.6. Principal components 13
Best K components
In the process above, we found the principal components one by one. It may be
that we would like to find the rotation for which the first K variables, say, have the
maximal sum of variances. That is, we wish to find the orthonormal set of q × 1
vectors b1 , . . . , bK to maximize
Fortunately, the answer is the same, i.e., take bi = gi for each i, the principal compo-
nents. See Proposition 1.1 in Section 1.8. Section 13.1 explores principal components
further.
1.6.1 Biplots
When plotting observations using the first few principal component variables, the
relationship between the original variables and principal components is often lost.
An easy remedy is to rotate and plot the original axes as well. Imagine in the original
data space, in addition to the observed points, one plots arrows of length λ along the
axes. That is, the arrows are the line segments
where an arrowhead is added at the non-origin end of the segment. If Y is the matrix
of observations, and G1 the matrix containing the first p loading vectors, then
= YG1 .
X (1.37)
= (a1 , . . . , aq )G1 .
A (1.38)
The plot consisting of the points X and the arrows A is then called the biplot. See
are just
Gabriel [1981]. The points of the arrows in A
so that in practice all we need to do is for each axis, draw an arrow pointing from
the origin to λ× (the i th row of G1 ). The value of λ is chosen by trial-and-error, so
that the arrows are amidst the observations. Notice that the components of these
arrows are proportional to the loadings, so that the length of the arrows represents
the weight of the corresponding variables on the principal components.
j 1 2 3 4 5 6 7
(1.41)
lj 10.32 4.28 3.98 3.3 2.74 2.25 0
The first eigenvalue is 10.32, quite a bit larger than the second. The second through
sixth are fairly equal, so it may be reasonable to look at just the first component.
(The seventh eigenvalue is 0, but that follows because the rank vectors all sum to
1 + · · · + 7 = 28, hence exist in a six-dimensional space.)
We create the biplot using the first two dimensions. We first plot the people:
ev <− eg$vectors
w <− y%∗%ev # The principal components
lm <− range(w)
plot(w[,1:2],xlim=lm,ylim=lm)
The biplot adds in the original axes. Thus we want to plot the seven (q = 7) points as
in (1.39), where Γ1 contains the first two eigenvectors. Plotting the arrows and labels:
arrows(0,0,5∗ev[,1],5∗ev[,2])
text(7∗ev[,1:2],labels=colnames(y))
The constants “5” (which is the λ) and “7” were found by trial and error so that the
graph, Figure 1.6, looks good. We see two main clusters. The left-hand cluster of
people is associated with the team sports’ arrows (baseball, football and basketball),
and the right-hand cluster is associated with the individual sports’ arrows (cycling,
swimming, jogging). Tennis is a bit on its own, pointing south.
4
Jog
2 BaseB
Cyc
PC2
FootB
0
Swim
BsktB
−2
−4
Ten
−4 −2 0 2 4
PC1
Figure 1.6: Biplot of the sports data, using the first two principal components.
has two variables, height and weight, measured on a number of adults. The variance
of height, in inches, is about 9, and the variance of weight, in pounds, is 900 (= 302 ).
One would expect the first principal component to be close to the weight variable,
because that is where the variation is. On the other hand, if height were measured in
millimeters, and weight in tons, the variances would be more like 6000 (for height)
and 0.0002 (for weight), so the first principal component would be essentially the
height variable. In general, if the variables are not measured in the same units, it can
be problematic to decide what units to use for the variables. See Section 13.1.1. One
√
common approach is to divide each variable by its standard deviation s jj , so that
the resulting variables all have variance 1.
Another caution is that the linear combination with largest variance is not neces-
sarily the most interesting, e g., you may want one which is maximally correlated
with another variable, or which distinguishes two populations best, or which shows
the most clustering.
Popular objective functions to maximize, other than variance, are skewness, kur-
tosis and negative entropy. The idea is to find projections that are not normal (in the
sense of the normal distribution). The hope is that these will show clustering or some
other interesting feature.
Skewness measure a certain lack of symmetry, where one tail is longer than the
other. It is measured by the normalized sample third central (meaning subtract the
mean) moment:
∑ni=1 ( xi − x̄ )3 /n
Skewness = . (1.42)
(∑ni=1 ( xi − x̄ )2 /n )3/2
Positive values indicate a longer tail to the right, and negative to the left. Kurtosis is
16 Chapter 1. Multivariate Data
∑ni=1 ( xi − x̄ )4 /n
Kurtosis = − 3. (1.43)
(∑ ni=1 ( xi − x̄ )2 /n )2
The “−3” is there so that exactly normal data will have kurtosis 0. A variable with
low kurtosis is more “boxy” than the normal. One with high kurtosis tends to have
thick tails and a pointy middle. (A variable with low kurtosis is platykurtic, and one
with high kurtosis is leptokurtic, from the Greek: kyrtos = curved, platys = flat, like a
platypus, and lepto = thin.) Bimodal distributions often have low kurtosis.
Entropy
(You may wish to look through Section 2.1 before reading this section.) The entropy
of a random variable Y with pdf f (y) is
Entropy is supposed to measure lack of structure, so that the larger the entropy, the
more diffuse the distribution is. For the normal, we have that
√
(Y − μ )2 1
Entropy( N (μ, σ )) = E f log( 2πσ ) +
2 2 = (1 + log(2πσ2 )). (1.45)
2σ2 2
Note that it does not depend on the mean μ, and that it increases without bound as σ2
increases. Thus maximizing entropy unrestricted is not an interesting task. However,
one can imagine maximizing entropy for a given mean and variance, which leads to
the next lemma, to be proved in Section 1.8.
Lemma 1.2. The N (μ, σ2 ) uniquely maximizes the entropy among all pdf’s with mean μ and
variance σ2 .
Thus a measure of nonnormality of g is its entropy subtracted from that of the
normal with the same variance. Since there is a negative sign in front of the entropy
of g, this difference is called negentropy defined for any g as
1
Negent( g) = (1 + log(2πσ2 )) − Entropy( g), where σ2 = Var g [Y ]. (1.46)
2
With data, one does not know the pdf g, so one must estimate the negentropy. This
value is known as the Kullback-Leibler distance, or discrimination information, from
g to the normal density. See Kullback and Leibler [1951].
Variance Entropy
v
s s
−2 sssss
−2
v
gv v vvg vv ssssssssss
v g vv sss
g vv v
sss ssssss
0 −1
0 −1
g gvgvgggvvvvv v sssss s ss
g ggvvg vv ss
s v
g vvv vvv vvvvv ggg
Ent2
vggvv vg
PC2
ssssss s
g g g vvvvvvvvvvg ggg g
g
g v
g g g gggvvgvgv v v
s
sssssss s vvvvvvvvgggg g g
gg
gg g ggvgggv v
ss
vvvvvvvvvg vggggg
vgvgg
g gg
ssss g gg
1
1
s ss vvvvvg g ggg
g ss gg
s g
2
2
g
gg s
3 2 1 0 −1 −2 3 2 1 0 −1 −2
PC1 Ent1
Figure 1.7: Projection pursuit for the iris data. The first plot is based on maximizing
the variances of the projections, i.e., principal components. The second plot maxi-
mizes estimated entropies.
produce different projections. The first principal component weights equally on the
two length variables, while the first entropy variable is essentially petal length.
Variance Entropy
g1 g2 g1∗ g2∗
Sepal length 0.63 0.43 0.08 0.74 (1.47)
Sepal width −0.36 0.90 0.00 −0.68
Petal length 0.69 0.08 −1.00 0.06
Figure 1.7 graphs the results. The plots both show separation between setosa and
the other two species, but the principal components plot has the observations more
spread out, while the entropy plot shows the two groups much tighter.
The matrix iris has the iris data, with the first four columns containing the mea-
surements, and the fifth specifying the species. The observations are listed with the
fifty setosas first, then the fifty versicolors, then the fifty virginicas. To find the prin-
cipal components for the first three variables, we use the following:
y <− scale(as.matrix(iris[,1:3]))
g <− eigen(var(y))$vectors
pc <− y%∗%g
The first statement centers and scales the variables. The plot of the first two columns
of pc is the first plot in Figure 1.7. The procedure we used for entropy is negent3D in
Listing A.3, explained in Appendix A.1. The code is
gstar <− negent3D(y,nstart=10)$vectors
ent <−y%∗%gstar
To create plots like the ones in Figure 1.7, use
par(mfrow=c(1,2))
18 Chapter 1. Multivariate Data
sp <− rep(c(’s’,’v’,’g’),c(50,50,50))
plot(pc[,1:2],pch=sp) # pch specifies the characters to plot.
plot(ent[,1:2],pch=sp)
1.8 Proofs
Proof of the principal components result, Lemma 1.1
The idea here was taken from Davis and Uhl [1999]. Consider the g1 , . . . , gq as defined
in (1.28). Take i < j, and for angle θ, let
According to the i th stage in (1.28), h(θ ) is maximized when g(θ ) = gi , i.e., when
θ = 0. The function is differentiable, hence its derivative must be zero at θ = 0. To
verify (1.30), differentiate:
d
0 = h(θ )| θ =0
dθ
d
= (cos2 (θ )gi Sgi + 2 sin(θ ) cos(θ ) gi Sg j + sin2 (θ )gj Sg j )| θ =0
dθ
= 2gi Sg j . (1.50)
Best K components
We next consider finding the set b1 , . . . , bK orthonormal vectors to maximize the sum
of variances, ∑K
i =1 bi Sbi , as in (1.35). It is convenient here to have the next definition.
Definition 1.4 (Trace). The trace of an m × m matrix A is the sum of its diagonals,
trace(A) = ∑ m
i =1 a ii .
Lemma 1.3. Suppose S and BK are as in Proposition 1.1, and S = GLG is its spectral
decomposition. Then (1.52) holds.
where the aij ’s are the elements of A, and ci = ∑Kj=1 a2ij . Because the columns of A
have norm one, and the rows of A have norms less than or equal to one,
q K q
∑ ci = ∑ [ ∑ a2ij ] = K and ci ≤ 1. (1.54)
i =1 j =1 i =1
To maximize (1.53) under those constraints on the ci ’s, we try to make the earlier
ci ’s as large as possible, which means that c1 = · · · = cK = 1 and cK +1 = · · · =
cq = 0. The resulting value is then l1 + · · · + lK . Note that taking A with aii = 1,
i = 1, . . . , K, and 0 elsewhere (so that A consists of the first K columns of Iq ), achieves
that maximum. With that A, we have that B = (g1 , . . . , gK ).
The last two terms in (1.55) are equal, since Y has the same mean and variance under
f and g.
At this point we need an important inequality about convexity, to whit, what
follows is a definition and lemma.
20 Chapter 1. Multivariate Data
1.9 Exercises
Exercise 1.9.1. Let Hn be the centering matrix in (1.12). (a) What is Hn 1n ? (b) Suppose
x is an n × 1 vector whose elements sum to zero. What is Hn x? (c) Show that Hn is
idempotent (1.16).
Exercise 1.9.2. Define the matrix Jn = (1/n )1n 1n , so that Hn = In − Jn . (a) What
does Jn do to a vector? (That is, what is Jn a?) (b) Show that Jn is idempotent. (c) Find
the spectral decomposition (1.33) for Jn explicitly when n = 3. [Hint: In G, the first
column (eigenvector) is proportional to 13 . The remaining two eigenvectors can be
any other vectors such that the three eigenvectors are orthonormal. Once you have a
G, you can find the L.] (d) Find the spectral decomposition for H3 . [Hint: Use the
same eigenvectors as for J3 , but in a different order.] (e) What do you notice about
the eigenvalues for these two matrices?
1.9. Exercises 21
Exercise 1.9.3. A covariance matrix has intraclass correlation structure if all the vari-
ances are equal, and all the covariances are equal. So for n = 3, it would look like
⎛ ⎞
a b b
A = ⎝ b a b ⎠. (1.61)
b b a
Find the spectral decomposition for this type of matrix. [Hint: Use the G in Exercise
1.9.2, and look at G AG.]
Exercise 1.9.7. In (1.53), show that trace(A LA) = ∑ i=1 [(∑Kj=1 a2ij )li ].
q
Exercise 1.9.8. This exercise is to show that the eigenvalue matrix of a covariance
matrix S is unique. Suppose S has two spectral decompositions, S = GLG = HMH ,
where G and H are orthogonal matrices, and L and M are diagonal matrices with
nonincreasing diagonal elements. Use Lemma 1.3 on both decompositions of S to
show that for each K = 1, . . . , q, l1 + · · · + lK = m1 + · · · + mK . Thus L = K.
Exercise 1.9.9. Suppose Y is a data matrix, and Z = YF for some orthogonal matrix
F, so that Z is a rotated version of Y. Show that the variances of the principal com-
ponents are the same for Y and Z. (This result should make intuitive sense.) [Hint:
Find the spectral decomposition of the covariance of Z from that of Y, then note that
these covariance matrices have the same eigenvalues.]
Exercise 1.9.10. Show that in the spectral decomposition (1.33), each li is an eigen-
value, with corresponding eigenvector gi , i.e., Sgi = li gi .
Exercise 1.9.11. Suppose λ is an eigenvalue of the covariance matrix S. Show that
λ must equal one of the li ’s in the spectral decomposition of S. [Hint: Let u be
an eigenvector corresponding to λ. Show that λ is also an eigenvalue of L, with
corresponding eigenvector v = G u, hence li vi = λvi for each i.]
Exercise 1.9.12. Verify the expression for g(y) log( f (y))dy in (1.55).
Exercise 1.9.13. Consider the setup in Jensen’s inequality, Lemma 1.4. (a) Show that if
h is convex, E [ h(W )] ≥ h( E [W ]). [Hint: Set x0 = E [W ] in Definition 1.5.] (b) Suppose
h is strictly convex. Give an example of a random variable W for which E [ h(W )] =
h( E [W ]). (c) Show that if h is convex and W is not constant, that E [ h(W )] > E [W ].
22 Chapter 1. Multivariate Data
Exercise 1.9.14 (Spam). In the Hewlett-Packard spam data, a set of n = 4601 emails
were classified according to whether they were spam, where “0” means not spam, “1”
means spam. Fifty-seven explanatory variables based on the content of the emails
were recorded, including various word and symbol frequencies. The emails were
sent to George Forman (not the boxer) at Hewlett-Packard labs, hence emails with
the words “George” or “hp” would likely indicate non-spam, while “credit” or “!”
would suggest spam. The data were collected by Hopkins et al. [1999], and are in the
data matrix Spam. ( They are also in the R data frame spam from the ElemStatLearn
package [Halvorsen, 2009], as well as at the UCI Machine Learning Repository [Frank
and Asuncion, 2010].)
Based on an email’s content, is it possible to accurately guess whether it is spam
or not? Here we use Chernoff’s faces. Look at the faces of some emails known to
be spam and some known to be non-spam (the “training data”). Then look at some
randomly chosen faces (the “test data”). E.g., to have twenty observations known
to be spam, twenty known to be non-spam, and twenty test observations, use the
following R code:
x0 <− Spam[Spam[,’spam’]==0,] # The non−spam
x1 <− Spam[Spam[,’spam’]==1,] # The spam
train0 <− x0[1:20,]
train1 <− x1[1:20,]
test <− rbind(x0[−(1:20),],x1[−(1:20),])[sample(1:4561,20),]
Based on inspecting the training data, try to classify the test data. How accurate are
your guesses? The faces program uses only the first fifteen variables of the input
matrix, so you should try different sets of variables. For example, for each variable
find the value of the t-statistic for testing equality of the spam and email groups, then
choose the variables with the largest absolute t’s.
Exercise 1.9.15 (Spam). Continue with the spam data from Exercise 1.9.14. (a) Plot the
variances of the explanatory variables (the first 57 variables) versus the index (i.e., the
x-axis has (1, 2, . . . , 57), and the y-axis has the corresponding variances.) You might
not see much, so repeat the plot, but taking logs of the variances. What do you see?
Which three variables have the largest variances? (b) Find the principal components
using just the explanatory variables. Plot the eigenvalues versus the index. Plot the
log of the eigenvalues versus the index. What do you see? (c) Look at the loadings for
the first three principal components. (E.g., if spamload contains the loadings (eigen-
vectors), then you can try plotting them using matplot(1:57,spamload[,1:3]).) What
is the main feature of the loadings? How do they relate to your answer in part (a)?
(d) Now scale the explanatory variables so each has mean zero and variance one:
spamscale <− scale(Spam[,1:57]). Find the principal components using this matrix.
Plot the eigenvalues versus the index. What do you notice, especially compared to
the results of part (b)? (e) Plot the loadings of the first three principal components
obtained in part (d). How do they compare to those from part (c)? Why is there such
a difference?
Exercise 1.9.16 (Sports data). Consider the Louis Roussos sports data described in
Section 1.6.2. Use faces to cluster the observations. Use the raw variables, or the prin-
cipal components, and try different orders of the variables (which maps the variables
to different sets of facial features). After clustering some observations, look at how
1.9. Exercises 23
they ranked the sports. Do you see any pattern? Were you able to distinguish be-
tween people who like team sports versus individual sports? Those who like (dislike)
tennis? Jogging?
Exercise 1.9.17 (Election). The data set election has the results of the first three US
presidential races of the 2000’s (2000, 2004, 2008). The observations are the 50 states
plus the District of Columbia, and the values are the ( D − R)/( D + R) for each state
and each year, where D is the number of votes the Democrat received, and R is the
number the Republican received. (a) Without scaling the variables, find the principal
components. What are the first two principal component loadings measuring? What
is the ratio of the standard deviation of the first component to the second’s? (c) Plot
the first versus second principal components, using the states’ two-letter abbrevia-
tions as the plotting characters. (They are in the vector stateabb.) Make the plot so
that the two axes cover the same range. (d) There is one prominent outlier. What
is it, and for which variable is it mostly outlying? (e) Comparing how states are
grouped according to the plot and how close they are geographically, can you make
any general statements about the states and their voting profiles (at least for these
three elections)?
Exercise 1.9.18 (Painters). The data set painters has ratings of 54 famous painters. It
is in the MASS package [Venables and Ripley, 2002]. See Davenport and Studdert-
Kennedy [1972] for a more in-depth discussion. The R help file says
The subjective assessment, on a 0 to 20 integer scale, of 54 classical painters.
The painters were assessed on four characteristics: composition, drawing,
colour and expression. The data is due to the Eighteenth century art critic,
de Piles.
The fifth variable gives the school of the painter, using the following coding:
A: Renaissance; B: Mannerist; C: Seicento; D: Venetian; E: Lombard; F:
Sixteenth Century; G: Seventeenth Century; H: French
Create the two-dimensional biplot for the data. Start by turning the data into a matrix,
then centering both dimensions, then scaling:
x <− scale(as.matrix(painters[,1:4]),scale=F)|
x <− t(scale(x),scale=F))
x <− scale(x)
Use the fifth variable, the painters’ schools, as the plotting character, and the four
rating variables as the arrows. Interpret the two principal component variables. Can
you make any generalizations about which schools tend to rate high on which scores?
Exercise 1.9.19 (Cereal). Chakrapani and Ehrenberg [1981] analyzed people’s atti-
tudes towards a variety of breakfast cereals. The data matrix cereal is 8 × 11, with
rows corresponding to eight cereals, and columns corresponding to potential at-
tributes about cereals. The attributes: Return (a cereal one would come back to),
tasty, popular (with the entire family), digestible, nourishing, natural flavor, afford-
able, good value, crispy (stays crispy in milk), fit (keeps one fit), and fun (for children).
The original data consisted of the percentage of subjects who thought the given ce-
real possessed the given attribute. The present matrix has been doubly centered, so
24 Chapter 1. Multivariate Data
that the row means and columns means are all zero. (The original data can be found
in the S-Plus [TIBCO Software Inc., 2009] data set cereal.attitude.) Create the two-
dimensional biplot for the data with the cereals as the points (observations), and the
attitudes as the arrows (variables). What do you see? Are there certain cereals/at-
tributes that tend to cluster together? (You might want to look at the Wikipedia entry
[Wikipedia, 2011] on breakfast cereals.)
Exercise 1.9.20 (Decathlon). The decathlon data set has scores on the top 24 men in
the decathlon (a set of ten events) at the 2008 Olympics. The scores are the numbers
of points each participant received in each event, plus each person’s total points. The
data can be found at the NBC Olympic site [Olympics, 2008]. Create the biplot for
these data based on the first ten variables (i.e., do not use their total scores). Doubly
center, then scale, the data as in Exercise 1.9.18. The events should be the arrows. Do
you see any clustering of the events? The athletes?
The remaining questions require software that will display rotating point clouds of
three dimensions, and calculate some projection pursuit objective functions. The
Spin program at http://stat.istics.net/MultivariateAnalysis is sufficient for
our purposes. GGobi [Cook and Swayne, 2007] has an excellent array of graphical
tools for interactively exploring multivariate data. See also the spin3R routine in the
R package aplpack [Wolf and Bielefeld, 2010].
Exercise 1.9.21 (Iris). Consider the three variables X = Sepal Length, Y = Petal Length,
and Z = Petal Width in the Fisher/Anderson iris data. (a) Look at the data while
rotating. What is the main feature of these three variables? (b) Scale the data so that
the variables all have the same sample variance. (The Spin program automatically
performs the scaling.) For various objective functions (variance, skewness, kurtosis,
negative kurtosis, negentropy), find the rotation that maximizes the function. (That
is, the first component of the rotation maximizes the criterion over all rotations. The
second then maximizes the criterion for components orthogonal to the first. The third
component is then whatever is orthogonal to the first two.) Which criteria are most
effective in yielding rotations that exhibit the main feature of the data? Which are
least effective? (c) Which of the original variables are most prominently represented
in the first two components of the most effective rotations?
Exercise 1.9.22 (Automobiles). The data set cars [Consumers’ Union, 1990] contains
q = 11 size measurements on n = 111 models of automobile. The original data can be
found in the S-Plus R
[TIBCO Software Inc., 2009] data frame cu.dimensions. In cars,
the variables have been normalized to have medians of 0 and median absolute devi-
ations (MAD) of 1.4826 (the MAD for a N (0, 1)). Inspect the three-dimensional data
set consisting of the variables length, width, and height. (In the Spin program, the
data set is called “Cars.”) (a) Find the linear combination with the largest variance.
What is the best linear combination? (Can you interpret it?) What is its variance?
Does the histogram look interesting? (b) Now find the linear combination to maxi-
mize negentropy. What is the best linear combination, and its entropy? What is the
main feature of the histogram? (c) Find the best two linear combinations for entropy.
What are they? What feature do you see in the scatter plot?
1.9. Exercises 25
Exercise 1.9.23 (RANDU). RANDU [IBM, 1970] is a venerable, fast, efficient, and very
flawed random number generator. See Dudewicz and Ralley [1981] for a thorough re-
view of old-time random number generators. For given “seed” x0 , RANDU produces
xi+1 from xi via
xi+1 = (65539 xi ) mod 231 . (1.62)
The “random” Uniform(0,1) values are then u i = xi /231 . The R data set randu is
based on a sequence generated using RANDU, where each of n = 400 rows is a set
of p = 3 consecutive u i ’s. Rotate the data, using objective criteria if you wish, to look
for significant non-randomness in the data matrix. If the data are really random, the
points should uniformly fill up the three-dimensional cube. What feature do you see
that reveals the non-randomness?
Multivariate Distributions
This chapter reviews the elements of distribution theory that we need, especially for
vectors and matrices. (Classical multivariate analysis is basically linear algebra, so
everything we do eventually gets translated into matrix equations.) See any good
mathematical statistics book such as Hogg, McKean, and Craig [2004], Bickel and
Doksum [2000], or Lehmann and Casella [1998] for a more comprehensive treatment.
F : R N → [0, 1]
F ( x 1 , x 2 , . . . , x N ) = P [ X1 ≤ x 1 , X2 ≤ x 2 , . . . , X N ≤ x N ] . (2.1)
27
28 Chapter 2. Multivariate Distributions
Note that it is defined on all of R N , not just the space of X. It is nondecreasing, and
continuous from the left, in each xi . The limit as all xi → − ∞ is zero, and as all
xi → ∞, the limit is one. The distribution function uniquely defines the distribution,
though we will not find much use for it.
2.1.2 Densities
A collection of random variables X is said to have a density with respect to Lebesgue
measure on R N , if there is a nonnegative function f (x),
f : X −→ [0, ∞ ), (2.2)
The second line is there to emphasize that wehave a multiple integral. (The Lebesgue
measure of a subset A of R N is the integral A dx, i.e., as if f (x) = 1 in (2.3). Thus if
N = 1, the Lebesgue measure of a line segment is its length. In two dimensions, the
Lebesgue measure of a set is its area. For N = 3, it is the volume.)
We will call a density f as in (2.3) the “pdf,” for “probability density function.”
Because P [X ∈ X ] = 1, the integral of the pdf over the entire space X must be 1.
Random variables or collections that have pdf’s are continuous in the sense that the
probability X equals a specific value x is 0. (There are continuous distributions that
do not have pdf’s, such as the uniform distribution on the unit circle.)
If X does have a pdf, then it can be obtained from the distribution function in (2.1)
by differentiation:
∂N
f ( x1 , . . . , x N ) = F ( x1 , . . . , x N ). (2.4)
∂x1 · · · ∂x N
If the space X is a countable (which includes finite) set, then its probability can be
given by specifying the probability of each individual point. The probability mass
function f , or “pmf,” with
f : X −→ [0, 1], (2.5)
is given by
f (x) = P [X = x] = P [{x}]. (2.6)
The probability of any subset A is the sum of the probabilities of the individual points
in A,
P [ A ] = ∑ f (x). (2.7)
x∈ A
Such an X is called discrete. (A pmf is also a density, but with respect to counting
measure on X , not Lebesgue measure.)
Not all random variables are either discrete or continuous, and especially a collec-
tion of random variables could have some discrete and some continuous members. In
such cases, the probability of a set is found by integrating over the continuous parts
2.1. Probability distributions 29
and summing over the discrete parts. For example, suppose our collection is a 1 × N
vector combining two other collections, i.e.,
We will use the generic term “density” to mean pdf, pmf, or the mixed type of density
in (2.11). There are other types of densities, but we will not need to deal with them.
2.1.3 Representations
Representations are very useful, especially when no pdf exists. For example, suppose
Y = (Y1 , Y2 ) is uniform on the unit circle, by which we mean Y has space Y = {y ∈
R2 | y = 1}, and it is equally likely to be any point on that circle. There is no
pdf, because the area of the circle in R2 is zero, so the integral over any subset of
Y of any function is zero. The distribution can be thought of in terms of the angle
y makes with the x-axis, that is, y is equally likely to be at any angle. Thus we can
let X ∼ Uniform(0, 2π ]: X has space (0, 2π ] and pdf f X ( x ) = 1/(2π ). Then we can
define
Y = (cos( X ), sin( X )). (2.12)
In general, suppose we are given the distribution for X with space X and function
g,
g : X −→ Y . (2.13)
Then for any B ⊂ Y , we can define the probability of Y by
g(x, y) = x. (2.15)
If f (x, y) is the density for (X, Y), then the density of X can be found by “integrating
(or summing) out” the y. That is, if f is a pdf, then f X (x) is the pdf for X, where
f X (x) = f (x, y)dy, (2.17)
Yx
and
Yx = YxW = {y ∈ R Ny | (x, y) ∈ W } (2.18)
is the conditional space (2.10) with A = W . If y has some discrete components, then
they are summed in (2.17).
Note that we can find the marginals of any subset, not just sets of consecutive
elements. E.g., if X = ( X1 , X2 , X3 , X4 , X5 ), we can find the marginal of ( X2 , X4 , X5 )
by integrating out the X1 and X3 .
Probability distributions can also be represented through conditioning, discussed
in the next section.
What this means is that for each fixed value x, there is a possibly different distribution
for Y.
Very generally, such conditional distributions will exist, though they may be hard
to figure out, even what they mean. In the discrete case, the concept is straightfor-
ward, and by analogy the case with densities follows. For more general situations,
we will use properties of conditional distributions rather than necessarily specifying
them.
We start with the (X, Y) as in (2.8), and assume we have their joint distribution
P. The word “joint” is technically unnecessary, but helps to emphasize that we are
considering the two collections together. The joint space is W , and let X denote the
marginal space of X as in (2.16), and for each x ∈ X , the conditional space of Y given
X = x, Yx , is given in (2.18). For example, if the space W = {( x, y) | 0 < x < y < 1},
then X = (0, 1), and for x ∈ X , Y x = ( x, 1).
Next, given the joint distribution of (X, Y), we define the conditional distribution
(2.19) in the discrete, then pdf, cases.
Discrete case
For sets A and B, the conditional probability of A given B is defined as
P[ A ∩ B]
P[ A | B] = if B = ∅. (2.20)
P[ B]
If B is empty, then the conditional probability is not defined since we would have 00 .
For a discrete pair (X, Y), let f (x, y) be the pmf. Then the conditional distribution of
2.1. Probability distributions 31
at least if P [X = x] > 0. The expression in (2.21) is, for fixed x, the conditional pmf
for Y:
f Y | X (y | x) = P [ Y = y | X = x]
P [Y = y and X = x]
=
P [ X = x]
f (x, y)
= , y ∈ Yx , (2.22)
f X (x)
if f X (x) > 0, where f X (x) is the marginal pmf of X from (2.17) with sums.
Pdf case
In the discrete case, the restriction that P [X = x] > 0 is not worrisome, since the
chance is 0 we will have a x with P [X = x] = 0. In the continuous case, we cannot
follow the same procedure, since P [X = x] = 0 for all x ∈ X . However, if we have
pdf’s, or general densities, we can analogize (2.22) and declare that the conditional
density of Y given X = x is
f (x, y)
f Y | X (y | x) = , y ∈ Yx , (2.23)
f X (x)
if f X (x) > 0. In this case, as in the discrete one, the restriction that f X (x) > 0 is not
worrisome, since the set on which X has density zero has probability zero. It turns
out that the definition (2.23) is mathematically legitimate.
The Y and X can be very general. Often, both will be functions of a collection
of random variables, so that we may be interested in conditional distributions of the
type
g (Y ) | h (Y ) = z (2.24)
for some functions g and h.
if the latter exists. Thus we often can find the expected values of functions of Y based
on the distribution of X.
Conditioning
If (X, Y) has a joint distribution, then we can define the conditional expectation of
g(Y) given X = x to be the regular expected value of g(Y), but we use the conditional
distribution Y | X = x. In the pdf case, we write
E [ g (Y ) | X = x] = g(y) f Y|X (y|x)dy ≡ e g (x). (2.32)
Yx
Note that the conditional expectation is a function of x. We can then take the expected
value of that, using the marginal distribution of X. We end up with the same result
(if we end up with anything) as taking the usual expected value of g(Y). That is
There is a bit of a notational glitch in the formula, since the inner expected value is a
function of x, a constant, and we really want to take the expected value over X. We
cannot just replace x with X, however, because then we would have the undesired
2.3. Means, variances, and covariances 33
E [ g(Y) | X = X]. So a more precise way to express the result is to use the e g (x) in
(2.32), so that
E [ g(Y)] = E [ e g (X)]. (2.34)
This result holds in general. It is not hard to see in the pdf case:
E [ e g (X)] = e g (x) f X (x)dx
X
= g(y) f Y|X(y|x)dy f X (x)dx by (2.32)
X Yx
= g(y) f (x, y)dydx by (2.25)
X Yx
= g(y) f (x, y)dxdy by (2.25)
W
= E [ g(Y)]. (2.35)
If X has a pmf, then we sum. The formula follows by taking g to be the indicator
function IB , given as
1 if y ∈ B,
I B (y ) = (2.37)
0 if y ∈ B.
if both variances are positive. Compare these definitions to those of the sample
analogs, (1.3), (1.4), (1.5), and (1.6). So, e.g., Var [ X j ] = Cov[ X j , X j ].
The mean of the collection X is the corresponding collection of means. That is,
Σ = Cov[X]
⎛ ⎞
Var [ X1 ] Cov[ X1 , X2 ] ··· Cov[ X1 , X N ]
⎜ Cov[ X2 , X1 ] Var [ X2 ] ··· Cov[ X2 , X N ] ⎟
⎜ ⎟
=⎜ .. .. .. .. ⎟, (2.43)
⎝ . . . . ⎠
Cov[ X N , X1 ] Cov[ X N , X2 ] ··· Var [ X N ]
so that the elements of Σ are the σjk ’s. Compare this arrangement to that of the
sample covariance matrix (1.17). If X is a row vector, and μ = E [X], a convenient
expression for its covariance is
Cov[X] = E (X − μ) (X − μ) . (2.44)
MX (t) < ∞ and MX (t) = MY (t) for all t such that t < , (2.48)
See Ash [1970] for an approach to proving this result. The mgf does not always
exist, that is, often the integral or sum defining the expected value diverges. That
is ok, as long as it is finite for t in a neighborhood of 0. If one knows complex
variables, the characteristic function is handy because it always exists. It is defined
as φX (t) = E [exp(iXt )].
If a distribution’s mgf is finite when t < for some > 0, then all of its moments
are finite, and can be calculated via differentiation:
∂K
E [ X1k1 · · · X kNN ] = MX ( t ) , (2.49)
∂t1k1 · · · ∂tkNN t= 0
2.4 Independence
Two sets of random variables are independent if the values of one set do not affect
the values of the other. More precisely, suppose the collection is (X, Y) as in (2.8),
with space W . Let X and Y be the marginal spaces (2.16) of X and Y, respectively.
First, we need the following:
Definition 2.3. Given the setup above, the collections X and Y are independent if W =
X × Y , and for every A ⊂ X and B ⊂ Y ,
P [(X, Y) ∈ A × B ] = P [X ∈ A] P [Y ∈ B ]. (2.51)
In the definition, the left-hand side uses the joint probability distribution for (X, Y),
and the right-hand side uses the marginal probabilities for X and Y, respectively.
If the joint collection (X, Y) has density f , then X and Y are independent if and
only if W = X × Y , and
where f X and f Y are the marginal densities (2.17) of X and Y, respectively. (Techni-
cally, (2.52) only has to hold with probability one. Also, except for sets of probability
zero, the requirements (2.51) or (2.52) imply that W = X × Y , so that the requirement
we place on the spaces is redundant. But we keep it for emphasis.)
A useful result is that X and Y are independent if and only if
The second equality uses (2.53), and the final equality uses that E [ X − E [ X ]] = E [ X ] −
E [ X ] = 0. Be aware that the reverse is not true, that is, variables can have 0 covariance
but still not be independent.
If the collections X and Y are independent, then Cov[ Xk , Yl ] = 0 for all k, l, so that
Cov[X] 0
Cov[(X, Y)] = , (2.55)
0 Cov[Y]
at least if the covariances exist. (Throughout this book, “0” represents a matrix of
zeroes, its dimension implied by the context.)
Collections Y and X are independent if and only if the conditional distribution of
Y given X = x does not depend on x. If (X, Y) has a pdf or pmf, this property is easy
to see. If X and Y are independent, then Yx = Y since W = X × Y , and by (2.23) and
(2.52),
f (x, y) f (y ) f X (x)
f Y | X (y | x) = = Y = f Y (y ), (2.56)
f X (x) f X (x)
so that the conditional distribution does not depend on x. On the other hand, if
the conditional distribution does not depend on x, then the conditional space and
pdf cannot depend on x, in which case they are the marginal space and pdf, so that
W = X × Y and
f (x, y)
= f Y (y) =⇒ f (x, y) = f X (x) f Y (y). (2.57)
f X (x)
2.5. Conditional distributions 37
f ( x1 , . . . , x N ) = f 1 ( x1 ) · · · f N ( x N ), (2.60)
where f j is the density of X j . Also, if the variances exist, the covariance matrix is
diagonal:
⎛ ⎞
Var [ X1 ] 0 ··· 0
⎜ 0 Var [ X2 ] · · · 0 ⎟
⎜ ⎟
Cov[X] = ⎜ .. .. . .. ⎟. (2.61)
⎝ . . . . . ⎠
0 ··· 0 Var [ X N ]
Plug-in formula
Suppose the collection of random variables is given by (X, Y), and we are interested
in the conditional distribution of the function g(X, Y) given X = x. Then
That is, the conditional distribution of g(X, Y) given X = x is the same as that of
g(x, Y) given X = x. (The “= D ” means “equal in distribution.”) Furthermore, if Y
and X are independent, we can take off the conditional part at the end of (2.62):
This property may at first seem so obvious to be meaningless, but it can be very
useful. For example, suppose X and Y are independent N (0, 1)’s, and g( X, Y ) =
X + Y, so we wish to find X + Y | X = x. The official way is to let W = X + Y, and
Z = X, and use the transformation of variables to find the space and pdf of (W, Z ).
One can then figure out Wz , and use the formula (2.23). Instead, using the plug-in
formula with independence (2.63), we have that
X + Y | X = x = D x + Y, (2.64)
Conditional independence
Given a set of three collections, (X, Y, Z), X are Y are said to be conditionally independent
given Z = z if
Y | X = x = D Y | h(X) = v. (2.66)
Note that the distribution depends on x only through h( x ) = x2 , so that, e.g., condi-
tioning on X = 1/2 is the same as conditioning on X = −1/2. The statement (2.66)
then yields √ √
Y | X2 = v ∼ Uniform(− 1 − v, 1 − v). (2.68)
That is, we have managed to turn a statement about conditioning on X to one about
conditioning on X2 .
Variance decomposition
The formula (2.34) shows that the expected value of g(Y) is the expected value of the
conditional expected value, e g (X). A similar formula holds for the variance, but it is
not simply that the variance is the expected value of the conditional variance. Using
the well-known identity Var [ Z ] = E [ Z2 ] − E [ Z ]2 on Z = g(Y), as well as (2.34) on
g(Y) and g(Y)2 , we have
Bayes theorem
Bayes formula reverses conditional distributions, that is, it takes the conditional distri-
bution of Y given X, and the marginal of X, and returns the conditional distribution of
X given Y. Bayesian inference is based on this formula, starting with the distribution
of the data given the parameters, and a marginal (“prior”) distribution of the pa-
rameters, and producing the conditional distribution (“posterior”) of the parameters
given the data. Inferences are then based on this posterior, which is the distribution
one desires because the data are observed while the parameters are not.
Theorem 2.2 (Bayes). In the setup of (2.8), suppose that the conditional density of Y given
X = x is f Y|X (y | x), and the marginal density of X is f X (x). Then for (x, y) ∈ W , the
conditional density of X given Y = y is
f Y | X (y | x) f X (x)
f X | Y (x | y ) = . (2.75)
Xy f Y|X (y | z) f X (z)dz
Proof. From (2.23) and (2.25),
f (x, y)
f X | Y (x | y ) =
f Y (y )
f Y | X (y | x) f X (x)
= . (2.76)
f Y (y )
By (2.26), using z for x, to avoid confusion with the x in (2.76),
f Y (y ) = f Y|X (y | z) f X (z)dz, (2.77)
Xy
which, substituted in the denominator of (2.76), shows (2.75).
40 Chapter 2. Multivariate Distributions
Yj = a j + b j1 X1 + · · · + b jN X N , j = 1, . . . , M, (2.78)
the a j ’s and b jk ’s being constants. Note that marginals are examples of affine trans-
formations: the a j ’s are 0, and most of the b jk ’s are 0, and a few are 1. Depending on
how the elements of X and Y are arranged, affine transformations can be written as a
matrix equation. For example, if X and Y are row vectors, and B is M × N, then
Y = a + XB , (2.79)
Y = A + CXD . (2.80)
E [ cX ] = cE [ X ] and E [ X + Y ] = E [ X ] + E [Y ], (2.81)
which can be seen from (2.28) and (2.29) by the linearity of integrals and sums. Con-
sidering any constant a as a (nonrandom) random variable, with E [ a] = a, (2.81) can
be used to show, e.g.,
E [ a + bX + cY ] = a + bE [ X ] + cE [Y ]. (2.82)
E [Yj ] = a j + b j1 E [ X1 ] + · · · + b jN E [ X N ], j = 1, . . . , M. (2.83)
If the collections are arranged as vectors or matrices, then so are the means, so that
for the row vector (2.79) and matrix (2.80) examples, one has, respectively,
and Y are row vectors, and (2.79) holds. Then from (2.44),
Cov[Y] = E (Y − E [Y]) (Y − E [Y])
= E (a + XB − (a + E [X]B )) (a + XB − (a + E [X]B ))
= E (XB − E [X]B ) (XB − E [X]B )
= E B(X − E [X]) (X − E [X])B
= BE (X − E [X]) (X − E [X]) B by second part of (2.84)
= BCov[X]B . (2.85)
Compare this formula to the sample version in (1.27). Though modest looking, the
formula Cov[XB ] = BCov[X]B is extremely useful. It is often called a “sandwich”
formula, with the B as the slices of bread. The formula for column vectors is the
same. Compare this result to the familiar one from univariate analysis: Var [ a + bX ] =
b2 Var [ X ]. Also, we already saw the sample version of (2.85) in (1.27).
For matrices, we again will wait. (We are waiting for Kronecker products as in
Definition 3.5, in case you are wondering.)
2.7 Exercises
Exercise 2.7.1. Consider the pair of random variables ( X, Y ), where X is discrete and
Y is continuous. Their space is
x+y
f ( x, y) = . (2.87)
21
Let A = {( x, y) ∈ W | y ≤ x/2}. (It is a good idea to sketch W and A.) (a) Find X A
(b) Find Y xA for each x ∈ X A . (c) Find P [ A]. (d) Find the marginal density and space
of X. (e) Find the marginal space of Y. (f) Find the conditional space of X given Y,
X y , for each y. (Do it separately for y ∈ (0, 1), y ∈ [1, 2) and y ∈ [2, 3).) (g) Find the
marginal density of Y.
Exercise 2.7.2. Given the setup in (2.8) through (2.10), show that for A ∈ W ,
Exercise 2.7.4. Show that X and Y are independent if and only if E [ g(X)h(Y)] =
E [ g(X)] E [ h(Y)] as in (2.53) for all g and h with finite expectations. You can assume
densities exist, i.e., (2.52). [Hint: To show independence implies (2.53), write out the
sums/integrals. For the other direction, consider indicator functions for g and h as in
(2.37).]
Exercise 2.7.5. Prove (2.31), E [ g(Y)] = E [ g(h(X))] for Y = h(X), in the discrete case.
[Hint: Start by writing
f Y (y ) = P [ Y = y ] = P [ h (X ) = y ] = ∑ f X (x), (2.90)
x∈Xy
In the inner summation in the final expression, h(x) is always equal to y. (Why?) Sub-
stitute h(x) for y in the g, then. Now the summand is free of y. Argue that the dou-
ble summation is the same as summing over x ∈ X , yielding ∑x∈X g(h( x )) f X (x) =
E [ g(h(X))].]
Exercise 2.7.6. (a) Prove the plugin formula (2.62) in the discrete case. [Hint: For z
in the range of g, write P [ g(X, Y) = z | X = x] = P [ g(X, Y) = z and X = x] /P [X = x],
then note that in the numerator, the X can be replaced by x.] (b) Prove (2.63). [Hint:
Follow the proof in part (a), then note the two events g(x, Y) = z and X = x are
independent.]
Exercise 2.7.7. Suppose (X, Y, Z) has a discrete distribution, X and Y are condi-
tionally independent given Z (as in (2.65)), and X and Z are independent. Show
that (X, Y) is independent of Z. [Hint: Use the total probability formula (2.36) on
P [X ∈ A and (Y, Z) ∈ B ], conditioning on Z. Then argue that the summand can be
written
Use the independence of X and Z on the first probability in the final expression, and
bring it out of the summation.]
Exercise 2.7.8. Prove (2.67). [Hint: Find Y x and the marginal f X ( x ).]
Exercise 2.7.9. Suppose Y = (Y1 , Y2 , Y3 , Y4 ) is multinomial with parameters n and
p = ( p1 , p2 , p3 , p4 ). Thus n is a positive integer, the pi ’s are positive and sum to 1,
and the Yi ’s are positive integers that sum to n. The pmf is
n y y
f (y ) = p 1 · · · p4 4 , (2.93)
y1 , y2 , y3 , y4 1
where (y1 ,y2n,y3 ,y4 ) = n!/(y1 ! · · · y4 !). Consider the conditional distribution of (Y1 , Y2 )
given (Y3 , Y4 ) = (c, d). (a) What is the conditional space of (Y1 , Y2 ) given (Y3 , Y4 ) =
2.7. Exercises 43
(c, d)? Give Y2 as a function of Y1 , c, and d. What is the conditional range of Y1 ? (b)
Write the conditional pmf of (Y1 , Y2 ) given (Y3 , Y4 ) = (c, d), and simplify noting that
n n n−c−d
= (2.94)
y1 , y2 , c, d n − c − d, c, d c, d
What is the conditional distribution of Y1 | (Y3 , Y4 ) = (c, d)? (c) What is the condi-
tional distribution of Y1 given Y3 + Y4 = a?
Exercise 2.7.10. Prove (2.44). [Hint: Write out the elements of the matrix (X − μ) (X −
μ), then use (2.42).]
Exercise 2.7.11. Suppose X, 1 × N, has finite covariance matrix. Show that Cov[X] =
E [X X] − E [X] E [X].
Exercise 2.7.12. (a) Prove the variance decomposition holds for the 1 × q vector Y, as
in (2.74). (b) Write Cov[Yi , Yj ] as a function of the conditional quantities Cov[Yi , Yj | X =
x], E [Yi | X = x], and E [Yj | X = x].
(a) Find the marginal pdf of Y. (b) The conditional mean and variance of Y are nx
and nx (1 − x ). (Right?) The unconditional mean and variance of X are α/(α + β)
and αβ/(α + β)2 (α + β + 1). What are the unconditional mean and variance of Y? (c)
Compare the variance of a Binomial(n, p) to that of a Beta-binomial(n, α, β), where
p = α/(α + β). (d) Find the joint density of ( X, Y ). (e) Find the pmf of the beta-
binomial. [Hint: Notice that the part of the joint density depending on x looks like a
Beta pdf, but without the constant. Thus integrating out x yields the reciprocal of the
constant.]
Exercise 2.7.14 (Bayesian inference). This question develops Bayesian inference for a
binomial. Suppose
Y | P = p ∼ Binomial(n, p) and P ∼ Beta(α0 , β0 ), (2.97)
that is, the probability of success P has a beta prior. (a) Show that the posterior
distribution is
P | Y = y ∼ Beta(α0 + y, β0 + n − y). (2.98)
The beta prior is called the conjugate prior for the binomial p, meaning the posterior
has the same form, but with updated parameters. [Hint: Exercise 2.7.13 (d) has the
joint density of ( P, Y ).] (b) Find the posterior mean, E [ P | Y = y]. Show that it can be
written as a weighted mean of the sample proportion p = y/n and the prior mean
p o = α o / ( α0 + β 0 ) .
44 Chapter 2. Multivariate Distributions
Exercise 2.7.15. Do the mean and variance formulas (2.33) and (2.72) work if g is a
function of X and Y? [Hint: Consider the collection (X, W), where W = (X, Y).]
Exercise 2.7.16. Suppose h(y) is a histogram with K equal-sized bins. That is, we
have bins (bi−1, bi ], i = 1, . . . , K, where bi = b0 + d × i, d being the width of each bin.
Then
pi /d if bi−1 < x ≤ bi , i = 1, . . . , K
h(y) = (2.99)
0 if y ∈ (b0 , bK ],
where the pi ’s are probabilities that sum to 1. Suppose Y is a random variable with
pdf h. For y ∈ (b0 , bK ], let I(y) be y’s bin, i.e., I(y) = i if bi−1 < y ≤ bi . (a) What is
the distribution of the random variable I(Y )? Find its mean and variance. (b) Find
the mean and variance of bI(Y ) = b0 + dI(Y ). (c) What is the conditional distribution
of Y given I(Y ) = i, for each i = 1, . . . , K? [It is uniform. Over what range?] Find the
conditional mean and variance. (d) Show that unconditionally,
1 1
E [Y ] = b0 + d( E [I] − ) and Var [Y ] = d2 (Var [I] + ). (2.100)
2 12
(e) Recall the entropy in (1.44). Note that for our pdf, h(Y ) = pI(Y ) /d. Show that
K
Entropy(h) = − ∑ pi log( pi ) + log(d), (2.101)
i =1
Exercise 2.7.17. Suppose for random vector ( X, Y ), one observes X = x, and wishes
to guess the value of Y by h( x ), say, using the least squares criterion: Choose h to
minimize E [ q ( X, Y )], where q ( X, Y ) = (Y − h( X ))2 . This h is called the regression
function of Y on X. Assume all the relevant means and variances are finite. (a)
Write E [ q ( X, Y )] as the expected value of the conditional expected value conditioning
on X = x, eq ( x ). For fixed x, note that h( x ) is a scalar, hence one can minimize
eq ( x ) over h( x ) using differentiation. What h( x ) achieves the minimum conditional
expected value of q? (b) Show that the h found in part (a) minimizes the unconditional
expected value E [ q ( X, Y )]. (c) Find the value of E [ q ( X, Y )] for the minimizing h.
Exercise 2.7.18. Continue with Exercise 2.7.17, but this time restrict h to be a linear
function, h( x ) = α + βx. Thus we wish to find α and β to minimize E [(Y − α − βX )2 ].
The minimizing function is the linear regression function of Y on X. (a) Find the
α and β to minimize E [(Y − α − βX )2 ]. [You can differentiate that expected value
directly, without worrying about conditioning.] (b) Find the value of E [(Y − α − βX )2 ]
for the minimizing α and β.
∂
κi = cX ( t ) . (2.103)
∂t t =0
Show that κ3 /κ23/2 is the population analog of skewness (1.42), and κ4 /κ22 is the
population analog of kurtosis (1.43), i.e.,
κ3 E [( X − μ )3 ] κ E [( X − μ )4 ]
= and 42 = − 3, (2.104)
κ23/2 σ 3 κ2 σ4
∂2
cX (t ) , i = j. (2.105)
∂ti ∂t j t= 0
Exercise 2.7.23. A study was conducted on people near Newcastle on Tyne in 1972-
74 [Appleton et al., 1996], and followed up twenty years later. We will focus on 1314
women in the study. The three variables we will consider are Z: age group (three
values); X: whether they smoked or not (in 1974); and Y: whether they were still
alive in 1994. Here are the frequencies:
Age group Young (18 − 34) Middle (35 − 64) Old (65+)
Smoker? Yes No Yes No Yes No
(2.106)
Died 5 6 92 59 42 165
Lived 174 213 262 261 7 28
Who were more likely to live, smokers or non-smokers? (b) Find P [ X = Smoker | Z =
z] for z= Young, Middle, and Old. What do you notice? (c) Find
and
P [Y = Lived | X = multNon-smoker & Z = z] (2.109)
46 Chapter 2. Multivariate Distributions
for z= Young, Middle, and Old. Adjusting for age group, who were more likely to
live, smokers or non-smokers? (d) Conditionally on age, the relationship between
smoking and living is negative for each age group. Is it true that marginally (not
conditioning on age), the relationship between smoking and living is negative? What
is the explanation? (Simpson’s Paradox.)
Exercise 2.7.24. Suppose in a large population, the proportion of people who are
infected with the HIV virus is = 1/100, 000. People can take a blood test to see
whether they have the virus. The test is 99% accurate: The chance the test is positive
given the person has the virus is 99%, and the chance the test is negative given the
person does not have the virus is also 99%. Suppose a randomly chosen person takes
the test. (a) What is the chance that this person does have the virus given that the test
is positive? Is this close to 99%? (b) What is the chance that this person does have the
virus given that the test is negative? Is this close to 1%? (c) Do the probabilities in (a)
and (b) sum to 1?
Exercise 2.7.25. Suppose Z1 , Z2 , Z3 are iid with P [ Zi = −1] = P [ Zi = +1] = 12 . Let
X1 = Z1 Z2 , X2 = Z1 Z3 , X3 = Z2 Z3 . (2.110)
(a) Find the conditional distribution of ( X1 , X2 ) | Z1 = +1. Are X1 and X2 con-
ditionally independent given Z1 = +1? (b) Find the conditional distribution of
( X1 , X2 ) | Z1 = −1. Are X1 and X2 conditionally independent given Z1 = −1? (c) Is
( X1 , X2 ) independent of Z1 ? Are X1 and X2 independent (unconditionally)? (d) Are
X1 and X3 independent? Are X2 and X3 independent? Are X1 , X2 and X3 mutually
independent? (e) What is the space of ( X1 , X2 , X3 )? (f) What is the distribution of
X1 X2 X3 ?
Exercise 2.7.26. Yes/no questions: (a) Suppose X1 and X2 are independent, X1 and
X3 are independent, and X2 and X3 are independent. Are X1 , X2 and X3 mutually
independent? (b) Suppose X1 , X2 and X3 are mutually independent. Are X1 and X2
conditionally independent given X3 = x3 ?
Exercise 2.7.27. (a) Let U ∼Uniform(0, 1), so that it has space (0, 1) and pdf f U (u ) =
1. Find its distribution function (2.1), FU (u ). (b) Suppose X is a random variable with
space ( a, b ) and pdf f X ( x ), where f X ( x ) > 0 for x ∈ ( a, b ). [Either or both of a and
b may be infinite.] Thus the inverse function FX−1 (u ) exists for u ∈ (0, 1). (Why?)
Show that the distribution of Y = FX ( X ) is Uniform(0, 1). [Hint: For y ∈ (0, 1), write
P [Y ≤ y] = P [ FX ( X ) ≤ y] = P [ X ≤ FX−1 (y)], then use the definition of FX .] (c)
Suppose U ∼ Uniform(0, 1). For the X in part (b), show that FX−1 (U ) has the same
distribution as X. [Note: This fact provides a way of generating random variables X
from random uniforms.]
Exercise 2.7.28. Suppose Y is n × 2 with covariance matrix
2 1
Σ= . (2.111)
1 2
Let W = YB , for
1 1
B= (2.112)
1 c
for some c. Find c so that the covariance between the two variables in W is zero.
What are the variances of the resulting two variables?
2.7. Exercises 47
Yj = μ j + B + E j ,
where the μ j are constants, B has mean zero and variance σB2 , the E j ’s are independent,
each with mean zero and variance σE2 , and B is independent of the E j ’s. (a) Find the
mean and covariance matrix of
X ≡ B E1 E2 E3 E4 . (2.113)
(b) Write Y as an affine transformation of X. (c) Find the mean and covariance matrix
of Y. (d) Cov[Y] can be written as
Give a and b in terms of σB2 and σE2 . (d) What are the mean and covariance matrix of
Y = (Y1 + · · · + Y4 )/4?
Exercise 2.7.30. Suppose Y is a 5 × 4 data matrix, and
where the Bi ’s are independent, each with mean zero and variance σB2 , the Eij are
independent, each with mean zero and variance σE2 ’s, and the Bi ’s are independent
of the Eij ’s. (Thus each row of Y is distributed as the vector in Extra 2.7.29, for some
particular values of μ j ’s.) [Note: This model is an example of a randomized block
model, where the rows of Y represent the blocks. For example, a farm might be
broken into 5 blocks, and each block split into four plots, where two of the plots
(Yi1 , Yi2 ) get one fertilizer, and two of the plots (Yi3 , Yi4 ) get another fertilizer.] (a)
E [Y] = xβz . Give x, β, and z . [The β contains the parameters μ and γ. The x and
z contain known constants.] (b) Are the rows of Y independent? (c) Find Cov[Y].
(d) Setting which parameter equal to zero guarantees that all elements of Y have the
same mean? (e) Setting which parameter equal to zero guarantees that all elements
of Y are uncorrelated?
Chapter 3
3.1 Definition
There are not very many commonly used multivariate distributions to model a data
matrix Y. The multivariate normal is by far the most common, at least for contin-
uous data. Which is not to say that all data are distributed normally, nor that all
techniques assume such. Rather, typically one either assumes normality, or makes
few assumptions at all and relies on asymptotic results.
The multivariate normal arises from the standard normal:
Definition 3.1. The random variable Z is standard normal, written Z ∼ N (0, 1), if it has
space R and pdf
1
φ(z) = √ e− 2 z .
1 2
(3.1)
2π
It is not hard to show that if Z ∼ N (0, 1),
1 2
E [ Z ] = 0, Var [ Z ] = 1, and M Z (t) = e 2 t . (3.2)
Definition 3.2. The collection of random variables Z = ( Z1 , . . . , Z M ) is a standard normal
collection if the Zi ’s are mutually independent standard normal random variables.
Because the variables in a standard normal collection are independent, by (3.2),
(2.61) and (2.59),
49
50 Chapter 3. Multivariate Normal
MY (s) = E [exp(Ys )]
= E [exp ((μ + ZB )s )]
= exp(μs ) E [exp(Z(sB) )]
= exp(μs ) MZ (sB)
1
= exp(μs ) exp( sB2 ) by (3.3)
2
1
= exp(μs + sBB s )
2
1
= exp(μs + sΣs ) (3.6)
2
The mgf depends on B through only Σ = BB . Because the mgf determines the
distribution (Theorem 2.1), two different B’s can produce the same distribution. That
is, as long as BB = CC , the distribution of μ + ZB and μ + ZC are the same. Which
is to say that the distribution of the multivariate normal depends on only the mean
and covariance. Thus it is legitimate to write
i.e., both vectors are N (0, Σ). Note that the two expressions are based on differing
numbers of standard normals, not just different linear combinations.
Which μ and Σ are legitimate parameters in (3.7)? Any μ ∈ Rq is. The covariance
matrix Σ can be BB for any q × M matrix B. Any such matrix B is considered a
square root of Σ. Clearly, Σ must be symmetric, but we already knew that. It must
also be nonnegative definite, which we define now.
3.2. Properties 51
Note that bBB b = bB2 ≥ 0, which means that Σ must be nonnegative definite.
But from (2.85),
bΣb = Cov[Yb ] = Var [Yb ] ≥ 0, (3.13)
because all variances are nonnegative. That is, any covariance matrix has to be non-
negative definite, not just multivariate normal ones.
So we know that Σ must be symmetric and nonnegative definite. Are there any
other restrictions, or for any symmetric nonnegative definite matrix is there a corre-
sponding B? In fact, there are potentially many square roots of Σ. These follow from
the spectral decomposition theorem, Theorem 1.1. Because Σ is symmetric, we can
write
Σ = ΓΛΓ , (3.14)
where Γ is orthogonal, and Λ is diagonal with diagonal elements λ1 ≥ λ2 ≥ · · · ≥ λq .
Because Σ is nonnegative definite, the eigenvalues are nonnegative (Exercise 3.7.12),
hence they have square roots. Consider
B = ΓΛ1/2 , (3.15)
where Λ1/2 is the diagonal matrix with diagonal elements the λ1/2
j ’s. Then, indeed,
That is, in (3.7), μ is unrestricted, and Σ can be any symmetric nonnegative definite
matrix. Note that C = ΓΛ1/2 Ψ for any q × q orthogonal matrix Ψ is also a square root
of Σ. If we take Ψ = Γ , then we have the symmetric square root, ΓΛ1/2 Γ .
If q = 1, then we have a normal random variable, say Y, and Y ∼ N (μ, σ2 ) signifies
that it has mean μ and variance σ2 . If Y is a multivariate normal collection represented
as an n × q matrix, we write
and as in (3.4),
Of course, the mean and covariance result we already knew from (2.84) and (2.85).
Because marginals are special cases of affine transformations, marginals of multi-
variate normals are also multivariate normal. One needs just to pick off the appro-
priate means and covariances. So if Y = (Y1 , . . . , Y5 ) is N5 (μ, Σ), and W = (Y2 , Y5 ),
then
σ22 σ25
W ∼ N2 (μ2 , μ5 ), . (3.20)
σ52 σ55
In Section 2.4, we showed that independence of two random variables means that
their covariance is 0, but that a covariance of 0 does not imply independence. But,
with multivariate normals, it does. That is, if X is a multivariate normal collection,
and Cov[ X j , Xk ] = 0, then X j and Xk are independent. The next theorem generalizes
this independence to sets of variables.
where
B 0
A= . (3.22)
0 C
Which shows that W has distribution given by ZA . With that representation, we
have that X = Z1 B and Y = Z2 C . Because the Zi ’s are mutually independent, and
the subsets Z1 and Z2 do not overlap, Z1 and Z2 are independent, which means that
X and Y are independent.
The theorem can also be proved using mgf’s or pdf’s. See Exercises 3.7.15 and
8.8.12.
In the iid case, the vectors all have the same mean μ and covariance matrix Σ. Thus
the mean of the entire matrix M = E [Y] is
⎛ ⎞
μ
⎜ μ ⎟
⎜ ⎟
M = ⎜ . ⎟. (3.24)
⎝ .. ⎠
μ
For the covariance of the Y, we need to string all the elements out, as in (2.45),
as (Y1 , . . . , Yn ). By independence, the covariance between variables from different
individuals is 0, that is, Cov[Yij , Ykl ] = 0 if i = k. Each group of q variables from a
single individual has covariance Σ, so that Cov[Y] is block diagonal:
⎛ ⎞
Σ 0 ··· 0
⎜ 0 Σ ··· 0 ⎟
⎜ ⎟
Ω = Cov[Y] = ⎜ . .. . . . ⎟. (3.25)
⎝ .. . . .. ⎠
0 0 ··· Σ
Patterned matrices such as (3.24) and (3.25) can be more efficiently represented as
Kronecker products.
Definition 3.5. If A is a p × q matrix and B is an n × m matrix, then the Kronecker
product is the (np) × (mq ) matrix A ⊗ B given by
⎛ ⎞
a11 B a12 B · · · a1q B
⎜ a21 B a22 B · · · a2q B ⎟
⎜ ⎟
A⊗B = ⎜ . .. .. .. ⎟ . (3.26)
⎝ .. . . . ⎠
a p1 B a p2 B · · · a pq B
Thus the mean in (3.24) and covariance matrix in (3.25) can be written as follows:
M = 1n ⊗ μ and Ω = In ⊗ Σ. (3.27)
Recall that 1n is the n × 1 vector of all 1’s, and In is the n × n identity matrix. Now if
the rows of Y are iid multivariate normal, we write
Y ∼ Nn×q (1n ⊗ μ, In ⊗ Σ). (3.28)
Often the rows are independent with common covariance Σ, but not necessarily hav-
ing the same means. Then we have
Y ∼ Nn×q (M, In ⊗ Σ). (3.29)
We have already seen examples of linear combinations of elements in the data
matrix. In (1.9) and (1.10), we had combinations of the form CY, where the matrix
multiplied Y on the left. The linear combinations are of the individuals within the
variable, so that each variable is affected in the same way. In (1.23), and for principal
components, the matrix is on the right: YD . In this case, the linear combinations are
of the variables, with the variables for each individual affected the same way. More
generally, we have affine transformations of the form (2.80),
W = A + CYD . (3.30)
Because W is an affine transformation of Y, it is also multivariate normal. When
Cov[Y] has the form as in (3.29), then so does W.
54 Chapter 3. Multivariate Normal
The mean part follows directly from the second part of (2.84). For the covari-
ance, we need some facts about Kronecker products, proofs of which are tedious but
straightforward. See Exercises 3.7.17 to 3.7.18.
Proposition 3.2. Presuming the matrix operations make sense and the inverses exist,
(A ⊗ B ) = A ⊗ B (3.32a)
(A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD) (3.32b)
(A ⊗ B ) −1 = A−1 ⊗ B −1 (3.32c)
row(CYD ) = row(Y)(C ⊗ D) (3.32d)
trace (A ⊗ B) = trace(A) trace(B) (3.32e)
| A ⊗ B | = | A| | B | ,b a
(3.32f)
One direct application of the proposition is the sample mean in the iid case (3.28),
so that Y ∼ Nn×q (1n ⊗ μ, In ⊗ Σ). Then from (1.9),
1
Y= 1 Y, (3.35)
n n
1 1 1 1
Y ∼ Nq (( 1n 1n ) ⊗ μ, ( 1n )In ( 1n ) ⊗ Σ) = Nq (μ, Σ), (3.36)
n n n n
Rather than diving in to joint densities, as in (2.23), we start by predicting the vector
Y from X with an affine transformation. That is, we wish to find α, 1 × q, and β, p × q,
so that
Y ≈ α + Xβ. (3.38)
We use the least squares criterion, which is is to find (α, β) to minimize
We start by noting that if the 1 × q vector W has finite covariance matrix, then
E [W − c2 ] is uniquely minimized over c ∈ Rq by c = E [W]. See Exercise (3.7.21).
Letting W = Y − Xβ, we have that fixing β, q (α, β) is minimized over α by taking
α = E [Y − Xβ] = μY − μ X β. (3.40)
over β. Using the trick that for a row vector z, z2 = trace(z z), and letting X∗ =
X − μ X and Y∗ = Y − μY , we can write (3.41) as
β = Σ− 1
XX Σ XY , (3.44)
The prediction of Y is then α + Xβ. Define the residual to be the error in the
prediction:
R = Y − α − Xβ. (3.46)
56 Chapter 3. Multivariate Normal
Next step is to find the joint distribution of (X, R). Because it is an affine transforma-
tion of (X, Y), the joint distribution is multivariate normal, hence we just need to find
the mean and covariance matrix. The mean of X we know is μ X , and the mean of R
is 0, from (3.40). The transform is
Ip −β
X R = X Y + 0 −α , (3.47)
0 Iq
hence
Ip 0 Σ XX Σ XY Ip −β
Cov[ X R ]=
− β Iq ΣYX ΣYY 0 Iq
Σ XX 0
= (3.48)
0 ΣYY · X
where
ΣYY · X = ΣYY − ΣYX Σ− 1
XX Σ XY , (3.49)
the minimizer in (3.45).
Note the zero in the covariance matrix. Because we have multivariate normality, X
and R are thus independent, and
Y | X = x = α + xβ + R, (3.51)
where α is given in (3.40), β is given in (3.44), and the conditional covariance matrix is given
in (3.49).
The conditional distribution is particularly nice:
• It is multivariate normal;
• The conditional mean is an affine transformation of x;
• It is homoskedastic, that is, the conditional covariance matrix does not depend on
x.
These properties are the typical assumptions in linear regression.
See Exercise 3.7.23. We again have the same β, but α is a bit expanded (it is n × q):
α = MY − M X β, β = Σ− 1
XX Σ XY . (3.55)
With R = Y − α − Xβ, we obtain that X and R are independent, and
Y | X = x ∼ Nn×q (α + xβ, In ⊗ ΣYY · X ). (3.56)
The zeroes in the covariance show that Y and Hn Y are independent (as they are in
the familiar univariate case), implying that Y and W are independent. Also,
U ≡ Hn Y ∼ N (0, Hn ⊗ Σ). (3.60)
Because Hn is idempotent, W = Y Hn Y = U U. At this point, instead of trying to
figure out the distribution of W, we define it to be what it is. Actually, Wishart [1928]
did this a while ago. Next is the formal definition.
Definition 3.6 (Wishart distribution). If Z ∼ Nν× p (0, Iν ⊗ Σ), then Z Z is Wishart on ν
degrees of freedom, with parameter Σ, written
Z Z ∼ Wishart p (ν, Σ). (3.61)
58 Chapter 3. Multivariate Normal
The difference between the distribution of U and the Z in the definition is the
former has Hn where we would prefer an identity matrix. We can deal with this issue
by rotating the Hn . We need its spectral decomposition. More generally, suppose J is
an n × n symmetric and idempotent matrix, with spectral decomposition (Theorem
1.1) J = ΓΛΓ , where Γ is orthogonal and Λ is diagonal with nondecreasing diagonal
elements. Because it is idempotent, JJ = J, hence
ΓΛΓ = ΓΛΓ ΓΛΓ = ΓΛΛΓ , (3.62)
so that Λ = or λi =
Λ2 , for the eigenvalues i = 1, . . . , n. That means that each of
λ2i
the eigenvalues is either 0 or 1. If matrices A and B have the same dimensions, then
trace(AB ) = trace(BA ). (3.63)
See Exercise 3.7.5. Thus
trace(J) = trace(ΛΓ Γ ) = trace(Λ) = λ1 + · · · + λn , (3.64)
which is the number of eigenvalues that are 1. Because the eigenvalues are ordered
from largest to smallest, λ1 = · · · = λtrace(J) = 1, and the rest are 0. Hence the
following result.
Lemma 3.1. Suppose J, n × n, is symmetric and idempotent. Then its spectral decomposition
is
Ik 0 Γ1
J = Γ1 Γ2 = Γ1 Γ1 , (3.65)
0 0 Γ2
where k = trace(J), Γ1 is n × k, and Γ2 is n × (n − k).
Now suppose
U ∼ N (0, J ⊗ Σ), (3.66)
for J as in the lemma. Letting Γ = (Γ1 , Γ2 ) in (3.65), we have E [ Γ U]
= 0 and
Γ1 U Ik 0
Cov[ Γ U] = Cov = ⊗ Σ. (3.67)
Γ2 U 0 0
Thus Γ2 U has mean and covariance zero, hence it must be zero itself (with probability
one). That is,
U U = U ΓΓ U = U Γ1 Γ1 U + U Γ2 Γ2 U = U Γ1 Γ1 U. (3.68)
By (3.66), and since J = Γ1 Γ1 in (3.65),
Mean
Letting Z1 , . . . , Zν be the rows of Z in Definition 3.6, we have that
In particular, for the S in (3.57), because ν = n − 1, E [S] = ((n − 1)/n )Σ, so that an
unbiased estimator of Σ is
= 1 Y Hn Y.
Σ (3.74)
n−1
Chi-squares
If Z1 , . . . , Zν are independent N (0, σ2 )’s, then
Linear transformations
If Z ∼ Nν×q (0, Iν ⊗ Σ), then for p × q matrix A, ZA ∼ Nν× p (0, Iν ⊗ AΣA ). Using
the definition of Wishart (3.61),
i.e.,
AWA ∼ Wishart p (ν, AΣA ). (3.79)
60 Chapter 3. Multivariate Normal
Marginals
Because marginals are special cases of linear transformations, central blocks of a
Wishart are Wishart. E.g., if W11 is the upper-left p × p block of W, then W11 ∼
Wishart p (ν, Σ11 ), where Σ11 is the upper-left block of Σ. See Exercise 3.7.9. A special
case of such marginal is a diagonal element, Wii , which is Wishart1 (ν, σii ), i.e., σii χ2ν .
Furthermore, if Σ is diagonal, then the diagonals of W are independent because the
corresponding normals are.
3.7 Exercises
Exercise 3.7.1. Verify the calculations in (3.9).
Exercise 3.7.2. Find the matrix B for which W ≡ (Y2 , Y5 ) = (Y1 , . . . , Y5 )B, and verify
(3.20).
Exercise 3.7.5. Suppose that A and B are both n × p matrices. Denote the elements
of A by aij , and of B by bij . (a) Give the following in terms of those elements: (AB )ii
(the i th diagonal element of the matrix AB ); and (B A) jj (the jth diagonal element of
the matrix B A). (c) Using the above, show that trace(AB ) = trace (B A).
Exercise 3.7.7. Explicitly write the sum of W1 and W2 as in (3.75) as a sum of Zi Zi ’s
as in (3.72).
Exercise 3.7.8. Suppose W ∼ σ2 χ2ν from (3.76), that is, W = Z12 + · · · + Zν2 , where the
Zi ’s are independent N (0, σ2 )’s. This exercise shows that W has pdf
1 ν
w 2 −1 e−w/(2σ ) , w > 0.
2
f W (w | ν, σ2 ) = (3.80)
Γ ( 2ν )(2σ2 )ν/2
It will help to know that U has the Gamma(α, λ) density if α > 0, λ > 0, and
1
f U (u | α, λ) = x α −1 e−λx for x > 0. (3.81)
Γ (α)λα
The Γ function is defined in (2.96). (It is the constant needed to have the pdf integrate
to one.) We’ll use moment generating functions. Working directly with convolutions
is another possibility. (a) Show that the moment generating function of U in (3.81) is
(1 − λt)−α when it is finite. For which t is the mgf finite? (b) Let Z ∼ N (0, σ2 ), so that
Z2 ∼ σ2 χ21 . Find the moment generating function for Z2 . [Hint: Write E [exp(tZ2 )] as
an integral using the pdf of Z, then note the exponential term in the integrand looks
like a normal with mean zero and some variance, but without the constant. Thus the
integral over that exponential is the reciprocal of the constant.] (c) Find the moment
generating function for W. (See Exercise 2.7.20.) (d) W has a gamma distribution.
3.7. Exercises 61
What are the parameters? Does this gamma pdf coincide with (3.80)? (e) [Aside] The
density of Z2 can be derived by writing
√w
P[ Z2 ≤ w] = √ f Z (z )dz, (3.82)
− w
then taking the derivative. Match the result with the σ2 χ21 density found above. What
is Γ ( 12 )?
Exercise 3.7.9. Suppose W ∼ Wishart p+q (ν, Σ), where W and Σ are partitioned as
W11 W12 Σ11 Σ12
W= and Σ = (3.83)
W21 W22 Σ21 Σ22
where W11 and Σ11 are p × p, etc. (a) What matrix A in (3.79) is used to show
that W11 ∼ Wishart p (ν, Σ11 )? (b) Argue that if Σ12 = 0, then W11 and W22 are
independent.
Exercise 3.7.10. The balanced one-way random effects model in analysis of variance
has
Yij = μ + Ai + eij , i = 1, . . . , g; j = 1, . . . , r, (3.84)
where the Ai ’s are iid N (0, σA
2 ) and the e ’s are iid N (0, σ2 ), and the e ’s are inde-
ij e ij
pendent of the Ai ’s. Let Y be the g × r matrix of the Yij ’s. Show that
Y ∼ Ng×r (M, Ig ⊗ Σ), (3.85)
and give M and Σ in terms of the μ, σA
2 and σ2 .
e
Exercise 3.7.16. True/false questions: (a) If A and B are identity matrices, then A ⊗ B
is an identity matrix. (b) If A and B are orthogonal, then A ⊗ B is orthogonal. (c)
If A is orthogonal and B is not orthogonal, then A ⊗ B is orthogonal. (d) If A and
B are diagonal, then A ⊗ B is diagonal. (e) If A and B are idempotent, then A ⊗ B
is idempotent. (f) If A and B are permutation matrices, then A ⊗ B is a permutation
matrix. (A permutation matrix is a square matrix with exactly one 1 in each row, one
1 in each column, and 0’s elsewhere.) (g) If A and B are upper triangular, then A ⊗ B
is upper triangular. (An upper triangular matrix is a square matrix whose elements
below the diaginal are 0. I.e., if A is upper triangular, then aij = 0 if i > j.) (h) If A is
upper triangular and B is not upper triangular, then A ⊗ B is upper triangular. (i) If
A is not upper triangular and B is upper triangular, then A ⊗ B is upper triangular.
(j) If A and C have the same dimensions, and B and D have the same dimensions,
then A ⊗ B + C ⊗ D = (A + C) ⊗ (B + D). (k) If A and C have the same dimensions,
then A ⊗ B + C ⊗ B = (A + C) ⊗ B. (l) If B and D have the same dimensions, then
A ⊗ B + A ⊗ D = A ⊗ (B + D).
Exercise 3.7.17. Prove (3.32a), (3.32b) and (3.32c).
Exercise 3.7.18. Take C, Y and D to all be 2 × 2. Show (3.32d) explicitly.
Exercise 3.7.19. Suppose A is a × a and B is b × b. (a) Show that (3.32e) for the
trace of A ⊗ B holds. (b) Show that (3.32f) determinant of A ⊗ B holds. [Hint: Write
A ⊗ B = (A ⊗ Ib )(Ia ⊗ B). You can use the fact that the the determinant of a product
is the product of the determinants. For |Ia ⊗ B|, permutate the rows and columns so
it looks like |B ⊗ Ia |.]
Exercise 3.7.20. Suppose the spectral decompositions of A and B are A = GLG and
B = HKH . Is the equation
the spectral decomposition of A ⊗ B? If not, what is wrong, and how can it be fixed?
3.7. Exercises 63
Exercise 3.7.21. Suppose W is a 1 × q vector with finite covariance matrix. Show that
q (c) = W − c2 is minimized over c ∈ Rq by c = E [W], and the minimum value is
q ( E [W]) = trace(Cov[W]). [Hint: Write
Show that the middle (cross-product) term in line (3.90) is zero (E [W] and c are
constants), and argue that the second term in line (3.91) is uniquely minimized by
c = E [W]. (No need to take derivatives.)]
Exercise 3.7.22. Verify the matrix multiplication in (3.48).
Exercise 3.7.23. Suppose (X, Y) is as in (3.53). (a) Show that (3.54) follows. [Be
careful about the covariance, since row(X, Y) = (row(X), row(Y)) if n > 1.] (b)
Apply Proposition 3.3 to (3.54) to obtain
where
What are β and ΣYY · X ? (c) Use Proposition 3.2 to derive (3.56) from part (b).
Exercise 3.7.24. Suppose ( X, Y, Z ) is multivariate normal with covariance matrix
⎛ ⎛ ⎞⎞
5 1 2
( X, Y, Z ) ∼ N ⎝(0, 0, 0), ⎝ 1 5 2 ⎠⎠ (3.94)
2 2 3
Find c so that the conditional correlation between X and Y given Z = z is 0 (so that
X and Y are conditionally independent, because of their normality).
Exercise 3.7.26. Let Y | X = x ∼ N (0, x2 ) and X ∼ N (2, 1). (a) Find E [Y ] and Var [Y ].
(b) Let Z = Y/X. What is the conditional distribution of Z | X = x? Is Z independent
of X? What is the marginal distribution of Z? (c) What is the conditional distribution
of Y | | X | = r?
64 Chapter 3. Multivariate Normal
Exercise 3.7.27. Suppose that conditionally, (Y1 , Y2 ) | X = x are iid N (α + βx, 10),
and that marginally, E [ X ] = Var [ X ] = 1. (The X is not necessarily normal.) (a) Find
Var [Yi ], Cov[Y1, Y2 ], and the (unconditional) correlation between Y1 and Y2 . (b) What
is the conditional distribution of Y1 + Y2 | X = x? Is Y1 + Y2 independent of X? (c)
What is the conditional distribution of Y1 − Y2 | X = x? Is Y1 − Y2 independent of X?
[Hint: Show that X and Y − α − Xβ are independent normals, and find the A so that
(X, Y) = (X, Y − α − Xβ)A.] (b) Show that the conditional distribution X | Y = y is
multivariate normal with mean
and
Cov[X | Y = y] = Σ XX − Σ XX (Σ + β Σ XX β)−1 Σ XX . (3.98)
(You can assume any covariance that needs to be invertible is invertible.)
where Σ, μ0 , and Σ0 are known. That is, the mean vector μ is a random variable,
with a multivariate normal prior. (a) Use Exercise 3.7.28 to show that the posterior
distribution of μ, i.e., μ given Y = y, is multivariate normal with
and
Cov[ μ | Y = y] = (Σ−1 + Σ0−1 )−1 . (3.101)
Thus the posterior mean is a weighted average of the data y and the prior mean, with
weights inversely proportional to their respective covariance matrices. [Hint: What
are the α and β in this case? It takes some matrix manipulations to get the mean and
covariance in the given form.] (b) Show that the marginal distribution of Y is
[Hint: See (3.96).][Note that the inverse of the posterior covariance is the sum of
the inverses of the conditional covariance of Y and the prior covariance, while the
marginal covariance of the Y is the sum of the conditional covariance of Y and the
prior covariance.] (c) Replace Y with Y, the sample mean of n iid vectors, so that
Y | μ = m ∼ N (m, Σ/n ). Keep the same prior on μ. Find the posterior distribution
of μ given the Y = y. What are the posterior mean and covariance matrix, approxi-
mately, when n is large?
3.7. Exercises 65
Exercise 3.7.30 (Bayesian inference). Consider a matrix version of Exercise 3.7.29, i.e.,
where K, Σ, μ0 and K0 are known, and the covariance matrices are invertible. [So if Y
is a sample mean vector, K would be n, and if Y is β from multivariate regression, K
would be x x.] Notice that the Σ is the same in the conditional distribution of Y and
in the prior. Show that the posterior distribution of μ is multivariate normal, with
and
Cov[ μ | Y = y] = (K + K0 )−1 ⊗ Σ. (3.105)
[Hint: Use (3.100) and (3.101) on row(Y) and row(μ), then use properties of Kro-
necker products, e.g., (3.32d) and Exercise 3.7.16 (l).]
Let
R = Y − XC − D (3.107)
for some matrices C and D. Instead of using least squares as in Section 3.4, here
we try to find C and D so that the residuals have mean zero and are independent
of X. (a) What are the dimensions of R, C and D? (b) Show that (X, R) is an affine
transformation of (X, Y). That is, find A and B so that
(c) Find the distribution of (X, R). (d) What must C be in order for X and R to be
independent? (You can assume Σ XX is invertible.) (e) Using the C found in part (d),
find Cov[R]. (It should be In ⊗ ΣYY · X .) (f) Sticking with the C from parts (d) and (e),
find D so that E [R] = 0. (g) Using the C and D from parts (d), (e), (f), what is the
distribution of R? The distribution of R R?
(a) Find C so that E [YC ] = xβ. (b) Assuming that x x is invertible, what is the dis-
tribution of Qx YC , where Qx = In − x(x x)−1 x ? (Is Qx idempotent? Such matrices
will appear again in equation 5.19.) (c) What is the distribution of CY Qx YC ?
66 Chapter 3. Multivariate Normal
Exercise 3.7.34. Here, W ∼ Wishart p (n, Σ). (a) Is E [trace(W)] = n trace(Σ)? (b)
Are the diagonal elements of W independent? (c) Suppose Σ = σ2 I p . What is the
distribution of trace(W)?
Exercise 3.7.35. Suppose Z = ( Z1 , Z2 ) ∼ N1×2 (0, I2 ). Let (θ, R) be the polar coordi-
nates, so that
Z1 = R cos(θ ) and Z2 = R sin(θ ). (3.110)
In order for the transformation to be one-to-one, remove 0 from the space of Z. Then
the space of (θ, R) is [0, 2π ) × (0, ∞ ). The question is to derive the distribution of
(θ, R). (a) Write down the density of Z. (b) Show that the Jacobian of the transforma-
tion is r. (c) Find the density of (θ, R). What is the marginal distribution of θ? What
is the marginal density of R? Are R and θ independent? (d) Find the distribution
function FR (r ) for R. (e) Find the inverse function of FR . (f) Argue that if U1 and U2
are independent Uniform(0, 1) random variables, then
−2 log(U2 ) × cos(2πU1 ) sin(2πU1 ) ∼ N1×2 (0, I2 ). (3.111)
Thus we can generate two random normals from two random uniforms. Equation
(3.111) is called the Box-Muller transformation [Box and Muller, 1958] [Hint: See
Exercise 2.7.27.] (g) Find the pdf of W = R2 . What is the distribution of W? Does it
check with (3.80)?
Chapter 4
This chapter presents some basic types of linear model. We start with the usual
linear model, with just one Y-variable. Multivariate regression extends the idea to
several variables, placing the same model on each variable. We then introduce linear
models that model the variables within the observations, basically reversing the roles
of observations and variables. Finally, we introduce the both-sides model, which
simultaneously models the observations and variables. Subsequent chapters present
estimation and hypothesis testing for these models.
Compare this to (3.51). The variance σR2 plays the role of σYY · X . The model (4.1)
assumes that the residuals Ri are iid N (0, σR2 ).
Some examples follow. There are thousands of books on linear regression and
linear models. Scheffé [1999] is the classic theoretical reference, and Christensen
[2002] provides a more modern treatment. A fine applied reference is Weisberg [2005].
67
68 Chapter 4. Linear Models
Analysis of variance
In analysis of variance, observations are classified into different groups, and one
wishes to compare the means of the groups. If there are three groups, with two
observations in each group, the model could be
⎛ ⎞ ⎛ ⎞
Y1 1 0 0
⎜ Y2 ⎟ ⎜ 1 0 0 ⎟⎛ ⎞
⎜ ⎟ ⎜ ⎟ μ1
⎜ Y3 ⎟ ⎜ 0 1 0 ⎟⎝
⎜ ⎟=⎜ ⎟ μ2 ⎠ + R. (4.3)
⎜ Y4 ⎟ ⎜ 0 1 0 ⎟ μ3
⎝ Y5 ⎠ ⎝ 0 0 1 ⎠
Y6 0 0 1
Other design matrices x yield the same model (See Section 5.4), e.g., we could just as
well write
⎛ ⎞ ⎛ ⎞
Y1 1 2 −1
⎜ Y2 ⎟ ⎜ 1 2 −1 ⎟ ⎛ ⎞
⎜ ⎟ ⎜ ⎟ μ
⎜ Y3 ⎟ ⎜ 1 −1 2 ⎟⎝
⎜ ⎟=⎜ ⎟ α ⎠ + R, (4.4)
⎜ Y4 ⎟ ⎜ 1 −1 2 ⎟
β
⎝ Y ⎠ ⎝ 1 −1 −1 ⎠
5
Y6 1 −1 −1
where μ is the grand mean, α is the effect for the first group, and β is the effect for
the second group. We could add the effect for group three, but that would lead to
a redundancy in the model. More complicated models arise when observations are
classified in multiple ways, e.g., sex, age, and ethnicity.
Analysis of covariance
It may be that the main interest is in comparing the means of groups as in analysis
of variance, but there are other variables that potentially affect the Y. For example, in
a study comparing three drugs’ effectiveness in treating leprosy, there were bacterial
measurements before and after treatment. The Y is the “after” measurement, and one
would expect the “before” measurement, in addition to the drugs, to affect the after
4.2. Multivariate regression 69
The actual experiment had ten observations in each group. See Section 7.5.
This model looks very much like the linear regression model in (4.1), and it is. It is
actually just a concatenation of q linear models, one for each variable (column) of Y.
Note that (4.8) places the same model on each variable, in the sense of using the same
70 Chapter 4. Linear Models
x’s, but allows different coefficients represented by the different columns of β. That
is, (4.8) implies
Y1 = xβ 1 + R1 , . . . , Yq = xβ q + Rq , (4.9)
Consider predicting the midterms and final exam scores from gender, and the
homework, labs, and inclass scores. The model is Y = xβ + R, where Y is 107 × 2
(the Midterms and Finals), x is 107 × 5 (with Gender, HW, Labs, InClass, plus the first
column of 1107 ), and β is 5 × 2:
⎛ ⎞
β0M β0F
⎜ βG M β GF ⎟
⎜ ⎟
β=⎜ βH M β HF ⎟. (4.11)
⎝ β LM β LF ⎠
βI M β IF
Chapter 6 shows how to estimate the β ij ’s. In this case the estimates are
4.2. Multivariate regression 71
Midterms Final
Intercept 56.472 43.002
Gender −3.044 −1.922
(4.12)
HW 0.244 0.305
Labs 0.052 0.005
InClass 0.048 0.076
Note that the largest slopes (not counting the intercepts) are the negative ones for
gender, but to truly assess the sizes of the coefficients, we will need to find their
standard errors, which we will do in Chapter 6.
Mouth sizes
Measurements were made on the size of mouths of 27 children at four ages: 8, 10,
12, and 14. The measurement is the distance from the “center of the pituitary to the
pteryomaxillary fissure”1 in millimeters, These data can be found in Potthoff and
Roy [1964]. There are 11 girls (Sex=1) and 16 boys (Sex=0). See Table 4.1. Figure 4.1
contains a plot of the mouth sizes over time. These curves are generally increasing.
There are some instances where the mouth sizes decrease over time. The measure-
ments are between two defined locations in the mouth, and as people age, the mouth
shape can change, so it is not that people mouths are really getting smaller. Note that
generally the boys have bigger mouths than the girls, as they are generally bigger
overall.
For the linear model, code x where the first column is 1 = girl, 0 = boy, and the
second column is 0 = girl, 1 = boy:
111 011 β11 β12 β13 β14
Y = xβ + R = + R. (4.13)
016 116 β21 β22 β23 β24
Here, Y and R are 27 × 4. So now the first row of β has the (population) means of the
girls for the four ages, and the second row has the means for the boys. The sample
means are
Histamine in dogs
Sixteen dogs were treated with drugs to see the effects on their blood histamine
levels. The dogs were split into four groups: Two groups received the drug morphine,
and two received the drug trimethaphan, both given intravenously. For one group
within each pair of drug groups, the dogs had their supply of histamine depleted
before treatment, while the other group had histamine intact. So this was a two-way
1
Actually, I believe it is the pterygomaxillary fissure. See Wikipedia [2010] for an illustration and some
references.
72 Chapter 4. Linear Models
Table 4.1: The mouth size data, from Potthoff and Roy [1964].
⎛ ⎞⎛ ⎞
14 −14 −14 14 μ0 μ1 μ3 μ5
⎜ 14 −14 −14 ⎟ ⎜ α5 ⎟
⎜
Y=⎝
14 ⎟ ⎜ α0 α1 α3 ⎟ + R. (4.15)
14 14 −14 −14 ⎠ ⎝ β0 β1 β3 β5 ⎠
14 14 14 14 γ0 γ1 γ3 γ5
4.3. Both sides models 73
30
25
Size
20
8 9 10 11 12 13 14
Age
30
Mean size
25
20
8 9 10 11 12 13 14
Age
Figure 4.1: Mouth sizes over time. The boys are indicated by dashed lines, the girls
by solid lines. The top graphs has the individual graphs, the bottom the averages for
the boys and girls.
The estimate of β is
Table 4.2: The data on histamine levels in dogs. The value with the asterisk is missing,
but for illustration purposes I filled it in. The dogs are classified according to the
drug administered (morphine or trimethaphan), and whether the dog’s histamine
was artificially depeleted.
viduals, but the same for each variable. Models on the variables switch the roles of
variable and individual.
and z is a fixed q × l matrix. The model (4.17) looks like just a transpose of model
(4.1), but (4.17) does not have iid residuals, because the observations are all on the
same individual. Simple repeated measures models and growth curve models are special
cases. (Simple because there is only one individual. Actual models would have more
than one.)
A repeated measure model is used if the y j ’s represent replications of the same
measurement. E.g., one may measure blood pressure of the same person several
times, or take a sample of several leaves from the same tree. If no systematic differ-
ences are expected in the measurements, the model would have the same mean μ for
each variable:
Y = μ (1, . . . , 1) + R = μ1q + R. (4.18)
3.0
Histamine
2.0
1.0
0.0
MI
MD
2.0
TI
TD
1.0
0.0
Figure 4.2: Plots of the dogs over time. The top plot has the individual dogs, the
bottom has the means of the groups. The groups: MI = Morphine, Intact; MD =
Morphine, Depleted; TI = Trimethaphan, Intact; TD = Trimethaphan, Depleted
side: ⎛ ⎞
1 ··· ··· 1
Y = ( β 0 , β 1 , β 2 ) ⎝ x1 x2 ··· xq ⎠ + R. (4.19)
x12 x22 ··· x2q
Mean
Drug
Depletion
Interaction
−0.4
Figure 4.3: Plots of the effects in the analysis of variance for the dogs data, over time.
This model makes sense if one takes a random sample of n individuals, and makes
repeated measurements from each. More generally, a growth curve model as in (4.19),
but with n individuals measured, is
⎛ ⎞
1 ··· ··· 1
Y = 1n ( β 0 , β 1 , β 2 ) ⎝ z1 z2 ··· zq ⎠ + R. (4.22)
z21 z22 ··· z2q
Example: Births
The average births for each hour of the day for four different hospitals is given in
Table 4.3. The data matrix Y here is 4 × 24, with the rows representing the hospitals
and the columns the hours. Figure 4.4 plots the curves.
One might wish to fit sine waves (Figure 4.5) to the four hospitals’ data, presuming
one day reflects one complete cycle. The model is
Y = βz + R, (4.23)
where ⎛ ⎞
β10 β11 β12
⎜ β20 β21 β22 ⎟
⎜
β=⎝ ⎟ (4.24)
β30 β31 β32 ⎠
β40 β41 β42
4.3. Both sides models 77
1 2 3 4 5 6 7 8
Hosp1 13.56 14.39 14.63 14.97 15.13 14.25 14.14 13.71
Hosp2 19.24 18.68 18.89 20.27 20.54 21.38 20.37 19.95
Hosp3 20.52 20.37 20.83 21.14 20.98 21.77 20.66 21.17
Hosp4 21.14 21.14 21.79 22.54 21.66 22.32 22.47 20.88
9 10 11 12 13 14 15 16
Hosp1 14.93 14.21 13.89 13.60 12.81 13.27 13.15 12.29
Hosp2 20.62 20.86 20.15 19.54 19.52 18.89 18.41 17.55
Hosp3 21.21 21.68 20.37 20.49 19.70 18.36 18.87 17.32
Hosp4 22.14 21.86 22.38 20.71 20.54 20.66 20.32 19.36
17 18 19 20 21 22 23 24
Hosp1 12.92 13.64 13.04 13.00 12.77 12.37 13.45 13.53
Hosp2 18.84 17.18 17.20 17.09 18.19 18.41 17.58 18.19
Hosp3 18.79 18.55 18.19 17.38 18.41 19.10 19.49 19.10
Hosp4 20.02 18.84 20.40 18.44 20.83 21.00 19.57 21.35
Table 4.3: The data on average number of births for each hour of the day for four
hospitals.
22
20
# Births
18
16
14
12
5 10 15 20
Hour
Figure 4.4: Plots of the four hospitals’ births’, over twenty-four hours.
78 Chapter 4. Linear Models
1.0
0.5
Sine/Cosine
0.0
−1.0
0 5 10 15 20
Hour
Figure 4.5: Sine and cosine waves, where one cycle spans twenty-four hours.
and
⎛ ⎞
1 1 ··· 1
z = ⎝ cos(1 · 2π/24) cos(2 · 2π/24) ··· cos(24 · 2π/24) ⎠ , (4.25)
sin(1 · 2π/24) sin(2 · 2π/24) ··· sin(24 · 2π/24)
The estimates of the coefficients are now β∗ = (18.35, −0.25, 1.34), which is the aver-
The fit is graphed as the thick line in Figure 4.6
age of the rows of β.
4.3. Both sides models 79
22
20
# Births
18
16
14
12
5 10 15 20
Hour
Figure 4.6: Plots of the four hospitals’ births, with the fitted sign waves. The thick
line fits one curve to all four hospitals.
Y = xβz + R, (4.28)
The “0m ”’s are m × 1 vectors of 0’s. Thus ( β g0 , β g1 , β g2 ) contains the coefficients for
the girls’ growth curve, and ( β b0 , β b1 , β b2 ) the boys’. Some questions which can be
addressed include
• Are the girls’ and boys’ curves parallel (are β g1 = β b1 and β g2 = β b2 , but maybe
not β g0 = β b0 )?
See also Ware and Bowden [1977] for a circadean application and Zerbe and Jones
[1980] for a time-series context. The model is often called the generalized multivari-
ate analysis of variance, or GMANOVA, model. Extensions are many. For examples,
see Gleser and Olkin [1970], Chinchilli and Elswick [1985], and the book by Kariya
[1985].
4.4 Exercises
Exercise 4.4.1 (Prostaglandin). Below are data from Ware and Bowden [1977] taken at
six four-hour intervals (labelled T1 to T6) over the course of a day for 10 individuals.
The measurements are prostaglandin contents in their urine.
Person T1 T2 T3 T4 T5 T6
1 146 280 285 215 218 161
2 140 265 289 231 188 69
3 288 281 271 227 272 150
4 121 150 101 139 99 103
5 116 132 150 125 100 86 (4.30)
6 143 172 175 222 180 126
7 174 276 317 306 139 120
8 177 313 237 135 257 152
9 294 193 306 204 207 148
10 76 151 333 144 135 99
(a) Write down the “xβz " part of the model that fits a separate sine wave to each
person. (You don’t have to calculate the estimates or anything. Just give the x, β and
z matrices.) (b) Do the same but for the model that fits one sine wave to all people.
Exercise 4.4.2 (Skulls). The data concern the sizes of Egyptian skulls over time, from
Thomson and Randall-MacIver [1905]. There are 30 skulls from each of five time
periods, so that n = 150 all together. There are four skull size measurements, all in
millimeters: maximum length, basibregmatic height, basialveolar length, and nasal
height. The model is a multivariate analysis of variance one, where x distinguishes
between the time periods, and we do not use a z. Use polynomials for the time
periods (code them as 1, 2, 3, 4, 5), so that x = w ⊗ 130 . Find w.
Exercise 4.4.3. Suppose Yb and Ya are n × 1 with n = 4, and consider the model
(Yb Ya ) ∼ N (xβ, In ⊗ Σ), (4.31)
where ⎛ ⎞
1 1 1
⎜ 1 1 −1 ⎟
⎜
x=⎝ ⎟. (4.32)
1 −1 1 ⎠
1 −1 −1
(a) What are the dimensions of β and Σ? The conditional distribution of Ya given
Yb = (4, 2, 6, 3) is
Ya | Yb = (4, 2, 6, 3) ∼ N (x∗ β∗ , In ⊗ Ω) (4.33)
4.4. Exercises 81
for some fixed matrix x∗ , parameter matrix β∗ , and covariance matrix Ω. (b) What are
the dimensions of β∗ and Ω? (c) What is x∗ ? (d) What is the most precise description
of the conditional model?
Exercise 4.4.4 (Caffeine). Henson et al. [1996] conducted an experiment to see whether
caffeine has a negative effect on short-term visual memory. High school students
were randomly chosen: 9 from eighth grade, 10 from tenth grade, and 9 from twelfth
grade. Each person was tested once after having caffeinated Coke, and once after
having decaffeinated Coke. After each drink, the person was given ten seconds to
try to memorize twenty small, common objects, then allowed a minute to write down
as many as could be remembered. The main question of interest is whether people
remembered more objects after the Coke without caffeine than after the Coke with
caffeine. The data are
Grade 8 Grade 10 Grade 12
Without With Without With Without With
5 6 6 3 7 7
9 8 9 11 8 6
6 5 4 4 9 6
8 9 7 6 11 7
(4.34)
7 6 6 8 5 5
6 6 7 6 9 4
8 6 6 8 9 7
6 8 9 8 11 8
6 7 10 7 10 9
10 6
“Grade" is the grade in school, and the “Without" and “With" entries are the numbers
of items remembered after drinking Coke without or with caffeine. Consider the
model
Y = xβz + R, (4.35)
where the Y is 28 × 2, the first column being the scores without caffeine, and the
second being the scores with caffeine. The x is 28 × 3, being a polynomial (quadratic)
matrix in the three grades. (a) The z has two columns. The first column of z represents
the overall mean (of the number of objects a person remembers), and the second
column represents the difference between the number of objects remembered with
caffeine and without caffeine. Find z. (b) What is the dimension of β? (c) What
effects do the β ij ’s represent? (Choices: overall mean, overall linear effect of grade,
overall quadratic effect of grade, overall difference in mean between caffeinated and
decaffeinated coke, linear effect of grade in the difference between caffeinated and
decaffeinated coke, quadratic effect of grade in the difference between caffeinated
and decaffeinated coke, interaction of linear and quadratic effects of grade.)
Y = xβz + R, (4.36)
measurement (at time 0), and the last three columns represent polynomial effects
(constant, linear, and quadratic) for just the three “after" time points (times 1, 3, 5).
(a) What is z? (b) What effects do the β ij ’s represent? (Choices: overall drug effect
for the after measurements, overall drug effect for the before measurement, average
“after" measurement, drug × depletion interaction for the “before" measurement,
linear effect in “after" time points for the drug effect.)
Exercise 4.4.6 (Leprosy). Below are data on leprosy patients found in Snedecor and
Cochran [1989]. There were 30 patients, randomly allocated to three groups of 10.
The first group received drug A, the second drug D, and the third group received a
placebo. Each person had their bacterial count taken before and after receiving the
treatment.
Drug A Drug D Placebo
Before After Before After Before After
11 6 6 0 16 13
8 0 6 2 13 10
5 2 7 3 11 18
14 8 8 1 9 5
(4.37)
19 11 18 18 21 23
6 4 8 4 16 12
10 13 19 14 12 5
6 1 8 9 12 16
11 8 5 1 7 1
3 0 15 9 12 20
(a) Consider the model Y = xβ + R for the multivariate analysis of variance with three
groups and two variables (so that Y is 30 × 2), where R ∼ N30×2 (0, I30 ⊗ Σ R ). The
x has vectors for the overall mean, the contrast between the drugs and the placebo,
and the contrast between Drug A and Drug D. Because there are ten people in each
group, x can be written as w ⊗ 110 . Find w. (b) Because the before measurements
were taken before any treatment, the means for the three groups on that variable
should be the same. Describe that constraint in terms of the β. (c) With Y = (Yb Ya ),
find the model for the conditional distribution
Y a | Y b = y b ∼ N (x∗ β ∗ , I n × Ω ). (4.38)
Give the x∗ in terms of x and yb , and give Ω in terms of the elements of Σ R . (Hint:
Write down what it would be with E [Y] = (μb μ a ) using the conditional formula,
then see what you get when μb = xβ b and μ a = xβ a .)
Exercise 4.4.7 (Parity). Johnson and Wichern [2007] present data (in their Exercise
6.17) on an experiment. Each of 32 subjects was given several sets of pairs of integers,
and had to say whether the two numbers had the same parity (i.e., both odd or both
even), or different parities. So (1, 3) have the same parity, while (4, 5) have differ-
ent parity. Some of the integer pairs were given numerically, like (2, 4), and some
were written out, i.e., ( Two, Four ). The time it took to decide the parity for each pair
was measured. Each person had a little two-way analysis of variance, where the two
factors are Parity, with levels different and same, and Format, with levels word and nu-
meric. The measurements were the median time for each Parity/Format combination
4.4. Exercises 83
for that person. Person i then had observation vector yi = (yi1 , yi2 , yi3 , yi4 ), which in
the ANOVA could be arranged as
Format
Parity Word Numeric
(4.39)
Different yi1 yi2
Same yi3 yi4
Exercise 4.4.8 (Sine waves). Let θ be an angle running from 0 to 2π, so that a sine/-
cosine wave with one cycle has the form
where the R j are the residuals. (a) Is the model linear in the parameters A, B, C? Why
or why not? (b) Show that the model can be rewritten as
2π 2π
Yj = β1 + β2 cos j + β3 sin j + R j , j = 1, . . . , q, (4.44)
q q
and give the β k ’s in terms of A, B, C. [Hint: What is cos( a + b )?] (c) Write this model
as a linear model, Y = βz + R, where Y is 1 × q. What is the z? (d) Waves with m ≥ 1
cycles can be added to the model by including cosine and sine terms with θ replaced
by mθ:
2πm 2πm
cos j , sin j . (4.45)
q q
If q = 6, then with the constant term, we can fit in the cosine and sign terms for
the wave with m = 1 cycle, and the cosine and sine terms for the wave with m = 2
cycles. The x cannot have more than 6 columns (or else it won’t be invertible). Find
the cosine and sine terms for m = 3. What do you notice? Which one should you put
in the model?
Chapter 5
In this chapter, we briefly review linear subspaces and projections onto them. Most of
the chapter is abstract, in the sense of not necessarily tied to statistics. The main result
we need for the rest of the book is the least-squares estimate given in Theorem 5.2.
Further results can be found in Chapter 1 of Rao [1973], an excellent compendium of
facts on linear subspaces and matrices.
x, y ∈ W =⇒ x + y ∈ W , and (5.1)
c ∈ R, x ∈ W =⇒ cx ∈ W . (5.2)
85
86 Chapter 5. Least Squares
By convention, the span of the empty set is just {0}. It is not hard to show that
any span is a linear subspace. Some examples: For K = 2, span{(1, 1)} is the set
of vectors of the form ( a, a), that is, the equiangular line through 0. For K = 3,
span{(1, 0, 0), (0, 1, 0)} is the set of vectors of the form ( a, b, 0), which is the x/y
plane, considering the axes to be x, y, z.
We will usually write the span in matrix form. Letting D be the M × K matrix
with columns d1 , . . . , dK . We have the following representations of subspace W :
W = span{d1 , . . . , dK }
= span{columns of D}
= {Db | b ∈ RK (b is K × 1)}
= span{rows of D }
= {bD | b ∈ RK (b is 1 × K )}. (5.4)
Not only is any span a subspace, but any subspace is a span of some vectors. In
fact, any subspace of RK can be written as a span of at most K vectors, although not
in a unique way. For example, for K = 3,
span{(1, 0, 0), (0, 1, 0)} = span{(1, 0, 0), (0, 1, 0), (1, 1, 0)}
= span{(1, 0, 0), (1, 1, 0)}
= span{(2, 0, 0), (0, −7, 0), (33, 2, 0)}. (5.5)
Any invertible transformation of the vectors yields the same span, as in the next
lemma. See Exercise 5.6.4 for the proof.
Lemma 5.1. Suppose W is the span of the columns of the M × K matrix D as in (5.4), and
A is an invertible K × K matrix. Then W is also the span of the columns of DA, i.e.,
span{columns of D} = span{columns of DA}. (5.6)
Note that the space in (5.5) can be a span of two or three vectors, or a span of any
number more than three as well. It cannot be written as a span of only one vector.
In the two sets of three vectors, there is a redundancy, that is, one of the vectors can
be written as a linear combination of the other two: (1, 1, 0) = (1, 0, 0) + (0, 1, 0) and
(2, 0, 0) = (4/(33 × 7))(0, −7, 0) + (2/33) × (33, 2, 0). Such sets are called linearly
dependent. We first define the opposite.
Definition 5.3. The vectors d1 , . . . , dK in RK are linearly independent if
b1 d1 + · · · + bK dK = 0 =⇒ b1 = · · · = bK = 0. (5.7)
Equivalently, the vectors are linearly independent if no one of them (as long as
it is not 0) can be written as a linear combination of the others. That is, there is no
di = 0 and set of coefficients b j such that
d i = b1 d1 + · · · + b i − 1 d i − 1 + b i + 1 d i + 1 + . . . + b K d K . (5.8)
The vectors are linearly dependent if and only if they are not linearly independent.
In (5.5), the sets with three vectors are linearly dependent, and those with two
vectors are linearly independent. To see that latter fact for {(1, 0, 0), (1, 1, 0)}, suppose
that γ1 (1, 0, 0) + γ2 (1, 1, 0) = (0, 0, 0). Then
b1 + b2 = 0 and b2 = 0 =⇒ b1 = b2 = 0, (5.9)
5.2. Projections 87
5.2 Projections
In linear models, the mean of the data matrix is presumed to lie in a linear subspace,
and an aspect of fitting the model is to find the point in the subspace closest to the
data. This closest point is called the projection. Before we get to the formal definition,
we need to define orthogonality. Recall from Section 1.5 that two column vectors v
and w are orthogonal if v w = 0 (or vw = 0 if they are row vectors).
Definition 5.6. The vector v ∈ R M is orthogonal to the subspace W ⊂ R M if v is orthogonal
to w for all w ∈ W . Also, subspace V ⊂ R M is orthogonal to W if v and w are orthogonal
for all v ∈ V and w ∈ W .
Geometrically, two objects are orthogonal if they are perpendicular. For example,
in R3 , the z-axis is orthogonal to the x/y-plane. Exercise 5.6.6 is to prove the next
result.
Lemma 5.2. Suppose W = span{d1 , . . . , dK }. Then y is orthogonal to W if and only if y
is orthogonal to each d j .
Definition 5.7. The projection of y onto W is the
y that satisfies
y ∈ W and y −
y is orthogonal to W . (5.10)
y 2 =
y 2 + y −
y 2 , (5.11)
which is Pythagoras’ Theorem. In a regression setting, the left-hand side is the total
sum-of-squares, and the right-hand side is the regression sum-of-squares (
y2 ) plus the
residual sum-of-squares, although usually the sample mean of the yi ’s is subtracted
from y and y.
Exercise 5.6.8 proves the following useful result.
Theorem 5.1 (Projection). Suppose y ∈ RK and W is a subspace of RK , and
y is the
projection of y onto W . Then
y −
y2 < y − w2 for all w ∈ W , w =
y. (5.12)
y ≈ bD . (5.13)
In Section 5.3.1, we specialize to the both-sides model (4.28). Our first objective is to
find the best value of b, where we define “best” by least squares.
such that
Definition 5.8. A least-squares estimate of b in the equation (5.13) is any b
2 = min y − bD 2 .
y − bD (5.14)
b∈RK
for which
Part (d) of Theorem 5.1 implies that a least squares estimate of b is any b
is the projection of y onto the subspace W . Thus y − bD
bD is orthogonal to W ,
and by Lemma 5.2, is orthogonal to each d j . The result are the normal equations:
)d j = 0 for each j = 1, . . . , K.
(y − bD (5.15)
We then have
)D = 0
(5.15) =⇒ (y − bD
D = yD
=⇒ bD (5.16)
= yD(D D)−1 ,
=⇒ b (5.17)
where the final equation holds if D D is invertible, which occurs if and only if the
columns of D constitute a basis of W . See Exercise 5.6.33. Summarizing:
Theorem 5.2 (Least squares). Any solution b to the least-squares equation (5.14) satisfies
the normal equations (5.16). The solution is unique if and only if D D is invertible, in which
case (5.17) holds.
If D D is invertible, the projection of y onto W can be written
where
PD = D ( D D ) − 1 D . (5.19)
The matrix PD is called the projection matrix for W . The residuals are then
y−
y = y − yPD = y(IK − PD ) = yQD , (5.20)
5.3. Least squares 89
where
QD = I M − PD . (5.21)
The minimum value in (5.14) is then
y2 = yQD y ,
y − (5.22)
y) = 0.
y (y − (5.23)
These two facts are consequences of parts (a) and (c) of the following proposition.
See Exercises 5.6.10 to 5.6.12.
Proposition 5.1 (Projection matrices). Suppose PD is defined as in (5.18), where D D is
invertible. Then the following hold.
Then the least squares estimate of β is found as in (5.17), where we make the identi-
fications y = row(Y), b = row( β) and D = x ⊗ z (hence M = nq and K = pl):
row ( β) = row(Y)(x ⊗ z)[(x ⊗ z) (x ⊗ z)] −1
= row(Y)(x ⊗ z)(x x ⊗ z z)−1
= row(Y)(x(x x)−1 ⊗ z(z z)−1 ). (5.26)
See Proposition 3.2. Now we need that x x and z z are invertible. Undoing as in
(3.32d), the estimate can be written
the usual estimate for regression. The repeated measures and growth curve models
such as in (4.23) have x = In , so that
Thus, indeed, the both-sides model has estimating matrices on both sides.
denoting
Other linear models can have different distributional assumptions, e.g., covariance
restrictions, but do have to have the mean lie in a linear subspace.
There are many different parametrizations of a given linear model, for the same
reason that there are many different bases for the mean space W . For example, it
may not be obvious, but
⎛ ⎞
1 1 1
1 0
x= , z=⎝ 1 2 4 ⎠ (5.35)
0 1
1 3 9
5.5. Gram-Schmidt orthogonalization 91
and ⎛ ⎞
1 −1 1
∗ 1 −1 ∗
x = , z =⎝ 1 0 −2 ⎠ (5.36)
1 1
1 1 1
lead to exactly the same model, though different interpretations of the parameters.
In fact, with x being n × p and z being q × l,
The representation in (5.36) has the advantage that the columns of the x∗ ⊗ z∗ are
orthogonal, which makes it easy to find the least squares estimates as the D D matrix
is diagonal, hence easy to invert. Note the z is the matrix for a quadratic. The z∗ is
the corresponding set of orthogonal polynomials, as discussed in Section 5.5.2.
for QD1 defined in (5.21) and (5.19). Then the columns of D1 and D2·1 are orthogonal,
D2 ·1 D1 = 0, (5.41)
and
W = span{columns of (D1 , D2·1 )}. (5.42)
Proof. D2·1 is the matrix of residuals for the least-squares model D2 = D1 β + R, i.e.,
D1 is the x and D2 is the Y in the multivariate regression model (5.28). Equation
(5.41) then follows from part (d) of Proposition 5.1: D2 ·1 D1 = D2 QD1 D1 = 0. For
(5.42),
IK −(D1 D1 )−1 D1 D2
D1 D2·1 = D1 D2 1 . (5.43)
0 IK 2
The final matrix is invertible, hence by Lemma 5.1, the spans of the columns of
(D1 , D2 ) and (D1 , D2·1 ) are the same.
92 Chapter 5. Least Squares
The next section derives some important matrix decompositions based on the
Gram-Schmidt orthogonalization. Section 5.5.2 applies the orthogonalization to poly-
nomials.
These matrices are upper unitriangular, meaning they are upper triangular (i.e.,
all elements below the diagonal are zero), and all diagonal elements are one. We will
use the notation
g1 , g2 ∈ G ⇒ g1 g2 ∈ G , (5.57)
−1
g∈G ⇒ g ∈ G. (5.58)
Now suppose the columns of D are linearly independent, which means that all the
columns of D∗ are nonzero (See Exercise 5.6.21.) Then we can divide each column of
D∗ by its norm, so that the resulting vectors are orthonormal:
di·{1: ( i−1)}
qi = , Q= q1 ··· qK = D∗ Δ −1 , (5.61)
di·{1: ( i−1)}
where Δ is the diagonal matrix with the norms on the diagonal. Letting R = ΔB−1 ,
we have that
D = QR, (5.62)
where R is upper triangular with positive diagonal elements, the Δii ’s. The set of
such matrices R is also group, denoted by
Tq+ = {T | T is q × q, tii > 0 for all i, tij = 0 for i > j}. (5.63)
Hence we have the next result. The uniqueness for M = K is shown in Exercise 5.6.26.
Theorem 5.3 (QR-decomposition). Suppose the M × K matrix D has linearly independent
columns (hence K ≤ M). Then there is a unique decomposition D = QR, where Q, M × K,
has orthonormal columns and R ∈ TK+ .
Gram-Schmidt also has useful implications for the matrix S = D D. From (5.43)
we have
IK 1 0 D1 D1 0 IK1 (D1 D1 )−1 D1 D2
S=
D2 D1 (D1 D1 )−1 IK2 0 D2 ·1 D2·1 0 IK 2
− 1
IK 1 0 S11 0 IK1 S11 S12
= −1 , (5.64)
S21 S11 IK 2 0 S22·1 0 IK 2
−1
where S22·1 = S22 − S21 S11 S12 as in (3.49). See Exercise 5.6.27. Then using steps as
in Gram-Schmidt, we have
⎛ ⎞
S11 0 0 ··· 0
⎜ 0 S22·1 0 ··· 0 ⎟
⎜ ⎟
− 1 ⎜ 0 0 S · · · · 0 ⎟ −1
S = (B ) ⎜ 33 12 ⎟B
⎜ .. .. .. .. .. ⎟
⎝ . . . . . ⎠
0 0 0 · · · SKK ·{1: ( K −1)}
= R R, (5.65)
and R is given by
⎧
⎪
⎪ Sii·{1··· i−1} if j = i,
⎨
Rij = Sij·{1··· i−1} / Sii·{1··· i−1} if j > i, (5.67)
⎪
⎪
⎩
0 if j < i.
Exercise 5.6.30 shows this decomposition works for any positive definite symmetric
matrix. It is then called the Cholesky decomposition:
Theorem 5.4 (Cholesky decomposition). If S ∈ Sq+ (5.34), then there exists a unique
R ∈ Tq+ such that S = R R.
Note that the ages (values 8, 10, 12, 14) are equally spaced. Thus we can just as well
code the ages as (0,1,2,3), so that we actually start with
⎛ ⎞
1 0 0 0
⎜ 1 1 1 1 ⎟
d1 d2 d3 d4 = ⎜ ⎝ 1 2 4 8 ⎠.
⎟ (5.69)
1 3 9 27
Because D D is diagonal, the least-squares estimates of the coefficients are found via
yd j·{1: ( j−1)}
j =
b , (5.74)
d j·{1: ( j−1)}2
which here yields
= (22.6475, 0.4795, −0.0125, 0.0165).
b (5.75)
These are the coefficients for the cubic model. The coefficients for the quadratic model
set
b4 = 0, but the other three are as for the cubic. Likewise, the linear model has b
equalling (22.6475, 0.4795, 0, 0), and the constant model has (22.6475, 0, 0, 0).
In contrast, if one uses the original vectors in either (5.68) or (5.69), one has to recal-
culate the coefficients separately for each model. Using (5.69), we have the following
estimates:
Model
b1∗
b2∗
b3∗
b4∗
Cubic 21.1800 1.2550 −0.2600 0.0550
Quadratic 21.1965 0.9965 −0.0125 0 (5.76)
Linear 21.2090 0.9590 0 0
Constant 22.6475 0 0 0
Note that the non-zero values in each column are not equal.
5.6 Exercises
Exercise 5.6.1. Show that the span in (5.3) is indeed a linear subspace.
Exercise 5.6.2. Verify that the four spans given in (5.5) are the same.
Exercise 5.6.3. Show that for matrices C (M × J) and D (M × K),
span{columns of D} ⊂ span{columns of C} ⇒ D = CA, (5.77)
for some J × K matrix A. [Hint: Each column of D must be a linear combination of
the columns of C.]
5.6. Exercises 97
[Hint: Any vector in the left-hand space equals DAb for some L × 1 vector b. For
what vector b∗ is DAb = Db∗ ?] (b) Prove Lemma 5.1. [Use part (a) twice, once for
A and once for A−1 .] (c) Show that if the columns of D are linearly independent,
and A is K × K and invertible, then the columns of DA are linearly independent.
[Hint: Suppose the columns of DA are linearly dependent, so that for some b = 0,
DAb = 0. Then there is a b∗ = 0 with Db∗ = 0. What is it?]
Exercise 5.6.5. Let d1 , . . . , dK be vectors in R M . (a) Suppose (5.8) holds. Show that
the vectors are linearly dependent. [That is, find b j ’s, not all zero, so that ∑ bi di = 0.]
(b) Suppose the vectors are linearly dependent. Find an index i and constants b j so
that (5.8) holds.
Exercise 5.6.6. Prove Lemma 5.2.
Exercise 5.6.7. Suppose the set of M × 1 vectors γ1 , . . . , γK are nonzero and mutually
orthogonal. Show that they are linearly independent. [Hint: Suppose they are linearly
dependent, and let γi be the vector on the left-hand side in (5.8). Then take γi times
each side of the equation, to arrive at a contradiction.]
Exercise 5.6.8. Prove part (a) of Theorem 5.1. [Hint: Show that the difference of y −
y1
and y − y2 is orthogonal to W , as well as in W . Then show that such a vector must be
zero.] (b) Prove part (b) of Theorem 5.1. (c) Prove part (c) of Theorem 5.1. (d) Prove
part (d) of Theorem 5.1. [Hint: Start by writing y − w2 = (y − y ) − (w −
y)2 ,
then expand. Explain why y − y and w − y are orthogonal.]
Exercise 5.6.9. Derive the normal equations (5.15) by differentiating y − bD 2 with
respect to the bi ’s.
Exercise 5.6.10. This Exercise proves part (a) of Proposition 5.1. Suppose W =
span{columns of D}, where D is M × K and D D is invertible. (a) Show that the
projection matrix PD = D(D D)−1 D as in (5.19) is symmetric and idempotent. (b)
Show that trace(PD ) = K.
Exercise 5.6.11. This Exercise proves part (b) of Proposition 5.1. Suppose P is a
symmetric and idempotent M × M matrix. Find a set of linearly independent vectors
d1 , . . . , dK , where K = trace(P), so that P is the projection matrix for span{d1 , . . . , dK }.
[Hint: Write P = Γ1 Γ1 where Γ1 has orthonormal columns, as in Lemma 3.1. Show
that P is the projection matrix onto the span of the columns of the Γ1 , and use Exercise
5.6.7 to show that those columns are a basis. What is D, then?]
Exercise 5.6.12. (a) Prove part (c) of Proposition 5.1. (b) Prove part (d) of Proposition
5.1. (c) Prove (5.22). (d) Prove (5.23).
Exercise 5.6.13. Consider the projection of y ∈ RK onto span{1K }. (a) Find the
projection. (b) Find the residual. What does it contain? (c) Find the projection matrix
P. What is Q = IK − P? Have we seen it before?
Exercise 5.6.14. Verify the steps in (5.26), detailing which parts of Proposition 3.2 are
used at each step.
98 Chapter 5. Least Squares
Exercise 5.6.15. Show that the equation for d j·1 in (5.44) does follow from the deriva-
tion of D2·1 .
Exercise 5.6.16. Give an argument for why the set of equations in (5.51) follows from
the Gram-Schmidt algorithm.
Exercise 5.6.17. Given that a subspace is a span of a set of vectors, explain how one
would obtain an orthogonal basis for the space.
Exercise 5.6.18. Let Z1 be a M × K matrix with linearly independent columns. (a)
How would you find a M × ( M − K ) matrix Z2 so that (Z1 , Z2 ) is an invertible M × M
matrix, and Z1 Z2 = 0 (i.e., the columns of Z1 are orthogonal to those of Z2 ). [Hint:
Start by using Lemma 5.3 with D1 = Z1 and D2 = I M . (What is the span of the
columns of (Z1 , I M )?) Then use Gram-Schmidt on D2·1 to find a set of vectors to use
as the Z2 . Do you recognize D2·1 ?] (b) Suppose the columns of Z are orthonormal.
How would you modify the Z2 in part (a) so that (Z1 , Z2 ) is an orthogonal matrix?
Exercise 5.6.19. Consider the matrix B( k) defined in (5.55). (a) Show that the inverse
of B( k) is of the same form, but with the − bkj ’s changed to bkj ’s. That is, the inverse
is the K × K matrix C( k) , where
⎧
⎪
⎨1 if i = j
(k)
Cij = bkj if j > k = i (5.79)
⎪
⎩0 otherwise.
Thus C is the inverse of the B in (5.59), where C = C( K −1) · · · C(1) . (b) Show that
C is unitriangular, where the bij ’s are in the upper triangular part, i.e, Cij = bij for
j > i, as in (5.60). (c) The R in (5.62) is then ΔC, where Δ is the diagonal matrix with
diagonal elements being the norms of the columns of D∗ . Show that R is given by
⎧
⎨di·{1: ( i−1)}
⎪ if j = i
Rij = dj·{1: ( i−1)}di·{1: ( i−1)}/di·{1: ( i−1)} if j > i (5.80)
⎪
⎩
0 if j < i.
Exercise 5.6.20. Verify (5.66).
Exercise 5.6.21. Suppose d1 , . . . , dK are vectors in R M , and d1∗ , . . . , d∗K are the corre-
sponding orthogonal vectors resulting from the Gram-Schmidt algorithm, i.e., d1∗ =
d1 , and for i > 1, d∗i = di·{1: ( i−1)} in (5.49). (a) Show that the d1∗ , . . . , d∗K are linearly
independent if and only if they are all nonzero. Why? [Hint: Recall Exercise 5.6.7.]
(b) Show that d1 , . . . , dK are linearly independent if and only if all the d∗j are nonzero.
Exercise 5.6.25. Show that if d1 , . . . , dK are vectors in R M with K > M, that the di ’s
are linearly dependent. (This fact should make sense, since there cannot be more axes
than there are dimensions in Euclidean space.) [Hint: Use Exercise 5.6.24 on the first
M vectors, then show how d M+1 is a linear combination of them.]
Exercise 5.6.26. Show that the QR decomposition in Theorem 5.3 is unique when
M = K. That is, suppose Q1 and Q2 are K × K orthogonal matrices, and R1 and
R2 are K × K upper triangular matrices with positive diagonals, and Q1 R1 = Q2 R2 .
Show that Q1 = Q2 and R1 = R2 . [Hint: Show that Q ≡ Q2 Q1 = R2 R1−1 ≡ R,
so that the orthogonal matrix Q equals the upper triangular matrix R with positive
diagonals. Show that therefore Q = R = IK .] [Extra credit: Show the uniqueness
when M > K.]
Exercise 5.6.27. Verify (5.64). In particular: (a) Show that
−1
IK 1 A IK 1 −A
= . (5.81)
0 IK 2 0 IK 2
(b) Argue that the 0s in the middle matrix on the left-hand side of (5.64) are correct.
(c) Show S22·1 = D2 ·1 D2·1 .
[Hint: Use (5.64) and (5.81).] (c) Use part (b) to show that
where [S−1 ]22 is the lower-right K2 × K2 block of S−1 . Under what condition on the
−1
Sij ’s is [S−1 ]22 = S22 ?
Exercise 5.6.30. Suppose S ∈ SK+ . Prove Theorem 5.4, i.e., show that we can write
S = R R, where R is upper triangular with positive diagonal elements. [Hint: Use
the spectral decomposition S = GLG from(1.33). Then let D = L1/2 G in (5.62). Are
the columns of this D linearly independent?]
100 Chapter 5. Least Squares
Exercise 5.6.32. Show that the Cholesky decomposition in Theorem 5.4 is unique.
That is, if R1 and R2 are K × K upper triangular matrices with positive diagonals,
that R1 R1 = R2 R2 implies that R1 = R2 . [Hint: Let R = R1 R2−1 , and show that
R R = IK . Then show that this R must be IK , just as in Exercise 5.6.26.]
Exercise 5.6.33. Show that the M × K matrix D has linearly independent columns
if and only if D D is invertible. [Hint: If D has linearly independent columns, then
D D = R R as Theorem 5.4, and R is invertible. If the columns are linearly dependent,
there is a b = 0 with D Db = 0. Why does that equation imply D D has no inverse?]
Exercise 5.6.34. Suppose D is M × K and C is M × J, K > J, and both matrices have
linearly independent columns. Furthermore, suppose
Thus this space has two bases with differing numbers of elements. (a) Let A be the
J × K matrix such that D = CA, guaranteed by Exercise 5.6.3. Show that the columns
of A are linearly independent. [Hint: Note that Db = 0 for any K × 1 vector b = 0.
Hence Ab = 0 for any b = 0.] (b) Use Exercise 5.6.25 to show that such an A cannot
exist. (c) What do you conclude?
Exercise 5.6.35. This exercise is to show that any linear subspace W in R M has a basis.
If W = {0}, the basis is the empty set. So you can assume W has more than just the
zero vector. (a) Suppose d1 , . . . , d J are linearly independent vectors in R M . Show that
d ∈ R M but d ∈ span{d1 , . . . , d J } implies that d1 , . . . , d J , d are linearly independent.
[Hint: If they are not linearly independent, then some linear combination of them
equals zero. The coefficient of d in that linear combination must be nonzero. (Why?)
Thus d must be in the span of the others.] (b) Take d1 ∈ W , d1 = 0. [I guess we are
assuming the Axiom of Choice.] If span{d1 } = W , then we have the basis. If not,
there must be a d2 ∈ W − span{d1 }. If span{d1 , d2 } = W , we are done. Explain
how to continue. (Also, explain why part (a) is important here.) How do you know
this process stops? (c) Argue that any linear subspace has a corresponding projection
matrix.
Exercise 5.6.36. Suppose P and P∗ are projection matrices for the linear subspace
W ⊂ R M . Show that P = P∗ , i.e., the projection matrix is unique to the subspace.
[Hint: Because the projection of any vector is unique, Py = P∗ y for all y. Consider
the columns of I M .]
Exercise 5.6.37. Let D = (D1 , D2 ), where D1 is M × K1 and D2 is M × K2 , and
suppose that the columns of D are linearly independent. Show that
where D2·1 = QD1 D2 . [Hint: Use Lemma 5.3 and the uniqueness in Exercise 5.6.36.]
5.6. Exercises 101
Exercise 5.6.38. Find the orthogonal polynomial matrix (up to cubic) for the four time
points 1, 2, 4, 5.
Exercise 5.6.39 (Skulls). For the model on skull measurements described in Exercise
4.4.2, replace the polynomial matrix w with that for orthogonal polynomials.
Exercise 5.6.41 (Leprosy). Consider again the model for the leprosy data in Exercise
4.4.6. An alternate expression for x is w∗ ⊗ 110 , where the first column of w∗ rep-
resents the overall mean, the second tells whether the treatment is one of the drugs,
and the third whether the treatment is Drug A, so that
⎛ ⎞
1 1 1
w∗ = ⎝ 1 1 0 ⎠ . (5.90)
1 0 0
Use Gram-Schmidt to orthogonalize the columns of w∗ . How does this matrix differ
from w? How does the model using w∗ differ from that using w?
Chapter 6
6.1 Distribution of β
The both-sides model as defined in (4.28) is
where we define
103
104 Chapter 6. Both-Sides Models: Distribution
β ∼ Np×l ( β, Cx ⊗ Σz ). (6.6)
The x and z matrices are known, but Σ R must be estimated. Before tackling that issue,
we will consider fits and residuals.
where for matrix u, Pu = u(u u)−1 u , the projection matrix on the span of the
columns of u, as in (5.19). We estimate the residuals R by subtracting:
R = Y − Px YPz .
= Y−Y (6.9)
and R
The joint distribution of Y is multivariate normal because the collection is a
linear transformation of Y. The means are straightforward:
E [R ] = xβz − xβz = 0.
] = E [Y] − E [Y (6.11)
The covariance matrix of the fit is not hard to obtain, but the covariance of the
residuals, as well as the joint covariance of the fit and residuals, are less obvious since
the residuals are not of the form AYB . Instead, we break the residuals into two parts,
the residuals from the left-hand side of the model, and the residuals of the right-hand
part on the fit of the left-hand part. That is, we write
= Y − Px YPz
R
= Y − Px Y + Px Y − Px YPz
= ( I n − Px ) Y + Px Y ( I n − Pz )
= Qx Y + Px YQz
≡R1 + R 2, (6.12)
For the joint covariance of the fit and two residual components, we use the row
function to write
⎛⎛ ⎞⎞
Y
row ⎝⎝ R 1 ⎠ ⎠ = row(Y ) row(R 1 ) row(R
2)
2
R
= row(Y) Px ⊗ Pz Qx ⊗ Iq Px ⊗ Qz . (6.13)
where we use Proposition 5.1, parts (a) and (c), on the projection matrices.
One fly in the ointment is that the fit and residuals are not in general independent,
due to the possible non-zero correlation between the fit and R 2 . The fit is independent
1 , though. We obtain the distributions of the fit and residuals to be
of R
∼ N (xβz , Px ⊗ Pz Σ R Pz ),
Y (6.15)
and
∼ N (0, Qx ⊗ Σ R + Px ⊗ Qz Σ R Qz ).
R (6.16)
1 = Qx Y ∼ N (0, Qx ⊗ Σ R ).
R (6.17)
R
R 1 1 = Y Qx Y ∼ Wishart p (n − p, Σ R ), (6.18)
z are chi-squareds:
The diagonals of Σ
1
σzjj ∼
σ χ2 , (6.22)
n − p zjj n− p
ij in (6.7) is
and the estimate of the variance of β
Var ( βij ) = Cxii
σzjj ∼
1
C σ χ2 . (6.23)
n − p xii zjj n− p
Applying the definition to the βij ’s, from (6.6), (6.7), and (6.23),
βij − β ij
Var ( βij )
Z= ∼ N (0, 1) and U = (n − p) , (6.26)
Cxii σzjj Cxii σzjj
6.4 Examples
6.4.1 Mouth sizes
Recall the measurements on the size of mouths of 27 kids at four ages, where there
are 11 girls (Sex=1) and 16 boys (Sex=0) in Section 4.2.1. Here’s the model where the x
6.4. Examples 107
matrix compares the boys and girls, and the z matrix specifies orthogonal polynomial
growth curves:
Y = xβz + R
⎛ ⎞
1 1 1 1
111 111 β0 β1 β2 β3 ⎜ −3 −1 1 3 ⎟
= ⎜ ⎟ + R, (6.28)
116 016 δ0 δ1 δ2 δ3 ⎝ 1 −1 −1 1 ⎠
−1 3 −3 1
Compare this model to that in (4.13). Here the first row of coefficients are the boys’
coefficients, and the sum of the rows are the girls’ coefficients, hence the second row
is the girls’ minus the boys’. We first find the estimated coefficients (using R), first
creating the x and z matrices:
x <− cbind(1,rep(c(1,0),c(11,16)))
z <− cbind(c(1,1,1,1),c(−3,−1,1,3),c(1,−1,−1,1),c(−1,3,−3,1))
estx <− solve(t(x)%∗%x,t(x))
estz <− solve(t(z)%∗%z,t(z))
betahat <− estx%∗%mouths[,1:4]%∗%t(estz)
The β is
Intercept Linear Quadratic Cubic
Boys 24.969 0.784 0.203 −0.056 (6.29)
Girls − Boys −2.321 −0.305 −0.214 0.072
Before trying to interpret the coefficients, we would like to estimate their standard
errors. We calculate Σ R = Y Qx Y/(n − p), where n = 27 and p = 2, then Σ z =
−
(z z) z Σz
1 (z z) and Cx = (x x)−1 of (6.5):
− 1
By (6.7), the standard errors of the βij ’s are estimated by multiplying the i th di-
agonal of Cx and jth diagonal of Σ z , then taking the square root. We can obtain the
matrix of standard errors using the command
se <− sqrt(outer(diag(cx),diag(sigmazhat),"∗"))
The t-statistics then divide the estimates by their standard errors, betahat/se:
108 Chapter 6. Both-Sides Models: Distribution
Standard Errors
Intercept Linear Quadratic Cubic
Boys 0.4860 0.0860 0.1276 0.0887
Girls − Boys 0.7614 0.1347 0.1999 0.1389
(6.32)
t-statistics
Intercept Linear Quadratic Cubic
Boys 51.38 9.12 1.59 −0.63
Girls − Boys −3.05 −2.26 −1.07 0.52
(The function bothsidesmodel of Section A.2 performs these calculations as well.) It
looks like the quadratic and cubic terms are unnecessary, so that straight lines for each
sex fit well. It is clear that the linear term for boys is necessary, and the intercepts for
the boys and girls are different (the two-sided p-value for 3.05 with 25 df is 0.005).
The p-value for the Girls−Boys slope is 0.033, which may or may not be significant,
depending on whether you take into account multiple comparisons.
Y = xβz + R, (6.33)
where as in (4.15), but using a Kronecker product to express the x matrix,
⎛ ⎞
1 −1 −1 1
⎜ 1 −1 1 −1 ⎟
⎜
x=⎝ ⎟ ⊗ 14 , (6.34)
1 1 −1 −1 ⎠
1 1 1 1
⎛ ⎞
μb μ0 μ1 μ2
⎜ αb α0 α1 α2 ⎟
β=⎜
⎝ βb
⎟ (6.35)
β0 β1 β2 ⎠
γb γ0 γ1 γ2
and ⎛ ⎞
1 0 0 0
⎜ 0 1 1 1 ⎟
z =⎜
⎝ 0
⎟. (6.36)
−1 0 1 ⎠
0 1 −2 1
The μ’s are for the overall mean of the groups, the α’s for the drug effects, the β’s
for the depletion effect, and the γ’s for the interactions. The “b” subscript indicates
the before means, and the 0, 1 and 2 subscripts indicate the constant, linear, and
quadratic terms of the growth curves. We first set up the design matrices:
6.5. Exercises 109
x <− cbind(1,rep(c(−1,1),c(8,8)),rep(c(−1,1,−1,1),c(4,4,4,4)))
x <− cbind(x,x[,2]∗x[,3])
z <− cbind(c(1,0,0,0),c(0,1,1,1),c(0,−1,0,1),c(0,1,−2,1))
The rest of the calculations follow as in the previous section, where here the y is
in the R matrix histamine. Because of the orthogonal columns for x, the matrix Cx
is diagonal, in fact is Cx = 161 I . The coefficients’ estimates, standard errors, and
4
t-statistics are below. Note the pattern in the standard errors.
Estimates
Before Intercept Linear Quadratic
Mean 0.0769 0.3858 −0.1366 0.0107
Drug −0.0031 0.1842 −0.0359 −0.0082
Depletion 0.0119 −0.2979 0.1403 −0.0111
Interaction 0.0069 −0.1863 0.0347 0.0078
Standard Errors
Before Intercept Linear Quadratic
Mean 0.0106 0.1020 0.0607 0.0099
(6.37)
Drug 0.0106 0.1020 0.0607 0.0099
Depletion 0.0106 0.1020 0.0607 0.0099
Interaction 0.0106 0.1020 0.0607 0.0099
t-statistics
Before Intercept Linear Quadratic
Mean 7.25 3.78 −2.25 1.09
Drug −0.29 1.81 −0.59 −0.83
Depletion 1.12 −2.92 2.31 −1.13
Interaction 0.65 −1.83 0.57 0.79
Here n = 16 and p = 4, so the degrees of freedom in the t-statistics are 12. It
looks like the quadratic terms are not needed, and that the basic assumption that the
treatment effects for the before measurements is 0 is reasonable. It looks also like the
drug and interaction effects are 0, so that the statistically significant effects are the
intercept and linear effects for the mean and depletion effects. See Figure 4.3 for a
plot of these effects. Chapter 7 deals with testing blocks of β ij ’s equal to zero, which
may be more appropriate for these data.
6.5 Exercises
Exercise 6.5.1. Justify the steps in (6.4) by refering to the appropriate parts of Propo-
sition 3.2.
Exercise 6.5.2. Verify the calculations in (6.12).
Exercise 6.5.3 (Bayesian inference). This exercise extends the Bayesian results in Exer-
cises 3.7.29 and 3.7.30 to the β in multivariate regression. We start with the estimator
β in (6.6), where the z = Iq , hence Σz = Σ R . The model is then
where Σ R , β 0 , and K0 are known. Note that the Σ R matrix appears in the prior,
which makes the posterior tractable. (a) Show that the posterior distribution of β is
multivariate normal, with
E [ β | β = b
] = (x x + K0 )−1 ((x x)b
+ K0 β 0 ), (6.39)
and
Cov[ β | β = b
] = (x x + K0 )−1 ⊗ Σ R . (6.40)
[Hint: Same hint as in Exercise 3.7.30.] (b) Set the prior parameters β 0 = 0 and
K0 = k0 I p for some k0 > 0. Show that
E [ β | β = b
] = (x x + k0 I p )−1 x y. (6.41)
This conditional mean is the ridge regression estimator of β. See Hoerl and Kennard
[1970]. This estimator can be better than the least squares estimator (a little biased, but
much less variable) when x x is nearly singular, that is, one or more of its eigenvalues
are close to zero.
Exercise 6.5.4 (Prostaglandin). Continue with the data described in Exercise 4.4.1.
The data are in the R matrix prostaglandin. Consider the both-sides model (6.1),
where the ten people have the same mean, so that x = 110 , and z contains the cosine
and sine vectors for m = 1, 2 and 3, as in Exercise 4.4.8. (Thus z is 6 × 6.) (a) What is
z? (b) Are the columns of z orthogonal? What are the squared norms of the columns?
(d) Find Σ
(c) Find β. z . (e) Find the (estimated) standard errors of the βj ’s. (f) Find
the t-statistics for the β j ’s. (g) Based on the t-statistics, which model appears most
appropriate? Choose from the constant model; the one-cycle model (just m=1); the
model with one cycle and two cycles; the model with one, two and three cycles.
Exercise 6.5.5 (Skulls). This question continues with the data described in Exercise
4.4.2. The data are in the R matrix skulls, obtained from http://lib.stat.cmu.edu/
DASL/Datafiles/EgyptianSkulls.html at DASL Project [1996]. The Y ∼ N (xβ, Im ⊗
Σ R ), where the x represents the orthogonal polynomials over time periods (from
Exercise 5.6.39). (a) Find β. (b) Find (x x)−1 . (c) Find Σ
R . What are the degrees
of freedom? (d) Find the standard errors of the βij ’s. (e) Which of the β ij ’s have
t-statistic larger than 2 in absolute value? (Ignore the first row, since those are the
overall means.) (f) Explain what the parameters with | t| > 2 are measuring. (g)
There is a significant linear trend for which measurements? (h) There is a significant
quadratic trend for which measurements?
Exercise 6.5.6 (Caffeine). This question uses the caffeine data (in the R matrix caffeine)
and the model from Exercise 4.4.4. (a) Fit the model, and find the relevant estimates.
ij ’s. (c) What do you conclude? (Choose as many
(b) Find the t-statistics for the β
conclusions as appropriate from the following: On average the students do about the
same with or without caffeine; on average the students do significantly better with-
out caffeine; on average the students do significantly better with caffeine; the older
students do about the same as the younger ones on average; the older students do
significantly better than the younger ones on average; the older students do signifi-
cantly worse than the younger ones on average; the deleterious effects of caffeine are
not significantly different for the older students than for the younger; the deleterious
6.5. Exercises 111
effects of caffeine are significantly greater for the older students than for the younger;
the deleterious effects of caffeine are significantly greater for the younger students
than for the older; the quadratic effects are not significant.)
Exercise 6.5.7 (Grades). Consider the grades data in (4.10). Let Y be the 107 × 5
matrix consisting of the variables homework, labs, inclass, midterms, and final. The
x matrix indicates gender. Let the first column of x be 1n . There are 70 women and
37 men in the class, so let the second column have 0.37 for the women and −0.70
for the men. (That way, the columns of x are orthogonal.) For the z, we want the
overall mean score; a contrast between the exams (midterms and final) and other
scores (homework, labs, inclass); a contrast between (homework, labs) and inclass; a
contrast between homework and labs; and a contrast between midterms and final.
Thus ⎛ ⎞
1 1 1 1 1
⎜ 2 2 2 −3 −3 ⎟
⎜ ⎟
z = ⎜ 1 1 −2 0 0 ⎟. (6.42)
⎝ 1 −1 0 0 0 ⎠
0 0 0 1 −1
Let
β1 β2 β3 β4 β5
β= . (6.43)
δ1 δ2 δ3 δ4 δ5
(a) Briefly describe what each of the parameters represents. (b) Find β. (c) Find the
standard errors of the β ij ’s. (d) Which of the parameters have |t-statistic| over 2? (e)
Based on the results in part (d), discuss whether there is any difference between the
grade profiles of the men and women.
Chapter 7
Testing a single β ij = 0 is easy using the t-test based on (6.27). It is often informative
to test a set of β ij ’s is 0, e.g., a row from β, or a column, or a block, or some other
configuration. In Section 7.1, we present a general test statistic and its χ2 approxima-
tion for testing any set of parameters equals zero. Section 7.2 refines the test statistic
when the set of β ij ’s of interest is a block.
−1/2 ∼ N (0, IK ),
θΩ (7.3)
hence
−1 θ
θΩ ∼ χ2 . (7.4)
K
Typically, Ω will have to be estimated, in which case we use
Ω
T2 ≡ θ −1 θ
, (7.5)
where under appropriate conditions (e.g., large sample size relative to K), under H0 ,
T 2 ≈ χ2K . (7.6)
113
114 Chapter 7. Both-Sides Models: Testing
θ = ( β2 , β3 , δ1 , δ2 , δ3 ). (7.7)
The estimate is
= (0.203, −0.056, −0.305, −0.214, 0.072).
θ (7.8)
To find the estimated covariance matrix Ω, we need to pick off the relevant elements
of the matrix Cx ⊗ Σz using the values in (6.30). In terms of row( β), we are interested
in elements 3, 4, 6, 7, and 8. Continuing the R work from Section 6.4.1, we have
omegahat <− kronecker(cx,sigmazhat)[c(3,4,6,7,8),c(3,4,6,7,8)]
so that
⎛ ⎞
0.01628 −0.00036 0.00314 −0.01628 0.00036
⎜ −0.00036 0.00786 −0.00057 0.00036 −0.00786 ⎟
=⎜
Ω ⎜ 0.00314 −0.00057 0.01815 −0.00770 0.00139
⎟
⎟. (7.9)
⎝ −0.01628 0.00036 −0.00770 0.03995 −0.00088 ⎠
0.00036 −0.00786 0.00139 −0.00088 0.01930
which uses the rows 1, 4, and 5, and the columns 1 and 3. Consider testing the
hypothesis
H0 : β∗ = 0. (7.12)
The corresponding estimate of β∗ has distribution
where Cx∗ and Σz∗ are the appropriate p∗ × p∗ and l ∗ × l ∗ submatrices of, respectively,
Cx and Σz . In the example (7.11),
⎛ ⎞
Cx11 Cx14 Cx15
σz11 σz13
Cx = ⎝ Cx41 Cx44 Cx45 ⎠ , and Σz∗ =
∗ . (7.14)
σz31 σz33
Cx51 Cx54 Cx55
z∗ be the corresponding submatrix of (6.21), we have
Also, letting Σ
z∗ ∼ Wishartl ∗ (ν, Σz∗ ), ν = n − p.
νΣ (7.15)
We take
θ = row( β∗ ) and Ω = Cx∗ ⊗ Σz∗ , (7.16)
Ω
and θ, as the obvious estimates. Then using (3.32c) and (3.32d), we have that
= row(Cx∗ −1 β∗ Σ
z∗ −1 ) row( β∗ )
= trace(Cx∗ −1 β∗ Σ
z∗ −1 β∗ ), (7.17)
where the final equation results by noting that for p × q matrices A and D,
p q
row(A) row(D) = ∑ ∑ aij dij = trace(AD ). (7.18)
i =1 j =1
ν B
F= ∼ Fp∗ ,ν . (7.25)
p∗ W
This is the classical problem in multiple (univariate) regression, and this test is the
regular F test.
where
Z = β∗ / Cx∗ . (7.27)
(Note that Cx∗ is a scalar.) From (7.19), T2 can be variously written
In this situation, the statistic is called Hotelling’s T 2 . The next proposition shows that
the distribution of the F version of Hotelling’s T 2 in (7.22) is exact, setting p∗ = 1.
The proof of the proposition is in Section 8.4.
Proposition 7.1. Suppose W and Z are independent, W ∼ Wishartl ∗ (ν, Σ) and Z ∼
N1×l ∗ (0, Σ), where ν ≥ l ∗ and Σ is invertible. Then
ν − l∗ + 1
ZW−1 Z ∼ Fl ∗ ,ν−l ∗ +1 . (7.29)
l∗
if ν > 2. Otherwise, the expected value is + ∞. For T 2 in (7.19) and (7.21), again by
independence of B and W,
because E [B] = p∗ Σz∗ by (3.73). To finish, we need the following lemma, which
extends the results on E [1/χ2ν ].
1
E [ W −1 ] = Σ −1 . (7.32)
ν − l∗ − 1
νp∗ νp∗ l ∗
E[ T2 ] = ∗ trace(Σz∗−1 Σz∗ ) = . (7.33)
ν−l −1 ν − l∗ − 1
ν − l∗ + 1 ν − l∗ + 1
∗ ∗ E[ T2 ] = = E [ Fp∗ l ∗ ,ν−l ∗ +1 ]. (7.34)
νp l ν − l∗ − 1
Wilks’ Λ
The statistic is based on the likelihood ratio statistic (see Section 9.3.1), and is defined
as
|W|
Λ= . (7.35)
|W + B|
See Exercise 9.6.8. Its distribution under the null hypothesis has the Wilk’s Λ distri-
bution, which is a generalization of the beta distribution.
Definition 7.2 (Wilks’ Λ). Suppose W and B are independent Wishart’s, with distributions
as in (7.21). Then Λ has the Wilks’ Λ distribution with dimension l ∗ and degrees of freedom
( p∗ , ν), written
Λ ∼ Wilksl ∗ ( p∗ , ν). (7.36)
Pillai trace
This one is the locally most powerful invariant test. (Don’t worry about what that
means exactly, but it has relatively good power if in the alternative the β∗ is not far
from 0.) The statistic is
trace((W + B)−1 B). (7.38)
Asymptotically, as ν → ∞,
7.3 Examples
In this section we further analyze the mouth size and histamine data. The book Hand
and Taylor [1987] contains a number of other nice examples.
Y = xβz + R
⎛ ⎞
1 1 1 1
111 111 β0 β1 β2 β3 ⎜ −3 −1 1 3 ⎟
= ⎜ ⎟ + R.
116 016 δ0 δ1 δ2 δ3 ⎝ 1 −1 −1 1 ⎠
−1 3 −3 1
(7.40)
ν − l∗ + 1 2 22
T ∼ Fl ∗ ,ν−l ∗ +1 −→ 16.5075 = 3.632, (7.43)
νl ∗ 100
which, compared to a F4,22 , has p-value 0.02. So we reject H0 , showing there is a
difference in the sexes.
Next, consider testing that the two curves are actually linear, that is, the quadratic
and cubic terms are 0 for both curves:
β2 β3
H0 : = 0. (7.44)
δ2 δ3
curves presuming the quadratic and cubic terms are 0. The β∗ = (−2.321, −0.305),
p∗ = 1 and l ∗ = 2. Hotelling’s T 2 = 13.1417, and the F = [(25 − 2 + 1)/(25 × 2)] ×
13.1417 = 6.308. Compared to an F2,25, the p-value is 0.006. Note that this is quite
a bit smaller than the p-value before (0.02), since we have narrowed the focus by
eliminating insignificant terms from the statistic.
Our conclusion that the boys’ and girls’ curves are linear but different appears
reasonable given the Figure 4.1.
Here, T 2 = 41.5661 and F = 3.849, which has degrees of freedom (9,10). The p-value
is 0.024, which does indicate a difference in groups.
Or, if only the second column of Y is of interest, then one might wish to test
H0 : CβD = 0, (7.52)
where C (p∗ × p) and D (l ∗ × l) are fixed matrices that express the desired restrictions.
To test the hypothesis, we use
and
z D ∼ 1
DΣ Wishart(n − p, DΣz D ). (7.54)
n−p
Then, assuming the appropriate matrices are invertible, we set
which puts us back at the distributions in (7.21). Thus T 2 or any of the other test
statistics can be used as above. In fact, the hypothesis β∗ in (7.12) and (7.11) can be
written as β∗ = CβD for C and D with 0’s and 1’s in the right places.
7.5 Covariates
A covariate is a variable that is of (possibly) secondary importance, but is recorded
because it might help adjust some estimates in order to make them more precise.
As a simple example, consider the Leprosy example described in Exercise 4.4.6. The
main interest is the effect of the treatments on the after-treatment measurements. The
before-treatment measurements constitute the covariate in this case. These measure-
ments are indications of health of the subjects before treatment, the higher the less
healthy. Because of the randomization, the before measurements have equal popula-
tion means for the three treatments. But even with a good randomization, the sample
means will not be exactly the same for the three groups, so this covariate can be used
to adjust the after comparisons for whatever inequities appear.
122 Chapter 7. Both-Sides Models: Testing
The multivariate analysis of variance model, with the before and after measure-
ments as the two Y variables, we use here is
Y = Yb Ya = xβ + R
⎡⎛ ⎞ ⎤⎛ ⎞
1 1 1 μb μ a
= ⎣⎝ 1 1 −1 ⎠ ⊗ 110 ⎦ ⎝ αb α a ⎠ + R, (7.56)
1 −2 0 βb β a
where
σbb σba
Cov(Y) = I30 ⊗ . (7.57)
σab σaa
The treatment vectors in x are the contrasts Drugs versus Control and Drug A versus
Drug D. The design restriction that the before means are the same for the three groups
is represented by
αb 0
= . (7.58)
βb 0
Interest centers on α a and β a . (We do not care about the μ’s.)
The estimates and standard errors using the usual calculations for the before and
after α’s and β’s are given in the next table:
Before After
Estimate se Estimate se
(7.59)
Drug vs. Control −1.083 0.605 −2.200 0.784
Drug A vs. Drug D −0.350 1.048 −0.400 1.357
Looking at the after parameters, we see a significant difference in the first contrast,
showing that the drugs appear effective. The second contrast is not significant, hence
we cannot say there is any reason to suspect difference between the drugs. Note,
though, that the first contrast for the before measurements is somewhat close to sig-
nificance, that is, by chance the control group received on average less healthy people.
Thus we wonder whether the significance of this contrast for the after measurements
is at least partly due to this fortuitous randomization.
To take into account the before measurements, we condition on them, that is, con-
sider the conditional distribution of the after measurements given the before mea-
surements (see equation 3.56 with X = Yb ):
Ya | Yb = yb ∼ N30×1 (α + yb γ, I30 × σaa·b ), (7.60)
where γ = σab /σbb , and recalling (7.58),
α + yb γ = E [Y a ] − E [Yb ] γ + yb γ
⎛ ⎞ ⎛ ⎞
μa μb
= x ⎝ α a ⎠ − x ⎝ 0 ⎠ γ + yb γ
βa 0
⎛ ∗ ⎞
μ
= x ⎝ α a ⎠ + yb γ
βa
⎛ ∗ ⎞
μ
⎜ αa ⎟
⎜
= (x, yb ) ⎝ ⎟, (7.61)
βa ⎠
γ
7.5. Covariates 123
where
C(x,yb ) = ((x, yb ) (x, yb ))−1 (7.63)
as in (6.5).
The next tables have the comparisons of the original estimates and the covariate-
adjusted estimates for α a and β a :
Original Covariate-adjusted
Estimate se t Estimate se t
(7.64)
αa −2.20 0.784 −2.81 α∗a
−1.13 0.547 −2.07
βa −0.40 1.357 −0.29 β∗a −0.05 0.898 −0.06
The covariates helped the precision of the estimates, lowering the standard errors by
about 30%. Also, the first effect is significant but somewhat less so than without the
covariate. This is due to the control group having somewhat higher before means
than the drug groups.
Whether the original or covariate-adjusted estimates are more precise depends on
a couple of terms. The covariance matrices for the two cases are
σbb σba
Cx ⊗ and C(x,yb ) ⊗ σaa·b. (7.65)
σab σaa
Because
σab
2
σaa·b = σaa − = σaa (1 − ρ2 ), (7.66)
σbb
where ρ is the correlation between the before and after measurements, the covariate-
adjusted estimates are relatively better the higher ρ is. From the original model we
estimate ρ = 0.79, which is quite high. The estimates from the two models are
σaa = 36.86 and
σaa·b = 16.05. On the other hand,
[Cx ]2:3,2:3 “ < ” [C(x,yb ) ]2:3,2:3, (7.67)
where the subscripts are to indicate we are taking the second and third row/columns
from the matrices. The inequality holds unless x yb = 0, and favors the original
estimates. Here the two matrices (7.67) are
0.0167 0 0.0186 0.0006
and , (7.68)
0 0.050 0.0006 0.0502
respectively, which are not very different. Thus in this case, the covariate-adjusted
estimates are better because the gain from the σ terms is much larger than the loss
124 Chapter 7. Both-Sides Models: Testing
from the Cx terms. Note that the covariate-adjusted estimate of the variances has lost
a degree of freedom.
The parameters in model (7.56) with the constraint (7.58) can be classified into
three types:
and
x is n × p, Yb is n × q b , Ya is n × q a , μb is p1 × q b , μ a is p1 × q a ,
β b is p2 × q b , β a is p2 × q a , Σbb is q b × q b , and Σ aa is q a × q a . (7.71)
Here, Yb contains the q b covariate variables, the β a contains the parameters of interest,
β b is assumed zero, and μb and μ a are not of interest. For example, the covariates
might consist of a battery of measurements before treatment.
Conditioning on the covariates again, we have that
where Σ aa·b = Σ aa − Σ ab Σ− −1
bb Σ ba , γ = Σ bb Σ ba , and
1
μa μb μ a − μb γ
α = E [Y a ] − E [Yb ] β ∗ = x −x γ=x . (7.73)
βa 0 βa
7.5.1 Pseudo-covariates
The covariates considered above were “real” in the sense that they were collected
purposely in such a way that their distribution was independent of the ANOVA
groupings. At times, one finds variables that act like covariates after transforming
the Y’s. Continue the mouth size example from Section 7.3.1, with cubic model over
time as in (7.40). We will assume that the cubic and quadratic terms are zero, as
testing the hypothesis (7.44) suggests. Then β is of the form ( β a β b ) = ( β a 0), where
β0 β1 β2 β3
βa = , and β b = , (7.75)
δ0 δ1 δ2 δ3
as in (7.69) but with no μ a or μb . (Plus the columns are in opposite order.) But the z
is in the way. Note though that z is square, and invertible, so that we can use the 1-1
function of Y, Y(z )−1 :
Y ( z ) ≡ Y (z ) − 1 = x( β a 0) + R ( z ) , (7.76)
where
R(z) ∼ Nn×q (0, In ⊗ Σz ), Σz = z−1 Σ R (z )−1 . (7.77)
(This Σz is the same as the one in (6.5) because z itself is invertible.)
(z) (z)
Now we are back in the covariate case (7.69), so we have Y(z) = (Ya Yb ), and
have the conditional linear model
(z) (z) ( z) ( z) βa ( z)
[ Y a | Y b = y b ] = (x y b ) + Ra , (7.78)
γ
−1 ( z)
where γ = Σz,bb Σz,ba , and Ra ∼ Nn×q a (0, In ⊗ Σz,aa·b ), and estimation proceeds as
in the general case (7.69). See Section 9.5 for the calculations.
When z is not square, so that z is q × k for k < q, pseudo-covariates can be created
by filling out z, that is, find a q × (q − k) matrix z2 so that z z2 = 0 and (z z2 ) is
invertible. (Such z2 can always be found. See Exercise 5.6.18.) Then the model is
Y = xβz + R
z
=x β 0 + R, (7.79)
z2
(We actually do not use the normality assumption on R in what follows.) This ap-
proach is found in Hasegawa [1986] and Gupta and Kabe [2000].
The observed Y is dependent on the value of other variables represented by x and
z. The objective is to use the observed data to predict a new variable Y New based
on its x New and z New . For example, an insurance company may have data on Y, the
payouts the company has made to a number of people, and (x, z), the basic data (age,
sex, overall health, etc.) on these same people. But the company is really wondering
whether to insure new people, whose (x New , z New ) they know, but Y New they must
predict. The prediction, Y New , is a function of (x New , z New ) and the observed Y. A
good predictor has Y New close to Y New .
The model (7.81), with β being p × l, is the largest model under consideration, and
will be called the “big” model. The submodels we will look at will be
Y = x∗ β∗ z∗ + R, (7.82)
where R New has the same distribution as R, and Y New and Y are independent. It is
perfectly reasonable to want to predict Y New ’s for different x and z than in the data,
but the analysis is a little easier if they are the same, plus it is a good starting point.
For a given submodel (7.82), we predict Y New by
∗ = x∗ β∗ z∗ ,
Y (7.84)
where β∗ is the usual estimate based on the smaller model (7.82) and the observed Y,
∗ = Px∗ YPz∗ .
Y (7.86)
To assess how well a model predicts, we use the sum of squares between Y New
∗:
and Y
PredSS ∗ = Y New − Y
∗ 2
n q
= ∑ ∑ (yijNew − y∗ij )2
i =1 j =1
∗ ) (Y New − Y
= trace((Y New − Y ∗ )). (7.87)
Of course, we cannot calculate PredSS ∗ because the Y New is not observed (if it were,
we wouldn’t need to predict it), so instead we look at the expected value:
∗ ) (Y New − Y
EPredSS ∗ = E [trace((Y New − Y ∗ ))]. (7.88)
The expected value is taken assuming the big model, (7.81) and (7.83). We cannot
observe EPredSS ∗ either, because it is a function of the unknown parameters β and
Σ R , but we can estimate it. So the program is to
1. Estimate EPredSS ∗ for each submodel (7.82);
2. Find the submodel(s) with the smallest estimated EPredSS ∗ ’s.
Whether prediction is the ultimate goal or not, the above is a popular way to choose
a submodel.
We will discuss Mallows’ C p as a method to estimate EPredSS ∗ . Cross-validation
is another popular method that will come up in classification, Chapter 11. The temp-
tation is to use the observed Y in place of Y New to estimate EPredSS ∗ , that is, to use
the observed residual sum of squares
∗ ) (Y − Y
ResidSS ∗ = trace((Y − Y ∗ )). (7.89)
∗ is based on the
This estimate is likely to be too optimistic, because the prediction Y
observed Y. The ResidSS ∗ is estimating its expected value,
EResidSS ∗ = E [trace((Y − Y
∗ ) (Y − Y
∗ ))]. (7.90)
We will use
For the mean, note that by (7.81) and (7.83), Y and Y New both have mean xβz , hence,
E [Y New − Px∗ YPz∗ ] = E [Y − Px∗ YPz∗ ] = xβz − Px∗ xβz Pz∗ ≡ Δ. (7.93)
The covariance term for EPredSS ∗ is easy because Y New and Y are independent.
Using (6.15),
Thus
Applying (7.91) with (7.93), (7.95), and (7.97), we have the following result.
Lemma 7.2. For Δ given in (7.93),
Note that both quantities can be decomposed into a bias part, Δ Δ, and a covari-
ance part. They have the same bias, but the residuals underestimate the prediction
error by having a “− p∗ ” in place of the “+ p∗ ”:
So to use the residuals to estimate the prediction error unbiasedly, we need to add
an unbiased estimate of the term in (7.99). That is easy, because we have an unbiased
estimator of Σ R .
Proposition 7.2. An unbiased estimator of EPredSS ∗ is Mallows’ C p statistic,
where
1
R =
Σ Y Qx Y. (7.101)
n−p
Some comments:
R is calculated from
• The ResidSS ∗ is calculated from the submodel, while the Σ
the big model.
7.6. Mallows’ C p 129
• The estimate of prediction error takes the residual error, and adds a penalty
depending (partly) on the number of parameters in the submodel. So the larger
the submodel, generally, the smaller the residuals and the larger the penalty. A
good model balances the two.
y <− mouths[,1:4]
xstar <− x[,ii]
zstar <− z[,jj]
pzstar <− zstar%∗%solve(t(zstar)%∗%zstar,t(zstar))
yhat <− xstar%∗%solve(t(xstar)%∗%xstar,t(xstar))%∗%y%∗%pzstar
residss <− sum((y−yhat)^2)
pstar <− length(ii)
penalty <− 2∗pstar∗tr(sigmaRhat%∗%pzstar)
cp <− residss + penalty
So, for example, the full model takes ii <− 1:2 and jj <− 1:4, while the model with
no difference between boys and girls, and a quadratic for the growth curve, would
take ii <− 1 and jj <− 1:3.
Here are the results for the eight models of interest:
p∗ l∗ ResidSS ∗ Penalty Cp
1 1 917.7 30.2 947.9
1 2 682.3 35.0 717.3
1 3 680.9 37.0 717.9
1 4 680.5 42.1 722.6 (7.103)
2 1 777.2 60.5 837.7
2 2 529.8 69.9 599.7
2 3 527.1 74.1 601.2
2 4 526.0 84.2 610.2
Note that in general, the larger the model, the smaller the ResidSS ∗ but the larger
the penalty. The C p statistics aims to balance the fit and complexity. The model
with the lowest C p is the (2, 2) model, which fits separate linear growth curves to the
boys and girls. We arrived at this model in Section 7.3.1 as well. The (2, 3) model
is essentially as good, but is a little more complicated. Generally, one looks for the
130 Chapter 7. Both-Sides Models: Testing
model with the smallest C p , but if several models have approximately the same low
C p , one chooses the simplest.
7.7 Exercises
Exercise 7.7.1. Show that when p∗ = l ∗ = 1, the T 2 in (7.17) equals the square of the
t-statistic in (6.27), assuming β ij = 0 there.
Exercise 7.7.2. Verify the equalities in (7.28).
Exercise 7.7.3. If A ∼ Gamma(α, λ) and B ∼ Gamma( β, λ), and A and B are indepen-
dent, then U = A/( A + B ) is distributed Beta(α, β). Show that when l ∗ = 1, Wilks’ Λ
(Definition 7.2) is Beta(α, β), and give the parameters in terms of p∗ and ν. [Hint: See
Exercise 3.7.8 for the Gamma distribution, whose pdf is given in (3.81). Also, the Beta
pdf is found in Exercise 2.7.13, in equation (2.95), though you do not need it here.]
Exercise 7.7.5. Show that E [1/χ2ν ] = 1/(ν − 2) if ν ≥ 2, as used in (7.30). The pdf of
the chi-square is given in (3.80).
Exercise 7.7.6. Find the matrices C and D so that the hypothesis in (7.52) is the same
as that in (7.12), where β is 5 × 4.
Exercise 7.7.7. Consider the model in (7.69) and (7.70). Let W = y Qx y. Show that
W aa·b = ya Qx∗ y a . Thus one has the same estimate of Σ aa·b in the original model and
in the covariate-adjusted model. [Hint: Write out Waa·b the usual way, where the
blocks in W are Ya Qx Ya , etc. Note that the answer is a function of Qx y a and Qx yb .
Then use (5.89) with D1 = x and D2 = yb .]
Exercise 7.7.8. Prove (7.91). [Hint: Use (7.18) and Exercise 2.7.11 on row(U).]
Exercise 7.7.9. Show that Δ in (7.93) is zero in the big model, i.e, x∗ = x and z∗ = z.
Exercise 7.7.10. Verify the second equality in (7.97). [Hint: Note that trace(QΣQ) =
trace(ΣQ) if Q is idempotent. Why?]
Exercise 7.7.11 (Mouth sizes). In the mouth size data in Section 7.1.1, there are n G =
11 girls and n B = 16 boys, and q = 4 measurements on each. Thus Y is 27 × 4.
Assume that
Y ∼ N (xβ, In ⊗ Σ R ), (7.104)
where this time
111 011
x= , (7.105)
016 116
and
μG μ G1 μ G2 μ G3 μ G4
β= = . (7.106)
μB μ B1 μ B2 μ B3 μ B4
G and μ
The sample means for the two groups are μ B . Consider testing
H0 : μ G = μ B . (7.107)
7.7. Exercises 131
G − μ
μ B
B= ∼ N (0, Σ R )? (7.108)
c
R . Then
(b) The unbiased estimate of Σ R is Σ
R ∼ Wishart(ν, Σ R ).
W=νΣ (7.109)
What is ν? (c) What is the value of Hotelling’s T 2 ? (d) Find the constant d and the
degrees of freedom a, b so that F = d T 2 ∼ Fa,b. (e) What is the value of F? What is
the resulting p-value? (f) What do you conclude? (g) Compare these results to those
in Section 7.1.1, equations (7.10) and below.
Exercise 7.7.12 (Skulls). This question continues Exercise 6.5.5 on Egyptian skulls. (a)
Consider testing that there is no difference among the five time periods for all four
measurements at once. What are p∗ , l ∗ , ν and T 2 in (7.22) for this hypothesis? What
is the F and its degrees of freedom? What is the p-value? What do you conclude?
(b) Now consider testing whether there is a non-linear effect on skull size over time,
that is, test whether the last three rows of the β matrix are all zero. What are l ∗ , ν, the
F-statistic obtained from T 2 , the degrees of freedom, and the p-value? What do you
conclude? (e) Finally, consider testing whether there is a linear effect on skull size
over time. Find the F-statistic obtained from T 2 . What do you conclude?
for θi = 2πi/6, i = 1, . . . , 6. (a) In Exercise 4.4.8, the one-cycle wave is given by the
equation A + B cos(θ + C ). The null hypothesis that the model does not include that
wave is expressed by setting B = 0. What does this hypothesis translate to in terms
of the β ij ’s? (b) Test whether the one-cycle wave is in the model. What is p∗ ? (c)
Test whether the two-cycle wave is in the model. What is p∗ ? (d) Test whether the
three-cycle wave is in the model. What is p∗ ? (e) Test whether just the one-cycle wave
needs to be in the model. (I.e., test whether the two- and three-cycle waves have zero
coefficients.) (f) Using the results from parts (b) through (e), choose the best model
among the models with (1) No waves; (2) Just the one-cycle wave; (3) Just the one-
and two-cycle waves; (4) The one-, two-, and three-cycle waves. (g) Use Mallows’ C p
to choose among the four models listed in part (f).
Exercise 7.7.14 (Histamine in dogs). Consider the model for the histamine in dogs
example in (4.15), i.e.,
⎛⎛ ⎞ ⎞⎛ ⎞
1 −1 −1 1 μb μ1 μ2 μ3
⎜⎜ 1 −1 1 −1 ⎟ ⎟⎜ αb α1 α2 α3 ⎟
⎜
Y = xβ + R = ⎝⎝⎜ ⎟ ⊗ 14 ⎟ ⎜ ⎟ + R.
1 1 −1 −1 ⎠ ⎠⎝ βb β1 β2 β3 ⎠
1 1 1 1 γb γ1 γ2 γ3
(7.111)
132 Chapter 7. Both-Sides Models: Testing
For the following two null hypotheses, specify which parameters are set to zero, then
find p∗ , l ∗ , ν, the T 2 and its F version, the degrees of freedom for the F, the p-value,
and whether you accept or reject. Interpret the finding in terms of the groups and
variables. (a) The four groups have equal means (for all four time points). Compare
the results to that for the hypothesis in (7.46). (b) The four groups have equal before
means. (c) Now consider testing the null hypothesis that the after means are equal,
but using the before measurements as a covariate. (So we assume that αb = β b =
γb = 0.) What are the dimensions of the resulting Ya and the x matrix augmented
with the covariate? What are p∗ , l ∗ , ν, and the degrees of freedom in the F for testing
the null hypothesis. (d) The x x from the original model (not using the covariates)
is 16 × I4 , so that the [(x x)−1 ] ∗ = (1/16)I3 . Compare the diagonals (i.e., 1/16)
to the diagonals of the analogous matrix in the model using the covariate. How
much smaller or larger, percentagewise, are the covariate-based diagonals than the
original? (e) The diagonals of the Σ ∗z in the original model are 0.4280, 0.1511, and
0.0479. Compare these diagonals to the diagonals of the analogous matrix in the
model using the covariate. How much smaller or larger, percentagewise, are the
covariate-based diagonals than the original? (f) Find the T 2 , the F statistic, and the
p-value for testing the hypothesis using the covariate. What do you conclude? How
does this result compare to that without the covariates?
Exercise 7.7.15 (Histamine, cont.). Continue the previous question, using as a starting
point the model with the before measurements as the covariate, so that
⎛ ∗ ⎞
μ1 μ2∗ μ3∗
∗ ∗
⎜ α1 α2 α3 ⎟ ∗
⎜ ⎟
Y∗ = x∗ ⎜ β∗1 β∗2 β∗3 ⎟ z + R∗ , (7.112)
⎝ γ∗ γ∗ γ∗ ⎠
1 2 3
δ1 δ2 δ3
where Y∗ has just the after measurements, x∗ is the x in (7.111) augmented with
the before measurements, and z represents orthogonal polynomials for the after time
points, ⎛ ⎞
1 −1 1
z=⎝ 1 0 −2 ⎠ . (7.113)
1 1 1
Now consider the equivalent model resulting from multiplying both sides of the
equation on the right by (z )−1 . (a) Find the estimates and standard errors for
the quadratic terms, (μ3∗ , α3∗ , β∗2 , γ3∗ ). Test the null hypothesis that (μ3∗ , α3∗ , β∗3 , γ3∗ ) =
(0, 0, 0, 0). What is ν? What is the p-value? Do you reject this null? (The answer
should be no.) (b) Now starting with the model from part (a), use the vector of
quadratic terms as the covariate. Find the estimates and standard errors of the rele-
vant parameters, i.e., ⎛ ∗ ⎞
μ1 μ2∗
⎜ α∗ α∗ ⎟
⎜ 1∗ 2 ⎟. (7.114)
⎝ β
1 β∗2 ⎠
∗
γ1 γ2 ∗
(c) Use Hotelling’s T 2 to test the interaction terms are zero, i.e., that (γ1∗ , γ2∗ ) = (0, 0).
(What are l ∗ and ν?) Also, do the t-tests for the individual parameters. What do you
conclude?
7.7. Exercises 133
Exercise 7.7.16 (Caffeine). This question uses the data on he effects of caffeine on
memory described in Exercise 4.4.4. The model is as in (4.35), with x as described
there, and
1 −1
z= . (7.115)
1 1
The goal of this problem is to use Mallows’ C p to find a good model, choosing among
the constant, linear and quadratic models for x, and the “overall mean" and “overall
mean + difference models" for the scores. Thus there are six models. (a) For each
of the 6 models, find the p∗ , l ∗ , residual sum of squares, penalty, and C p values. (b)
Which model is best in terms of C p ? (c) Find the estimate of β∗ for the best model.
(d) What do you conclude?
Chapter 8
This chapter contains a number of results useful for linear models and other models,
including the densities of the multivariate normal and Wishart. We collect them here
so as not to interrupt the flow of the narrative.
y=γ
d (8.2)
.
for some constant γ
If U and V are random variables, with E [|UV |] < ∞, then the Cauchy-Schwarz
inequality becomes
E [UV ]2 ≤ E [U 2 ] E [V 2 ], (8.5)
with equality if and only if V is zero with probability one, or
U = bV (8.6)
135
136 Chapter 8. Technical Results
for constant b = E [UV ] /E [V 2 ]. See Exercise 8.8.2. The next result is well-known in
statistics.
Corollary 8.1 (Correlation inequality). Suppose Y and X are random variables with finite
positive variances. Then
−1 ≤ Corr [Y, X ] ≤ 1, (8.7)
with equality if and only if, for some constants a and b,
Y = a + bX. (8.8)
from which (8.7) follows. Then (8.8) follows from (8.6), with b = Cov[Y, X ] /Var [ X ]
and a = E [Y ] − bE [ X ], so that (8.8) is the least squares fit of X to Y.
This inequality for the sample correlation coefficient of n × 1 vectors x and y fol-
lows either by using Lemma 8.1 on Hn y and Hn x, where Hn is the centering matrix
(1.12), or by using Corollary 8.1 with X and Y having the empirical distributions
given by x and y, respectively, i.e.,
1 1
P[ X = x] = #{ xi = x } and P [Y = y] = #{yi = y}. (8.10)
n n
The next result also follows from Cauchy-Schwarz. It will be useful for Hotelling’s
T 2 in Section 8.4.1, and for canonical correlations in Section 13.3.
d
y=± . (8.12)
d
Proposition 8.1. Consider the situation above, where Σ XX is invertible and ν ≥ p. Then
W XY | W XX = wxx ∼ N (wxx Σ−
XX Σ XY , w xx ⊗ ΣYY · X ),
1
(8.17)
W XX ∼ Wishart p (ν, Σ XX ). (8.18)
Proof. The final equation is just the marginal of a Wishart, as in Section 3.6. By
Definition 3.6 of the Wishart,
(The α = 0 because the means of X and Y are zero.) Note that (8.20) is the both-
sides model (6.1), with z = Iq and Σ R = ΣYY · X . Thus by Theorem 6.1 and the
plug-in property (2.62) of conditional distributions, β = (X X)−1 X Y and Y QX Y are
conditionally independent given X = x,
and
Y QX Y | X = x ∼ Wishartq (n − p, ΣYY · X ). (8.22)
The conditional distribution in (8.22) does not depend on x, hence Y QX Y
is (uncon-
ditionally) independent of the pair (X, β), as in (2.65) and therebelow, hence
hence
X Y = (X X) β | X X = x x ∼ N (x x) β, (x x) ⊗ ΣYY · X ). (8.25)
Translating to W using (8.19), noting that Y QX Y = WYY · X , we have that (8.23) is
(8.15), (8.22) is (8.16), and (8.25) is (8.17).
For any q × q orthogonal matrix Γ, ΓUΓ has the same distribution as U, hence in
particular
E [U−1 ] = E [(ΓUΓ )−1 ] = ΓE [U−1 ] Γ . (8.27)
Exercise 8.8.6 shows that any symmetric q × q matrix A for which A = ΓAΓ for all
orthogonal Γ must be of the form a11 Iq . Thus
Using (5.85),
% & % &
1 1 1
E [(U−1 )11 ] = E =E = . (8.29)
U11·{2:q} χ2ν−l ∗ +1 ν − l∗ − 1
1
E [ U−1 ] = I ∗. (8.30)
ν − l∗ − 1 l
Next take W ∼ Wishartq (ν, Σ), with Σ invertible. Then, W = D Σ1/2 UΣ1/2 , and
ZW−1 Z
ZW−1 W = Z2 . (8.32)
Z2
ZW−1 Z
| Z = z. (8.34)
Z2
Because Z and W are independent, we can use the plugin formula (2.63), so that
% &
ZW−1 Z zW−1 z
| Z = z =D = g1 W−1 g1 , (8.35)
Z 2 z 2
8.4. Distribution of Hotelling’s T 2 139
where g1 = z/z. Note that on the right-hand side we have the unconditional
distribution for W. Let G be any l ∗ × l ∗ orthogonal matrix with g1 as its first row.
(Exercise 5.6.18 guarantees there is one.) Then
by (5.85).
Note that the distribution of U, hence [U−1 ]11 , does not depend on z, which means
that
ZW−1 Z
is independent of Z. (8.38)
Z2
Furthermore, by (8.16), where p = l ∗ − 1 and q = 1,
Z2 χ2l ∗
ZW−1 Z = =D , (8.40)
U11·{2:l ∗ } χ2ν−l ∗ +1
where the two χ2 ’s are independent. Then (7.29) follows from Definition 7.1 for the
F.
Za
Ta = √ ∼ tν , (8.42)
aWa /ν
or, since t2ν = F1,ν ,
(Za )2
Ta2 = ∼ F1,ν . (8.43)
aWa /ν
For any a, we can do a regular F test. The projection pursuit approach is to find the
a that gives the most significant result. That is, we wish to find
Then
(Vb )2
T 2 = max ν , where V = ZW−1/2 . (8.46)
b =0 bb
Letting g = b/b, so that g = 1, Corollary 8.2 of Cauchy-Schwarz shows that (see
Exercise 8.8.9)
T 2 = ν max (Vg )2 = νVV = νZW−1 Z , (8.47)
g | g=1
The density of Z is
1 1 zz
e− 2 z21 +···z2N
e− 2
1 1
f (z | 0, I N ) = = , (8.52)
(2π ) N/2 (2π ) N/2
so that
1 −1 −1
abs |(A )−1 | e− 2 (y−μ)(A ) A (y−μ)
1
f (y | μ, Ω) =
(2π ) N/2
1 −1
|AA | −1/2 e− 2 (y−μ)(AA ) (y−μ) ,
1
= (8.53)
(2π ) N/2
from which (8.48) follows.
8.6. The QR decomposition for the multivariate normal 141
When Z can be written as a matrix with a Kronecker product for its covariance
matrix, as is often the case for us, the pdf can be compactified.
Corollary 8.3. Suppose Y ∼ Nn×q (M, C ⊗ Σ), where C (n × n) and Σ (q × q) are positive
definite. Then
1 1 −1 −1
e− 2 trace(C (y−M) Σ (y−M) ) .
1
f (y | M, C, Σ) = (8.54)
(2π )nq/2 |C| q/2 | Σ| n/2
See Exercise 8.8.15 for the proof.
Next, suppose Y ∼ Nν×q (0, Iν ⊗ Σ), where Σ is invertible. Let A be the matrix
such that
Σ = A A, where A ∈ Tq+ of (5.63), (8.62)
Q = D ΓQ. (8.64)
Proof. From above, we see that Z and Y = ZA have the same Q. The distribution of Z
does not depend on Σ, hence neither does the distribution of Q, proving part (ii). For
part (iii), consider ΓY, which has the same distribution as Y. We have ΓY = (ΓQ)V.
Since ΓQ also has orthonormal columns, the uniqueness of the QR decomposition
implies that ΓQ is the “Q” for ΓY. Thus Q and ΓQ have the same distribution.
Proving the independence result of part (i) takes some extra machinery from math-
ematical statistics. See, e.g., Lehmann and Casella [1998]. Rather than providing all
the details, we outline how one can go about the proof. First, V can be shown to be a
complete sufficient statistic for the model Y ∼ N (0, Iν ⊗ Σ). Basu’s Lemma says that
any statistic whose distribution does not depend on the parameter, in this case Σ, is
independent of the complete sufficient statistic. Thus by part (ii), Q is independent
of V.
8.7. Density of the Wishart 143
If n = q, the Q is an orthogonal matrix, and its distribution has the Haar probabil-
ity measure, or uniform distribution, over Oν . It is the only probability distribution
that does have the above invariance property, although proving the fact is beyond
this book. See Halmos [1950]. Thus one can generate a random q × q orthogonal
matrix by first generating an q × q matrix of independent N (0, 1)’s, then performing
Gram-Schmidt orthogonalization on the columns, normalizing the results so that the
columns have norm 1.
1
u k −1 e − 2 u .
1 2
f k (u ) = (8.65)
Γ (k/2)2( k/2)−1
1 n−q
r ν−1 r ν−2 · · · rqq e− 2 trace(r r) ,
1
f R (r ) = (8.66)
c(ν, q ) 11 22
where
q
ν−j+1
c(ν, q ) = π q( q−1) /4 2( νq/2)−q ∏Γ . (8.67)
j =1
2
ν −1 ν −2 ν−q
1 v11 v22 · · · vqq − 1 trace((A ) −1v vA−1 ) 1
f V (v | Σ ) = ν−q e
2
c(ν, q ) a ν − 1
a ν −2
· · · a qq a 11 a 2 · · · aq
22 qq
11 22
1 1 ν−q −1
vν−1 vν−2 · · · vqq e− 2 trace( Σ v v) ,
1
= (8.69)
c(ν, q ) | Σ| ν/2 11 22
where
q
ν−j+1
d(ν, q ) = π q( q−1) /4 2νq/2 ∏Γ . (8.72)
j =1
2
8.8 Exercises
Exercise 8.8.1. Suppose y is 1 × K, D is M × K, and D D is invertible. Let y be
the projection of y onto the span of the rows of D , so that y D , where γ
= γ =
yD(D D)−1 is the least-squares estimate as in (5.17). Show that
(Notice the projection matrix, from (5.19).) Show that in the case D = d , i.e., M = 1,
(8.73) yields the equality in (8.4).
Exercise 8.8.2. Prove the Cauchy-Schwarz inequality for random variables U and V
given in (8.5) and (8.6), assuming that V is not zero with probability one. [Hint: Use
least squares, by finding b to minimize E [(U − bV )2 ].]
Exercise 8.8.3. Prove Corollary 8.2. [Hint: Show that (8.11) follows from (8.1), and
= ±1/d, using the fact that y = 1.]
that (8.2) implies that γ
Exercise 8.8.4. For W in (8.19), verify that X X = W XX , X Y = W XY , and Y QX Y =
WYY · X , where QX = In − X(X X)−1 X .
Exercise 8.8.5. Suppose Z ∼ N1×l ∗ (0, Σ) and W ∼ Wishartl ∗ (ν, Σ) are as in Proposi-
tion 7.1. (a) Show that for any l ∗ × l ∗ invertible matrix A,
to show that a11 = a22 and a1i = a2i for i = 3, . . . , q. Similar equalities can be obtained
by switching other pairs of rows and columns.] (b) Suppose (8.75) holds for all Γ that
are diagonal, with each diagonal element being either +1 or −1. (They needn’t all be
the same sign.) Show that all off-diagonals must be 0. (c) Suppose (8.75) holds for all
orthogonal Γ. Show that A must be of the form a11 Iq . [Hint: Use parts (a) and (b).]
Exercise 8.8.7. Verify the three equalities in (8.37).
Exercise 8.8.9. Find the g in (8.47) that maximizes (Vg )2 , and show that the maxi-
mum is indeed VV . (Use Corollary 8.2.) What is a maximizing a in (8.44)?
Exercise 8.8.10. Suppose that z = yB, where z and y are 1 × N, and B is N × N and
invertible. Show that | ∂z/∂y| = |B|. (Recall (8.51)).
Exercise 8.8.11. Show that for N × N matrix A, abs |(A )−1 | = |AA | −1/2 .
Exercise 8.8.12. Let
Σ XX 0
(X, Y) ∼ Np+q (0, 0), . (8.77)
0 ΣYY
Exercise 8.8.15. Prove Corollary 8.3. [Hint: Make the identifications z = row(y),
μ = row(M), and Ω = C ⊗ Σ in (8.48). Use (3.32f) for the determinant term in the
density. For the term in the exponent, use (3.32d) to help show that
Exercise 8.8.18. Verify (8.68). [Hint: Vectorize the matrices by row, leaving out the
structural zeroes, i.e., for q = 3, v → (v11 , v12 , v13 , v22 , v23 , v33 ). Then the matrix of
derivatives will be lower triangular.]
146 Chapter 8. Technical Results
[Hint: Show that with V = RA as in (8.61) and (8.62), Vjj = a jj R jj . Apply (5.67) to the
A and Σ.]
Exercise 8.8.24 (Bayesian inference). Consider Bayesian inference for the covariance
matrix. It turns out that the conjugate prior is an inverse Wishart on the covariance
matrix, which means Σ−1 has a Wishart prior. Specifically, let
where Σ0 is the prior guess of Σ, and ν0 is the “prior sample size.” (The larger the ν0 ,
the more weight is placed on the prior vs. the data.) Then the model in terms of the
inverse covariance parameter matrices is
W | Ψ = ψ ∼ Wishartq (ν, ψ −1 )
Ψ ∼ Wishartq (ν0 , Ψ0 ) , (8.84)
where ν ≥ q, ν0 ≥ q and Ψ0 is positive definite, so that Ψ, hence Σ, is invertible with
probability one. Note that the prior mean for Ψ is Σ0−1 . (a) Show that the joint density
of (W, Ψ) is
−1
f W | Ψ (w | ψ ) f Ψ (ψ ) = c(w)| ψ | ( ν+ν0 −q−1) /2 e− 2 trace((w+Ψ0 ) ψ) ,
1
(8.85)
where c(w) is some constant that does not depend on ψ, though it does depend on
Ψ0 and ν0 . (b) Without doing any calculations, show that the posterior distribution
of Ψ is
Ψ | W = w ∼ Wishartq (ν + ν0 , (w + Ψ0−1 )−1 ). (8.86)
8.8. Exercises 147
[Hint: Dividing the joint density in (8.85) by the marginal density of W, f W (w), yields
the posterior density just like the joint density, but with a different constant, say,
c∗ (w). With ψ as the variable, the density is a Wishart one, with given parameters.]
(c) Letting S = W/ν be the sample covariance matrix, show that the posterior mean
of Σ is
1
E [ Σ | W = w] = (νS + ν0 Σ0 ), (8.87)
ν + ν0 − q − 1
close to a weighted average of the prior guess and observed covariance matrices.
[Hint: Use Lemma 7.1 on Ψ, rather than trying to find the distribution of Σ.]
Here, K0 , μ0 , Ψ0 and ν0 are known, where K0 and Ψ0 are positive definite, and ν0 ≥ q.
Show that unconditionally, E [ μ] = μ0 and, using (8.83),
1 ν0
Cov[ μ] = (K0 ⊗ Ψ0 )−1 = K−1 ⊗ Σ 0 . (8.89)
ν0 − q − 1 ν0 − q − 1 0
for the Wishart constant d(ν0 , q ) given in (8.72). [Hint: Use the pdfs in (8.54) and
(8.69).] (b) Argue that the final two terms in (8.90) (the | ψ | term and the exponential
term) look like the density of Ψ if
but without the constants, hence integrating over ψ yields the inverse of those con-
stants. Then show that the marginal density of μ is
f μ (m) = f μ,Ψ (m, ψ )dψ
d(ν0 + p, q )
= | Ψ0 | −ν0 /2 |K0 | q/2 |(m − μ0 ) K0 (m − μ0 ) + Ψ0−1 | −( ν0 + p) /2
(2π ) pq/2 d(ν0 , q )
1 | Ψ0 | p/2 |K0 | q/2
= ,
c(ν0 , p, q ) |(m − μ0 ) K0 (m − μ0 )Ψ0 + Iq | ( ν0 + p) /2
(8.92)
148 Chapter 8. Technical Results
where
c(ν0 , p, q ) = (2π ) pq/2 d(ν0 , q )/d(ν0 + p, q ). (8.93)
This density for μ is a type of multivariate t. Hotellings T 2 is another type. (c) Show
that if p = q = 1, μ0 = 0, K0 = 1/ν0 and Ψ0 = 1, that the pdf (8.92) is that of a
Student’s t on ν0 degrees of freedom:
Γ ((ν + 1)/2) 1
f (t | ν0 ) = √ 0 . (8.94)
νπ Γ (ν0 /2) (1 + t /ν0 )( ν0 +1) /2
2
Exercise 8.8.27 (Bayesian inference). Now we add some data to the prior in Exercise
8.8.25. The conditional model for the data is
where Y and W are independent given μ and Ψ. Note that W’s distribution does not
depend on the μ. The conjugate prior is given in (8.88), with the conditions given
therebelow. The K is a fixed positive definite matrix. A curious element is that prior
covariance of the mean and the conditional covariance of Y have the same ψ, which
helps tractability (as in Exercise 6.5.3). (a) Justify the following equations:
(b) Show that the conditional distribution of μ given Y and Ψ is multivariate normal
with
E [ μ | Y = y, Ψ = ψ ] = (K + K0 )−1 (Ky + K0 μ0 ),
Cov[ μ | Y = y, Ψ = ψ ] = (K + K0 )−1 ⊗ ψ −1 . (8.97)
[Hint: Follows from Exercise 3.7.29, noting that ψ is fixed (conditioned upon) for this
calculation.] (c) Show that
[Hint: See (3.102).] (d) Let Z = (K−1 + K0−1 )−1/2 (Y − μ0 ), and show that the middle
two densities in the last line of (8.96) can be combined into the density of
U = W + Z Z | Ψ = ψ ∼ Wishartq (ν + p, ψ −1 ), (8.99)
that is,
f Y | Ψ (y | ψ ) f W | Ψ (w | ψ ) = c∗ (u, w) f U | Ψ (u | ψ ) (8.100)
for some constant c∗ (u, w) that does not depend on ψ. (e) Now use Exercise 8.8.24 to
show that
Ψ | U = u ∼ Wishartq (ν + ν0 + p, (u + Ψ0−1 )−1 ). (8.101)
(f) Thus the posterior distribution of μ and Ψ in (8.97) and (8.101) are given in the
same two stages as the prior in (8.88). The only differences are in the parameters.
8.8. Exercises 149
The prior parameters are μ0 , K0 , ν0 , and Ψ0 . What are the corresponding posterior
parameters? (g) Using (8.83), show that the posterior means of μ and Σ are
E [ μ | Y = y, W = w] = (K + K0 )−1 (Ky + K0 μ0 ),
1
E [ Σ | Y = y, W = w] = (u + ν0 Σ0 ), (8.102)
ν + ν0 + p − q − 1
1
Cov[ μ | Y = y, W = w] = (K + K0 )−1 ⊗ (u + ν0 Σ0 ). (8.103)
ν + ν0 + p − q − 1
Chapter 9
Likelihood Methods
For the linear models, we derived estimators of β using the least-squares principle,
and found estimators of Σ R in an obvious manner. Likelihood provides another
general approach to deriving estimators, hypothesis tests, and model selection proce-
dures. We start with a very brief introduction, then apply the principle to the linear
models. Chapter 10 considers MLE’s for models concerning the covariance matrix.
9.1 Introduction
Throughout this chapter, we assume we have a statistical model consisting of a ran-
dom object (usually a matrix or a set of matrices) Y with space Y , and a set of
distributions { Pθ | θ ∈ Θ }, where Θ is the parameter space. We assume that these
distributions have densities, with Pθ having associated density f (y | θ).
Definition 9.1. For a statistical model with densities, the likelihood function is defined for
each fixed y ∈ Y as the function L (· ; y) : θ → [0, ∞ ) given by
L (θ ; y) = a(y) f (y | θ), (9.1)
for any positive a(y).
Likelihoods are to be interpreted in only relative fashion, that is, to say the likeli-
hood of a particular θ1 is L (θ1 ; y) does not mean anything by itself. Rather, meaning
is attributed to saying that the relative likelihood of θ1 to θ2 (in light of the data y) is
L (θ1 ; y)/L (θ2 ; y). Which is why the “a(y)” in (9.1) is allowed. There is a great deal of
controversy over what exactly the relative likelihood means. We do not have to worry
about that particularly, since we are just using likelihood as a means to an end. The
general idea, though, is that the data supports θ’s with relatively high likelihood.
The next few sections consider maximum likelihood estimation. Subsequent sec-
tions look at likelihood ratio tests, and two popular model selection techniques (AIC
and BIC). Our main applications are to multivariate normal parameters.
151
152 Chapter 9. Likelihood Methods
estimate.
Definition 9.2. The maximum likelihood estimate (MLE) of the parameter θ based on the
(y) ∈ Θ that maximizes the likelihood L (θ; y).
data y is the unique value, if it exists, θ
It may very well be that the maximizer is not unique, or does not exist at all, in
which case there is no MLE for that particular y. The MLE of a function of θ, g(θ),
is defined to be the function of the MLE, that is, g(
(θ) = g(θ). See Exercises 9.6.1 and
9.6.2 for justification.
−
β = (x x) x y. We show that this is in fact the MLE. Write
1
y − xβ = y − x β + x β − xβ
= Qx y + x( β − β), (9.4)
where Qx = In − x(x x)−1 x (see Proposition 5.1), so that
1 −1
e− 2 trace( Σ U)
1
g(Σ ) = (9.7)
| Σ| a/2
= 1 U,
Σ (9.8)
a
and the maximum is
1 aq
) =
g(Σ e− 2 . (9.9)
| a/2
|Σ
Applying this lemma to (9.6) yields
R = y Qx y .
Σ (9.10)
n
where as in (6.5),
Σ z = z− 1 Σ R (z ) − 1 . (9.14)
154 Chapter 9. Likelihood Methods
or
−1
z
Y( z ) ≡ Y ∼ Nn×q x β 0 , In ⊗ Σz ) , (9.18)
z2
where
−1
−1 z
Σz = z z2 ΣR . (9.19)
z2
As before, ( β, Σz ) and ( β, Σ R ) are in one-to-one correspondence, so it will be
sufficient to first find the MLE of the former. Because of the “0” in the mean of Y(z) ,
the least-squares estimate of β in (5.27) is not the MLE. Instead, we have to proceed
conditionally. That is, partition Y(z) similar to (7.69),
(z) (z) (z) (z)
Y ( z ) = (Y a Y b ), Y a is n × l, Yb is n × (q − l ), (9.20)
and Σz similar to (7.70),
Σz,aa Σz,ab
Σz = . (9.21)
Σz,ba Σz,bb
( z) ( z) ( z)
The density of Y(z) is a product of the conditional density of Ya given Yb = yb ,
( z)
and the marginal density of Yb :
( z) ( z) ( z)
f (y(z) | β, Σz ) = f (y a | yb , β, γ, Σz,aa·b ) × f (yb | Σz,bb ), (9.22)
where
−1 −1
γ = Σz,bb Σz,ba and Σz,aa·b = Σz,aa − Σz,ab Σz,bb Σz,ba . (9.23)
( z)
The notation in (9.22) makes explicit the facts that the conditional distribution of Ya
( z) ( z)
given Yb = yb depends on ( β, Σz ) through only ( β, γ, Σz,aa·b), and the marginal
( z)
distribution of Yb depends on ( β, Σz ) through only Σz,bb .
The set of parameters ( β, γ, Σz,aa·b, Σz,bb ) can be seen to be in one-to-one corre-
spondence with ( β, Σz ), and has space R p×l × R( q−l )×l × Sl+ × Sq+−l . That is, the
parameters in the conditional density are functionally independent of those in the
marginal density, which means that we can find the MLE’s of these parameters sepa-
rately.
9.2. Maximum likelihood estimation 155
Conditional part
We know as in (7.72) (without the “μ” parts) and (7.73) that
) *
( z) β
Ya | Yb = yb ∼ Nn×l x yb , In ⊗ Σz,aa·b . (9.24)
γ
We are in the multivariate regression case, that is, without a z, so the MLE of the
( β , γ ) parameter is the least-squares estimate
+ , −1 + ,
( z) x
β x x x y b ( z)
= ( z) ( z) ( z) ( z) Ya , (9.25)
γ yb x yb yb yb
and
z,aa·b = 1 Y(az) Q (z) Y(az) .
Σ (9.26)
n (x,yb )
Marginal part
From (9.19), we have that
( z)
Yb ∼ Nn×( q−l )(0, In ⊗ Σz,bb ), (9.27)
is the MLE.
The maximum of the likelihood from (9.22), ignoring the constants, is
1 1 nq
e− 2 (9.29)
| Σz,aa·b| n/2
| Σz,bb | n/2
Putting things back together, we first note that the MLE of β is given in (9.25), and
that for Σz,bb is given in (9.28). By (9.23), the other parts of Σz have MLE
Σ z,bb γ
z,ba = Σ and Σ Σ
z,aa·b + γ
z,aa = Σ z,bb γ
. (9.30)
If one is mainly interested in β, then the MLE can be found using the pseudo-
covariate approach in Section 7.5.1, and the estimation of Σz,bb and the reconstruction
R are unnecessary.
of Σ
We note that a similar approach can be used to find the MLE’s in the covariate
model (7.69), with just a little more complication to take care of the μ parts. Again, if
one is primarily interested in the β a , then the MLE is found as in that section.
156 Chapter 9. Likelihood Methods
1 1 −1
g(Σ) = h(U−1/2 ΣU−1/2 ), where h(Ψ) ≡ e− 2 trace( Ψ )
1
(9.32)
| U| a/2 | Ψ| a/2
= (1/a)Iq ,
is a function of Ψ ∈ Sq+ . Exercise 9.6.7 shows that (9.32) is maximized by Ψ
hence
1
) = e− 2 a·trace(Iq ) ,
1
h(Ψ (9.33)
| U| | a Iq |
a/2 1 a/2
= U−1/2 ΣU
Ψ = U1/2 1 Iq U1/2 = 1 U,
−1/2 ⇒ Σ (9.34)
a a
which proves (9.8). 2
H0 : θ ∈ Θ0 versus H A : θ ∈ Θ A , (9.35)
where
Θ0 ⊂ Θ A ⊂ Θ. (9.36)
Technically, the space in H A should be Θ A − Θ0 , but we take that to be implicit.
The likelihood ratio statistic for problem (9.35) is defined to be
supθ∈Θ A L (θ; y)
LR = , (9.37)
supθ∈Θ0 L (θ; y)
H0 : β 2 = 0 versus H A : β 2 = 0. (9.40)
The maximized likelihoods under the null and alternative are easy to find using
(9.10) and (9.11). The MLE’s of Σ under H0 and H A are, respectively,
0 = 1 y Qx1 y and Σ
Σ A = 1 y Qx y, (9.41)
n n
the former because under the null, the model is multivariate regression with mean
x1 β 1 . Then the likelihood ratio from (9.37) is
+ ,n/2
n/2
0|
|Σ |y Qx1 y|
LR = = . (9.42)
A|
|Σ |y Qx y|
We can use the approximation (9.38) under the null, where here d f = p2 q. It turns
out that the statistic is equivalent to Wilk’s Λ in (7.2),
|W|
Λ = ( LR)−2/n = ∼ Wilksq ( p2 , n − p), (9.43)
|W + B|
where
W = y Qx y and B = y (Qx1 − Qx )y. (9.44)
See Exercise 9.6.8. Thus we can use Bartlett’s approximation in (7.37), with l∗ = q
and p∗ = p2 .
Let
be the loglikelihoods for the models. The constant C (y) is arbitrary, and as long as
it is the same for each k, it will not affect the outcome of the following procedures.
Define the deviance of the model Mk at parameter value θk by
It is a measure of fit of the model to the data; the smaller the deviance, the better the
fit. The MLE of θk for model Mk minimizes this deviance, giving us the observed
deviance,
k ) ; y) = −2 lk (θ
deviance( Mk (θ k ; y) = −2 max lk (θk ; y). (9.48)
θk ∈ Θ k
Note that the likelihood ratio statistic in (9.38) is just the difference in observed
deviance of the two hypothesized models:
0 ) ; y) − deviance( H A (θ
2 log( LR) = deviance( H0 (θ A ) ; y ) . (9.49)
At first blush one might decide the best model is the one with the smallest ob-
served deviance. The problem with that approach is that because the deviances are
based on minus the maximum of the likelihoods, the model with the best observed
deviance will be the largest model, i.e., one with highest dimension. Instead, we add
a penalty depending on the dimension of the parameter space, as for Mallows’ C p in
(7.102). The two most popular procedures are the Bayes information criterion (BIC)
of Schwarz [1978] and the Akaike information criterion (AIC) of Akaike [1974] (who
actually meant for the “A” to stand for “An”):
models. More specifically, if the prior probability that model Mk is the true one is π k ,
then the BIC-based estimate of the posterior probability is
e− 2 BIC( Mk ; y) π
1
PBIC [ Mk | y] = k
. (9.53)
e− 2 BIC( M1 ; y) π − 12 BIC( MK ; y) π
1
1 + · · · + e K
If the prior probabilities are taken to be equal, then because each posterior probability
has the same denominator, the model that has the highest posterior probability is
indeed the model with the smallest value of BIC. The advantage of the posterior
probability form is that it is easy to assess which models are nearly as good as the
best, if there are any.
To see where the approximation arises, we first need a prior on the parameter
space. In this case, there are several parameter spaces, one for each model under
consideration. Thus is it easier to find conditional priors for each θk , conditioning on
the model:
θk | M k ∼ ρ k ( θk ) , (9.54)
for some density ρk on Θ k . The marginal probability of each model is the prior
probability:
π k = P [ Mk ]. (9.55)
The conditional density of (Y, θk ) given Mk is
gk (y | M k ) π k
P [ Mk | y] = . (9.58)
g1 (y | M1 )π1 + · · · + gK (y | MK )π K
The following requires a number of regularity assumptions, not all of which we will
detail. One is that the data y consists of n iid observations, another that n is large.
Many of the standard likelihood-based assumptions needed can be found in Chapter
6 of Lehmann and Casella [1998], or any other good mathematical statistics text. For
convenience we drop the “k”, and from (9.57) consider
f (y | θ)ρ(θ)dθ = el ( θ ; y) ρ(θ)dθ. (9.59)
Θ Θ
The Laplace approximation expands l (θ ; y) around its maximum, the maximum oc-
curing at the maximum likelihood estimator θ. Then, assuming all the derivatives
exist,
; y) + (θ − θ
l (θ ; y) ≈ l (θ ) + 1 (θ − θ
) ∇(θ ) H(θ
)(θ − θ
), (9.60)
2
160 Chapter 9. Likelihood Methods
θ ∼ Nd (θ, F) − 1 ) ,
(n (9.66)
but without the constant. Thus the integral is just the reciprocal of that constant, i.e.,
√ √
e− 2 ( θ−θ) nF( θ)( θ−θ) dθ = ( 2π )d/2 | n
F| −1/2 = ( 2π )d/2 |
F| −1/2 n −d/2.
1
(9.67)
Θ
Putting (9.64) and (9.67) together gives
log f (y | θ)ρ(θ)dθ ≈ l (θ ; y) − d log(n )
Θ 2
d 1
+ log(ρ(θ)) + log(2π ) − log(|
F|)
2 2
d
≈ l (θ ; y) − log(n )
2
1
= − BIC( M ; y). (9.68)
2
Dropping the last three terms in the first line is justified by noting that as n → ∞,
; y) is of order n (in the iid case), log(n )d/2 is clearly of order log(n ), and the
l (θ
other terms are bounded. (This step may be a bit questionable since n has to be
extremely large before log(n ) starts to dwarf a constant.)
There are a number of approximations and heuristics in this derivation, and in-
deed the resulting approximation may not be especially good. See Berger, Ghosh,
and Mukhopadhyay [1999], for example. A nice property is that under conditions,
if one of the considered models is the correct one, then the BIC chooses the correct
model as n → ∞.
9.4. Model selection 161
k ) ; y New ) = −2lk (θ
deviance( Mk (θ k ; y New ), (9.69)
as in (9.48), except that here, while the MLE θk is based on the data y, the loglike-
lihood is evaluated at the new variable y New . The expected prediction deviance is
then
k ) ; Y New )],
EPredDeviance( Mk ) = E [deviance( Mk (θ (9.70)
where the expected value is over both the data Y (through the θ k ) and the new vari-
able Y New . The best model is then the one that minimizes this value.
We need to estimate the expected prediction deviance, and the observed deviance
(9.48) is the obvious place to start. As for Mallows’ C p , the observed deviance is
likely to be an underestimate because the parameter is chosen to be optimal for the
particular data y. Thus we would like to find how much of an underestimate it is,
i.e., find
k ) ; Y)].
Δ = EPredDeviance( Mk ) − E [deviance( Mk (θ (9.71)
Akaike argues that for large n, the answer is Δ = 2dk , i.e.,
k ) ; Y)] + 2dk ,
EPredDeviance( Mk ) ≈ E [deviance( Mk (θ (9.72)
from which the AIC in (9.51) arises. One glitch in the proceedings is that the approxi-
mation assumes that the true model is in fact Mk (or a submodel thereof), rather than
the most general model, as in (7.83) for Mallows’ C p .
Rather than justify the result in full generality, we will show the exact value for Δ
for multivariate regression, as Hurvich and Tsai [1989] did in the multiple regression
model.
n 1
l ( β, Σ R ; y) = − log(| Σ R |) − trace(Σ−
R (y − xβ ) (y − xβ )).
1
(9.74)
2 2
The MLE’s are then
R = 1 y Qx y,
β = (x x)−1 x y and Σ (9.75)
n
162 Chapter 9. Likelihood Methods
Σ
deviance( M ( β, R |) + nq,
R ) ; y) = n log(| Σ (9.76)
Σ
deviance( M ( β, −1 (Y New − x β) (Y New − x β)).
R |) + trace(Σ
R ) ; Y New ) = n log(| Σ
R
(9.77)
Using the deviance from (9.76), the difference Δ from (9.71) can be written
−1 Σ R )] − nq = n (n + p) E [trace(W−1 Σ R )] − nq,
Δ = (n + p) E [trace(Σ (9.81)
R
W = Y Qx Y ∼ Wishart(n − p, Σ R ). (9.82)
1
E [ W −1 ] = Σ . (9.83)
n− p−q−1 R
q n
Δ = n (n + p) − nq = q (2p + q + 1). (9.84)
n− p−q−1 n− p−q−1
Thus
Σ
AIC∗ ( M ; y) = deviance( M ( β, R ) ; y) + nq (2p + q + 1) (9.85)
n− p−q−1
for the multivariate regression model. For large n, Δ ≈ 2 dim(Θ ). See Exercise 9.6.11.
In univariate regression q = 1, and (9.85) is the value given in Hurvich and Tsai
[1989].
9.5. Example 163
We look at two of the models in detail. The full model M24 in (7.40) is actually just
( z)
multivariate regression, so there is no “before” variable yb . Thus
24.969 0.784 0.203 −0.056
β = (x x)−1 x y(z) = , (9.87)
−2.321 −0.305 −0.214 0.072
Σ
deviance( M24 ( β, R ) ; y(z) ) = n log(| Σ
R |) + nq = −18.611, (9.89)
with n = 27 and q = 4.
The best model in (7.103) was M22 , which fit different linear equations to the boys
( z) ( z)
and girls. In this case, y a consists of the first two columns of y(z) , and yb the final
( z)
two columns. As in (9.25), to find the MLE of the coefficients, we shift the yb to be
with the x, yielding
β ( z) ( z) ( z) ( z)
= ((x yb ) (x yb ))−1 (x yb ) y a
γ
⎛ ⎞
24.937 0.827
⎜ −2.272 −0.350 ⎟
⎜ ⎟
=⎜ ⎟, (9.90)
⎝ −0.189 −0.191 ⎠
−1.245 0.063
164 Chapter 9. Likelihood Methods
where the top 2 × 2 submatrix contains the estimates of the β ij ’s. Notice that they are
similar but not exactly the same as the estimates in (9.87) for the corresponding coef-
ficients. The bottom submatrix contains coefficients relating the “after” and “before”
measurements, which are not of direct interest.
There are two covariance matrices needed, both 2 × 2:
z,aa·b = 1 ( z) ( z) β ( z) ( z) β
Σ y b − (x y b ) y b − (x y b )
27 γ γ
3.313 0.065
= (9.91)
0.065 0.100
and
Then by (9.29),
Σ
deviance( M22 ( β, ) ; y(z) ) = n log(| Σ
z,aa·b|) + n log(| Σ
z,bb |) + nq
= 27(−1.116 − 3.463 + 4)
= −15.643. (9.93)
That value is obviously not significant, but formally the chi-square test would com-
pare the statistic to the cutoff from a χ2d f where d f = d24 − d22 . The dimension for
model M p∗ l ∗ is
∗ ∗ q
d p∗ l ∗ = p l + = p∗ l ∗ + 10. (9.96)
2
because the model has p∗ l ∗ non-zero β ij ’s, and Σ R is 4 × 4. Note that all the models
we are considering have the same dimension Σ R . For the hypotheses (9.94), d22 = 14
and d24 = 18, hence the d f = 4 (rather obviously since we are setting four of the β ij ’s
to 0). The test shows that there is no reason to reject the smaller model in favor of the
full one.
The AIC (9.50) and BIC (9.51) are easy to find as well (log(27) ≈ 3.2958):
AIC BIC Cp
M22 −15.643 + 2(14) = 12.357 −15.643 + log(27)(14) = 30.499 599.7 (9.97)
M24 −18.611 + 2(18) = 17.389 −18.611 + log(27)(18) = 40.714 610.2
The Mallows’ C p ’s are from (7.103). Whichever criterion you use, it is clear the smaller
model optimizes it. It is also interesting to consider the BIC-based approximations to
9.6. Exercises 165
the posterior probabilities of these models in (9.53). With π22 = π24 = 12 , we have
e− 2 BIC ( M22)
1
That is, between these two models, the smaller one has an estimated probability of
99.4%, quite high.
We repeat the process for each of the models in (7.103) to obtain the following
table (the last column is explained below):
For AIC, BIC, and C p , the best model is the one with linear fits for each sex, M22 .
The next best is the model with quadratic fits for each sex, M23 . The penultimate
column has the BIC-based estimated posterior probabilities, taking the prior proba-
bilities equal. Model M22 is the overwhelming favorite, with about 82% estimated
probability, and M23 is next with about 11%, not too surprising considering the plots
in Figure 4.1. The only other models with estimated probability over 1% are the lin-
ear and quadratic fits with boys and girls equal. The probability that a model shows
differences between the boys and girls can be estimated by summing the last four
probabilities, obtaining 93%.
The table in (9.99) also contains a column “GOF,” which stands for “goodness-
of-fit.” Perlman and Wu [2003] suggest in such model selection settings to find the
p-value for each model when testing the model (as null) versus the big model. Thus
here, for model M p∗ l ∗ , we find the p-value for testing
As in (9.49), we use the difference in the models’ deviances, which under the null has
an approximate χ2 distribution, with the degrees of freedom being the difference in
their dimensions. Thus
9.6 Exercises
Exercise 9.6.1. Consider the statistical model with space Y and densities f (y | θ)
for θ ∈ Θ. Suppose the function g : Θ → Ω is one-to-one and onto, so that a
166 Chapter 9. Likelihood Methods
Exercise 9.6.2. Again consider the statistical model with space Y and densities f (y | θ)
for θ ∈ Θ, and suppose g : Θ → Ω is just onto. Let g∗ be any function of θ such that
the joint function h(θ) = ( g(θ), g∗ (θ)), h : Θ → Λ, is one-to-one and onto, and set
the reparametrized density as f ∗ (y | λ) = f (y | h−1 (λ)). Exercise 9.6.1 shows that if
uniquely maximizes f (y | θ) over Θ, then
θ λ = h(θ ) uniquely maximizes f ∗ (y | λ)
over Λ. Argue that if θ is the MLE of θ, that it is legitimate to define g(θ ) to be the
MLE of ω = g(θ).
Exercise 9.6.3. Show that (9.5) holds. [What are Qx Qx and Qx x?]
Exercise 9.6.4. Show that if A (p × p) and B (q × q) are positive definite, and u is
p × q, that
trace (Bu Au) > 0 (9.102)
unless u = 0. [Hint: See (8.79).]
Exercise 9.6.5. From (9.18), (9.21), and (9.23), give ( β, Σz ) as a function of ( β, γ, Σz,aa·b,
Σz,bb ), and show that the latter set of parameters has space R p×l × R( q−l )×l × Sl ×
Sq−l .
Exercise 9.6.6. Verify (9.32).
Exercise 9.6.7. Consider maximizing h(Ψ) in (9.32) over Ψ ∈ Sq . (a) Let Ψ = ΓΛΓ
be the spectral decomposition of Ψ, so that the diagonals of Λ are the eigenvalues
λ1 ≥ λ2 ≥ · · · ≥ λq ≥ 0. (Recall Theorem 1.1.) Show that
q
1 −1
e− 2 trace( Ψ ) = ∏[ λia/2 e− 2 λi ].
1 1
(9.103)
| Ψ| a/2
i =1
(b) Find λi , the maximizer of λia/2 exp(− λi /2), for each i = 1, . . . , q. (c) Show that
these λi ’s satisfy the conditions on the eigenvalues of Λ. (d) Argue that then Ψ =
(1/a)Iq maximizes h(Ψ).
Exercise 9.6.8. Suppose the null hypothesis in (9.40) holds, so that Y ∼ Nn×q (x1 β 1 , In ⊗
Σ R ). Exercise 5.6.37 shows that Qx1 − Qx = Px2·1 , where x2·1 = Qx1 x2 . (a) Show that
Px2·1 x1 = 0 and Qx Px2·1 = 0. [Hint: See Lemma 5.3.] (b) Show that Qx Y and Px2·1 Y
are independent, and find their distributions. (c) Part (b) shows that W = Y Qx Y
and B = Y Px2·1 Y are independent. What are their distributions? (d) Verify the Wilk’s
distribution in (9.42) for |W| /|W + B|.
Exercise 9.6.9. Consider the multivariate regression model (9.2), where Σ R is known.
(a) Use (9.5) to show that
1
l ( β ; y) − l ( β ; y) = − trace(Σ− 1
R ( β − β ) x x( β − β )). (9.104)
2
(b) Show that in this case, (9.60) is actually an equality, and give H, which is a function
of Σ R and x x.
9.6. Exercises 167
Exercise 9.6.11. (a) Show that for the model in (9.73), dim(Θ ) = pq + q (q + 1)/2,
where Θ is the joint space of β and Σ R . (b) Show that in (9.84), Δ → 2 dim(Θ ) as
n → ∞.
Exercise 9.6.12 (Caffeine). This question continues the caffeine data in Exercises 4.4.4
and 6.5.6. Start with the both-sides model Y = xβz + R, where as before the Y is
2 × 28, the first column being the scores without caffeine, and the second being the
scores with caffeine. The x is a 28 × 3 ANOVA matrix for the three grades, with
orthogonal polynomials. The linear vector is (−19 , 010 , 1 ) and the quadratic vector
9
is (19 , −1.8110 , 19 ) . The z looks at the sum and difference of scores:
1 −1
z= . (9.105)
1 1
The goal of this problem is to use BIC to find a good model, choosing among the
constant, linear and quadratic models for x, and the “overall mean” and “overall
mean + difference models” for the scores. Start by finding Y( z) = Y(z )−1 . (a) For
each of the 6 models, find the deviance (just the log-sigma parts), number of free
parameters, BIC, and estimated probability. (b) Which model has highest probability?
(c) What is the chance that the difference effect is in the model? (d) Find the MLE of
β for the best model.
Exercise 9.6.13 (Leprosy, Part I). This question continues Exercises 4.4.6 and 5.6.41 on
the leprosy data. The model is
⎡⎛ ⎞ ⎤⎛ ⎞
1 1 1 μb μ a
(Y( b) , Y( a) ) = xβ + R = ⎣⎝ 1 1 −1 ⎠ ⊗ 110 ⎦ ⎝ 0 α a ⎠ + R, (9.106)
1 −2 0 0 βa
where
σbb σba
R ∼ N 0, In ⊗ . (9.107)
σab σaa
Because of the zeros in the β, the MLE is not the usual one for multivariate regres-
sion. Instead, the problem has to be broken up into the conditional part (“after”
conditioning on “before”), and the marginal of the before measurements, as for the
both-sided model in Sections 9.2.2 and 9.5. The conditional is
⎛ ⎛ ∗ ⎞ ⎞
μ
⎜ ⎜ ⎟
(b) ⎜ αa ⎟
⎟
Y( a ) | Y( b ) = y( b ) ∼ N ⎜ ⎟
⎝(x y ) ⎝ β a ⎠ , σaa·b In ⎠ , (9.108)
γ
where
σab
μ ∗ = μ a − γμ b and γ = . (9.109)
σbb
168 Chapter 9. Likelihood Methods
In this question, give the answers symbolically, not the actual numerical values. Those
come in the next exercise. (a) What is the marginal distribution of Y( b) ? Write it as
a linear model, without any zeroes in the coefficient matrix. (Note that the design
matrix will not be the entire x.) (b) What are the MLE’s for μ b and σbb ? (c) Give the
MLE’s of μ a , σab , and σaa in terms of the MLE’s of μ ∗ , μ b , γ, σbb and σaa·b. (d) What is
the deviance of this model? How many free parameters (give the actual number) are
there? (e) Consider the model with β a = 0. Is the MLE of αb the same or different
than in the original model? What about the MLE of σaa·b ? Or of σbb ?
Exercise 9.6.14 (Leoprosy, Part II). Continue with the leprosy example from Part I,
Exercise 9.6.13. (a) For the original model in Part I, give the values of the MLE’s of
α a , β a , σaa·b and σbb . (Note that the MLE of σaa·b will be different than the unbiased
estimate of 16.05.) (b) Now consider four models: The original model, the model
with β a = 0, the model with α a = 0, and the model with α a = β a = 0. For each, find
the MLE’s of σaa·b , σbb , the deviance (just using the log terms, not the nq), the number
of free parameters, the BIC, and the BIC-based estimate of the posterior probability
(in percent) of the model. Which model has the highest probability? (c) What is the
probability (in percent) that the drug vs. placebo effect is in the model? The Drug A
vs. Drug D effect?
Exercise 9.6.15 (Skulls). For the data on Egyptian skulls (Exercises 4.4.2, 6.5.5, and
7.7.12), consider the linear model over time, so that
⎛ ⎞
1 −3
⎜ 1 −1 ⎟
⎜ ⎟
x=⎜ 1 0 ⎟ ⊗ 130 . (9.110)
⎝ 1 1 ⎠
1 3
Exercise 9.6.16. (This is a discussion question, in that there is no exact answer. Your
reasoning should be sound, though.) Suppose you are comparing a number of mod-
els using BIC, and the lowest BIC is bmin . How much larger than bmin would a BIC
have to be for you to consider the corresponding model ignorable? That is, what is δ
so that models with BIC > bmin + δ don’t seem especially viable. Why?
Exercise 9.6.17. Often, in hypothesis testing, people misinterpret the p-value to be the
probability that the null is true, given the data. We can approximately compare the
9.6. Exercises 169
two values using the ideas in this chapter. Consider two models, the null (M0 ) and
alternative (M A ), where the null is contained in the alternative. Let deviance0 and
deviance A be their deviances, and dim0 and dim A be their dimensions, respectively.
Supposing that the assumptions are reasonable, the p-value for testing the null is
p-value = P [ χ2ν > δ], where ν = dim A − dim0 and δ = deviance0 − deviance A . (a)
Give the BIC-based estimate of the probability of the null for a given ν, δ and sample
size n. (b) For each of various values of n and ν (e.g, n = 1, 5, 10, 25, 100, 1000 and
ν = 1, 5, 10, 25), find the δ that gives a p-value of 5%, and find the corresponding
estimate of the probability of the null. (c) Are the probabilities of the null close to
5%? What do you conclude?
Chapter 10
The models so far have been on the means of the variables. In this chapter, we
look at some models for the covariance matrix. We start with testing the equality of
covariance matrices, then move on to testing independence and conditional indepen-
dence of sets of variables. Next is factor analysis, where the relationships among the
variables are assumed to be determined by latent (unobserved) variables. Principal
component analysis is sometimes thought of as a type of factor analysis, although it is
more of a decomposition than actual factor analysis. See Section 13.1.5. We conclude
with a particular class of structural models, called invariant normal models
We will base our hypothesis tests on Wishart matrices (one, or several independent
ones). In practice, these matrices will often arise from the residuals in linear models,
especially the Y Qx Y as in (6.18). If U ∼ Wishartq (ν, Σ), where Σ is invertible and
ν ≥ q, then the likelihood is
−1
L (Σ; U) = | Σ| −ν/2 e− 2 trace( Σ U) .
1
(10.1)
The likelihood follows from the density in (8.71). An alternative derivation is to note
that by (8.54), Z ∼ N (0, Iν ⊗ Σ) has likelihood L ∗ (Σ; z) = L (Σ; z z). Thus z z is a
sufficient statistic, and there is a theorem that states that the likelihood for any X is
the same as the likelihood for its sufficient statistic. Since Z Z = D U, (10.1) is the
likelihood for U.
Recall from (5.34) that Sq+ denotes the set of q × q positive definite symmetric
matrices. Then Lemma 9.1 shows that the MLE of Σ ∈ Sq+ based on (10.1) is
= U,
Σ (10.2)
ν
171
172 Chapter 10. Covariance Models
H0 : Σ1 = Σ2 versus H A : Σ1 = Σ2 , (10.5)
where both Σ1 and Σ2 are in Sq+ . (That is, we are not assuming any particular struc-
ture for the covariance matrices.) We need the likelihoods under the two hypotheses.
Because the Ui ’s are independent,
−1 −1
L (Σ1 , Σ2 ; U1 , U2 ) = | Σ1 | −ν1 /2 e− 2 trace( Σ1 U1 ) | Σ2 | −ν2 /2 e− 2 trace( Σ2 U2 ) ,
1 1
(10.6)
where Σ is the common value of Σ1 and Σ2 . The MLE under the alternative hypoth-
esis is found by maximizing (10.5), which results in two separate maximizations:
A1 = U1 , Σ
Under H A : Σ A2 = U2 . (10.8)
ν1 ν2
Under H0 : Σ 02 = U1 + U2 .
01 = Σ (10.9)
ν1 + ν2
Thus
U1 −ν1 /2 − 1 ν q U2 −ν2 /2 − 1 ν q
sup L = e 1 e 2 2 , (10.10)
ν1 ν
2
HA 2
and
U + U2 −( ν1 +ν2 ) /2 − 1 ( ν +ν ) q
sup L = 1 e 2 1 2 . (10.11)
H0 ν +ν
1 2
Taking the ratio, note that the parts in the e cancel, hence
And
U + U2
2 log( LR) = (ν1 + ν2 ) log 1 − ν1 log U1 − ν2 log U2 . (10.13)
ν1 + ν2 ν1 ν2
Under the null hypothesis, 2 log( LR) approaches a χ2 as in (9.37). To figure out
the degrees of freedom, we have to find the number of free parameters under each
10.1. Testing equality 173
Thus, under H0 ,
2 log( LR) −→ χ2q( q+1) /2. (10.15)
and
⎛ ⎞
121.76 113.31 58.33 40.79 40.91
⎜ 113.31 212.33 124.65 52.51 50.60 ⎟
1 ⎜ ⎟
Women: U2 = ⎜ 58.33 124.65 373.84 56.29 74.49 ⎟. (10.17)
ν2 ⎝ 40.79 52.51 56.29 88.47 60.93 ⎠
40.91 50.60 74.49 60.93 112.88
These covariance matrices are clearly not equal, but are the differences significant?
The pooled estimate, i.e., the common estimate under H0 , is
⎛ ⎞
137.04 144.89 74.75 44.53 48.21
⎜ 144.89 251.11 152.79 55.64 57.03 ⎟
1 ⎜ ⎟
(U1 + U2 ) = ⎜ 74.75 152.79 525.59 51.16 63.30 ⎟ (10.18)
ν1 + ν2 ⎝ 44.53 55.64 51.16 85.69 57.29 ⎠
48.21 57.03 63.30 57.29 107.46
Then
U + U2
2 log( LR) = (ν1 + ν2 ) log 1 − ν1 log U1 − ν2 log U2
ν1 + ν2 ν1 ν2
= 105 log(2.6090 × 1010 ) − 36 log(2.9819 × 1010 ) − 69 log(1.8149 × 1010 )
= 20.2331. (10.19)
The degrees of freedom for the χ2 is q (q + 1)/2 = 5 × 6/2 = 15. The p-value is 0.16,
which shows that we have not found a significant difference between the covariance
matrices.
174 Chapter 10. Covariance Models
where
U11 and Σ11 are q1 × q1 , and U22 and Σ22 are q2 × q2 ; q = q1 + q2 . (10.24)
Presuming the Wishart arises from multivariate normals, we wish to test whether the
two blocks of variables are independent, which translates to testing
Under the alternative, the likelihood is just the one in (10.1), hence
−ν/2
U
sup L (Σ; U) = e− 2 νq .
1
(10.26)
HA ν
H0
10.2. Testing independence 175
Taking the ratio of (10.26) and (10.28), the parts in the exponent of the e again
cancel, hence
2 log( LR) = ν (log(|U11 /ν|) + log(|U22 /ν|) − log(|U/ν|)). (10.30)
(The ν’s in the denominators of the determinants cancel, so they can be erased if
desired.)
Section 13.3 considers canonical correlations, which are a way to summarize rela-
tionships between two sets of variables.
Here the degrees of freedom in the χ2 are q1 × q2 = 6, because that is the number of
covariances we are setting to 0 in the null. Or you can count
dim( H A ) = q (q + 1)/2 = 15,
dim( H0 ) = q1 (q1 + 1)/2 + q2 (q2 + 1)/2 = 6 + 3 = 9, (10.32)
which has dim( H A ) − dim( H0 ) = 6. In either case, the result is clearly significant (the
p-value is less than 0.0001), hence indeed the two sets of scores are not independent.
Testing the independence of several block of variables is almost as easy. Consider
the three variables homework, labs, and midterms, which have covariance
⎛ ⎞
σ11 σ13 σ14
Σ = ⎝ σ31 σ33 σ34 ⎠ (10.33)
σ41 σ43 σ44
where ⎛ ⎞
Σ11 Σ12 Σ13
Σ R = ⎝ Σ21 Σ22 Σ23 ⎠ , (10.37)
Σ31 Σ32 Σ33
so that Σii is q i × q i . The null hypothesis is
we know from Proposition 8.1 that the conditional covariance is also Wishart, but
loses q3 degrees of freedom:
U11·3 U12·3
U(1:2)(1:2)·3 ≡
U21·3 U22·3
Σ11·3 Σ12·3
∼ Wishart( q1 +q2 ) ν − q3 , . (10.42)
Σ21·3 Σ22·3
(The U is partitioned analogously to the Σ R .) Then testing the hypothesis Σ12·3 here
is the same as (10.30) but after dotting out 3:
( Y1 , Y2 ) | Y3 = y 3 ∼ N ( x ∗ β ∗ , I n ⊗ Σ ∗ ) , (10.44)
where
Σ11·3 Σ12·3
x∗ = (x, y3 ) and Σ∗ = . (10.45)
Σ21·3 Σ22·3
Then
U(1:2)(1:2)·3 = (Y1 , Y2 ) Qx∗ (Y1 , Y2 ). (10.46)
See Exercise 7.7.7.
We note that there appears to be an ambiguity in the denominators of the Ui ’s for
the 2 log( LR). That is, if we base the likelihood on the original Y of (10.36), then the
denominators will be n. If we use the original U in (10.41), the denominators will
be n − p. And what we actually used, based on the conditional covariance matrix
in (10.42), were n − p − q3 . All three possibilities are fine in that the asymptotics as
n → ∞ are valid. We chose the one we did because it is the most focussed, i.e., there
are no parameters involved (e.g., β) that are not directly related to the hypotheses.
Testing the independence of three or more blocks of variables, given another
block, again uses the dotted-out Wishart matrix. For Example, consider Example
10.2.1 with variables homework, inclass, and midterms, but test whether those three
are conditionally independent given the “block 4” variables, labs and final. The
conditional U matrix is now denoted U(1:3)(1:3)·4, and the degrees of freedom are
ν − q4 = 105 − 2 = 103, so that the estimate of the conditional covariance matrix is
⎛ ⎞
σ11·4
σ12·4
σ13·4
⎝ 1
σ21·4
σ22·4 σ23·4 ⎠ =
U
σ31·4
σ32·4
σ33·4 ν − q4 (1:3)(1:3)·4
⎛ ⎞
51.9536 −18.3868 5.2905
= ⎝ −18.3868 432.1977 3.8627 ⎠ . (10.47)
5.2905 3.8627 53.2762
Then, to test
H0 : σ12·4 = σ13·4 = σ23·4 = 0 versus H A : not, (10.48)
we use the statistic analogous to (10.35),
The degrees of freedom for the χ2 is again 3, so we accept the null: There does not
appear to be significant relationship among these three variables given the labs and
final scores. This implies, among other things, that once we know someone’s labs and
final scores, knowing the homework or inclass will not help in guessing the midterms
score. We could also look at the sample correlations, unconditionally (from (10.18))
178 Chapter 10. Covariance Models
and conditionally:
Unconditional Conditional on Labs, Final
HW InClass Midterms HW InClass Midterms
HW 1.00 0.28 0.41 1.00 −0.12 0.10
InClass 0.28 1.00 0.24 −0.12 1.00 0.03
Midterms 0.41 0.24 1.00 0.10 0.03 1.00
(10.50)
Notice that the conditional correlations are much smaller than the unconditional
ones, and the conditional correlation between homework and inclass scores is nega-
tive, though not significantly so. Thus it appears that the labs and final scores explain
the relationships among the other variables.
Ψ = ΣYY − ΣYX Σ− −1
XX Σ XY ⇒ ΣYY = ΣYX Σ XX Σ XY + Ψ,
1
(10.53)
so that marginally,
Y ∼ N (Dγ, In ⊗ (ΣYX Σ−
XX Σ XY + Ψ)).
1
(10.54)
Because Y is all we observe, we cannot estimate Σ XY or Σ XX separately, but only the
function ΣYX Σ− ∗
XX Σ XY . Note that if we replace X with X = AX for some invertible
1
matrix A,
ΣYX Σ− −1
XX Σ XY = ΣYX ∗ Σ X ∗ X ∗ Σ X ∗ Y ,
1
(10.55)
10.3. Factor analysis 179
X ∼ N (0, In ⊗ I p ). (10.56)
Then, letting β = Σ−
XX Σ XY = Σ XY ,
1
where X and R are independent. The equation decomposes each variable (column)
in Y into the fixed mean plus the part depending on the factors plus the parts unique
to the individual variables. The element β ij is called the loading of factor i on the
variable j. The variance ψ j is the unique variance of variable j, i.e., the part not
explained by the factors. Any measurement error is assumed to be part of the unique
variance.
There is the statistical problem of estimating the model, meaning the β β and Ψ
(and γ, but we already know about that), and the interpretative problem of finding
and defining the resulting factors. We will take these concerns up in the next two
subsections.
10.3.1 Estimation
We estimate the γ using least squares as usual, i.e.,
= (D D)−1 D Y.
γ (10.59)
Then the residual sum of squares matrix is used to estimate the β and Ψ:
The parameters are still not estimable, because for any p × p orthogonal matrix
Γ, (Γβ) (Γβ) yields the same β β. We can use the QR decomposition from Theorem
5.3. Our β is p × q with p < q. Write β = ( β 1 , β 2 ), where β 1 has the first p columns
of β. We apply the QR decomposition to β 1 , assuming the columns are linearly
independent. Then β 1 = QR, where Q is orthogonal and R is upper triangular with
positive diagonal elements. Thus we can write Q β 1 = R, or
Q β = Q β 1 β 2 = R R ∗ ≡ β ∗ , (10.61)
where the β∗ii ’s are positive. If we require that β satisfies constraints (10.62), then it
is estimable. (Exercise 10.5.6.) Note that there are p( p − 1)/2 non-free parameters
180 Chapter 10. Covariance Models
(since β ij = 0 for i > j), which means the number of free parameters in the model
is pq − p( p − 1)/2 for the β part, and q for Ψ. Thus for the p-factor model M p , the
number of free parameters is
p ( p − 1)
d p ≡ dim( M p ) = q ( p + 1) − . (10.63)
2
(We are ignoring the parameters in the γ, because they are the same for all the models
we consider.) In order to have a hope of estimating the factors, the dimension of the
factor model cannot exceed the dimension of the most general model, ΣYY ∈ Sq+ ,
which has q (q + 1)/2 parameters. Thus for identifiability we need
q ( q + 1) ( q − p )2 − p − q
− dp = ≥ 0. (10.64)
2 2
E.g., if there are q = 10 variables, at most p = 6 factors can be estimated.
There are many methods for estimating β and Ψ. As in (10.1), the maximum
likelihood estimator maximizes
1 −1
e− 2 trace(( β β+Ψ) U)
1
L ( β, Ψ; U) = (10.65)
| β β + Ψ| ν/2
over β satisfying (10.62) and Ψ being diagonal. There is not a closed form solution to
the maximization, so it must be done numerically. There may be problems, too, such
as having one or more of the ψ j ’s being driven to 0. It is not obvious, but if β and Ψ
are the MLE’s, then the maximum of the likelihood is, similar to (10.3),
Ψ; 1
U) = e− 2 νq .
1
L ( β, (10.66)
| β β + Ψ| ν/2
YY = U/ν, so that
The MLE for H A is Σ
+ ,ν/2
| β β + Ψ
|
LR = . (10.68)
|U/ν|
Now
2 log( LR) = ν(log(| β β + Ψ
|) − log(|U/ν|)), (10.69)
which is asymptotically χ2d f
with d f being the difference in (10.64). Bartlett suggests
a slight adjustment to the factor ν, similar to the Box approximation for Wilks’ Λ, so
that under the null,
2q + 5 2p
2 log( LR)∗ = (ν − − )(log(| β β + Ψ
|) − log(|U/ν|)) −→ χ2 ,
df (10.70)
6 3
where
( q − p )2 − p − q
df = . (10.71)
2
10.3. Factor analysis 181
Alternatively, one can use AIC (9.50) or BIC (9.51) to assess M p for several p.
Because νq is the same for all models, we can take
Ψ
deviance( M p ( β, ) ; y) = ν log(| β β + Ψ
|), (10.72)
so that
where
β∗ = ( β β + Ψ)−1 β , α∗ = −Dγβ∗ , (10.77)
and
Σ XX ·Y = I p − β( β β + Ψ)−1 β . (10.78)
)( β β + Ψ
= (y − Dγ
X )−1 β . (10.79)
182 Chapter 10. Covariance Models
E (Y) = Dγ (10.80)
is a 107 × 2 matrix that distinguishes men from women. The first step is to estimate
ΣYY :
YY = 1 Y QD Y,
Σ (10.81)
ν
where here ν = 107 − 2 (since D has two columns), which is the pooled covariance
matrix in (10.18).
We illustrate with the R program factanal. The input to the program can be a data
matrix or a covariance matrix or a correlation matrix. In any case, the program will
base its calculations on the correlation matrix. Unless D is just a column of 1’s, you
shouldn’t give it Y, but S = Y QD Y/ν, where ν = n − k if D is n × k. You need to also
specify how many factors you want, and the number of observations (actually, ν + 1
for us). We’ll start with one factor. The sigmahat is the S, and covmat= indicates to R
that you are giving it a covariance matrix. (Do the same if you are giving a correlation
matrix.) In such cases, the program does not know what n or k is, so you should set
the parameter n.obs. It assumes that D is 1n , i.e., that k = 1, so to trick it into using
another k, set n.obs to n − k + 1, which in our case is 106. Then the one-factor model
is fit to the sigmahat in (10.18) using
f <− factanal(covmat=sigmahat,factors=1,n.obs=106)
and
HW Labs InClass Midterms Final
β : (10.83)
Factor1 0.868 0.886 0.415 0.484 0.463
The given loadings and uniquenesses are based on the correlation matrix, so the fitted
correlation matrix can be found using
The result is
Compare that to the observed correlation matrix. which is in the matrix f$corr:
10.3. Factor analysis 183
The fitted correlations are reasonably close to the observed ones, except for the
midterms/final correlation: The actual is 0.60, but the estimate from the one-factor
model is only 0.22. It appears that this single factor is more focused on other correla-
tions.
For a formal goodness-of-fit test, we have
We can use either the correlation or covariance matrices, as long as we are consistent,
and since factanal gives the correlation, we might as well use that. The MLE under
H A is then corrA, the correlation matrix obtained from S, and under H0 is corr0. Then
is found in R using
105∗log(det(corr0)/det(f$corr))
yielding the value 37.65. It is probably better to use Bartlett’s refinement (10.70),
(105 − (2∗5+5)/6−2/3)∗log(det(corr0)/det(f$corr))
which gives 36.51. This value can be found in f$STATISTIC, or by printing out f. The
degrees of freedom for the statistic in (10.71) is ((q − p)2 − p − q )/2 = 5, since p = 1
and q = 5. Thus H0 is rejected: The one-factor model does not fit.
Two factors
With small q, we have to be careful not to ask for too many factors. By (10.64), two is
the maximum when q = 5. In R, we just need to set factors=2 in the factanal function.
The χ2 for goodness-of-fit is 2.11, on one degree of freedom, hence the two-factor
model fits fine. The estimated correlation matrix is now
Two-factor model HW Labs InClass Midterms Final
HW 1.00 0.78 0.35 0.40 0.40
Labs 0.78 1.00 0.42 0.38 0.35
(10.88)
InClass 0.35 0.42 1.00 0.24 0.25
Midterms 0.40 0.38 0.24 1.00 0.60
Final 0.40 0.35 0.25 0.60 1.00
which is quite close to the observed correlation matrix (10.85) above. Only the In-
Class/HW correlation is a bit off, but not by much.
The uniquenesses and loadings for this model are
and
HW Labs InClass Midterms Final
β : Factor 1 0.742 0.982 0.391 0.268 0.211 (10.90)
Factor 2 0.299 0.173 0.208 0.672 0.807
The routine gives the loadings using the varimax criterion.
Looking at the uniquenesses, we notice that inclass’s is quite large, which suggests
that it has a factor unique to itself, e.g., being able to get to class. It has fairly low
loadings on both factors. We see that the first factor loads highly on homework
and labs, especially labs, and the second loads heavily on the exams, midterms and
final. (These results are not surprising given the example in Section 10.2.2, where we
see homework, inclass, and midterms are conditionally independent given labs and
final.) So one could label the factors “Diligence” and “Test taking ability”.
The exact same fit can be achieved by using other rotations Γβ, for a 2 × 2 orthog-
onal matrix Γ. Consider the rotation
1 1 1
Γ= √ . (10.91)
2 1 −1
Now Factor∗ 1 could be considered an overall ability factor, and Factor∗ 2 a contrast
of HW+Lab and Midterms+Final.
Any rotation is fine — whichever you can interpret easiest is the one to take.
The only difference between the two BIC columns is that the second one has 127.226
added to each element, making it easier to compare them. These results conform to
what we had before. The one-factor model is untenable, and the two-factor model is
fine, with 77% estimated probability. The full model has a decent probability as well.
−5 −4 −3 −2 −1 0 1
Diligence
x <− cbind(1,grades[,1])
gammahat <− solve(t(x)%∗%x,t(x)%∗%grades[,2:6])
resids <− grades[,2:6]− x%∗%gammahat
xhat <− factanal(resids,factors=2,scores=’regression’)$scores
Now we can use the factor scores in scatter plots. For example, Figure 10.1 contains
a scatter plot of the estimated factor scores for the two-factor model. They are by
construction uncorrelated, but one can see how diligence has a much longer lower
tail (lazy people?).
We also calculated box plots to compare the women’s and men’s distribution on
the factors:
par(mfrow=c(1,2))
yl <− range(xhat) # To obtain the same y−scales
w <− (x[,2]==1) # Whether women (T) or not.
boxplot(list(Women=xhat[w,1],Men=xhat[!w,1]),main=’Factor 1’,ylim=yl)
boxplot(list(Women=xhat[w,2],Men=xhat[!w,2]),main=’Factor 2’,ylim=yl)
See Figure 10.2. There do not appear to be any large overall differences.
186 Chapter 10. Covariance Models
2
1
1
Test taking ability
−1 0
−1 0
Diligence
−3
−3
−5
−5
Women Men Women Men
Figure 10.2: Box plots comparing the women and men on their factor scores.
Thus we can define the mean zero invariant normal model based on G to be (10.95)
with
Σ ∈ Sq+ (G) ≡ {Σ ∈ Sq+ | Σ = g Σg for all g ∈ G}. (10.98)
A few examples are in order at this point. Typically, the groups are fairly simple
groups.
10.4. Symmetry models 187
and ⎛ ⎞
Σ11 Σ12 ··· Σ1K
⎜ Σ21 Σ22 ··· Σ2K ⎟
⎜ ⎟
Σ=⎜ . .. .. .. ⎟ , Σkl is q k × q l . (10.100)
⎝ .. . . . ⎠
ΣK1 ΣK2 ··· ΣKK
Independence of one block of variables from the others entails setting the covari-
ance to zero, that is, Σkl = 0 means Yk and Yl are independent. Invariant normal
models can specify a block being independent of all the other blocks. For example,
suppose K = 3. Then the model that Y1 is independent of (Y2 , Y3 ) has
⎛ ⎞
Σ11 0 0
Σ=⎝ 0 Σ22 Σ23 ⎠ . (10.101)
0 Σ32 Σ33
The group that gives rise to that model consists of two elements:
⎧⎛ ⎞ ⎛ ⎞⎫
⎨ Iq1 0 0 − Iq1 0 0 ⎬
G = ⎝ 0 Iq2 0 ⎠ , ⎝ 0 Iq2 0 ⎠ . (10.102)
⎩ ⎭
0 0 Iq3 0 0 Iq3
(The first element is just Iq , of course.) It is easy to see that Σ of (10.101) is invariant
under G of (10.102). Lemma 10.1 below can be used to show any Σ in S + (G) is of the
form (10.101).
If the three blocks Y1 , Y2 and Y3 are mutually independent, then Σ is block diago-
nal, ⎛ ⎞
Σ11 0 0
Σ=⎝ 0 Σ22 0 ⎠, (10.103)
0 0 Σ33
and the corresponding G consists of the eight matrices
⎧⎛ ⎞⎫
⎨ ± Iq1 0 0 ⎬
G= ⎝ 0 ± Iq2 0 ⎠ . (10.104)
⎩ ⎭
0 0 ± Iq3
measuring the same ability. In such cases, the covariance matrix would have equal
variances, and equal covariances:
⎛ ⎞
1 ρ ··· ρ
⎜ ρ 1 ··· ρ ⎟
⎜ ⎟
Σ = σ2 ⎜ .. .. .. .. ⎟. (10.105)
⎝ . . . . ⎠
ρ ··· ρ 1
Compound symmetry
Compound symmetry is an extension of intraclass symmetry, where there are groups
of variables, and the variables within each group are interchangeable. Such models
might arise, e.g., if students are given three interchangeable batteries of math ques-
tions, and two interchangeable batteries of verbal questions. The covariance matrix
would then have the form
⎛ ⎞
a b b c c
⎜ b a b c c ⎟
⎜ ⎟
Σ=⎜ b b a c c ⎟. (10.106)
⎝ c c c d e ⎠
c c c e d
In general, the group would consist of block diagonal matrices, with permutation
matrices as the blocks. That is, with Σ partitioned as in (10.100),
⎧⎛ ⎞ ⎫
⎪
⎪ G1 0 ··· 0 ⎪
⎪
⎪
⎨⎜ 0 G2 ··· 0 ⎟ ⎪
⎬
⎜ ⎟
G= ⎜ . .. .. .. ⎟ | G1 ∈ Pq1 , G2 ∈ Pq2 , . . . , GK ∈ PqK . (10.107)
⎪
⎪ ⎝ .. . . . ⎠ ⎪
⎪
⎪
⎩ ⎪
⎭
0 0 · · · GK
∑g∈G g Σg
Σ= . (10.108)
#G
It should be clear that if Σ ∈ Sq+ (G), then Σ = Σ. The next lemma shows that all
averages are in Sq+ (G).
∑g∈G h g Σgh
h Σh =
#G
∑g∗ ∈G g∗ Σg∗
=
#G
= Σ. (10.109)
The second line follows by setting g∗ = gh, and noting that as g runs over G , so does
g∗ . (This is where the requirement that G is a group is needed.) But (10.109) implies
that Σ ∈ G .
so that one can discover the structure of covariance matrices invariant under a partic-
ular group by averaging a generic Σ. That is how one finds the structures in (10.101),
(10.103), (10.106), and (10.107) from their respective groups.
1 −1
e− 2 trace( Σ u) , Σ ∈ S + (G), where u = y y.
1
L (Σ; y) = (10.111)
| Σ| n/2
Thus
1 −1
e− 2 trace( Σ u) .
1
L (Σ; y) = (10.114)
| Σ| n/2
We know from Lemma 9.1 that the maximizer of L in (10.114) over Σ ∈ Sq+ is
= u,
Σ (10.115)
n
but since that maximizer is in Sq+ (G) by Lemma 10.1, and Sq+ (G) ⊂ Sq+ , it must
be the maximizer over Sq+ (G). That is, (10.115) is indeed the maximum likelihood
estimate for (10.111).
To illustrate, let S = U/n. Then if G is as in (10.104), so that the model is that
three sets of variables are independent (10.103), the maximum likelihood estimate is
the sample analog
⎛ ⎞
S11 0 0
(G) = ⎝ 0
Σ S22 0 ⎠. (10.116)
0 0 S33
In the intraclass correlation model (10.105), the group is the set of q × q permutation
matrices, and the maximum likelihood estimate has the same form,
⎛ ⎞
1 ρ · · · ρ
⎜ ρ 1 · · · ρ ⎟
(G) = ⎜ ⎟
Σ σ2 ⎜ . . .. . ⎟, (10.117)
⎝ .. .. . .. ⎠
ρ · · · ρ 1
where
q
1 ∑ ∑1≤i<j≤q sij
σ2 =
∑ σ2 =
s , and ρ
q i=1 ii (2q)
. (10.118)
That is, the common variance is the average of the original variances, and the common
covariance is the average of the original covariances.
10.4. Symmetry models 191
(G)|),
deviance( M (G)) = n log(| Σ (10.119)
where we drop the exponential term since nq is the same for all models. We can
then use this deviance in finding AIC’s or BIC’s for comparing such models, once we
figure out the dimensions of the models, which is usually not too hard. E.g., if the
model is that Σ is unrestricted, so that G A = {Iq }, the trivial group, the dimension
for H A is (q+ 1
2 ). The dimension for the independence model in (10.103) and (10.104)
sums the dimensions for the diagonal blocks: (q12+1) + (q22+1) + (q32+1). The dimension
for the intraclass correlation model (10.105) is 2 (for the variance and covariance).
Also, the likelihood ratio statistic for testing two nested invariant normal models
is easy to find. These testing problems use two nested groups, G A ⊂ G0 , so that the
hypotheses are
H0 : Σ ∈ Sq+ (G0 ) versus H A : Σ ∈ Sq+ (G A ), (10.120)
Note that the larger G , the smaller Sq+ (G), since fewer covariance matrices are in-
variant under a larger group. Then the likelihood ratio test statistic, 2 log( LR), is the
difference of the deviances, as in (9.49).
= U or
Σ
U
, (10.121)
n n−p
(where x is n × p), depending on whether you want the maximum likelihood estimate
or an unbiased estimate. In testing, I would suggest taking the unbiased versions,
then using
(G)|).
deviance( M (G)) = (n − p) log(| Σ (10.122)
and ⎛ ⎞
5.260 3.285 3.285 3.285
⎜ 3.285 5.260 3.285 3.285 ⎟
0 = ⎜
Σ ⎟. (10.124)
⎝ 3.285 3.285 5.260 3.285 ⎠
3.285 3.285 3.285 5.260
To test the null hypothesis that the intraclass correlation structure holds, versus
the general model, we have from (10.119)
0 |) − log(| Σ
2 log( LR) = 25 (log(| Σ A |)) = 9.374. (10.125)
The dimension for the general model is d A = q (q + 1)/2 = 10, and for the null is
just d0 = 2, thus the degrees of freedom for this statistic is d f = d A − d0 = 8. The
intraclass correlation structure appears to be plausible.
We can exploit this structure (10.105) on the Σ R to more easily test hypotheses
about the β in both-sides models like (7.40). First, we transform the matrix Σ R into a
diagonal matrix with two distinct variances. Notice that we can write this covariance
as
Σ R = σ2 (1 − ρ)Iq + σ2 ρ1q 1q . (10.126)
Γ Σ R Γ = σ2 (1 − ρ)Γ Γ + σ2 ρΓ 1q 1q Γ
⎛ √ ⎞
q
⎜ 0 ⎟ √
⎜ ⎟
= σ 2 (1 − ρ )I q + σ 2 ρ ⎜ . ⎟ q 0 ··· 0
⎝ .. ⎠
0
1 + ( q − 1 ) ρ 0
= σ2 ≡ Λ. (10.127)
0 (1 − ρ )I q −1
We used the fact that because all columns of Γ except the first are orthogonal to 1q ,
√
Γ 1q = q (1, 0, . . . , 0) . As suggested by the notation, this Λ is indeed the eigenvalue
matrix for Σ R , and Γ contains a corresponding set of eigenvectors.
In the model (7.40), the z is almost an appropriate Γ:
⎛ ⎞
1 −3 1 −3
⎜ 1 −1 −1 1 ⎟
z=⎜
⎝ 1
⎟. (10.128)
1 −1 −1 ⎠
1 3 1 3
The columns are orthogonal, and the first is 14 , so we just have to divide each column
by its length to obtain orthonormal columns. The squared lengths of the columns are
the diagonals of z z : (4, 20, 4, 20). Let Δ be the square root of z z,
⎛ ⎞
2 √0 0 0
⎜ 0 20 0 0 ⎟
Δ=⎜
⎝ 0
⎟,
⎠ (10.129)
0 2 √0
0 0 0 20
10.4. Symmetry models 193
and set
Γ = zΔ−1 and β∗ = βΔ, (10.130)
so that the both-sides model can be written
Y∗ ≡ YΓ = xβ∗ + R∗ , (10.132)
where
R∗ ≡ RΓ ∼ N (0, In ⊗ Γ Σ R Γ ) = N (0, In ⊗ Λ). (10.133)
This process is so far similar to that in Section 7.5.1. The estimate of β∗ is straightfor-
ward:
49.938 3.508 0.406 −0.252
β∗ = (x x)−1 x Y∗ = . (10.134)
−4.642 −1.363 −0.429 0.323
These estimates are the same as those for model (6.28), multiplied by Δ as in (10.130).
The difference is in their covariance matrix:
To estimate the standard errors of the estimates, we look at the sum of squares
and cross products of the estimated residuals,
U∗ = Y∗ Qx Y∗ ∼ Wishartq (ν, Λ), (10.136)
∗
U11 ∼ τ02 χ2ν , Ujj∗ ∼ τ12 χ2ν , j = 2, . . . , q = 4. (10.137)
Standard errors
Constant Linear Quadratic Cubic
(10.141)
Boys 0.972 0.351 0.351 0.351
Girls−Boys 1.523 0.550 0.550 0.550
t-statistics
Constant Linear Quadratic Cubic
(10.142)
Boys 51.375 9.984 1.156 −0.716
Girls−Boys −3.048 −2.477 −0.779 0.586
These statistics are not much different from what we found in Section 6.4.1, but the
degrees of freedom for all but the first column are now 75, rather than 25. The main
impact is in the significance of δ1 , the difference between the girls’ and boys’ slopes.
Previously, the p-value was 0.033 (the t = −2.26 on 25 df). Here, the p-value is 0.016,
a bit stronger suggestion of a difference.
10.5 Exercises
Exercise 10.5.1. Verify the likelihood ratio statistic (10.21) for testing the equality of
several covariance matrices as in (10.20).
−1 −1
Exercise 10.5.2. Verify that trace(Σ−1 U) = trace(Σ11 U11 ) + trace(Σ22 U22 ), as in
(10.27), for Σ being block-diagonal, i.e., Σ12 = 0 in (10.23).
Exercise 10.5.3. Show that the value of 2 log( LR) of (10.31) does not change if the ν’s
in the denominators are erased.
Exercise 10.5.4. Suppose U ∼ Wishartq (ν, Σ), where Σ is partitioned as
⎛ ⎞
Σ11 Σ12 ··· Σ1K
⎜ Σ21 Σ22 ··· Σ2K ⎟
⎜ ⎟
Σ=⎜ .. .. .. .. ⎟, (10.143)
⎝ . . . . ⎠
ΣK1 ΣK2 ··· ΣKK
where Σij is q i × q j , and the q i ’s sum to q. Consider testing the null hypothesis that
the blocks are mutually independent, i.e.,
versus the alternative that Σ is unrestricted. (a) Find the 2 log( LR), and the degrees
of freedom in the χ2 approximation. (The answer is analogous to that in (10.35).) (b)
Let U∗ = AUA for some diagonal matrix A with positive diagonal elements. Replace
the U in 2 log( LR) with U∗ . Show that the value of the statistic remains the same.
(c) Specialize to the case that all q i = 1 for all i, so that we are testing the mutual
independence of all the variables. Let C be the sample correlation matrix. Show that
2 log( LR) = − ν log(|C|). [Hint: Find the appropriate A from part (b).)
10.5. Exercises 195
Exercise 10.5.14 (Mouth sizes). For the boys’ and girls’ mouth size data in Table 4.1,
let Σ B be the covariance matrix for the boys’ mouth sizes, and Σ G be the covariance
matrix for the girls’ mouth sizes. Consider testing
H0 : Σ B = Σ G versus H A : Σ B = Σ G . (10.147)
(a) What are the degrees of freedom for the boys’ and girls’ sample covariance matri-
B |, | Σ
ces? (b) Find | Σ G |, and the pooled | Σ
|. (Use the unbiased estimates of the Σi ’s.)
(c) Find 2 log( LR). What are the degrees of freedom for the χ2 ? What is the p-value?
Do you reject the null hypothesis (if α = .05)? (d) Look at trace(Σ B ), trace(Σ
G ). Also,
look at the correlation matrices for the girls and for the boys. What do you see?
196 Chapter 10. Covariance Models
Exercise 10.5.15 (Mouth sizes). Continue with the mouth size data from Exercise
10.5.14. (a) Test whether Σ B has the intraclass correlation structure (versus the gen-
eral alternative). What are the degrees of freedom for the χ2 ? (b) Test whether Σ G
has the intraclass correlation structure. (c) Now assume that both Σ B and Σ G have
the intraclass correlation structure. Test whether the covariances matrices are equal.
What are the degrees of freedom for this test? What is the p-value. Compare this
p-value to that in Exercise 10.5.14, part (c). Why is it so much smaller (if it is)?
Exercise 10.5.16 (Grades). This problem considers the grades data. In what follows,
use the pooled covariance matrix in (10.18), which has ν = 105. (a) Test the indepen-
dence of the first three variables (homework, labs, inclass) from the fourth variable,
11 |, | Σ
the midterms score. (So leave out the final exams at this point.) Find l1 , l2 , | Σ 22 |,
and | Σ|. Also, find 2 log( LR) and the degrees of freedom for the χ . Do you accept or
2
reject the null hypothesis? (b) Now test the conditional independence of the set (home-
work, labs, inclass) from the midterms, conditioning on the final exam score. What
11 |, | Σ
is the ν for the estimated covariance matrix now? Find the new l1 , l2 , | Σ 22 |, and
| Σ|. Also, find 2 log( LR) and the degrees of freedom for the χ . Do you accept or
2
reject the null hypothesis? (c) Find the correlations between the homework, labs and
inclass scores and the midterms scores, as well as the conditional correlations given
the final exam. What do you notice?
Exercise 10.5.17 (Grades). The table in (10.93) has the BIC’s for the one-factor, two-
factor, and unrestricted models for the Grades data. Find the deviance, dimension,
and BIC for the zero-factor model, M0 . (See Exercise 10.5.9.) Find the estimated
probabilities of the four models. Compare the results to those without M0 .
Exercise 10.5.18 (Exams). The exams matrix has data on 191 statistics students, giving
their scores (out of 100) on the three midterm exams, and the final exam. (a) What
is the maximum number of factors that can be estimated? (b) Give the number of
parameters in the covariance matrices for the 0, 1, 2, and 3 factor models (even if they
are not estimable). (d) Plot the data. There are three obvious outliers. Which obser-
vations are they? What makes them outliers? For the remaining exercise, eliminate
these outliers, so that there are n = 188 observations. (c) Test the null hypothesis
that the four exams are mutually independent. What are the adjusted 2 log( LR)∗
(in (10.70)) and degrees of freedom for the χ2 ? What do you conclude? (d) Fit the
one-factor model. What are the loadings? How do you interpret them? (e) Look at
the residual matrix C − β β,
where C is the observed correlation matrix of the orig-
inal variables. If the model fits exactly, what values would the off-diagonals of the
residual matrix be? What is the largest off-diagonal in this observed matrix? Are the
diagonals of this matrix the uniquenesses? (f) Does the one-factor model fit?
Exercise 10.5.19 (Exams). Continue with the Exams data from Exercise 10.5.18. Again,
do not use the outliers found in part (d). Consider the invariant normal model where
the group G consists of 4 × 4 matrices of the form
∗
G 03
G= (10.148)
03 1
where G∗ is a 3 × 3 permutation matrix. Thus the model is an example of com-
pound symmetry, from Section 10.4.1. The model assumes the three midterms are
10.5. Exercises 197
interchangeable. (a) Give the form of a covariance matrix Σ which is invariant un-
der that G . (It should be like the upper-left 4 × 4 block of the matrix in (10.106).)
How many free parameters are there? (b) For the exams data, give the MLE of the
covariance matrix under the assumption that it is G -invariant. (c) Test whether this
symmetry assumption holds, versus the general model. What are the degrees of free-
dom? For which elements of Σ is the null hypothesis least tenable? (d) Assuming Σ
is G-invariant, test whether the first three variables are independent of the last. (That
is, the null hypothesis is that Σ is G -invariant and σ14 = σ24 = σ34 = 0, while the
alternative is that Σ is G -invariant, but otherwise unrestricted.) What are the degrees
of freedom for this test? What do you conclude?
Exercise 10.5.20 (South Africa heart disease). The data for this question comes from
a study of heart disease in adult males in South Africa from Rousseauw et al. [1983].
(We return to these data in Section 11.8.) The R data frame is SAheart, found in
the ElemStatLearn package [Halvorsen, 2009]. The main variable of interest is “chd”,
congestive heart disease, where 1 indicates the person has the disease, 0 he does
not. Explanatory variables include sbc (measurements on blood pressure), tobacco
use, ldl (bad cholesterol), adiposity (fat %), family history of heart disease (absent
or present), type A personality, obesity, alcohol usage, and age. Here you are to
find common factors among the the explanatory variables excluding age and family
history. Take logs of the variables sbc, ldl, and obesity, and cube roots of alcohol
and tobacco, so that the data look more normal. Age is used as a covariate. Thus
Y is n × 7, and D = (1n xage ). Here, n = 462. (a) What is there about the tobacco
and alcohol variables that is distinctly non-normal? (b) Find the sample correlation
matrix of the residuals from the Y = Dγ + R model. Which pairs of variables have
correlations over 0.25, and what are their correlations? How would you group these
variables? (c) What is the largest number of factors that can be fit for this Y? (d) Give
the BIC-based probabilities of the p-factor models for p = 0 to the maximum found
in part (c), and for the unrestricted model. Which model has the highest probability?
Does this model fit, according to the χ2 goodness of fit test? (e) For the most probable
model from part (d), which variables’ loadings are highest (over 0.25) for each factor?
(Use the varimax rotation for the loadings.) Give relevant names to the two factors.
Compare the factors to what you found in part (b). (f) Keeping the same model,
find the estimated factor scores. For each factor, find the two-sample t-statistic for
comparing the people with heart disease to those without. (The statistics are not
actually distributed as Student’s t, but do give some measure of the difference.) (g)
Based on the statistics in part (f), do any of the factors seem to be important factors in
predicting heart disease in these men? If so, which one(s). If not, what are the factors
explaining?
Exercise 10.5.21 (Decathlon). Exercise 1.9.20 created a biplot for the decathlon data
The data consist of the scores (number of points) on each of ten events for the top 24
men in the decathlon at the 2008 Olympics. For convenience, rearrange the variables
so that the running events come first, then the jumping, then throwing (ignoring the
overall total):
y <− decathlon[,c(1,5,10,6,3,9,7,2,4,8)]
Fit the 1, 2, and 3 factor models. (The chi-squared approximations for the fit might
not be very relevant, because the sample size is too small.) Based on the loadings, can
198 Chapter 10. Covariance Models
you give an interpretation of the factors? Based on the uniquenesses, which events
seem to be least correlated with the others?
Chapter 11
Classification
(Xi , Yi ), i = 1, . . . , n, (11.1)
where Xi is the 1 × p vector of predictors for observation i, and Yi is the group number
of observation i, so that Yi ∈ {1, . . . , K }. Marginally, the proportion of the population
in group k is
P [Y = k] = π k , k = 1, . . . , K. (11.2)
199
200 Chapter 11. Classification
0.4
0.3
0.2
pdf
0.1
0.0
0 5 10 15
x
Figure 11.1: Three densities, plus a mixture of the three (the thick line).
The classification task arises when a new observation, X New , arrives without its group
identification Y New, so its density is that of the mixture. We have to guess what the
group is.
In clustering, the data themselves are without group identification, so we have just
the marginal distributions of the Xi . Thus the joint pdf for the data is
Thus clustering is similar to classifying new observations, but without having any
previous y data to help estimate the π k ’s and f k ’s. See Section 12.3.
11.2 Classifiers
A classifier is a function C that takes the new observation, and emits a guess at its
group:
C : X −→ {1, . . . , K }, (11.10)
where X is the space of X New . The classifier may depend on previous data, as well
as on the π k ’s and f k ’s, but not on the Y New . A good classifier is one that is unlikely
to make a wrong classification. Thus a reasonable criterion for a classifier is the
probability of an error:
P [C(X New ) = Y New ]. (11.11)
We would like to minimize that probability. (This criterion assumes that any type of
misclassification is equally bad. If that is an untenable assumption, then one can use
a weighted probability:
K K
∑ ∑ wkl P[C(X New ) = k and Y New = l ], (11.12)
k =1 l =1
and
P [C(X New ) = Y New ] = E [ I [C(X New) = Y New ]]. (11.15)
202 Chapter 11. Classification
Thus if we minimize the last expression in (11.17) for each x New , we have minimized
the expected value in (11.16). Minimizing (11.17) is the same as maximizing
This sum equals P [Y New = l | X New = x New ] for whichever k C chooses, so to maxi-
mize the sum, choose the l with the highest conditional probability, as in (11.13).
Now the conditional distribution of Y New given X New is obtained from (11.4) and
(11.5) (it is Bayes theorem, Theorem 2.2):
f k (x New )π k
P [Y New = k | X New = x New ] = . (11.20)
f1 (x New )π 1 + · · · + f K (x
New )π
K
Since, given x New , the denominator is the same for each k, we just have to choose the
k to maximize the numerator:
C B (x) = k if f k (x New )π k > f l (x New )π l f or l = k. (11.21)
We are assuming there is a unique maximum, which typically happens in practice
with continuous variables. If there is a tie, any of the top categories will yield the
optimum.
Consider the example in (11.6). Because the π k ’s are equal, it is sufficient to look at
the conditional pdfs. A given x is then classified into the group with highest density,
as given in Figure 11.2.
Thus the classifications are
⎧
⎨ 1 if 3.640 < x < 6.360
CB (x) = 2 if x < 3.640 or 6.360 < x < 8.067 or x > 15.267 (11.22)
⎩ 3 if 8.067 < x < 15.267
In practice, the π k ’s and f k ’s are not known, but can be estimated from the data.
Consider the joint density of the data as in (11.7). The π k ’s appear in only the first
term. They can be estimated easily (as in a multinomial situation) by
Nk
k =
π . (11.23)
n
11.3. Linear discrimination 203
0.4
0.3
2 1 2 3
pdf
0.2
0.1
0.0
0 5 10 15
x
Figure 11.2: Three densities, and the regions in which each is the highest. The den-
sities are 1: N(5,1), solid line; 2: N(5,4), dashed line; 3: N(10,1), dashed/dotted line.
Density 2 is also the highest for x > 15.267.
The parameters for the f k ’s can be estimated using the xi ’s that are associated with
group k. These estimates are then plugged into the Bayes formula to obtain an ap-
proximate Bayes classifier. The next section shows what happens in the multivariate
normal case.
1 −1
e − 2 ( x− μ k ) Σ ( x− μ k )
1
f k (x | μk , Σ) = c.
| Σ|1/2
1 − 1 x −1 μ − 1 μ Σ −1 μ
e− 2 xΣ
1
= c. 1/2 exΣ k 2 k k . (11.25)
|Σ|
We can ignore the factors that are the same for each group, i.e., that do not depend
on k, because we are in quest of the highest pdf× π k . Thus for a given x, we choose
the k to maximize −1 −1
π k exΣ μk − 2 μk Σ μk ,
1
(11.26)
or, by taking logs, the k that maximizes
1
d∗k (x) ≡ xΣ−1 μk − μk Σ−1 μk + log(π k ). (11.27)
2
204 Chapter 11. Classification
These d∗k ’s are called the discriminant functions. Note that in this case, they are
linear in x, hence linear discriminant functions. It is often convenient to target one
group (say the K th ) as a benchmark, then use the functions
1
Nk {i|y∑=k} i
k =
μ x, (11.29)
i
and estimate Σ by the MLE, i.e., because we are assuming the covariances are equal,
the pooled covariance:
K
= 1 ∑ ∑ (xi − μ
Σ k ) (xi − μ
k ). (11.30)
n k =1 { i | y = k }
i
(The numerator equals X QX for Q being the projection matrix for the design matrix
indicating which groups the observations are from. We could divide by n − K to
obtain the unbiased estimator, but the classifications would still be essentially the
same, exactly so if the π k ’s are equal.) Thus the estimated discriminant functions are
where
ak = ( μ
k − μ −1 and
K )Σ
1 −1 μ
ck = − (μ Σ k − μ −1 μ
K Σ K ) + log(π K ).
k /π (11.32)
2 k
Now we can define the classifier based upon Fisher’s linear discrimination func-
tion to be
CFLD (x) = k i f dk (x) > dl (x) f or l = k. (11.33)
(The hat is there to emphasize the fact that the classifier is estimated from the data.) If
p = 2, each set {x | dk (x) = dl (x)} defines a line in the x-space. These lines divide the
space into a number of polygonal regions (some infinite). Each region has the same
x). Similarly, for general q, the regions are bounded by hyperplanes. Figure 11.3
C(
illustrates for the iris data when using just the sepal length and width. The solid line
is the line for which the discriminant functions for setosas and versicolors are equal.
It is basically perfect for these data. The dashed line tries to separate setosas and
virginicas. There is one misclassification. The dashed/dotted line tries to separate
the versicolors and virginicas. It is not particularly successful. See Section 11.4.1 for
a better result using all the variables.
11.4. Cross-validation estimate of error 205
s
s
s
4.0
s
s
s s gg
s ss
s ss g
3.5
Sepal width sss s
s s sss s v gg
ss g
v g
s ss s v g
vg ggv g
3.0
s ss g g
vgv
ss sss v vv g vgg
v gvgvg gg gg
s vv vvvgv v g
gvg vgggv v g g
v vvv g gg
v vv g g
2.5
g v vvg g
v g
v v
s v v v
g
vv
2.0
Figure 11.3: Fisher’s linear discrimination for the iris data using just sepal length
and width. The solid line separates setosa (s) and versicolor (v); the dashed line
separates setosa and virginica (r); and the dashed/dotted line separates versicolor
and virginica.
Remark
Fisher’s original derivation in Fisher [1936] of the classifier (11.33) did not start with
the multivariate normal density. Rather, in the case of two groups, he obtained the
p × 1 vector a that maximized the ratio of the squared difference of means of the
variable Xi a for the two groups to the variance:
a = (μ
1 − μ −1 ,
2 )Σ (11.35)
which is the a1 in (11.32). Even though our motivation leading to (11.27) is different
than Fisher’s, because we end up with his coefficients, we will refer to (11.31) as
Fisher’s.
error to the data at hand, we take the criterion to be the probability of error given the
observed Xi ’s, (c.f. the prediction error in (7.88)),
As in prediction, this error will be an underestimate because we are using the same
data to estimate the classifier and test it out. A common approach to a fair estimate is
to initially set aside a random fraction of the observations (e.g., 10% to 25%) to be test
data, and use the remaining so-called training data to estimate the classifier. Then
this estimated classifier is tested on the test data.
Cross-validation is a method that takes the idea one step further, by repeatedly
separating the data into test and training data. The “leave-one-out” cross-validation
uses single observations as the test data. It starts by setting aside the first observation,
(x1 , y1 ), and calculating the classifier using the data (x2 , y2 ), . . . , (xn , yn ). (That is, we
find the sample means, covariances, etc., leaving out the first observation.) Call the
resulting classifier C(−1) . Then determine whether this classifier classifies the first
observation correctly:
I [ C(−1) (x1 ) = y1 ]. (11.38)
The C(−1) and Y1 are independent, so the function in (11.38) is almost an unbiased
estimate (conditionally on X1 = x1 ) of the error
X New ) = Y New | X New = x1 ],
P [ C( (11.39)
1 1 1
the only reason it is not exactly unbiased is that C(−1) is based on n − 1 observations,
rather than the n for C. This difference should be negligible.
Repeat the process, leaving out each observation in turn, so that C(−i) is the classi-
fier calculated without observation i. Then the almost unbiased estimate of ClassError
in (11.36) is
n
LOOCV = 1 ∑ I [ C(−i) (xi ) = yi ].
ClassError (11.40)
n i =1
If n is large, and calculating the classifier is computationally challenging, then leave-
one-out cross-validation can use up too much computer time (especially if one is
trying a number of different classifiers). Also, the estimate, though nearly unbiased,
might have a high variance. An alternative is to leave out more than one observation
each time, e.g., the 10% cross-validation would break the data set into 10 sets of size ≈
n/10, and for each set, use the other 90% to classify the observations. This approach
is much more computationally efficient, and less variable, but does introduce more
bias. Kshirsagar [1972] contains a number of other suggestions for estimating the
classifiaction error.
11.4. Cross-validation estimate of error 207
k ak ck
1 (Setosa) 11.325 20.309 −29.793 −39.263 18.428
(11.41)
2 (Versicolor) 3.319 3.456 −7.709 −14.944 32.159
3 (Virginica) 0 0 0 0 0
Note that the final coefficients are zero, because of the way we normalize the functions
in (11.28).
To see how well the classifier works on the data, we have to first calculate the
dk (xi ). The following places these values in an n × K matrix disc:
disc <− x%∗%ld.iris$a
disc <− sweep(disc,2,ld.iris$c,’+’)
The rows corresponding to the first observation from each species are
k
i 1 2 3
1 97.703 47.400 0 (11.42)
51 −32.305 9.296 0
101 −120.122 −19.142 0
The classifier (11.33) classifies each observation into the group corresponding to the
column with the largest entry. Applied to the observations in (11.42), we have
that is, each of these observations is correctly classified into its group. To find the
CFLD ’s for all the observations, use
imax <− function(z) ((1:length(z))[z==max(z)])[1]
yhat <− apply(disc,1,imax)
where imax is a little function to give the index of the largest value in a vector. To see
how close the predictions are to the observed, use the table command:
table(yhat,y.iris)
208 Chapter 11. Classification
which yields
y
y 1 2 3
1 50 0 0 (11.44)
2 0 48 1
3 0 2 49
#{CFLD (xi ) = yi } 3
ClassErrorObs = = = 0.02. (11.45)
n 150
Here, for each i, we calculate the classifier without observation i, then apply it to
that left-out observation i, the predictions placed in the vector yhat.cv. We then count
how many observations were misclassified. In this case, ClassError LOOCV = 0.02,
just the same as the observed classification error. In fact, the same three observations
were misclassified.
Subset selection
The above classifications used all four iris variables. We now see if we can obtain
equally good or better results using a subset of the variables. We use the same loop
as above, setting varin to the vector of indices for the variables to be included. For
example, varin = c(1,3) will use just variables 1 and 3, sepal length and petal length.
Below is a table giving the observed error and leave-one-out cross-validation error
(in percentage) for 15 models, depending on which variables are included in the
classification.
11.4. Cross-validation estimate of error 209
2.5
2.0
Petal width
1.5
1.0
0.5
Figure 11.4: Boxplots of the petal widths for the three species of iris. The solid
line separates the setosas from the versicolors, and the dashed line separates the
versicolors from the virginicas.
Classification errors
Variables Observed Cross-validation
1 25.3 25.3
2 44.7 48.0
3 5.3 6.7
4 4.0 4.0
1, 2 20.0 20.7
1, 3 3.3 4.0
1, 4 4.0 4.7 (11.46)
2, 3 4.7 4.7
2, 4 3.3 4.0
3, 4 4.0 4.0
1, 2, 3 3.3 4.0
1, 2, 4 4.0 5.3
1, 3, 4 2.7 2.7
2, 3, 4 2.0 4.0
1, 2, 3, 4 2.0 2.0
Note that the cross-validation error estimates are either the same, or a bit larger,
than the observed error rates. The best classifier uses all 4 variables, with an estimated
2% error. Note, though, that Variable 4 (Petal Width) alone has only a 4% error rate.
Also, adding Variable 1 to Variable 4 actual worsens the prediction a little, showing
that adding the extra variation is not worth it. Looking at just the observed error, the
prediction stays the same.
Figure 11.4 shows the classifications using just petal widths. Because the sample
sizes are equal, and the variances are assumed equal, the separating lines between
210 Chapter 11. Classification
two species are just the average of their means. We did not plot the line for setosa
versus virginica. There are six misclassifications, two versicolors and four virginicas.
(Two of the latter had the same petal width, 1.5.)
Both models have the same unrestricted means, and we can consider the π k ’s fixed,
so we can work with just the sample covariance matrices, as in Section 10.1. Let
U1 , U2 , and U3 be the sum of squares and cross-products matrices (1.15) for the three
species, and U = U1 + U2 + U3 be the pooled version. The degrees of freedom for
each species is νk = 50 − 1 = 49. Thus from (10.10) and (10.11), we can find the
deviances (9.47) to be
To test the null hypothesis MSame versus the alternative MDiff , as in (9.49),
AIC BIC
MSame −1443.90 −1414.00 (11.55)
MDiff −1550.57 −1460.86
They, too, favor the separate covariance model. Cross-validation above suggests that
the equal-covariance model is slightly better. Thus there seems to be a conflict be-
tween AIC/BIC and cross-validation. The conflict can be explained by noting that
AIC/BIC are trying to model the xi ’s and yi ’s jointly, while cross-validation tries to
model the conditional distribution of the yi ’s given the xi ’s. The latter does not really
care about the distribution of the xi ’s, except to the extent it helps in predicting the
yi ’s.
1
Σ = I p =⇒ q (x; μk , I p ) = −
x − μk 2 , or
2
1
Σ = Δ, diagonal =⇒ q (x; μk , Δ) = − ∑( xi − μki )2 /δii . (11.57)
2
The first case is regular Euclidean distance. In the second case, one would need to
estimate the δii ’s by the pooled sample variances. These alternatives may be better
when there are not many observations per group, and a fairly large number of vari-
ables p, so that estimating a full Σ introduces enough extra random error into the
classification to reduce its effectiveness.
Another modification is to use functions of the individual variables. E.g., in the
iris data, one could generate quadratic boundaries by using the variables
in the x. The resulting set of variables certainly would not be multivariate normal,
but the classification based on them may still be reasonable. See the next section for
another method of incorporating such functions.
for some 1 × m parameter θ, 1 × m function t(x) (the sufficient statistic), and function
a(x), where ψ (θ) is the normalizing constant.
Suppose that the conditional density of X given Y = k is f (x | θk ), that is, each
group has the same form of the density, but a different parameter value. Then the
analog to equations (11.27) and (11.28) yields discriminant functions like those in
(11.31),
dk (x) = γk + t(x)αk , (11.60)
a linear function of t(x), where αk = θk − θK , and γk is a constant depending on the
parameters. (Note that dK (x) = 0.) To implement the classifier, we need to estimate
the parameters θk and π k , usually by finding the maximum likelihood estimates.
(Note that Fisher’s quadratic discrimination in Section 11.5 also has discriminant
functions (11.48) of the form (11.60), where the t is a function of the x and its square,
xx .) In such models the conditional distribution of Y given X is given by
e d k ( x)
P [ Y = k | X = x] = (11.61)
e d1 ( x) + · · · + e d K − 1 ( x) + 1
11.7. Logistic regression 213
for the dk ’s in (11.60). This conditional model is called the logistic regression model.
Then an alternative method for estimating the γk ’s and αk ’s is to find the values that
maximize the conditional likelihood,
n
L ((γ1 , α1 ), . . . , (γK −1 , aK −1 ) ; (x1 , y1 ), . . . , (xn , yn )) = ∏ P [ Y = y i | X i = xi ] . (11.62)
i =1
(We know that αK = 0 and γK = 0.) There is no closed-form solution for solving
the likelihood equations, so one must use some kind of numerical procedure like
Newton-Raphson. Note that this approach estimates the slopes and intercepts of the
discriminant functions directly, rather than (in the normal case) estimating the means
and variances, and the π k ’s, then finding the slopes and intercepts as functions of
those estimates.
Whether using the exponential family model unconditionally or the logistic model
conditionally, it is important to realize that both lead to the exact same classifier.
The difference is in the way the slopes and intercepts are estimated in (11.60). One
question is then which gives the better estimates. Note that the joint distribution of
the (X, Y ) is the product of the conditional of Y given X in (11.61) and the marginal
of X in (11.5), so that for the entire data set,
% &
n n
∏ f ( y i , x i | θk ) π y i
= ∏ P [ Y = y i | X i = x i , θk ]
i =1 i =1
% &
n
× ∏(π1 f (xi | θ1 ) + · · · + πK f (xi | θK )) .
i =1
(11.63)
Thus using just the logistic likelihood (11.62), which is the first term on the right-
hand side in (11.63), in place of the complete likelihood on the left, leaves out the
information about the parameters that is contained in the mixture likelihood (the
second term on the right). As we will see in Chapter 12, there is information in the
mixture likelihood. One would then expect that the complete likelihood gives better
estimates in the sense of asymptotic efficiency of the estimates. It is not clear whether
that property always translates to yielding better classification schemes, but maybe.
On the other hand, the conditional logistic model is more general in that it yields
valid estimates even when the exponential family assumption does not hold. We can
entertain the assumption that the conditional distributions in (11.61) hold for any
statistics t(x) we wish to use, without trying to model the marginal distributions of
the X’s at all. This realization opens up a vast array of models to use, that is, we can
contemplate any functions t we wish.
In what follows, we restrict ourselves to having K = 2 groups, and renumber the
groups {0, 1}, so that Y is conditionally Bernoulli:
Yi | Xi = xi ∼ Bernoulli(ρ(xi )), (11.64)
where
ρ (x) = P [ Y = 1 | X = x] . (11.65)
The modeling assumption from (11.61) can be translated to the logit (log odds) of ρ,
logit(ρ) = log(ρ/(1 − ρ)). Then
logit(ρ(x)) = logit(ρ(x | γ, α)) = γ + xα . (11.66)
214 Chapter 11. Classification
(We have dropped the t from the notation. You can always define x to be whatever
functions of the data you wish.) The form (11.66) exhibits the reason for calling the
model “logistic regression.” Letting
⎛ ⎞
logit(ρ(x1 | γ, α))
⎜ logit(ρ(x2 | γ, α)) ⎟
⎜ ⎟
logit(ρ ) = ⎜ .. ⎟, (11.67)
⎝ . ⎠
logit(ρ(xn | γ, α))
we can set up the model to look like the regular linear model,
⎛ ⎞
1 x1
⎜ 1 x2 ⎟
⎜ ⎟ γ
logit(ρ ) = ⎜ . . ⎟ = Xβ. (11.68)
⎝ .. .. ⎠ α
1 xn
We turn to examples.
Positive Negative
3d our over remove internet order make address will hp hpl george lab
mail addresses free business you data 85 parts pm cs meeting original
credit your font 000 money 650 tech- project re edu table conference ;
nology ! $ #
(11.71)
216 Chapter 11. Classification
Computational details
In R, logistic regression models with two categories can be fit using the generalized
linear model function, glm. The spam data is in the data frame Spam. The indicator
variable, Yi , for spam is called spam. We first must change the data matrix into a data
frame for glm: Spamdf <− data.frame(Spam). The full logistic regression model is fit
using
spamfull <− glm(spam ~ .,data=Spamdf,family=binomial)
The “spam ~ .” tells the program that the spam variable is the Y, and the dot means
use all the variables except for spam in the X. The “family = binomial” tells the pro-
gram to fit logistic regression. The summary command, summary(spamfull), will print
out all the coefficients, which I will not reproduce here, and some other statistics, in-
cluding
Null deviance: 6170.2 on 4600 degrees of freedom
Residual deviance: 1815.8 on 4543 degrees of freedom
AIC: 1931.8
The “residual deviance” is the regular deviance in (9.47). The full model uses 58
variables, hence
AIC = deviance +2p = 1815.8 + 2 × 58 = 1931.8, (11.72)
which checks. The BIC is found by substituting log(4601) for the 2.
We can find the predicted classifications from this fit using the function predict,
which returns the estimated linear X β from (11.68) for the fitted model. The Y i ’s
are then 1 or 0 as the ρ(x( i ) | c,
a) is greater than or less than 12 . Thus to find the
predictions and overall error rate, do
yhat <− ifelse(predict(spamfull)>0,1,0)
sum(yhat!=Spamdf[,’spam’])/4601
We find the observed classifcation error to be 6.87%.
Cross-validation
We will use 46-fold cross-validation to estimate the classification error. We randomly
divide the 4601 observations into 46 groups of 100, leaving one observation who
doesn’t get to play. First, permute the indices from 1 to n:
o <− sample(1:4601)
Then the first hundred are the indices for the observations in the first leave-out-block,
the second hundred in the second leave-out-block, etc. The loop is next, where the
err collects the number of classification errors in each block of 100.
err <− NULL
for(i in 1:46) {
oi <− o[(1:100)+(i−1)∗100]
yfiti <− glm(spam ~ ., family = binomial,data = Spamdf,subset=(1:4601)[− oi])
dhati <− predict(yfiti,newdata=Spamdf[oi,])
yhati <− ifelse(dhati>0,1,0)
err <− c(err,sum(yhati!=Spamdf[oi,’spam’]))
}
11.8. Trees 217
In the loop for cross-validation, the oi is the vector of indices being left out. We then fit
the model without those by using the keyword subset=(1:4601)[− oi], which indicates
using all indices except those in oi. The dhati is then the vector of discriminant
functions evaluated for the left out observations (the newdata). The mean of err is the
estimated error, which for us is 7.35%. See the entry in table in (11.70).
Stepwise
The command to use for stepwise regression is step. To have the program search
through the entire set of variables, use one of the two statements
spamstepa <− step(spamfull,scope=list(upper= ~.,lower = ~1))
spamstepb <− step(spamfull,scope=list(upper= ~.,lower = ~1),k=log(4601))
The first statement searches on AIC, the second on BIC. The first argument in the step
function is the return value of glm for the full data. The upper and lower inputs refer
to the formulas of the largest and smallest models one wishes to entertain. In our
case, we wish the smallest model to have just the 1n vector (indicated by the “~1”),
and the largest model to contain all the vectors (indicated by the “~.”).
These routines may take a while, and will spit out a lot of output. The end result
is the best model found using the given criterion. (If using the BIC version, while
calculating the steps, the program will output the BIC values, though calling them
“AIC.” The summary output will give the AIC, calling it “AIC.” Thus if you use just
the summary output, you must calculate the BIC for yourself. )
To find the cross-validation estimate of classification error, we need to insert the
stepwise procedure after fitting the model leaving out the observations, then predict
those left out using the result of the stepwise procedure. So for the best BIC model,
use the following:
errb <− NULL
for(i in 1:46) {
oi <− o[(1:100)+(i−1)∗100]
yfiti <− glm(spam ~ ., family = binomial, data = Spamdf,subset=(1:4601)[− oi])
stepi <− step(yfiti,scope=list(upper= ~.,lower = ~1),k=log(4501))
dhati <− predict(stepi,newdata=Spamdf[oi,])
yhati <− ifelse(dhati>0,1,0)
errb <− c(errb,sum(yhati!=Spamdf[oi,’spam’]))
}
The estimate for the best AIC model uses the same statements but with k = 2 in
the step function. This routine will take a while, because each stepwise procedure
is time consuming. Thus one might consider using cross-validation on the model
chosen using the BIC (or AIC) criterion for the full data.
The neural networks R package nnet [Venables and Ripley, 2002] can be used to fit
logistic regression models for K > 2.
11.8 Trees
The presentation here will also use just K = 2 groups, labeled 0 and 1, but can be
extended to any number of groups. In the logistic regression model (11.61), we mod-
eled P [Y = 1 | X = x] ≡ ρ(x) using a particular parametric form. In this section we
218 Chapter 11. Classification
45.61%
40
35
8.55%
30
Age
25
20
15
10
38.68%
20 30 40 50 60
Adiposity
Figure 11.5: Splitting on age and adiposity. The open triangles indicate no heart
disease, the solid discs indicate heart disease. The percentages are the percentages of
men with heart disease in each region of the plot.
use a simpler, nonparametric form, where ρ(x) is constant over rectangular regions
of the X-space.
To illustrate, we will use the South African heart disease data from Rousseauw
et al. [1983], which was used in Exercise 10.5.20. The Y is congestive heart disease
(chd), where 1 indicates the person has the disease, 0 he does not. Explanatory
variables include various health measures. Hastie et al. [2009] apply logistic regres-
sion to these data. Here we use trees. Figure 11.5 plots the chd variable for the
age and adiposity (fat percentage) variables. Consider the vertical line. It splits the
data according to whether age is less than 31.5 years. The splitting point 31.5 was
chosen so that the proportions of heart disease in each region would be very dif-
ferent. Here, 10/117 = 8.85% of the men under age 31.5 had heart disease, while
150/345 = 43.48% of those above 31.5 had the disease.
The next step is to consider just the men over age 31.5, and split them on the
adiposity variable. Taking the value 25, we have that 41/106 = 38.68% of the men
over age 31.5 but with adiposity under 25 have heart disease; 109/239 = 45.61% of
the men over age 31.5 and with adiposity over 25 have the disease. We could further
split the younger men on adiposity, or split them on age again. Subsequent steps
split the resulting rectangles, each time with either a vertical or horizontal segment.
There are also the other variables we could split on. It becomes easier to represent
the splits using a tree diagram, as in Figure 11.6. There we have made several splits,
at the nodes. Each node needs a variable and a cutoff point, such that people for
which the variable is less than the cutoff are placed in the left branch, and the others
11.8. Trees 219
1.00 0.60
tobacco < 7.605 ldl < 6.705
0.00 0.40
0.70 0.20
0.30 0.80 tobacco < 4.15
typea < 42.5 adiposity < 28.955
0.05
adiposity < 28
1.00
adiposity < 24.435 0.20
typea < 48
1.00 0.00 0.50 0.80
0.00 1.00 0.50 0.70
0.90 0.60 0.30
0.08 0.40 0.60 0.00
0.40 1.00
go to the right. The ends of the branches are terminal nodes or leaves. This plot has 15
leaves. At each leaf, there are a certain number of observations. The plot shows the
proportion of 0’s (the top number) and 1’s (the bottom number) at each leaf.
For classification, we place a 0 or 1 at each leaf, depending on whether the pro-
portion of 1’s is less than or greater than 1/2. Figure 11.7 shows the results. Note
that for some splits, both leaves have the same classification, because although their
proportions of 1’s are quite different, they are both on the same side of 1/2. For
classification purposes, we can snip some of the branches off. Further analysis (Sec-
tion 11.8.1) leads us to the even simpler tree in Figure 11.8. The tree is very easy to
interpret, hence popular among people (e.g., doctors) who need to use them. The
tree also makes sense, showing age, type A personality, tobacco use, and family his-
tory are important factors in predicting heart disease among these men. The trees
also are flexible, incorporating continuous or categorical variables, avoiding having
to consider transformations, and automatically incorporating interactions. E.g., the
type A variable shows up only for people between the ages of 31.5 and 50.5, and
family history and tobacco use show up only for people over 50.5.
Though simple to interpret, it is easy to imagine that finding the “best” tree is a
220 Chapter 11. Classification
11.8.1 CART
Two popular commercial products for fitting trees are Categorization and Regression
Trees, CART R
, by Breiman et al. [1984], and C5.0, by Quinlan [1993]. We will take
the CART approach, the main reason being the availability of an R version. It seems
that CART would appeal more to statisticians, and C5.0 to data-miners, but I do not
think the results of the two methods would differ much.
We first need an objective function to measure the fit of a tree to the data. We will
use deviance, although other measures such as the observed misclassification rate are
certainly reasonable. For a tree T with L leaves, each observation is placed in one of
the leaves. If observation yi is placed in leaf l, then that observation’s ρ(xi ) is given
11.8. Trees 221
0 1
by the parameter for leaf l, say pl . The likelihood for that Bernoulli observation is
p l i (1 − p l )1 − y i .
y
(11.73)
Assuming the observations are independent, at leaf l there is a sample of iid Bernoulli
random variables with parameter pl , hence the overall likelihood of the sample is
L
L ( p1 , . . . , p L | y1 , . . . , y n ) = ∏ pwl (1 − pl )n −w ,
l l l
(11.74)
l =1
where
n l = #{i at leaf l }, wl = #{yi = 1 at leaf l }. (11.75)
This likelihood is maximized over the pl ’s by taking pl = wl /n l . Then the deviance
(9.47) for this tree is
L
deviance(T ) = −2 ∑ (wl log( pl ) + (n l − wl ) log(1 − pl )) . (11.76)
l =1
The CART method has two main steps: grow the tree, then prune the tree. The
tree is grown in a stepwise, greedy fashion, at each stage trying to find the next
split that maximally reduces the objective function. We start by finding the single
split (variable plus cutoff point) that minimizes the deviance among all such splits.
Then the observations at each resulting leaf are optimally split, again finding the
variable/cutoff split with the lowest deviance. The process continues until the leaves
have just a few observations, e.g., stopping when any split would result in a leaf with
fewer than five observations.
To grow the tree for the South African heart disease data in R, we need to install
the package called tree [Ripley, 2010]. A good explanation of it can be found in
Venables and Ripley [2002]. We use the data frame SAheart in the ElemStatLearn
package [Halvorsen, 2009]. The dependent variable is chd. To grow a tree, use
222 Chapter 11. Classification
The prune.tree function can be used to find the subtree with the lowest BIC. It takes
the base tree and a value k as inputs, then finds the subtree that minimizes
Thus for the best AIC subtree we would take k = 4, and for BIC we would take
k = 2 log(n ):
aictree <− prune.tree(basetree,k=4)
bictree <− prune.tree(basetree,k=2∗log(462)) # n = 462 here.
If the k is not specified, then the routine calculates the numbers of leaves and de-
viances of best subtrees for all values of k. The best AIC subtree is in fact the full
base tree, as in Figure 11.7. Figure 11.9 exhibits the best BIC subtree, which has eight
leaves. There are also routines in the tree package that use cross-validation to choose
a good factor k to use in pruning.
Note that the tree in Figure 11.9 has some redundant splits. Specifically, all leaves
to the left of the first split (age < 31.5) lead to classification “0.” To snip at that node,
we need to determine its index in basetree. One approach is to print out the tree,
resulting in the output in Listing 11.1. We see that node #2 is “age < 31.5,” which is
where we wish to snip, hence we use
bictree.2 <− snip.tree(bictree,nodes=2)
Plotting the result yields Figure 11.8. It is reasonable to stick with the presnipped
tree, in case one wished to classify using a cutoff point for pl ’s other than 12 .
There are some drawbacks to this tree-fitting approach. Because of the stepwise
nature of the growth, if we start with the wrong variable, it is difficult to recover. That
11.8. Trees 223
0 1
Figure 11.9: The best subtree using the BIC criterion, before snipping redundant
leaves.
Listing 11.1: Text representation of the output of tree for the tree in Figure 11.9
node), split, n, deviance, yval, (yprob)
∗ denotes terminal node
is, even though the best single split may be on age, the best two-variable split may
be on type A and alcohol. There is inherent instability, because having a different
variable at a given node can completely change the further branches. Additionally,
if there are several splits, the sample sizes for estimating the pl ’s at the farther-out
leaves can be quite small. Boosting, bagging, and random forests are among the tech-
niques proposed that can help ameliorate some of these problems and lead to better
classifications. They are more black-box-like, though, losing some of the simplicity of
the simple trees. See Hastie et al. [2009].
11.9 Exercises
Exercise 11.9.1. Show that (11.19) follows from (11.18).
Exercise 11.9.2. Compare the statistic in (11.34) and its maximum using the
a in
(11.35) to the motivation for Hotelling’s T 2 presented in Section 8.4.1.
Exercise 11.9.3. Write the γk in (11.60) as a function of the θi ’s and π i ’s.
Exercise 11.9.4 (Spam). Consider the spam data from Section 11.7.2 and Exercise
1.9.14. Here we simplify it a bit, and just look at four of the 0/1 predictors: Whether
or not the email contains the words “free” or “remove” or the symbols “!” or “$”.
The following table summarizes the data, where the first four columns indicate the
presence (0) or absence (1) of the word or symbol, and the last two columns give the
11.9. Exercises 225
numbers of corresponding emails that are spam or not spam. E.g., there are 98 emails
containing “remove” and “!”, but not “free” nor “$”, 8 of which are not spam, 90 are
spam.
free remove ! $ not spam spam
0 0 0 0 1742 92
0 0 0 1 157 54
0 0 1 0 554 161
0 0 1 1 51 216
0 1 0 0 15 28
0 1 0 1 4 17
0 1 1 0 8 90
0 1 1 1 5 166 (11.80)
1 0 0 0 94 42
1 0 0 1 28 20
1 0 1 0 81 159
1 0 1 1 38 305
1 1 0 0 1 16
1 1 0 1 0 33
1 1 1 0 2 116
1 1 1 1 8 298
Assuming a multinomial distribution for the 25 = 32 possibilities, find the estimated
Bayes classifier of email as “spam” or “not spam” based on the other four variables
in the table. What is the observed error rate?
Exercise 11.9.5 (Crabs). This problem uses data on 200 crabs, categorized into two
species, Orange and Blue, and two sexes. It is in the MASS R package [Venables
and Ripley, 2002]. The data is in the data frame crabs. There are 50 crabs in each
species×sex category; the first 50 are blue males, then 50 blue females, then 50 orange
males, then 50 orange females. The five measurements are frontal lobe size, rear
width, carapace length, carapace width, and body depth, all in millimeters. The goal
here is to find linear discrimination procedures for classifying new crabs into species
and sex categories. (a) The basic model is that Y ∼ N (xβ, I200 ⊗ Σ), where x is any
analysis of variance design matrix (n × 4) that distinguishes the four groups. Find the
MLE of Σ, Σ. (b) Find the ck ’s and ak ’s in Fisher’s linear discrimination for classifying
all four groups, i.e., classifying on species and sex simultaneously. (Take π k = 1/4 for
all four groups.) Use the version wherein dK = 0. (c) Using the procedure in part (b)
on the observed data, how many crabs had their species misclassified? How many
had their sex misclassified? What was the overall observed misclassification rate (for
simultaneous classification of color and sex)? (d) Use leave-one-out cross-validation
to estimate the overall misclassification rate. What do you get? Is it higher than the
observed rate in part (c)?
Exercise 11.9.6 (Crabs). | Continue with the crabs data from Exercise 11.9.5, but use
classification trees to classify the crabs by just species. (a) Find the base tree using the
command
crabtree <− tree(sp ~ FL+RW+CL+CW+BD,data=crabs)
How many leaves does the tree have? Snip off redundant nodes. How many leaves
does the snipped tree have? What is its observed misclassification rate? (b) Find the
226 Chapter 11. Classification
BIC for the subtrees found using prune.tree. Give the number of leaves, deviance,
and dimension for the subtree with the best BIC. (c) Consider the subtree with the
best BIC. What is its observed misclassification rate? What two variables figure most
prominently in the tree? Which variables do not appear? (d) Now find the leave-
one-out cross-validation estimate of the misclassification error rate for the best model
using BIC. How does this rate compare with the observed rate?
Exercise 11.9.7 (South African heart disease). This question uses the South African
heart disease study discussed in Section 11.8. The objective is to use logistic regres-
sion to classify people on the presence of heart disease, variable chd. (a) Use the
logistic model that includes all the explanatory variables to do the classification. (b)
Find the best logistic model using the stepwise function, with BIC as the criterion.
Which variables are included in the best model from the stepwise procedure? (c) Use
the model with just the variables suggested by the factor analysis of Exercise 10.5.20:
tobacco, ldl, adiposity, obesity, and alcohol. (d) Find the BIC, observed error rate, and
leave-one-out cross-validation rate for the three models in parts (a), (b) and (c). (e)
True or false: (i) The full model has the lowest observed error rate; (ii) The factor-
analysis based model is generally best; (iii) The cross-validation-based error rates are
somewhat larger than the corresponding observed error rates; (iv) The model with
the best observed error rate has the best cv-based error rate as well; (v) The best
model of these three is the one chosen by the stepwise procedure; (vi) Both adiposity
and obesity seem to be important factors in classifying heart disease.
Exercise 11.9.9 (Spam). Use classification trees to classify the spam data. It is best to
start as follows:
Spamdf <− data.frame(Spam)
spamtree <− tree(as.factor(spam)~.,data=Spamdf)
Turning the matrix into a data frame makes the labeling on the plots simpler. (a)
Find the BIC’s for the subtrees obtained using prune.tree. How many leaves in the
best model? What is its BIC? What is its observed error rate? (b) You can obtain a
cross-validation estimate of the error rate by using
cvt <− cv.tree(spamtree,method=’misclass’,K=46)
The “46” means use 46-fold cross-validation, which is the same as leaving 100 out.
The vector cvt$dev contains the number of left-outs misclassified for the various mod-
els. The cv.tree function randomly splits the data, so you should run it a few times,
and use the combined results to estimate the misclassification rates for the best model
you chose in part (a). What do you see? (c) Repeat parts (a) and (b), but using the
first ten principal components of the spam explanatory variables as the predictors.
(Exercise 1.9.15 calculated the principal components.) Repeat again, but this time
using the first ten principal components based on the scaled explanatory variables,
scale(Spam[,1:57]). Compare the effectiveness of the three approaches.
Exercise 11.9.10. This questions develops a Bayes classifier when there is a mix of nor-
mal and binomial explanatory variables. Consider the classification problem based
on (Y, X, Z ), where Y is the variable to be classified, with values 0 and 1, and X and
Z are predictors. X is a 1 × 2 continuous vector, and Z takes the values 0 and 1. The
model for (Y, X, Z ) is given by
and
P [Y = y & Z = z] = pyz , (11.82)
so that p00 + p01 + p10 + p11 = 1. (a) Find an expression for P [Y = y | X = x & Z =
z]. (b) Find the 1 × 2 vector αz and the constant γz (which depend on z and the
parameters) so that
(c) Suppose the data are (Yi , Xi , Zi ), i = 1, . . . , n, iid, distributed as above. Find
expressions for the MLE’s of the parameters (the four μyz ’s, the four pyz ’s, and Σ).
228 Chapter 11. Classification
Exercise 11.9.11 (South African heart disease). Apply the classification method in Ex-
ercise 11.9.10 to the South African heart disease data, with Y indicating heart disease
(chd), X containing the two variables age and type A, and Z being the family history
of heart disease variable (history: 0 = absent, 1 = present). Randomly divide the data
into two parts: The training data with n = 362, and the test data with n = 100. E.g.,
use
random.index <− sample(462,100)
sahd.train <− SAheart[−random.index,]
sahd.test <− SAheart[random.index,]
(a) Estimate the αz and γz using the training data. Find the observed misclassification
rate on the training data, where you classify an observation as Y i = 1 if xi
αz + γ
z > 0,
and Yi = 0 otherwise. What is the misclassification rate for the test data (using the
estimates from the training data)? Give the 2 × 2 table showing true and predicted
Y’s for the test data. (b) Using the same training data, find the classification tree. You
don’t have to do any pruning. Just take the full tree from the tree program. Find the
misclassification rates for the training data and the test data. Give the table showing
true and predicted Y’s for the test data. (c) Still using the training data, find the
classification using logistic regression, with the X and Z as the explanatory variables.
What are the coefficients for the explanatory variables? Find the misclassification
rates for the training data and the test data. (d) What do you conclude?
Chapter 12
Clustering
The classification and prediction we have covered in previous chapters were cases
of supervised learning. For example, in classification, we try to find a function that
classifies individuals into groups using their x values, where in the training set we
know what the proper groups are because we observe their y’s. In clustering, we
again wish to classify observations into groups using their x’s, but do not know the
correct groups even in the training set, i.e., we do not observe the y’s, nor often even
know how many groups there are. Clustering is a case of unsupervised learning.
There are many clustering algorithms. Most are reasonably easy to implement
given the number K of clusters. The difficult part is deciding what K should be.
Unlike in classification, there is no obvious cross-validation procedure to balance
the number of clusters with the tightness of the clusters. Only in the model-based
clustering do we have direct AIC or BIC criteria. Otherwise, a number of reasonable
but ad hoc measures have been proposed. We will look at two: gap statistics, and
silhouettes.
In some situations one is not necessarily assuming that there are underlying clus-
ters, but rather is trying to divide the observations into a certain number of groups
for other purposes. For example, a teacher in a class of 40 students might want to
break up the class into four sections of about ten each based on general ability (to
give more focused instruction to each group). The teacher does not necessarily think
there will be wide gaps between the groups, but still wishes to divide for pedagogical
purposes. In such cases K is fixed, so the task is a bit simpler.
In general, though, when clustering one is looking for groups that are well sep-
arated. There is often an underlying model, just as in Chapter 11 on model-based
classification. That is, the data are
(Y1 , X1 ), . . . , (Yn , Xn ), iid, (12.1)
where yi ∈ {1, . . . , K },
X | Y = k ∼ f k (x) = f (x | θk ) and P [Y = k] = π k , (12.2)
as in (11.2) and (11.3). If the parameters are known, then the clustering proceeds
exactly as for classification, where an observation x is placed into the group
f k (x) π k
C(x) = k that maximizes . (12.3)
f 1 (x) π 1 + · · · + f K (x) π K
229
230 Chapter 12. Clustering
See (11.13). The fly in the ointment is that we do not observe the yi ’s (neither in the
training set nor for the new observations), nor do we necessarily know what K is, let
alone the parameter values.
The following sections look at some approaches to clustering. The first, K-means,
does not explicitly use a model, but has in the back of its mind f k ’s being N (μk , σ2 I p ).
Hierarchical clustering avoids the problems of number of clusters by creating a tree
containing clusterings of all sizes, from K = 1 to n. Finally, the model-based cluster-
ing explicitly assumes the f k ’s are multivariate normal (or some other given distribu-
tion), with various possibilities for the covariance matrices.
12.1 K-Means
For a given number K of groups, K-means assumes that each group has a mean vector
μk . Observation xi is assigned to the group with the closest mean. To estimate these
means, we minimize the sum of the squared distances from the observations to their
group means:
n
obj(μ1 , . . . , μK ) = ∑ μmin
,...,μ
1 K
xi − μ k 2 . (12.4)
i =1
An algorithm for finding the clusters starts with a random set of means μ 1 , . . . , μK
(e.g., randomly choose K observations from the data), then iterate the following two
steps:
1
#{C(xi ) = k} i|C(∑
k =
μ xi . (12.6)
xi )= k
The algorithm is guaranteed to converge, but not necessarily to the global mini-
mum. It is a good idea to try several random starts, then take the one that yields the
lowest obj in (12.4). The resulting means and assignments are the K-means and their
clustering.
0.25
8.0
log(SS)
Gap
0.15
7.6
7.2
0.05
2 4 6 8 10 2 4 6 8 10
K K
Figure 12.1: The first plot shows the log of the total sums of squares for cluster sizes
from K = 1 to 10 for the data (solid line), and for 100 random uniform samples (the
clump of curves). The second plot exhibits the gap statistics with ± SD lines.
Tibshirani, Walther, and Hastie [2001] take a different approach, proposing the gap
statistic, which compares the observed log(SS (K ))’s with what would be expected
from a sample with no cluster structure. We are targeting the values
where E0 [·] denotes expected value under some null distribution on the Xi ’s. Tib-
shirani et al. [2001] suggest taking a uniform distribution over the range of the data,
possibly after rotating the data to the principal components. A large value of Gap(K )
indicates that the observed clustering is substantially better than what would be ex-
pected if there were no clusters. Thus we look for a K with a large gap.
Because the sports data are rankings, it is natural to consider as a null distribu-
tion that the observations are independent and uniform over the permutations of
{1, 2, . . . , 7}. We cannot analytically determine the expected value in (12.10), so we
use simulations. For each b = 1, . . . , B = 100, we generate n = 130 random rankings,
perform K-means clustering for K = 1, . . . , 10, and find the corresponding SSb (K )’s.
These make up the dense clump of curves in the first plot of Figure 12.1.
The Gap(K ) in (12.10) is then estimated by using the average of the random curves,
B
( (K ) = 1
Gap ∑ log(SSb (K )) − log(SS(K )). (12.11)
B b =1
The second plot in Figure 12.1 graphs this estimated curve, along with curves plus
or minus one standard deviation of the SSb (K )’s. Clearly K = 2 is much better than
K = 1; K’s larger than two do not appear to be better than two, so that the gap statistic
suggests K = 2 to be appropriate. Even if K = 3 had a higher gap, unless it is higher
by a standard deviation, one may wish to stick with the simpler K = 2. Of course,
interpretability is a strong consideration as well.
12.1. K-Means 233
12.1.3 Silhouettes
Another measure of clustering efficacy is Rousseeuw’s [1987] notion of silhouettes.
The silhouette of an observation i measures how well it fits in its own cluster versus
how well it fits in its next closest cluster. Adapted to K-means, we have
a (i ) = xi − μ
k 2 and b (i ) = xi − μ
l 2 , (12.12)
where observation i is assigned to group k, and group l has the next-closest group
mean to xi . Then its silhouette is
b (i ) − a (i )
silhouette(i ) = . (12.13)
max{ a(i ), b (i )}
wi = xi z , i = 1, . . . , N. (12.15)
Figure 12.4 is the histogram for the wi ’s, where group 1 has wi > 0 and group 2 has
wi < 0. We can see that the clusters are well-defined in that the bulk of each cluster
is far from the center of the other cluster.
We have also plotted the sports, found by creating a “pure” ranking for each sport.
Thus the pure ranking for baseball would give baseball the rank of 1, and the other
234 Chapter 12. Clustering
K=2 K=3
0 20 40 60 80 120 0 20 40 60 80 120
Ave = 0.625 Ave = 0.555
K=4 K=5
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0 20 40 60 80 120 0 20 40 60 80 120
Ave = 0.508 Ave = 0.534
Figure 12.2: The silhouettes for K = 2, . . . , 5 clusters. The horizontal axis indexes the
observations. The vertical axis exhibits the values of the silhouettes.
0.60
0.56
0.52
2 4 6 8 10
K
Basketball
Football Jogging
10 Swimming
Baseball
Cycling
Tennis
8
6
4
2
0
−6 −4 −2 0 2 4 6
W
Figure 12.4: The histogram for the observations along the line connecting the two
means for K = 2 groups.
sports the rank of 4.5, so that the sum of the ranks, 28, is the same as for the other
rankings. Adding these sports to the plot helps aid in interpreting the groups: team
sports on the left, individual sports on the right, with tennis on the individual-sport
side, but close to the border.
If K = 3, then the three means lie in a plane, hence we would like to project the
observations onto that plane. One approach is to use principal components (Section
1.6) on the means. Because there are three, only the first two principal components
will have positive variance, so that all the action will be in the first two. Letting
⎛ ⎞
μ
1
Z=⎝ μ
2 ⎠ , (12.16)
μ
3
we apply the spectral decomposition (1.33) in Theorem 1.1 to the sample covariance
matrix of Z:
1
Z H3 Z = GLG , (12.17)
3
where G is orthogonal and L is diagonal. The diagonals of L here are 11.77, 4.07, and
five zeros. We then rotate the data and the means using G,
Figure 12.5 plots the first two variables for W and W( means), along with the seven pure
rankings. We see the people who like team sports to the right, and the people who
like individual sports to the left, divided into those who can and those who cannot
abide jogging. Compare this plot to the biplot that appears in Figure 1.6.
236 Chapter 12. Clustering
3
2
Swimming
Var 2
Cycling Basketball
1
Tennis
0
Football
Baseball
2
−2
Jogging
−4 −2 0 2 4
Var 1
Figure 12.5: The scatter plot for the data projected onto the plane containing the
means for K = 3.
The centers input specifies the number of groups desired, and nstart=10 means ran-
domly start the algorithm ten times, then use the one with lowest within sum of
squares. The output in kms[[K]] for the K-group clustering is a list with centers, the
K × p of estimated cluster means; cluster, an n-vector that assigns each observation
to its cluster (i.e., the yi ’s); withinss, the K-vector of SSk ’s (so that SS (K ) is found
by sum(kms[[K]]$withinss)); and size, the K-vector giving the numbers of observations
assigned to each group.
12.1. K-Means 237
Gap statistic
To find the gap statistic, we first calculate the vector of SS (K )’s in (12.9) for K =
1, . . . , 10. For K = 1, there is just one large group, so that SS (1) is the sum of the
sample variances of the variables, times n − 1. Thus
n <− nrow(sportsranks) # n=130
ss <− tr(var(sportsranks))∗(n−1) # For K=1
for(K in 2:10) {
ss <− c(ss, sum(kms[[K]]$withinss))
}
The solid line in the first plot in Figure 12.1 is K versus log(ss). (Or something like it;
there is randomness in the results.)
For the summation term on the right-hand side of (12.11), we use uniformly dis-
tributed permutations of {1, . . . , 7}, which uses the command sample(7). For non-
rank statistics, one has to try some other randomization. For each b = 1, . . . , 100,
we create n = 130 random permutations, then go through the K-means process for
K = 1, . . . , 10. The xstar is the n × 7 random data set.
ssb <− NULL
for(b in 1:100) {
xstar <− NULL
for(i in 1:n) xstar <− rbind(xstar,sample(7))
sstar <− tr(var(xstar))∗(n−1)
for(K in 2:10) {
sstar <− c(sstar,sum(kmeans(xstar,centers=K,nstart=10)$withinss))
}
ssb <− rbind(ssb,sstar)
}
Now each column of ssb is a random vector of log(SSb (K ))’s. The gap statistics (12.11)
and the two plots in Figure 12.1 are found using
par(mfrow=c(1,2)) # Set up two plots
matplot(1:10,log(cbind(ss,t(ssb))),type=’l’,xlab=’K’,ylab=’log(SS)’)
ssbm <− apply(log(ssb),2,mean) # Mean of log(ssb[,K])’s
ssbsd <− sqrt(apply(log(ww),2,var)) # SD of log(ssb[,K])’s
gap <− ssbm − log(ss) # The vector of gap statistics
matplot(1:10,cbind(gap,gap+ssbsd,gap−ssbsd),type=’l’,xlab=’K’,ylab=’Gap’)
Silhouettes
Section A.4.1 contains a simple function for calculating the silhouettes in (12.13) for
a given K-means clustering. The sort.silhouette function in Section A.4.2 sorts the
silhouette values for plotting. The following statements produce Figure 12.2:
sil.ave <− NULL # To collect silhouette’s means for each K
par(mfrow=c(3,3))
for(K in 2:10) {
sil <− silhouette.km(sportsranks,kms[[K]]$centers)
sil.ave <− c(sil.ave,mean(sil))
ssil <− sort.silhouette(sil,kms[[K]]$cluster)
238 Chapter 12. Clustering
plot(ssil,type=’h’,xlab=’Observations’,ylab=’Silhouettes’)
title(paste(’K =’,K))
}
The sil.ave calculated above can then be used to obtain Figure 12.3:
plot(2:10,sil.ave,type=’l’,xlab=’K’,ylab=’Average silhouette width’)
12.2 K-medoids
Clustering with medoids [Kaufman and Rousseeuw, 1990] works directly on dis-
tances between objects. Suppose we have n objects, o1 , . . . , on , and a dissimilarity
measure d(oi , o j ) between pairs. This d satisfies
but it may not be an actual metric in that it need not satisfy the triangle inequality.
Note that one cannot necessarily impute distances between an object and another
vector, e.g., a mean vector. Rather than clustering around means, the clusters are then
12.2. K-medoids 239
built around some of the objects. That is, K-medoids finds K of the objects (c1 , . . . , cK )
to act as centers (or medoids), the objective being to find the set that minimizes
N
obj(c1 , . . . , cK ) = ∑ {cmin
,...,c }
d (o i , ck ). (12.20)
i =1 1 K
Silhouettes are defined as in (12.13), except that here, for each observation i,
a (i ) = ∑ d(oi , o j ) and b (i ) = ∑ d (o i , o j ), (12.21)
j ∈ Group k j ∈ Group l
where group k is object i’s group, and group l is its next closest group.
In R, one can use the package cluster, [Maechler et al., 2005], which implements
K-medoids clustering in the function pam, which stands for partitioning around
medoids. Consider the grades data in Section 4.2.1. We will cluster the five vari-
ables, homework, labs, inclass, midterms, and final, not the 107 people. A natural
measure of similarity between two variables is their correlation. Instead of using the
usual Pearson coefficient, we will use Kendall’s τ, which is more robust. For n × 1
vectors x and y, Kendall’s τ is
We see that K = 3 has the best average silhouette. The assigned groups for this clus-
tering can be found in pam3$clustering, which is (1,1,2,3,3), meaning the groupings
are, reasonably enough,
The medoids, i.e., the objects chosen as centers, are in this case labs, inclass, and
midterms, respectively.
k )π
f (xi | θ k
xi ) = k that maximizes
C( . (12.26)
1 )π
f (xi | θ K )π
1 + · · · + f ( x i | θ K
f (xi ) = f (xi | θ1 , . . . , θK , π1 , . . . , π K )
= f (xi | θ1 )π1 + · · · + f (xi | θK )π K . (12.27)
We will assume the μk ’s are free to vary, although models in which there are equalities
among some of the elements are certainly reasonable. There are also a variety of
structural and equality assumptions on the Σk ’s used.
−4000
−5000
−BIC
EII VVI
VII EEE
−6000
EEI EEV
VEI VEV
EVI VVV
2 4 6 8
K
data frame cu.dimensions. The variables in cars have been normalized to have medians
of 0 and median absolute deviations (MAD) of 1.4826 (the MAD for a N (0, 1)).
The routine we’ll use is Mclust (be sure to capitalize the M). It will try various
forms of the covariance matrices and group sizes, and pick the best based on the BIC.
To use the default options and have the results placed in mcars, use
mcars <− Mclust(cars)
There are many options for plotting in the package. To see a plot of the BIC’s, use
plot(mcars,cars,what=’BIC’)
You have to clicking on the graphics window, or hit enter, to reveal the plot. The result
is in Figure 12.6. The horizontal axis specifies the K, and the vertical axis gives the
BIC values, although these are the the negatives of our BIC’s. The symbols plotted
on the graph are codes for various structural hypotheses on the covariances. See
(12.35). In this example, the best model is Model “VVV” with K = 2, which means
the covariance matrices are arbitrary and unequal.
Some pairwise plots (length versus height, width versus front head room, and
rear head room versus luggage) are given in Figure 12.7. The plots include ellipses to
illustrate the covariance matrices. Indeed we see that the two ellipses in each plot are
arbitrary and unequal. To plot variable 1 (length) versus variable 4 (height), use
plot(mcars,cars,what=’classification’,dimens=c(1,4))
We also plot the first two principal components (Section 1.6). The matrix of eigenvec-
tors, G in (1.33), is given by eigen(var(cars))$vectors:
242 Chapter 12. Clustering
10 15 20
0
5
−2
0
−5
−4
−4 −2 0 2 4 −4 −2 0 2 4
Length Width
5
4
2
0
Luggage
0
PC2
−10
−4
−20
−8
−4 −2 0 2 4 6 0 10 20 30
Rear head room PC1
Figure 12.7: Some two-variable plots of the clustering produced by Mclust. The solid
triangles indicate group 1, and the open squares indicate group 2. The fourth graph
plots the first two principal components of the data.
To obtain the ellipses, we redid the clustering using the principal components as the
data, and specifying G=2 groups in Mclust.
Look at the plots. The lower left graph shows that group 2 is almost constant
on the luggage variable. In addition, the upper left and lower right graphs indicate
that group 2 can be divided into two groups, although the BIC did not pick up the
difference. The Table 12.1 exhibits four of the variables for the 15 automobiles in
group 2.
We have divided this group as suggested by the principal component plot. Note
that the first group of five are all sports cars. They have no back seats or luggage areas,
hence the values in the data set for the corresponding variables are coded somehow.
The other ten automobiles are minivans. They do not have specific luggage areas, i.e.,
trunks, either, although in a sense the whole vehicle is a big luggage area. Thus this
group really is a union of two smaller groups, both of which are quite a bit different
than group 1.
12.3. Model-based clustering 243
Table 12.1: The automobiles in group 2 of the clustering of all the data.
−3000
−3400
−BIC
−3800
EII VVI
VII EEE
EEI EEV
VEI VEV
−4200
EVI VVV
2 4 6 8
K
Figure 12.8: −BIC’s for the data set without the sports cars or minivans.
Shape(Σk ) = Δk ;
p
Volume(Σk ) = | Σk | = ck ;
Orientation(Σk ) = Γ k . (12.33)
The covariance matrices are then classified into spherical, diagonal, and ellipsoidal:
Spherical ⇒ Δk = I p ⇒ Σk = ck I p ;
Diagonal ⇒ Γ k = I p ⇒ Σk = ck Dk ;
Ellipsoidal ⇒ Σk is arbitrary. (12.34)
The various models are defined by the type of covariances, and what equalities
there are among them. I haven’t been able to crack the code totally, but the descrip-
tions tell the story. When K ≥ 2 and p ≥ 2, the following table may help translate the
descriptions into restrictions on the covariance matrices through (12.33) and(12.34):
12.4. The EM algorithm 245
Code Description Σk
EII spherical, equal volume σ2 I p
VII spherical, unequal volume σk2 I p
EEI diagonal, equal volume and shape Λ
VEI diagonal, varying volume, equal shape ck Δ
EVI diagonal, equal volume, varying shape cΔk (12.35)
VVI diagonal, varying volume and shape Λk
EEE ellipsoidal, equal volume, shape, and orientation Σ
EEV ellipsoidal, equal volume and equal shape Γ k ΛΓ k
VEV ellipsoidal, equal shape ck Γ k ΔΓ k
VVV ellipsoidal, varying volume, shape, and orientation arbitrary
Here, Λ’s are diagonal matrices with positive diagonals, Δ’s are diagonal matrices
with positive diagonals whose product is 1 as in (12.32), Γ’s are orthogonal matrices,
Σ’s are arbitrary nonnegative definite symmetric matrices, and c’s are positive scalars.
A subscript k on an element means the groups can have different values for that
element. No subscript means that element is the same for each group.
If there is only one variable, but K ≥ 2, then the only two models are “E,” meaning
the variances of the groups are equal, and “V,” meaning the variances can vary. If
there is only one group, then the models are as follows:
Code Description Σ
X one-dimensional σ2
XII spherical σ2 I p (12.36)
XXI diagonal Λ
XXX ellipsoidal arbitrary
Suppose we start with initial estimates of the π k ’s, μk ’s, and Σk ’s. E.g., one could
first perform a K-means procedure, then use the sample means and covariance ma-
trices of the groups to estimate the means and covariances, and estimate the π k ’s by
the proportions of observations in the groups. Then, as in (12.26), for step 1 we use
f (xi | μ k )π
k , Σ k
P[Y = k | X = xi ] =
f (xi | μ1 , Σ1 )π
1 + · · · + f ( x i | μ K )π
K , Σ K
( i)
≡ wk , (12.37)
k = (μ
where θ k ).
k , Σ
246 Chapter 12. Clustering
( i)
Note that for each i, the wk can be thought of as weights, because their sum over
k is 1. Then in Step 2, we find the weighted means and covariances of the xi ’s:
1 N ( i)
k i∑
k =
μ w k xi
n =1
n
k = 1 ∑ w ( i ) (xi − μ
and Σ k ) ,
k )(xi − μ
n k i =1 k
n
( i)
k =
where n ∑ wk .
i =1
k
n
k =
Also, π . (12.38)
n
The two steps are iterated until convergence. The convergence may be slow, and
it may not approach the global maximum likelihood, but it is guaranteed to increase
the likelihood at each step. As in K-means, it is a good idea to try different starting
points.
In the end, the observations are clustered using the conditional probabilities, be-
cause from (12.26),
xi ) = k that maximizes w( i) .
C( (12.39)
k
1 − 12 xi −μk 2
f k (xi ) = c e 2σ . (12.41)
σp
If σ is fixed, then the EM algorithm proceeds as above, except that the covariance
calculation in (12.38) is unnecessary. If we let σ → 0 in (12.37), fixing the means, we
have that
( i)
P[Y = k | X = xi ] −→ wk (12.42)
( i)
k ’s are positive. Thus for small fixed σ,
for the wk in (12.40), at least if all the π
K-means and model-based clustering are practically the same.
Allowing σ to be estimated as well leads to what we call soft K-means, soft be-
cause we use a weighted mean, where the weights depend on the distances from the
observations to the group means. In this case, the EM algorithm is as in (12.37) and
12.6. Hierarchical clustering 247
(12.38), but with the estimate of the covariance replaced with the pooled estimate of
σ2 ,
1 K n ( i)
n k∑ ∑ wk xi − μk 2 .
σ2 =
(12.43)
=1 i =1
0.80
InClass
0.65
Height
0.50
HW
Labs
Midterms
Final
Figure 12.9: Hierarchical clustering of the grades, using complete linkage.
For a set of objects, the question is which clusters to combine at each stage. At
the first stage, we combine the two closest objects, that is, the pair (oi , o j ) with the
smallest d(o1 , o j ). At any further stage, we may wish to combine two individual
objects, or a single object to a group, or two groups. Thus we need to decide how
to measure the dissimilarity between any two groups of objects. There are many
possibilities. Three popular ones look at the minimum, average, and maximum of
the individuals’ distances. That is, suppose A and B are subsets of objects. Then the
three distances between the subsets are
In all cases, d({a}, {b}) = d(a, b). Complete linkage is an example of Hausdorff
distance, at least when the d is a distance.
40
35
Jogging
30
Height
Tennis
Baseball
25
Football
Basketball
Cycling
Swimming
20
Complete linkage
4
Height
2
0
Single linkage
Figure 12.11: Clustering the individuals in the sports data, using complete linkage
(top) and single linkage (bottom).
12.7. Exercises 251
Complete linkage tends to favor similar-sized clusters, because by using the max-
imum distance, it is easier for two small clusters to get together than anything to
attach itself to a large cluster. Single linkage tends to favor a few large clusters, and
the rest small, because the larger the cluster, the more likely it will be close to small
clusters. These ideas are borne out in the plot, where complete linkage yields a more
treey-looking dendrogram.
12.7 Exercises
p
Exercise 12.7.1. Show that | Σk | = ck in (12.33) follows from (12.31) and (12.32).
( i)
Exercise 12.7.2. (a) Show that the EM algorithm, where we use the wk ’s in (12.40) as
the estimate of P[Y = k | X = xi ], rather than that in (12.37), is the K-means algorithm
of Section 12.1. [Note: You have to worry only about the mean in (12.38).] (b) Show
that the limit as σ → 0 of P[Y = k | X = xi ] is indeed given in (12.40), if we use the f k
in (12.41) in (12.37).
Exercise 12.7.3 (Grades). This problem is to cluster the students in the grades data
based on variables 2–6: homework, labs, inclass, midterms, and final. (a) Use K-
means clustering for K = 2. (Use nstart=100, which is a little high, but makes sure
everyone gets similar answers.) Look at the centers, and briefly characterize the
clusters. Compare the men and women (variable 1, 0=Male, 1=Female) on which
clusters they are in. (Be sure to take into account that there are about twice as many
women as men.) Any differences? (b) Same question, for K = 3. (c) Same question,
for K = 4. (d) Find the average silhouettes for the K = 2, 3 and 4 clusterings from
parts (a), (b) and (c). Which K has the highest average silhouette? (e) Use soft K-
means to find the K = 1, 2, 3 and 4 clusterings. Which K is best according to the
BIC’s? (Be aware that the BIC’s in Mclust are negative what we use.) Is it the same as
for the best K-means clustering (based on silhouettes) found in part (d)? (f) For each
of K = 2, 3, 4, compare the classifications of the data using regular K-means to that of
soft K-means. That is, match the clusters produced by both methods for given K, and
count how many observations were differently clustered.
Exercise 12.7.4 (Diabetes). The R package mclust contains the data set diabetes [Reaven
and Miller, 1979]. There are n = 145 subjects and four variables. The first variable
(class) is a categorical variable indicating whether the subject has overt diabetes (my
interpretation: symptoms are obvious), chemical diabetes (my interpretation: can
only be detected through chemical analysis of the blood), or is normal (no diabetes).
The other three variables are blood measurements: glucose, insulin, sspg. (a) First,
normalize the three blood measurement variables so that they have means zero and
variances 1:
blood <− scale(diabetes[,2:4])
(a) Use K-means to cluster the observations on the three normalized blood measure-
ment variables for K = 1, 2, . . . , 9. (b) Find the gap statistics for the clusterings in part
(a). To generate a random observation, use three independent uniforms, where their
ranges coincide with the ranges of the three variables. So to generate a random data
set xstar:
252 Chapter 12. Clustering
n <− nrow(blood)
p <− ncol(blood)
ranges <− apply(blood,2,range) # Obtains the mins and maxes
xstar <− NULL # To contain the new data set
for(j in 1:p) {
xstar <− cbind(xstar,runif(n,ranges[1,j],ranges[2,j]))
}
Which K would you choose based on this criterion? (b) Find the average silhouettes
for the clusterings found in part (a), except for K = 1. Which K would you choose
based on this criterion? (c) Use model-based clustering, again with K = 1, . . . 9.
Which model and K has the best BIC? (d) For each of the three “best” clusterings
in parts (a), (b), and (c), plot each pair of variables, indicating which cluster each
point was assigned, as in Figure 12.7. Compare these to the same plots that use
the class variable as the indicator. What do you notice? (e) For each of the three
best clusterings, find the table comparing the clusters with the class variable. Which
clustering was closest to the class variable? Why do you suppose that clustering was
closest? (Look at the plots.)
Exercise 12.7.5 (Iris). This question applies model-based clustering to the iris data,
pretending we do not know which observations are in which species. (a) Do the
model-based clustering without any restrictions (i.e., use the defaults). Which model
and number K was best, according to BIC? Compare the clustering for this best model
to the actual species. (b) Now look at the BIC’s for the model chosen in part (a), but
for the various K’s from 1 to 9. Calculate the corresponding estimated posterior
probabilities. What do you see? (c) Fit the same model, but with K = 3. Now
compare the clustering to the true species.
Exercise 12.7.6 (Grades). Verify the dissimilarity matrices in (12.47) and (12.49).
Exercise 12.7.7 (Soft drinks). The data set softdrinks has 23 peoples’ ranking of 8
soft drinks: Coke, Pepsi, Sprite, 7-up, and their diet equivalents. Do a hierarchical
clustering on the drinks, so that the command is
hclust(dist(t(softdrinks2)))
then plot the tree with the appropriate labels. Describe the tree. Does the clustering
make sense?
Exercise 12.7.8 (Cereal). Exercise 1.9.19 presented the cereal data (in the R data ma-
trix cereal), finding the biplot. Do hierarchical clustering on the cereals, and on the
attributes. Do the clusters make sense? What else would you like to know from these
data? Compare the clusterings to the biplot.
Chapter 13
Data reduction is a common goal in multivariate analysis — one has too many vari-
ables, and wishes to reduce the number of them without losing much information.
How to approach the reduction depends of course on the goal of the analysis. For
example, in linear models, there are clear dependent variables (in the Y matrix) that
we are trying to explain or predict from the explanatory variables (in the x matrix,
and possibly the z matrix). Then Mallows’ C p or cross-validation are reasonable ap-
proaches. If the correlations between the Y’s are of interest, then factor analysis is
appropriate, where the likelihood ratio test is a good measure of how many factors
to take. In classification, using cross-validation is a good way to decide on the vari-
ables. In model-based clustering, and in fact any situation with a likelihood, one can
balance the fit and complexity of the model using something like AIC or BIC.
There are other situations in which the goal is not so clear cut as in those above;
one is more interested in exploring the data, using data reduction to get a better
handle on the data, in the hope that something interesting will reveal itself. The
reduced data may then be used in more formal models, although I recommend first
considering targeted reductions as mentioned in the previous paragraph, rather than
immediately jumping to principal components.
Below we discuss principal components in more depth, then present multidimen-
sional scaling, and canonical correlations.
253
254 Chapter 13. Principal Components, etc.
Recall that the Fisher/Anderson iris data (Section 1.3.1) has n = 150 observations
and q = 4 variables. The measurements of the petals and sepals are in centimeters,
so it is reasonable to leave the data unscaled. On the other hand, the variances of
the variables do differ, so scaling so that each has unit variance is also reasonable.
Furthermore, we could either leave the data unadjusted in the sense of subtracting
the overall mean when finding the covariance matrix, or adjust the data for species by
subtracting from each observation the mean of its species. Thus we have four reason-
able starting points for principal components, based on whether we adjust for species
and whether we scale the variables. Figure 13.1 has plots of the first two principal
components for each of these possibilities. Note that there is a stark difference be-
tween the plots based on adjusted and unadjusted data. The unadjusted plots show
a clear separation based on species, while the adjusted plots have the species totally
mixed, which would be expected because there are differences in means between
the species. Adjusting hides those differences. There are less obvious differences
between the scaled and unscaled plots within adjusted/unadjusted pairs. For the ad-
justed data, the unscaled plot seems to have fairly equal spreads for the three species,
while the scaled data has the virginica observations more spread out than the other
two species.
The table below shows the sample variances, s2 , and first principal component’s
loadings (sample eigenvector), PC1 , for each of the four sets of principal components:
13.1. Principal components, redux 255
2
v v
v vvvvv g v vvvgvv vvg
sss vvvv vgggg v
vv vvv gvg g
v
vv vvvvvv vv vg vg ssss
1
sssss vvvv gvg g
sssssss vv vvvggggg gggg ssssssss vvvv vvv vggggggg
s v vvv g gg s vvvvvvg vg g
s sssssss vvvvv ggggg ssss s vvg vgg g
−5.5
0
ssss ggg
vvv v ggggg gg g s ssssssssss vvvvvvg ggggg g
g
ssss v v ggg gg v v gggg g g g
g
ggg gg
ssss ssss
−1
s ss
v gg sss
s s g
−2
−6.5
ss g s
s g s gg
2 3 4 5 6 7 8 9 −3 −2 −1 0 1 2 3
g g gg
s 2 g g
g vv g
g vv gg g
g vg
gg v v s v g g g v s g vv ss vs
0.5
v ggg svvg g
s sg
sss s v g s ggsg v ssv
g v g v ss ssvss vs s v v g sv vgvvg vsvssgssssssvs
g
v s gggss gv g sg s vs
vvgvs gvsvsgsssssssvvsvvvvs v v
v s
0
v
v gv vsvsggsssvsgsvvs vgsgvg
0.0
gv v v v sggs s g v g
g v v
s s gvv vvgg g
v svv vsgssgg svsvg
gs gv
v g s svss vvs v vg g
g
−1
g g
g vvs gss gv ggv s gg v v g v
vss g g v g v g
−0.5
g
s g g gg
−2
s g
s g
−1 0 1 2 −4 −2 0 2 4
Figure 13.1: Plots of the first two principal components for the iris data, depending on
whether adjusting for species and whether scaling the variables to unit variance. For
the individual points, “s” indicates setosa, “v” indicates versicolor, and “g” indicates
virginica.
Unadjusted Adjusted
Unscaled Scaled Unscaled Scaled
s2 PC1 s2 PC1 s2 PC1 s2 PC1
Sepal Length 0.69 0.36 1 0.52 0.26 0.74 1 −0.54 (13.1)
Sepal Width 0.19 −0.08 1 −0.27 0.11 0.32 1 −0.47
Petal Length 3.12 0.86 1 0.58 0.18 0.57 1 −0.53
Petal Width 0.58 0.36 1 0.56 0.04 0.16 1 −0.45
Note that whether adjusted or not, the relative variances of the variables affect
the relative weighting they have in the principal component. For example, for the
unadjusted data, petal length has the highest variance in the unscaled data, and
receives the highest loading in the eigenvector. That is, the first principal component
is primarily sepal length. But for the scaled data, all variables are forced to have
the same variance, and now the loadings of the variables are much more equal. The
256 Chapter 13. Principal Components, etc.
0 1 2 3 4 5 6
1.0
Eigenvalue
log(ratio)
0.6
0.2
2 4 6 8 10 2 4 6 8 10
Index Index
Figure 13.2: The left-hand plot is a scree plot (i versus li ) of the eigenvalues for the
automobile data. The right-hand plot shows i versus log(li /li+1 ), the successive log-
proportional gaps.
opposite holds for sepal width. A similar effect is seen for the adjusted data. The
sepal length has the highest unscaled variance and highest loading in PC1 , and petal
width the lowest variance and loading. But scaled, the loadings are approximately
equal.
Any of the four sets of principal components is reasonable. Which to use depends
on what one is interested in, e.g., if wanting to distinguish between species, the
unadjusted plots are likely more interesting, while when interested in relations within
species, adjusting make sense. We mention that in cases where the units are vastly
different for the variables, e.g., population in thousands and areas in square miles of
cities, leaving the data unscaled is less defensible.
6.210, 1.833, 0.916, 0.691, 0.539, 0.279, 0.221, 0.138, 0.081, 0.061, 0.030. (13.2)
The scree plot is the first one in Figure 13.2. Note that there is a big drop from the
first to the second eigenvalue. There is a smaller drop to the third, then the values
13.1. Principal components, redux 257
seem to level off. Other simple plots can highlight the gaps. For example, the second
plot in the figure shows the logarithms of the successive proportional drops via
li
log(ratioi ) ≡ log . (13.3)
l i +1
The biggest drops are again from #1 to #2, and #2 to #3, but there are almost as large
proportional drops at the fifth and tenth stages.
One may have outside information or requirements that aid in choosing the com-
ponents. For examples, there may be a reason one wishes a certain number of com-
ponents (say, three if the next step is a three-dimensional plot), or to have as few
components as possible in order to achieve a certain percentage (e.g., 95%) of the to-
tal variance. If one has an idea that the measurement error for the observed variables
is c, then it makes sense to take just the principal components that have eigenvalue
significantly greater than c2 . Or, as in the iris data, all the data is accurate just to one
decimal place, so that taking c = 0.05 is certainly defensible.
To assess significance, assume that
1
U ∼ Wishartq (ν, Σ), and S = U, (13.4)
ν
where ν > q and Σ is invertible. Although we do not necessarily expect this dis-
tribution to hold in practice, it will help develop guidelines to use. Let the spectral
decompositions of S and Σ be
where G and Γ are orthogonal, and L and Λ are diagonal with nonincreasing diagonal
elements (the eigenvalues), as in Theorem 1.1. The eigenvalues of S will be distinct
with probability 1. If we assume that the eigenvalues of Σ are also distinct, then
Theorem 13.5.1 in Anderson [1963] shows that for large ν, the sample eigenvalues are
approximately independent, and li ≈ N (λi , 2λ2i /ν). If components with λi ≤ c2 are
ignorable, then it is reasonable to ignore the li for which
√ l i − c2 c2
ν √ < 2, equivalently, li < √ √ . (13.6)
2 li 1 − 2 2/ ν
(One may be tempted to take c = 0, but if any λi = 0, then the corresponding li will
be zero as well, so that there is no need for hypothesis testing.) Other test statistics
(or really “guidance statistics”) can be easily derived, e.g., to see whether the average
of the k smallest eigenvalues are less than c2 , or the sum of the first p are greater than
some other cutoff.
λ 1 = · · · = λ q 1 = α1 ,
λ q 1 + 1 = · · · = λ q 1 + q 2 = α2 ,
..
.
λq1 +···+qK −1+1 = · · · = λq = αK . (13.8)
Then the space is split into K orthogonal subspaces, of dimensions q1 , . . . , q K ,
where q = q1 + · · · + q K . The vector (q1 , . . . , q K ) is referred to as the pattern of equal-
ities among the eigenvalues. Let Γ be an orthogonal matrix containing eigenvectors
as in (13.5), and partition it as
Γ = Γ1 Γ2 · · · Γ K , Γ k is q × q k , (13.9)
so that Γ k contains the eigenvectors for the q k eigenvalues that equal αk . These are
not unique because Γ k J for any q k × q k orthogonal matrix J will also yield a set of
eigenvectors for those eigenvalues. The subspaces have corresponding projection
matrices P1 , . . . , PK , which are unique, and we can write
K
Σ= ∑ αk Pk , where Pk = Γ k Γ k . (13.10)
k =1
With this structure, the principal components can be defined only in groups, i.e., the
first q1 of them represent one group, which have higher variance than the next group
of q2 components, etc., down to the final q K components. There is no distinction
within a group, so that one would take either the top q1 components, or the top
q1 + q2 , or the top q1 + q2 + q3 , etc.
Using the distributional assumption (13.4), we find the Bayes information criterion
to choose among the possible patterns (13.8) of equality. The best set can then be used
in plots such as in Figure 13.3, where the gaps will be either enhanced (if large) or
eliminated (if small). The model (13.8) will be denoted M( q1 ,...,qK ) . Anderson [1963]
(see also Section 12.5) shows the following.
Theorem 13.1. Suppose (13.4) holds, and S and Σ have spectral decompositions as in (13.5).
Then the MLE of Σ under the model M( q1 ,...,qK ) is given by Σ , where the
= GΛG λi ’s are
found by averaging the relevant l i ’s:
1
λ1 = . . . = λ q1 =
α1 = ( l + . . . + l q 1 ),
q1 1
1
λ q1 +1 = · · · =
λ q1 + q2 =
α2 = (l + · · · + l q 1 + q 2 ),
q 2 q1 +1
..
.
1
λq1 +···+qK −1+1 = · · · =
λq =
αK = (l + · · · + l q ). (13.11)
q K q1 +···+qK −1+1
13.1. Principal components, redux 259
j 1 2 3 4 5
lj 6.210 1.833 0.916 0.691 0.539
λj 6.210 1.833 0.716 0.716 0.716
(13.15)
j 6 7 8 9 10 11
lj 0.279 0.221 0.138 0.081 0.061 0.030
λj 0.213 0.213 0.213 0.071 0.071 0.030
With ν = n − 1 = 95,
) ; S) = 95
deviance( M(1,1,3,3,2,1)(Σ ∑ log(λ j ) = −1141.398, (13.16)
j
Table 13.1.4 contains a number of models, one each for K from 1 to 11. Each
pattern after the first was chosen to be the best that is obtained from the previous by
summing two consecutive q k ’s. The estimated probabilities are among those in the
table. Clearly, the preferred is the one with MLE in (13.15). Note that the assumption
(13.4) is far from holding here, both because the data are not normal, and because
we are using a correlation matrix rather than a covariance matrix. We are hoping,
though, that in any case, the BIC is a reasonable balance of the fit of the model on the
eigenvalues and the number of parameters.
260 Chapter 13. Principal Components, etc.
Table 13.1: The BIC’s for the sequence of principal component models for the auto-
mobile data.
Sample MLE
2
2
log(eigenvalue)
log(eigenvalue)
1
1
−1
−1
−3
−3
2 4 6 8 10 2 4 6 8 10
Index Index
Figure 13.3: Plots of j versus the sample l j ’s, and j versus the MLE’s
λ j ’s for the
chosen model.
Figure 13.3 shows the scree plots, using logs, of the sample eigenvalues and the
fitted ones from the best model. Note that the latter gives more aid in deciding how
many components to choose because the gaps are enhanced or eliminated. That is,
taking one or two components is reasonable, but because there is little distinction
among the next three, one may as well take all or none of those three. Similarly with
numbers six, seven and eight.
What about interpretations? Below we have the first five principal component
13.1. Principal components, redux 261
3
2
Tall vs. wide
1
0
−1
−3
−6 −4 −2 0 2 4
Overall size
Figure 13.4: The first two principal component variables for the automobile data
(excluding sports cars and minivans), clustered into two groups.
Using R
In Section 12.3.1 we created cars1, the reduced data set. To center and scale the data,
so that the means are zero and variances are one, use
xcars <− scale(cars1)
The following obtains eigenvalues and eigenvectors of S:
eg <− eigen(var(xcars))
The eigenvalues are in eg$values and the matrix of eigenvectors are in eg$vectors. To
find the deviance and BIC for the pattern (1, 1, 3, 3, 2, 1) seen in (13.15 and (13.17), we
use the function pcbic (detailed in Section A.5.1):
pcbic(eg$values,95,c(1,1,3,3,2,1))
In Section A.5.2 we present the function pcbic.stepwise, which uses the stepwise pro-
cedure to calculate the elements in Table 13.1.4:
pcbic.stepwise(eg$values,95)
For principal components, where we take the first p components, partition Γ and
Λ in (13.5) as
Λ1 0
Γ = Γ1 Γ2 and Λ = . (13.22)
0 Λ2
Here, Γ1 is q × p, Γ2 is q × (q − p), Λ1 is p × p, and Λ2 is (q − p) × (q − p), the Λk ’s
being diagonal. The large eigenvalues are in Λ1 , the small ones are in Λ2 . Because
Iq = ΓΓ = Γ1 Γ1 + Γ2 Γ2 , we can write
where
Because Γ1 Γ2 = 0, X and R are again independent. We also have (Exercise 13.4.4)
Comparing these covariances to the factor analytic ones in (13.20), we see the follow-
ing:
ΣX ΣR
Factor analysis Ip Ψ (13.26)
Principal components Λ1 Γ2 Λ1 Γ2
The key difference is in the residuals. Factor analysis chooses the p-dimensional X so
that the residuals are uncorrelated, though not necessarily small. Thus the correlations
among the Y’s are explained by the factors X. Principal components chooses the p-
dimensional X so that the residuals are small (the variances sum to the sum of the
(q − p) smallest eigenvalues), but not necessarily uncorrelated. Much of the variance
of the Y is explained by the components X.
A popular model that fits into both frameworks is the factor analytic model (13.20)
with the restriction that
Ψ = σ2 Iq , σ2 “small.” (13.27)
The interpretation in principal components is that the X contains the important infor-
mation in Y, while the residuals R contain just random measurement error. For factor
analysis, we have that the X explains the correlations among the Y, and the residuals
happen to have the same variances. In this case, we have
Σ = βΣ XX β + σ2 Iq . (13.28)
for some orthogonal Γ. But any orthogonal matrix contains eigenvectors for Iq = ΓΓ ,
hence Γ is also an eigenvector matrix for Σ:
∗
Λ1 + σ2 I p 0
Σ = ΓΛΓ = Γ Γ . (13.30)
0 σ2 Iq − p
264 Chapter 13. Principal Components, etc.
and the eigenvectors for the first p eigenvalues are the columns of Γ1 . In this case the
factor space and the principal component space are the same. In fact, if the λ∗j are
distinct and positive, the eigenvalues (13.30) satisfy the structural model (13.8) with
pattern (1, 1, . . . , 1, q − p). A common approach to choosing p is to use hypothesis
testing on such models to find the smallest p for which the model fits. See Anderson
[1963] or Mardia, Kent, and Bibby [1979]. Of course, AIC or BIC could be used as
well.
F = G Γ, (13.34)
where
1 1
βj = − + . (13.36)
2λ j 2λq
(The 1/(2λq ) is added because we need the β j ’s to be nonnegative in what follows.)
Note that the last term in (13.35) is independent of F. We do summation by parts by
letting
so that
q q
li = ∑ dk and β j = ∑ δm . (13.38)
k=i m= j
Because the li ’s and λi ’s are nondecreasing in i, the li ’s are positive, and by (13.36)
the β j ’s are also nonnegative, we have that the δi ’s and di ’s are all nonnegative. Using
(13.38) and interchanging the orders of summation, we have
q q q q q q
∑∑ f ij2 li β j = ∑∑∑ ∑ f ij2 dk δm
i =1 j =1 i =1 j =1 k = i m = j
q q k m
= ∑ ∑ dk δm ∑∑ f ij2 . (13.39)
k =1 m =1 i =1 j =1
k m
∑∑ f ij2 ≤ min{k, m}. (13.40)
i =1 j =1
Γ = G. (13.41)
K
− q k ν/2 − ν ( tk /α k )
∏ αk e 2 , (13.43)
k =1
where
q1 q1 +···+ q k
t1 = ∑ li and tk = ∑ li for 2 ≤ k ≤ K. (13.44)
i =1 i = q1 +···+ q k −1+1
It is easy to maximize over each αk in (13.43), which proves that (13.11) is indeed the
MLE of the eigenvalues. Thus with (13.41), we have the MLE of Σ as in Theorem 13.1.
We give a heuristic explanation of the dimension (13.12) of the model M( q1 ,...qK ) in
(13.8). To describe the model, we need the K distinct parameters among the λi ’s, as
well as the K orthogonal subspaces that correspond to the distinct values of λi . We
start by counting the number of free parameters needed to describe an s-dimensional
subspace of a t-dimensional space, s < t. Any such subspace can be described by a
t × s basis matrix B, that is, the columns of B comprise a basis for the subspace. (See
Section 5.2.) The basis is not unique, in that BA for any invertible s × s matrix A is
also a basis matrix, and in fact any basis matrix equals BA for some such A. Take A
266 Chapter 13. Principal Components, etc.
to be the inverse of the top s × s submatrix of B, so that BA has Is as its top s × s part.
This matrix has (t − s) × s free parameters, represented in the bottom (t − s) × s part
of it, and is the only basis matrix with Is at the top. Thus the dimension is (t − s) × s.
(If the top part of B is not invertible, then we can find some other subset of s rows to
use.)
Now for model (13.8), we proceed stepwise. There are q1 (q2 + · · · + q K ) parame-
ters needed to specify the first q1 -dimensional subspace. Next, focus on the subspace
orthogonal to that first one. It is (q2 + · · · + q K )-dimensional, hence to describe the
second, q2 -dimensional, subspace within that, we need q2 × (q3 + · · · + q K ) parame-
ters. Continuing, the total number of parameters is
K
1 2
q 1 ( q 2 + · · · + q K ) + q 2 ( q 3 + · · · + q K ) + · · · + q K −1 q k = (q − ∑ q2k ). (13.45)
2 k =1
Δij = d2 (oi , o j ) ≈
xi −
x j 2 . (13.46)
Then the xi ’s are plotted in R p , giving an approximate visual representation of the
original dissimilarities.
There are a number of approaches. Our presentation here follows that of Mar-
dia, Kent, and Bibby [1979], which provides more in-depth coverage. We will start
with the case that the original dissimilarities are themselves Euclidean distances, and
present the so-called classical solution. Next, we exhibit the classical solution when
the distances may not be Euclidean. Finally, we briefly mention the nonmetric ap-
proach.
d2 (o i , o j ) = y i − y j 2 . (13.47)
For any n × p matrix X with rows xi , define Δ(X) to be the n × n matrix of xi − x j 2 ’s,
so that (13.47) can be written Δ = Δ(Y).
The classical solution looks for
xi ’s in (13.46) that are based on rotations of the
yi ’s, much like principal components. That is, suppose B is a q × p matrix with
13.2. Multidimensional scaling 267
Proof. Write
Proposition 13.1. If Δ = Δ(Y), then the classical solution of the multidimensional scaling
= YG1 , where the columns of G1 consist of the first p eigenvectors
problem for given p is X
of Y Hn Y.
If one is interested in the distances between variables, so that the distances of in-
terest are in Δ(Y ) (note the transpose), then the classical solution uses the first p
eigenvectors of YHq Y .
xi 1 −
x j1 2 ≤
xi 2 −
x j2 2 ≤ · · · ≤
xi t −
x jt 2 . (13.64)
That might not be (actually probably is not) possible for given p, so instead one
that comes as close as possible, where close is measured by some “stress”
finds the X
function. A popular stress function is given by
where the d∗ij ’s are constants that have the same ordering as the original dissimilarities
d(oi , o j )’s in (13.63), and among such orderings minimize the stress. See Johnson and
Wichern [2007] for more details and some examples. The approach is “nonmetric”
because it does not depend on the actual d’s, but just their order.
Grades Sports
30
Jog
150
20
10
Labs
Var 2
Var 2
0 50
HW FootB
BaseB
0
Cyc BsktB
Midterms Swim Ten
InClass
−20
−100
Final
Figure 13.5: Multidimensional scaling plot of the grades’ variables (left) and the
sports’ variables (right).
to how many people typically participate, i.e., jogging, swimming and cycling can
be done solo, tennis needs two to four people, basketball has five per team, baseball
nine, and football eleven. The second variable serves mainly to separate jogging from
the others.
Cov[Y1 α, Y2 β] = α Σ12 β,
Var [Y1 α] = α Σ11 α, and
Var [Y2 β] = β Σ22 β, (13.67)
hence
α Σ12 β
Corr [Y1 α, Y2 β] = . (13.68)
α Σ11 α β Σ22 β
13.3. Canonical correlations 271
Then δi ≡ αi Σ12 β i is the i th canonical correlation, and αi and β i are the associated
canonical correlation loading vectors.
Recall that principal component analysis (Definition 1.2) led naturally to the spec-
tral decomposition theorem (Theorem 1.1). Similarly, canonical correlation analysis
will lead to the singular value decomposition (Theorem 13.2 below). We begin the
canonical correlation analysis with some simplfications. Let
γi = Σ1/2 1/2
11 α i and ψi = Σ22 β i (13.70)
for each i, so that the γi ’s and ψi ’s are sets of orthonormal vectors, and
δi = Corr [Y1 αi , Y2 β i ] = γi Ξψi , (13.71)
where
−1/2 −1/2
Ξ = Σ11 Σ12 Σ22 . (13.72)
This matrix Ξ is a multivariate generalization of the correlation coefficient which is
useful here, but I don’t know exactly how it should be interpreted.
In what follows, we assume that q1 ≥ q2 = m. The q1 < q2 case can be handled
similarly. The matrix Ξ Ξ is a q2 × q2 symmetric matrix, hence by the spectral decom-
position in (1.33), there is a q2 × q2 orthogonal matrix Γ and a q2 × q2 diagonal matrix
Λ with diagonal elements λ1 ≥ λ2 ≥ · · · ≥ λq2 such that
Γ Ξ ΞΓ = Λ. (13.73)
Let γ1 , . . . , γq2 denote the columns of Γ, so that the column of ΞΓ is Ξγi . Then
i th
(13.73) shows these columns are orthogonal and have squared lengths equal to the
λi ’s, i.e.,
Ξγi 2 = λi and (Ξγi ) (Ξγj ) = 0 if i = j. (13.74)
272 Chapter 13. Principal Components, etc.
Furthermore, because the γi ’s satisfy the equations for the principal components’
loading vectors in (1.28) with S = Ξ Ξ,
Ξγi = λi maximizes Ξγ over γ = 1, γ γj = 0 for j < i. (13.75)
Now for the first canonical correlation, we wish to find unit vectors ψ and γ to
maximize ψ Ξγ. By Corollary 8.2, for γ fixed, the maximum over ψ is when ψ is
proportional to Ξγ, hence
Ξγ
ψ Ξγ ≤ Ξγ, equality achieved with ψ = . (13.76)
Ξγ
where Ψ = (ψ1 , . . . , ψq2 ) has orthonormal columns, and Δ is diagonal with δ1 , . . . , δq2
on the diagonal. Shifting the Γ to the other side of the equation, we obtain the
following.
Theorem 13.2. Singular value decomposition. The q1 × q2 matrix Ξ can be written
Ξ = ΨΔΓ (13.81)
To summarize:
−1/2 −1/2
Corollary 13.1. Let (13.81) be the singular value decomposition of Ξ = Σ11 Σ12 Σ22 for
model (13.66). Then for 1 ≤ i ≤ min{q1 , q2 }, the i canonical correlation is δi , with loading
th
found in R using
symsqrtinv1 <− symsqrtinv(s[1:3,1:3])
symsqrtinv2 <− symsqrtinv(s[4:5,4:5])
xi <− symsqrtinv1%∗%s[1:3,4:5]%∗%symsqrtinv2
where
symsqrtinv <− function(x) {
ee <− eigen(x)
ee$vectors%∗%diag(sqrt(1/ee$values))%∗%t(ee$vectors)
}
274 Chapter 13. Principal Components, etc.
calculates the inverse symmetric square root of an invertible symmetric matrix x. The
singular value decomposition function in R is called svd:
sv <− svd(xi)
a <− symsqrtinv1%∗%sv$u
b <− symsqrtinv2%∗%sv$v
The component sv$u is the estimate of Ψ and the component sv$v is the estimate of Γ
in (13.81). The matrices of loading vectors are obtained as in (13.79):
⎛ ⎞
−0.065 0.059
−1/2
A = S11 Ψ = ⎝ −0.007 −0.088 ⎠ ,
−0.014 0.039
−1/2 −0.062 −0.12
and B = S22 Γ= . (13.84)
−0.053 0.108
The estimated canonical correlations (singular values) are in the vector sv$d, which
are
d1 = 0.482 and d2 = 0.064. (13.85)
The d1 is fairly high, and d2 is practically negligible. (See the next section.) Thus it
is enough to look at the first columns of A and B. We can change signs, and take the
first loadings for the first set of variables to be (0.065, 0.007, 0.014), which is primarily
the homework score. For the second set of variables, the loadings are (0.062, 0.053),
essentially a straight sum of midterms and final. Thus the correlations among the two
sets of variables can be almost totally explained by the correlation between homework
and the sum of midterms and final, which correlation is 0.45, almost the optimum of
0.48.
be the sample analog of Ξ (on the left), and its singular value decomposition (on the
right).
We first obtain the MLE of Σ under model K. Note that Σ and (Σ11 , Σ22 , Ξ) are
in one-to-one correspondence. Thus it is enough to find the MLE of the latter set of
parameters. The next theorem is from Fujikoshi [1974].
13.3. Canonical correlations 275
Theorem 13.3. For the above setup, the MLE of (Σ11 , Σ22 , Ξ) under model MK in (13.87)
is given by
(S11 , S22 , PD( K ) G ), (13.89)
That is, the MLE is obtained by setting to zero the sample canonical correlations
that are set to zero in the model. One consequence of the theorem is that the natu-
ral sample canonical correlations and accompanying loading vectors are indeed the
MLE’s. The deviance, for comparing the models MK , can be expressed as
K
deviance( MK ) = ν ∑ log(1 − d2i ). (13.90)
i =1
The dimension for the δi ’s is K. Only the first K of the ψi ’s enter into the equation.
Thus the dimension is the same as for principal components with K distinct eigen-
values, and the rest equal at 0, yielding pattern (1, 1, . . . , 1, q1 − K ), where there are
K ones. Similarly, the γi ’s dimension is as for pattern (1, 1, . . . , 1, q2 − K ). Then by
(13.45),
1 2
dim(Γ ) + dim(Ψ) + dim(Δ( K ) ) = ( q − K − ( q 1 − K )2 )
2 1
1
+ (q22 − K − (q2 − K )2 ) + K
2
= K ( q − K ). (13.92)
K
BIC( MK ) = ν ∑ log(1 − d2k ) + log(ν)K (q − K ) (13.93)
k =1
because the q i (q i + 1)/2 parts are the same for each model.
In the example, we have three models: K = 0, 1, 2. K = 0 means the two sets
of variables are independent, which we already know is not true, and K = 2 is the
unrestricted model. The calculations, with ν = 105, d1 = 0.48226 and d2 = 0.064296:
The process is the same as for canonical correlations, but we use the singular value
decomposition of Σ12 instead of Ξ. The procedure is called partial least squares,
but it could have been called canonical covariances. It is an attractive alternative to
canonical correlations when there are many variables and not many observations, in
which cases the estimates of Σ11 and Σ22 are not invertible.
13.4 Exercises
Exercise 13.4.1. In the model (13.4), find the approximate test for testing the null
hypothesis that the average of the last k (k < q) eigenvalues is less than the constant
c2 .
Exercise 13.4.3. Show that the deviance for the model in Theorem 13.1 is given by
(13.13). [Hint: Start with the likelihood as in (13.32). Show that
q
−1 S ) = li
trace(Σ ∑
λ
= q. (13.96)
i =1 i
Argue you can then ignore the part of the deviance that comes from the exponent.]
Exercise 13.4.4. Verify (13.25). [Hint: First, show that Γ1 Γ = (I p 0) and Γ2 Γ =
(0 Iq− p ).]
Exercise 13.4.5. Show that (13.30) follows from (13.28) and (13.29).
Exercise 13.4.6. Prove (13.40). [Hint: First, explain why ∑ki=1 f ij2 ≤ 1 and ∑m
j =1 f ij ≤ 1.]
2
Exercise 13.4.7. Verify the equality in (13.42), and show that (13.11) does give the
maximizers of (13.43).
Exercise 13.4.12. For the canonical correlations situation in Corollary 13.1, let α =
(α1 , . . . , αm ) and β = ( β 1 , . . . , β m ) be matrices with columns being the loading vec-
tors. Find the covariance matrix of the transformation
α 0
Y1 α Y2 β = Y1 Y2 . (13.97)
0 β
; S)) = ν log(| Σ
−2 log( L (Σ −1 S )
|) + ν trace(Σ (13.98)
[Hint: The first equality uses part (a). The second equality might be easiest to show
by letting
Iq1 − CK
H= , (13.101)
0 Iq2
and multiplying the two large matrices by H on the left and H on the right. For the
third equality, using orthogonality of the columns of G, show that CK C = CCK =
| = |S11 ||S22 ||Iq1 − CK C |, where CK is given in part (b).
CK CK .] (c) Show that | Σ K
[Hint: Recall (5.83).] (d) Show that |Iq1 − CK CK | = ∏K i =1 (1 − di ). (e) Use parts (b)
2
through (d) to find an expression for (13.98), then argue that for comparing MK ’s, we
can take the deviance as in (13.87).
Exercise 13.4.15. Verify the calculation in (13.92).
Exercise 13.4.16 (Painters). The biplot for the painters data set (in the MASS package)
was analyzed in Exercise 1.9.18 (a) Using the first four variables, without any scaling,
find the sample eigenvalues li . Which seem to be large, and which small? (b) Find
the pattern of the li ’s that has best BIC. What are the MLE’s of the λi ’s for the best
pattern? Does the result conflict with your answer to part (a)?
278 Chapter 13. Principal Components, etc.
Exercise 13.4.17 (Spam). In Exercises 1.9.15 and 11.9.9, we found principal compo-
nents for the spam data. Here we look for the best pattern of eigenvalues. Note that
the data is far from multivariate normal, so the distributional aspects should not be
taken too seriously. (a) Using the unscaled spam explanatory variables (1 through
57), find the best pattern of eigenvalues based on the BIC criterion. Plot the sample
eigenvalues and their MLE’s. Do the same, but for the logs. How many principal
components is it reasonable to take? (b) Repeat part (b), but using the scaled data,
scale(Spam[,1:57]). (c) Which approach yielded the more satisfactory answer? Was
the decision to use ten components in Exercise 11.9.9 reasonable, at least for the scaled
data?
Exercise 13.4.18 (Iris). This question concerns the relationships between the sepal
measurements and petal measurements in the iris data. Let S be pooled covariance
matrix, so that the denominator is ν = 147. (a) Find the correlation between the
sepal length and petal length, and the correlation between the sepal width and petal
width. (b) Find the canonical correlation quantities for the two groups of variables
{Sepal Length, Sepal Width} and {Petal Length, Petal Width}. What do the loadings
show? Compare the di ’s to the correlations in part (a). (c) Find the BIC’s for the three
models K = 0, 1, 2, where K is the number of nonzero δi ’s. What do you conclude?
Exercise 13.4.19 (Exams). Recall the exams data set (Exercise 10.5.18) has the scores
of 191 students on four exams, the three midterms (variables 1, 2, and 3) and the final
exam. (a) Find the canonical correlations quantities, with the three midterms in one
group, and the final in its own group. Describe the relative weightings (loadings) of
the midterms. (b) Apply the regular multiple regression model with the final as the
Y and the three midterms as the X’s. What is the correlation between the Y and the
How does this correlation compare to d1 in part (a)? What do you get if you
fit, Y?
square this correlation? (c) Look at the ratios βi /ai1 for i = 1, 2, 3, where βi is the
regression coefficient for midterm i in part (b), and ai1 is the first canonical correlation
loading. What do you conclude? (d) Run the regression again, with the final still Y,
but use just the one explanatory variable Xa1 . Find the correlation of Y and the Y for
this regression. How does it compare to that in part (b)? (e) Which (if either) yields
a linear combination of the midterms that best correlates with the final, canonical
correlation analysis or multiple regression. (f) Look at the three midterms’ variances.
What do you see? Find the regular principal components (without scaling) for the
midterms. What are the loadings for the first principal component? Compare them
to the canonical correlations’ loadings in part (a). (g) Run the regression again, with
the final as the Y again, but with just the first principal component of the midterms as
the sole explanatory variable. Find the correlation between Y and Y here. Compare
to the correlations in parts (b) and (d). What do you conclude?
Exercise 13.4.20 (States). This problems uses the matrix states, which contains several
demographic variables on the 50 United States, plus D.C. We are interested in the
relationship between crime variables and money variables:
Crime: Violent crimes per 100,000 people
Prisoners: Number of people in prison per 10,000 people.
Poverty: Percentage of people below the poverty line.
Employment: Percentage of people employed
Income: Median household income
13.4. Exercises 279
Let the first two variables be Y1 , and the other three be Y2 . Scale them to have mean
zero and variance one:
y1 <− scale(states[,7:8])
y2 <− scale(states[,9:11])
Find the canonical correlations between the Y1 and Y2 . (a) What are the two canonical
correlations? How many of these would you keep? (b) Find the BIC’s for the K = 0, 1
and 2 canonical correlation models. Which is best? (c) Look at the loadings for the
first canonical correlation, i.e., a1 and b1 . How would you interpret these? (d) Plot
the first canonical variables: Y1 a1 versus Y2 b1 . Do they look correlated? Which
observations, if any, are outliers? (e) Plot the second canonical variables: Y1 a2 versus
Y2 b1 2. Do they look correlated? (f) Find the correlation matrix of the four canonical
variables: (Y1 a1 , Y1 a2 , Y2 b1 , Y2 b2 ). What does it look like? (Compare it to the result
in Exercise 13.4.9.)
Appendix A
Extra R routines
These functions are very barebones. They do not perform any checks on the inputs,
and are not necessarily efficient. You are encouraged to robustify and enhance any of
them to your heart’s content.
pi
g( x ) = if bi−1 < x ≤ bi . (A.1)
d
From (2.102) in Exercise 2.7.16, we have that the negative entropy (1.46) is
K
1 1
Negent( g) = 1 + log 2π Var [I] + + ∑ pi log( pi ), (A.2)
2 12 i =1
See Section A.1.1 for the R function we use to calculate this estimate.
For projection pursuit, we have our n × q data matrix Y, and wish to find first the
q × 1 vector g1 with norm 1 that maximizes the estimated negentropy of Yg1 . Next
we look for the g2 with norm 1 orthogonal to g1 that maximizes the negentropy of
Yg2 , etc. Then our rotation is given by the orthogonal matrix G = (g1 , g2 , . . . , gq ).
281
282 Appendix A. Extra R routines
G( θ1 , θ2 , θ3 ) = E3 ( θ1 , θ2 , θ3 )
E2 ( θ3 ) 0 1 0 E2 ( θ1 ) 0
≡ . (A.5)
0 1 0 E2 ( θ2 ) 0 1
See Anderson et al. [1987] for similar parametrizations when q > 3. The first step is to
find the G = (g1 , g2 , g3 ) whose first column, g1 , achieves the maximum negentropy
of Yg1 . Here it is enough to take θ3 = 0, so that the left-hand matrix is the identity.
Because our estimate of negentropy for Yg is not continuous in g, we use the simulated
annealing option in the R function optim to find the optimal g1 . The second step is to
find the best further rotation of the remaining variables, Y(g2 , g3 ), for which we can
use the two-dimensional procedure above. See Section A.1.3.
Description: Searches for the rotation that maximizes the estimated negentropy of
the first column of the rotated data, for q = 2 dimensional data. See Listing A.2 for
the code.
Usage: negent2D(y,m=100)
Arguments:
A.2. Both-sides model 283
Description: Searches for the rotation that maximizes the estimated negentropy of
the first column of the rotated data, and of the second variable fixing the first, for
q = 3 dimensional data. The routine uses a random start for the function optim using
the simulated annealing option SANN, hence one may wish to increase the number
of attempts by setting nstart to a integer larger than 1. See Listing A.3 for the code.
Usage: negent2D(y,nstart=1,m=100,...)
Arguments:
y: The n × 3 data matrix.
nstart: The number of times to randomly start the search routine.
m: The number of angles (between 0 and π) over which to search to find the second
variables.
. . .: Further optional arguments to pass to the optim function to control the simulated
annealing algorithm.
x: An n × p design matrix.
y: The n × q matrix of observations.
z: A q × l design matrix.
A.3 Classification
Description: Finds the coefficents ak and constants ck for Fisher’s linear discrimina-
tion function dk in (11.31) and (11.32). See Listing A.6 for the code.
Usage: lda(x,y)
Arguments:
a: A p × K matrix, where column k contains the coefficents ak for (11.31). The final
column is all zero.
Description: The function returns the elements needed to calculate the quadratic
discrimination in (11.48). Use the output from this function in predict.qda (Section
A.3.2) to find the predicted groups. See Listing A.7 for the code.
Usage: qda(x,y)
Arguments:
Mean: A K × p matrix, where row k contains the sample mean vector for group k.
Sigma: A K × p × p array, where the Sigma[k,,] contains the sample covariance matrix
k.
for group k, Σ
Description: The function uses the output from the function qda (Section A.3.2) and
a p-vector x, and calculates the predicted group for this x. See Listing A.8 for the
code.
Usage: predict.qda(qd,newx)
Arguments:
newx: A p-vector x whose components match the variables used in the qda function.
Value: A K-vector of the discriminant values dkQ (x) in (11.48) for the given x.
centers: The K × p matrix of centers (means) for the K clusters, row k being the
center for cluster k.
Description: Sorts the silhouettes, first by group, then by value, preparatory to plot-
ting. See Listing A.10 for the code.
Usage: sort.silhouette(sil,clusters)
Arguments:
Description: Find the BIC and MLE from a set of observed eigenvalues for a specific
pattern. See Listing A.11 for the code.
Usage: pcbic(eigenvals,n,pattern)
Arguments:
eigenvals: The q-vector of eigenvalues of the covariance matrix, in order from largest
to smallest.
n: The degrees of freedom in the covariance matrix.
pattern: The pattern of equalities of the eigenvalues, given by the K-vector
(q1 , . . . , q K ) as in (13.8).
Description: Uses the stepwise procedure described in Section 13.1.4 to find a pattern
for a set of observed eigenvalues with good BIC value. See Listing A.11 for code.
Usage: pcbic.stepwise(eigenvals,n)
Arguments:
eigenvals: The q-vector of eigenvalues of the covariance matrix, in order from largest
to smallest.
n: The degrees of freedom in the covariance matrix.
ck <− NULL
ak <− NULL
vi <− solve(v)
for(k in 1:K) {
c0 <− −(1/2)∗(m[k,]%∗%vi%∗%m[k,]−m[K,]%∗%vi%∗%m[K,])
+log(phat[k]/phat[K])
ck <− c(ck,c0)
a0 <− vi%∗%(m[k,]−m[K,])
ak <− cbind(ak,a0)
}
list(a = ak, c = ck)
}
Hirotugu Akaike. A new look at the statistical model identification. IEEE Transactions
on Automatic Control, 19:716 – 723, 1974.
E. Anderson. The irises of the Gaspe Peninsula. Bulletin of the American Iris Society,
59:2–5, 1935.
E. Anderson. The species problem in iris. Annals of the Missouri Botanical Garden, 23:
457 – 509, 1936.
Steen Andersson. Invariant normal models. Annals of Statistics, 3:132 – 154, 1975.
Robert B. Ash. Basic Probability Theory. John Wiley and Sons Inc., http://www.math.
uiuc.edu/~r-ash/BPT.html, 1970.
Daniel Asimov. The grand tour: A tool for viewing multidimensional data. SIAM
Journal on Scientific and Statistical Computing, 6:128–143, 1985.
Alexander Basilevsky. Statistical Factor Analysis and Related Methods: Theory and Appli-
cations. John Wiley & Sons, 1994.
295
296 Bibliography
Peter J. Huber. Projection pursuit (C/R: P475-525). The Annals of Statistics, 13:435–475,
1985.
Clifford M. Hurvich and Chih-Ling Tsai. Regression and time series model selection
in small samples. Biometrika, 76:297–307, 1989.
Aapo Hyvärinen, Juha Karhunen, and Erkki Oja. Independent Component Analysis.
Wiley-Interscience, May 2001.
Richard Arnold Johnson and Dean W. Wichern. Applied Multivariate Statistical Analy-
sis. Pearson Prentice-Hall Inc, sixth edition, 2007.
Takeaki Kariya. Testing in the Multivariate General Linear Model. Kinokuniya, 1985.
Yann LeCun. Generalization and network design strategies. Technical Report CRG-
TR-89-4, Department of Computer Science, University of Toronto, 1989. URL http:
//yann.lecun.com/exdb/publis/pdf/lecun-89t.pdf.
Martin Maechler, Peter Rousseeuw, Anja Struyf, and Mia Hubert. Cluster analysis
basics and extensions, 2005. URL http://cran.r-project.org/web/packages/
cluster/index.html. Rousseeuw et. al. provided the S original which has been
ported to R by Kurt Hornik and has since been enhanced by Martin Maechler.
TIBCO Software Inc. S-Plus. Palo Alto, CA, 2009. URL http://www.tibco.com.
R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data
set via the gap statistic. Journal of the Royal Statistical Society B, 63:411 – 423, 2001.
W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Springer, New York,
fourth edition, 2002.
J H Ware and R E Bowden. Circadian rhythm analysis when output is collected at
intervals. Biometrics, 33(3):566–571, 1977.
Sanford Weisberg. Applied Linear Regression. John Wiley & Sons, third edition, 2005.
Wikipedia. List of breakfast cereals — Wikipedia, The Free Encyclopedia,
2011. URL http://en.wikipedia.org/w/index.php?title=List_of_breakfast_
cereals&oldid=405905817. [Online; accessed 7-January-2011].
Wikipedia. Pterygomaxillary fissure — Wikipedia, The Free Encyclopedia,
2010. URL http://en.wikipedia.org/w/index.php?title=Pterygomaxillary_
fissure&oldid=360046832. [Online; accessed 7-January-2011].
John Wishart. The generalised product moment distribution in samples from a nor-
mal multivariate population. Biometrika, 20A:32 – 52, 1928.
Peter Wolf and Uni Bielefeld. aplpack: Another Plot PACKage, 2010. URL http:
//cran.r-project.org/web/packages/aplpack/index.html. R package version
1.2.3.
John W. Wright, editor. The Universal Almanac. Andrews McMeel Publishing, Kansas
City, MO, 1997.
Gary O. Zerbe and Richard H. Jones. On application of growth curve techniques to
time series data. Journal of the American Statistical Association, 75:507–509, 1980.
Index
301
302 Index
one, 7
Wilks’ Λ, 117
mouth size data, 119
Wishart distribution, 57–60, 171
and chi-squares, 59
Bartlett’s decomposition, 142
conditional property, 136–137
definition, 57
density, 143–144
expectation of inverse, 137–138
for sample covariance matrix, 58
Half-Wishart, 142, 146
likelihood, 171
linear transformations, 59
marginals, 60
mean, 59
sum of independent, 59