0% found this document useful (1 vote)
1K views

Multivariate

- The document is a preface to a textbook on old-school multivariate statistical analysis, with a focus on multivariate normal modeling, linear models, and related techniques. - It covers topics like multivariate regression, analysis of variance, growth curve models, covariance matrices, principal components, classification, clustering, and implementations in R. - The goal is to provide a thorough grounding in these techniques while balancing mathematical rigor and practical understanding. Examples and R code are provided throughout.

Uploaded by

Lucas Gallindo
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
1K views

Multivariate

- The document is a preface to a textbook on old-school multivariate statistical analysis, with a focus on multivariate normal modeling, linear models, and related techniques. - It covers topics like multivariate regression, analysis of variance, growth curve models, covariance matrices, principal components, classification, clustering, and implementations in R. - The goal is to provide a thorough grounding in these techniques while balancing mathematical rigor and practical understanding. Examples and R code are provided throughout.

Uploaded by

Lucas Gallindo
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 319

Multivariate Statistical Analysis

Old School
John I. Marden
Department of Statistics
University of Illinois at Urbana-Champaign
c 2011 by John I. Marden
Preface
The goal of this text is to give the reader a thorough grounding in old-school mul-
tivariate statistical analysis. The emphasis is on multivariate normal modeling and
inference, both theory and implementation. Linear models form a central theme of
the book. Several chapters are devoted to developing the basic models, including
multivariate regression and analysis of variance, and especially the both-sides mod-
els (i.e., generalized multivariate analysis of variance models), which allow model-
ing relationships among individuals as well as variables. Growth curve and repeated
measure models are special cases.
The linear models are concerned with means. Inference on covariance matrices
covers testing equality of several covariance matrices, testing independence and con-
ditional independence of (blocks of) variables, factor analysis, and some symmetry
models. Principal components, though mainly a graphical/exploratory technique,
also lends itself to some modeling.
Classication and clustering are related areas. Both attempt to categorize indi-
viduals. Classication tries to classify individuals based upon a previous sample of
observed individuals and their categories. In clustering, there is no observed catego-
rization, nor often even knowledge of how many categories there are. These must be
estimated from the data.
Other useful multivariate techniques include biplots, multidimensional scaling,
and canonical correlations.
The bulk of the results here are mathematically justied, but I have tried to arrange
the material so that the reader can learn the basic concepts and techniques while
plunging as much or as little as desired into the details of the proofs.
Practically all the calculations and graphics in the examples are implemented
using the statistical computing environment R [R Development Core Team, 2010].
Throughout the notes we have scattered some of the actual R code we used. Many of
the data sets and original R functions can be found in the le http://www.istics.
net/r/multivariateOldSchool.r. For others we refer to available R packages.
iii
Contents
Preface iii
Contents iv
1 A First Look at Multivariate Data 1
1.1 The data matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Example: Planets data . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Glyphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Scatter plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Example: Fisher-Anderson iris data . . . . . . . . . . . . . . . . . 5
1.4 Sample means, variances, and covariances . . . . . . . . . . . . . . . . . 6
1.5 Marginals and linear combinations . . . . . . . . . . . . . . . . . . . . . . 8
1.5.1 Rotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Principal components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6.1 Biplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6.2 Example: Sports data . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.7 Other projections to pursue . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.7.1 Example: Iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.8 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Multivariate Distributions 27
2.1 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.1 Distribution functions . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.2 Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.3 Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.4 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . 30
2.2 Expected values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 Means, variances, and covariances . . . . . . . . . . . . . . . . . . . . . . 33
2.3.1 Vectors and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.2 Moment generating functions . . . . . . . . . . . . . . . . . . . . 35
2.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5 Additional properties of conditional distributions . . . . . . . . . . . . . 37
2.6 Afne transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
iv
Contents v
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3 The Multivariate Normal Distribution 49
3.1 Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Some properties of the multivariate normal . . . . . . . . . . . . . . . . . 51
3.3 Multivariate normal data matrix . . . . . . . . . . . . . . . . . . . . . . . 52
3.4 Conditioning in the multivariate normal . . . . . . . . . . . . . . . . . . 55
3.5 The sample covariance matrix: Wishart distribution . . . . . . . . . . . . 57
3.6 Some properties of the Wishart . . . . . . . . . . . . . . . . . . . . . . . . 59
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4 Linear Models on Both Sides 67
4.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Multivariate regression and analysis of variance . . . . . . . . . . . . . . 69
4.2.1 Examples of multivariate regression . . . . . . . . . . . . . . . . . 70
4.3 Linear models on both sides . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.1 One individual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.2 IID observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.3 The both-sides model . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5 Linear Models: Least Squares and Projections 85
5.1 Linear subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3 Least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.1 Both-sides model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4 What is a linear model? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5 Gram-Schmidt orthogonalization . . . . . . . . . . . . . . . . . . . . . . . 91
5.5.1 The QR and Cholesky decompositions . . . . . . . . . . . . . . . 93
5.5.2 Orthogonal polynomials . . . . . . . . . . . . . . . . . . . . . . . 95
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6 Both-Sides Models: Distribution of Estimator 103
6.1 Distribution of

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2 Fits and residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.3 Standard errors and t-statistics . . . . . . . . . . . . . . . . . . . . . . . . 105
6.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.4.1 Mouth sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.4.2 Histamine in dogs . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7 Both-Sides Models: Hypothesis Tests on 113
7.1 Approximate
2
test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.1.1 Example: Mouth sizes . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.2 Testing blocks of are zero . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.2.1 Just one column F test . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2.2 Just one row Hotellings T
2
. . . . . . . . . . . . . . . . . . . . . 116
7.2.3 General blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2.4 Additional test statistics . . . . . . . . . . . . . . . . . . . . . . . . 117
7.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
vi Contents
7.3.1 Mouth sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.3.2 Histamine in dogs . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.4 Testing linear restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.5 Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.5.1 Pseudo-covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.6 Model selection: Mallows C
p
. . . . . . . . . . . . . . . . . . . . . . . . . 125
7.6.1 Example: Mouth sizes . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8 Some Technical Results 135
8.1 The Cauchy-Schwarz inequality . . . . . . . . . . . . . . . . . . . . . . . 135
8.2 Conditioning in a Wishart . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.3 Expectation of inverse Wishart . . . . . . . . . . . . . . . . . . . . . . . . 137
8.4 Distribution of Hotellings T
2
. . . . . . . . . . . . . . . . . . . . . . . . . 138
8.4.1 A motivation for Hotellings T
2
. . . . . . . . . . . . . . . . . . . 139
8.5 Density of the multivariate normal . . . . . . . . . . . . . . . . . . . . . . 140
8.6 The QR decomposition for the multivariate normal . . . . . . . . . . . . 141
8.7 Density of the Wishart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
9 Likelihood Methods 151
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
9.2 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . 151
9.2.1 The MLE in multivariate regression . . . . . . . . . . . . . . . . . 152
9.2.2 The MLE in the both-sides linear model . . . . . . . . . . . . . . 153
9.2.3 Proof of Lemma 9.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 156
9.3 Likelihood ratio tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
9.3.1 The LRT in multivariate regression . . . . . . . . . . . . . . . . . 157
9.4 Model selection: AIC and BIC . . . . . . . . . . . . . . . . . . . . . . . . 157
9.4.1 BIC: Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
9.4.2 AIC: Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
9.4.3 AIC: Multivariate regression . . . . . . . . . . . . . . . . . . . . . 161
9.5 Example: Mouth sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
10 Models on Covariance Matrices 171
10.1 Testing equality of covariance matrices . . . . . . . . . . . . . . . . . . . 172
10.1.1 Example: Grades data . . . . . . . . . . . . . . . . . . . . . . . . . 173
10.1.2 Testing the equality of several covariance matrices . . . . . . . . 174
10.2 Testing independence of two blocks of variables . . . . . . . . . . . . . . 174
10.2.1 Example: Grades data . . . . . . . . . . . . . . . . . . . . . . . . . 175
10.2.2 Example: Testing conditional independence . . . . . . . . . . . . 176
10.3 Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
10.3.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
10.3.2 Describing the factors . . . . . . . . . . . . . . . . . . . . . . . . . 181
10.3.3 Example: Grades data . . . . . . . . . . . . . . . . . . . . . . . . . 182
10.4 Some symmetry models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
10.4.1 Some types of symmetry . . . . . . . . . . . . . . . . . . . . . . . 187
10.4.2 Characterizing the structure . . . . . . . . . . . . . . . . . . . . . 189
10.4.3 Maximum likelihood estimates . . . . . . . . . . . . . . . . . . . 189
Contents vii
10.4.4 Hypothesis testing and model selection . . . . . . . . . . . . . . 191
10.4.5 Example: Mouth sizes . . . . . . . . . . . . . . . . . . . . . . . . . 191
10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
11 Classication 199
11.1 Mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
11.2 Classiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
11.3 Fishers linear discrimination . . . . . . . . . . . . . . . . . . . . . . . . . 203
11.4 Cross-validation estimate of error . . . . . . . . . . . . . . . . . . . . . . 205
11.4.1 Example: Iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
11.5 Fishers quadratic discrimination . . . . . . . . . . . . . . . . . . . . . . . 210
11.5.1 Example: Iris data, continued . . . . . . . . . . . . . . . . . . . . 210
11.6 Modications to Fishers discrimination . . . . . . . . . . . . . . . . . . . 211
11.7 Conditioning on X: Logistic regression . . . . . . . . . . . . . . . . . . . 212
11.7.1 Example: Iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.7.2 Example: Spam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.8 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
11.8.1 CART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
11.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
12 Clustering 229
12.1 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
12.1.1 Example: Sports data . . . . . . . . . . . . . . . . . . . . . . . . . 230
12.1.2 Gap statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
12.1.3 Silhouettes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
12.1.4 Plotting clusters in one and two dimensions . . . . . . . . . . . . 233
12.1.5 Example: Sports data, using R . . . . . . . . . . . . . . . . . . . . 236
12.2 K-medoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
12.3 Model-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
12.3.1 Example: Automobile data . . . . . . . . . . . . . . . . . . . . . . 240
12.3.2 Some of the models in mclust . . . . . . . . . . . . . . . . . . . . 243
12.4 An example of the EM algorithm . . . . . . . . . . . . . . . . . . . . . . . 245
12.5 Soft K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
12.5.1 Example: Sports data . . . . . . . . . . . . . . . . . . . . . . . . . 247
12.6 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
12.6.1 Example: Grades data . . . . . . . . . . . . . . . . . . . . . . . . . 248
12.6.2 Example: Sports data . . . . . . . . . . . . . . . . . . . . . . . . . 249
12.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
13 Principal Components and Related Techniques 253
13.1 Principal components, redux . . . . . . . . . . . . . . . . . . . . . . . . . 253
13.1.1 Example: Iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
13.1.2 Choosing the number of principal components . . . . . . . . . . 256
13.1.3 Estimating the structure of the component spaces . . . . . . . . . 257
13.1.4 Example: Automobile data . . . . . . . . . . . . . . . . . . . . . . 259
13.1.5 Principal components and factor analysis . . . . . . . . . . . . . 262
13.1.6 Justication of the principal component MLE, Theorem 13.1 . . 264
13.2 Multidimensional scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
13.2.1 is Euclidean: The classical solution . . . . . . . . . . . . . . . . 266
13.2.2 may not be Euclidean: The classical solution . . . . . . . . . . 268
viii Contents
13.2.3 Nonmetric approach . . . . . . . . . . . . . . . . . . . . . . . . . . 269
13.2.4 Examples: Grades and sports . . . . . . . . . . . . . . . . . . . . 269
13.3 Canonical correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
13.3.1 Example: Grades . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
13.3.2 How many canonical correlations are positive? . . . . . . . . . . 274
13.3.3 Partial least squares . . . . . . . . . . . . . . . . . . . . . . . . . . 276
13.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
A Extra R routines 281
A.1 Estimating entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
A.1.1 negent: Estimating negative entropy . . . . . . . . . . . . . . . . . 282
A.1.2 negent2D: Maximizing negentropy for q = 2 dimensions . . . . . 282
A.1.3 negent3D: Maximizing negentropy for q = 3 dimensions . . . . . 283
A.2 Both-sides model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
A.2.1 bothsidesmodel: Calculate the estimates . . . . . . . . . . . . . . . 283
A.2.2 bothsidesmodel.test: Test blocks of are zero . . . . . . . . . . . 284
A.3 Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
A.3.1 lda: Linear discrimination . . . . . . . . . . . . . . . . . . . . . . . 285
A.3.2 qda: Quadratic discrimination . . . . . . . . . . . . . . . . . . . . 285
A.3.3 predict.qda: Quadratic discrimination prediction . . . . . . . . . 286
A.4 Silhouettes for K-Means Clustering . . . . . . . . . . . . . . . . . . . . . 286
A.4.1 silhouette.km: Calculate the silhouettes . . . . . . . . . . . . . . . 286
A.4.2 sort.silhouette: Sort the silhouettes by group . . . . . . . . . . . . 286
A.5 Estimating the eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . 287
A.5.1 pcbic: BIC for a particular pattern . . . . . . . . . . . . . . . . . . 287
A.5.2 pcbic.stepwise: Choosing a good pattern . . . . . . . . . . . . . . 287
A.5.3 Helper functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
Bibliography 295
Index 301
Chapter 1
A First Look at Multivariate Data
In this chapter, we try to give a sense of what multivariate data sets looks like, and
introduce some of the basic matrix manipulations needed throughout these notes.
Chapters 2 and 3 lay down the distributional theory. Linear models are probably the
most popular statistical models ever. With multivariate data, we can model relation-
ships between individuals or between variables, leading to what we call both-sides
models, which do both simultaneously. Chapters 4 through 8 present these models
in detail. The linear models are concerned with means. Before turning to models
on covariances, Chapter 9 briey reviews likelihood methods, including maximum
likelihood estimation, likelihood ratio tests, and model selection criteria (Bayes and
Akaike). Chapter 10 looks at a number of models based on covariance matrices, in-
cluding equality of covariances, independence and conditional independence, factor
analysis, and other structural models. Chapter 11 deals with classication, in which
the goal is to nd ways to classify individuals into categories, e.g., healthy or un-
healthy, based on a number of observed variable. Chapter 12 has a similar goal,
except that the categories are unknown and we seek groupings of individuals using
just the observed variables. Finally, Chapter 13 explores principal components, which
we rst see in Section 1.6. It is an approach for reducing the number of variables,
or at least nd a few interesting ones, by searching through linear combinations of
the observed variables. Multidimensional scaling has a similar objective, but tries to
exhibit the individual data points in a low-dimensional space while preserving the
original inter-point distances. Canonical correlations has two sets of variables, and
nds linear combinations of the two sets to explain the correlations between them.
On to the data.
1.1 The data matrix
Data generally will consist of a number of variables recorded on a number of indi-
viduals, e.g., heights, weights, ages, and sex of a sample of students. Also, generally,
there will be n individuals and q variables, and the data will be arranged in an n q
1
2 Chapter 1. Multivariate Data
data matrix, with rows denoting individuals and the columns denoting variables:
Y =
Var 1 Var 2 Var q
Individual 1
Individual 2
.
.
.
Individual n

y
11
y
12
. . . y
1q
y
21
y
22
. . . y
2q
.
.
.
.
.
.
.
.
.
.
.
.
y
n1
y
n2
. . . y
nq

.
(1.1)
Then y
ij
is the value of the variable j for individual i. Much more complex data
structures exist, but this course concentrates on these straightforward data matrices.
1.1.1 Example: Planets data
Six astronomical variables are given on each of the historical nine planets (or eight
planets, plus Pluto). The variables are (average) distance in millions of miles from
the Sun, length of day in Earth days, length of year in Earth days, diameter in miles,
temperature in degrees Fahrenheit, and number of moons. The data matrix:
Dist Day Year Diam Temp Moons
Mercury 35.96 59.00 88.00 3030 332 0
Venus 67.20 243.00 224.70 7517 854 0
Earth 92.90 1.00 365.26 7921 59 1
Mars 141.50 1.00 687.00 4215 67 2
Jupiter 483.30 0.41 4332.60 88803 162 16
Saturn 886.70 0.44 10759.20 74520 208 18
Uranus 1782.00 0.70 30685.40 31600 344 15
Neptune 2793.00 0.67 60189.00 30200 261 8
Pluto 3664.00 6.39 90465.00 1423 355 1
(1.2)
The data can be found in Wright [1997], for example.
1.2 Glyphs
Graphical displays of univariate data, that is, data on one variable, are well-known:
histograms, stem-and-leaf plots, pie charts, box plots, etc. For two variables, scat-
ter plots are valuable. It is more of a challenge when dealing with three or more
variables.
Glyphs provide an option. A little picture is created for each individual, with char-
acteristics based on the values of the variables. Chernoffs faces [Chernoff, 1973] may
be the most famous glyphs. The idea is that people intuitively respond to character-
istics of faces, so that many variables can be summarized in a face.
Figure 1.1 exhibits faces for the nine planets. We use the faces routine by H. P. Wolf
in the R package aplpack, Wolf and Bielefeld [2010]. The distance the planet is from
the sun is represented by the height of the face (Pluto has a long face), the length of
the planets day by the width of the face (Venus has a wide face), etc. One can then
cluster the planets. Mercury, Earth and Mars look similar, as do Saturn and Jupiter.
These face plots are more likely to be amusing than useful, especially if the number
of individuals is large. A star plot is similar. Each individual is represented by a
p-pointed star, where each point corresponds to a variable, and the distance of the
1.3. Scatter plots 3
Mercury Venus Earth
Mars Jupiter Saturn
Uranus Neptune Pluto
Figure 1.1: Chernoffs faces for the planets. Each feature represents a variable. For
the rst six variables, from the faces help le, 1-height of face, 2-width of face, 3-shape of
face, 4-height of mouth, 5-width of mouth, 6-curve of smile.
point from the center is based on the variables value for that individual. See Figure
1.2.
1.3 Scatter plots
Two-dimensional scatter plots can be enhanced by using different symbols for the
observations instead of plain dots. For example, different colors could be used for
different groups of points, or glyphs representing other variables could be plotted.
Figure 1.2 plots the planets with the logarithms of day length and year length as the
axes, where the stars created from the other four variables are the plotted symbols.
Note that the planets pair up in a reasonable way. Mercury and Venus are close, both
in terms of the scatter plot and in the look of their stars. Similarly, Earth and Mars
pair up, as do Jupiter and Saturn, and Uranus and Neptune. See Listing 1.1 for the R
code.
A scatter plot matrix arranges all possible two-way scatter plots in a q q matrix.
These displays can be enhanced with brushing, in which individual or groups of
individual plots can be selected in one plot, and be simultaneously highlighted in the
other plots.
4 Chapter 1. Multivariate Data
Listing 1.1: R code for the star plot of the planets, Figure 1.2. The data are in the
matrix planets. The rst statement normalizes the variables to range from 0 to 1. The
ep matrix is used to place the names of the planets. Tweaking is necessary, depending
on the size of the plot.
p < apply(planets,2,function(z) (zmin(z))/(max(z)min(z)))
x < log(planets[,2])
y < log(planets[,3])
ep < matrix(c(.3,.4),c(.5,.4),c(.5,0),c(.5,0),c(.6,1),c(.5,1.4),
c(1,.6),c(1.3,.4),c(1,.5))
symbols(x,y,stars=p[,(2:3)],xlab=log(day),ylab=log(year),inches=.4)
text(x+ep[,1],y+ep[,2],labels=rownames(planets),cex=.5)
0 2 4 6
4
6
8
1
0
1
2
Mercury
Venus
Earth
Mars
Jupiter
Saturn
Uranus
Neptune
Pluto
log(day)
l
o
g
(
y
e
a
r
)
Figure 1.2: Scatter plot of log(day) versus log(year) for the planets, with plotting
symbols being stars created from the other four variables, distance, diameter, tem-
perature, moons.
1.3. Scatter plots 5
Sepal.Length
s
s
s
s
s
s
s
s
s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s
s s
s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
vv
v
v
v
v
v
v
v
vv
v
v
v
v
v v
v
v
v
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g g
g
g
g
g
s
s
s
s
s
s
s
s
s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s
ss
s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v v
v
v
v
v
v
v
v
vv
v
v
v
v
vv
v
v
v
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
gg
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g g
g
g
g
g
s
s
s
s
s
s
s
s
s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s
ss
ss
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v v
v
v
v
v
v
v
v
v v
v
v
v
v
vv
v
v
v
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
gg
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g g
g
g
g
g
s
s
s
s
s
s
s s
s
s
s
s
s s
s
s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
v v
v
v
v v
v
v
v
v
v
v
v
v v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
vv
vv
v
v
v
v
v
v
v
v
v
v
v
v
v v
v
v
g
g
g
g
g g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g g
g
g
g
g
g
g
g
g
g
g g
g
g
g
g
g
g gg
g
g
g
g
g
g
g
g
Sepal.Width
s
s
s
s
s
s
ss
s
s
s
s
s s
s
s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
v v
v
v
v v
v
v
v
v
v
v
v
v v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v v
v v
v
v
v
v
v
v
v
v
v
v
v
v
vv
v
v
g
g
g
g
g g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g g
g
g
g
g
g
g
g
g
g
g g
g
g
g
g
g
gg g
g
g
g
g
g
g
g
g
s
s
s
s
s
s
s s
s
s
s
s
ss
s
s
s
s
ss
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
vv
v
v
v v
v
v
v
v
v
v
v
v v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v v
v v
v
v
v
v
v
v
v
v
v
v
v
v
vv
v
v
g
g
g
g
g g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
gg
g
g
g
g
g
g
g
g
g
g g
g
g
g
g
g
g g g
g
g
g
g
g
g
g
g
s s
s
s
s
s
s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
ss
s
s
ss
s s
s
s
s
s
s
s
s
s s s
s
s
s
s
s
s
s
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v v
v
v
v
v
v
v
v
v
vvv
v
v
v
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g g
g
g
g
g
g
g
g
s s
s
s
s
s
s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s s
s
s
s s
s s
s
s
s
s
s
s
s
s s s
s
s
s
s
s
s
s
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v v
v
v
v
v
v
v
v
v
v v v
v
v
v
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g g
g
g
g
g
g
g
g
Petal.Length
ss
s
s
s
s
s
s
s
ss
s
s
s
s
s
s
s
s
s
s
s
s
s
s
ss
s
s
ss
s s
s
s
s
s
s
s
s
ss s
s
s
s
s
s
s
s
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
vv
v
v
v
v
v
v
v
v
v vv
v
v
v
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g g
g
g
g
g
g
g
g
s s s s s
s
s
s s
s
s s
s s
s
s s
s s s
s
s
s
s
ss
s
ss ss
s
s
s ss s
s
s s
s s
s
s
s
s
s s s s
v
v v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
vv
v
v
v
v
v
v
v
v
v
v
v v v
v
v
v
v
v
v
v v
v
v
g
g
g
g
g
g
g
g g
g
g
g
g
g
g
g
g
g
g
g
g
g g
g
g
g g g
g
g
g
g
g
g
g
g
g
g g
g
g
g
g
g
g
g
g
g
g
g
s ss s s
s
s
s s
s
s s
ss
s
s s
s s s
s
s
s
s
s s
s
s s s s
s
s
s ss s
s
s s
s s
s
s
s
s
s s s s
v
v v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v v
v
v
v
v
v
v
v
v
v
v
v v v
v
v
v
v
v
v
vv
v
v
g
g
g
g
g
g
g
g g
g
g
g
g
g
g
g
g
g
g
g
g
gg
g
g
g gg
g
g
g
g
g
g
g
g
g
g g
g
g
g
g
g
g
g
g
g
g
g
ss ss s
s
s
s s
s
ss
s s
s
s s
ss s
s
s
s
s
s s
s
s sss
s
s
ss ss
s
ss
ss
s
s
s
s
s ss s
v
vv
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
vv
v
v
v
v
v
v
v
v
v
v
v v v
v
v
v
v
v
v
vv
v
v
g
g
g
g
g
g
g
g g
g
g
g
g
g
g
g
g
g
g
g
g
g g
g
g
g gg
g
g
g
g
g
g
g
g
g
g g
g
g
g
g
g
g
g
g
g
g
g
Petal.Width
Figure 1.3: A scatter plot matrix for the Fisher/Anderson iris data. In each plot, s
indicates setosa plants, v indicates versicolor, and r indicates virginica.
1.3.1 Example: Fisher-Anderson iris data
The most famous data set in multivariate analysis is the iris data analyzed by Fisher
[1936] based on data collected by Anderson [1935]. See also Anderson [1936]. There
are fty specimens each of three species of iris: setosa, versicolor, and virginica. There
are four variables measured on each plant, sepal length, sepal width, petal length, and
pedal width. Thus n = 150 and q = 4. Figure 1.3 contains the corresponding scatter
plot matrix, with species indicated by letter. We used the R function pairs. Note that
the three species separate fairly well, with setosa especially different from the other
two.
As a preview of classication (Chapter 11), Figure 1.4 uses faces to exhibit ve
observations from each species, and ve random observations without their species
label. Can you guess which species each is? See page 20. The setosas are not too
difcult, since they have small faces, but distinguishing the other two can be a chal-
lenge.
6 Chapter 1. Multivariate Data
set set set set set
vers vers vers vers vers
virg virg virg virg virg
? ? ? ? ?
Figure 1.4: Five specimens from each iris species, plus ve from unspecied species.
Here, set indicates setosa plants, vers indicates versicolor, and virg indicates
virginica.
1.4 Sample means, variances, and covariances
For univariate values x
1
, . . . , x
n
, the sample mean is
x =
1
n
n

i=1
x
i
, (1.3)
and the sample variance is
s
2
x
= s
xx
=
1
n
n

i=1
(x
i
x)
2
. (1.4)
Note the two notations: The s
2
x
is most common when dealing with individual vari-
ables, but the s
xx
transfers better to multivariate data. Often one is tempted to divide
by n 1 instead of n. Thats ne, too. With a second set of values z
1
, . . . , z
n
, we have
1.4. Sample means, variances, and covariances 7
the sample covariance between the x
i
s and z
i
s to be
s
xz
=
1
n
n

i=1
(x
i
x)(z
i
z). (1.5)
So the covariance between the x
i
s and themselves is the variance, which is to say that
s
2
x
= s
xx
. The sample correlation coefcient is a normalization of the covariance that
ranges between 1 and +1, dened by
r
xz
=
s
xz
s
x
s
z
(1.6)
provided both variances are positive. (See Corollary 8.1.) In a scatter plot of x versus
z, the correlation coefcient is +1 if all the points lie on a line with positive slope,
and 1 if they all lie on a line with negative slope.
For a data matrix Y (1.1) with q variables, there are q means:
y
j
=
1
n
n

i=1
y
ij
. (1.7)
Placing them in a row vector, we have
y = ( y
1
, . . . , y
q
). (1.8)
The n 1 one vector is 1
n
= (1, 1, . . . , 1)
/
, the vector of all 1s. Then the mean vector
(1.8) can be written
y =
1
n
1
/
n
Y. (1.9)
To nd the variances and covariances, we rst have to subtract the means from the
individual observations in Y: change y
ij
to y
ij
y
j
for each i, j. That can be achieved
by subtracting the n q matrix 1
n
y from Y to get the matrix of deviations. Using
(1.9), we can write
Y 1
n
y = Y 1
n
1
n
1
/
n
Y = (I
n

1
n
1
n
1
/
n
)Y H
n
Y. (1.10)
There are two important matrices in that formula: The n n identity matrix I
n
,
I
n
=

1 0 . . . 0
0 1 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . 1

, (1.11)
and the n n centering matrix H
n
,
H
n
= I
n

1
n
1
n
1
/
n
=

1
1
n

1
n
. . .
1
n

1
n
1
1
n
. . .
1
n
.
.
.
.
.
.
.
.
.
.
.
.

1
n

1
n
. . . 1
1
n

, (1.12)
8 Chapter 1. Multivariate Data
The identity matrix leaves any vector or matrix alone, so if A is n m, then A =
I
n
A = AI
m
, and the centering matrix subtracts the column mean from each element
in H
n
A. Similarly, AH
m
results in the row mean being subtracted from each element.
For an n 1 vector x with mean x, and n 1 vector z with mean z, we can write
n

i=1
(x
i
x)
2
= (x x1
n
)
/
(x x1
n
) (1.13)
and
n

i=1
(x
i
x)(z
i
z) = (x x1
n
)
/
(z z1
n
). (1.14)
Thus taking the deviations matrix in (1.10), (H
n
Y)
/
(H
n
Y) contains all the

(y
ij

y
j
)(y
ik
y
k
)s. We will call that matrix the sumof squares and cross-products matrix.
Notice that
(H
n
Y)
/
(H
n
Y) = Y
/
H
/
n
H
n
Y = Y
/
H
n
Y. (1.15)
What happened to the H
n
s? First, H
n
is clearly symmetric, so that H
/
n
= H
n
. Then
notice that H
n
H
n
= H
n
. Such a matrix is called idempotent, that is, a square matrix
A is idempotent if AA = A. (1.16)
Dividing the sum of squares and cross-products matrix by n gives the sample
variance-covariance matrix, or more simply sample covariance matrix:
S =
1
n
Y
/
H
n
Y =

s
11
s
12
. . . s
1q
s
21
s
22
. . . s
2q
.
.
.
.
.
.
.
.
.
.
.
.
s
q1
s
q2
. . . s
qq

, (1.17)
where s
jj
is the sample variance of the j
th
variable (column), and s
jk
is the sample
covariance between the j
th
and k
th
variables. (When doing inference later, we may
divide by n d f instead of n for some degrees-of-freedom integer d f .)
1.5 Marginals and linear combinations
A natural rst stab at looking at data with several variables is to look at the variables
one at a time, so with q variables, one would rst make q histograms, or box plots, or
whatever suits ones fancy. Such techniques are based on marginals, that is, based on
subsets of the variable rather than all variables at once as in glyphs. One-dimensional
marginals are the individual variables, two-dimensional marginals are the pairs of
variables, three-dimensional marginals are the sets of three variables, etc.
Consider one-dimensional marginals. It is easy to construct the histograms, say.
But why be limited to the q variables? Functions of the variables can also be his-
togrammed, e.g., weight/height. The number of possible functions one could imag-
ine is vast. One convenient class is the set of linear transformations, that is, for some
constants b
1
, . . . , b
q
, a new variable is W = b
1
Y
1
+ + b
q
Y
q
, so the transformed data
consist of w
1
, . . . , w
n
, where
w
i
= b
1
y
i1
+ + b
q
y
iq
. (1.18)
1.5. Marginals and linear combinations 9
Placing the coefcients into a column vector b = (b
1
, . . . , b
q
)
/
, we can write
W

w
1
w
2
.
.
.
w
n

= Yb, (1.19)
transforming the original data matrix to another one, albeit with only one variable.
Now there is a histogram for each vector b. A one-dimensional grand tour runs
through the vectors b, displaying the histogram for Yb as it goes. (See Asimov [1985]
and Buja and Asimov [1986] for general grand tour methodology.) Actually, one does
not need all b, e.g., the vectors b = (1, 2, 5)
/
and b = (2, 4, 10)
/
would give the same
histogram. Just the scale of the horizontal axis on the histograms would be different.
One simplication is to look at only the bs with norm 1. That is, the norm of a vector
x = (x
1
, . . . , x
q
)
/
is
|x| =
_
x
2
1
+ + x
2
q
=

x
/
x, (1.20)
so one would run through the bs with |b| = 1. Note that the one-dimensional
marginals are special cases: take
b
/
= (1, 0, . . . , 0), (0, 1, 0, . . . , 0), . . . , or (0, 0, . . . , 1). (1.21)
Scatter plots of two linear combinations are more common. That is, there are two
sets of coefcients (b
1j
s and b
2j
s), and two resulting variables:
w
i1
= b
11
y
i1
+ b
21
y
i2
+ + b
q1
y
iq
, and
w
i2
= b
12
y
i1
+ b
22
y
i2
+ + b
q2
y
iq
. (1.22)
In general, the data matrix generated from p linear combinations can be written
W = YB, (1.23)
where W is n p, and B is q p with column k containing the coefcients for the k
th
linear combination. As for one linear combination, the coefcient vectors are taken to
have norm 1, i.e., |(b
1k
, . . . , b
qk
)| = 1, which is equivalent to having all the diagonals
of B
/
B being 1.
Another common restriction is to have the linear combination vectors be orthogo-
nal, where two column vectors b and c are orthogonal if b
/
c = 0. Geometrically, or-
thogonality means the vectors are perpendicular to each other. One benet of restrict-
ing to orthogonal linear combinations is that one avoids scatter plots that are highly
correlated but not meaningfully so, e.g., one might have w
1
be Height + Weight, and
w
2
be .99 Height + 1.01 Weight. Having those two highly correlated does not tell
us anything about the data set. If the columns of B are orthogonal to each other, as
well as having norm 1, then
B
/
B = I
p
. (1.24)
A set of norm 1 vectors that are mutually orthogonal are said to be orthonormal .
Return to q = 2 orthonormal linear combinations. A two-dimensional grand tour
plots the two variables as the q 2 matrix B runs through all the matrices with a pair
of orthonormal columns.
10 Chapter 1. Multivariate Data
1.5.1 Rotations
If the B in (1.24) is q q, i.e., there are as many orthonormal linear combinations as
variables, then B is an orthogonal matrix
.
Denition 1.1. A q q matrix G is orthogonal if
G
/
G = GG
/
= I
q
. (1.25)
Note that the denition says that the columns are orthonormal, and the rows are
orthonormal. In fact, the rows are orthonormal if and only of the columns are (if
the matrix is square), hence the middle equality in (1.25) is not strictly needed in the
denition.
Think of the data matrix Y being the set of n points in q-dimensional space. For
orthogonal matrix G, what does the set of points W = YG look like? It looks exactly
like Y, but rotated or ipped. Think of a pinwheel turning, or a chicken on a rotisserie,
or the earth spinning around its axis or rotating about the sun. Figure 1.5 illustrates
a simple rotation of two variables. In particular, the norms of the points in Y are the
same as in W, so each point remains the same distance from 0.
Rotating point clouds for three variables work by rst multiplying the n 3 data
matrix by a 3 3 orthogonal matrix, then making a scatter plot of the rst two re-
sulting variables. By running through the orthogonal matrices quickly, one gets the
illusion of three dimensions. See the discussion immediately above Exercise 1.9.21
for some suggestions on software for real-time rotations.
1.6 Principal components
The grand tours and rotating point clouds described in the last two subsections do
not have mathematical objectives, that is, one just looks at them to see if anything
interesting pops up. In projection pursuit [Huber, 1985], one looks for a few (often
just one or two) (orthonormal) linear combinations that maximize a given objective
function. For example, if looking at just one linear combination, one may wish to nd
the one that maximizes the variance of the data, or the skewness or kurtosis, or one
whose histogram is most bimodal. With two linear combinations, one may be after
clusters of points, high correlation, curvature, etc.
Principal components are the orthonormal combinations that maximize the vari-
ance. They predate the term projection pursuit by decades [Pearson, 1901], and are the
most commonly used. The idea behind them is that variation is information, so if
one has several variables, one wishes the linear combinations that capture as much
of the variation in the data as possible. You have to decide in particular situations
whether variation is the important criterion. To nd a column vector b to maximize
the sample variance of W = Yb, we could take b innitely large, which yields innite
variance. To keep the variance meaningful, we restrict to vectors b of unit norm.
For q variables, there are q principal components: The rst has the maximal vari-
ance any one linear combination (with norm 1) can have, the second has the maximal
variance among linear combinations orthogonal to the rst, etc. The technical de-
nition for a data matrix is below. First, we note that for a given q p matrix B, the
mean and variance of the elements in the linear transformation W = YB are easily
1.6. Principal components 11
obtained from the mean and covariance matrix of Y using (1.8) and (1.15):
w =
1
n
1
/
n
W =
1
n
1
/
n
YB = yB, (1.26)
by (1.9), and
S
W
=
1
n
W
/
H
n
W =
1
n
B
/
Y
/
H
n
YB = B
/
SB, (1.27)
where S is the covariance matrix of Y in (1.17). In particular, for a column vector b,
the sample variance of Yb is b
/
Sb. Thus the principal components aim to maximize
g
/
Sg for gs of unit length.
Denition 1.2. Suppose S is the sample covariance matrix for the n q data matrix Y. Let
g
1
, . . . , g
q
be an orthonormal set of q 1 vectors such that
g
1
is any g that maximizes g
/
Sg over |g| = 1;
g
2
is any g that maximizes g
/
Sg over |g| = 1, g
/
g
1
= 0;
g
3
is any g that maximizes g
/
Sg over |g| = 1, g
/
g
1
= g
/
g
2
= 0;
.
.
.
g
q
is any g that maximizes g
/
Sg over |g| = 1, g
/
g
1
= = g
/
g
q1
= 0. (1.28)
Then Yg
i
is the i
th
sample principal component, g
i
is its loading vector, and l
i
g
/
i
Sg
i
is its sample variance.
Because the function g
/
Sg is continuous in g, and the maximizations are over
compact sets, these principal components always exist. They may not be unique,
although for sample covariance matrices, if n q, they almost always are unique, up
to sign. See Section 13.1 for further discussion.
By the construction in (1.28), we have that the sample variances of the principal
components are ordered as
l
1
l
2
l
q
. (1.29)
What is not as obvious, but quite important, is that the principal components are
uncorrelated, as in the next lemma, proved in Section 1.8.
Lemma 1.1. The S and g
1
, . . . , g
q
in Denition 1.2 satisfy
g
/
i
Sg
j
= 0 for i ,= j. (1.30)
Now G (g
1
, . . . , g
q
) is an orthogonal matrix, and the matrix of principal compo-
nents is
W = YG. (1.31)
Equations (1.29) and (1.30) imply that the sample covariance matrix, say L, of W is
diagonal, with the l
i
s on the diagonal. Hence by (1.27),
S
W
= G
/
SG = G
/
GLG
/
G = L =

l
1
0 0
0 l
2
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 l
q

. (1.32)
Moving the Gs to the L side of the equality, we obtain the following.
12 Chapter 1. Multivariate Data
1.5 0.5 0.5 1.5

1
.
5

0
.
5
0
.
5
1
.
5
Sepal length
S
e
p
a
l

w
i
d
t
h
Original data
1.5 0.5 0.5 1.5

1
.
5

0
.
5
0
.
5
1
.
5
PC
1
P
C
2
Principal components
Figure 1.5: The sepal length and sepal width for the setosa iris data. The rst plot is
the raw data, centered. The second shows the two principal components.
Theorem 1.1 (The spectral decomposition theorem for symmetric matrices). If S is a
symmetric q q matrix, then there exists a q q orthogonal (1.25) matrix G and a q q
diagonal matrix L with diagonals l
1
l
2
l
q
such that
S = GLG
/
. (1.33)
Although we went through the derivation with S being a covariance matrix, all
we really needed for this theorem was that S is symmetric. The g
i
s and l
i
s have
mathematical names, too: Eigenvectors and eigenvalues.
Denition 1.3 (Eigenvalues and eigenvectors). Suppose A is a q q matrix. Then is
an eigenvalue of A if there exists a non-zero q 1 vector u such that Au = u. The vector
u is the corresponding eigenvector. Similarly, u ,= 0 is an eigenvector if there exists an
eigenvalue to which it corresponds.
A little linear algebra shows that indeed, each g
i
is an eigenvector of S correspond-
ing to l
i
. Hence the following:
Symbol Principal components Spectral decomposition
l
i
Variance Eigenvalue
g
i
Loadings Eigenvector
(1.34)
Figure 1.5 plots the principal components for the q = 2 variables sepal length and
sepal width for the fty iris observations of the species setosa. The data has been
centered, so that the means are zero. The variances of the two original variables are
0.124 and 0.144, respectively. The rst graph shows the two variables are highly cor-
related, with most of the points lining up near the 45

line. The principal component


loading matrix G rotates the points approximately 45

clockwise as in the second


graph, so that the data are now most spread out along the horizontal axis (variance is
0.234), and least along the vertical (variance is 0.034). The two principal components
are also, as it appears, uncorrelated.
1.6. Principal components 13
Best K components
In the process above, we found the principal components one by one. It may be
that we would like to nd the rotation for which the rst K variables, say, have the
maximal sum of variances. That is, we wish to nd the orthonormal set of q 1
vectors b
1
, . . . , b
K
to maximize
b
/
1
Sb
1
+ +b
/
K
Sb
K
. (1.35)
Fortunately, the answer is the same, i.e., take b
i
= g
i
for each i, the principal compo-
nents. See Proposition 1.1 in Section 1.8. Section 13.1 explores principal components
further.
1.6.1 Biplots
When plotting observations using the rst few principal component variables, the
relationship between the original variables and principal components is often lost.
An easy remedy is to rotate and plot the original axes as well. Imagine in the original
data space, in addition to the observed points, one plots arrows of length along the
axes. That is, the arrows are the line segments
a
i
= (0, . . . , 0, c, 0, . . . , 0)
/
[ 0 < c < (the c is in the i
th
slot), (1.36)
where an arrowhead is added at the non-origin end of the segment. If Y is the matrix
of observations, and G
1
the matrix containing the rst p loading vectors, then

X = YG
1
. (1.37)
We also apply the transformation to the arrows:

A = (a
1
, . . . , a
q
)G
1
. (1.38)
The plot consisting of the points

X and the arrows

A is then called the biplot. See
Gabriel [1981]. The points of the arrows in

A are just
I
q
G
1
= G
1
, (1.39)
so that in practice all we need to do is for each axis, draw an arrow pointing from
the origin to (the i
th
row of G
1
). The value of is chosen by trial-and-error, so
that the arrows are amidst the observations. Notice that the components of these
arrows are proportional to the loadings, so that the length of the arrows represents
the weight of the corresponding variables on the principal components.
1.6.2 Example: Sports data
Louis Roussos asked n = 130 people to rank seven sports, assigning #1 to the sport
they most wish to participate in, and #7 to the one they least wish to participate in.
The sports are baseball, football, basketball, tennis, cycling, swimming and jogging.
Here are a few of the observations:
14 Chapter 1. Multivariate Data
Obs i BaseB FootB BsktB Ten Cyc Swim Jog
1 1 3 7 2 4 5 6
2 1 3 2 5 4 7 6
3 1 3 2 5 4 7 6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
129 5 7 6 4 1 3 2
130 2 1 6 7 3 5 4
(1.40)
E.g., the rst person likes baseball and tennis, but not basketball or jogging (too much
running?).
We nd the principal components. The data is in the matrix sportsranks. We nd it
easier to interpret the plot if we reverse the ranks, so that 7 is best and 1 is worst, then
center the variables. The function eigen calculates the eigenvectors and eigenvalues of
its argument, returning the results in the components vectors and values, respectively:
y < 8sportsranks
y < scale(y,scale=F) # Centers the columns
eg < eigen(var(y))
The function prcomp can also be used. The eigenvalues (variances) are
j 1 2 3 4 5 6 7
l
j
10.32 4.28 3.98 3.3 2.74 2.25 0
(1.41)
The rst eigenvalue is 10.32, quite a bit larger than the second. The second through
sixth are fairly equal, so it may be reasonable to look at just the rst component.
(The seventh eigenvalue is 0, but that follows because the rank vectors all sum to
1 + + 7 = 28, hence exist in a six-dimensional space.)
We create the biplot using the rst two dimensions. We rst plot the people:
ev < eg$vectors
w < y%%ev # The principal components
lm < range(w)
plot(w[,1:2],xlim=lm,ylim=lm)
The biplot adds in the original axes. Thus we want to plot the seven (q = 7) points as
in (1.39), where
1
contains the rst two eigenvectors. Plotting the arrows and labels:
arrows(0,0,5ev[,1],5ev[,2])
text(7ev[,1:2],labels=colnames(y))
The constants 5 (which is the ) and 7 were found by trial and error so that the
graph, Figure 1.6, looks good. We see two main clusters. The left-hand cluster of
people is associated with the team sports arrows (baseball, football and basketball),
and the right-hand cluster is associated with the individual sports arrows (cycling,
swimming, jogging). Tennis is a bit on its own, pointing south.
1.7 Other projections to pursue
Principal components can be very useful, but you do have to be careful. For one, they
depend crucially on the scaling of your variables. For example, suppose the data set
1.7. Other projections to pursue 15
4 2 0 2 4

2
0
2
4
BaseB
FootB
BsktB
Ten
Cyc
Swim
Jog
PC
1
P
C
2
Figure 1.6: Biplot of the sports data, using the rst two principal components.
has two variables, height and weight, measured on a number of adults. The variance
of height, in inches, is about 9, and the variance of weight, in pounds, is 900 (= 30
2
).
One would expect the rst principal component to be close to the weight variable,
because that is where the variation is. On the other hand, if height were measured in
millimeters, and weight in tons, the variances would be more like 6000 (for height)
and 0.0002 (for weight), so the rst principal component would be essentially the
height variable. In general, if the variables are not measured in the same units, it can
be problematic to decide what units to use for the variables. See Section 13.1.1. One
common approach is to divide each variable by its standard deviation

s
jj
, so that
the resulting variables all have variance 1.
Another caution is that the linear combination with largest variance is not neces-
sarily the most interesting, e g., you may want one which is maximally correlated
with another variable, or which distinguishes two populations best, or which shows
the most clustering.
Popular objective functions to maximize, other than variance, are skewness, kur-
tosis and negative entropy. The idea is to nd projections that are not normal (in the
sense of the normal distribution). The hope is that these will show clustering or some
other interesting feature.
Skewness measure a certain lack of symmetry, where one tail is longer than the
other. It is measured by the normalized sample third central (meaning subtract the
mean) moment:
Skewness =

n
i=1
(x
i
x)
3
/n
(

n
i=1
(x
i
x)
2
/n)
3/2
. (1.42)
Positive values indicate a longer tail to the right, and negative to the left. Kurtosis is
16 Chapter 1. Multivariate Data
the normalized sample fourth central moment. For a sample x
1
, . . . , x
n
, it is
Kurtosis =

n
i=1
(x
i
x)
4
/n
(

n
i=1
(x
i
x)
2
/n)
2
3. (1.43)
The 3 is there so that exactly normal data will have kurtosis 0. A variable with
low kurtosis is more boxy than the normal. One with high kurtosis tends to have
thick tails and a pointy middle. (A variable with low kurtosis is platykurtic, and one
with high kurtosis is leptokurtic, from the Greek: kyrtos = curved, platys = at, like a
platypus, and lepto = thin.) Bimodal distributions often have low kurtosis.
Entropy
(You may wish to look through Section 2.1 before reading this section.) The entropy
of a random variable Y with pdf f (y) is
Entropy( f ) = E
f
[log( f (Y))]. (1.44)
Entropy is supposed to measure lack of structure, so that the larger the entropy, the
more diffuse the distribution is. For the normal, we have that
Entropy(N(,
2
)) = E
f
_
log(

2
2
) +
(Y )
2
2
2
_
=
1
2
(1 +log(2
2
)). (1.45)
Note that it does not depend on the mean , and that it increases without bound as
2
increases. Thus maximizing entropy unrestricted is not an interesting task. However,
one can imagine maximizing entropy for a given mean and variance, which leads to
the next lemma, to be proved in Section 1.8.
Lemma 1.2. The N(,
2
) uniquely maximizes the entropy among all pdfs with mean and
variance
2
.
Thus a measure of nonnormality of g is its entropy subtracted from that of the
normal with the same variance. Since there is a negative sign in front of the entropy
of g, this difference is called negentropy dened for any g as
Negent(g) =
1
2
(1 + log(2
2
)) Entropy(g), where
2
= Var
g
[Y]. (1.46)
With data, one does not know the pdf g, so one must estimate the negentropy. This
value is known as the Kullback-Leibler distance, or discrimination information, from
g to the normal density. See Kullback and Leibler [1951].
1.7.1 Example: Iris data
Consider the rst three variables of the iris data (sepal length, sepal width, and petal
length), normalized so that each variable has mean zero and variance one. We nd
the rst two principal components, which maximize the variances, and the rst two
components that maximize the estimated entropies, dened as in Denition 1.28,
but with estimated entropy of Yg substituted for the variance g
/
Sg. The table (1.47)
contains the loadings for the variables. Note that the two objective functions do
1.7. Other projections to pursue 17
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
vv
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
3 2 1 0 1 2
2
1
0

2
PC
1
P
C
2
Variance
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s s
s
s
s
s
s
s
s
ss
s
ss
s
s
s
s
s
s
s
s
s
s
v
v
v v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
vv
v
v
v
v
v v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
g
g
g
gg
g
g
g
g
g g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
3 2 1 0 1 2
2
1
0

2
Ent
1
E
n
t
2
Entropy
Figure 1.7: Projection pursuit for the iris data. The rst plot is based on maximizing
the variances of the projections, i.e., principal components. The second plot maxi-
mizes estimated entropies.
produce different projections. The rst principal component weights equally on the
two length variables, while the rst entropy variable is essentially petal length.
Variance Entropy
g
1
g
2
g

1
g

2
Sepal length 0.63 0.43 0.08 0.74
Sepal width 0.36 0.90 0.00 0.68
Petal length 0.69 0.08 1.00 0.06
(1.47)
Figure 1.7 graphs the results. The plots both show separation between setosa and
the other two species, but the principal components plot has the observations more
spread out, while the entropy plot shows the two groups much tighter.
The matrix iris has the iris data, with the rst four columns containing the mea-
surements, and the fth specifying the species. The observations are listed with the
fty setosas rst, then the fty versicolors, then the fty virginicas. To nd the prin-
cipal components for the rst three variables, we use the following:
y < scale(as.matrix(iris[,1:3]))
g < eigen(var(y))$vectors
pc < y%%g
The rst statement centers and scales the variables. The plot of the rst two columns
of pc is the rst plot in Figure 1.7. The procedure we used for entropy is negent3D in
Listing A.3, explained in Appendix A.1. The code is
gstar < negent3D(y,nstart=10)$vectors
ent <y%%gstar
To create plots like the ones in Figure 1.7, use
par(mfrow=c(1,2))
18 Chapter 1. Multivariate Data
sp < rep(c(s,v,g),c(50,50,50))
plot(pc[,1:2],pch=sp) # pch species the characters to plot.
plot(ent[,1:2],pch=sp)
1.8 Proofs
Proof of the principal components result, Lemma 1.1
The idea here was taken from Davis and Uhl [1999]. Consider the g
1
, . . . , g
q
as dened
in (1.28). Take i < j, and for angle , let
h() = g()
/
Sg() where g() = cos()g
i
+ sin()g
j
. (1.48)
Because the g
i
s are orthogonal,
|g()| = 1 and g()
/
g
1
= = g()
/
g
i1
= 0. (1.49)
According to the i
th
stage in (1.28), h() is maximized when g() = g
i
, i.e., when
= 0. The function is differentiable, hence its derivative must be zero at = 0. To
verify (1.30), differentiate:
0 =
d
d
h()[
=0
=
d
d
(cos
2
()g
/
i
Sg
i
+ 2 sin() cos()g
/
i
Sg
j
+ sin
2
()g
/
j
Sg
j
)[
=0
= 2g
/
i
Sg
j
. (1.50)
2
Best K components
We next consider nding the set b
1
, . . . , b
K
orthonormal vectors to maximize the sum
of variances,

K
i=1
b
/
i
Sb
i
, as in (1.35). It is convenient here to have the next denition.
Denition 1.4 (Trace). The trace of an m m matrix A is the sum of its diagonals,
trace(A) =

m
i=1
a
ii
.
Thus if we let B = (b
1
, . . . , b
K
), we have that
K

i=1
b
/
i
Sb
i
= trace(B
/
SB). (1.51)
Proposition 1.1. Best K components. Suppose S is a q q covariance matrix, and dene
B
K
to be the set of q K matrices with orthonormal columns, 1 K q. Then
max
BB
K
trace(B
/
SB) = l
1
+ + l
K
, (1.52)
which is achieved by taking B = (g
1
, . . . , g
K
), where g
i
is the i
th
principal component loading
vector for S, and l
i
is the corresponding variance.
The proposition follows directly from the next lemma, with S as in (1.33).
1.8. Proofs 19
Lemma 1.3. Suppose S and B
K
are as in Proposition 1.1, and S = GLG
/
is its spectral
decomposition. Then (1.52) holds.
Proof. Set A = G
/
B, so that A is also in B
K
. Then B = GA, and
trace(B
/
SB) = trace(A
/
G
/
SGA)
= trace(A
/
LA)
=
q

i=1
[(
K

j=1
a
2
ij
)l
i
]
=
q

i=1
c
i
l
i
, (1.53)
where the a
ij
s are the elements of A, and c
i
=

K
j=1
a
2
ij
. Because the columns of A
have norm one, and the rows of A have norms less than or equal to one,
q

i=1
c
i
=
K

j=1
[
q

i=1
a
2
ij
] = K and c
i
1. (1.54)
To maximize (1.53) under those constraints on the c
i
s, we try to make the earlier
c
i
s as large as possible, which means that c
1
= = c
K
= 1 and c
K+1
= =
c
q
= 0. The resulting value is then l
1
+ + l
K
. Note that taking A with a
ii
= 1,
i = 1, . . . , K, and 0 elsewhere (so that A consists of the rst K columns of I
q
), achieves
that maximum. With that A, we have that B = (g
1
, . . . , g
K
).
Proof of the entropy result, Lemma 1.2
Let f be the N(,
2
) density, and g be any other pdf with mean and variance
2
.
Then
Entropy( f ) Entropy(g) =
_
f (y) log( f (y))dy +
_
g(y) log(g(y))dy
=
_
g(y) log(g(y))dy
_
g(y) log( f (y))dy
+
_
g(y) log( f (y))dy
_
f (y) log( f (y))dy
=
_
g(y) log( f (y)/g(y))dy
+ E
g
_
log(

2
2
) +
(Y )
2
2
2
_
E
f
_
log(

2
2
) +
(Y )
2
2
2
_
(1.55)
=E
g
[log( f (Y)/g(Y))]. (1.56)
The last two terms in (1.55) are equal, since Y has the same mean and variance under
f and g.
At this point we need an important inequality about convexity, to whit, what
follows is a denition and lemma.
20 Chapter 1. Multivariate Data
Denition 1.5 (Convexity). The real-valued function h, dened on an interval I R, is
convex if for each x
0
I, there exists an a
0
and b
0
such that
h(x
0
) = a
0
+ b
0
x
0
and h(x) a
0
+ b
0
x for x ,= x
0
. (1.57)
The function is strictly convex if the inequality is strict in (1.57).
The line a
0
+ b
0
x is the tangent line to h at x
0
. Convex functions have tangent
lines that are below the curve, so that convex functions are bowl-shaped. The next
lemma is proven in Exercise 1.9.13.
Lemma 1.4 (Jensens inequality). Suppose W is a random variable with nite expected
value. If h(w) is a convex function, then
E[h(W)] h(E[W]), (1.58)
where the left-hand expectation may be innite. Furthermore, the inequality is strict if h(w)
is strictly convex and W is not constant, that is, P[W = c] < 1 for any c.
One way to remember the direction of the inequality is to imagine h(w) = w
2
,
in which case (1.58) states that E[W
2
] > E[W]
2
, which we already know because
Var[X] = E[W
2
] E[W]
2
0.
Now back to (1.56). The function h(w) = log(w) is strictly convex, and if g is
not equivalent to f , W = f (Y)/g(Y) is not constant. Jensens inequality thus shows
that
E
g
[log( f (Y)/g(Y))] > log (E[ f (Y)/g(Y)])
= log
_
_
( f (y)/g(y))g(y)dy
_
= log
_
_
f (y)dy
_
= log(1) = 0. (1.59)
Putting (1.56) and (1.59) together yields
Entropy(N(0,
2
)) Entropy(g) > 0, (1.60)
which completes the proof of Lemma 1.2. 2
Answers: The question marks in Figure 1.4 are, respectively, virginica, setosa,
virginica, versicolor, and setosa.
1.9 Exercises
Exercise 1.9.1. Let H
n
be the centering matrix in (1.12). (a) What is H
n
1
n
? (b) Suppose
x is an n 1 vector whose elements sum to zero. What is H
n
x? (c) Show that H
n
is
idempotent (1.16).
Exercise 1.9.2. Dene the matrix J
n
= (1/n)1
n
1
/
n
, so that H
n
= I
n
J
n
. (a) What
does J
n
do to a vector? (That is, what is J
n
a?) (b) Show that J
n
is idempotent. (c) Find
the spectral decomposition (1.33) for J
n
explicitly when n = 3. [Hint: In G, the rst
column (eigenvector) is proportional to 1
3
. The remaining two eigenvectors can be
any other vectors such that the three eigenvectors are orthonormal. Once you have a
G, you can nd the L.] (d) Find the spectral decomposition for H
3
. [Hint: Use the
same eigenvectors as for J
3
, but in a different order.] (e) What do you notice about
the eigenvalues for these two matrices?
1.9. Exercises 21
Exercise 1.9.3. A covariance matrix has intraclass correlation structure if all the vari-
ances are equal, and all the covariances are equal. So for n = 3, it would look like
A =

a b b
b a b
b b a

. (1.61)
Find the spectral decomposition for this type of matrix. [Hint: Use the G in Exercise
1.9.2, and look at G
/
AG.]
Exercise 1.9.4. Suppose Y is an n q data matrix, and W = YG, where G is a q q
orthogonal matrix. Let y
1
, . . . , y
n
be the rows of Y, and similarly w
i
s be the rows of
W. (a) Show that the corresponding points have the same length, |y
i
| = |w
i
|. (b)
Show that the distances between the points have not changed, |y
i
y
j
| = |w
i
w
j
|,
for any i, j.
Exercise 1.9.5. Suppose that the columns of G constitute the principal component
loading vectors for the sample covariance matrix S. Show that g
/
i
Sg
i
= l
i
and g
/
i
Sg
j
for i ,= j, as in (1.30), implies (1.32): G
/
SG = L.
Exercise 1.9.6. Verify (1.49) and (1.50).
Exercise 1.9.7. In (1.53), show that trace(A
/
LA) =

q
i=1
[(

K
j=1
a
2
ij
)l
i
].
Exercise 1.9.8. This exercise is to show that the eigenvalue matrix of a covariance
matrix S is unique. Suppose S has two spectral decompositions, S = GLG
/
= HMH
/
,
where G and H are orthogonal matrices, and L and M are diagonal matrices with
nonincreasing diagonal elements. Use Lemma 1.3 on both decompositions of S to
show that for each K = 1, . . . , q, l
1
+ + l
K
= m
1
+ + m
K
. Thus L = K.
Exercise 1.9.9. Suppose Y is a data matrix, and Z = YF for some orthogonal matrix
F, so that Z is a rotated version of Y. Show that the variances of the principal com-
ponents are the same for Y and Z. (This result should make intuitive sense.) [Hint:
Find the spectral decomposition of the covariance of Z from that of Y, then note that
these covariance matrices have the same eigenvalues.]
Exercise 1.9.10. Show that in the spectral decomposition (1.33), each l
i
is an eigen-
value, with corresponding eigenvector g
i
, i.e., Sg
i
= l
i
g
i
.
Exercise 1.9.11. Suppose is an eigenvalue of the covariance matrix S. Show that
must equal one of the l
i
s in the spectral decomposition of S. [Hint: Let u be
an eigenvector corresponding to . Show that is also an eigenvalue of L, with
corresponding eigenvector v = G
/
u, hence l
i
v
i
= v
i
for each i.]
Exercise 1.9.12. Verify the expression for
_
g(y) log( f (y))dy in (1.55).
Exercise 1.9.13. Consider the setup in Jensens inequality, Lemma 1.4. (a) Show that if
h is convex, E[h(W)] h(E[W]). [Hint: Set x
0
= E[W] in Denition 1.5.] (b) Suppose
h is strictly convex. Give an example of a random variable W for which E[h(W)] =
h(E[W]). (c) Show that if h is convex and W is not constant, that E[h(W)] > E[W].
22 Chapter 1. Multivariate Data
Exercise 1.9.14 (Spam). In the Hewlett-Packard spam data, a set of n = 4601 emails
were classied according to whether they were spam, where 0 means not spam, 1
means spam. Fifty-seven explanatory variables based on the content of the emails
were recorded, including various word and symbol frequencies. The emails were
sent to George Forman (not the boxer) at Hewlett-Packard labs, hence emails with
the words George or hp would likely indicate non-spam, while credit or !
would suggest spam. The data were collected by Hopkins et al. [1999], and are in the
data matrix Spam. ( They are also in the R data frame spam from the ElemStatLearn
package [Halvorsen, 2009], as well as at the UCI Machine Learning Repository [Frank
and Asuncion, 2010].)
Based on an emails content, is it possible to accurately guess whether it is spam
or not? Here we use Chernoffs faces. Look at the faces of some emails known to
be spam and some known to be non-spam (the training data). Then look at some
randomly chosen faces (the test data). E.g., to have twenty observations known
to be spam, twenty known to be non-spam, and twenty test observations, use the
following R code:
x0 < Spam[Spam[,spam]==0,] # The nonspam
x1 < Spam[Spam[,spam]==1,] # The spam
train0 < x0[1:20,]
train1 < x1[1:20,]
test < rbind(x0[(1:20),],x1[(1:20),])[sample(1:4561,20),]
Based on inspecting the training data, try to classify the test data. How accurate are
your guesses? The faces program uses only the rst fteen variables of the input
matrix, so you should try different sets of variables. For example, for each variable
nd the value of the t-statistic for testing equality of the spam and email groups, then
choose the variables with the largest absolute ts.
Exercise 1.9.15 (Spam). Continue with the spam data from Exercise 1.9.14. (a) Plot the
variances of the explanatory variables (the rst 57 variables) versus the index (i.e., the
x-axis has (1, 2, . . . , 57), and the y-axis has the corresponding variances.) You might
not see much, so repeat the plot, but taking logs of the variances. What do you see?
Which three variables have the largest variances? (b) Find the principal components
using just the explanatory variables. Plot the eigenvalues versus the index. Plot the
log of the eigenvalues versus the index. What do you see? (c) Look at the loadings for
the rst three principal components. (E.g., if spamload contains the loadings (eigen-
vectors), then you can try plotting them using matplot(1:57,spamload[,1:3]).) What
is the main feature of the loadings? How do they relate to your answer in part (a)?
(d) Now scale the explanatory variables so each has mean zero and variance one:
spamscale < scale(Spam[,1:57]). Find the principal components using this matrix.
Plot the eigenvalues versus the index. What do you notice, especially compared to
the results of part (b)? (e) Plot the loadings of the rst three principal components
obtained in part (d). How do they compare to those from part (c)? Why is there such
a difference?
Exercise 1.9.16 (Sports data). Consider the Louis Roussos sports data described in
Section 1.6.2. Use faces to cluster the observations. Use the raw variables, or the prin-
cipal components, and try different orders of the variables (which maps the variables
to different sets of facial features). After clustering some observations, look at how
1.9. Exercises 23
they ranked the sports. Do you see any pattern? Were you able to distinguish be-
tween people who like team sports versus individual sports? Those who like (dislike)
tennis? Jogging?
Exercise 1.9.17 (Election). The data set election has the results of the rst three US
presidential races of the 2000s (2000, 2004, 2008). The observations are the 50 states
plus the District of Columbia, and the values are the (D R)/(D + R) for each state
and each year, where D is the number of votes the Democrat received, and R is the
number the Republican received. (a) Without scaling the variables, nd the principal
components. What are the rst two principal component loadings measuring? What
is the ratio of the standard deviation of the rst component to the seconds? (c) Plot
the rst versus second principal components, using the states two-letter abbrevia-
tions as the plotting characters. (They are in the vector stateabb.) Make the plot so
that the two axes cover the same range. (d) There is one prominent outlier. What
is it, and for which variable is it mostly outlying? (e) Comparing how states are
grouped according to the plot and how close they are geographically, can you make
any general statements about the states and their voting proles (at least for these
three elections)?
Exercise 1.9.18 (Painters). The data set painters has ratings of 54 famous painters. It
is in the MASS package [Venables and Ripley, 2002]. See Davenport and Studdert-
Kennedy [1972] for a more in-depth discussion. The R help le says
The subjective assessment, on a 0 to 20 integer scale, of 54 classical painters.
The painters were assessed on four characteristics: composition, drawing,
colour and expression. The data is due to the Eighteenth century art critic,
de Piles.
The fth variable gives the school of the painter, using the following coding:
A: Renaissance; B: Mannerist; C: Seicento; D: Venetian; E: Lombard; F:
Sixteenth Century; G: Seventeenth Century; H: French
Create the two-dimensional biplot for the data. Start by turning the data into a matrix,
then centering both dimensions, then scaling:
x < scale(as.matrix(painters[,1:4]),scale=F)|
x < t(scale(x),scale=F))
x < scale(x)
Use the fth variable, the painters schools, as the plotting character, and the four
rating variables as the arrows. Interpret the two principal component variables. Can
you make any generalizations about which schools tend to rate high on which scores?
Exercise 1.9.19 (Cereal). Chakrapani and Ehrenberg [1981] analyzed peoples atti-
tudes towards a variety of breakfast cereals. The data matrix cereal is 8 11, with
rows corresponding to eight cereals, and columns corresponding to potential at-
tributes about cereals. The attributes: Return (a cereal one would come back to),
tasty, popular (with the entire family), digestible, nourishing, natural avor, afford-
able, good value, crispy (stays crispy in milk), t (keeps one t), and fun (for children).
The original data consisted of the percentage of subjects who thought the given ce-
real possessed the given attribute. The present matrix has been doubly centered, so
24 Chapter 1. Multivariate Data
that the row means and columns means are all zero. (The original data can be found
in the S-Plus [TIBCO Software Inc., 2009] data set cereal.attitude.) Create the two-
dimensional biplot for the data with the cereals as the points (observations), and the
attitudes as the arrows (variables). What do you see? Are there certain cereals/at-
tributes that tend to cluster together? (You might want to look at the Wikipedia entry
[Wikipedia, 2011] on breakfast cereals.)
Exercise 1.9.20 (Decathlon). The decathlon data set has scores on the top 24 men in
the decathlon (a set of ten events) at the 2008 Olympics. The scores are the numbers
of points each participant received in each event, plus each persons total points. The
data can be found at the NBC Olympic site [Olympics, 2008]. Create the biplot for
these data based on the rst ten variables (i.e., do not use their total scores). Doubly
center, then scale, the data as in Exercise 1.9.18. The events should be the arrows. Do
you see any clustering of the events? The athletes?
The remaining questions require software that will display rotating point clouds of
three dimensions, and calculate some projection pursuit objective functions. The
Spin program at http://stat.istics.net/MultivariateAnalysis is sufcient for
our purposes. GGobi [Cook and Swayne, 2007] has an excellent array of graphical
tools for interactively exploring multivariate data. See also the spin3R routine in the
R package aplpack [Wolf and Bielefeld, 2010].
Exercise 1.9.21 (Iris). Consider the three variables X = Sepal Length, Y = Petal Length,
and Z = Petal Width in the Fisher/Anderson iris data. (a) Look at the data while
rotating. What is the main feature of these three variables? (b) Scale the data so that
the variables all have the same sample variance. (The Spin program automatically
performs the scaling.) For various objective functions (variance, skewness, kurtosis,
negative kurtosis, negentropy), nd the rotation that maximizes the function. (That
is, the rst component of the rotation maximizes the criterion over all rotations. The
second then maximizes the criterion for components orthogonal to the rst. The third
component is then whatever is orthogonal to the rst two.) Which criteria are most
effective in yielding rotations that exhibit the main feature of the data? Which are
least effective? (c) Which of the original variables are most prominently represented
in the rst two components of the most effective rotations?
Exercise 1.9.22 (Automobiles). The data set cars [Consumers Union, 1990] contains
q = 11 size measurements on n = 111 models of automobile. The original data can be
found in the S-Plus
R
[TIBCO Software Inc., 2009] data frame cu.dimensions. In cars,
the variables have been normalized to have medians of 0 and median absolute devi-
ations (MAD) of 1.4826 (the MAD for a N(0, 1)). Inspect the three-dimensional data
set consisting of the variables length, width, and height. (In the Spin program, the
data set is called Cars.) (a) Find the linear combination with the largest variance.
What is the best linear combination? (Can you interpret it?) What is its variance?
Does the histogram look interesting? (b) Now nd the linear combination to maxi-
mize negentropy. What is the best linear combination, and its entropy? What is the
main feature of the histogram? (c) Find the best two linear combinations for entropy.
What are they? What feature do you see in the scatter plot?
1.9. Exercises 25
Exercise 1.9.23 (RANDU). RANDU [IBM, 1970] is a venerable, fast, efcient, and very
awed random number generator. See Dudewicz and Ralley [1981] for a thorough re-
view of old-time random number generators. For given seed x
0
, RANDU produces
x
i+1
from x
i
via
x
i+1
= (65539 x
i
) mod 2
31
. (1.62)
The random Uniform(0,1) values are then u
i
= x
i
/2
31
. The R data set randu is
based on a sequence generated using RANDU, where each of n = 400 rows is a set
of p = 3 consecutive u
i
s. Rotate the data, using objective criteria if you wish, to look
for signicant non-randomness in the data matrix. If the data are really random, the
points should uniformly ll up the three-dimensional cube. What feature do you see
that reveals the non-randomness?
The data sets Example 1, Examples 2, . . ., Example 5 are articial three-dimensional
point clouds. The goal is to rotate the point clouds to reveal their structures.
Exercise 1.9.24. Consider the Example 1 data set. (a) Find the rst two principal
components for these data. What are their variances? (b) Rotate the data. Are the
principal components unique? (c) Find the two-dimensional plots based on maximiz-
ing the skewness, kurtosis, negative kurtosis, and negentropy criteria. What do you
see? What does the histogram for the linear combination with the largest kurtosis
look like? Is it pointy? What does the histogram for the linear combination with
the most negative kurtosis look like? Is it boxy? (d) Describe the three-dimensional
structure of the data points. Do the two-dimensional plots in part (c) give a good
idea of the three-dimensional structure?
Exercise 1.9.25. This question uses the Example 2 data set. (a) What does the his-
togram for the linear combination with the largest variance look like? (b) What does
the histogram for the linear combination with the largest negentropy look like? (c)
Describe the three-dimensional object.
Exercise 1.9.26. For each of Example 3, 4, and 5, try to guess the shape of the cloud
of data points based on just the 2-way scatter plots. Then rotate the points enough to
convince yourself of the actual shape.
Chapter 2
Multivariate Distributions
This chapter reviews the elements of distribution theory that we need, especially for
vectors and matrices. (Classical multivariate analysis is basically linear algebra, so
everything we do eventually gets translated into matrix equations.) See any good
mathematical statistics book such as Hogg, McKean, and Craig [2004], Bickel and
Doksum [2000], or Lehmann and Casella [1998] for a more comprehensive treatment.
2.1 Probability distributions
We will deal with random variables and nite collections of random variables. A ran-
dom variable X has range or space A R, the real line. A collection of random vari-
ables is just a set of random variables. They could be arranged in any convenient way,
such as a row or column vector, matrix, triangular array, or three-dimensional array,
and will often be indexed to match the arrangement. The default arrangement will be
to index the random variables by 1, . . . , N, so that the collection is X = (X
1
, . . . , X
N
),
considered as a row vector. The space of X is A R
N
, N-dimensional Euclidean
space. A probability distribution P for a random variable or collection of random
variables species the chance that the random object will fall in a particular subset
of its space. That is, for A A, P[A] is the probability that the random X is in A,
also written P[X A]. In principle, to describe a probability distribution, one must
specify P[A] for all subsets A. (Technically, all measurable subsets, but we will not
worry about measurability.) Fortunately, there are easier ways. We will use densities,
but the main method will be to use representations, by which we mean describing a
collection of random variables Y in terms of another collection X for which we already
know the distribution, usually through a function, i.e., Y = g(X).
2.1.1 Distribution functions
The distribution function for the probability distribution P for the collection X =
(X
1
, . . . , X
N
) of random variables is the function
F : R
N
[0, 1]
F(x
1
, x
2
, . . . , x
N
) = P[X
1
x
1
, X
2
x
2
, . . . , X
N
x
N
]. (2.1)
27
28 Chapter 2. Multivariate Distributions
Note that it is dened on all of R
N
, not just the space of X. It is nondecreasing, and
continuous from the left, in each x
i
. The limit as all x
i
is zero, and as all
x
i
, the limit is one. The distribution function uniquely denes the distribution,
though we will not nd much use for it.
2.1.2 Densities
A collection of random variables X is said to have a density with respect to Lebesgue
measure on R
N
, if there is a nonnegative function f (x),
f : A [0, ), (2.2)
such that for any A A,
P[A] =
_
A
f (x)dx
=
_ _

_
A
f (x
1
, . . . , x
N
)dx
1
dx
N
. (2.3)
The second line is there to emphasize that we have a multiple integral. (The Lebesgue
measure of a subset A of R
N
is the integral
_
A
dx, i.e., as if f (x) = 1 in (2.3). Thus if
N = 1, the Lebesgue measure of a line segment is its length. In two dimensions, the
Lebesgue measure of a set is its area. For N = 3, it is the volume.)
We will call a density f as in (2.3) the pdf, for probability density function.
Because P[X A] = 1, the integral of the pdf over the entire space A must be 1.
Random variables or collections that have pdfs are continuous in the sense that the
probability X equals a specic value x is 0. (There are continuous distributions that
do not have pdfs, such as the uniform distribution on the unit circle.)
If X does have a pdf, then it can be obtained from the distribution function in (2.1)
by differentiation:
f (x
1
, . . . , x
N
) =

N
x
1
x
N
F(x
1
, . . . , x
N
). (2.4)
If the space A is a countable (which includes nite) set, then its probability can be
given by specifying the probability of each individual point. The probability mass
function f , or pmf, with
f : A [0, 1], (2.5)
is given by
f (x) = P[X = x] = P[x]. (2.6)
The probability of any subset A is the sum of the probabilities of the individual points
in A,
P[A] =

xA
f (x). (2.7)
Such an X is called discrete. (A pmf is also a density, but with respect to counting
measure on A, not Lebesgue measure.)
Not all random variables are either discrete or continuous, and especially a collec-
tion of random variables could have some discrete and some continuous members. In
such cases, the probability of a set is found by integrating over the continuous parts
2.1. Probability distributions 29
and summing over the discrete parts. For example, suppose our collection is a 1 N
vector combining two other collections, i.e.,
W = (X, Y) has space J, X is 1 N
x
and Y is 1 N
y
, N = N
x
+ N
y
. (2.8)
For a subset A J, dene the marginal set by
A
A
= x R
N
x
[ (x, y) A for some y, (2.9)
and the conditional set given X = x by
}
A
x
= y R
N
y
[ (x, y) A. (2.10)
Suppose X is discrete and Y is continuous. Then f (x, y) is a mixed-type density for
the distribution of W if for any A J,
P[A] =

xA
A
_
}
A
x
f (x, y)dy. (2.11)
We will use the generic term density to mean pdf, pmf, or the mixed type of density
in (2.11). There are other types of densities, but we will not need to deal with them.
2.1.3 Representations
Representations are very useful, especially when no pdf exists. For example, suppose
Y = (Y
1
, Y
2
) is uniform on the unit circle, by which we mean Y has space } = y
R
2
[ |y| = 1, and it is equally likely to be any point on that circle. There is no
pdf, because the area of the circle in R
2
is zero, so the integral over any subset of
} of any function is zero. The distribution can be thought of in terms of the angle
y makes with the x-axis, that is, y is equally likely to be at any angle. Thus we can
let X Uniform(0, 2]: X has space (0, 2] and pdf f
X
(x) = 1/(2). Then we can
dene
Y = (cos(X), sin(X)). (2.12)
In general, suppose we are given the distribution for X with space A and function
g,
g : A }. (2.13)
Then for any B }, we can dene the probability of Y by
P[Y B] = P[g(X) B] = P[X g
1
(B)]. (2.14)
We know the nal probability because g
1
(B) A.
One special type of function yields marginal distributions, analogous to the mar-
ginals in Section 1.5, that picks off some of the components. Consider the setup in
(2.8). The marginal function for X simply chooses the X components:
g(x, y) = x. (2.15)
The space of X is then given by (2.9) with A = J, i.e.,
A A
J
= x R
N
x
[ (x, y) J for some y. (2.16)
30 Chapter 2. Multivariate Distributions
If f (x, y) is the density for (X, Y), then the density of X can be found by integrating
(or summing) out the y. That is, if f is a pdf, then f
X
(x) is the pdf for X, where
f
X
(x) =
_
}
x
f (x, y)dy, (2.17)
and
}
x
= }
J
x
= y R
N
y
[ (x, y) J (2.18)
is the conditional space (2.10) with A = J. If y has some discrete components, then
they are summed in (2.17).
Note that we can nd the marginals of any subset, not just sets of consecutive
elements. E.g., if X = (X
1
, X
2
, X
3
, X
4
, X
5
), we can nd the marginal of (X
2
, X
4
, X
5
)
by integrating out the X
1
and X
3
.
Probability distributions can also be represented through conditioning, discussed
in the next section.
2.1.4 Conditional distributions
The conditional distribution of one or more variables given another set of variables,
the relationship of cause to effect, is central to multivariate analysis. E.g., what is
the distribution of health measures given diet, smoking, and ethnicity? We start with
the two collections of variables Y and X, each of which may be a random variable,
vector, matrix, etc. We want to make sense of the notion
Conditional distribution of Y given X = x, written Y[ X = x. (2.19)
What this means is that for each xed value x, there is a possibly different distribution
for Y.
Very generally, such conditional distributions will exist, though they may be hard
to gure out, even what they mean. In the discrete case, the concept is straightfor-
ward, and by analogy the case with densities follows. For more general situations,
we will use properties of conditional distributions rather than necessarily specifying
them.
We start with the (X, Y) as in (2.8), and assume we have their joint distribution
P. The word joint is technically unnecessary, but helps to emphasize that we are
considering the two collections together. The joint space is J, and let A denote the
marginal space of X as in (2.16), and for each x A, the conditional space of Y given
X = x, }
x
, is given in (2.18). For example, if the space J = (x, y) [ 0 < x < y < 1,
then A = (0, 1), and for x A, }
x
= (x, 1).
Next, given the joint distribution of (X, Y), we dene the conditional distribution
(2.19) in the discrete, then pdf, cases.
Discrete case
For sets A and B, the conditional probability of A given B is dened as
P[A[ B] =
P[A B]
P[B]
if B ,= . (2.20)
If B is empty, then the conditional probability is not dened since we would have
0
0
.
For a discrete pair (X, Y), let f (x, y) be the pmf. Then the conditional distribution of
2.1. Probability distributions 31
Y given X = x can be specied by
P[Y = y [ X = x], for x A, y }
x
. (2.21)
at least if P[X = x] > 0. The expression in (2.21) is, for xed x, the conditional pmf
for Y:
f
Y[X
(y [ x) = P[Y = y [ X = x]
=
P[Y = y and X = x]
P[X = x]
=
f (x, y)
f
X
(x)
, y }
x
, (2.22)
if f
X
(x) > 0, where f
X
(x) is the marginal pmf of X from (2.17) with sums.
Pdf case
In the discrete case, the restriction that P[X = x] > 0 is not worrisome, since the
chance is 0 we will have a x with P[X = x] = 0. In the continuous case, we cannot
follow the same procedure, since P[X = x] = 0 for all x A. However, if we have
pdfs, or general densities, we can analogize (2.22) and declare that the conditional
density of Y given X = x is
f
Y[X
(y [ x) =
f (x, y)
f
X
(x)
, y }
x
, (2.23)
if f
X
(x) > 0. In this case, as in the discrete one, the restriction that f
X
(x) > 0 is not
worrisome, since the set on which X has density zero has probability zero. It turns
out that the denition (2.23) is mathematically legitimate.
The Y and X can be very general. Often, both will be functions of a collection
of random variables, so that we may be interested in conditional distributions of the
type
g(Y) [ h(Y) = z (2.24)
for some functions g and h.
Reconstructing the joint distribution
Note that if we are given the marginal space and density for X, and the conditional
spaces and densities for Y given X = x, then we can reconstruct the joint space and
joint density:
J = (x, y) [ y }
x
, x A and f (x, y) = f
Y[X
(y [ x) f
X
(x). (2.25)
Thus another way to represent a distribution for Y is to specify the conditional
distribution given some X = x, and the marginal of X. The marginal distribution of
Y is then found by rst nding the joint as in (2.25), then integrating out the x:
f
Y
(y) =
_
A
y
f
Y[X
(y [ x) f
X
(x)dx. (2.26)
32 Chapter 2. Multivariate Distributions
2.2 Expected values
Means, variances, and covariances (Section 1.4) are key sample quantities in describ-
ing data. Similarly, they are important for describing random variables. These are all
expected values of some function, dened next.
Denition 2.1 (Expected value). Suppose X has space A, and consider the real-valued
function g,
g : A R. (2.27)
If X has pdf f , then the expected value of g(X), E[g(X)], is
E[g(X)] =
_
A
g(x) f (x)dx (2.28)
if the integral converges. If X has pmf f , then
E[g(X)] =

xA
g(x) f (x) (2.29)
if the sum converges.
As in (2.11), if the collection is (X, Y), where X is discrete and Y is continuous, and
f (x, y) is its mixed-type density, then for function g(x, y),
E[g(X, Y)] =

xA
_
}
x
g(x, y) f (x, y)dy. (2.30)
if everything converges. (The spaces are dened in (2.16) and (2.18).)
Expected values for representations cohere is the proper way, that is, if Y is a
collection of random variables such that Y = h(X), then for a function g,
E[g(Y)] = E[g(h(X))], (2.31)
if the latter exists. Thus we often can nd the expected values of functions of Y based
on the distribution of X.
Conditioning
If (X, Y) has a joint distribution, then we can dene the conditional expectation of
g(Y) given X = x to be the regular expected value of g(Y), but we use the conditional
distribution Y[ X = x. In the pdf case, we write
E[g(Y) [ X = x] =
_
}
x
g(y) f
Y[X
(y[x)dy e
g
(x). (2.32)
Note that the conditional expectation is a function of x. We can then take the expected
value of that, using the marginal distribution of X. We end up with the same result
(if we end up with anything) as taking the usual expected value of g(Y). That is
E[g(Y)] = E[ E[g(Y) [ X = x] ]. (2.33)
There is a bit of a notational glitch in the formula, since the inner expected value is a
function of x, a constant, and we really want to take the expected value over X. We
cannot just replace x with X, however, because then we would have the undesired
2.3. Means, variances, and covariances 33
E[g(Y) [ X = X]. So a more precise way to express the result is to use the e
g
(x) in
(2.32), so that
E[g(Y)] = E[e
g
(X)]. (2.34)
This result holds in general. It is not hard to see in the pdf case:
E[e
g
(X)] =
_
A
e
g
(x) f
X
(x)dx
=
_
A
_
}
x
g(y) f
Y[X
(y[x)dyf
X
(x)dx by (2.32)
=
_
A
_
}
x
g(y) f (x, y)dydx by (2.25)
=
_
J
g(y) f (x, y)dxdy by (2.25)
= E[g(Y)]. (2.35)
A useful corollary is the total probability formula: For B }, if X has a pdf,
P[Y B] =
_
A
P[Y B [ X = x] f
X
(x)dx. (2.36)
If X has a pmf, then we sum. The formula follows by taking g to be the indicator
function I
B
, given as
I
B
(y) =
_
1 if y B,
0 if y , B.
(2.37)
2.3 Means, variances, and covariances
Means, variances, and covariances are particular expected values. For a collection
of random variables X = (X
1
, . . . , X
N
), the mean of X
j
is its expected value, E[X
j
].
(Throughout this section, we will be acting as if the expected values exist. So if E[X
j
]
doesnt exist, then the mean of X
j
doesnt exist, but we might not explicitly mention
that.) Often the mean is denoted by , so that E[X
j
] =
j
.
The variance of X
j
, often denoted
2
j
or
jj
, is

jj
= Var[X
j
] = E[(X
j

j
)
2
]. (2.38)
The covariance between X
j
and X
k
is dened to be

jk
= Cov[X
j
, X
k
] = E[(X
j

j
)(X
k

k
)]. (2.39)
Their correlation coefcient is
Corr[X
j
, X
k
] =
jk
=

jk

jj

kk
, (2.40)
if both variances are positive. Compare these denitions to those of the sample
analogs, (1.3), (1.4), (1.5), and (1.6). So, e.g., Var[X
j
] = Cov[X
j
, X
j
].
The mean of the collection X is the corresponding collection of means. That is,
= E[X] = (E[X
1
], . . . , E[X
N
]). (2.41)
34 Chapter 2. Multivariate Distributions
2.3.1 Vectors and matrices
If a collection has a particular structure, then its mean has the same structure. That
is, if X is a row vector as in (2.41), then E[X] = (E[X
1
], . . . , E[X
N
]). If X is a column
vector, so is its mean. Similarly, if W is an n p matrix, then so is its mean. That is,
E[W] = E

W
11
W
12
W
1p
W
21
W
22
W
2p
.
.
.
.
.
.
.
.
.
.
.
.
W
n1
W
n2
W
np

E[W
11
] E[W
12
] E[W
1p
]
E[W
21
] E[W
22
] E[W
2p
]
.
.
.
.
.
.
.
.
.
.
.
.
E[W
n1
] E[W
n2
] E[W
np
]

. (2.42)
Turning to variances and covariances, rst suppose that X is a vector (row or
column). There are N variances and (
N
2
) covariances among the X
j
s to consider,
recognizing that Cov[X
j
, X
k
] = Cov[X
k
, X
j
]. By convention, we will arrange them into
a matrix, the variance-covariance matrix, or simply covariance matrix of X:
= Cov[X]
=

Var[X
1
] Cov[X
1
, X
2
] Cov[X
1
, X
N
]
Cov[X
2
, X
1
] Var[X
2
] Cov[X
2
, X
N
]
.
.
.
.
.
.
.
.
.
.
.
.
Cov[X
N
, X
1
] Cov[X
N
, X
2
] Var[X
N
]

, (2.43)
so that the elements of are the
jk
s. Compare this arrangement to that of the
sample covariance matrix (1.17). If X is a row vector, and = E[X], a convenient
expression for its covariance is
Cov[X] = E
_
(X )
/
(X )

. (2.44)
Similarly, if X is a column vector, Cov[X] = E[(X )(X )
/
].
Now suppose X is a matrix as in (2.42). Notice that individual components have
double subscripts: X
ij
. We need to decide how to order the elements in order to
describe its covariance matrix. We will use the convention that the elements are
strung out by row, so that row(X) is the 1 N vector, N = np, given by
row(X) =(X
11
, X
12
, , X
1p
,
X
21
, X
22
, , X
2p
,

X
n1
, X
n2
, , X
np
). (2.45)
Then Cov[X] is dened to be Cov[row(X)], which is an (np) (np) matrix.
One more covariance: The covariance between two vectors is dened to be the
matrix containing all the individual covariances of one variable from each vector.
2.4. Independence 35
That is, if X is 1 p and Y is 1 q , then the p q matrix of covariances is
Cov[X, Y] =

Cov[X
1
, Y
1
] Cov[X
1
, Y
2
] Cov[X
1
, Y
q
]
Cov[X
2
, Y
1
] Cov[X
2
, Y
2
] Cov[X
2
, Y
q
]
.
.
.
.
.
.
.
.
.
.
.
.
Cov[X
p
, Y
1
] Cov[X
p
, Y
2
] Cov[X
p
, Y
q
]

. (2.46)
2.3.2 Moment generating functions
The moment generating function (mgf for short) of X is a function fromR
N
[0, ]
given by
M
X
(t) = M
X
(t
1
, . . . , t
N
) = E
_
e
t
1
X
1
++t
N
X
N
_
= E
_
e
Xt
/
_
(2.47)
for t = (t
1
, . . . , t
N
). It is very useful in distribution theory, especially convolutions
(sums of independent random variables), asymptotics, and for generating moments.
The main use we have is that the mgf determines the distribution:
Theorem 2.1 (Uniqueness of MGF). If for some > 0,
M
X
(t) < and M
X
(t) = M
Y
(t) for all t such that |t| < , (2.48)
then X and Y have the same distribution.
See Ash [1970] for an approach to proving this result. The mgf does not always
exist, that is, often the integral or sum dening the expected value diverges. That
is ok, as long as it is nite for t in a neighborhood of 0. If one knows complex
variables, the characteristic function is handy because it always exists. It is dened
as
X
(t) = E[exp(iXt
/
)].
If a distributions mgf is nite when |t| < for some > 0, then all of its moments
are nite, and can be calculated via differentiation:
E[X
k
1
1
X
k
N
N
] =

K
t
k
1
1
t
k
N
N
M
X
(t)

t=0
, (2.49)
where the k
1
are nonnegative integers, and K = k
1
+ + k
N
. See Exercise 2.7.21
2.4 Independence
Two sets of random variables are independent if the values of one set do not affect
the values of the other. More precisely, suppose the collection is (X, Y) as in (2.8),
with space J. Let A and } be the marginal spaces (2.16) of X and Y, respectively.
First, we need the following:
Denition 2.2. If A R
K
and B R
L
, then A B is a rectangle, the subset of R
K+L
given by
A B = (y, z) R
K+L
[ y A and z B. (2.50)
Now for the main denition.
36 Chapter 2. Multivariate Distributions
Denition 2.3. Given the setup above, the collections X and Y are independent if J =
A }, and for every A A and B },
P[(X, Y) AB] = P[X A]P[Y B]. (2.51)
In the denition, the left-hand side uses the joint probability distribution for (X, Y),
and the right-hand side uses the marginal probabilities for X and Y, respectively.
If the joint collection (X, Y) has density f , then X and Y are independent if and
only if J = A }, and
f (x, y) = f
X
(x) f
Y
(y) for all x A and y }, (2.52)
where f
X
and f
Y
are the marginal densities (2.17) of X and Y, respectively. (Techni-
cally, (2.52) only has to hold with probability one. Also, except for sets of probability
zero, the requirements (2.51) or (2.52) imply that J = A }, so that the requirement
we place on the spaces is redundant. But we keep it for emphasis.)
A useful result is that X and Y are independent if and only if
E[g(X)h(Y)] = E[g(X)]E[h(Y)] (2.53)
for all functions g : X R and h : Y R with nite expectation.
The last expression can be used to show that independent variables have covari-
ance equal to 0. If X and Y are independent random variables with nite expectations,
then
Cov[X, Y] = E[(X E[X])(Y E[Y])]
= E[(X E[X])] E[(Y E[Y])]
= 0. (2.54)
The second equality uses (2.53), and the nal equality uses that E[XE[X]] = E[X]
E[X] = 0. Be aware that the reverse is not true, that is, variables can have 0 covariance
but still not be independent.
If the collections X and Y are independent, then Cov[X
k
, Y
l
] = 0 for all k, l, so that
Cov[(X, Y)] =
_
Cov[X] 0
0 Cov[Y]
_
, (2.55)
at least if the covariances exist. (Throughout this book, 0 represents a matrix of
zeroes, its dimension implied by the context.)
Collections Y and X are independent if and only if the conditional distribution of
Y given X = x does not depend on x. If (X, Y) has a pdf or pmf, this property is easy
to see. If X and Y are independent, then }
x
= } since J = A }, and by (2.23) and
(2.52),
f
Y[X
(y [ x) =
f (x, y)
f
X
(x)
=
f
Y
(y) f
X
(x)
f
X
(x)
= f
Y
(y), (2.56)
so that the conditional distribution does not depend on x. On the other hand, if
the conditional distribution does not depend on x, then the conditional space and
pdf cannot depend on x, in which case they are the marginal space and pdf, so that
J = A } and
f (x, y)
f
X
(x)
= f
Y
(y) = f (x, y) = f
X
(x) f
Y
(y). (2.57)
2.5. Conditional distributions 37
So far, we have treated independence of just two sets of variables. Everything
can be easily extended to any nite number of sets. That is, suppose X
1
, . . . , X
S
are
collections of random variables, with N
s
and A
s
being the dimension and space for
X
s
, and X = (X
1
, . . . , X
S
), with dimension N = N
1
+ + N
S
and space A.
Denition 2.4. Given the setup above, the collections X
1
, . . . , X
S
are mutually independent
if A = A
1
A
S
, and for every set of subsets A
s
A
s
,
P[(X
1
, . . . , X
S
) A
1
A
S
] = P[X
1
A
1
] P[X
S
A
S
]. (2.58)
In particular, X
1
, . . . , X
S
being mutually independent implies that every pair X
i
, X
j
(i ,= j) is independent. The reverse need not be true, however, that is, each pair could
be independent without having all mutually independent. Analogs of the equiva-
lences in (2.52) to (2.53) hold for this case, too. E.g., X
1
, . . . , X
S
are mutually indepen-
dent if and only if
E[g
1
(X
1
) g
S
(X
S
)] = E[g
1
(X
1
)] E[g
S
(X
S
)] (2.59)
for all functions g
s
: A
s
R, s = 1, . . . , S, with nite expectation.
A common situation is that the individual random variables X
i
s in X are mutually
independent. Then, e.g., if there are densities,
f (x
1
, . . . , x
N
) = f
1
(x
1
) f
N
(x
N
), (2.60)
where f
j
is the density of X
j
. Also, if the variances exist, the covariance matrix is
diagonal:
Cov[X] =

Var[X
1
] 0 0
0 Var[X
2
] 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 Var[X
N
]

. (2.61)
2.5 Additional properties of conditional distributions
The properties that follow are straightforward to prove in the discrete case. They still
hold for the continuous and more general cases, but are not always easy to prove.
See Exercises 2.7.6 to 2.7.15.
Plug-in formula
Suppose the collection of random variables is given by (X, Y), and we are interested
in the conditional distribution of the function g(X, Y) given X = x. Then
g(X, Y) [ X = x =
T
g(x, Y) [ X = x. (2.62)
That is, the conditional distribution of g(X, Y) given X = x is the same as that of
g(x, Y) given X = x. (The =
T
means equal in distribution.) Furthermore, if Y
and X are independent, we can take off the conditional part at the end of (2.62):
X and Y independent = g(X, Y) [ X = x =
T
g(x, Y). (2.63)
38 Chapter 2. Multivariate Distributions
This property may at rst seem so obvious to be meaningless, but it can be very
useful. For example, suppose X and Y are independent N(0, 1)s, and g(X, Y) =
X + Y, so we wish to nd X + Y [ X = x. The ofcial way is to let W = X + Y, and
Z = X, and use the transformation of variables to nd the space and pdf of (W, Z).
One can then gure out J
z
, and use the formula (2.23). Instead, using the plug-in
formula with independence (2.63), we have that
X + Y [ X = x =
T
x +Y, (2.64)
which we immediately realize is N(x, 1).
Conditional independence
Given a set of three collections, (X, Y, Z), X are Y are said to be conditionally independent
given Z = z if
P[(X, Y) AB [ Z = z] = P[X A[ Z = z]P[Y B [ Z = z], (2.65)
for sets A A
z
and B }
z
as in (2.51). If further X is independent of Z, then X is
independent of the combined (Y, Z).
Dependence on x only through a function
If the conditional distribution of Y given X = x depends on x only through the func-
tion h(x), then that conditional distribution is the same as the conditional distribution
given h(X) = h(x). Symbolically, if v = h(x),
Y[ X = x =
T
Y[ h(X) = v. (2.66)
As an illustration, suppose (X, Y) is uniformly distributed over the unit disk, so
that the pdf is f (x, y) = 1/ for x
2
+ y
2
< 1. Then it can be shown that
Y [ X = x Uniform(
_
1 x
2
,
_
1 x
2
). (2.67)
Note that the distribution depends on x only through h(x) = x
2
, so that, e.g., condi-
tioning on X = 1/2 is the same as conditioning on X = 1/2. The statement (2.66)
then yields
Y [ X
2
= v Uniform(

1 v,

1 v). (2.68)
That is, we have managed to turn a statement about conditioning on X to one about
conditioning on X
2
.
Variance decomposition
The formula (2.34) shows that the expected value of g(Y) is the expected value of the
conditional expected value, e
g
(X). A similar formula holds for the variance, but it is
not simply that the variance is the expected value of the conditional variance. Using
the well-known identity Var[Z] = E[Z
2
] E[Z]
2
on Z = g(Y), as well as (2.34) on
g(Y) and g(Y)
2
, we have
Var[g(X)] = E[g(Y)
2
] E[g(Y)]
2
= E[e
g
2 (X)] E[e
g
(X)]
2
. (2.69)
2.5. Conditional distributions 39
The identity holds conditionally as well, i.e.,
v
g
(x) = Var[g(Y) [ X = x] = E[g(Y)
2
[ X = x] E[g(Y) [ X = x]
2
= e
g
2 (x) e
g
(x)
2
. (2.70)
Taking expected value over X in (2.70), we have
E[v
g
(X)] = E[e
g
2 (X)] E[e
g
(X)
2
]. (2.71)
Comparing (2.69) and (2.71), we see the difference lies in where the square is in the
second terms. Thus
Var[g(Y)] = E[v
g
(X)] + E[e
g
(X)
2
] E[e
g
(X)]
2
= E[v
g
(X)] + Var[e
g
(X)], (2.72)
now using the identity on e
g
(X). Thus the variance of g(Y) equals the variance of the
conditional expected value plus the expected value of the conditional variance.
For a collection Y of random variables,
e
Y
(x) = E[Y[ X = x] and v
Y
(x) = Cov[Y[ X = x], (2.73)
(2.72) extends to
Cov[Y] = E[v
Y
(X)] + Cov[e
Y
(X)]. (2.74)
See Exercise 2.7.12.
Bayes theorem
Bayes formula reverses conditional distributions, that is, it takes the conditional distri-
bution of Y given X, and the marginal of X, and returns the conditional distribution of
X given Y. Bayesian inference is based on this formula, starting with the distribution
of the data given the parameters, and a marginal (prior) distribution of the pa-
rameters, and producing the conditional distribution (posterior) of the parameters
given the data. Inferences are then based on this posterior, which is the distribution
one desires because the data are observed while the parameters are not.
Theorem 2.2 (Bayes). In the setup of (2.8), suppose that the conditional density of Y given
X = x is f
Y[X
(y [ x), and the marginal density of X is f
X
(x). Then for (x, y) J, the
conditional density of X given Y = y is
f
X[Y
(x [ y) =
f
Y[X
(y [ x) f
X
(x)
_
A
y
f
Y[X
(y [ z) f
X
(z)dz
. (2.75)
Proof. From (2.23) and (2.25),
f
X[Y
(x [ y) =
f (x, y)
f
Y
(y)
=
f
Y[X
(y [ x) f
X
(x)
f
Y
(y)
. (2.76)
By (2.26), using z for x, to avoid confusion with the x in (2.76),
f
Y
(y) =
_
A
y
f
Y[X
(y [ z) f
X
(z)dz, (2.77)
which, substituted in the denominator of (2.76), shows (2.75).
40 Chapter 2. Multivariate Distributions
2.6 Afne transformations
In Section 1.5, linear combinations of the data were used heavily. Here we consider
the distributional analogs of linear functions, or their extensions, afne transforma-
tions. For a single random variable X, an afne transformation is a +bX for constants
a and b. Equation (2.82) is an example of an afne transformation with two random
variables.
More generally, an afne transformation of a collection of N random variables X
is a collection of M random variables Y where
Y
j
= a
j
+ b
j1
X
1
+ + b
jN
X
N
, j = 1, . . . , M, (2.78)
the a
j
s and b
jk
s being constants. Note that marginals are examples of afne trans-
formations: the a
j
s are 0, and most of the b
jk
s are 0, and a few are 1. Depending on
how the elements of X and Y are arranged, afne transformations can be written as a
matrix equation. For example, if X and Y are row vectors, and B is M N, then
Y = a +XB
/
, (2.79)
where B is the matrix of b
jk
s, and a = (a
1
, . . . , a
M
). If X and Y are column vectors,
then the equation is Y = a
/
+BX. For an example using matrices, suppose X is n p,
C is mn, D is q p, and A is mq, and
Y = A+CXD
/
. (2.80)
Then Y is an m q matrix, each of whose elements is some afne transformation of
the elements of X. The relationship between the b
jk
s and the elements of C and D is
somewhat complicated but could be made explicit, if desired. Look ahead to (3.32d),
if interested.
Expectations are linear, that is, for any random variables (X, Y), and constant c,
E[cX] = cE[X] and E[X + Y] = E[X] + E[Y], (2.81)
which can be seen from (2.28) and (2.29) by the linearity of integrals and sums. Con-
sidering any constant a as a (nonrandom) random variable, with E[a] = a, (2.81) can
be used to show, e.g.,
E[a + bX + cY] = a + bE[X] + cE[Y]. (2.82)
The mean of an afne transformation is the afne transformation of the mean.
This property follows from (2.81) as in (2.82), i.e., for (2.78),
E[Y
j
] = a
j
+ b
j1
E[X
1
] + + b
jN
E[X
N
], j = 1, . . . , M. (2.83)
If the collections are arranged as vectors or matrices, then so are the means, so that
for the row vector (2.79) and matrix (2.80) examples, one has, respectively,
E[Y] = a + E[X]B
/
and E[Y] = A+CE[X]D
/
. (2.84)
The covariance matrix of Y can be obtained from that of X. It is a little more
involved than for the means, but not too bad, at least in the vector case. Suppose X
2.7. Exercises 41
and Y are row vectors, and (2.79) holds. Then from (2.44),
Cov[Y] = E
_
(Y E[Y])
/
(Y E[Y])

= E
_
(a +XB
/
(a + E[X]B
/
))
/
(a +XB
/
(a + E[X]B
/
))

= E
_
(XB
/
E[X]B
/
)
/
(XB
/
E[X]B
/
)

= E
_
B(X E[X])
/
(X E[X])B
/

= BE
_
(X E[X])
/
(X E[X])

B
/
by second part of (2.84)
= BCov[X]B
/
. (2.85)
Compare this formula to the sample version in (1.27). Though modest looking, the
formula Cov[XB
/
] = BCov[X]B
/
is extremely useful. It is often called a sandwich
formula, with the B as the slices of bread. The formula for column vectors is the
same. Compare this result to the familiar one from univariate analysis: Var[a +bX] =
b
2
Var[X]. Also, we already saw the sample version of (2.85) in (1.27).
For matrices, we again will wait. (We are waiting for Kronecker products as in
Denition 3.5, in case you are wondering.)
2.7 Exercises
Exercise 2.7.1. Consider the pair of random variables (X, Y), where X is discrete and
Y is continuous. Their space is
J = (x, y) [ x 1, 2, 3 & 0 < y < x, (2.86)
and their mixed-type density is
f (x, y) =
x + y
21
. (2.87)
Let A = (x, y) J[ y x/2. (It is a good idea to sketch J and A.) (a) Find A
A
(b) Find }
A
x
for each x A
A
. (c) Find P[A]. (d) Find the marginal density and space
of X. (e) Find the marginal space of Y. (f) Find the conditional space of X given Y,
A
y
, for each y. (Do it separately for y (0, 1), y [1, 2) and y [2, 3).) (g) Find the
marginal density of Y.
Exercise 2.7.2. Given the setup in (2.8) through (2.10), show that for A J,
A = (x, y) [ x A
A
and y }
A
x
= (x, y) [ y }
A
and x A
A
y
. (2.88)
Exercise 2.7.3. Verify (2.17), that is, given B A, show that
P[X B] =
_
B
_
_
}
x
f (x, y)dy
_
dx. (2.89)
[Hint: Show that for A = (x, y) [ x B and y }
x
, x B if and only if (x, y) A, so
that P[X B] = P[(X, Y) A]. Then note that the latter probability is
_
A
f (x, y)dxdy,
which with some interchanging equals the right-hand side of (2.89).]
42 Chapter 2. Multivariate Distributions
Exercise 2.7.4. Show that X and Y are independent if and only if E[g(X)h(Y)] =
E[g(X)]E[h(Y)] as in (2.53) for all g and h with nite expectations. You can assume
densities exist, i.e., (2.52). [Hint: To show independence implies (2.53), write out the
sums/integrals. For the other direction, consider indicator functions for g and h as in
(2.37).]
Exercise 2.7.5. Prove (2.31), E[g(Y)] = E[g(h(X))] for Y = h(X), in the discrete case.
[Hint: Start by writing
f
Y
(y) = P[Y = y] = P[h(X) = y] =

xA
y
f
X
(x), (2.90)
where A
y
= x A [ h(x) = y. Then
E[g(Y)] =

y}
g(y)

xA
y
f
X
(x) =

y}

xA
y
g(y) f
X
(x). (2.91)
In the inner summation in the nal expression, h(x) is always equal to y. (Why?) Sub-
stitute h(x) for y in the g, then. Now the summand is free of y. Argue that the dou-
ble summation is the same as summing over x A, yielding

xA
g(h(x)) f
X
(x) =
E[g(h(X))].]
Exercise 2.7.6. (a) Prove the plugin formula (2.62) in the discrete case. [Hint: For z
in the range of g, write P[g(X, Y) = z [ X = x] = P[g(X, Y) = z and X = x]/P[X = x],
then note that in the numerator, the X can be replaced by x.] (b) Prove (2.63). [Hint:
Follow the proof in part (a), then note the two events g(x, Y) = z and X = x are
independent.]
Exercise 2.7.7. Suppose (X, Y, Z) has a discrete distribution, X and Y are condi-
tionally independent given Z (as in (2.65)), and X and Z are independent. Show
that (X, Y) is independent of Z. [Hint: Use the total probability formula (2.36) on
P[X A and (Y, Z) B], conditioning on Z. Then argue that the summand can be
written
P[X A and (Y, Z) B [ Z = z] = P[X A and (Y, z) B [ Z = z]
= P[X A[ Z = z]P[(Y, z) B [ Z = z]. (2.92)
Use the independence of X and Z on the rst probability in the nal expression, and
bring it out of the summation.]
Exercise 2.7.8. Prove (2.67). [Hint: Find }
x
and the marginal f
X
(x).]
Exercise 2.7.9. Suppose Y = (Y
1
, Y
2
, Y
3
, Y
4
) is multinomial with parameters n and
p = (p
1
, p
2
, p
3
, p
4
). Thus n is a positive integer, the p
i
s are positive and sum to 1,
and the Y
i
s are positive integers that sum to n. The pmf is
f (y) =
_
n
y
1
, y
2
, y
3
, y
4
_
p
y
1
1
p
y
4
4
, (2.93)
where (
n
y
1
,y
2
,y
3
,y
4
) = n!/(y
1
! y
4
!). Consider the conditional distribution of (Y
1
, Y
2
)
given (Y
3
, Y
4
) = (c, d). (a) What is the conditional space of (Y
1
, Y
2
) given (Y
3
, Y
4
) =
2.7. Exercises 43
(c, d)? Give Y
2
as a function of Y
1
, c, and d. What is the conditional range of Y
1
? (b)
Write the conditional pmf of (Y
1
, Y
2
) given (Y
3
, Y
4
) = (c, d), and simplify noting that
_
n
y
1
, y
2
, c, d
_
=
_
n
n c d, c, d
__
n c d
c, d
_
(2.94)
What is the conditional distribution of Y
1
[ (Y
3
, Y
4
) = (c, d)? (c) What is the condi-
tional distribution of Y
1
given Y
3
+ Y
4
= a?
Exercise 2.7.10. Prove (2.44). [Hint: Write out the elements of the matrix (X)
/
(X
), then use (2.42).]
Exercise 2.7.11. Suppose X, 1 N, has nite covariance matrix. Show that Cov[X] =
E[X
/
X] E[X]
/
E[X].
Exercise 2.7.12. (a) Prove the variance decomposition holds for the 1 q vector Y, as
in (2.74). (b) Write Cov[Y
i
, Y
j
] as a function of the conditional quantities Cov[Y
i
, Y
j
[ X =
x], E[Y
i
[ X = x], and E[Y
j
[ X = x].
Exercise 2.7.13. The beta-binomial(n, , ) distribution is a mixture of binomial dis-
tributions. That is, suppose Y given X = x is Binomial(n, x) ( f
Y
(y) = (
n
x
)x
y
(1 x)
ny
for y = 0, 1, . . . , n), and X is (marginally) Beta(, ):
f
X
(x) =
( + )
()()
x
1
(1 x)
1
, x (0, 1), (2.95)
where is the gamma function,
() =
_

0
u
1
e
u
du, > 0. (2.96)
(a) Find the marginal pdf of Y. (b) The conditional mean and variance of Y are nx
and nx(1 x). (Right?) The unconditional mean and variance of X are /( + )
and /( + )
2
( + +1). What are the unconditional mean and variance of Y? (c)
Compare the variance of a Binomial(n, p) to that of a Beta-binomial(n, , ), where
p = /( + ). (d) Find the joint density of (X, Y). (e) Find the pmf of the beta-
binomial. [Hint: Notice that the part of the joint density depending on x looks like a
Beta pdf, but without the constant. Thus integrating out x yields the reciprocal of the
constant.]
Exercise 2.7.14 (Bayesian inference). This question develops Bayesian inference for a
binomial. Suppose
Y [ P = p Binomial(n, p) and P Beta(
0
,
0
), (2.97)
that is, the probability of success P has a beta prior. (a) Show that the posterior
distribution is
P [ Y = y Beta(
0
+ y,
0
+ n y). (2.98)
The beta prior is called the conjugate prior for the binomial p, meaning the posterior
has the same form, but with updated parameters. [Hint: Exercise 2.7.13 (d) has the
joint density of (P, Y).] (b) Find the posterior mean, E[P [ Y = y]. Show that it can be
written as a weighted mean of the sample proportion p = y/n and the prior mean
p
o
=
o
/(
0
+
0
).
44 Chapter 2. Multivariate Distributions
Exercise 2.7.15. Do the mean and variance formulas (2.33) and (2.72) work if g is a
function of X and Y? [Hint: Consider the collection (X, W), where W = (X, Y).]
Exercise 2.7.16. Suppose h(y) is a histogram with K equal-sized bins. That is, we
have bins (b
i1
, b
i
], i = 1, . . . , K, where b
i
= b
0
+ d i, d being the width of each bin.
Then
h(y) =
_
p
i
/d if b
i1
< x b
i
, i = 1, . . . , K
0 if y , (b
0
, b
K
],
(2.99)
where the p
i
s are probabilities that sum to 1. Suppose Y is a random variable with
pdf h. For y (b
0
, b
K
], let 1(y) be ys bin, i.e., 1(y) = i if b
i1
< y b
i
. (a) What is
the distribution of the random variable 1(Y)? Find its mean and variance. (b) Find
the mean and variance of b
1(Y)
= b
0
+d1(Y). (c) What is the conditional distribution
of Y given 1(Y) = i, for each i = 1, . . . , K? [It is uniform. Over what range?] Find the
conditional mean and variance. (d) Show that unconditionally,
E[Y] = b
0
+ d(E[1]
1
2
) and Var[Y] = d
2
(Var[1] +
1
12
). (2.100)
(e) Recall the entropy in (1.44). Note that for our pdf, h(Y) = p
1(Y)
/d. Show that
Entropy(h) =
K

i=1
p
i
log(p
i
) +log(d), (2.101)
and for the negentropy in (1.46),
Negent(h) =
1
2
_
1 + log
_
2
_
Var[1] +
1
12
___
+
K

i=1
p
i
log(p
i
). (2.102)
Exercise 2.7.17. Suppose for random vector (X, Y), one observes X = x, and wishes
to guess the value of Y by h(x), say, using the least squares criterion: Choose h to
minimize E[q(X, Y)], where q(X, Y) = (Y h(X))
2
. This h is called the regression
function of Y on X. Assume all the relevant means and variances are nite. (a)
Write E[q(X, Y)] as the expected value of the conditional expected value conditioning
on X = x, e
q
(x). For xed x, note that h(x) is a scalar, hence one can minimize
e
q
(x) over h(x) using differentiation. What h(x) achieves the minimum conditional
expected value of q? (b) Show that the h found in part (a) minimizes the unconditional
expected value E[q(X, Y)]. (c) Find the value of E[q(X, Y)] for the minimizing h.
Exercise 2.7.18. Continue with Exercise 2.7.17, but this time restrict h to be a linear
function, h(x) = + x. Thus we wish to nd and to minimize E[(Y X)
2
].
The minimizing function is the linear regression function of Y on X. (a) Find the
and to minimize E[(Y X)
2
]. [You can differentiate that expected value
directly, without worrying about conditioning.] (b) Find the value of E[(Y X)
2
]
for the minimizing and .
Exercise 2.7.19. Suppose Y is 1 q and X is 1 p, E[X] = 0, Cov[X] = I
p
, E[Y[ X =
x] = +x for some p q matrix , and Cov[Y[ X = x] = for some q q diagonal
matrix . Thus the Y
i
s are conditionally uncorrelated given X = x. Find the uncon-
ditional E[Y] and Cov[Y]. The covariance matrix of Y has a factor-analytic structure,
which we will see in Section 10.3. The X
i
s are factors that explain the correlations
among the Y
i
s. Typically, the factors are not observed.
2.7. Exercises 45
Exercise 2.7.20. Suppose Y
1
, . . . , Y
q
are independent 1 p vectors, where Y
i
has
moment generating function M
i
(t), i = 1, . . . , q, all of which are nite for |t| <
for some > 0. Show that the moment generating function of Y
1
+ + Y
q
is
M
1
(t) M
q
(t). For which t is this moment generating function nite?
Exercise 2.7.21. Prove (2.49). It is legitimate to interchange the derivatives and ex-
pectation, and to set t = 0 within the expectation, when |t| < . [Extra credit: Prove
that those operations are legitimate.]
Exercise 2.7.22. The cumulant generating function of X is dened to be c
X
(t) =
log(M
X
(t)), and, if the function is nite for t in a neighborhood of zero, then the
(k
1
, . . . , k
N
)
th
mixed cumulant is the corresponding mixed derivative of c
X
(t) evalu-
ated at zero. (a) For N = 1, nd the rst four cumulants,
1
, . . . ,
4
, where

i
=

t
c
X
(t)

t=0
. (2.103)
Show that
3
/
3/2
2
is the population analog of skewness (1.42), and
4
/
2
2
is the
population analog of kurtosis (1.43), i.e.,

3/2
2
=
E[(X )
3
]

3
and

4

2
2
=
E[(X )
4
]

4
3, (2.104)
where = E[X] and
2
= Var[X]. [Write everything in terms of E[X
k
]s by expanding
the E[(X )
k
]s.] (b) For general N, nd the second mixed cumulants, i.e.,

2
t
i
t
j
c
X
(t)

t=0
, i ,= j. (2.105)
Exercise 2.7.23. A study was conducted on people near Newcastle on Tyne in 1972-
74 [Appleton et al., 1996], and followed up twenty years later. We will focus on 1314
women in the study. The three variables we will consider are Z: age group (three
values); X: whether they smoked or not (in 1974); and Y: whether they were still
alive in 1994. Here are the frequencies:
Age group Young (18 34) Middle (35 64) Old (65+)
Smoker? Yes No Yes No Yes No
Died 5 6 92 59 42 165
Lived 174 213 262 261 7 28
(2.106)
(a) Treating proportions in the table as probabilities, nd
P[Y = Lived[ X = Smoker] and P[Y = Lived[ X = Non-smoker]. (2.107)
Who were more likely to live, smokers or non-smokers? (b) Find P[X = Smoker [ Z =
z] for z= Young, Middle, and Old. What do you notice? (c) Find
P[Y = Lived[ X = Smoker & Z = z] (2.108)
and
P[Y = Lived[ X = multNon-smoker & Z = z] (2.109)
46 Chapter 2. Multivariate Distributions
for z= Young, Middle, and Old. Adjusting for age group, who were more likely to
live, smokers or non-smokers? (d) Conditionally on age, the relationship between
smoking and living is negative for each age group. Is it true that marginally (not
conditioning on age), the relationship between smoking and living is negative? What
is the explanation? (Simpsons Paradox.)
Exercise 2.7.24. Suppose in a large population, the proportion of people who are
infected with the HIV virus is = 1/100, 000. People can take a blood test to see
whether they have the virus. The test is 99% accurate: The chance the test is positive
given the person has the virus is 99%, and the chance the test is negative given the
person does not have the virus is also 99%. Suppose a randomly chosen person takes
the test. (a) What is the chance that this person does have the virus given that the test
is positive? Is this close to 99%? (b) What is the chance that this person does have the
virus given that the test is negative? Is this close to 1%? (c) Do the probabilities in (a)
and (b) sum to 1?
Exercise 2.7.25. Suppose Z
1
, Z
2
, Z
3
are iid with P[Z
i
= 1] = P[Z
i
= +1] =
1
2
. Let
X
1
= Z
1
Z
2
, X
2
= Z
1
Z
3
, X
3
= Z
2
Z
3
. (2.110)
(a) Find the conditional distribution of (X
1
, X
2
) [ Z
1
= +1. Are X
1
and X
2
con-
ditionally independent given Z
1
= +1? (b) Find the conditional distribution of
(X
1
, X
2
) [ Z
1
= 1. Are X
1
and X
2
conditionally independent given Z
1
= 1? (c) Is
(X
1
, X
2
) independent of Z
1
? Are X
1
and X
2
independent (unconditionally)? (d) Are
X
1
and X
3
independent? Are X
2
and X
3
independent? Are X
1
, X
2
and X
3
mutually
independent? (e) What is the space of (X
1
, X
2
, X
3
)? (f) What is the distribution of
X
1
X
2
X
3
?
Exercise 2.7.26. Yes/no questions: (a) Suppose X
1
and X
2
are independent, X
1
and
X
3
are independent, and X
2
and X
3
are independent. Are X
1
, X
2
and X
3
mutually
independent? (b) Suppose X
1
, X
2
and X
3
are mutually independent. Are X
1
and X
2
conditionally independent given X
3
= x
3
?
Exercise 2.7.27. (a) Let U Uniform(0, 1), so that it has space (0, 1) and pdf f
U
(u) =
1. Find its distribution function (2.1), F
U
(u). (b) Suppose X is a random variable with
space (a, b) and pdf f
X
(x), where f
X
(x) > 0 for x (a, b). [Either or both of a and
b may be innite.] Thus the inverse function F
1
X
(u) exists for u (0, 1). (Why?)
Show that the distribution of Y = F
X
(X) is Uniform(0, 1). [Hint: For y (0, 1), write
P[Y y] = P[F
X
(X) y] = P[X F
1
X
(y)], then use the denition of F
X
.] (c)
Suppose U Uniform(0, 1). For the X in part (b), show that F
1
X
(U) has the same
distribution as X. [Note: This fact provides a way of generating random variables X
from random uniforms.]
Exercise 2.7.28. Suppose Y is n 2 with covariance matrix
=
_
2 1
1 2
_
. (2.111)
Let W = YB
/
, for
B =
_
1 1
1 c
_
(2.112)
for some c. Find c so that the covariance between the two variables in W is zero.
What are the variances of the resulting two variables?
2.7. Exercises 47
Exercise 2.7.29. 1. Let Y be a 1 4 vector with
Y
j
=
j
+ B + E
j
,
where the
j
are constants, B has mean zero and variance
2
B
, the E
j
s are independent,
each with mean zero and variance
2
E
, and B is independent of the E
j
s. (a) Find the
mean and covariance matrix of
X
_
B E
1
E
2
E
3
E
4
_
. (2.113)
(b) Write Y as an afne transformation of X. (c) Find the mean and covariance matrix
of Y. (d) Cov[Y] can be written as
Cov[Y] = aI
4
+ b1
4
1
/
4
. (2.114)
Give a and b in terms of
2
B
and
2
E
. (d) What are the mean and covariance matrix of
Y = (Y
1
+ + Y
4
)/4?
Exercise 2.7.30. Suppose Y is a 5 4 data matrix, and
Y
ij
= + B
i
+ + E
ij
for j = 1, 2, (2.115)
Y
ij
= + B
i
+ E
ij
for j = 3, 4, (2.116)
where the B
i
s are independent, each with mean zero and variance
2
B
, the E
ij
are
independent, each with mean zero and variance
2
E
s, and the B
i
s are independent
of the E
ij
s. (Thus each row of Y is distributed as the vector in Extra 2.7.29, for some
particular values of
j
s.) [Note: This model is an example of a randomized block
model, where the rows of Y represent the blocks. For example, a farm might be
broken into 5 blocks, and each block split into four plots, where two of the plots
(Y
i1
, Y
i2
) get one fertilizer, and two of the plots (Y
i3
, Y
i4
) get another fertilizer.] (a)
E[Y] = xz
/
. Give x, , and z
/
. [The contains the parameters and . The x and
z contain known constants.] (b) Are the rows of Y independent? (c) Find Cov[Y].
(d) Setting which parameter equal to zero guarantees that all elements of Y have the
same mean? (e) Setting which parameter equal to zero guarantees that all elements
of Y are uncorrelated?
Chapter 3
The Multivariate Normal Distribution
3.1 Denition
There are not very many commonly used multivariate distributions to model a data
matrix Y. The multivariate normal is by far the most common, at least for contin-
uous data. Which is not to say that all data are distributed normally, nor that all
techniques assume such. Rather, typically one either assumes normality, or makes
few assumptions at all and relies on asymptotic results.
The multivariate normal arises from the standard normal:
Denition 3.1. The random variable Z is standard normal, written Z N(0, 1), if it has
space R and pdf
(z) =
1

2
e

1
2
z
2
. (3.1)
It is not hard to show that if Z N(0, 1),
E[Z] = 0, Var[Z] = 1, and M
Z
(t) = e
1
2
t
2
. (3.2)
Denition 3.2. The collection of random variables Z = (Z
1
, . . . , Z
M
) is a standard normal
collection if the Z
i
s are mutually independent standard normal random variables.
Because the variables in a standard normal collection are independent, by (3.2),
(2.61) and (2.59),
E[Z] = 0, Cov[Z] = I
M
and M
Z
(t) = e
1
2
(t
2
1
++t
2
M
)
= e
1
2
|t|
2
. (3.3)
The mgf is nite for all t.
A general multivariate normal distribution can have any (legitimate) mean and
covariance, achieved through the use of afne transformations. Here is the denition.
Denition 3.3. The collection X is multivariate normal if it is an afne transformation of
a standard normal collection.
The mean and covariance of a multivariate normal can be calculated from the
coefcients in the afne transformation. In particular, suppose Z is a standard normal
collection represented as an 1 M row vector, and Y is a 1 q row vector
Y = +ZB
/
, (3.4)
49
50 Chapter 3. Multivariate Normal
where B is q M and is 1 q. From (3.3), (2.84) and (2.85),
= E[Y] and = Cov[Y] = BB
/
. (3.5)
The mgf is calculated, for 1 q vector s, as
M
Y
(s) = E[exp(Ys
/
)]
= E[exp(( +ZB
/
)s
/
)]
= exp(s
/
)E[exp(Z(sB)
/
)]
= exp(s
/
)M
Z
(sB)
= exp(s
/
) exp(
1
2
|sB|
2
) by (3.3)
= exp(s +
1
2
sBB
/
s
/
)
= exp(s +
1
2
ss
/
) (3.6)
The mgf depends on B through only = BB
/
. Because the mgf determines the
distribution (Theorem 2.1), two different Bs can produce the same distribution. That
is, as long as BB
/
= CC
/
, the distribution of +ZB
/
and +ZC
/
are the same. Which
is to say that the distribution of the multivariate normal depends on only the mean
and covariance. Thus it is legitimate to write
Y N
q
(, ), (3.7)
which is read Y has q-dimensional multivariate normal distribution with mean
and covariance .
For example, consider the two matrices
B =
_
1 2 1
0 3 4
_
and C =
_
2 2
0 5
_
. (3.8)
It is not hard to show that
BB
/
= CC
/
=
_
6 10
10 25
_
. (3.9)
Thus if the Z
i
s are independent N(0, 1),
(Z
1
, Z
2
, Z
3
)B
/
= (Z
1
+ 2Z
2
+ Z
3
, 3Z
2
+ 4Z
3
)
=
T
(

2Z
1
+ 2Z
2
, 5Z
2
)
= (Z
1
, Z
2
)C
/
, (3.10)
i.e., both vectors are N(0, ). Note that the two expressions are based on differing
numbers of standard normals, not just different linear combinations.
Which and are legitimate parameters in (3.7)? Any R
q
is. The covariance
matrix can be BB
/
for any q M matrix B. Any such matrix B is considered a
square root of . Clearly, must be symmetric, but we already knew that. It must
also be nonnegative denite, which we dene now.
3.2. Properties 51
Denition 3.4. A symmetric q q matrix A is nonnegative denite if
bAb
/
0 for all 1 q vectors b. (3.11)
Also, A is positive denite if
bAb
/
> 0 for all 1 q vectors b ,= 0. (3.12)
Note that bBB
/
b
/
= |bB|
2
0, which means that must be nonnegative denite.
But from (2.85),
bb
/
= Cov[Yb
/
] = Var[Yb
/
] 0, (3.13)
because all variances are nonnegative. That is, any covariance matrix has to be non-
negative denite, not just multivariate normal ones.
So we know that must be symmetric and nonnegative denite. Are there any
other restrictions, or for any symmetric nonnegative denite matrix is there a corre-
sponding B? In fact, there are potentially many square roots of . These follow from
the spectral decomposition theorem, Theorem 1.1. Because is symmetric, we can
write
=
/
, (3.14)
where is orthogonal, and is diagonal with diagonal elements
1

2

q
.
Because is nonnegative denite, the eigenvalues are nonnegative (Exercise 3.7.12),
hence they have square roots. Consider
B =
1/2
, (3.15)
where
1/2
is the diagonal matrix with diagonal elements the
1/2
j
s. Then, indeed,
BB
/
=
1/2

1/2

/
=
/
= . (3.16)
That is, in (3.7), is unrestricted, and can be any symmetric nonnegative denite
matrix. Note that C =
1/2
for any q q orthogonal matrix is also a square root
of . If we take =
/
, then we have the symmetric square root,
1/2

/
.
If q = 1, then we have a normal random variable, say Y, and Y N(,
2
) signies
that it has mean and variance
2
. If Y is a multivariate normal collection represented
as an n q matrix, we write
Y N
nq
(, ) row(Y) N
nq
(row(), ). (3.17)
3.2 Some properties of the multivariate normal
Afne transformations of multivariate normals are also multivariate normal, because
any afne transformation of a multivariate normal collection is an afne transfor-
mation of an afne transformation of a standard normal collection, and an afne
transformation of an afne transformation is also an afne transformation. That is,
suppose Y N
q
(, ), and W = c +YD
/
for p q matrix D and 1 p vector c. Then
we know that for some B with BB
/
= , Y = +ZB
/
, where Z is a standard normal
vector. Hence
W = c +YD
/
= c + ( +ZB
/
)D
/
= D
/
+c +Z(B
/
D
/
), (3.18)
52 Chapter 3. Multivariate Normal
and as in (3.4),
W N
p
(c +D
/
, DBB
/
D
/
) = N
p
(c +D
/
, DD
/
). (3.19)
Of course, the mean and covariance result we already knew from (2.84) and (2.85).
Because marginals are special cases of afne transformations, marginals of multi-
variate normals are also multivariate normal. One needs just to pick off the appro-
priate means and covariances. So if Y = (Y
1
, . . . , Y
5
) is N
5
(, ), and W = (Y
2
, Y
5
),
then
W N
2
_
(
2
,
5
),
_

22

25

52

55
__
. (3.20)
In Section 2.4, we showed that independence of two random variables means that
their covariance is 0, but that a covariance of 0 does not imply independence. But,
with multivariate normals, it does. That is, if X is a multivariate normal collection,
and Cov[X
j
, X
k
] = 0, then X
j
and X
k
are independent. The next theorem generalizes
this independence to sets of variables.
Theorem 3.1. If W = (X, Y) is a multivariate normal collection, then Cov[X, Y] = 0 (see
Equation 2.46) implies that X and Y are independent.
Proof. For simplicity, we will assume the mean of W is 0. Let B (p M
1
) and C
(q M
2
) be matrices such that BB
/
= Cov[X] and CC
/
= Cov[Y], and Z = (Z
1
, Z
2
)
be a standard normal collection of M
1
+ M
2
variables, where Z
1
is 1 M
1
and Z
2
is
1 M
2
. By assumption on the covariances between the X
k
s and Y
l
s, and properties
of B and C,
Cov[W] =
_
Cov[X] 0
0 Cov[Y]
_
=
_
BB
/
0
0 CC
/
_
= AA
/
, (3.21)
where
A =
_
B 0
0 C
_
. (3.22)
Which shows that W has distribution given by ZA
/
. With that representation, we
have that X = Z
1
B
/
and Y = Z
2
C
/
. Because the Z
i
s are mutually independent, and
the subsets Z
1
and Z
2
do not overlap, Z
1
and Z
2
are independent, which means that
X and Y are independent.
The theorem can also be proved using mgfs or pdfs. See Exercises 3.7.15 and
8.8.12.
3.3 Multivariate normal data matrix
Here we connect the n q data matrix Y (1.1) to the multivariate normal. Each row of
Y represents the values of q variables for an individual. Often, the data are modeled
considering the rows of Y as independent observations from a population. Letting Y
i
be the i
th
row of Y, we would say that
Y
1
, . . . , Y
n
are independent and identically distributed (iid). (3.23)
3.3. Data matrix 53
In the iid case, the vectors all have the same mean and covariance matrix . Thus
the mean of the entire matrix M = E[Y] is
M =

.
.
.

. (3.24)
For the covariance of the Y, we need to string all the elements out, as in (2.45),
as (Y
1
, . . . , Y
n
). By independence, the covariance between variables from different
individuals is 0, that is, Cov[Y
ij
, Y
kl
] = 0 if i ,= k. Each group of q variables from a
single individual has covariance , so that Cov[Y] is block diagonal:
= Cov[Y] =

0 0
0 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0

. (3.25)
Patterned matrices such as (3.24) and (3.25) can be more efciently represented as
Kronecker products.
Denition 3.5. If A is a p q matrix and B is an n m matrix, then the Kronecker
product is the (np) (mq) matrix AB given by
AB =

a
11
B a
12
B a
1q
B
a
21
B a
22
B a
2q
B
.
.
.
.
.
.
.
.
.
.
.
.
a
p1
B a
p2
B a
pq
B

. (3.26)
Thus the mean in (3.24) and covariance matrix in (3.25) can be written as follows:
M = 1
n
and = I
n
. (3.27)
Recall that 1
n
is the n 1 vector of all 1s, and I
n
is the n n identity matrix. Now if
the rows of Y are iid multivariate normal, we write
Y N
nq
(1
n
, I
n
). (3.28)
Often the rows are independent with common covariance , but not necessarily hav-
ing the same means. Then we have
Y N
nq
(M, I
n
). (3.29)
We have already seen examples of linear combinations of elements in the data
matrix. In (1.9) and (1.10), we had combinations of the form CY, where the matrix
multiplied Y on the left. The linear combinations are of the individuals within the
variable, so that each variable is affected in the same way. In (1.23), and for principal
components, the matrix is on the right: YD
/
. In this case, the linear combinations are
of the variables, with the variables for each individual affected the same way. More
generally, we have afne transformations of the form (2.80),
W = A+CYD
/
. (3.30)
Because W is an afne transformation of Y, it is also multivariate normal. When
Cov[Y] has the form as in (3.29), then so does W.
54 Chapter 3. Multivariate Normal
Proposition 3.1. If Y N
nq
(M, H) and W = A+CYD
/
, where C is m n, D is
p q, and A is m p, then
W N
mp
(A+CMD
/
, CHC
/
DD
/
). (3.31)
The mean part follows directly from the second part of (2.84). For the covari-
ance, we need some facts about Kronecker products, proofs of which are tedious but
straightforward. See Exercises 3.7.17 to 3.7.18.
Proposition 3.2. Presuming the matrix operations make sense and the inverses exist,
(AB)
/
= A
/
B
/
(3.32a)
(AB)(CD) = (AC) (BD) (3.32b)
(AB)
1
= A
1
B
1
(3.32c)
row(CYD
/
) = row(Y)(CD)
/
(3.32d)
trace(AB) = trace(A) trace(B) (3.32e)
[AB[ = [A[
b
[B[
a
, (3.32f)
where in the nal equation, A is a a and B is b b. If Cov[U] = AB, then
Var[U
ij
] = a
ii
b
jj
, more generally, (3.33a)
Cov[U
ij
, U
kl
] = a
ik
b
jl
(3.33b)
Cov[i
th
row of U] = a
ii
B (3.33c)
Cov[j
th
column of U] = b
jj
A. (3.33d)
To prove the covariance result in Proposition 3.1, write
Cov[CYD
/
] = Cov[row(Y)(C
/
D
/
)] by (3.32d)
= (C
/
D
/
)
/
Cov[row(Y)](C
/
D
/
) by (2.85)
= (CD)(H)(C
/
D
/
) by (3.32a)
= CHC
/
DD
/
by (3.32b), twice. (3.34)
One direct application of the proposition is the sample mean in the iid case (3.28),
so that Y N
nq
(1
n
, I
n
). Then from (1.9),
Y =
1
n
1
/
n
Y, (3.35)
so we can use Proposition 3.1 with C =
1
n
1
/
n
, D
/
= I
q
, and A = 0. Thus
Y N
q
((
1
n
1
/
n
1
n
) , (
1
n
1
/
n
)I
n
(
1
n
1
/
n
)
/
) = N
q
(,
1
n
), (3.36)
since (1/n)1
/
n
1
n
= 1, and c A = cA if c is a scalar. This result should not be
surprising because it is the analog of the univariate result that Y N(,
2
/n).
3.4. Conditioning in the multivariate normal 55
3.4 Conditioning in the multivariate normal
We start here with X being a 1 p vector and Y being a 1 q vector, then specialize
to the data matrix case at the end of this section. If (X, Y) is multivariate normal, then
the conditional distributions of Y given X = x are multivariate normal as well. Let
(X, Y) N
p+q
_
(
X
,
Y
),
_

XX

XY

YX

YY
__
. (3.37)
Rather than diving in to joint densities, as in (2.23), we start by predicting the vector
Y from X with an afne transformation. That is, we wish to nd , 1 q, and , p q,
so that
Y +X. (3.38)
We use the least squares criterion, which is is to nd (, ) to minimize
q(, ) = E[|Y X|
2
]. (3.39)
We start by noting that if the 1 q vector W has nite covariance matrix, then
E[|Wc|
2
] is uniquely minimized over c R
q
by c = E[W]. See Exercise (3.7.21).
Letting W = Y X, we have that xing , q(, ) is minimized over by taking
= E[Y X] =
Y

X
. (3.40)
Using that in (3.39), we now want to minimize
q(
Y

X
, ) = E[|Y(
Y

X
) X|
2
] = E[|(Y
Y
) (X
X
)|
2
] (3.41)
over . Using the trick that for a row vector z, |z|
2
= trace(z
/
z), and letting X

=
X
X
and Y

= Y
Y
, we can write (3.41) as
E[trace((Y

)
/
(Y

))] = trace(E[(Y

)
/
(Y

))])
= trace(
YY

YX

/

XY
+
/

XX
). (3.42)
Now we complete the square. That is, we want to nd

so that

YY

YX

/

XY
+
/

XX
= (

)
/

XX
(

) +
YY

XX

. (3.43)
Matching, we must have that
/

XX

=
/

XY
, so that if
XX
is invertible, we need
that

=
1
XX

XY
. Then the trace of the expression in (3.43) is minimized by taking
=

, since that sets to 0 the part depending on , and you cant do better than
that. Which means that (3.39) is minimized with
=
1
XX

XY
, (3.44)
and in (3.40). The minimum of (3.39) is the trace of

YY

/

XX
=
YY

YX

1
XX

XY
. (3.45)
The prediction of Y is then + X. Dene the residual to be the error in the
prediction:
R = Y X. (3.46)
56 Chapter 3. Multivariate Normal
Next step is to nd the joint distribution of (X, R). Because it is an afne transforma-
tion of (X, Y), the joint distribution is multivariate normal, hence we just need to nd
the mean and covariance matrix. The mean of X we know is
X
, and the mean of R
is 0, from (3.40). The transform is
_
X R
_
=
_
X Y
_
_
I
p

0 I
q
_
+
_
0
_
, (3.47)
hence
Cov[
_
X R
_
] =
_
I
p
0

/
I
q
__

XX

XY

YX

YY
__
I
p

0 I
q
_
=
_

XX
0
0
YYX
_
(3.48)
where

YYX
=
YY

YX

1
XX

XY
, (3.49)
the minimizer in (3.45).
Note the zero in the covariance matrix. Because we have multivariate normality, X
and R are thus independent, and
R N
q
(0,
YYX
). (3.50)
Using the plug-in formula with independence, (2.63), so that
Y[ X = x = +x +R, (3.51)
leads to the next result.
Proposition 3.3. If (X, Y) is multivariate normal as in (3.37), and
XX
invertible, then
Y[ X = x N( +x,
YYX
), (3.52)
where is given in (3.40), is given in (3.44), and the conditional covariance matrix is given
in (3.49).
The conditional distribution is particularly nice:
It is multivariate normal;
The conditional mean is an afne transformation of x;
It is homoskedastic, that is, the conditional covariance matrix does not depend on
x.
These properties are the typical assumptions in linear regression.
Conditioning in a multivariate normal data matrix
So far we have looked at just one X/Y vector, whereas data will have a number of
such vectors. Stacking these vectors into a data matrix, we have the distribution as in
(3.29), but with an X matrix as well. That is, let X be n p and Y be n q, where
_
X Y
_
N
n(p+q)
_
_
M
X
M
Y
_
, I
n

_

XX

XY

YX

YY
__
. (3.53)
3.5. Wishart 57
The conditional distribution of Y given X = x can be obtained by applying Proposi-
tion 3.3 to (row(X), row(Y)), whose distribution can be written
_
row(X) row(Y)
_
N
np+nq
__
row(M
X
) row(M
Y
)
_
,
_
I
n

XX
I
n

XY
I
n

YX
I
n

YY
__
. (3.54)
See Exercise 3.7.23. We again have the same , but is a bit expanded (it is n q):
= M
Y
M
X
, =
1
XX

XY
. (3.55)
With R = Y X, we obtain that X and R are independent, and
Y[ X = x N
nq
( +x, I
n

YYX
). (3.56)
3.5 The sample covariance matrix: Wishart distribution
Consider the iid case (3.28), Y N
nq
(1
n
, I
n
). The sample covariance matrix
is given in (1.17) and (1.15),
S =
1
n
W, W = Y
/
H
n
Y, H
n
= I
n

1
n
1
n
1
/
n
. (3.57)
See (1.12) for the centering matrix, H
n
. Here we nd the joint distribution of the
sample mean Y and W. The marginal distribution of the sample mean is given in
(3.36). Start by looking at the mean and the deviations together:
_
Y
H
n
Y
_
=
_
1
n
1
/
n
H
n
_
Y. (3.58)
Thus they are jointly normal. The mean of the sample mean is , and the mean of the
deviations H
n
Y is H
n
1
n
= 0. (Recall Exercise 1.9.1.) The covariance is given by
Cov
__
Y
H
n
Y
__
=
_
1
n
1
/
n
H
n
_
_
1
n
1
n
H
n
_

=
_
1
n
0
0 H
n
_
. (3.59)
The zeroes in the covariance show that Y and H
n
Y are independent (as they are in
the familiar univariate case), implying that Y and W are independent. Also,
U H
n
Y N(0, H
n
). (3.60)
Because H
n
is idempotent, W = Y
/
H
n
Y = U
/
U. At this point, instead of trying to
gure out the distribution of W, we dene it to be what it is. Actually, Wishart [1928]
did this a while ago. Next is the formal denition.
Denition 3.6 (Wishart distribution). If Z N
p
(0, I

), then Z
/
Z is Wishart on
degrees of freedom, with parameter , written
Z
/
Z Wishart
p
(, ). (3.61)
58 Chapter 3. Multivariate Normal
The difference between the distribution of U and the Z in the denition is the
former has H
n
where we would prefer an identity matrix. We can deal with this issue
by rotating the H
n
. We need its spectral decomposition. More generally, suppose J is
an n n symmetric and idempotent matrix, with spectral decomposition (Theorem
1.1) J =
/
, where is orthogonal and is diagonal with nondecreasing diagonal
elements. Because it is idempotent, JJ = J, hence

/
=
/

/
=
/
, (3.62)
so that =
2
, or
i
=
2
i
for the eigenvalues i = 1, . . . , n. That means that each of
the eigenvalues is either 0 or 1. If matrices A and B have the same dimensions, then
trace(AB
/
) = trace(BA
/
). (3.63)
See Exercise 3.7.5. Thus
trace(J) = trace(
/
) = trace() =
1
+ +
n
, (3.64)
which is the number of eigenvalues that are 1. Because the eigenvalues are ordered
from largest to smallest,
1
= =
trace(J)
= 1, and the rest are 0. Hence the
following result.
Lemma 3.1. Suppose J, n n, is symmetric and idempotent. Then its spectral decomposition
is
J =
_

1

2
_
_
I
k
0
0 0
__

/
1

/
2
_
=
1

/
1
, (3.65)
where k = trace(J),
1
is n k, and
2
is n (n k).
Now suppose
U N(0, J ), (3.66)
for J as in the lemma. Letting = (
1
,
2
) in (3.65), we have E[
/
U] = 0 and
Cov[
/
U] = Cov
__

/
1
U

/
2
U
__
=
_
I
k
0
0 0
_
. (3.67)
Thus
/
2
U has mean and covariance zero, hence it must be zero itself (with probability
one). That is,
U
/
U = U
/

/
U = U
/

/
1
U+U
/

/
2
U = U
/

/
1
U. (3.68)
By (3.66), and since J =
1

/
1
in (3.65),

/
1
U N(0,
/
1

/
1

1
) = N(0, I
k
). (3.69)
Now we can apply the Wishart denition (3.61) to
/
1
U, to obtain the next result.
Corollary 3.1. If U N
np
(0, J ) for idempotent J, then U
/
U Wishart
p
(trace(J), ).
To apply the corollary to W = Y
/
H
n
Y in (3.57), by (3.60), we need only the trace of
H
n
:
trace(H
n
) = trace(I
n
)
1
n
trace(1
n
1
/
n
) = n
1
n
(n) = n 1. (3.70)
Thus
W Wishart
q
(n 1, ). (3.71)
3.6. Properties of the Wishart 59
3.6 Some properties of the Wishart
In this section we present some useful properties of the Wishart. The density is
derived later in Section 8.7, and a conditional distribution is presented in Section 8.2.
Mean
Letting Z
1
, . . . , Z

be the rows of Z in Denition 3.6, we have that


Z
/
Z = Z
/
1
Z
1
+ +Z
/

Wishart
q
(, ). (3.72)
Each Z
i
N
1q
(0, ), so E[Z
/
i
Z
i
] = Cov[Z
i
] = . Thus
E[W] = . (3.73)
In particular, for the S in (3.57), because = n 1, E[S] = ((n 1)/n), so that an
unbiased estimator of is

=
1
n 1
Y
/
H
n
Y. (3.74)
Sum of independent Wisharts
If
W
1
Wishart
q
(
1
, ) and W
2
Wishart
q
(
2
, ), (3.75)
and W
1
and W
2
are independent, then W
1
+ W
2
Wishart
q
(
1
+
2
, ). This fact
can be easily shown by writing each as in (3.72), then summing.
Chi-squares
If Z
1
, . . . , Z

are independent N(0,


2
)s, then
W = Z
2
1
+ + Z
2

(3.76)
is said to be chi-squared on degrees of freedom with scale
2
, written W
2

.
(If
2
= 1, we call it just chi-squared on degrees of freedom.) If q = 1 in the
Wishart (3.72), the Z
i
s in (3.72) are one-dimensional, i.e., N(0,
2
)s, hence
Wishart
1
(,
2
) =
2

. (3.77)
Linear transformations
If Z N
q
(0, I

), then for p q matrix A, ZA


/
N
p
(0, I

AA
/
). Using
the denition of Wishart (3.61),
AZZ
/
A
/
Wishart
p
(, AA
/
), (3.78)
i.e.,
AWA
/
Wishart
p
(, AA
/
). (3.79)
60 Chapter 3. Multivariate Normal
Marginals
Because marginals are special cases of linear transformations, central blocks of a
Wishart are Wishart. E.g., if W
11
is the upper-left p p block of W, then W
11

Wishart
p
(,
11
), where
11
is the upper-left block of . See Exercise 3.7.9. A special
case of such marginal is a diagonal element, W
ii
, which is Wishart
1
(,
ii
), i.e.,
ii

.
Furthermore, if is diagonal, then the diagonals of W are independent because the
corresponding normals are.
3.7 Exercises
Exercise 3.7.1. Verify the calculations in (3.9).
Exercise 3.7.2. Find the matrix B for which W (Y
2
, Y
5
) = (Y
1
, . . . , Y
5
)B, and verify
(3.20).
Exercise 3.7.3. Verify (3.42).
Exercise 3.7.4. Verify the covariance calculation in (3.59).
Exercise 3.7.5. Suppose that A and B are both n p matrices. Denote the elements
of A by a
ij
, and of B by b
ij
. (a) Give the following in terms of those elements: (AB
/
)
ii
(the i
th
diagonal element of the matrix AB
/
); and (B
/
A)
jj
(the j
th
diagonal element of
the matrix B
/
A). (c) Using the above, show that trace(AB
/
) = trace(B
/
A).
Exercise 3.7.6. Show that in (3.69),
/
1

1
= I
k
.
Exercise 3.7.7. Explicitly write the sum of W
1
and W
2
as in (3.75) as a sum of Z
/
i
Z
i
s
as in (3.72).
Exercise 3.7.8. Suppose W
2

from (3.76), that is, W = Z


2
1
+ + Z
2

, where the
Z
i
s are independent N(0,
2
)s. This exercise shows that W has pdf
f
W
(w[ ,
2
) =
1
(

2
)(2
2
)
/2
w

2
1
e
w/(2
2
)
, w > 0. (3.80)
It will help to know that U has the Gamma(, ) density if > 0, > 0, and
f
U
(u [ , ) =
1
()

x
1
e
x
for x > 0. (3.81)
The function is dened in (2.96). (It is the constant needed to have the pdf integrate
to one.) Well use moment generating functions. Working directly with convolutions
is another possibility. (a) Show that the moment generating function of U in (3.81) is
(1 t)

when it is nite. For which t is the mgf nite? (b) Let Z N(0,
2
), so that
Z
2

2

2
1
. Find the moment generating function for Z
2
. [Hint: Write E[exp(tZ
2
)] as
an integral using the pdf of Z, then note the exponential term in the integrand looks
like a normal with mean zero and some variance, but without the constant. Thus the
integral over that exponential is the reciprocal of the constant.] (c) Find the moment
generating function for W. (See Exercise 2.7.20.) (d) W has a gamma distribution.
3.7. Exercises 61
What are the parameters? Does this gamma pdf coincide with (3.80)? (e) [Aside] The
density of Z
2
can be derived by writing
P[Z
2
w] =
_

w

w
f
Z
(z)dz, (3.82)
then taking the derivative. Match the result with the
2

2
1
density found above. What
is (
1
2
)?
Exercise 3.7.9. Suppose W Wishart
p+q
(, ), where W and are partitioned as
W =
_
W
11
W
12
W
21
W
22
_
and =
_

11

12

21

22
_
(3.83)
where W
11
and
11
are p p, etc. (a) What matrix A in (3.79) is used to show
that W
11
Wishart
p
(,
11
)? (b) Argue that if
12
= 0, then W
11
and W
22
are
independent.
Exercise 3.7.10. The balanced one-way random effects model in analysis of variance
has
Y
ij
= + A
i
+ e
ij
, i = 1, . . . , g; j = 1, . . . , r, (3.84)
where the A
i
s are iid N(0,
2
A
) and the e
ij
s are iid N(0,
2
e
), and the e
ij
s are inde-
pendent of the A
i
s. Let Y be the g r matrix of the Y
ij
s. Show that
Y N
gr
(M, I
g
), (3.85)
and give M and in terms of the ,
2
A
and
2
e
.
Exercise 3.7.11. The double exponential random variable U has density
f (u) =
1
2
e
[u[
, u R. (3.86)
It has mean 0, variance 2, and moment generating function M(t) = 1/(1 t
2
) for
[t[ < 1. Suppose U and V are independent double exponentials, and let
X
1
= 5U, X
2
= 4U + 2V. (3.87)
(a) Find the covariance matrix of X = (X
1
, X
2
). (b) Find the symmetric positive
denite square root of the covariance matrix. Call it A. Let Y = (Y
1
, Y
2
) = (U, V)A.
(c) Do X and Y have the same mean? (d) Do X and Y have the same covariance matrix?
(e) Are X and Y both linear combinations of independent double exponentials? (f) Do
X and Y have the same distribution? [Look at their moment generating functions.]
(g) [Extra credit] Derive the mgf of the double exponential.
Exercise 3.7.12. Suppose is a q q symmetric matrix with spectral decomposition
(Theorem 1.1)
/
. (a) Show that is nonnegative denite if and only if
i
0 for
all i = 1, . . . , q. [Hint: Suppose it is nonnegative denite. Let
i
be the i
th
column of
, and look at
/
i

i
. What can you say about
i
? The other way, suppose all
i
0.
Consider bb
/
, and let w = b
/
. Write bb
/
in terms of w and the
i
.] (b) Show
that is positive denite if and only if
i
> 0 for all i = 1, . . . , q. (c) Show that is
invertible if and only if
i
,= 0 for all i = 1, . . . , q. What is the spectral decomposition
of
1
if the inverse exists?
62 Chapter 3. Multivariate Normal
Exercise 3.7.13. Extend Theorem 3.1: Show that if W = (Y
1
, . . . , Y
g
) is a multivariate
normal collection, then Cov[Y
i
, Y
j
] = 0 for each i ,= j implies that Y
1
, . . . , Y
g
are
mutually independent.
Exercise 3.7.14. Given the random vector (X, Y, Z), answer true or false to the follow-
ing questions: (a) Pairwise independence implies mutual independence. (b) Pairwise
independence and multivariate normality implies mutual independence. (c) Mutual
independence implies conditional independence of X and Y given Z. (d) Conditional
independence of X and Y given Z implies that X and Y are unconditionally indepen-
dent. (e) (X, Y, Z) multivariate normal implies (1, X, Y, Z) is multivariate normal.
Exercise 3.7.15. Let X be 1 p and Y be 1 q, where
(X, Y) N
1(p+q)
_
(
X
,
Y
),
_

XX
0
0
YY
__
, (3.88)
so that Cov(X) =
XX
, Cov(Y) =
YY
, and Cov(X, Y) = 0. Using moment generating
functions, show that X and Y are independent.
Exercise 3.7.16. True/false questions: (a) If A and B are identity matrices, then AB
is an identity matrix. (b) If A and B are orthogonal, then A B is orthogonal. (c)
If A is orthogonal and B is not orthogonal, then AB is orthogonal. (d) If A and
B are diagonal, then AB is diagonal. (e) If A and B are idempotent, then A B
is idempotent. (f) If A and B are permutation matrices, then AB is a permutation
matrix. (A permutation matrix is a square matrix with exactly one 1 in each row, one
1 in each column, and 0s elsewhere.) (g) If A and B are upper triangular, then AB
is upper triangular. (An upper triangular matrix is a square matrix whose elements
below the diaginal are 0. I.e., if A is upper triangular, then a
ij
= 0 if i > j.) (h) If A is
upper triangular and B is not upper triangular, then AB is upper triangular. (i) If
A is not upper triangular and B is upper triangular, then AB is upper triangular.
(j) If A and C have the same dimensions, and B and D have the same dimensions,
then AB +CD = (A+C) (B +D). (k) If A and C have the same dimensions,
then AB +C B = (A +C) B. (l) If B and D have the same dimensions, then
AB +AD = A(B +D).
Exercise 3.7.17. Prove (3.32a), (3.32b) and (3.32c).
Exercise 3.7.18. Take C, Y and D to all be 2 2. Show (3.32d) explicitly.
Exercise 3.7.19. Suppose A is a a and B is b b. (a) Show that (3.32e) for the
trace of AB holds. (b) Show that (3.32f) determinant of AB holds. [Hint: Write
AB = (AI
b
)(I
a
B). You can use the fact that the the determinant of a product
is the product of the determinants. For [I
a
B[, permutate the rows and columns so
it looks like [B I
a
[.]
Exercise 3.7.20. Suppose the spectral decompositions of A and B are A = GLG
/
and
B = HKH
/
. Is the equation
AB = (GH)(L K)(GH)
/
(3.89)
the spectral decomposition of AB? If not, what is wrong, and how can it be xed?
3.7. Exercises 63
Exercise 3.7.21. Suppose W is a 1 q vector with nite covariance matrix. Show that
q(c) = |Wc|
2
is minimized over c R
q
by c = E[W], and the minimum value is
q(E[W]) = trace(Cov[W]). [Hint: Write
q(c) = E[|(WE[W]) (E[W] c)|
2
]
= E[|WE[W]|
2
] + 2E[(WE[W])(E[W] c)] + E[|E[W] c|
2
] (3.90)
= E[|WE[W]|
2
] + E[|E[W] c|
2
]. (3.91)
Show that the middle (cross-product) term in line (3.90) is zero (E[W] and c are
constants), and argue that the second term in line (3.91) is uniquely minimized by
c = E[W]. (No need to take derivatives.)]
Exercise 3.7.22. Verify the matrix multiplication in (3.48).
Exercise 3.7.23. Suppose (X, Y) is as in (3.53). (a) Show that (3.54) follows. [Be
careful about the covariance, since row(X, Y) ,= (row(X), row(Y)) if n > 1.] (b)
Apply Proposition 3.3 to (3.54) to obtain
row(Y) [ row(X) = row(x) N
nq
(

+ row(x)

YYX
), (3.92)
where

= row(
Y
) row(
X
)

= I
n
,

YYX
= I
n

YYX
. (3.93)
What are and
YYX
? (c) Use Proposition 3.2 to derive (3.56) from part (b).
Exercise 3.7.24. Suppose (X, Y, Z) is multivariate normal with covariance matrix
(X, Y, Z) N

(0, 0, 0),

5 1 2
1 5 2
2 2 3

(3.94)
(a) What is the correlation of X and Y? Consider the conditional distribution of
(X, Y)[Z = z. (b) Give the conditional covariance matrix, Cov[(X, Y)[Z = z]. (c)
The correlation from that matrix is the condition correlation of X and Y given Z = z,
sometimes called the partial correlation. What is the conditional correlation in this
case? (d) If the conditional correlation between two variables given a third variable is
negative, is the marginal correlation between those two necessarily negative?
Exercise 3.7.25. Now suppose
(X, Y, Z) N

(0, 0, 0),

5 1 c
1 5 2
c 2 3

. (3.95)
Find c so that the conditional correlation between X and Y given Z = z is 0 (so that
X and Y are conditionally independent, because of their normality).
Exercise 3.7.26. Let Y [ X = x N(0, x
2
) and X N(2, 1). (a) Find E[Y] and Var[Y].
(b) Let Z = Y/X. What is the conditional distribution of Z [ X = x? Is Z independent
of X? What is the marginal distribution of Z? (c) What is the conditional distribution
of Y [ [X[ = r?
64 Chapter 3. Multivariate Normal
Exercise 3.7.27. Suppose that conditionally, (Y
1
, Y
2
) [ X = x are iid N( + x, 10),
and that marginally, E[X] = Var[X] = 1. (The X is not necessarily normal.) (a) Find
Var[Y
i
], Cov[Y
1
, Y
2
], and the (unconditional) correlation between Y
1
and Y
2
. (b) What
is the conditional distribution of Y
1
+ Y
2
[ X = x? Is Y
1
+ Y
2
independent of X? (c)
What is the conditional distribution of Y
1
Y
2
[ X = x? Is Y
1
Y
2
independent of X?
Exercise 3.7.28. This question reverses the conditional distribution in a multivariate
normal, without having to use Bayes formula. Suppose conditionally Y[ X = x
N( +x, ), and marginally X N(
X
,
XX
), where Y is 1 q and X is 1 p. (a)
Show that (X, Y) is multivariate normal, and nd its mean vector, and show that
Cov[(X Y)] =
_

XX

XX

XX
+
/

XX

_
. (3.96)
[Hint: Show that X and Y X are independent normals, and nd the A so that
(X, Y) = (X, Y X)A.] (b) Show that the conditional distribution X[ Y = y is
multivariate normal with mean
E[X[ Y = y] =
X
+ (y
X
)( +
/

XX
)
1

XX
, (3.97)
and
Cov[X[ Y = y] =
XX

XX
( +
/

XX
)
1

XX
. (3.98)
(You can assume any covariance that needs to be invertible is invertible.)
Exercise 3.7.29 (Bayesian inference). A Bayesian approach to estimating the normal
mean vector, when the covariance matrix is known, is to set
Y[ = m N
1q
(m, ) and N
1q
(
0
,
0
), (3.99)
where ,
0
, and
0
are known. That is, the mean vector is a random variable,
with a multivariate normal prior. (a) Use Exercise 3.7.28 to show that the posterior
distribution of , i.e., given Y = y, is multivariate normal with
E[ [ Y = y] = (y
1
+
0

1
0
)(
1
+
1
0
)
1
, (3.100)
and
Cov[ [ Y = y] = (
1
+
1
0
)
1
. (3.101)
Thus the posterior mean is a weighted average of the data y and the prior mean, with
weights inversely proportional to their respective covariance matrices. [Hint: What
are the and in this case? It takes some matrix manipulations to get the mean and
covariance in the given form.] (b) Show that the marginal distribution of Y is
Y N
1q
(
0
, +
0
). (3.102)
[Hint: See (3.96).][Note that the inverse of the posterior covariance is the sum of
the inverses of the conditional covariance of Y and the prior covariance, while the
marginal covariance of the Y is the sum of the conditional covariance of Y and the
prior covariance.] (c) Replace Y with Y, the sample mean of n iid vectors, so that
Y[ = m N(m, /n). Keep the same prior on . Find the posterior distribution
of given the Y = y. What are the posterior mean and covariance matrix, approxi-
mately, when n is large?
3.7. Exercises 65
Exercise 3.7.30 (Bayesian inference). Consider a matrix version of Exercise 3.7.29, i.e.,
Y[ = m N
pq
(m, K
1
) and N
pq
(
0
, K
1
0
), (3.103)
where K, ,
0
and K
0
are known, and the covariance matrices are invertible. [So if Y
is a sample mean vector, K would be n, and if Y is

from multivariate regression, K
would be x
/
x.] Notice that the is the same in the conditional distribution of Y and
in the prior. Show that the posterior distribution of is multivariate normal, with
E[ [ Y = y] = (K+K
0
)
1
(Ky +K
0

0
), (3.104)
and
Cov[ [ Y = y] = (K+K
0
)
1
. (3.105)
[Hint: Use (3.100) and (3.101) on row(Y) and row(), then use properties of Kro-
necker products, e.g., (3.32d) and Exercise 3.7.16 (l).]
Exercise 3.7.31. Suppose X is n p, Y is n q, and
_
X Y
_
N
_
_
M
X
M
Y
_
, I
n

_

XX

XY

YX

YY
__
. (3.106)
Let
R = Y XC
/
D (3.107)
for some matrices C and D. Instead of using least squares as in Section 3.4, here
we try to nd C and D so that the residuals have mean zero and are independent
of X. (a) What are the dimensions of R, C and D? (b) Show that (X, R) is an afne
transformation of (X, Y). That is, nd A and B so that
(X, R) = A+ (X, Y)B
/
. (3.108)
(c) Find the distribution of (X, R). (d) What must C be in order for X and R to be
independent? (You can assume
XX
is invertible.) (e) Using the C found in part (d),
nd Cov[R]. (It should be I
n

YYX
.) (f) Sticking with the C from parts (d) and (e),
nd D so that E[R] = 0. (g) Using the C and D from parts (d), (e), (f), what is the
distribution of R? The distribution of R
/
R?
Exercise 3.7.32. Let Y N
np
(M, I
n
). Suppose K is an n n symmetric idem-
potent matrix with trace(K) = k, and that KM = 0. Show that Y
/
KY is Wishart, and
give the parameters.
Exercise 3.7.33. Suppose Y N(xz
/
, I
n
), where x is n p, and
z =

1 1 1
1 0 2
1 1 1

. (3.109)
(a) Find C so that E[YC
/
] = x. (b) Assuming that x
/
x is invertible, what is the dis-
tribution of Q
x
YC
/
, where Q
x
= I
n
x(x
/
x)
1
x
/
? (Is Q
x
idempotent? Such matrices
will appear again in equation 5.19.) (c) What is the distribution of CY
/
Q
x
YC
/
?
66 Chapter 3. Multivariate Normal
Exercise 3.7.34. Here, W Wishart
p
(n, ). (a) Is E[trace(W)] = n trace()? (b)
Are the diagonal elements of W independent? (c) Suppose =
2
I
p
. What is the
distribution of trace(W)?
Exercise 3.7.35. Suppose Z = (Z
1
, Z
2
) N
12
(0, I
2
). Let (, R) be the polar coordi-
nates, so that
Z
1
= Rcos() and Z
2
= Rsin(). (3.110)
In order for the transformation to be one-to-one, remove 0 from the space of Z. Then
the space of (, R) is [0, 2) (0, ). The question is to derive the distribution of
(, R). (a) Write down the density of Z. (b) Show that the Jacobian of the transforma-
tion is r. (c) Find the density of (, R). What is the marginal distribution of ? What
is the marginal density of R? Are R and independent? (d) Find the distribution
function F
R
(r) for R. (e) Find the inverse function of F
R
. (f) Argue that if U
1
and U
2
are independent Uniform(0, 1) random variables, then
_
2 log(U
2
)
_
cos(2U
1
) sin(2U
1
)
_
N
12
(0, I
2
). (3.111)
Thus we can generate two random normals from two random uniforms. Equation
(3.111) is called the Box-Muller transformation [Box and Muller, 1958] [Hint: See
Exercise 2.7.27.] (g) Find the pdf of W = R
2
. What is the distribution of W? Does it
check with (3.80)?
Chapter 4
Linear Models on Both Sides
This chapter presents some basic types of linear model. We start with the usual
linear model, with just one Y-variable. Multivariate regression extends the idea to
several variables, placing the same model on each variable. We then introduce linear
models that model the variables within the observations, basically reversing the roles
of observations and variables. Finally, we introduce the both-sides model, which
simultaneously models the observations and variables. Subsequent chapters present
estimation and hypothesis testing for these models.
4.1 Linear regression
Section 3.4 presented conditional distributions in the multivariate normal. Interest
was in the effect of one set of variables, X, on another set, Y. Conditional on X = x,
the distribution of Y was normal with the mean being a linear function of x, and
the covariance being independent of x. The normal linear regression model does
not assume that the joint distribution of (X, Y) is normal, but only that given x, Y is
multivariate normal. Analysis is carried out considering x to be xed. In fact, x need
not be a realization of a random variable, but a quantity xed by the researcher, such
as the dose of a drug or the amount of fertilizer.
The multiple regression model uses the data matrix (x, Y), where x is n p and is
lower case to emphasize that those values are xed, and Y is n 1. That is, there are
p variables in x and a single variable in Y. In Section 4.2, we allow Y to contain more
than one variable.
The model is
Y = x +R, where is p 1 and R N
n1
(0,
2
R
I
n
). (4.1)
Compare this to (3.51). The variance
2
R
plays the role of
YYX
. The model (4.1)
assumes that the residuals R
i
are iid N(0,
2
R
).
Some examples follow. There are thousands of books on linear regression and
linear models. Scheff [1999] is the classic theoretical reference, and Christensen
[2002] provides a more modern treatment. A ne applied reference is Weisberg [2005].
67
68 Chapter 4. Linear Models
Simple linear regression
One may wish to assess the relation between height and weight, or between choles-
terol level and percentage of fat in the diet. A linear relation would be cholesterol =
+ ( f at) + residual, so one would typically want both an intercept and a slope .
Translating this model to (4.1), we would have p = 2, where the rst column contains
all 1s. That is, if x
1
, . . . , x
n
are the values of the explanatory variable (fat), the model
would be

Y
1
Y
2
.
.
.
Y
n

1 x
1
1 x
2
.
.
.
.
.
.
1 x
n

_
+

R
1
R
2
.
.
.
R
n

. (4.2)
Multiple regression would add more explanatory variables, e.g., age, blood pres-
sure, amount of exercise, etc., each one being represented by its own column in the x
matrix.
Analysis of variance
In analysis of variance, observations are classied into different groups, and one
wishes to compare the means of the groups. If there are three groups, with two
observations in each group, the model could be

Y
1
Y
2
Y
3
Y
4
Y
5
Y
6

1 0 0
1 0 0
0 1 0
0 1 0
0 0 1
0 0 1

+R. (4.3)
Other design matrices x yield the same model (See Section 5.4), e.g., we could just as
well write

Y
1
Y
2
Y
3
Y
4
Y
5
Y
6

1 2 1
1 2 1
1 1 2
1 1 2
1 1 1
1 1 1

+R, (4.4)
where is the grand mean, is the effect for the rst group, and is the effect for
the second group. We could add the effect for group three, but that would lead to
a redundancy in the model. More complicated models arise when observations are
classied in multiple ways, e.g., sex, age, and ethnicity.
Analysis of covariance
It may be that the main interest is in comparing the means of groups as in analysis
of variance, but there are other variables that potentially affect the Y. For example, in
a study comparing three drugs effectiveness in treating leprosy, there were bacterial
measurements before and after treatment. The Y is the after measurement, and one
would expect the before measurement, in addition to the drugs, to affect the after
4.2. Multivariate regression 69
measurement. Letting x
i
s represent the before measurements, the model would be

Y
1
Y
2
Y
3
Y
4
Y
5
Y
6

1 0 0 x
1
1 0 0 x
2
0 1 0 x
3
0 1 0 x
4
0 0 1 x
5
0 0 1 x
6

+R. (4.5)
The actual experiment had ten observations in each group. See Section 7.5.
Polynomial and cyclic models
The linear in linear models refers to the linearity of the mean of Y in the parameter
for xed values of x. Within the matrix x, there can be arbitrary nonlinear functions
of variables. For example, in growth curves, one may be looking at Y
i
s over time
which grow as a quadratic in x
i
, i.e., E[Y
i
] =
0
+
1
x
i
+
2
x
2
i
. Such a model is still
considered a linear model because the
j
s come in linearly. The full model would be

Y
1
Y
2
.
.
.
Y
n

1 x
1
x
2
1
1 x
2
x
2
2
.
.
.
.
.
.
.
.
.
1 x
n
x
2
n

R
1
R
2
.
.
.
R
n

. (4.6)
Higher-order polynomials add on columns of x
3
i
s, x
4
i
s, etc.
Alternatively, the Y
i
s might behave cyclically, such as temperature over the course
of a year, or the circadian (daily) rhythms of animals. If the cycle is over 24 hours,
and measurements are made at each hour, the model could be

Y
1
Y
2
.
.
.
Y
24

1 cos(2 1/24) sin(2 1/24)


1 cos(2 2/24) sin(2 2/24)
.
.
.
.
.
.
.
.
.
1 cos(2 24/24) sin(2 24/24)

R
1
R
2
.
.
.
R
24

.
(4.7)
Based on the data, typical objectives in linear regression are to estimate , test
whether certain components of are 0, or predict future values of Y based on its xs.
In Chapters 6 and 7, such formal inferences will be handled. In this chapter, we are
concentrating on setting up the models.
4.2 Multivariate regression and analysis of variance
Consider (x, Y) to be a data matrix where x is again n p, but now Y is n q. The
linear model analogous to the conditional model in (3.56) is
Y = x +R, where is p q and R N
nq
(0, I
n

R
). (4.8)
This model looks very much like the linear regression model in (4.1), and it is. It is
actually just a concatenation of q linear models, one for each variable (column) of Y.
Note that (4.8) places the same model on each variable, in the sense of using the same
70 Chapter 4. Linear Models
xs, but allows different coefcients represented by the different columns of . That
is, (4.8) implies
Y
1
= x
1
+R
1
, . . . , Y
q
= x
q
+R
q
, (4.9)
where the subscript i indicates the i
th
column of the matrix.
The x matrix is the same as in the previous section, so rather than repeating
the examples, just imagine them with extra columns of Y and , and prepend the
word multivariate to the models, e.g., multivariate analysis of variance, multivari-
ate polynomial regression, etc.
One might ask what the advantage is of doing all q regressions at once rather
than doing q separate ones. Good question. The main reason is to gather strength
from having several variables. For example, suppose one has an analysis of variance
comparing drugs on a number of health-related variables. It may be that no single
variable shows signicant differences between drugs, but the variables together show
strong differences. Using the overall model can also help deal with multiple compar-
isons, e.g., when one has many variables, there is a good chance at least one shows
signicance even when there is nothing going on.
These models are more compelling when they are expanded to model dependen-
cies among the means of the variables, which is the subject of Section 4.3.
4.2.1 Examples of multivariate regression
Example: Grades
The data are the grades (in the data set grades), and sex (0=Male, 1=Female), of 107
students, a portion of which is below:
Obs i Gender HW Labs InClass Midterms Final Total
1 0 30.47 0.00 0 60.38 52 43.52
2 1 37.72 20.56 75 69.84 62 59.34
3 1 65.56 77.33 75 68.81 42 63.18
4 0 65.50 75.83 100 58.88 56 64.04
5 1 72.36 65.83 25 74.93 60 65.92
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
105 1 93.18 97.78 100 94.75 92 94.64
106 1 97.54 99.17 100 91.23 96 94.69
107 1 94.17 97.50 100 94.64 96 95.67
(4.10)
Consider predicting the midterms and nal exam scores from gender, and the
homework, labs, and inclass scores. The model is Y = x + R, where Y is 107 2
(the Midterms and Finals), x is 107 5 (with Gender, HW, Labs, InClass, plus the rst
column of 1
107
), and is 5 2:
=

0M

0F

GM

GF

HM

HF

LM

LF

I M

IF

. (4.11)
Chapter 6 shows how to estimate the
ij
s. In this case the estimates are
4.2. Multivariate regression 71
Midterms Final
Intercept 56.472 43.002
Gender 3.044 1.922
HW 0.244 0.305
Labs 0.052 0.005
InClass 0.048 0.076
(4.12)
Note that the largest slopes (not counting the intercepts) are the negative ones for
gender, but to truly assess the sizes of the coefcients, we will need to nd their
standard errors, which we will do in Chapter 6.
Mouth sizes
Measurements were made on the size of mouths of 27 children at four ages: 8, 10,
12, and 14. The measurement is the distance from the center of the pituitary to the
pteryomaxillary ssure
1
in millimeters, These data can be found in Potthoff and
Roy [1964]. There are 11 girls (Sex=1) and 16 boys (Sex=0). See Table 4.1. Figure 4.1
contains a plot of the mouth sizes over time. These curves are generally increasing.
There are some instances where the mouth sizes decrease over time. The measure-
ments are between two dened locations in the mouth, and as people age, the mouth
shape can change, so it is not that people mouths are really getting smaller. Note that
generally the boys have bigger mouths than the girls, as they are generally bigger
overall.
For the linear model, code x where the rst column is 1 = girl, 0 = boy, and the
second column is 0 = girl, 1 = boy:
Y = x +R =
_
1
11
0
11
0
16
1
16
__

11

12

13

14

21

22

23

24
_
+R. (4.13)
Here, Y and R are 27 4. So now the rst row of has the (population) means of the
girls for the four ages, and the second row has the means for the boys. The sample
means are
Age8 Age10 Age12 Age14
Girls 21.18 22.23 23.09 24.09
Boys 22.88 23.81 25.72 27.47
(4.14)
The lower plot in Figure 4.1 shows the sample mean vectors. The boys curve is
higher than the girls, and the two are reasonably parallel, and linear.
Histamine in dogs
Sixteen dogs were treated with drugs to see the effects on their blood histamine
levels. The dogs were split into four groups: Two groups received the drug morphine,
and two received the drug trimethaphan, both given intravenously. For one group
within each pair of drug groups, the dogs had their supply of histamine depleted
before treatment, while the other group had histamine intact. So this was a two-way
1
Actually, I believe it is the pterygomaxillary ssure. See Wikipedia [2010] for an illustration and some
references.
72 Chapter 4. Linear Models
Obs i Age8 Age10 Age12 Age14 Sex
1 21.0 20.0 21.5 23.0 1
2 21.0 21.5 24.0 25.5 1
3 20.5 24.0 24.5 26.0 1
4 23.5 24.5 25.0 26.5 1
5 21.5 23.0 22.5 23.5 1
6 20.0 21.0 21.0 22.5 1
7 21.5 22.5 23.0 25.0 1
8 23.0 23.0 23.5 24.0 1
9 20.0 21.0 22.0 21.5 1
10 16.5 19.0 19.0 19.5 1
11 24.5 25.0 28.0 28.0 1
12 26.0 25.0 29.0 31.0 0
13 21.5 22.5 23.0 26.5 0
14 23.0 22.5 24.0 27.5 0
15 25.5 27.5 26.5 27.0 0
16 20.0 23.5 22.5 26.0 0
17 24.5 25.5 27.0 28.5 0
18 22.0 22.0 24.5 26.5 0
19 24.0 21.5 24.5 25.5 0
20 23.0 20.5 31.0 26.0 0
21 27.5 28.0 31.0 31.5 0
22 23.0 23.0 23.5 25.0 0
23 21.5 23.5 24.0 28.0 0
24 17.0 24.5 26.0 29.5 0
25 22.5 25.5 25.5 26.0 0
26 23.0 24.5 26.0 30.0 0
27 22.0 21.5 23.5 25.0 0
Table 4.1: The mouth size data, from Potthoff and Roy [1964].
analysis of variance model, the factors being Drug (Morphine or Trimethaphan)
and Depletion (Intact or Depleted). These data are from a study by Morris and
Zeppa [1963], analyzed also in Cole and Grizzle [1966]. See Table 4.2.
Each dog had four measurements: Histamine levels (in micrograms per milliliter
of blood) before the inoculation, and then at 1, 3, and 5 minutes after. (The value
0.10 marked with an asterisk was actually missing. I lled it in arbitrarily.)
Figure 4.2 has a plot of the 16 dogs series of measurements. Most of the data is
close to zero, so it is hard to distinguish many of the individuals.
The model is a two-way multivariate analysis of variance one: Y = x +R, where
contains the mean effect (), two main effects ( and ), and interaction effect ()
for each time point:
Y =

1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4
1
4

0

1

3

5

0

1

3

5

0

1

3

5

0

1

3

5

+R. (4.15)
4.3. Both sides models 73
8 9 10 11 12 13 14
2
0
2
5
3
0
Age
S
i
z
e
8 9 10 11 12 13 14
2
0
2
5
3
0
Age
M
e
a
n

s
i
z
e
Figure 4.1: Mouth sizes over time. The boys are indicated by dashed lines, the girls
by solid lines. The top graphs has the individual graphs, the bottom the averages for
the boys and girls.
The estimate of is
Effect Before After1 After3 After5
Mean 0.077 0.533 0.364 0.260
Drug 0.003 0.212 0.201 0.140
Depletion 0.012 0.449 0.276 0.169
Interaction 0.007 0.213 0.202 0.144
(4.16)
See the second plot in Figure 4.2 for the means of the groups, and Figure 4.3 for
the effects, both plotted over time. Note that the mean and depletion effects are the
largest, particularly at time point 2, After1.
4.3 Linear models on both sides
The regression and multivariate regression models in the previous sections model
differences between the individuals: The rows of x are different for different indi-
74 Chapter 4. Linear Models
Obs i Before After1 After3 After5
Morphine 1 0.04 0.20 0.10 0.08
Intact 2 0.02 0.06 0.02 0.02
3 0.07 1.40 0.48 0.24
4 0.17 0.57 0.35 0.24
Morphine 5 0.10 0.09 0.13 0.14
Depleted 6 0.12 0.11 0.10 0.10
7 0.07 0.07 0.07 0.07
8 0.05 0.07 0.06 0.07
Trimethaphan 9 0.03 0.62 0.31 0.22
Intact 10 0.03 1.05 0.73 0.60
11 0.07 0.83 1.07 0.80
12 0.09 3.13 2.06 1.23
Trimethaphan 13 0.10 0.09 0.09 0.08
Depleted 14 0.08 0.09 0.09 0.10
15 0.13 0.10 0.12 0.12
16 0.06 0.05 0.05 0.05
Table 4.2: The data on histamine levels in dogs. The value with the asterisk is missing,
but for illustration purposes I lled it in. The dogs are classied according to the
drug administered (morphine or trimethaphan), and whether the dogs histamine
was articially depeleted.
viduals, but the same for each variable. Models on the variables switch the roles of
variable and individual.
4.3.1 One individual
Start with just one individual, so that Y = (Y
1
, . . . , Y
q
) is a 1 q row vector. A linear
model on the variables is
Y = z
/
+R, where is 1 l, R N
1q
(0,
R
), (4.17)
and z is a xed q l matrix. The model (4.17) looks like just a transpose of model
(4.1), but (4.17) does not have iid residuals, because the observations are all on the
same individual. Simple repeated measures models and growth curve models are special
cases. (Simple because there is only one individual. Actual models would have more
than one.)
A repeated measure model is used if the y
j
s represent replications of the same
measurement. E.g., one may measure blood pressure of the same person several
times, or take a sample of several leaves from the same tree. If no systematic differ-
ences are expected in the measurements, the model would have the same mean for
each variable:
Y = (1, . . . , 1) +R = 1
/
q
+R. (4.18)
It is common in this setting to assume
R
has the intraclass correlation structure, as
in (1.61), i.e., the variances are all equal, and the covariances are all equal.
Growth curve models are used when the measurements are made over time, and
growth (polynomial or otherwise) is expected. A quadratic model turns (4.6) on its
4.3. Both sides models 75
1.0 1.5 2.0 2.5 3.0 3.5 4.0
0
.
0
1
.
0
2
.
0
3
.
0
Time
H
i
s
t
a
m
i
n
e
1.0 1.5 2.0 2.5 3.0 3.5 4.0
0
.
0
1
.
0
2
.
0
3
.
0
Time
M
e
a
n

h
i
s
t
a
m
i
n
e
MI
MD
TI
TD
Figure 4.2: Plots of the dogs over time. The top plot has the individual dogs, the
bottom has the means of the groups. The groups: MI = Morphine, Intact; MD =
Morphine, Depleted; TI = Trimethaphan, Intact; TD = Trimethaphan, Depleted
side:
Y = (
0
,
1
,
2
)

1 1
x
1
x
2
x
q
x
2
1
x
2
2
x
2
q

+R. (4.19)
Similarly one can transpose cyclic models akin to (4.7).
4.3.2 IID observations
Now suppose we have a sample of n independent individuals, so that the n q data
matrix is distributed
Y N
nq
(1
q
, I
q

R
), (4.20)
which is the same as (3.28) with slightly different notation. Here, is 1 q, so the
model says that the rows of Y are independent with the same mean and covariance
matrix
R
. A repeated measure model assumes in addition that the elements of are
equal to , so that the linear model takes the mean in (4.20) and combines it with the
76 Chapter 4. Linear Models
1.0 1.5 2.0 2.5 3.0 3.5 4.0

0
.
4
0
.
0
0
.
2
0
.
4
Time
E
f
f
e
c
t
s
Mean
Drug
Depletion
Interaction
Figure 4.3: Plots of the effects in the analysis of variance for the dogs data, over time.
mean in (4.18) to obtain
Y = 1
n
1
/
q
+R, R N
nq
(0, I
n

R
). (4.21)
This model makes sense if one takes a random sample of n individuals, and makes
repeated measurements from each. More generally, a growth curve model as in (4.19),
but with n individuals measured, is
Y = 1
n
(
0
,
1
,
2
)

1 1
z
1
z
2
z
q
z
2
1
z
2
2
z
2
q

+R. (4.22)
Example: Births
The average births for each hour of the day for four different hospitals is given in
Table 4.3. The data matrix Y here is 4 24, with the rows representing the hospitals
and the columns the hours. Figure 4.4 plots the curves.
One might wish to t sine waves (Figure 4.5) to the four hospitals data, presuming
one day reects one complete cycle. The model is
Y = z
/
+R, (4.23)
where
=

10

11

12

20

21

22

30

31

32

40

41

42

(4.24)
4.3. Both sides models 77
1 2 3 4 5 6 7 8
Hosp1 13.56 14.39 14.63 14.97 15.13 14.25 14.14 13.71
Hosp2 19.24 18.68 18.89 20.27 20.54 21.38 20.37 19.95
Hosp3 20.52 20.37 20.83 21.14 20.98 21.77 20.66 21.17
Hosp4 21.14 21.14 21.79 22.54 21.66 22.32 22.47 20.88
9 10 11 12 13 14 15 16
Hosp1 14.93 14.21 13.89 13.60 12.81 13.27 13.15 12.29
Hosp2 20.62 20.86 20.15 19.54 19.52 18.89 18.41 17.55
Hosp3 21.21 21.68 20.37 20.49 19.70 18.36 18.87 17.32
Hosp4 22.14 21.86 22.38 20.71 20.54 20.66 20.32 19.36
17 18 19 20 21 22 23 24
Hosp1 12.92 13.64 13.04 13.00 12.77 12.37 13.45 13.53
Hosp2 18.84 17.18 17.20 17.09 18.19 18.41 17.58 18.19
Hosp3 18.79 18.55 18.19 17.38 18.41 19.10 19.49 19.10
Hosp4 20.02 18.84 20.40 18.44 20.83 21.00 19.57 21.35
Table 4.3: The data on average number of births for each hour of the day for four
hospitals.
5 10 15 20
1
2
1
4
1
6
1
8
2
0
2
2
Hour
#

B
i
r
t
h
s
Figure 4.4: Plots of the four hospitals births, over twenty-four hours.
78 Chapter 4. Linear Models
0 5 10 15 20

1
.
0
0
.
0
0
.
5
1
.
0
Hour
S
i
n
e
/
C
o
s
i
n
e
Figure 4.5: Sine and cosine waves, where one cycle spans twenty-four hours.
and
z
/
=

1 1 1
cos(1 2/24) cos(2 2/24) cos(24 2/24)
sin(1 2/24) sin(2 2/24) sin(24 2/24)

, (4.25)
the z here being the same as the x in (4.7).
The estimate of is
Mean Cosine Sine
Hosp1 13.65 0.03 0.93
Hosp2 19.06 0.69 1.46
Hosp3 19.77 0.22 1.70
Hosp4 20.93 0.12 1.29
(4.26)
Then the ts are

Y =

z
/
, which is also 4 24. See Figure 4.6.
Now try the model with same curve for each hospital, Y = x

z
/
+ R, where
x = 1
4
(the star on the

is to distinguish it from the previous ):


Y = x

z
/
+R =

1
1
1
1

2
_
z
/
+R. (4.27)
The estimates of the coefcients are now

= (18.35, 0.25, 1.34), which is the aver-


age of the rows of

. The t is graphed as the thick line in Figure 4.6
4.3. Both sides models 79
5 10 15 20
1
2
1
4
1
6
1
8
2
0
2
2
Hour
#

B
i
r
t
h
s
Figure 4.6: Plots of the four hospitals births, with the tted sign waves. The thick
line ts one curve to all four hospitals.
4.3.3 The both-sides model
Note that the last two models, (4.19) and (4.22) have means with xed matrices on
both sides of the parameter. Generalizing, we have the model
Y = xz
/
+R, (4.28)
where x is n p, is p l, and z is q l. The x models differences between indi-
viduals, and the z models relationships between the variables. This formulation is by
Potthoff and Roy [1964].
For example, consider the mouth size example in Section 4.2. A growth curve
model seems reasonable, but one would not expect the iid model to hold. In particu-
lar, the mouths of the eleven girls would likely be smaller on average than those of the
sixteen boys. An analysis of variance model, with two groups, models the differences
between the individuals, while a growth curve models the relationship among the
four time points. With Y being the 27 4 data matrix of measurements, the model is
Y =
_
1
11
0
11
0
16
1
16
__

g0

g1

g2

b0

b1

b2
_

1 1 1 1
8 10 12 14
8
2
10
2
12
2
14
2

+R. (4.29)
The 0
m
s are m 1 vectors of 0s. Thus (
g0
,
g1
,
g2
) contains the coefcients for
the girls growth curve, and (
b0
,
b1
,
b2
) the boys. Some questions which can be
addressed include
Does the model t, or are cubic terms necessary?
Are the quadratic terms necessary (is
g2
=
b2
= 0)?
Are the girls and boys curves the same (are
gj
=
bj
for j = 0, 1, 2)?
80 Chapter 4. Linear Models
Are the girls and boys curves parallel (are
g1
=
b1
and
g2
=
b2
, but maybe
not
g0
=
b0
)?
See also Ware and Bowden [1977] for a circadean application and Zerbe and Jones
[1980] for a time-series context. The model is often called the generalized multivari-
ate analysis of variance, or GMANOVA, model. Extensions are many. For examples,
see Gleser and Olkin [1970], Chinchilli and Elswick [1985], and the book by Kariya
[1985].
4.4 Exercises
Exercise 4.4.1 (Prostaglandin). Below are data from Ware and Bowden [1977] taken at
six four-hour intervals (labelled T1 to T6) over the course of a day for 10 individuals.
The measurements are prostaglandin contents in their urine.
Person T1 T2 T3 T4 T5 T6
1 146 280 285 215 218 161
2 140 265 289 231 188 69
3 288 281 271 227 272 150
4 121 150 101 139 99 103
5 116 132 150 125 100 86
6 143 172 175 222 180 126
7 174 276 317 306 139 120
8 177 313 237 135 257 152
9 294 193 306 204 207 148
10 76 151 333 144 135 99
(4.30)
(a) Write down the xz
/
" part of the model that ts a separate sine wave to each
person. (You dont have to calculate the estimates or anything. Just give the x, and
z matrices.) (b) Do the same but for the model that ts one sine wave to all people.
Exercise 4.4.2 (Skulls). The data concern the sizes of Egyptian skulls over time, from
Thomson and Randall-MacIver [1905]. There are 30 skulls from each of ve time
periods, so that n = 150 all together. There are four skull size measurements, all in
millimeters: maximum length, basibregmatic height, basialveolar length, and nasal
height. The model is a multivariate analysis of variance one, where x distinguishes
between the time periods, and we do not use a z. Use polynomials for the time
periods (code them as 1, 2, 3, 4, 5), so that x = w1
30
. Find w.
Exercise 4.4.3. Suppose Y
b
and Y
a
are n 1 with n = 4, and consider the model
(Y
b
Y
a
) N(x, I
n
), (4.31)
where
x =

1 1 1
1 1 1
1 1 1
1 1 1

. (4.32)
(a) What are the dimensions of and ? The conditional distribution of Y
a
given
Y
b
= (4, 2, 6, 3)
/
is
Y
a
[ Y
b
= (4, 2, 6, 3)
/
N(x

, I
n
) (4.33)
4.4. Exercises 81
for some xed matrix x

, parameter matrix

, and covariance matrix . (b) What are


the dimensions of

and ? (c) What is x

? (d) What is the most precise description


of the conditional model?
Exercise 4.4.4 (Caffeine). Henson et al. [1996] conducted an experiment to see whether
caffeine has a negative effect on short-term visual memory. High school students
were randomly chosen: 9 from eighth grade, 10 from tenth grade, and 9 from twelfth
grade. Each person was tested once after having caffeinated Coke, and once after
having decaffeinated Coke. After each drink, the person was given ten seconds to
try to memorize twenty small, common objects, then allowed a minute to write down
as many as could be remembered. The main question of interest is whether people
remembered more objects after the Coke without caffeine than after the Coke with
caffeine. The data are
Grade 8 Grade 10 Grade 12
Without With Without With Without With
5 6 6 3 7 7
9 8 9 11 8 6
6 5 4 4 9 6
8 9 7 6 11 7
7 6 6 8 5 5
6 6 7 6 9 4
8 6 6 8 9 7
6 8 9 8 11 8
6 7 10 7 10 9
10 6
(4.34)
Grade" is the grade in school, and the Without" and With" entries are the numbers
of items remembered after drinking Coke without or with caffeine. Consider the
model
Y = xz
/
+R, (4.35)
where the Y is 28 2, the rst column being the scores without caffeine, and the
second being the scores with caffeine. The x is 28 3, being a polynomial (quadratic)
matrix in the three grades. (a) The z has two columns. The rst column of z represents
the overall mean (of the number of objects a person remembers), and the second
column represents the difference between the number of objects remembered with
caffeine and without caffeine. Find z. (b) What is the dimension of ? (c) What
effects do the
ij
s represent? (Choices: overall mean, overall linear effect of grade,
overall quadratic effect of grade, overall difference in mean between caffeinated and
decaffeinated coke, linear effect of grade in the difference between caffeinated and
decaffeinated coke, quadratic effect of grade in the difference between caffeinated
and decaffeinated coke, interaction of linear and quadratic effects of grade.)
Exercise 4.4.5 (Histamine in dogs). In Table 4.2, we have the model
Y = xz
/
+R, (4.36)
where x (n 4) describes a balanced two-way analysis of variance. The columns
represent, respectively, the overall mean, the drug effect, the depletion effect, and the
drug depletion interaction. For the z, the rst column is the effect of the before"
82 Chapter 4. Linear Models
measurement (at time 0), and the last three columns represent polynomial effects
(constant, linear, and quadratic) for just the three after" time points (times 1, 3, 5).
(a) What is z? (b) What effects do the
ij
s represent? (Choices: overall drug effect
for the after measurements, overall drug effect for the before measurement, average
after" measurement, drug depletion interaction for the before" measurement,
linear effect in after" time points for the drug effect.)
Exercise 4.4.6 (Leprosy). Below are data on leprosy patients found in Snedecor and
Cochran [1989]. There were 30 patients, randomly allocated to three groups of 10.
The rst group received drug A, the second drug D, and the third group received a
placebo. Each person had their bacterial count taken before and after receiving the
treatment.
Drug A Drug D Placebo
Before After Before After Before After
11 6 6 0 16 13
8 0 6 2 13 10
5 2 7 3 11 18
14 8 8 1 9 5
19 11 18 18 21 23
6 4 8 4 16 12
10 13 19 14 12 5
6 1 8 9 12 16
11 8 5 1 7 1
3 0 15 9 12 20
(4.37)
(a) Consider the model Y = x+Rfor the multivariate analysis of variance with three
groups and two variables (so that Y is 30 2), where R N
302
(0, I
30

R
). The
x has vectors for the overall mean, the contrast between the drugs and the placebo,
and the contrast between Drug A and Drug D. Because there are ten people in each
group, x can be written as w1
10
. Find w. (b) Because the before measurements
were taken before any treatment, the means for the three groups on that variable
should be the same. Describe that constraint in terms of the . (c) With Y = (Y
b
Y
a
),
nd the model for the conditional distribution
Y
a
[ Y
b
= y
b
N(x

, I
n
). (4.38)
Give the x

in terms of x and y
b
, and give in terms of the elements of
R
. (Hint:
Write down what it would be with E[Y] = (
b

a
) using the conditional formula,
then see what you get when
b
= x
b
and
a
= x
a
.)
Exercise 4.4.7 (Parity). Johnson and Wichern [2007] present data (in their Exercise
6.17) on an experiment. Each of 32 subjects was given several sets of pairs of integers,
and had to say whether the two numbers had the same parity (i.e., both odd or both
even), or different parities. So (1, 3) have the same parity, while (4, 5) have differ-
ent parity. Some of the integer pairs were given numerically, like (2, 4), and some
were written out, i.e., (Two, Four). The time it took to decide the parity for each pair
was measured. Each person had a little two-way analysis of variance, where the two
factors are Parity, with levels different and same, and Format, with levels word and nu-
meric. The measurements were the median time for each Parity/Format combination
4.4. Exercises 83
for that person. Person i then had observation vector y
i
= (y
i1
, y
i2
, y
i3
, y
i4
), which in
the ANOVA could be arranged as
Format
Parity Word Numeric
Different y
i1
y
i2
Same y
i3
y
i4
(4.39)
The model is of the form
Y = xz
/
+R. (4.40)
(a) What are x and z for the model where each person has a possibly different
ANOVA, and each ANOVA has effects for overall mean, parity effect, format effect,
and parity/format interaction? How many, and which, elements of must be set to
zero to model no-interaction? (b) What are x and z for the model where each person
has the same mean vector, and that vector represents the ANOVA with effects for
overall mean, parity effect, format effect, and parity/format interaction? How many,
and which, elements of must be set to zero to model no-interaction?
Exercise 4.4.8 (Sine waves). Let be an angle running from 0 to 2, so that a sine/-
cosine wave with one cycle has the form
g() = A + B cos( + C) (4.41)
for parameters A, B, and C. Suppose we observe the wave at the q equally-spaced
points

j
=
2
q
j, j = 1, . . . , q, (4.42)
plus error, so that the model is
Y
j
= g(
j
) + R
j
= A + B cos
_
2
q
j + C
_
+ R
j
, j = 1, . . . , q, (4.43)
where the R
j
are the residuals. (a) Is the model linear in the parameters A, B, C? Why
or why not? (b) Show that the model can be rewritten as
Y
j
=
1
+
2
cos
_
2
q
j
_
+
3
sin
_
2
q
j
_
+ R
j
, j = 1, . . . , q, (4.44)
and give the
k
s in terms of A, B, C. [Hint: What is cos(a + b)?] (c) Write this model
as a linear model, Y = z
/
+R, where Y is 1 q. What is the z? (d) Waves with m 1
cycles can be added to the model by including cosine and sine terms with replaced
by m:
cos
_
2m
q
j
_
, sin
_
2m
q
j
_
. (4.45)
If q = 6, then with the constant term, we can t in the cosine and sign terms for
the wave with m = 1 cycle, and the cosine and sine terms for the wave with m = 2
cycles. The x cannot have more than 6 columns (or else it wont be invertible). Find
the cosine and sine terms for m = 3. What do you notice? Which one should you put
in the model?
Chapter 5
Linear Models: Least Squares and
Projections
In this chapter, we briey review linear subspaces and projections onto them. Most of
the chapter is abstract, in the sense of not necessarily tied to statistics. The main result
we need for the rest of the book is the least-squares estimate given in Theorem 5.2.
Further results can be found in Chapter 1 of Rao [1973], an excellent compendium of
facts on linear subspaces and matrices.
5.1 Linear subspaces
We start with the space R
M
. The elements y R
M
may be considered row vectors, or
column vectors, or matrices, or any other conguration. We will generically call them
vectors. This space could represent vectors for individuals, in which case M = q,
the number of variables, or it could represent vectors for variables, so M = n, the
number of individuals, or it could represent the entire data matrix, so that M = nq. A
linear subspace is one closed under addition and multiplication by a scalar. Because
will deal with Euclidean space, everyone knows what addition and multiplication
mean. Here is the denition.
Denition 5.1. A subset J R
M
is a linear subspace of R
M
if
x, y J =x +y J, and (5.1)
c R, x J =cx J. (5.2)
We often shorten linear subspace to subspace, or even space. Note that R
M
is itself a linear (sub)space, as is the set 0. Because c in (5.2) can be 0, any subspace
must contain 0. Any line through 0, or plane through 0, is a subspace. One convenient
representation of subspaces is the set of linear combinations of some elements:
Denition 5.2. The span of the set of vectors d
1
, . . . , d
K
R
M
is
spand
1
, . . . , d
K
= b
1
d
1
+ + b
K
d
K
[ b = (b
1
, . . . , b
K
) R
K
. (5.3)
85
86 Chapter 5. Least Squares
By convention, the span of the empty set is just 0. It is not hard to show that
any span is a linear subspace. Some examples: For K = 2, span(1, 1) is the set
of vectors of the form (a, a), that is, the equiangular line through 0. For K = 3,
span(1, 0, 0), (0, 1, 0) is the set of vectors of the form (a, b, 0), which is the x/y
plane, considering the axes to be x, y, z.
We will usually write the span in matrix form. Letting D be the M K matrix
with columns d
1
, . . . , d
K
. We have the following representations of subspace J:
J = spand
1
, . . . , d
K

= spancolumns of D
= Db[ b R
K
(b is K 1)
= spanrows of D
/

= bD
/
[ b R
K
(b is 1 K). (5.4)
Not only is any span a subspace, but any subspace is a span of some vectors. In
fact, any subspace of R
K
can be written as a span of at most K vectors, although not
in a unique way. For example, for K = 3,
span(1, 0, 0), (0, 1, 0) = span(1, 0, 0), (0, 1, 0), (1, 1, 0)
= span(1, 0, 0), (1, 1, 0)
= span(2, 0, 0), (0, 7, 0), (33, 2, 0). (5.5)
Any invertible transformation of the vectors yields the same span, as in the next
lemma. See Exercise 5.6.4 for the proof.
Lemma 5.1. Suppose J is the span of the columns of the MK matrix D as in (5.4), and
A is an invertible K K matrix. Then J is also the span of the columns of DA, i.e.,
spancolumns of D = spancolumns of DA. (5.6)
Note that the space in (5.5) can be a span of two or three vectors, or a span of any
number more than three as well. It cannot be written as a span of only one vector.
In the two sets of three vectors, there is a redundancy, that is, one of the vectors can
be written as a linear combination of the other two: (1, 1, 0) = (1, 0, 0) + (0, 1, 0) and
(2, 0, 0) = (4/(33 7))(0, 7, 0) + (2/33) (33, 2, 0). Such sets are called linearly
dependent. We rst dene the opposite.
Denition 5.3. The vectors d
1
, . . . , d
K
in R
K
are linearly independent if
b
1
d
1
+ + b
K
d
K
= 0 = b
1
= = b
K
= 0. (5.7)
Equivalently, the vectors are linearly independent if no one of them (as long as
it is not 0) can be written as a linear combination of the others. That is, there is no
d
i
,= 0 and set of coefcients b
j
such that
d
i
= b
1
d
1
+ + b
i1
d
i1
+ b
i+1
d
i+1
+ . . . + b
K
d
K
. (5.8)
The vectors are linearly dependent if and only if they are not linearly independent.
In (5.5), the sets with three vectors are linearly dependent, and those with two
vectors are linearly independent. To see that latter fact for (1, 0, 0), (1, 1, 0), suppose
that
1
(1, 0, 0) +
2
(1, 1, 0) = (0, 0, 0). Then
b
1
+ b
2
= 0 and b
2
= 0 = b
1
= b
2
= 0, (5.9)
5.2. Projections 87
which veries (5.7).
If a set of vectors is linearly dependent, then one can remove one of the redundant
vectors (5.8), and still have the same span. A basis is a set of vectors that has the same
span but no dependencies.
Denition 5.4. The set of vectors d
1
, . . . , d
K
is a basis for the subspace J if the vectors
are linearly independent and J = spand
1
, . . . , d
K
.
Although a (nontrivial) subspace has many bases, each basis has the same number
of elements, which is the dimension. See Exercise 5.6.34.
Denition 5.5. The dimension of a subspace is the number of vectors in any of its bases.
5.2 Projections
In linear models, the mean of the data matrix is presumed to lie in a linear subspace,
and an aspect of tting the model is to nd the point in the subspace closest to the
data. This closest point is called the projection. Before we get to the formal denition,
we need to dene orthogonality. Recall from Section 1.5 that two column vectors v
and w are orthogonal if v
/
w = 0 (or vw
/
= 0 if they are row vectors).
Denition 5.6. The vector v R
M
is orthogonal to the subspace J R
M
if v is orthogonal
to w for all w J. Also, subspace 1 R
M
is orthogonal to J if v and w are orthogonal
for all v 1 and w J.
Geometrically, two objects are orthogonal if they are perpendicular. For example,
in R
3
, the z-axis is orthogonal to the x/y-plane. Exercise 5.6.6 is to prove the next
result.
Lemma 5.2. Suppose J = spand
1
, . . . , d
K
. Then y is orthogonal to J if and only if y
is orthogonal to each d
j
.
Denition 5.7. The projection of y onto J is the y that satises
y J and y y is orthogonal to J. (5.10)
In statistical parlance, the projection y is the t and y y is the residual for the
model. Because of the orthogonality, we have the decomposition of squared norms,
|y|
2
= |y|
2
+|y y|
2
, (5.11)
which is Pythagoras Theorem. In a regression setting, the left-hand side is the total
sum-of-squares, and the right-hand side is the regression sum-of-squares (|y|
2
) plus the
residual sum-of-squares, although usually the sample mean of the y
i
s is subtracted
from y and y.
Exercise 5.6.8 proves the following useful result.
Theorem 5.1 (Projection). Suppose y R
K
and J is a subspace of R
K
, and y is the
projection of y onto J. Then
(a) The projection is unique: If y
1
and y
2
are both in J, and y y
1
and y y
2
are both
orthogonal to J, then y
1
= y
2
.
(b) If y J, then y = y.
88 Chapter 5. Least Squares
(c) If y is orthogonal to J, then y = 0.
(d) The projection y uniquely minimizes the Euclidean distance between y and J, that is,
|y y|
2
< |y w|
2
for all w J, w ,= y. (5.12)
5.3 Least squares
In this section, we explicitly nd the projection of y (1 M) onto J. Suppose
d
1
, . . . , d
K
, the (transposed) columns of the M K matrix D, form a basis for J,
so that the nal expression in (5.4) holds. Our objective is to nd a b, 1 K, so that
y bD
/
. (5.13)
In Section 5.3.1, we specialize to the both-sides model (4.28). Our rst objective is to
nd the best value of b, where we dene best by least squares.
Denition 5.8. A least-squares estimate of b in the equation (5.13) is any

b such that
|y

bD
/
|
2
= min
bR
K
|y bD
/
|
2
. (5.14)
Part (d) of Theorem 5.1 implies that a least squares estimate of b is any

b for which

bD
/
is the projection of y onto the subspace J. Thus y

bD
/
is orthogonal to J,
and by Lemma 5.2, is orthogonal to each d
j
. The result are the normal equations:
(y

bD
/
)d
j
= 0 for each j = 1, . . . , K. (5.15)
We then have
(5.15) = (y

bD
/
)D = 0
=

bD
/
D = yD (5.16)
=

b = yD(D
/
D)
1
, (5.17)
where the nal equation holds if D
/
D is invertible, which occurs if and only if the
columns of D constitute a basis of J. See Exercise 5.6.33. Summarizing:
Theorem 5.2 (Least squares). Any solution

b to the least-squares equation (5.14) satises
the normal equations (5.16). The solution is unique if and only if D
/
D is invertible, in which
case (5.17) holds.
If D
/
D is invertible, the projection of y onto J can be written
y =

bD
/
= yD(D
/
D)
1
D
/
yP
D
, (5.18)
where
P
D
= D(D
/
D)
1
D
/
. (5.19)
The matrix P
D
is called the projection matrix for J. The residuals are then
y y = y yP
D
= y(I
K
P
D
) = yQ
D
, (5.20)
5.3. Least squares 89
where
Q
D
= I
M
P
D
. (5.21)
The minimum value in (5.14) is then
|y y|
2
= yQ
D
y
/
, (5.22)
and the t and residuals are orthogonal:
y(y y)
/
= 0. (5.23)
These two facts are consequences of parts (a) and (c) of the following proposition.
See Exercises 5.6.10 to 5.6.12.
Proposition 5.1 (Projection matrices). Suppose P
D
is dened as in (5.18), where D
/
D is
invertible. Then the following hold.
(a) P
D
is symmetric and idempotent, with trace(P
D
) = K, the dimension of J;
(b) Any symmetric idempotent matrix is a projection matrix for some subspace J;
(c) Q
D
= I
M
P
D
is also a projection matrix, and is orthogonal to P
D
in the sense that
P
D
Q
D
= Q
D
P
D
= 0;
(d) P
D
D = D and Q
D
D = 0.
The matrix Q
D
is the projection matrix onto the orthogonal complement of J,
where the orthogonal complement contains all vectors in R
M
that are orthogonal to
J.
5.3.1 Both-sides model
We apply least squares to estimate in the both-sides model. Chapter 6 will derive
its distribution. Recall the model
Y = xz
/
+R, R N(0, I
n

R
) (5.24)
from (4.28), where Y is n q, x is n p, is p l, and z is q l. Vectorizing Y and
, by property (3.32d) of Kronecker products,
row(Y) = row()(x z)
/
+row(R). (5.25)
Then the least squares estimate of is found as in (5.17), where we make the identi-
cations y = row(Y), b = row() and D = x z (hence M = nq and K = pl):

row() = row(Y)(x z)[(x z)


/
(x z)]
1
= row(Y)(x z)(x
/
x z
/
z)
1
= row(Y)(x(x
/
x)
1
z(z
/
z)
1
). (5.26)
See Proposition 3.2. Now we need that x
/
x and z
/
z are invertible. Undoing as in
(3.32d), the estimate can be written

= (x
/
x)
1
x
/
Yz(z
/
z)
1
. (5.27)
90 Chapter 5. Least Squares
In multivariate regression of Section 4.2, z is non-existent (actually, z = I
q
), so that
the model and estimate are
Y = x +R and

= (x
/
x)
1
x
/
Y, (5.28)
the usual estimate for regression. The repeated measures and growth curve models
such as in (4.23) have x = I
n
, so that
Y = z
/
+R and

= Yz(z
/
z)
1
. (5.29)
Thus, indeed, the both-sides model has estimating matrices on both sides.
5.4 What is a linear model?
We have been working with linear models for a while, so perhaps it is time to formally
dene them. Basically, a linear model for Y is one for which the mean of Y lies in a
given linear subspace. A model itself is a set of distributions. The linear model does
not describe the entire distribution, thus the actual distribution, e.g., multivariate
normal with a particular covariance structure, needs to be specied as well. The
general model we have been using is the both-sides model (4.28) for given matrices x
and z. The Y, hence its mean, is an n q matrix, which resides in R
K
with K = nq.
As in (5.25),
E[row(Y)] = row()(x z)
/
. (5.30)
Letting range over all the p l matrices, we have the restriction
E[row(Y)] J row()(x z)
/
[ row() R
pl
. (5.31)
Is J a linear subspace? Indeed, as in Denition 5.2, it is the span of the transposed
columns of x z, the columns being x
i
z
j
, where x
i
and z
j
are the columns of x and
z, respectively. That is,
row()(x z)
/
=
p

i=1
l

j=1

ij
(x
i
z
j
)
/
. (5.32)
The linear model is then the set of distributions
/ = N(M, I
n
) [ M J and o
+
q
, (5.33)
denoting
o
+
q
= The set of q q positive denite symmetric matrices. (5.34)
Other linear models can have different distributional assumptions, e.g., covariance
restrictions, but do have to have the mean lie in a linear subspace.
There are many different parametrizations of a given linear model, for the same
reason that there are many different bases for the mean space J. For example, it
may not be obvious, but
x =
_
1 0
0 1
_
, z =

1 1 1
1 2 4
1 3 9

(5.35)
5.5. Gram-Schmidt orthogonalization 91
and
x

=
_
1 1
1 1
_
, z

1 1 1
1 0 2
1 1 1

(5.36)
lead to exactly the same model, though different interpretations of the parameters.
In fact, with x being n p and z being q l,
x

= xA and z

= zB, (5.37)
yields the same model as long as A (p p) and B (l l) are invertible:
xz
/
= x

z
/
with

= A
1
B
1
. (5.38)
The representation in (5.36) has the advantage that the columns of the x

are
orthogonal, which makes it easy to nd the least squares estimates as the D
/
D matrix
is diagonal, hence easy to invert. Note the z is the matrix for a quadratic. The z

is
the corresponding set of orthogonal polynomials, as discussed in Section 5.5.2.
5.5 Gram-Schmidt orthogonalization
We have seen polynomial models in (4.6), (4.22) and (4.29). Note that, especially
in the latter case, one can have a design matrix (x or z) whose entries have widely
varying magnitudes, as well as highly correlated vectors, which can lead to numerical
difculties in calculation. Orthogonalizing the vectors, without changing their span,
can help both numerically and for interpretation. Gram-Schmidt orthogonalization
is a well-known constructive approach. It is based on the following lemma.
Lemma 5.3. Suppose (D
1
, D
2
) is MK, where D
1
is MK
1
, D
2
is MK
2
, and J is
the span of the combined columns:
J = spancolumns of (D
1
, D
2
). (5.39)
Suppose D
/
1
D
1
is invertible, and let
D
21
= Q
D
1
D
2
, (5.40)
for Q
D
1
dened in (5.21) and (5.19). Then the columns of D
1
and D
21
are orthogonal,
D
/
21
D
1
= 0, (5.41)
and
J = spancolumns of (D
1
, D
21
). (5.42)
Proof. D
21
is the matrix of residuals for the least-squares model D
2
= D
1
+R, i.e.,
D
1
is the x and D
2
is the Y in the multivariate regression model (5.28). Equation
(5.41) then follows from part (d) of Proposition 5.1: D
/
21
D
1
= D
/
2
Q
D
1
D
1
= 0. For
(5.42),
_
D
1
D
21
_
=
_
D
1
D
2
_
_
I
K
1
(D
/
1
D
1
)
1
D
/
1
D
2
0 I
K
2
_
. (5.43)
The nal matrix is invertible, hence by Lemma 5.1, the spans of the columns of
(D
1
, D
2
) and (D
1
, D
21
) are the same.
92 Chapter 5. Least Squares
Now let d
1
, . . . d
K
be the columns of D (D
1
, D
2
), and J their span as in (5.39).
The Gram-Schmidt process starts by applying Lemma 5.3 with D
1
= d
1
and D
2
=
(d
2
, . . . , d
K
). Dotting out the 1 on the columns as well, we write the resulting
columns of D
21
as the vectors
d
21
, , d
K1
, where d
j1
= d
j

d
/
j
d
1
|d
1
|
2
d
1
. (5.44)
In other words, d
j1
is the residual of the projection of d
j
onto spand
1
. Thus d
1
is
orthogonal to all the d
j1
s in (5.44), and J = spand
1
, d
21
, . . . , d
K1
by (5.42).
Second step is to apply the lemma again, this time with D
1
= d
21
, and D
2
=
(d
31
, . . . , d
K1
), leaving aside the d
1
for the time being. Now write the columns of
the new D
21
dotting out the 1 and 2:
d
312
, , d
K12
, where d
j12
= d
j1

d
/
j1
d
21
|d
21
|
2
d
21
. (5.45)
Now d
21
, as well as d
1
, are orthogonal to the vectors in (5.45), and
J = spand
1
, d
21
, d
312
, . . . , d
K12
. (5.46)
We continue until we have the set of vectors
d
1
, d
21
, d
312
, . . . , d
K1:(K1)
, (5.47)
which are mutually orthogonal and span J. Here, we are using the R-based notation
a : b = a, a + 1, . . . , b for integers a < b. (5.48)
It is possible that one or more of the vectors we use for D
1
will be zero. In such
cases, we just leave the vectors in D
2
alone, i.e., D
21
= D
2
, because the projection
of any vector on the space 0 is 0, hence the residual equals the original vector.
We can describe the entire resulting process iteratively, for i = 1, . . . , K 1 and j =
i + 1, . . . , K, as setting
d
j1:i
=
_
d
j1:(i1)
b
ij
d
i1:(i1)
if d
i1:(i1)
,= 0
d
j1:(i1)
if d
i1:(i1)
= 0,
(5.49)
where
b
ij
=
d
/
j1:(i1)
d
i1:(i1)
|d
i1:(i1)
|
2
(5.50)
if its denominator is nonzero. Otherwise, set b
ij
= 0, although any value will do.
Optionally, one can multiply any of these vectors by a nonzero constant, e.g., so
that it has a norm of one, or for esthetics, so that the entries are small integers. Any
zero vectors left in the set can be eliminated without affecting the span.
Note that by the stepwise nature of the algorithm, we have the spans of the rst k
vectors from each set are equal, that is,
spand
1
, d
2
= spand
1
, d
21

spand
1
, d
2
, d
3
= spand
1
, d
21
, d
312

.
.
.
spand
1
, d
2
, . . . , d
K
= spand
1
, d
21
, d
312
, . . . , d
K1:(K1)
. (5.51)
5.5. Gram-Schmidt orthogonalization 93
The next section derives some important matrix decompositions based on the
Gram-Schmidt orthogonalization. Section 5.5.2 applies the orthogonalization to poly-
nomials.
5.5.1 The QR and Cholesky decompositions
We can write the Gram-Schmidt process in matrix form. The rst step is
_
d
1
d
21
d
K1
_
= D

1 b
12
b
1K
0 1 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 1

, (5.52)
The b
ij
s are dened in (5.50). Next,
_
d
1
d
21
d
312
d
K12
_
=
_
d
1
d
21
d
K1
_

1 0 0 0
0 1 b
23
b
2K
0 0 1 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 1

. (5.53)
We continue, so that the nal result is
D


_
d
1
d
21
d
312
d
K1:(K1)
_
= DB
(1)
B
(2)
B
(K1)
, (5.54)
where B
(k)
is the identity except for the elements kj, j > k:
B
(k)
ij
=

1 if i = j
b
kj
if j > k = i
0 otherwise.
(5.55)
These matrices are upper unitriangular, meaning they are upper triangular (i.e.,
all elements below the diagonal are zero), and all diagonal elements are one. We will
use the notation
T
1
q
= T[ T is q q, t
ii
= 1 for all i, t
ij
= 0 for i > j. (5.56)
Such matrices form an algebraic group. A group of matrices is a set ( of N N
invertible matrices g that is closed under multiplication and inverse:
g
1
, g
2
( g
1
g
2
(, (5.57)
g ( g
1
(. (5.58)
Thus we can write
D = D

B
1
, where B = B
(1)
B
(K1)
. (5.59)
94 Chapter 5. Least Squares
Exercise 5.6.19 shows that
B
1
=

1 b
12
b
13
b
1K
0 1 b
23
b
2K
0 0 1 b
3K
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 1

. (5.60)
Now suppose the columns of D are linearly independent, which means that all the
columns of D

are nonzero (See Exercise 5.6.21.) Then we can divide each column of
D

by its norm, so that the resulting vectors are orthonormal:


q
i
=
d
i1:(i1)
|d
i1:(i1)
|
, Q =
_
q
1
q
K
_
= D

1
, (5.61)
where is the diagonal matrix with the norms on the diagonal. Letting R = B
1
,
we have that
D = QR, (5.62)
where R is upper triangular with positive diagonal elements, the
ii
s. The set of
such matrices R is also group, denoted by
T
+
q
= T[ T is q q, t
ii
> 0 for all i, t
ij
= 0 for i > j. (5.63)
Hence we have the next result. The uniqueness for M = K is shown in Exercise 5.6.26.
Theorem 5.3 (QR-decomposition). Suppose the MK matrix D has linearly independent
columns (hence K M). Then there is a unique decomposition D = QR, where Q, MK,
has orthonormal columns and R T
+
K
.
Gram-Schmidt also has useful implications for the matrix S = D
/
D. From (5.43)
we have
S =
_
I
K
1
0
D
/
2
D
1
(D
/
1
D
1
)
1
I
K
2
__
D
/
1
D
1
0
0 D
/
21
D
21
__
I
K
1
(D
/
1
D
1
)
1
D
/
1
D
2
0 I
K
2
_
=
_
I
K
1
0
S
21
S
1
11
I
K
2
__
S
11
0
0 S
221
__
I
K
1
S
1
11
S
12
0 I
K
2
_
, (5.64)
where S
221
= S
22
S
21
S
1
11
S
12
as in (3.49). See Exercise 5.6.27. Then using steps as
in Gram-Schmidt, we have
S = (B
1
)
/

S
11
0 0 0
0 S
221
0 0
0 0 S
3312
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 S
KK1:(K1)

B
1
= R
/
R, (5.65)
because the inner matrix is
2
. Also, note that
b
ij
= S
ij1:(i1)
/S
ii1:(i1)
for j > i, (5.66)
5.5. Gram-Schmidt orthogonalization 95
and R is given by
R
ij
=

_
S
ii1 i1
if j = i,
S
ij1 i1
/
_
S
ii1 i1
if j > i,
0 if j < i.
(5.67)
Exercise 5.6.30 shows this decomposition works for any positive denite symmetric
matrix. It is then called the Cholesky decomposition:
Theorem 5.4 (Cholesky decomposition). If S o
+
q
(5.34), then there exists a unique
R T
+
q
such that S = R
/
R.
Note that this decomposition yields another square root of S.
5.5.2 Orthogonal polynomials
Turn to polynomials. We will illustrate with the example on mouth sizes in (4.29).
Here K = 4, and we will consider the cubic model, so that the vectors are
_
d
1
d
2
d
3
d
4
_
=

1 8 8
2
8
3
1 10 10
2
10
3
1 12 12
2
12
3
1 14 14
2
14
3

. (5.68)
Note that the ages (values 8, 10, 12, 14) are equally spaced. Thus we can just as well
code the ages as (0,1,2,3), so that we actually start with
_
d
1
d
2
d
3
d
4
_
=

1 0 0 0
1 1 1 1
1 2 4 8
1 3 9 27

. (5.69)
Dotting d
1
out of vector w is equivalent to subtracting the mean of the elements
of w for each element. Hence
d
21
=

3/2
1/2
1/2
3/2

3
1
1
3

, d
31
=

7/2
5/2
1/2
11/2

7
5
1
11

, d
41
=

9
8
1
18

.
(5.70)
where we multiplied the rst two vectors in (5.70) by 2 for simplicity. Next, we dot
d
21
out of the last two vectors. So for d
31
, we have
d
312
=

7
5
1
11

(7, 5, 1, 11)
/
(3, 1, 1, 3)
|(3, 1, 1, 3)|
2

3
1
1
3

2
2
2
2

1
1
1
1

,
(5.71)
and, similarly, d
412
= (4.2, 3.6, 5.4, 4.8)
/
(7, 6, 9, 8)
/
. Finally, we dot d
312
out of d
412
to obtain d
4123
= (1, 3, 3, 1)
/
. Then our nal orthogonal polynomial
96 Chapter 5. Least Squares
matrix is the very nice
_
d
1
d
21
d
312
d
4123
_
=

1 3 1 1
1 1 1 3
1 1 1 3
1 3 1 1

. (5.72)
Some older statistics books (e.g., Snedecor and Cochran [1989]) contain tables of or-
thogonal polynomials for small K, and statistical packages will calculate them for
you. In R, the function is poly.
A key advantage to using orthogonal polynomials over the original polynomial
vectors is that, by virtue of the sequence in (5.51), one can estimate the parameters
for models of all degrees at once. For example, consider the mean of the girls mouth
sizes in (4.14) as the y, and the matrix in (5.72) as the D, in the model (5.13):
(21.18, 22.23, 23.09, 24.09) (b
1
, b
2
, b
3
, b
4
)

1 1 1 1
3 1 1 3
1 1 1 1
1 3 3 1

. (5.73)
Because D
/
D is diagonal, the least-squares estimates of the coefcients are found via

b
j
=
yd
j1:(j1)
|d
j1:(j1)
|
2
, (5.74)
which here yields

b = (22.6475, 0.4795, 0.0125, 0.0165). (5.75)


These are the coefcients for the cubic model. The coefcients for the quadratic model
set

b
4
= 0, but the other three are as for the cubic. Likewise, the linear model has

b
equalling (22.6475, 0.4795, 0, 0), and the constant model has (22.6475, 0, 0, 0).
In contrast, if one uses the original vectors in either (5.68) or (5.69), one has to recal-
culate the coefcients separately for each model. Using (5.69), we have the following
estimates:
Model

b

1

b

2

b

3

b

4
Cubic 21.1800 1.2550 0.2600 0.0550
Quadratic 21.1965 0.9965 0.0125 0
Linear 21.2090 0.9590 0 0
Constant 22.6475 0 0 0
(5.76)
Note that the non-zero values in each column are not equal.
5.6 Exercises
Exercise 5.6.1. Show that the span in (5.3) is indeed a linear subspace.
Exercise 5.6.2. Verify that the four spans given in (5.5) are the same.
Exercise 5.6.3. Show that for matrices C (M J) and D (MK),
spancolumns of D spancolumns of C D = CA, (5.77)
for some J K matrix A. [Hint: Each column of D must be a linear combination of
the columns of C.]
5.6. Exercises 97
Exercise 5.6.4. Here, D is an MK matrix, and A is K L. (a) Show that
spancolumns of DA spancolumns of D. (5.78)
[Hint: Any vector in the left-hand space equals DAb for some L 1 vector b. For
what vector b

is DAb = Db

?] (b) Prove Lemma 5.1. [Use part (a) twice, once for
A and once for A
1
.] (c) Show that if the columns of D are linearly independent,
and A is K K and invertible, then the columns of DA are linearly independent.
[Hint: Suppose the columns of DA are linearly dependent, so that for some b ,= 0,
DAb = 0. Then there is a b

,= 0 with Db

= 0. What is it?]
Exercise 5.6.5. Let d
1
, . . . , d
K
be vectors in R
M
. (a) Suppose (5.8) holds. Show that
the vectors are linearly dependent. [That is, nd b
j
s, not all zero, so that

b
i
d
i
= 0.]
(b) Suppose the vectors are linearly dependent. Find an index i and constants b
j
so
that (5.8) holds.
Exercise 5.6.6. Prove Lemma 5.2.
Exercise 5.6.7. Suppose the set of M1 vectors
1
, . . . ,
K
are nonzero and mutually
orthogonal. Show that they are linearly independent. [Hint: Suppose they are linearly
dependent, and let
i
be the vector on the left-hand side in (5.8). Then take
/
i
times
each side of the equation, to arrive at a contradiction.]
Exercise 5.6.8. Prove part (a) of Theorem 5.1. [Hint: Show that the difference of yy
1
and y y
2
is orthogonal to J, as well as in J. Then show that such a vector must be
zero.] (b) Prove part (b) of Theorem 5.1. (c) Prove part (c) of Theorem 5.1. (d) Prove
part (d) of Theorem 5.1. [Hint: Start by writing |y w|
2
= |(y y) (wy)|
2
,
then expand. Explain why y y and wy are orthogonal.]
Exercise 5.6.9. Derive the normal equations (5.15) by differentiating |y bD
/
|
2
with
respect to the b
i
s.
Exercise 5.6.10. This Exercise proves part (a) of Proposition 5.1. Suppose J =
spancolumns of D, where D is M K and D
/
D is invertible. (a) Show that the
projection matrix P
D
= D(D
/
D)
1
D
/
as in (5.19) is symmetric and idempotent. (b)
Show that trace(P
D
) = K.
Exercise 5.6.11. This Exercise proves part (b) of Proposition 5.1. Suppose P is a
symmetric and idempotent M M matrix. Find a set of linearly independent vectors
d
1
, . . . , d
K
, where K = trace(P), so that P is the projection matrix for spand
1
, . . . , d
K
.
[Hint: Write P =
1

/
1
where
1
has orthonormal columns, as in Lemma 3.1. Show
that P is the projection matrix onto the span of the columns of the
1
, and use Exercise
5.6.7 to show that those columns are a basis. What is D, then?]
Exercise 5.6.12. (a) Prove part (c) of Proposition 5.1. (b) Prove part (d) of Proposition
5.1. (c) Prove (5.22). (d) Prove (5.23).
Exercise 5.6.13. Consider the projection of y R
K
onto span1
/
K
. (a) Find the
projection. (b) Find the residual. What does it contain? (c) Find the projection matrix
P. What is Q = I
K
P? Have we seen it before?
Exercise 5.6.14. Verify the steps in (5.26), detailing which parts of Proposition 3.2 are
used at each step.
98 Chapter 5. Least Squares
Exercise 5.6.15. Show that the equation for d
j1
in (5.44) does follow from the deriva-
tion of D
21
.
Exercise 5.6.16. Give an argument for why the set of equations in (5.51) follows from
the Gram-Schmidt algorithm.
Exercise 5.6.17. Given that a subspace is a span of a set of vectors, explain how one
would obtain an orthogonal basis for the space.
Exercise 5.6.18. Let Z
1
be a M K matrix with linearly independent columns. (a)
How would you nd a M(MK) matrix Z
2
so that (Z
1
, Z
2
) is an invertible MM
matrix, and Z
/
1
Z
2
= 0 (i.e., the columns of Z
1
are orthogonal to those of Z
2
). [Hint:
Start by using Lemma 5.3 with D
1
= Z
1
and D
2
= I
M
. (What is the span of the
columns of (Z
1
, I
M
)?) Then use Gram-Schmidt on D
21
to nd a set of vectors to use
as the Z
2
. Do you recognize D
21
?] (b) Suppose the columns of Z are orthonormal.
How would you modify the Z
2
in part (a) so that (Z
1
, Z
2
) is an orthogonal matrix?
Exercise 5.6.19. Consider the matrix B
(k)
dened in (5.55). (a) Show that the inverse
of B
(k)
is of the same form, but with the b
kj
s changed to b
kj
s. That is, the inverse
is the K K matrix C
(k)
, where
C
(k)
ij
=

1 if i = j
b
kj
if j > k = i
0 otherwise.
(5.79)
Thus C is the inverse of the B in (5.59), where C = C
(K1)
C
(1)
. (b) Show that
C is unitriangular, where the b
ij
s are in the upper triangular part, i.e, C
ij
= b
ij
for
j > i, as in (5.60). (c) The R in (5.62) is then C, where is the diagonal matrix with
diagonal elements being the norms of the columns of D

. Show that R is given by


R
ij
=

|d
i1:(i1)
| if j = i
d
/
j1:(i1)
d
i1:(i1)
/|d
i1:(i1)
| if j > i
0 if j < i.
(5.80)
Exercise 5.6.20. Verify (5.66).
Exercise 5.6.21. Suppose d
1
, . . . , d
K
are vectors in R
M
, and d

1
, . . . , d

K
are the corre-
sponding orthogonal vectors resulting from the Gram-Schmidt algorithm, i.e., d

1
=
d
1
, and for i > 1, d

i
= d
i1:(i1)
in (5.49). (a) Show that the d

1
, . . . , d

K
are linearly
independent if and only if they are all nonzero. Why? [Hint: Recall Exercise 5.6.7.]
(b) Show that d
1
, . . . , d
K
are linearly independent if and only if all the d

j
are nonzero.
Exercise 5.6.22. Suppose D is MK, with linearly independent columns, and D =
QR is its QR decomposition. Show that spancolumns of D = spancolumns of Q.
Exercise 5.6.23. Suppose D is an MM matrix whose columns are linearly indepen-
dent. Show that D is invertible. [Hint: Use the QR decomposition in Theorem 5.3.
What kind of a matrix is the Q here? Is it invertible?]
Exercise 5.6.24. (a) Show that spancolumns of Q = R
M
if Q is an M M orthog-
onal matrix. (b) Suppose the M1 vectors d
1
, . . . , d
M
are linearly independent, and
J = spand
1
, . . . , d
M
. Show that J = R
M
. [Hint: Use Theorem 5.3, Lemma 5.1,
and part (a).]
5.6. Exercises 99
Exercise 5.6.25. Show that if d
1
, . . . , d
K
are vectors in R
M
with K > M, that the d
i
s
are linearly dependent. (This fact should make sense, since there cannot be more axes
than there are dimensions in Euclidean space.) [Hint: Use Exercise 5.6.24 on the rst
M vectors, then show how d
M+1
is a linear combination of them.]
Exercise 5.6.26. Show that the QR decomposition in Theorem 5.3 is unique when
M = K. That is, suppose Q
1
and Q
2
are K K orthogonal matrices, and R
1
and
R
2
are K K upper triangular matrices with positive diagonals, and Q
1
R
1
= Q
2
R
2
.
Show that Q
1
= Q
2
and R
1
= R
2
. [Hint: Show that Q Q
/
2
Q
1
= R
2
R
1
1
R,
so that the orthogonal matrix Q equals the upper triangular matrix R with positive
diagonals. Show that therefore Q = R = I
K
.] [Extra credit: Show the uniqueness
when M > K.]
Exercise 5.6.27. Verify (5.64). In particular: (a) Show that
_
I
K
1
A
0 I
K
2
_
1
=
_
I
K
1
A
0 I
K
2
_
. (5.81)
(b) Argue that the 0s in the middle matrix on the left-hand side of (5.64) are correct.
(c) Show S
221
= D
/
21
D
21
.
Exercise 5.6.28. Suppose
S =
_
S
11
S
12
S
21
S
22
_
, (5.82)
where S
11
is K
1
K
1
and S
22
is K
2
K
2
, and S
11
is invertible. (a) Show that
[S[ = [S
11
[ [S
221
[. (5.83)
[Hint: Use (5.64).] (b) Show that
S
1
=

S
1
11
+S
1
11
S
12
S
1
221
S
21
S
1
11
S
1
11
S
12
S
1
221
S
1
221
S
21
S
1
11
S
1
221

. (5.84)
[Hint: Use (5.64) and (5.81).] (c) Use part (b) to show that
[S
1
]
22
= S
1
221
, (5.85)
where [S
1
]
22
is the lower-right K
2
K
2
block of S
1
. Under what condition on the
S
ij
s is [S
1
]
22
= S
1
22
?
Exercise 5.6.29. For S o
+
K
, show that
[S[ = S
11
S
221
S
3312
S
KK1 K1
. (5.86)
[Hint: Use (5.65)). What is the determinant of a unitriangular matrix?]
Exercise 5.6.30. Suppose S o
+
K
. Prove Theorem 5.4, i.e., show that we can write
S = R
/
R, where R is upper triangular with positive diagonal elements. [Hint: Use
the spectral decomposition S = GLG
/
from(1.33). Then let D = L
1/2
G
/
in (5.62). Are
the columns of this D linearly independent?]
100 Chapter 5. Least Squares
Exercise 5.6.31. Show that if W = R
/
R is the Cholesky decomposition of W (K K),
then
[W[ =
K

j=1
r
2
jj
. (5.87)
Exercise 5.6.32. Show that the Cholesky decomposition in Theorem 5.4 is unique.
That is, if R
1
and R
2
are K K upper triangular matrices with positive diagonals,
that R
/
1
R
1
= R
/
2
R
2
implies that R
1
= R
2
. [Hint: Let R = R
1
R
1
2
, and show that
R
/
R = I
K
. Then show that this R must be I
K
, just as in Exercise 5.6.26.]
Exercise 5.6.33. Show that the M K matrix D has linearly independent columns
if and only if D
/
D is invertible. [Hint: If D has linearly independent columns, then
D
/
D = R
/
R as Theorem 5.4, and R is invertible. If the columns are linearly dependent,
there is a b ,= 0 with D
/
Db = 0. Why does that equation imply D
/
D has no inverse?]
Exercise 5.6.34. Suppose D is MK and C is M J, K > J, and both matrices have
linearly independent columns. Furthermore, suppose
spancolumns of D = spancolumns of C. (5.88)
Thus this space has two bases with differing numbers of elements. (a) Let A be the
J K matrix such that D = CA, guaranteed by Exercise 5.6.3. Show that the columns
of A are linearly independent. [Hint: Note that Db ,= 0 for any K 1 vector b ,= 0.
Hence Ab ,= 0 for any b ,= 0.] (b) Use Exercise 5.6.25 to show that such an A cannot
exist. (c) What do you conclude?
Exercise 5.6.35. This exercise is to show that any linear subspace J in R
M
has a basis.
If J = 0, the basis is the empty set. So you can assume J has more than just the
zero vector. (a) Suppose d
1
, . . . , d
J
are linearly independent vectors in R
M
. Show that
d R
M
but d , spand
1
, . . . , d
J
implies that d
1
, . . . , d
J
, d are linearly independent.
[Hint: If they are not linearly independent, then some linear combination of them
equals zero. The coefcient of d in that linear combination must be nonzero. (Why?)
Thus d must be in the span of the others.] (b) Take d
1
J, d
1
,= 0. [I guess we are
assuming the Axiom of Choice.] If spand
1
= J, then we have the basis. If not,
there must be a d
2
J spand
1
. If spand
1
, d
2
= J, we are done. Explain
how to continue. (Also, explain why part (a) is important here.) How do you know
this process stops? (c) Argue that any linear subspace has a corresponding projection
matrix.
Exercise 5.6.36. Suppose P and P

are projection matrices for the linear subspace


J R
M
. Show that P = P

, i.e., the projection matrix is unique to the subspace.


[Hint: Because the projection of any vector is unique, Py = P

y for all y. Consider


the columns of I
M
.]
Exercise 5.6.37. Let D = (D
1
, D
2
), where D
1
is M K
1
and D
2
is M K
2
, and
suppose that the columns of D are linearly independent. Show that
P
D
= P
D
1
+P
D
21
and Q
D
1
Q
D
= P
D
21
, (5.89)
where D
21
= Q
D
1
D
2
. [Hint: Use Lemma 5.3 and the uniqueness in Exercise 5.6.36.]
5.6. Exercises 101
Exercise 5.6.38. Find the orthogonal polynomial matrix (up to cubic) for the four time
points 1, 2, 4, 5.
Exercise 5.6.39 (Skulls). For the model on skull measurements described in Exercise
4.4.2, replace the polynomial matrix w with that for orthogonal polynomials.
Exercise 5.6.40 (Caffeine). In Exercise 4.4.4, the x is a quadratic polynomial matrix
in grade (8, 10, 12). Replace it with the orthogonal polynomial matrix (also 28 3),
where the rst column is all ones, the second is is the linear vector (1
/
9
, 0
/
10
, 1
/
9
)
/
, and
third is the quadratic vector is (1
/
9
, c1
/
10
, 1
/
9
)
/
for some c. Find c.
Exercise 5.6.41 (Leprosy). Consider again the model for the leprosy data in Exercise
4.4.6. An alternate expression for x is w

1
10
, where the rst column of w

rep-
resents the overall mean, the second tells whether the treatment is one of the drugs,
and the third whether the treatment is Drug A, so that
w

1 1 1
1 1 0
1 0 0

. (5.90)
Use Gram-Schmidt to orthogonalize the columns of w

. How does this matrix differ


from w? How does the model using w

differ from that using w?


Chapter 6
Both-Sides Models: Distribution of
Estimator
6.1 Distribution of

The both-sides model as dened in (4.28) is


Y = xz
/
+R, R N(0, I
n

R
), (6.1)
where Y is n q, x is n p, is p l, and z is q l. Assuming that x
/
x and z
/
z are
invertible, the least-squares estimate of is given in (5.27) to be

= (x
/
x)
1
x
/
Yz(z
/
z)
1
. (6.2)
To nd the distribution of

, we rst look at the mean, which by (6.1) is
E[

] = (x
/
x)
1
x
/
E[Y]z(z
/
z)
1
= (x
/
x)
1
x
/
(xz
/
)z(z
/
z)
1
= . (6.3)
Thus

is an unbiased estimator of . For the variances and covariances, Proposition
3.2 helps:
Cov[

] = Cov[(x
/
x)
1
x
/
Yz(z
/
z)
1
]
= Cov[row(Y)(x(x
/
x)
1
z(z
/
z)
1
)]
= (x(x
/
x)
1
z(z
/
z)
1
)
/
Cov[row(Y)](x(x
/
x)
1
z(z
/
z)
1
)
= (x(x
/
x)
1
z(z
/
z)
1
)
/
(I
n

R
)(x(x
/
x)
1
z(z
/
z)
1
)
= ((x
/
x)
1
x
/
I
n
x(x
/
x)
1
) ((z
/
z)
1
z
/

R
z(z
/
z)
1
)
= C
x

z
, (6.4)
where we dene
C
x
= (x
/
x)
1
and
z
= (z
/
z)
1
z
/

R
z(z
/
z)
1
. (6.5)
103
104 Chapter 6. Both-Sides Models: Distribution
Because

is a linear transformation of the Y,

N
pl
(, C
x

z
). (6.6)
The variances of the individual

ij
s are the diagonals of C
x

z
, so that
Var(

ij
) = C
xii

zjj
. (6.7)
The x and z matrices are known, but
R
must be estimated. Before tackling that issue,
we will consider ts and residuals.
6.2 Fits and residuals
The observed matrix Y in (6.1) is a sum of its mean and the residuals R. Estimates of
these two quantities are called the ts and estimated residuals, respectively. The t
of the model to the data is the natural estimate

Y of E[Y] = xz
/
,

Y = x

z = x(x
/
x)
1
x
/
Yz(z
/
z)
1
z
/
= P
x
YP
z
, (6.8)
where for matrix u, P
u
= u(u
/
u)
1
u
/
, the projection matrix on the span of the
columns of u, as in (5.19). We estimate the residuals R by subtracting:

R = Y

Y = Y P
x
YP
z
. (6.9)
The joint distribution of

Y and

R is multivariate normal because the collection is a
linear transformation of Y. The means are straightforward:
E[

Y] = P
x
E[Y]P
z
= P
x
xz
/
P
z
= xz
/
(6.10)
by part (d) of the Proposition 5.1 and
E[

R] = E[Y] E[

Y] = xz
/
xz
/
= 0. (6.11)
The covariance matrix of the t is not hard to obtain, but the covariance of the
residuals, as well as the joint covariance of the t and residuals, are less obvious since
the residuals are not of the form AYB
/
. Instead, we break the residuals into two parts,
the residuals from the left-hand side of the model, and the residuals of the right-hand
part on the t of the left-hand part. That is, we write

R = Y P
x
YP
z
= Y P
x
Y +P
x
Y P
x
YP
z
= (I
n
P
x
)Y +P
x
Y(I
n
P
z
)
= Q
x
Y +P
x
YQ
z


R
1
+

R
2
, (6.12)
where Q
u
= I
n
P
u
as in part (c) of Proposition 5.1. Note that in the multivariate
regression model, where z = I
q
, or if z is q q and invertible, the P
z
= I
q
, hence
Q
z
= 0, and

R
2
= 0.
6.3. SEs and t-statistics 105
For the joint covariance of the t and two residual components, we use the row
function to write
row

R
1

R
2

=
_
row(

Y) row(

R
1
) row(

R
2
)
_
= row(Y)
_
P
x
P
z
Q
x
I
q
P
x
Q
z
_
. (6.13)
Using Proposition 3.1 on the covariance in (6.1), we have
Cov

R
1

R
2

P
x
P
z
Q
x
I
q
P
x
Q
z

(I
n

R
)

P
x
P
z
Q
x
I
q
P
x
Q
z

/
=

P
x
P
z

R
P
z
0 P
x
P
z

R
Q
z
0 Q
x

R
0
P
x
Q
z

R
P
z
0 P
x
Q
z

R
Q
z

, (6.14)
where we use Proposition 5.1, parts (a) and (c), on the projection matrices.
One y in the ointment is that the t and residuals are not in general independent,
due to the possible non-zero correlation between the t and

R
2
. The t is independent
of

R
1
, though. We obtain the distributions of the t and residuals to be

Y N(xz
/
, P
x
P
z

R
P
z
), (6.15)
and

R N(0, Q
x

R
+P
x
Q
z

R
Q
z
). (6.16)
6.3 Standard errors and t-statistics
Our rst goal is to estimate the variances of the

ij
s in (6.7). We need to estimate
z
in (6.5), so we start by estimating
R
. From (6.14) and above,

R
1
= Q
x
Y N(0, Q
x

R
). (6.17)
Because Q
x
is idempotent, Corollary 3.1 shows that

R
/
1

R
1
= Y
/
Q
x
Y Wishart
p
(n p,
R
), (6.18)
by Proposition 5.1 (a), since trace(Q
x
) = trace(I
n
P
x
) = n p. Thus an unbiased
estimate of
R
is

R
=
1
n p

R
/
1

R
1
. (6.19)
Now to estimate Cov(

) in (6.4), by (3.79),
(z
/
z)
1
z
/
Y
/
Q
x
Yz(z
/
z)
1
Wishart
q
(n p,
z
), (6.20)
so that an unbiased estimate of
z
is

z
= (z
/
z)
1
z
/

R
z(z
/
z)
1
. (6.21)
106 Chapter 6. Both-Sides Models: Distribution
The diagonals of

z
are chi-squareds:

zjj

1
n p

zjj

2
np
, (6.22)
and the estimate of the variance of

ij
in (6.7) is

Var(

ij
) = C
xii

zjj

1
n p
C
xii

zjj

2
np
. (6.23)
One more result we need is that

is independent of

z
. But because P
u
u = u, we
can write

= (x
/
x)
1
x
/
P
x
YP
z
z(z
/
z)
1
= (x
/
x)
1
x
/

Yz(z
/
z)
1
, (6.24)
which shows that

depends on Y through only

Y. Since

z
depends on Y through
only

R
1
, the independence of

Y and

R
1
in (6.14) implies independence of the two
estimators. We collect these results.
Theorem 6.1. In the both-sides model (6.1),

in (5.27) and Y
/
Q
x
Y are independent, with
distributions given in (6.6) and (6.18), respectively.
Recall the Students t distribution:
Denition 6.1. If Z N(0, 1) and U
2

, and Z and U are independent, then


T
Z

U/
, (6.25)
has the Student t distribution on degrees of freedom, written T t

.
Applying the denition to the

ij
s, from (6.6), (6.7), and (6.23),
Z =

ij

ij
_
C
xii

zjj
N(0, 1) and U = (n p)

Var(

ij
)
C
xii

zjj
, (6.26)
and Z and U are independent, hence
Z
_
U/(n p)
=
(

ij

ij
)/
_
C
xii

zjj
_

Var(

ij
)/C
xii

zjj
=

ij

ij
_

Var(

ij
)
t
np
. (6.27)
6.4 Examples
6.4.1 Mouth sizes
Recall the measurements on the size of mouths of 27 kids at four ages, where there
are 11 girls (Sex=1) and 16 boys (Sex=0) in Section 4.2.1. Heres the model where the x
6.4. Examples 107
matrix compares the boys and girls, and the z matrix species orthogonal polynomial
growth curves:
Y = xz
/
+R
=
_
1
11
1
11
1
16
0
16
__

0

1

2

3

0

1

2

3
_

1 1 1 1
3 1 1 3
1 1 1 1
1 3 3 1

+R, (6.28)
Compare this model to that in (4.13). Here the rst row of coefcients are the boys
coefcients, and the sum of the rows are the girls coefcients, hence the second row
is the girls minus the boys. We rst nd the estimated coefcients (using R), rst
creating the x and z matrices:
x < cbind(1,rep(c(1,0),c(11,16)))
z < cbind(c(1,1,1,1),c(3,1,1,3),c(1,1,1,1),c(1,3,3,1))
estx < solve(t(x)%%x,t(x))
estz < solve(t(z)%%z,t(z))
betahat < estx%%mouths[,1:4]%%t(estz)
The

is
Intercept Linear Quadratic Cubic
Boys 24.969 0.784 0.203 0.056
Girls Boys 2.321 0.305 0.214 0.072
(6.29)
Before trying to interpret the coefcients, we would like to estimate their standard
errors. We calculate

R
= Y
/
Q
x
Y/(n p), where n = 27 and p = 2, then

z
=
(z
/
z)
1
z
/

z(z
/
z)
1
and C
x
= (x
/
x)
1
of (6.5):
PxY < x%%estx%%mouths[,1:4]
QxY < mouths[,1:4]PxY
sigmaRhat < t(QxY)%%QxY/(272)
sigmazhat < estz%%sigmaRhat%%t(estz)
cx < solve(t(x)%%x)
We nd

z
=

3.7791 0.0681 0.0421 0.1555


0.0681 0.1183 0.0502 0.0091
0.0421 0.0502 0.2604 0.0057
0.1555 0.0091 0.0057 0.1258

(6.30)
and
C
x
=
_
0.0625 0.0625
0.0625 0.1534
_
. (6.31)
By (6.7), the standard errors of the

ij
s are estimated by multiplying the i
th
di-
agonal of C
x
and j
th
diagonal of

z
, then taking the square root. We can obtain the
matrix of standard errors using the command
se < sqrt(outer(diag(cx),diag(sigmazhat),""))
The t-statistics then divide the estimates by their standard errors, betahat/se:
108 Chapter 6. Both-Sides Models: Distribution
Standard Errors
Intercept Linear Quadratic Cubic
Boys 0.4860 0.0860 0.1276 0.0887
Girls Boys 0.7614 0.1347 0.1999 0.1389
t-statistics
Intercept Linear Quadratic Cubic
Boys 51.38 9.12 1.59 0.63
Girls Boys 3.05 2.26 1.07 0.52
(6.32)
(The function bothsidesmodel of Section A.2 performs these calculations as well.) It
looks like the quadratic and cubic terms are unnecessary, so that straight lines for each
sex t well. It is clear that the linear term for boys is necessary, and the intercepts for
the boys and girls are different (the two-sided p-value for 3.05 with 25 df is 0.005).
The p-value for the GirlsBoys slope is 0.033, which may or may not be signicant,
depending on whether you take into account multiple comparisons.
6.4.2 Histamine in dogs
Turn to the example in Section 4.2.1 where sixteen dogs were split into four groups,
and each had four histamine levels measured over the course of time.
In the model (4.15), the x part of the model is for a 2 2 ANOVA with interaction.
We add a matrix z to obtain a both-sides model. The z we will take has a separate
mean for the before measurements, because these are taken before any treatment,
and a quadratic model for the three after time points:
Y = xz
/
+R, (6.33)
where as in (4.15), but using a Kronecker product to express the x matrix,
x =

1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1

1
4
, (6.34)
=

b

0

1

2

b

0

1

2

b

0

1

2

b

0

1

2

(6.35)
and
z
/
=

1 0 0 0
0 1 1 1
0 1 0 1
0 1 2 1

. (6.36)
The s are for the overall mean of the groups, the s for the drug effects, the s
for the depletion effect, and the s for the interactions. The b subscript indicates
the before means, and the 0, 1 and 2 subscripts indicate the constant, linear, and
quadratic terms of the growth curves. We rst set up the design matrices:
6.5. Exercises 109
x < cbind(1,rep(c(1,1),c(8,8)),rep(c(1,1,1,1),c(4,4,4,4)))
x < cbind(x,x[,2]x[,3])
z < cbind(c(1,0,0,0),c(0,1,1,1),c(0,1,0,1),c(0,1,2,1))
The rest of the calculations follow as in the previous section, where here the y is
in the R matrix histamine. Because of the orthogonal columns for x, the matrix C
x
is diagonal, in fact is C
x
=
1
16
I
4
. The coefcients estimates, standard errors, and
t-statistics are below. Note the pattern in the standard errors.
Estimates
Before Intercept Linear Quadratic
Mean 0.0769 0.3858 0.1366 0.0107
Drug 0.0031 0.1842 0.0359 0.0082
Depletion 0.0119 0.2979 0.1403 0.0111
Interaction 0.0069 0.1863 0.0347 0.0078
Standard Errors
Before Intercept Linear Quadratic
Mean 0.0106 0.1020 0.0607 0.0099
Drug 0.0106 0.1020 0.0607 0.0099
Depletion 0.0106 0.1020 0.0607 0.0099
Interaction 0.0106 0.1020 0.0607 0.0099
t-statistics
Before Intercept Linear Quadratic
Mean 7.25 3.78 2.25 1.09
Drug 0.29 1.81 0.59 0.83
Depletion 1.12 2.92 2.31 1.13
Interaction 0.65 1.83 0.57 0.79
(6.37)
Here n = 16 and p = 4, so the degrees of freedom in the t-statistics are 12. It
looks like the quadratic terms are not needed, and that the basic assumption that the
treatment effects for the before measurements is 0 is reasonable. It looks also like the
drug and interaction effects are 0, so that the statistically signicant effects are the
intercept and linear effects for the mean and depletion effects. See Figure 4.3 for a
plot of these effects. Chapter 7 deals with testing blocks of
ij
s equal to zero, which
may be more appropriate for these data.
6.5 Exercises
Exercise 6.5.1. Justify the steps in (6.4) by refering to the appropriate parts of Propo-
sition 3.2.
Exercise 6.5.2. Verify the calculations in (6.12).
Exercise 6.5.3 (Bayesian inference). This exercise extends the Bayesian results in Exer-
cises 3.7.29 and 3.7.30 to the in multivariate regression. We start with the estimator

in (6.6), where the z = I


q
, hence
z
=
R
. The model is then

[ = b N
pq
(b, (x
/
x)
1

R
) and N
pq
(
0
, K
1
0

R
), (6.38)
110 Chapter 6. Both-Sides Models: Distribution
where
R
,
0
, and K
0
are known. Note that the
R
matrix appears in the prior,
which makes the posterior tractable. (a) Show that the posterior distribution of is
multivariate normal, with
E[ [

=

b] = (x
/
x +K
0
)
1
((x
/
x)

b +K
0

0
), (6.39)
and
Cov[ [

=

b] = (x
/
x +K
0
)
1

R
. (6.40)
[Hint: Same hint as in Exercise 3.7.30.] (b) Set the prior parameters
0
= 0 and
K
0
= k
0
I
p
for some k
0
> 0. Show that
E[ [

=

b] = (x
/
x + k
0
I
p
)
1
x
/
y. (6.41)
This conditional mean is the ridge regression estimator of . See Hoerl and Kennard
[1970]. This estimator can be better than the least squares estimator (a little biased, but
much less variable) when x
/
x is nearly singular, that is, one or more of its eigenvalues
are close to zero.
Exercise 6.5.4 (Prostaglandin). Continue with the data described in Exercise 4.4.1.
The data are in the R matrix prostaglandin. Consider the both-sides model (6.1),
where the ten people have the same mean, so that x = 1
10
, and z contains the cosine
and sine vectors for m = 1, 2 and 3, as in Exercise 4.4.8. (Thus z is 6 6.) (a) What is
z? (b) Are the columns of z orthogonal? What are the squared norms of the columns?
(c) Find

. (d) Find

z
. (e) Find the (estimated) standard errors of the

j
s. (f) Find
the t-statistics for the
j
s. (g) Based on the t-statistics, which model appears most
appropriate? Choose from the constant model; the one-cycle model (just m=1); the
model with one cycle and two cycles; the model with one, two and three cycles.
Exercise 6.5.5 (Skulls). This question continues with the data described in Exercise
4.4.2. The data are in the R matrix skulls, obtained from http://lib.stat.cmu.edu/
DASL/Datafiles/EgyptianSkulls.html at DASL Project [1996]. The Y N(x, I
m

R
), where the x represents the orthogonal polynomials over time periods (from
Exercise 5.6.39). (a) Find

. (b) Find (x
/
x)
1
. (c) Find

R
. What are the degrees
of freedom? (d) Find the standard errors of the

ij
s. (e) Which of the

ij
s have
t-statistic larger than 2 in absolute value? (Ignore the rst row, since those are the
overall means.) (f) Explain what the parameters with [t[ > 2 are measuring. (g)
There is a signicant linear trend for which measurements? (h) There is a signicant
quadratic trend for which measurements?
Exercise 6.5.6 (Caffeine). This question uses the caffeine data (in the R matrix caeine)
and the model from Exercise 4.4.4. (a) Fit the model, and nd the relevant estimates.
(b) Find the t-statistics for the

ij
s. (c) What do you conclude? (Choose as many
conclusions as appropriate from the following: On average the students do about the
same with or without caffeine; on average the students do signicantly better with-
out caffeine; on average the students do signicantly better with caffeine; the older
students do about the same as the younger ones on average; the older students do
signicantly better than the younger ones on average; the older students do signi-
cantly worse than the younger ones on average; the deleterious effects of caffeine are
not signicantly different for the older students than for the younger; the deleterious
6.5. Exercises 111
effects of caffeine are signicantly greater for the older students than for the younger;
the deleterious effects of caffeine are signicantly greater for the younger students
than for the older; the quadratic effects are not signicant.)
Exercise 6.5.7 (Grades). Consider the grades data in (4.10). Let Y be the 107 5
matrix consisting of the variables homework, labs, inclass, midterms, and nal. The
x matrix indicates gender. Let the rst column of x be 1
n
. There are 70 women and
37 men in the class, so let the second column have 0.37 for the women and 0.70
for the men. (That way, the columns of x are orthogonal.) For the z, we want the
overall mean score; a contrast between the exams (midterms and nal) and other
scores (homework, labs, inclass); a contrast between (homework, labs) and inclass; a
contrast between homework and labs; and a contrast between midterms and nal.
Thus
z
/
=

1 1 1 1 1
2 2 2 3 3
1 1 2 0 0
1 1 0 0 0
0 0 0 1 1

. (6.42)
Let
=
_

1

2

3

4

5

1

2

3

4

5
_
. (6.43)
(a) Briey describe what each of the parameters represents. (b) Find

. (c) Find the
standard errors of the

ij
s. (d) Which of the parameters have |t-statistic| over 2? (e)
Based on the results in part (d), discuss whether there is any difference between the
grade proles of the men and women.
Chapter 7
Both-Sides Models: Hypothesis Tests on

Testing a single
ij
= 0 is easy using the t-test based on (6.27). It is often informative
to test a set of
ij
s is 0, e.g., a row from , or a column, or a block, or some other
conguration. In Section 7.1, we present a general test statistic and its
2
approxima-
tion for testing any set of parameters equals zero. Section 7.2 renes the test statistic
when the set of
ij
s of interest is a block.
7.1 Approximate
2
test
Start by placing the parameters of interest in the 1 K vector . We assume we have
a vector of estimates

such that

N(, ), (7.1)
and we wish to test
H
0
: = 0. (7.2)
We could test whether the vector equals a xed non-zero vector, but then we can
subtract that hypothesized value from and return to the case (7.2). Assuming is
invertible, we have that under H
0
,

1/2
N(0, I
K
), (7.3)
hence

/

2
K
. (7.4)
Typically, will have to be estimated, in which case we use
T
2

/
, (7.5)
where under appropriate conditions (e.g., large sample size relative to K), under H
0
,
T
2

2
K
. (7.6)
113
114 Chapter 7. Both-Sides Models: Testing
7.1.1 Example: Mouth sizes
In the mouth size example in Section 7.1.1, consider testing the t of the model spec-
ifying parallel straight lines for the boys and girls, with possibly differing intercepts.
Then in the model (6.28), only
0
,
1
, and
0
would be nonzero, so we would be
testing whether the other ve are zero. Place those in the vector :
= (
2
,
3
,
1
,
2
,
3
). (7.7)
The estimate is

= (0.203, 0.056, 0.305, 0.214, 0.072). (7.8)


To nd the estimated covariance matrix

, we need to pick off the relevant elements
of the matrix C
x

z
using the values in (6.30). In terms of row(), we are interested
in elements 3, 4, 6, 7, and 8. Continuing the R work from Section 6.4.1, we have
omegahat < kronecker(cx,sigmazhat)[c(3,4,6,7,8),c(3,4,6,7,8)]
so that

0.01628 0.00036 0.00314 0.01628 0.00036


0.00036 0.00786 0.00057 0.00036 0.00786
0.00314 0.00057 0.01815 0.00770 0.00139
0.01628 0.00036 0.00770 0.03995 0.00088
0.00036 0.00786 0.00139 0.00088 0.01930

. (7.9)
The statistic (7.6) is
T
2
=

/
= 10.305. (7.10)
The degrees of freedom K = 5, which yields an approximate p-value of 0.067, border-
line signicant. Judging from the individual t-statistics, the

22
element, indicating a
difference in slopes, may be the reason for the almost-signicance.
7.2 Testing blocks of are zero
In this section, we focus on blocks in the both-sides model, for which we can nd a
better approximation to the distribution of T
2
, or in some cases the exact distribution.
A block

of the p l matrix is a p

rectangular (though not necessarily


contiguous) submatrix of . For example, if is 5 4, we might have

be the 3 2
submatrix

11

13

41

43

51

53

, (7.11)
which uses the rows 1, 4, and 5, and the columns 1 and 3. Consider testing the
hypothesis
H
0
:

= 0. (7.12)
The corresponding estimate of

has distribution

N
p

l
(

, C

z
), (7.13)
7.2. Tests for blocks 115
where C

x
and

z
are the appropriate p

and l

submatrices of, respectively,


C
x
and
z
. In the example (7.11),
C

x
=

C
x11
C
x14
C
x15
C
x41
C
x44
C
x45
C
x51
C
x54
C
x55

, and

z
=
_

z11

z13

z31

z33
_
. (7.14)
Also, letting

z
be the corresponding submatrix of (6.21), we have

z
Wishart
l
(,

z
), = n p. (7.15)
We take
= row(

) and = C

z
, (7.16)
and

,

as the obvious estimates. Then using (3.32c) and (3.32d), we have that
T
2
= row(

)(C

z
)
1
row(

)
/
= row(

)(C

x
1

z
1
) row(

)
/
= row(C

x
1

z
1
) row(

)
/
= trace(C

x
1

z
1

/
), (7.17)
where the nal equation results by noting that for p q matrices A and D,
row(A) row(D)
/
=
p

i=1
q

j=1
a
ij
d
ij
= trace(AD
/
). (7.18)
To clean up the notation a bit, we write
T
2
= trace(W
1
B), (7.19)
where
W =

z
, = n p, and B =

/
C

x
1

. (7.20)
Thus by Theorem 6.1, B and W are independent, and by (7.13) and (7.15), under H
0
,
B Wishart
l
(p

z
) and W Wishart
l
(,

z
). (7.21)
In multivariate analysis of variance, we usually call B the between-group sum of
squares and cross-products matrix, and W the within-group matrix. The test based
on T
2
in (7.19) is called the Lawley-Hotelling trace test, where the statistic is usually
dened to be T
2
/.
We could use the approximation (7.6) with K = p

, but we do a little better with


the approximation
F
l

+ 1
p

T
2
F
p

,l

+1
. (7.22)
Recall the following denition of F
,
.
Denition 7.1. If B
2

and W
2

, and B and W are independent, then


F =
1

B
1

W
F
,
, (7.23)
an F distribution with degrees of freedom and .
116 Chapter 7. Both-Sides Models: Testing
When p

= 1 or l

= 1, so that we are testing elements within a single row or


column, the distribution in (7.22) is exact. In fact, when we are testing just one
ij
(p

= l

= 1), the T
2
is the square of the usual t-statistic, hence distributed F
1,
. In
other cases, at least the mean of the test statistic matches that of the F. The rest of
this section veries these statements.
7.2.1 Just one column F test
Suppose l

= 1, so that B and W are independent scalars, and from (7.21),


B
2
z

2
p
and W
2
z

2

. (7.24)
Then in (7.22), the constant multiplying T
2
is simply /p

, so that
F =

p

B
W
F
p

,
. (7.25)
This is the classical problem in multiple (univariate) regression, and this test is the
regular F test.
7.2.2 Just one row Hotellings T
2
Now p

= 1, so that the B in (7.19) has just one degree of freedom. Thus we can write
B = Z
/
Z, Z N
1l
(0,

z
), (7.26)
where
Z =

/
_
C

x
. (7.27)
(Note that C

x
is a scalar.) From (7.19), T
2
can be variously written
T
2
= trace(W
1
Z
/
Z) = ZW
1
Z
/
=

1
z

/
C

x
. (7.28)
In this situation, the statistic is called Hotellings T
2
. The next proposition shows that
the distribution of the F version of Hotellings T
2
in (7.22) is exact, setting p

= 1.
The proof of the proposition is in Section 8.4.
Proposition 7.1. Suppose W and Z are independent, W Wishart
l
(, ) and Z
N
1l
(0, ), where l

and is invertible. Then


l

+ 1
l

ZW
1
Z
/
F
l

,l

+1
. (7.29)
7.2.3 General blocks
In this section we verify that the expected values of the two sides of (7.22) are the
same. It is not hard to show that E[
2

] = and E[1/
2

] = 1/( 2) if > 2. Thus


by the denition of F in (7.23), independence yields
E[F
,
] =

2
(7.30)
7.2. Tests for blocks 117
if > 2. Otherwise, the expected value is +. For T
2
in (7.19) and (7.21), again by
independence of B and W,
E[T
2
] = trace(E[W
1
] E[B]) = p

trace(E[W
1
]

z
), (7.31)
because E[B] = p

z
by (3.73). To nish, we need the following lemma, which
extends the results on E[1/
2

].
Lemma 7.1. If W Wishart
l
(, ), > l

+ 1, and is invertible,
E[W
1
] =
1
l

1

1
. (7.32)
The proof in in Section 8.3. Continuing from (7.31),
E[T
2
] =
p

1
trace(
1
z

z
) =
p

1
. (7.33)
Using (7.33) and (7.30) on (7.22), we have
l

+ 1
p

E[T
2
] =
l

+ 1
l

1
= E[F
p

,l

+1
]. (7.34)
7.2.4 Additional test statistics
In addition to the Lawley-Hotelling trace statistic T
2
(7.19), other popular test statis-
tics for testing blocks based on W and B in (7.19) include the following.
Wilks
The statistic is based on the likelihood ratio statistic (see Section 9.3.1), and is dened
as
=
[W[
[W+B[
. (7.35)
See Exercise 9.6.8. Its distribution under the null hypothesis has the Wilks distri-
bution, which is a generalization of the beta distribution.
Denition 7.2 (Wilks ). Suppose W and B are independent Wisharts, with distributions
as in (7.21). Then has the Wilks distribution with dimension l

and degrees of freedom


(p

, ), written
Wilks
l
(p

, ). (7.36)
Wilks can be represented as a product of independent beta random variables.
Bartlett [1954] has a number of approximations for multivariate statistics, including
one for Wilks :

_

l

+ 1
2
_
log()
2
p

. (7.37)
118 Chapter 7. Both-Sides Models: Testing
Pillai trace
This one is the locally most powerful invariant test. (Dont worry about what that
means exactly, but it has relatively good power if in the alternative the

is not far
from 0.) The statistic is
trace((W+B)
1
B). (7.38)
Asymptotically, as ,
trace((W+B)
1
B)
2
l

, (7.39)
which is the same limit as for the Lawley-Hotelling T
2
.
Roys maximum root
This test is based on the largest root, i.e., largest eigenvalue of (W+B)
1
B.
If p

= 1 or l

= 1, these statistics are all equivalent to T


2
. In general, Lawley-
Hotelling, Pillai, and Wilks have similar operating characteristics. Each of these four
tests is admissible in the sense that there is no other test of the same level that always
has better power. See Anderson [2003], for discussions of these statistics, including
some asymptotic approximations and tables.
7.3 Examples
In this section we further analyze the mouth size and histamine data. The book Hand
and Taylor [1987] contains a number of other nice examples.
7.3.1 Mouth sizes
We will use the model from (6.28) with orthogonal polynomials for the four age
variables and the second row of the representing the differences between the girls
and boys:
Y = xz
/
+R
=
_
1
11
1
11
1
16
0
16
__

0

1

2

3

0

1

2

3
_

1 1 1 1
3 1 1 3
1 1 1 1
1 3 3 1

+R.
(7.40)
See Section 6.4.1 for calculation of estimates of the parameters.
We start by testing equality of the boys and girls curves. Consider the last row of
:
H
0
: (
0
,
1
,
2
,
3
) = (0, 0, 0, 0). (7.41)
The estimate is
(

0
,

1
,

2
,

3
) = (2.321, 0.305, 0.214, 0.072). (7.42)
7.3. Examples 119
Because p

= 1, the T
2
is Hotellings T
2
from Section 7.2.2, where l

= l = 4 and
= n p = 27 2 = 25. Here C

x
= C
x22
= 0.1534 and

z
=

z
from (6.30). We
calculate T
2
= 16.5075, using
t2 < betahat[2,]%%solve(sigmazhat,betahat[2,])/cx[2,2]
By (7.22), under the null,
l

+1
l

T
2
F
l

,l

+1

22
100
16.5075 = 3.632, (7.43)
which, compared to a F
4,22
, has p-value 0.02. So we reject H
0
, showing there is a
difference in the sexes.
Next, consider testing that the two curves are actually linear, that is, the quadratic
and cubic terms are 0 for both curves:
H
0
:
_

2

3

2

3
_
= 0. (7.44)
Now p

= l

= 2, C

x
= C
x
, and

z
is the lower right 2 2 submatrix of

z
. Calcu-
lating:
sigmazstar < sigmazhat[3:4,3:4]
betastar < betahat[,3:4]
b < t(betastar)%%solve(cx)%%betastar % Note that solve(cx) = t(x)%%x
t2 < tr(solve(sigmazstar)%%b)
The function tr is a simple function that nds the trace of a square matrix dened by
tr < function(x) sum(diag(x))
This T
2
= 2.9032, and the F form in (7.22) is (24/100) T
2
= 0.697, which is not
at all signicant for an F
4,24
. The function bothsidesmodel.test in Section A.2.2 will
perform this test, as well as the Wilks test.
The other tests in Section 7.2.4 are also easy to implement, where here W = 25

z
.
Wilks (7.35) is
w < 25sigmazstar
lambda < det(w)/det(b+w)
The = 0.8959. For the large-sample approximation (7.37), the factor is 24.5, and the
statistic is 24.5 log() = 2.693, which is not signicant for a
2
4
. Pillais trace test
statistic (7.38) is
tr(solve(b+w)%%b)
which equals 0.1041, and the statistic (7.39 ) is 2.604, similar to Wilks . The nal
one is Roys maximum root test. The eigenvalues are found using
eigen(solve(b+u)%%b)$values
being 0.1036 and 0.0005. Thus the statistic here is 0.1036. Anderson [2003] has tables
and other information about these tests. For this situation, ( + p

)/p

times the
statistic, which is (27/2) 0.1036 = 1.40, has 0.05 cutoff point of 5.75.
The conclusion is that we need not worry about the quadratic or cubic terms. Just
for fun, go back to the original model, and test the equality of the boys and girls
120 Chapter 7. Both-Sides Models: Testing
curves presuming the quadratic and cubic terms are 0. The

= (2.321, 0.305),
p

= 1 and l

= 2. Hotellings T
2
= 13.1417, and the F = [(25 2 + 1)/(25 2)]
13.1417 = 6.308. Compared to an F
2,25
, the p-value is 0.006. Note that this is quite
a bit smaller than the p-value before (0.02), since we have narrowed the focus by
eliminating insignicant terms from the statistic.
Our conclusion that the boys and girls curves are linear but different appears
reasonable given the Figure 4.1.
7.3.2 Histamine in dogs
Consider again the two-way multivariate analysis of variance model Y = x +R from
(4.15), where
=

b

0

1

2

b

1

2

3

b

1

2

3

b

1

2

3

. (7.45)
Recall that the s are for the overall mean of the groups, the s for the drug effects,
the s for the depletion effect, and the s for the interactions. The b subscript
indicates the before means, and the 1, 2, 3s represent the means for the three after
time points.
We do an overall test of equality of the four groups based on the three after time
points. Thus
H
0
:

1

2

3

1

2

3

1

2

3

= 0. (7.46)
Section 6.4.2 contains the initial calculations. Here, C

x
is the lower right 3 3
submatrix of C
x
, i.e., C
x
= (1/16)I
3
,

z
is the lower right 3 3 submatrix of
z
,
p

= l

= 3, and n p = 16 4 = 12. We then have


t2 < 16tr(solve(sigmazhat[2:4,2:4])%%t(betahat[2:4,2:4])%%betahat[2:4,2:4])
f < (123+1)t2/(1233)
Here, T
2
= 41.5661 and F = 3.849, which has degrees of freedom (9,10). The p-value
is 0.024, which does indicate a difference in groups.
7.4 Testing linear restrictions
Instead of testing that some of the
ij
s are 0, one often wishes to test equalities among
them, or other linear restrictions. For example, consider the one-way multivariate
analysis of variance with three groups, with n
k
observations in group k, and q = 2
variables (such as the leprosy data below), written as
Y =

1
n
1
0 0
0 1
n
2
0
0 0 1
n
3

11

12

21

22

31

32

+R. (7.47)
The hypothesis that the groups have the same means is
H
0
:
11
=
21
=
31
and
12
=
22
=
32
. (7.48)
7.5. Covariates 121
That hypothesis can be expressed in matrix form as
_
1 1 0
1 0 1
_

11

12

21

22

31

32

=
_
0 0
0 0
_
. (7.49)
Or, if only the second column of Y is of interest, then one might wish to test
H
0
:
12
=
22
=
32
, (7.50)
which in matrix form can be expressed as
_
1 1 0
1 0 1
_

11

12

21

22

31

32

_
0
1
_
=
_
0
0
_
. (7.51)
Turning to the both-sides model, these hypotheses can be written as
H
0
: CD
/
= 0, (7.52)
where C (p

p) and D (l

l) are xed matrices that express the desired restrictions.


To test the hypothesis, we use
C

D
/
N(CD
/
, CC
x
C
/
D
z
D
/
), (7.53)
and
D

z
D
/

1
n p
Wishart(n p, D
z
D
/
). (7.54)
Then, assuming the appropriate matrices are invertible, we set
B = D

/
(CC
x
C
/
)
1
C

D
/
, = n p, W = D

z
D
/
, (7.55)
which puts us back at the distributions in (7.21). Thus T
2
or any of the other test
statistics can be used as above. In fact, the hypothesis

in (7.12) and (7.11) can be


written as

= CD
/
for C and D with 0s and 1s in the right places.
7.5 Covariates
A covariate is a variable that is of (possibly) secondary importance, but is recorded
because it might help adjust some estimates in order to make them more precise.
As a simple example, consider the Leprosy example described in Exercise 4.4.6. The
main interest is the effect of the treatments on the after-treatment measurements. The
before-treatment measurements constitute the covariate in this case. These measure-
ments are indications of health of the subjects before treatment, the higher the less
healthy. Because of the randomization, the before measurements have equal popula-
tion means for the three treatments. But even with a good randomization, the sample
means will not be exactly the same for the three groups, so this covariate can be used
to adjust the after comparisons for whatever inequities appear.
122 Chapter 7. Both-Sides Models: Testing
The multivariate analysis of variance model, with the before and after measure-
ments as the two Y variables, we use here is
Y =
_
Y
b
Y
a
_
= x +R
=

1 1 1
1 1 1
1 2 0

1
10

b

a

b

a

b

a

+R, (7.56)
where
Cov(Y) = I
30

_

bb

ba

ab

aa
_
. (7.57)
The treatment vectors in x are the contrasts Drugs versus Control and Drug A versus
Drug D. The design restriction that the before means are the same for the three groups
is represented by
_

b

b
_
=
_
0
0
_
. (7.58)
Interest centers on
a
and
a
. (We do not care about the s.)
The estimates and standard errors using the usual calculations for the before and
after s and s are given in the next table:
Before After
Estimate se Estimate se
Drug vs. Control 1.083 0.605 2.200 0.784
Drug A vs. Drug D 0.350 1.048 0.400 1.357
(7.59)
Looking at the after parameters, we see a signicant difference in the rst contrast,
showing that the drugs appear effective. The second contrast is not signicant, hence
we cannot say there is any reason to suspect difference between the drugs. Note,
though, that the rst contrast for the before measurements is somewhat close to sig-
nicance, that is, by chance the control group received on average less healthy people.
Thus we wonder whether the signicance of this contrast for the after measurements
is at least partly due to this fortuitous randomization.
To take into account the before measurements, we condition on them, that is, con-
sider the conditional distribution of the after measurements given the before mea-
surements (see equation 3.56 with X = Y
b
):
Y
a
[ Y
b
= y
b
N
301
( +y
b
, I
30

aab
), (7.60)
where =
ab
/
bb
, and recalling (7.58),
+y
b
= E[Y
a
] E[Y
b
] +y
b

= x

b
0
0

+y
b

= x

+y
b

= (x, y
b
)

, (7.61)
7.5. Covariates 123
where

=
a

b
. Thus conditionally we have another linear model, this one
with one Y vector and four X variables, instead of two Ys and three Xs. But the key
point is that the parameters of interest,
a
and
a
, are estimable in this model. Note
that
a
and
b
are not so estimable, but then we do not mind.
The estimates of the parameters from this conditional model have distribution

[ Y
b
= y
b
N
41

, C
(x,y
b
)

aab

, (7.62)
where
C
(x,y
b
)
= ((x, y
b
)
/
(x, y
b
))
1
(7.63)
as in (6.5).
The next tables have the comparisons of the original estimates and the covariate-
adjusted estimates for
a
and
a
:
Original
Estimate se t

a
2.20 0.784 2.81

a
0.40 1.357 0.29
Covariate-adjusted
Estimate se t

a
1.13 0.547 2.07

a
0.05 0.898 0.06
(7.64)
The covariates helped the precision of the estimates, lowering the standard errors by
about 30%. Also, the rst effect is signicant but somewhat less so than without the
covariate. This is due to the control group having somewhat higher before means
than the drug groups.
Whether the original or covariate-adjusted estimates are more precise depends on
a couple of terms. The covariance matrices for the two cases are
C
x

_

bb

ba

ab

aa
_
and C
(x,y
b
)

aab
. (7.65)
Because

aab
=
aa


2
ab

bb
=
aa
(1
2
), (7.66)
where is the correlation between the before and after measurements, the covariate-
adjusted estimates are relatively better the higher is. From the original model we
estimate = 0.79, which is quite high. The estimates from the two models are

aa
= 36.86 and
aab
= 16.05. On the other hand,
[C
x
]
2:3,2:3
< [C
(x,y
b
)
]
2:3,2:3
, (7.67)
where the subscripts are to indicate we are taking the second and third row/columns
from the matrices. The inequality holds unless x
/
y
b
= 0, and favors the original
estimates. Here the two matrices (7.67) are
_
0.0167 0
0 0.050
_
and
_
0.0186 0.0006
0.0006 0.0502
_
, (7.68)
respectively, which are not very different. Thus in this case, the covariate-adjusted
estimates are better because the gain from the terms is much larger than the loss
124 Chapter 7. Both-Sides Models: Testing
from the C
x
terms. Note that the covariate-adjusted estimate of the variances has lost
a degree of freedom.
The parameters in model (7.56) with the constraint (7.58) can be classied into
three types:
Those of interest:
a
and
a
;
Those assumed zero:
b
and
b
;
Those of no particular interest nor with any constraints:
b
and
a
.
The model can easily be generalized to the following:
Y =
_
Y
b
Y
a
_
= x = x
_

b

a

b

a
_
+R, where
b
= 0, (7.69)
Cov(Y) = I
n

_

bb

ba

ab

aa
_
, (7.70)
and
x is n p, Y
b
is n q
b
, Y
a
is n q
a
,
b
is p
1
q
b
,
a
is p
1
q
a
,

b
is p
2
q
b
,
a
is p
2
q
a
,
bb
is q
b
q
b
, and
aa
is q
a
q
a
. (7.71)
Here, Y
b
contains the q
b
covariate variables, the
a
contains the parameters of interest,

b
is assumed zero, and
b
and
a
are not of interest. For example, the covariates
might consist of a battery of measurements before treatment.
Conditioning on the covariates again, we have that
Y
a
[ Y
b
= y
b
N
nq
a
( +y
b
, I
n

aab
), (7.72)
where
aab
=
aa

ab

1
bb

ba
, =
1
bb

ba
, and
= E[Y
a
] E[Y
b
]

= x
_

a

a
_
x
_

b
0
_
= x
_

a

a
_
. (7.73)
Thus we can write the conditional model (conditional on Y
b
= y
b
),
Y
a
= x
_

a

a
_
+y
b
+R

= (x y
b
)

+R

= x

+R

, (7.74)
where

=
a

b
and R

N
nq
a
(0, I
n

aab
). In this model, the parameter of
interest,
a
, is still estimable, but the
a
and
b
are not, being combined in the

. The
model has turned from one with q = q
a
+ q
b
dependent vectors and p explanatory
vectors into one with q
a
dependent vectors and p + q
b
explanatory vectors. The esti-
mates and hypothesis tests on

then proceed as for general multivariate regression


models.
7.6. Mallows C
p
125
7.5.1 Pseudo-covariates
The covariates considered above were real in the sense that they were collected
purposely in such a way that their distribution was independent of the ANOVA
groupings. At times, one nds variables that act like covariates after transforming
the Ys. Continue the mouth size example from Section 7.3.1, with cubic model over
time as in (7.40). We will assume that the cubic and quadratic terms are zero, as
testing the hypothesis (7.44) suggests. Then is of the form (
a

b
) = (
a
0), where

a
=
_

0

1

0

1
_
, and
b
=
_

2

3

2

3
_
, (7.75)
as in (7.69) but with no
a
or
b
. (Plus the columns are in opposite order.) But the z
/
is in the way. Note though that z is square, and invertible, so that we can use the 1-1
function of Y, Y(z
/
)
1
:
Y
(z)
Y(z
/
)
1
= x(
a
0) +R
(z)
, (7.76)
where
R
(z)
N
nq
(0, I
n

z
),
z
= z
1

R
(z
/
)
1
. (7.77)
(This
z
is the same as the one in (6.5) because z itself is invertible.)
Now we are back in the covariate case (7.69), so we have Y
(z)
= (Y
(z)
a
Y
(z)
b
), and
have the conditional linear model
[Y
(z)
a
[ Y
(z)
b
= y
(z)
b
] = (x y
(z)
b
)
_

a

_
+R
(z)
a
, (7.78)
where =
1
z,bb

z,ba
, and R
(z)
a
N
nq
a
(0, I
n

z,aab
), and estimation proceeds as
in the general case (7.69). See Section 9.5 for the calculations.
When z is not square, so that z is q k for k < q, pseudo-covariates can be created
by lling out z, that is, nd a q (q k) matrix z
2
so that z
/
z
2
= 0 and (z z
2
) is
invertible. (Such z
2
can always be found. See Exercise 5.6.18.) Then the model is
Y = xz
/
+R
= x
_
0
_
_
z
/
z
/
2
_
+R, (7.79)
and we can proceed as in (7.76) with
Y
(z)
= Y
_
z
/
z
/
2
_
1
. (7.80)
7.6 Model selection: Mallows C
p
Hypothesis testing is a good method for deciding between two nested models, but
often in linear models, and in any statistical analysis, there are many models up for
consideration. For example, in a linear model with a p l matrix of parameters,
there are 2
pl
models attainable by setting subsets of the
ij
s to 0. There is also an
126 Chapter 7. Both-Sides Models: Testing
innite number of submodels obtained by setting linear (and nonlinear) restrictions
among the parameters. One approach to choosing among many models is to min-
imize some criterion that measures the efcacy of a model. In linear models, some
function of the residual sum of squares is an obvious choice. The drawback is that,
typically, the more parameters in the model, the lower the residual sum of squares,
hence the best model always ends up being the one with all the parameters, i.e., the
entire in the multivariate linear model. Thus one often assesses a penalty depend-
ing on the number of parameters in the model, the larger the number of parameters,
the higher the penalty, so that there is some balance between the residual sum of
squares and number of parameters.
In this section we present Mallows C
p
criterion, [Mallows, 1973]. Section 9.4 ex-
hibits the Bayes information criterion (BIC) and Akaike information criteria (AIC),
which are based on the likelihood. Mallows C
p
and the AIC are motivated by pre-
diction. We develop this idea for the both-sides model (4.28), where
Y = xz
/
+R, where R N
nq
(0, I
n

R
). (7.81)
(We actually do not use the normality assumption on R in what follows.) This ap-
proach is found in Hasegawa [1986] and Gupta and Kabe [2000].
The observed Y is dependent on the value of other variables represented by x and
z. The objective is to use the observed data to predict a new variable Y
New
based
on its x
New
and z
New
. For example, an insurance company may have data on Y, the
payouts the company has made to a number of people, and (x, z), the basic data (age,
sex, overall health, etc.) on these same people. But the company is really wondering
whether to insure new people, whose (x
New
, z
New
) they know, but Y
New
they must
predict. The prediction,

Y
New
, is a function of (x
New
, z
New
) and the observed Y. A
good predictor has

Y
New
close to Y
New
.
The model (7.81), with being p l, is the largest model under consideration, and
will be called the big model. The submodels we will look at will be
Y = x

z
/
+R, (7.82)
where x

is n p

, consisting of p

of the p columns of x, and similarly z

is q l

,
consisting of l

of the l columns of z (and

is p

). There are about 2


p+l
such
submodels (7.82).
We act as if we wish to predict observations Y
New
that have the same explanatory
variables x and z as the data, so that
Y
New
= xz
/
+R
New
, (7.83)
where R
New
has the same distribution as R, and Y
New
and Y are independent. It is
perfectly reasonable to want to predict Y
New
s for different x and z than in the data,
but the analysis is a little easier if they are the same, plus it is a good starting point.
For a given submodel (7.82), we predict Y
New
by

= x

z
/
, (7.84)
where

is the usual estimate based on the smaller model (7.82) and the observed Y,

= (x
/
x

)
1
x
/
Yz

(z
/
z

)
1
. (7.85)
7.6. Mallows C
p
127
It is convenient to use projection matrices as in (6.8), so that

= P
x
YP
z
. (7.86)
To assess how well a model predicts, we use the sum of squares between Y
New
and

Y

:
PredSS

= |Y
New

|
2
=
n

i=1
q

j=1
(y
New
ij
y

ij
)
2
= trace((Y
New

)
/
(Y
New

)). (7.87)
Of course, we cannot calculate PredSS

because the Y
New
is not observed (if it were,
we wouldnt need to predict it), so instead we look at the expected value:
EPredSS

= E[trace((Y
New

)
/
(Y
New

))]. (7.88)
The expected value is taken assuming the big model, (7.81) and (7.83). We cannot
observe EPredSS

either, because it is a function of the unknown parameters and

R
, but we can estimate it. So the program is to
1. Estimate EPredSS

for each submodel (7.82);


2. Find the submodel(s) with the smallest estimated EPredSS

s.
Whether prediction is the ultimate goal or not, the above is a popular way to choose
a submodel.
We will discuss Mallows C
p
as a method to estimate EPredSS

. Cross-validation
is another popular method that will come up in classication, Chapter 11. The temp-
tation is to use the observed Y in place of Y
New
to estimate EPredSS

, that is, to use


the observed residual sum of squares
ResidSS

= trace((Y

)
/
(Y

)). (7.89)
This estimate is likely to be too optimistic, because the prediction

Y

is based on the
observed Y. The ResidSS

is estimating its expected value,


EResidSS

= E[trace((Y

)
/
(Y

))]. (7.90)
Is ResidSS

a good estimate of PredSS

, or EPredSS

? We calculate and compare


EPredSS

and EResidSS

.
First note that for any n q matrix U, we have (Exercise 7.7.8)
E[trace(U
/
U)] = trace(Cov[row(U)]) + trace(E[U]
/
E[U]). (7.91)
We will use
for EPredSS

: U = Y
New

= Y
New
P
x
YP
z
;
and for EResidSS

: U = Y

= Y P
x

YP
z

. (7.92)
128 Chapter 7. Both-Sides Models: Testing
For the mean, note that by (7.81) and (7.83), Y and Y
New
both have mean xz
/
, hence,
E[Y
New
P
x

YP
z

] = E[YP
x

YP
z

] = xz
/
P
x

xz
/
P
z

. (7.93)
The covariance term for EPredSS

is easy because Y
New
and Y are independent.
Using (6.15),
Cov[Y
New
P
x

YP
z

] = Cov[Y
New
] + Cov[P
x

YP
z

]
= (I
n

R
) + (P
x

P
z

R
P
z

). (7.94)
See (6.14). The trace is then
trace(Cov[Y
New
P
x

YP
z

]) = n trace(
R
) + p

trace(
R
P
z

). (7.95)
The covariance for the residuals uses (6.16):
Cov[YP
x

YP
z

] = Q
x


R
+P
x

Q
z

R
Q
z

. (7.96)
Thus
trace(Cov[YP
x

YP
z

]) = (n p

) trace(
R
) + p

trace(Q
z

R
Q
z

)
= n trace(
R
) + p

trace(
R
P
z

). (7.97)
Applying (7.91) with (7.93), (7.95), and (7.97), we have the following result.
Lemma 7.2. For given in (7.93),
EPredSS

= trace(
/
) + n trace(
R
) + p

trace(
R
P
z

);
EResidSS

= trace(
/
) + n trace(
R
) p

trace(
R
P
z

). (7.98)
Note that both quantities can be decomposed into a bias part,
/
, and a covari-
ance part. They have the same bias, but the residuals underestimate the prediction
error by having a p

in place of the +p

:
EPredSS

EResidSS

= 2p

trace(
R
P
z

). (7.99)
So to use the residuals to estimate the prediction error unbiasedly, we need to add
an unbiased estimate of the term in (7.99). That is easy, because we have an unbiased
estimator of
R
.
Proposition 7.2. An unbiased estimator of EPredSS

is Mallows C
p
statistic,
C
p
(x

, z

) =

EPredSS

= ResidSS

+ 2p

trace(

R
P
z

), (7.100)
where

R
=
1
n p
Y
/
Q
x
Y. (7.101)
Some comments:
The ResidSS

is calculated from the submodel, while the


R
is calculated from
the big model.
7.6. Mallows C
p
129
The estimate of prediction error takes the residual error, and adds a penalty
depending (partly) on the number of parameters in the submodel. So the larger
the submodel, generally, the smaller the residuals and the larger the penalty. A
good model balances the two.
In univariate regression, l = 1, so there is no P
z
(it is 1), and
R
=
2
R
, so that
Mallows C
p
is
C
p
(x

) = ResidSS

+ 2p

2
R
. (7.102)
7.6.1 Example: Mouth sizes
Take the big model to be (6.28) again, where the x matrix distinguishes between girls
and boys, and the z matrix has the orthogonal polynomials for age. We will consider
2 4 = 8 submodels, depending on whether the GirlsBoys term is in the x, and
which degree polynomial (0, 1, 2, 3) we use of the z. In Section 6.4.1, we have the
R representations of x, z, and the estimate

R
. To nd the Mallows C
p
s (7.100),
we also need to nd the residuals and projection matrix P
z
for each model under
consideration. Let ii contain the indices of the columns of x in the model, and jj
contain the indices for the columns of z in the model. Then to nd ResidSS

, the
penalty term, and the C
p
statistics, we can use the following:
y < mouths[,1:4]
xstar < x[,ii]
zstar < z[,jj]
pzstar < zstar%%solve(t(zstar)%%zstar,t(zstar))
yhat < xstar%%solve(t(xstar)%%xstar,t(xstar))%%y%%pzstar
residss < sum((yyhat)^2)
pstar < length(ii)
penalty < 2pstartr(sigmaRhat%%pzstar)
cp < residss + penalty
So, for example, the full model takes ii < 1:2 and jj < 1:4, while the model with
no difference between boys and girls, and a quadratic for the growth curve, would
take ii < 1 and jj < 1:3.
Here are the results for the eight models of interest:
p

ResidSS

Penalty C
p
1 1 917.7 30.2 947.9
1 2 682.3 35.0 717.3
1 3 680.9 37.0 717.9
1 4 680.5 42.1 722.6
2 1 777.2 60.5 837.7
2 2 529.8 69.9 599.7
2 3 527.1 74.1 601.2
2 4 526.0 84.2 610.2
(7.103)
Note that in general, the larger the model, the smaller the ResidSS

but the larger


the penalty. The C
p
statistics aims to balance the t and complexity. The model
with the lowest C
p
is the (2, 2) model, which ts separate linear growth curves to the
boys and girls. We arrived at this model in Section 7.3.1 as well. The (2, 3) model
is essentially as good, but is a little more complicated. Generally, one looks for the
130 Chapter 7. Both-Sides Models: Testing
model with the smallest C
p
, but if several models have approximately the same low
C
p
, one chooses the simplest.
7.7 Exercises
Exercise 7.7.1. Show that when p

= l

= 1, the T
2
in (7.17) equals the square of the
t-statistic in (6.27), assuming
ij
= 0 there.
Exercise 7.7.2. Verify the equalities in (7.28).
Exercise 7.7.3. If A Gamma(, ) and B Gamma(, ), and A and B are indepen-
dent, then U = A/(A+ B) is distributed Beta(, ). Show that when l

= 1, Wilks
(Denition 7.2) is Beta(, ), and give the parameters in terms of p

and . [Hint: See


Exercise 3.7.8 for the Gamma distribution, whose pdf is given in (3.81). Also, the Beta
pdf is found in Exercise 2.7.13, in equation (2.95), though you do not need it here.]
Exercise 7.7.4. Suppose V F
,
as in Denition 7.1. Show that U = /(F + ) is
Beta(, ) from Exercise 7.7.3, and give the parameters in terms of and .
Exercise 7.7.5. Show that E[1/
2

] = 1/( 2) if 2, as used in (7.30). The pdf of


the chi-square is given in (3.80).
Exercise 7.7.6. Find the matrices C and D so that the hypothesis in (7.52) is the same
as that in (7.12), where is 5 4.
Exercise 7.7.7. Consider the model in (7.69) and (7.70). Let W = y
/
Q
x
y. Show that
W
aab
= y
/
a
Q
x
y
a
. Thus one has the same estimate of
aab
in the original model and
in the covariate-adjusted model. [Hint: Write out W
aab
the usual way, where the
blocks in W are Y
/
a
Q
x
Y
a
, etc. Note that the answer is a function of Q
x
y
a
and Q
x
y
b
.
Then use (5.89) with D
1
= x and D
2
= y
b
.]
Exercise 7.7.8. Prove (7.91). [Hint: Use (7.18) and Exercise 2.7.11 on row(U).]
Exercise 7.7.9. Show that in (7.93) is zero in the big model, i.e, x

= x and z

= z.
Exercise 7.7.10. Verify the second equality in (7.97). [Hint: Note that trace(QQ) =
trace(Q) if Q is idempotent. Why?]
Exercise 7.7.11 (Mouth sizes). In the mouth size data in Section 7.1.1, there are n
G
=
11 girls and n
B
= 16 boys, and q = 4 measurements on each. Thus Y is 27 4.
Assume that
Y N(x, I
n

R
), (7.104)
where this time
x =
_
1
11
0
11
0
16
1
16
_
, (7.105)
and
=
_

G

B
_
=
_

G1

G2

G3

G4

B1

B2

B3

B4
_
. (7.106)
The sample means for the two groups are
G
and
B
. Consider testing
H
0
:
G
=
B
. (7.107)
7.7. Exercises 131
(a) What is the constant c so that
B =

G

B
c
N(0,
R
)? (7.108)
(b) The unbiased estimate of
R
is

R
. Then
W =

R
Wishart(,
R
). (7.109)
What is ? (c) What is the value of Hotellings T
2
? (d) Find the constant d and the
degrees of freedom a, b so that F = d T
2
F
a,b
. (e) What is the value of F? What is
the resulting p-value? (f) What do you conclude? (g) Compare these results to those
in Section 7.1.1, equations (7.10) and below.
Exercise 7.7.12 (Skulls). This question continues Exercise 6.5.5 on Egyptian skulls. (a)
Consider testing that there is no difference among the ve time periods for all four
measurements at once. What are p

, l

, and T
2
in (7.22) for this hypothesis? What
is the F and its degrees of freedom? What is the p-value? What do you conclude?
(b) Now consider testing whether there is a non-linear effect on skull size over time,
that is, test whether the last three rows of the matrix are all zero. What are l

, , the
F-statistic obtained from T
2
, the degrees of freedom, and the p-value? What do you
conclude? (e) Finally, consider testing whether there is a linear effect on skull size
over time. Find the F-statistic obtained from T
2
. What do you conclude?
Exercise 7.7.13 (Prostaglandin). This question continues Exercise 6.5.4 on prostaglan-
din levels over time. The model is Y = 1
10
z
/
+R, where the i
th
row of z is
(1, cos(
i
), sin(
i
), cos(2
i
), sin(2
i
), cos(3
i
)) (7.110)
for
i
= 2i/6, i = 1, . . . , 6. (a) In Exercise 4.4.8, the one-cycle wave is given by the
equation A + B cos( + C). The null hypothesis that the model does not include that
wave is expressed by setting B = 0. What does this hypothesis translate to in terms
of the
ij
s? (b) Test whether the one-cycle wave is in the model. What is p

? (c)
Test whether the two-cycle wave is in the model. What is p

? (d) Test whether the


three-cycle wave is in the model. What is p

? (e) Test whether just the one-cycle wave


needs to be in the model. (I.e., test whether the two- and three-cycle waves have zero
coefcients.) (f) Using the results from parts (b) through (e), choose the best model
among the models with (1) No waves; (2) Just the one-cycle wave; (3) Just the one-
and two-cycle waves; (4) The one-, two-, and three-cycle waves. (g) Use Mallows C
p
to choose among the four models listed in part (f).
Exercise 7.7.14 (Histamine in dogs). Consider the model for the histamine in dogs
example in (4.15), i.e.,
Y = x +R =

1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1

1
4

b

1

2

3

b

1

2

3

b

1

2

3

b

1

2

3

+R.
(7.111)
132 Chapter 7. Both-Sides Models: Testing
For the following two null hypotheses, specify which parameters are set to zero, then
nd p

, l

, , the T
2
and its F version, the degrees of freedom for the F, the p-value,
and whether you accept or reject. Interpret the nding in terms of the groups and
variables. (a) The four groups have equal means (for all four time points). Compare
the results to that for the hypothesis in (7.46). (b) The four groups have equal before
means. (c) Now consider testing the null hypothesis that the after means are equal,
but using the before measurements as a covariate. (So we assume that
b
=
b
=

b
= 0.) What are the dimensions of the resulting Y
a
and the x matrix augmented
with the covariate? What are p

, l

, , and the degrees of freedom in the F for testing


the null hypothesis. (d) The x
/
x from the original model (not using the covariates)
is 16 I
4
, so that the [(x
/
x)
1
]

= (1/16)I
3
. Compare the diagonals (i.e., 1/16)
to the diagonals of the analogous matrix in the model using the covariate. How
much smaller or larger, percentagewise, are the covariate-based diagonals than the
original? (e) The diagonals of the

z
in the original model are 0.4280, 0.1511, and
0.0479. Compare these diagonals to the diagonals of the analogous matrix in the
model using the covariate. How much smaller or larger, percentagewise, are the
covariate-based diagonals than the original? (f) Find the T
2
, the F statistic, and the
p-value for testing the hypothesis using the covariate. What do you conclude? How
does this result compare to that without the covariates?
Exercise 7.7.15 (Histamine, cont.). Continue the previous question, using as a starting
point the model with the before measurements as the covariate, so that
Y

= x

1

2

3

z
/
+R

, (7.112)
where Y

has just the after measurements, x

is the x in (7.111) augmented with


the before measurements, and z represents orthogonal polynomials for the after time
points,
z =

1 1 1
1 0 2
1 1 1

. (7.113)
Now consider the equivalent model resulting from multiplying both sides of the
equation on the right by (z
/
)
1
. (a) Find the estimates and standard errors for
the quadratic terms, (

3
,

3
,

2
,

3
). Test the null hypothesis that (

3
,

3
,

3
,

3
) =
(0, 0, 0, 0). What is ? What is the p-value? Do you reject this null? (The answer
should be no.) (b) Now starting with the model from part (a), use the vector of
quadratic terms as the covariate. Find the estimates and standard errors of the rele-
vant parameters, i.e.,

. (7.114)
(c) Use Hotellings T
2
to test the interaction terms are zero, i.e., that (

1
,

2
) = (0, 0).
(What are l

and ?) Also, do the t-tests for the individual parameters. What do you
conclude?
7.7. Exercises 133
Exercise 7.7.16 (Caffeine). This question uses the data on he effects of caffeine on
memory described in Exercise 4.4.4. The model is as in (4.35), with x as described
there, and
z =
_
1 1
1 1
_
. (7.115)
The goal of this problem is to use Mallows C
p
to nd a good model, choosing among
the constant, linear and quadratic models for x, and the overall mean" and overall
mean + difference models" for the scores. Thus there are six models. (a) For each
of the 6 models, nd the p

, l

, residual sum of squares, penalty, and C


p
values. (b)
Which model is best in terms of C
p
? (c) Find the estimate of

for the best model.


(d) What do you conclude?
Chapter 8
Some Technical Results
This chapter contains a number of results useful for linear models and other models,
including the densities of the multivariate normal and Wishart. We collect them here
so as not to interrupt the ow of the narrative.
8.1 The Cauchy-Schwarz inequality
Lemma 8.1. Cauchy-Schwarz inequality. Suppose y and d are 1 K vectors. Then
(yd
/
)
2
|y|
2
|d|
2
, (8.1)
with equality if and only if d is zero, or
y = d (8.2)
for some constant .
Proof. If d is zero, the result is immediate. Suppose d ,= 0, and let y be the projection
of y onto spand. (See Denitions 5.2 and 5.7.) Then by least-squares, Theorem 5.2
(with D = d
/
), y = d, where here = yd
/
/|d|
2
. The sum-of-squares decomposition
in (5.11) implies that
|y|
2
|y|
2
= |y y|
2
0, (8.3)
which yields
|y|
2
|y|
2
=
(yd
/
)
2
|d|
2
, (8.4)
from which (8.1) follows. Equality in (8.1) holds if and only if y = y, which holds if
and only if (8.2).
If U and V are random variables, with E[[UV[] < , then the Cauchy-Schwarz
inequality becomes
E[UV]
2
E[U
2
]E[V
2
], (8.5)
with equality if and only if V is zero with probability one, or
U = bV (8.6)
135
136 Chapter 8. Technical Results
for constant b = E[UV]/E[V
2
]. See Exercise 8.8.2. The next result is well-known in
statistics.
Corollary 8.1 (Correlation inequality). Suppose Y and X are random variables with nite
positive variances. Then
1 Corr[Y, X] 1, (8.7)
with equality if and only if, for some constants a and b,
Y = a + bX. (8.8)
Proof. Apply (8.5) with U = Y E[Y] and V = X E[X] to obtain
Cov[Y, X]
2
Var[Y]Var[X], (8.9)
from which (8.7) follows. Then (8.8) follows from (8.6), with b = Cov[Y, X]/Var[X]
and a = E[Y] bE[X], so that (8.8) is the least squares t of X to Y.
This inequality for the sample correlation coefcient of n 1 vectors x and y fol-
lows either by using Lemma 8.1 on H
n
y and H
n
x, where H
n
is the centering matrix
(1.12), or by using Corollary 8.1 with X and Y having the empirical distributions
given by x and y, respectively, i.e.,
P[X = x] =
1
n
#x
i
= x and P[Y = y] =
1
n
#y
i
= y. (8.10)
The next result also follows from Cauchy-Schwarz. It will be useful for Hotellings
T
2
in Section 8.4.1, and for canonical correlations in Section 13.3.
Corollary 8.2. Suppose y and d are 1 K vectors, and |y| = 1. Then
(yd
/
)
2
|d|
2
, (8.11)
with equality if and only if d is zero, or d is nonzero and
y =
d
|d|
. (8.12)
8.2 Conditioning in a Wishart
We start with W Wishart
p+q
(, ) as in Denition 3.6, where is partitioned
=
_

XX

XY

YX

YY
_
, (8.13)

XX
is p p,
YY
is q q, and W is partitioned similarly. We are mainly interested
in the distribution of
W
YYX
= W
YY
W
YX
W
1
XX
W
XY
(8.14)
(see Equation 3.49), but some additional results will easily come along for the ride.
8.3. Expectation of inverse Wishart 137
Proposition 8.1. Consider the situation above, where
XX
is invertible and p. Then
(W
XX
, W
XY
) is independent of W
YYX
, (8.15)
W
YYX
Wishart
q
( p,
YYX
), (8.16)
and
W
XY
[ W
XX
= w
xx
N(w
xx

1
XX

XY
, w
xx

YYX
), (8.17)
W
XX
Wishart
p
(,
XX
). (8.18)
Proof. The nal equation is just the marginal of a Wishart, as in Section 3.6. By
Denition 3.6 of the Wishart,
W =
T
(X Y)
/
(X Y), (X Y) N(0, I
n
), (8.19)
where X is n p and Y is n q. Conditioning as in (3.56), we have
Y [ X = x N
nq
(x, I
n

YYX
), =
1
XX

XY
. (8.20)
(The = 0 because the means of X and Y are zero.) Note that (8.20) is the both-
sides model (6.1), with z = I
q
and
R
=
YYX
. Thus by Theorem 6.1 and the
plug-in property (2.62) of conditional distributions,

= (X
/
X)
1
X
/
Y and Y
/
Q
X
Y are
conditionally independent given X = x,

[ X = x N(, (x
/
x)
1

YYX
), (8.21)
and
Y
/
Q
X
Y [ X = x Wishart
q
(n p,
YYX
). (8.22)
The conditional distribution in (8.22) does not depend on x, hence Y
/
Q
X
Y is (uncon-
ditionally) independent of the pair (X,

), as in (2.65) and therebelow, hence
Y
/
Q
X
Y is independent of (X
/
X, X
/
Y). (8.23)
Property (2.66) implies that

[ X
/
X = x
/
x N(, (x
/
x)
1

YYX
), (8.24)
hence
X
/
Y = (X
/
X)

[ X
/
X = x
/
x N(x
/
x), (x
/
x)
YYX
). (8.25)
Translating to W using (8.19), noting that Y
/
Q
X
Y = W
YYX
, we have that (8.23) is
(8.15), (8.22) is (8.16), and (8.25) is (8.17).
8.3 Expectation of inverse Wishart
We rst prove Lemma 7.1 for
U Wishart
l
(, I
l
). (8.26)
138 Chapter 8. Technical Results
For any q q orthogonal matrix , U
/
has the same distribution as U, hence in
particular
E[U
1
] = E[(U
/
)
1
] = E[U
1
]
/
. (8.27)
Exercise 8.8.6 shows that any symmetric q q matrix A for which A = A
/
for all
orthogonal must be of the form a
11
I
q
. Thus
E[U
1
] = E[(U
1
)
11
] I
l
. (8.28)
Using (5.85),
E[(U
1
)
11
] = E
_
1
U
112:q
_
= E
_
1

2
l

+1
_
=
1
l

1
. (8.29)
Equations (8.28) and (8.29) show that
E[U
1
] =
1
l

1
I
l
. (8.30)
Next take W Wishart
q
(, ), with invertible. Then, W =
T

1/2
U
1/2
, and
E[W
1
] = E[(
1/2
U
1/2
)
1
]
=
1/2
1
l

1
I
l

1/2
=
1
l

1

1
, (8.31)
verifying (7.32).
8.4 Distribution of Hotellings T
2
Here we prove Proposition 7.1. Exercise 8.8.5 shows that we can assume = I
l
in
the proof, which we do. Divide and multiply the ZW
1
Z
/
by |Z|
2
:
ZW
1
W
/
=
ZW
1
Z
/
|Z|
2
|Z|
2
. (8.32)
Because Z is a vector of l

independent standard normals,


|Z|
2

2
l

. (8.33)
Consider the distribution of the ratio conditional on Z = z:
ZW
1
Z
/
|Z|
2
[ Z = z. (8.34)
Because Z and W are independent, we can use the plugin formula (2.63), so that
_
ZW
1
Z
/
|Z|
2
[ Z = z
_
=
T
zW
1
z
/
|z|
2
= g
1
W
1
g
/
1
, (8.35)
8.4. Distribution of Hotellings T
2
139
where g
1
= z/|z|. Note that on the right-hand side we have the unconditional
distribution for W. Let G be any l

orthogonal matrix with g


1
as its rst row.
(Exercise 5.6.18 guarantees there is one.) Then
g
1
W
1
g
/
1
= e
1
GW
1
G
/
e
/
1
, e
1
= (1, 0, . . . , 0). (8.36)
Because the covariance parameter in the Wishart distribution for W is I
l
, U
GWG
/
Wishart
l
(, I
l
). But
e
1
GW
1
G
/
e
/
1
= e
1
U
1
e
/
1
= [U
1
]
11
= U
1
112:l

(8.37)
by (5.85).
Note that the distribution of U, hence [U
1
]
11
, does not depend on z, which means
that
ZW
1
Z
/
|Z|
2
is independent of Z. (8.38)
Furthermore, by (8.16), where p = l

1 and q = 1,
U
112
Wishart
1
( l

+1, 1)
2
l

+1
. (8.39)
Now (8.32) can be expressed as
ZW
1
Z
/
=
|Z|
2
U
112:l

=
T

2
l

2
l

+1
, (8.40)
where the two
2
s are independent. Then (7.29) follows from Denition 7.1 for the
F.
8.4.1 A motivation for Hotellings T
2
Hotellings T
2
test can be motivated using the projection pursuit idea. Let a ,= 0 be
an 1 l

vector of constants, and look at


Za
/
N(0, aa
/
) and aWa
/
Wishart
1
(, aa
/
) = (aa
/
)
2

. (8.41)
Now we are basically in the univariate t (6.25) case, i.e.,
T
a
=
Za
/

aWa
/
/
t

, (8.42)
or, since t
2

= F
1,
,
T
2
a
=
(Za
/
)
2
aWa
/
/
F
1,
. (8.43)
For any a, we can do a regular F test. The projection pursuit approach is to nd the
a that gives the most signicant result. That is, we wish to nd
T
2
= max
a,=0
T
2
a
. (8.44)
To nd the best a, rst simplify the denominator by setting
b = aW
1/2
, so that a = bW
1/2
. (8.45)
140 Chapter 8. Technical Results
Then
T
2
= max
b,=0

(Vb
/
)
2
bb
/
, where V = ZW
1/2
. (8.46)
Letting g = b/|b|, so that |g| = 1, Corollary 8.2 of Cauchy-Schwarz shows that (see
Exercise 8.8.9)
T
2
= max
g [ |g|=1
(Vg
/
)
2
= VV
/
= ZW
1
Z
/
, (8.47)
which is indeed Hotellings T
2
of (7.28), a multivariate generalization of Students
t
2
. Even though T
2
a
has an F
1,
distribution, the T
2
does not have that distribution,
because it maximizes over many F
1,
s.
8.5 Density of the multivariate normal
Except for when using likelihood methods in Chapter 9, we do not need the density
of the multivariate normal, nor of the Wishart, for our main purposes, but present
them here because of their intrinsic interest. We start with the multivariate normal,
with positive denite covariance matrix.
Lemma 8.2. Suppose Z N
1N
(, ), where is positive denite. Then the pdf of Z is
f (z [ , ) =
1
(2)
N/2
1
[[
1/2
e

1
2
(z)
1
(z)
/
. (8.48)
Proof. Recall that a multivariate normal vector is an afne transform of a vector of
independent standard normals,
Y = ZA
/
+ , Z N
1N
(0, I
N
), AA
/
= . (8.49)
We will take A to be N N, so that being positive denite implies that A is
invertible. Then
Z = (Y )(A
/
)
1
, (8.50)
and the Jacobian is
[
z
y
[

z
1
/y
1
z
1
/y
2
z
1
/y
N
z
2
/y
1
z
2
/y
2
z
2
/y
N
.
.
.
.
.
.
.
.
.
.
.
.
z
N
/y
1
z
N
/y
2
z
N
/y
N

= [(A
/
)
1
[. (8.51)
The density of Z is
f (z [ 0, I
N
) =
1
(2)
N/2
e

1
2
z
2
1
+z
2
N
=
1
(2)
N/2
e

1
2
zz
/
, (8.52)
so that
f (y[ , ) =
1
(2)
N/2
abs[(A
/
)
1
[ e

1
2
(y)(A
/
)
1
A
1
(y)
/
=
1
(2)
N/2
[AA
/
[
1/2
e

1
2
(y)(AA
/
)
1
(y)
/
, (8.53)
from which (8.48) follows.
8.6. The QR decomposition for the multivariate normal 141
When Z can be written as a matrix with a Kronecker product for its covariance
matrix, as is often the case for us, the pdf can be compactied.
Corollary 8.3. Suppose Y N
nq
(M, C), where C (n n) and (q q) are positive
denite. Then
f (y [ M, C, ) =
1
(2)
nq/2
1
[C[
q/2
[[
n/2
e

1
2
trace(C
1
(yM)
1
(yM)
/
)
. (8.54)
See Exercise 8.8.15 for the proof.
8.6 The QR decomposition for the multivariate normal
Here we discuss the distributions of the Q and R matrices in the QR decomposition
of a multivariate normal matrix. From the distribution of the upper triangular R
we obtain Bartletts decomposition [Bartlett, 1939], useful for randomly generating
Wisharts, as well as derive the Wishart density in Section 8.7. Also, we see that Q
has a certain uniform distribution, which provides a method for generating random
orthogonal matrices from random normals. The results are found in Olkin and Roy
[1954], and this presentation is close to that of Kshirsagar [1959]. (Old school, indeed!)
We start with the data matrix
Z N
q
(0, I

I
q
), (8.55)
a matrix of independent N(0, 1)s, where q, and consider the QR decomposition
(Theorem 5.3)
Z = QR. (8.56)
We nd the distribution of the R. Let
S Z
/
Z = R
/
R Wishart
q
(, I
q
). (8.57)
Apply Proposition 8.1 with S
XX
being the single element S
11
. Because = I
q
,
(S
11
, S
12:q
) is independent of S
2:q2:q1
,
S
11
Wishart
1
(, I
1
) =
2

,
S
12:q
[ S
11
= s
11
N
1(q1)
(0, s
11
I
q1
),
and S
2:q2:q1
Wishart
q1
( 1, I
q1
). (8.58)
Note that S
12:q
/

S
11
, conditional on S
11
, is N(0, I
q1
), in particular, is independent
of S
11
. Thus the three quantities S
11
, S
12:q
/

S
11
, and S
2:q2:q1
are mutually
independent. Equation (5.67) shows that
R
11
=
_
S
11

_

and
_
R
12
R
1q
_
= S
12:q
/
_
S
11
N
1(q1)
(0, I
q1
). (8.59)
Next, work on the rst component of S
221
of S
2:q2:q1
. We nd that
R
22
=
_
S
221

_

2
1
and
_
R
23
R
2q
_
= S
23:q1
/
_
S
221
N
1(q2)
(0, I
q2
), (8.60)
142 Chapter 8. Technical Results
both independent of each other, and of S
3:q3:q12
. We continue, obtaining the
following result.
Lemma 8.3 (Bartletts decomposition). Suppose S Wishart
q
(, I
q
), where q, and
let its Cholesky decomposition be S = R
/
R. Then the elements of R are mutually independent,
where
R
2
ii

2
i+1
, i = 1, . . . , q, and R
ij
N(0, 1), 1 i < j q. (8.61)
Next, suppose Y N
q
(0, I

), where is invertible. Let A be the matrix


such that
= A
/
A, where A T
+
q
of (5.63), (8.62)
i.e., A
/
A is the Cholesky decomposition of . Thus we can take Y = ZA. Now
Y = QV, where V RA is also in T
+
q
, and Q still has orthonormal columns. By the
uniqueness of the QR decomposition, QV is the QR decomposition for Y. Then
Y
/
Y = V
/
V Wishart(0, ). (8.63)
We call the distribution of V the Half-Wishart
q
(, A).
To generate a random W Wishart
q
(, ) matrix, one can rst generate q(q 1)/2
N(0, 1)s, and q
2
s, all independently, then set the R
ij
s as in (8.61), then calculate
V = RA, and W = V
/
V. If is large, this process is more efcient than generating
the q normals in Z or Y. The next section derives the density of the Half-Wishart,
then that of the Wishart itself.
We end this section by completing description of the joint distribution of (Q, V).
Exercise 3.7.35 handled the case Z N
21
(0, I
2
).
Lemma 8.4. Suppose Y = QV as above. Then
(i) Q and V are independent;
(ii) The distribution of Q in does not depend on ;
(iii) The distribution of Q is invariant under orthogonal transforms: If O
n
, the group of
n n orthogonal matrices (see (5.58)), then
Q =
T
Q. (8.64)
Proof. From above, we see that Z and Y = ZA have the same Q. The distribution of Z
does not depend on , hence neither does the distribution of Q, proving part (ii). For
part (iii), consider Y, which has the same distribution as Y. We have Y = (Q)V.
Since Q also has orthonormal columns, the uniqueness of the QR decomposition
implies that Q is the Q for Y. Thus Q and Q have the same distribution.
Proving the independence result of part (i) takes some extra machinery from math-
ematical statistics. See, e.g., Lehmann and Casella [1998]. Rather than providing all
the details, we outline how one can go about the proof. First, V can be shown to be a
complete sufcient statistic for the model Y N(0, I

). Basus Lemma says that


any statistic whose distribution does not depend on the parameter, in this case , is
independent of the complete sufcient statistic. Thus by part (ii), Q is independent
of V.
8.7. Density of the Wishart 143
If n = q, the Q is an orthogonal matrix, and its distribution has the Haar probabil-
ity measure, or uniform distribution, over O

. It is the only probability distribution


that does have the above invariance property, although proving the fact is beyond
this book. See Halmos [1950]. Thus one can generate a random q q orthogonal
matrix by rst generating an q q matrix of independent N(0, 1)s, then performing
Gram-Schmidt orthogonalization on the columns, normalizing the results so that the
columns have norm 1.
8.7 Density of the Wishart
We derive the density of the Half-Wishart, then the Wishart. We need to be careful
with constants, and nd two Jacobians. Some details are found in Exercises 8.8.16 to
8.8.21.
We start by writing down the density of R Half-Wishart
q
(, I
q
), assuming n q,
as in (8.61), The density of U (> 0), where U
2

2
k
, is
f
k
(u) =
1
(k/2)2
(k/2)1
u
k1
e

1
2
u
2
. (8.65)
Thus that for R is
f
R
(r) =
1
c(, q)
r
1
11
r
2
22
r
nq
qq
e

1
2
trace(r
/
r)
, (8.66)
where
c(, q) =
q(q1)/4
2
(q/2)q
q

j=1

_
j + 1
2
_
. (8.67)
For V Half-Wishart
q
(, ), where is invertible, we set V = RA, where A
/
A is
the Cholesky decomposition of in (8.62). The Jacobian J is given by
1
J
=

v
r

= a
11
a
2
22
a
q
qq
. (8.68)
Thus, since v
jj
= a
jj
r
jj
, the density of V is
f
V
(v [ ) =
1
c(, q)
v
1
11
v
2
22
v
q
qq
a
1
11
a
2
22
a
q
qq
e

1
2
trace((A
/
)
1
v
/
vA
1
)
1
a
11
a
2
22
a
q
qq
=
1
c(, q)
1
[[
/2
v
1
11
v
2
22
v
q
qq
e

1
2
trace(
1
v
/
v)
, (8.69)
since [[ =

a
2
ii
. See Exercise 5.6.31.
Finally, suppose W Wishart
q
(, ), so that we can take W = V
/
V. The Jacobian
is
1
J

w
v

= 2
q
v
q
11
v
q1
22
v
qq
. (8.70)
144 Chapter 8. Technical Results
Thus from (8.69),
f
W
(w[ ) =
1
2
q
1
c(, q)
1
[[
/2
v
1
11
v
2
22
v
q
qq
v
q
11
v
q1
22
v
qq
e

1
2
trace(
1
w)
=
1
d(, q)
1
[[
n/2
[w[
(q1)/2
e

1
2
trace(
1
w)
, (8.71)
where
d(, q) =
q(q1)/4
2
q/2
q

j=1

_
j + 1
2
_
. (8.72)
8.8 Exercises
Exercise 8.8.1. Suppose y is 1 K, D is M K, and D
/
D is invertible. Let y be
the projection of y onto the span of the rows of D
/
, so that y = D
/
, where =
yD(D
/
D)
1
is the least-squares estimate as in (5.17). Show that
|y|
2
= yD(D
/
D)
1
D
/
y
/
. (8.73)
(Notice the projection matrix, from (5.19).) Show that in the case D = d
/
, i.e., M = 1,
(8.73) yields the equality in (8.4).
Exercise 8.8.2. Prove the Cauchy-Schwarz inequality for random variables U and V
given in (8.5) and (8.6), assuming that V is not zero with probability one. [Hint: Use
least squares, by nding b to minimize E[(U bV)
2
].]
Exercise 8.8.3. Prove Corollary 8.2. [Hint: Show that (8.11) follows from (8.1), and
that (8.2) implies that = 1/|d|, using the fact that |y| = 1.]
Exercise 8.8.4. For W in (8.19), verify that X
/
X = W
XX
, X
/
Y = W
XY
, and Y
/
Q
X
Y =
W
YYX
, where Q
X
= I
n
X(X
/
X)
1
X
/
.
Exercise 8.8.5. Suppose Z N
1l
(0, ) and W Wishart
l
(, ) are as in Proposi-
tion 7.1. (a) Show that for any l

invertible matrix A,
ZW
1
Z
/
= (ZA)(A
/
WA)
1
(ZA)
/
. (8.74)
(b) For what A do we have ZA N
1l
(0, I
l
) and A
/
WA Wishart
l
(, I
l
)?
Exercise 8.8.6. Let A be a q q symmetric matrix, and for q q orthogonal matrix ,
contemplate the equality
A = A
/
. (8.75)
(a) Suppose (8.75) holds for all permutation matrices (matrices with one 1 in each
row, and one 1 in each column, and zeroes elsewhere). Show that all the diagonals
of A must be equal (i.e., a
11
= a
22
= = a
qq
), and that all off-diagonals must be
equal (i.e., a
ij
= a
kl
if i ,= j and k ,= l). [Hint: You can use the permutation matrix
that switches the rst two rows and rst two columns,
=

0 1 0
1 0 0
0 0 I
q2

, (8.76)
8.8. Exercises 145
to show that a
11
= a
22
and a
1i
= a
2i
for i = 3, . . . , q. Similar equalities can be obtained
by switching other pairs of rows and columns.] (b) Suppose (8.75) holds for all that
are diagonal, with each diagonal element being either +1 or 1. (They neednt all be
the same sign.) Show that all off-diagonals must be 0. (c) Suppose (8.75) holds for all
orthogonal . Show that A must be of the form a
11
I
q
. [Hint: Use parts (a) and (b).]
Exercise 8.8.7. Verify the three equalities in (8.37).
Exercise 8.8.8. Show that t
2

= F
1,
.
Exercise 8.8.9. Find the g in (8.47) that maximizes (Vg
/
)
2
, and show that the maxi-
mum is indeed VV
/
. (Use Corollary 8.2.) What is a maximizing a in (8.44)?
Exercise 8.8.10. Suppose that z = yB, where z and y are 1 N, and B is N N and
invertible. Show that [z/y[ = [B[. (Recall (8.51)).
Exercise 8.8.11. Show that for N N matrix A, abs[(A
/
)
1
[ = [AA
/
[
1/2
.
Exercise 8.8.12. Let
(X, Y) N
p+q
_
(0, 0),
_

XX
0
0
YY
__
. (8.77)
where X is 1 p, Y is 1 q, Cov[X] =
XX
, and Cov[Y] =
YY
. By writing out the
density of (X, Y), show that X and Y are independent. (Assume the covariances are
invertible.)
Exercise 8.8.13. Take (X, Y) as in (8.77). Show that X and Y are independent by using
moment generating functions. Do you need that the covariances are invertible?
Exercise 8.8.14. With
(X, Y) N
12
_
(
X
,
Y
),
_

XX

XY

YX

YY
__
, (8.78)
derive the conditional distribution Y [ X = x explicitly using densities, assuming

XX
> 0. That is, show that f
Y[X
(y[x) = f (x, y)/f
X
(x).
Exercise 8.8.15. Prove Corollary 8.3. [Hint: Make the identications z = row(y),
= row(M), and = C in (8.48). Use (3.32f) for the determinant term in the
density. For the term in the exponent, use (3.32d) to help show that
trace(C
1
(y M)
1
(y M)
/
) = (z )(C
1

1
)(z )
/
.] (8.79)
Exercise 8.8.16. Show that U =

X, where X
2
k
, has density as in (8.65).
Exercise 8.8.17. Verify (8.66) and (8.67). [Hint: Collect the constants as in (8.65),
along with the constants from the normals (the R
ij
s, j > i). The trace in the exponent
collects all the r
2
ij
.]
Exercise 8.8.18. Verify (8.68). [Hint: Vectorize the matrices by row, leaving out the
structural zeroes, i.e., for q = 3, v (v
11
, v
12
, v
13
, v
22
, v
23
, v
33
). Then the matrix of
derivatives will be lower triangular.]
146 Chapter 8. Technical Results
Exercise 8.8.19. Verify (8.69). In particular, show that
trace((A
/
)
1
v
/
vA
1
) = trace(
1
v
/
v) (8.80)
and

a
jj
= [[
1/2
. [Recall (5.87).]
Exercise 8.8.20. Verify (8.70). [Hint: Vectorize the matrices as in Exercise 8.8.18, where
for w just take the elements in the upper triangular part.]
Exercise 8.8.21. Verify (8.71) and (8.72).
Exercise 8.8.22. Suppose V Half-Wishart
q
(, ) as in (8.63), where is positive
denite and p. Show that the diagonals V
jj
are independent, and
V
2
ii

jj1:(j1)

2
j+1
. (8.81)
[Hint: Show that with V = RA as in (8.61) and (8.62), V
jj
= a
jj
R
jj
. Apply (5.67) to the
A and .]
Exercise 8.8.23. For a covariance matrix , [[ is called the population generalized
variance. It is an overall measure of spread. Suppose W Wishart
q
(, ), where
is positive denite and q. Show that

[[ =
1
( 1) ( q + 1)
[W[ (8.82)
is an unbiased estimate of the generalized variance. [Hint: Find the Cholesky decom-
position W = V
/
V, then use (8.81) and (5.86).]
Exercise 8.8.24 (Bayesian inference). Consider Bayesian inference for the covariance
matrix. It turns out that the conjugate prior is an inverse Wishart on the covariance
matrix, which means
1
has a Wishart prior. Specically, let
=
1
and
0

0
=
1
0
, (8.83)
where
0
is the prior guess of , and
0
is the prior sample size. (The larger the
0
,
the more weight is placed on the prior vs. the data.) Then the model in terms of the
inverse covariance parameter matrices is
W[ = Wishart
q
(,
1
)
Wishart
q
(
0
,
0
) , (8.84)
where q,
0
q and
0
is positive denite, so that , hence , is invertible with
probability one. Note that the prior mean for is
1
0
. (a) Show that the joint density
of (W, ) is
f
W[
(w[ ) f

() = c(w)[[
(+
0
q1)/2
e

1
2
trace((w+
1
0
))
, (8.85)
where c(w) is some constant that does not depend on , though it does depend on

0
and
0
. (b) Without doing any calculations, show that the posterior distribution
of is
[ W = w Wishart
q
( +
0
, (w+
1
0
)
1
). (8.86)
8.8. Exercises 147
[Hint: Dividing the joint density in (8.85) by the marginal density of W, f
W
(w), yields
the posterior density just like the joint density, but with a different constant, say,
c

(w). With as the variable, the density is a Wishart one, with given parameters.]
(c) Letting S = W/ be the sample covariance matrix, show that the posterior mean
of is
E[[ W = w] =
1
+
0
q 1
(S +
0

0
), (8.87)
close to a weighted average of the prior guess and observed covariance matrices.
[Hint: Use Lemma 7.1 on , rather than trying to nd the distribution of .]
Exercise 8.8.25 (Bayesian inference). Exercise 3.7.30 considered Bayesian inference on
the normal mean when the covariance matrix is known, and the above Exercise 8.8.24
treated the covariance case with no mean apparent. Here we present a prior to deal
with the mean and covariance simultaneously. It is a two-stage prior:
[ = N
pq
(
0
, (K
0
)
1
),
Wishart
q
(
0
,
0
) . (8.88)
Here, K
0
,
0
,
0
and
0
are known, where K
0
and
0
are positive denite, and
0
q.
Show that unconditionally, E[] =
0
and, using (8.83),
Cov[] =
1

0
q 1
(K
0

0
)
1
=

0

0
q 1
K
1
0

0
. (8.89)
[Hint: Use the covariance decomposition in (2.74) on .]
Exercise 8.8.26. This exercise nds density of the marginal distribution of the in
(8.88). (a) Show that the joint density of and can be written
f
,
(m, ) =
1
(2)
pq/2
d(
0
, q)
[
0
[

0
/2
[K
0
[
q/2
[[
(
0
+pq1)/2
e

1
2
trace(((m
0
)
/
K
0
(m
0
)+
1
0
))
, (8.90)
for the Wishart constant d(
0
, q) given in (8.72). [Hint: Use the pdfs in (8.54) and
(8.69).] (b) Argue that the nal two terms in (8.90) (the [[ term and the exponential
term) look like the density of if
Wishart
q
(
0
+ p, ((m
0
)
/
K
0
(m
0
) +
1
0
)
1
), (8.91)
but without the constants, hence integrating over yields the inverse of those con-
stants. Then show that the marginal density of is
f

(m) =
_
f
,
(m, )d
=
d(
0
+ p, q)
(2)
pq/2
d(
0
, q)
[
0
[

0
/2
[K
0
[
q/2
[(m
0
)
/
K
0
(m
0
) +
1
0
[
(
0
+p)/2
=
1
c(
0
, p, q)
[
0
[
p/2
[K
0
[
q/2
[(m
0
)
/
K
0
(m
0
)
0
+I
q
[
(
0
+p)/2
,
(8.92)
148 Chapter 8. Technical Results
where
c(
0
, p, q) = (2)
pq/2
d(
0
, q)/d(
0
+ p, q). (8.93)
This density for is a type of multivariate t. Hotellings T
2
is another type. (c) Show
that if p = q = 1,
0
= 0, K
0
= 1/
0
and
0
= 1, that the pdf (8.92) is that of a
Students t on
0
degrees of freedom:
f (t [
0
) =
((
0
+ 1)/2)

(
0
/2)
1
(1 + t
2
/
0
)
(
0
+1)/2
. (8.94)
Exercise 8.8.27 (Bayesian inference). Now we add some data to the prior in Exercise
8.8.25. The conditional model for the data is
Y[ = m, = N
pq
(m, (K)
1
),
W[ = m, = Wishart
q
(,
1
), (8.95)
where Y and W are independent given and . Note that Ws distribution does not
depend on the . The conjugate prior is given in (8.88), with the conditions given
therebelow. The K is a xed positive denite matrix. A curious element is that prior
covariance of the mean and the conditional covariance of Y have the same , which
helps tractability (as in Exercise 6.5.3). (a) Justify the following equations:
f
Y,W,,
(y, w, m, ) = f
Y[ ,
(y [ m, ) f
W[
(w[ ) f
[
(m[ ) f

()
= f
[ Y,
(m[ y, ) f
Y[
(y [ ) f
W[
(w[ ) f

()
(8.96)
(b) Show that the conditional distribution of given Y and is multivariate normal
with
E[ [ Y = y, = ] = (K+K
0
)
1
(Ky +K
0

0
),
Cov[ [ Y = y, = ] = (K+K
0
)
1

1
. (8.97)
[Hint: Follows from Exercise 3.7.29, noting that is xed (conditioned upon) for this
calculation.] (c) Show that
Y[ = N
pq
(
0
, (K
1
+K
1
0
)
1
). (8.98)
[Hint: See (3.102).] (d) Let Z = (K
1
+K
1
0
)
1/2
(Y
0
), and show that the middle
two densities in the last line of (8.96) can be combined into the density of
U = W+Z
/
Z[ = Wishart
q
( + p,
1
), (8.99)
that is,
f
Y[
(y [ ) f
W[
(w[ ) = c

(u, w) f
U[
(u[ ) (8.100)
for some constant c

(u, w) that does not depend on . (e) Now use Exercise 8.8.24 to
show that
[ U = u Wishart
q
( +
0
+ p, (u +
1
0
)
1
). (8.101)
(f) Thus the posterior distribution of and in (8.97) and (8.101) are given in the
same two stages as the prior in (8.88). The only differences are in the parameters.
8.8. Exercises 149
The prior parameters are
0
, K
0
,
0
, and
0
. What are the corresponding posterior
parameters? (g) Using (8.83), show that the posterior means of and are
E[ [ Y = y, W = w] = (K+K
0
)
1
(Ky +K
0

0
),
E[[ Y = y, W = w] =
1
+
0
+ p q 1
(u +
0

0
), (8.102)
and the posterior covariance of is
Cov[ [ Y = y, W = w] =
1
+
0
+ p q 1
(K+K
0
)
1
(u +
0

0
). (8.103)
Chapter 9
Likelihood Methods
For the linear models, we derived estimators of using the least-squares principle,
and found estimators of
R
in an obvious manner. Likelihood provides another
general approach to deriving estimators, hypothesis tests, and model selection proce-
dures. We start with a very brief introduction, then apply the principle to the linear
models. Chapter 10 considers MLEs for models concerning the covariance matrix.
9.1 Introduction
Throughout this chapter, we assume we have a statistical model consisting of a ran-
dom object (usually a matrix or a set of matrices) Y with space }, and a set of
distributions P

[ , where is the parameter space. We assume that these


distributions have densities, with P

having associated density f (y[ ).


Denition 9.1. For a statistical model with densities, the likelihood function is dened for
each xed y } as the function L( ; y) : [0, ) given by
L(; y) = a(y) f (y[ ), (9.1)
for any positive a(y).
Likelihoods are to be interpreted in only relative fashion, that is, to say the likeli-
hood of a particular
1
is L(
1
; y) does not mean anything by itself. Rather, meaning
is attributed to saying that the relative likelihood of
1
to
2
(in light of the data y) is
L(
1
; y)/L(
2
; y). Which is why the a(y) in (9.1) is allowed. There is a great deal of
controversy over what exactly the relative likelihood means. We do not have to worry
about that particularly, since we are just using likelihood as a means to an end. The
general idea, though, is that the data supports s with relatively high likelihood.
The next few sections consider maximum likelihood estimation. Subsequent sec-
tions look at likelihood ratio tests, and two popular model selection techniques (AIC
and BIC). Our main applications are to multivariate normal parameters.
9.2 Maximum likelihood estimation
Given the data y, it is natural (at least it sounds natural terminologically) to take as
estimate of the value that is most likely. Indeed, that is the maximum likelihood
151
152 Chapter 9. Likelihood Methods
estimate.
Denition 9.2. The maximum likelihood estimate (MLE) of the parameter based on the
data y is the unique value, if it exists,

(y) that maximizes the likelihood L(; y).
It may very well be that the maximizer is not unique, or does not exist at all, in
which case there is no MLE for that particular y. The MLE of a function of , g(),
is dened to be the function of the MLE, that is,

g() = g(

). See Exercises 9.6.1 and


9.6.2 for justication.
9.2.1 The MLE in multivariate regression
Here the model is the multivariate regression model (4.8),
Y N
nq
(x, I
n

R
), (9.2)
where x is n p and is p q. We need to assume that
n p + q;
x
/
x is positive denite;

R
is positive denite.
The parameter = (,
R
) and the parameter space is = R
pq
o
+
q
, where o
+
q
is
the space of q q positive denite symmetric matrices as in (5.34).
To nd the MLE, we rst must nd the L, that is, the pdf of Y. We can use
Corollary 8.3. From (9.2) we see that the density in (8.54) has M = x and C = I
n
,
hence the likelihood is
L(,
R
; y) =
1
[
R
[
n/2
e

1
2
trace(
1
R
(yx)
/
(yx))
. (9.3)
(Note that the constant has been dropped.)
To maximize the likelihood, rst consider maximizing L over , which is equiva-
lent to minimizing trace(
1
R
(y x)
/
(y x)). Let

be the least-squares estimate,

= (x
/
x)
1
x
/
y. We show that this is in fact the MLE. Write
y x = y x

+x

x
= Q
x
y +x(

), (9.4)
where Q
x
= I
n
x(x
/
x)
1
x
/
(see Proposition 5.1), so that
(y x)
/
(y x) = (Q
x
y +x(

))
/
(Q
x
y +x(

))
= y
/
Q
x
y + (

)
/
x
/
x(

). (9.5)
Because
R
and x
/
x are positive denite, trace(
1
R
(

)
/
x
/
x(

)) > 0 unless
=

, in which case the trace is zero, which means that

uniquely maximizes the
likelihood over for xed
R
, and because it does not depend on
R
, it is the MLE.
Now consider
L(

,
R
; y) =
1
[
R
[
n/2
e

1
2
trace(
1
R
y
/
Q
x
y)
. (9.6)
We need to maximize that over
R
o
+
q
. We appeal to the following lemma, proved
in Section 9.2.3.
9.2. Maximum likelihood estimation 153
Lemma 9.1. Suppose a > 0 and U o
+
q
. Then
g() =
1
[[
a/2
e

1
2
trace(
1
U)
(9.7)
is uniquely maximized over o
+
q
by

=
1
a
U, (9.8)
and the maximum is
g(

) =
1
[

[
a/2
e

aq
2
. (9.9)
Applying this lemma to (9.6) yields

R
=
y
/
Q
x
y
n
. (9.10)
Note that Y
/
Q
x
Y Wishart
q
(n p,
R
), which is positive denite with probability
1 if n p q, i.e., if n p + q, which we have assumed. Also, note that the
denominator is n, compared to the n p we used in (6.19). Thus unless x = 0, this
estimate will be biased since E[Y
/
Q
x
Y] = (n p)
R
. Now we have that from (9.3) or
(9.9),
L(

R
; y) =
1
[

R
[
n/2
e

nq
2
. (9.11)
9.2.2 The MLE in the both-sides linear model
Here we add the z, that is, the model is
Y N
nq
(xz
/
, I
n

R
), (9.12)
where x is n p, z is q l, and is p l. We need one more assumption, i.e., we
need that
n p + q;
x
/
x is positive denite;
z
/
z is positive denite;

R
is positive denite.
The parameter is again = (,
R
), where the parameter space is = R
pl
o
+
q
.
There are two cases to consider: z is square, l = q, hence invertible, or l < q. In
the former case, we proceed as in (7.76), that is, let Y
(z)
= Y(z
/
)
1
, so that
Y
(z)
N
nq
(x, I
n

z
), (9.13)
where as in (6.5),

z
= z
1

R
(z
/
)
1
. (9.14)
154 Chapter 9. Likelihood Methods
The MLEs of and
z
are as in the previous Section 9.2.1:

= (x
/
x)
1
x
/
Y
(z)
= (x
/
x)
1
x
/
Y(z
/
)
1
(9.15)
and

z
=
1
n
Y
(z)
/
Q
x
Y
(z)
= z
1
Y
/
Q
z
Y(z
/
)
1
. (9.16)
Since (,
z
) and (,
R
) are in one-to-one correspondence, the MLE of (,
R
) is
given by (9.15) and (9.10), the latter because
R
= z
z
z
/
.
The case that l < q is a little more involved. As in Section 7.5.1 (and Exercise
5.6.18), we ll out z, that is, nd a q (q l) matrix z
2
so that z
/
z
2
= 0 and (z z
2
) is
invertible. Following (7.79), the model (9.12) is the same as
Y N
nq
_
x
_
0
_
_
z
/
z
/
2
_
, I
n

R
_
, (9.17)
or
Y
(z)
Y
_
z
/
z
/
2
_
1
N
nq
_
x
_
0
_
, I
n

z
)
_
, (9.18)
where

z
=
_
z z
2
_
1

R
_
z
/
z
/
2
_
1
. (9.19)
As before, (,
z
) and (,
R
) are in one-to-one correspondence, so it will be
sufcient to rst nd the MLE of the former. Because of the 0 in the mean of Y
(z)
,
the least-squares estimate of in (5.27) is not the MLE. Instead, we have to proceed
conditionally. That is, partition Y
(z)
similar to (7.69),
Y
(z)
= (Y
(z)
a
Y
(z)
b
), Y
(z)
a
is n l, Y
(z)
b
is n (q l), (9.20)
and
z
similar to (7.70),

z
=
_

z,aa

z,ab

z,ba

z,bb
_
. (9.21)
The density of Y
(z)
is a product of the conditional density of Y
(z)
a
given Y
(z)
b
= y
(z)
b
,
and the marginal density of Y
(z)
b
:
f (y
(z)
[ ,
z
) = f (y
(z)
a
[ y
(z)
b
, , ,
z,aab
) f (y
(z)
b
[
z,bb
), (9.22)
where
=
1
z,bb

z,ba
and
z,aab
=
z,aa

z,ab

1
z,bb

z,ba
. (9.23)
The notation in (9.22) makes explicit the facts that the conditional distribution of Y
(z)
a
given Y
(z)
b
= y
(z)
b
depends on (,
z
) through only (, ,
z,aab
), and the marginal
distribution of Y
(z)
b
depends on (,
z
) through only
z,bb
.
The set of parameters (, ,
z,aab
,
z,bb
) can be seen to be in one-to-one corre-
spondence with (,
z
), and has space R
pl
R
(ql)l
o
+
l
o
+
ql
. That is, the
parameters in the conditional density are functionally independent of those in the
marginal density, which means that we can nd the MLEs of these parameters sepa-
rately.
9.2. Maximum likelihood estimation 155
Conditional part
We know as in (7.72) (without the parts) and (7.73) that
Y
a
[ Y
b
= y
b
N
nl
_
_
x y
(z)
b
_
_

_
, I
n

z,aab
_
. (9.24)
We are in the multivariate regression case, that is, without a z, so the MLE of the
(
/
,
/
)
/
parameter is the least-squares estimate
_


_
=
_
x
/
x x
/
y
(z)
b
y
(z)
/
b
x y
(z)
/
b
y
(z)
b
_
1
_
x
/
y
(z)
/
b
_
Y
(z)
a
, (9.25)
and

z,aab
=
1
n
Y
(z)
a
/
Q
(x,y
(z)
b
)
Y
(z)
a
. (9.26)
Marginal part
From (9.19), we have that
Y
(z)
b
N
n(ql)
(0, I
n

z,bb
), (9.27)
hence it is easily seen that

z,bb
=
1
n
Y
(z)
b
/
Y
(z)
b
(9.28)
is the MLE.
The maximum of the likelihood from (9.22), ignoring the constants, is
1
[

z,aab
[
n/2
1
[

z,bb
[
n/2
e

nq
2
(9.29)
Putting things back together, we rst note that the MLE of is given in (9.25), and
that for
z,bb
is given in (9.28). By (9.23), the other parts of
z
have MLE

z,ba
=

z,bb
and

z,aa
=

z,aab
+
/

z,bb
. (9.30)
Finally, to get back to
R
, use

R
=
_
z z
2
_

z
_
z
/
z
/
2
_
. (9.31)
If one is mainly interested in , then the MLE can be found using the pseudo-
covariate approach in Section 7.5.1, and the estimation of
z,bb
and the reconstruction
of

R
are unnecessary.
We note that a similar approach can be used to nd the MLEs in the covariate
model (7.69), with just a little more complication to take care of the parts. Again, if
one is primarily interested in the
a
, then the MLE is found as in that section.
156 Chapter 9. Likelihood Methods
9.2.3 Proof of Lemma 9.1
Because U is positive denite and symmetric, it has an invertible symmetric square
root, U
1/2
. Let = U
1/2
U
1/2
, and from (9.7) write
g() = h(U
1/2
U
1/2
), where h()
1
[U[
a/2
1
[[
a/2
e

1
2
trace(
1
)
(9.32)
is a function of o
+
q
. Exercise 9.6.7 shows that (9.32) is maximized by

= (1/a)I
q
,
hence
h(

) =
1
[U[
a/2
[
1
a
I
q
[
a/2
e

1
2
atrace(I
q
)
, (9.33)
from which follows (9.9). Also,

= U
1/2

U
1/2


= U
1/2
1
a
I
q
U
1/2
=
1
a
U, (9.34)
which proves (9.8). 2
9.3 Likelihood ratio tests
Again our big model has Y with space } and a set of distributions P

[ with
associated densities f (y [ ). Testing problems we consider are of the form
H
0
:
0
versus H
A
:
A
, (9.35)
where

0

A
. (9.36)
Technically, the space in H
A
should be
A

0
, but we take that to be implicit.
The likelihood ratio statistic for problem (9.35) is dened to be
LR =
sup

A
L(; y)
sup

0
L(; y)
, (9.37)
where the likelihood L is given in Denition 9.1.
The idea is that the larger LR, the more likely the alternative H
A
is, relative to the
null H
0
. For testing, one would either use LR as a basis for calculating a p-value, or
nd a c

such that rejecting the null when LR > c

yields a level of (approximately) .


Either way, the null distribution of LR is needed, at least approximately. The general
result we use says that under certain conditions (satised by most of what follows),
under the null hypothesis,
2 log(LR)
T

2
d f
where d f = dim(H
A
) dim(H
0
) (9.38)
as n (which means there must be some n to go to innity). The dimension
of a hypothesis is the number of free parameters it takes to uniquely describe the
associated distributions. This denition is not very explicit, but in most examples the
dimension will be obvious.
9.4. Model selection 157
9.3.1 The LRT in multivariate regression
Consider the multivariate regression model in (9.2), with the x and partitioned:
Y N
nq
_
(x
1
x
2
)
_

1

2
_
, I
n

R
_
, (9.39)
where the x
i
are n p
i
and the
i
are p
i
q, p
1
+ p
2
= p. We wish to test
H
0
:
2
= 0 versus H
A
:
2
,= 0. (9.40)
The maximized likelihoods under the null and alternative are easy to nd using
(9.10) and (9.11). The MLEs of under H
0
and H
A
are, respectively,

0
=
1
n
y
/
Q
x
1
y and

A
=
1
n
y
/
Q
x
y, (9.41)
the former because under the null, the model is multivariate regression with mean
x
1

1
. Then the likelihood ratio from (9.37) is
LR =
_
[

0
[
[

A
[
_
n/2
=
_
[y
/
Q
x
1
y[
[y
/
Q
x
y[
_
n/2
. (9.42)
We can use the approximation (9.38) under the null, where here d f = p
2
q. It turns
out that the statistic is equivalent to Wilks in (7.2),
= (LR)
2/n
=
[W[
[W+B[
Wilks
q
(p
2
, n p), (9.43)
where
W = y
/
Q
x
y and B = y
/
(Q
x
1
Q
x
)y. (9.44)
See Exercise 9.6.8. Thus we can use Bartletts approximation in (7.37), with l

= q
and p

= p
2
.
9.4 Model selection: AIC and BIC
As in Section 7.6, we often have a number of models we wish to consider, rather
than just two as in hypothesis testing. (Note also that hypothesis testing may not be
appropriate even when choosing between two models, e.g., when there is no obvious
allocation to null and alternative models.) Using Mallows C
p
(Proposition 7.2)
is reasonable in linear models, but more general methods are available that are based
on likelihoods.
We assume there are K models under consideration, labelled M
1
, M
2
, . . . , M
K
.
Each model is based on the same data, Y, but has its own density and parameter
space:
Model M
k
Y f
k
(y [
k
),
k

k
. (9.45)
The densities need not have anything to do with each other, i.e., one could be normal,
another uniform, another logistic, etc., although often they will be of the same type.
It is possible that the models will overlap, so that several models might be correct at
once, e.g., when there are nested models.
158 Chapter 9. Likelihood Methods
Let
l
k
(
k
; y) = log(L
k
(
k
; y)) = log( f
k
(y [
k
)) + C(y), k = 1, . . . , K, (9.46)
be the loglikelihoods for the models. The constant C(y) is arbitrary, and as long as
it is the same for each k, it will not affect the outcome of the following procedures.
Dene the deviance of the model M
k
at parameter value
k
by
deviance(M
k
(
k
) ; y) = 2 l
k
(
k
; y). (9.47)
It is a measure of t of the model to the data; the smaller the deviance, the better the
t. The MLE of
k
for model M
k
minimizes this deviance, giving us the observed
deviance,
deviance(M
k
(

k
) ; y) = 2 l
k
(

k
; y) = 2 max

k
l
k
(
k
; y). (9.48)
Note that the likelihood ratio statistic in (9.38) is just the difference in observed
deviance of the two hypothesized models:
2 log(LR) = deviance(H
0
(

0
) ; y) deviance(H
A
(

A
) ; y). (9.49)
At rst blush one might decide the best model is the one with the smallest ob-
served deviance. The problem with that approach is that because the deviances are
based on minus the maximum of the likelihoods, the model with the best observed
deviance will be the largest model, i.e., one with highest dimension. Instead, we add
a penalty depending on the dimension of the parameter space, as for Mallows C
p
in
(7.102). The two most popular procedures are the Bayes information criterion (BIC)
of Schwarz [1978] and the Akaike information criterion (AIC) of Akaike [1974] (who
actually meant for the A to stand for An):
BIC(M
k
; y) = deviance(M
k
(

k
) ; y) + log(n)d
k
, and (9.50)
AIC(M
k
; y) = deviance(M
k
(

k
) ; y) + 2d
k
, (9.51)
where in both cases,
d
k
= dim(
k
). (9.52)
Whichever criterion is used, it is implemented by nding the value for each model,
then choosing the model with the smallest value of the criterion, or looking at the
models with the smallest values.
Note that the only difference between AIC and BIC is the factor multiplying the
dimension in the penalty component. The BIC penalizes each dimension more heav-
ily than does the AIC, at least if n > 7, so tends to choose more parsimonious models.
In more complex situations than we deal with here, the deviance information criterion
is useful, which uses more general denitions of the deviance. See Spiegelhalter et al.
[2002].
The next two sections present some further insight into the two criteria.
9.4.1 BIC: Motivation
The AIC and BIC have somewhat different motivations. The BIC, as hinted at by the
Bayes in the name, is an attempt to estimate the Bayes posterior probability of the
9.4. Model selection 159
models. More specically, if the prior probability that model M
k
is the true one is
k
,
then the BIC-based estimate of the posterior probability is

P
BIC
[M
k
[ y] =
e

1
2
BIC(M
k
; y)

k
e

1
2
BIC(M
1
; y)

1
+ + e

1
2
BIC(M
K
; y)

K
. (9.53)
If the prior probabilities are taken to be equal, then because each posterior probability
has the same denominator, the model that has the highest posterior probability is
indeed the model with the smallest value of BIC. The advantage of the posterior
probability form is that it is easy to assess which models are nearly as good as the
best, if there are any.
To see where the approximation arises, we rst need a prior on the parameter
space. In this case, there are several parameter spaces, one for each model under
consideration. Thus is it easier to nd conditional priors for each
k
, conditioning on
the model:

k
[ M
k

k
(
k
), (9.54)
for some density
k
on
k
. The marginal probability of each model is the prior
probability:

k
= P[M
k
]. (9.55)
The conditional density of (Y,
k
) given M
k
is
g
k
(y,
k
[ M
k
) = f
k
(y [
k
)
k
(
k
). (9.56)
To nd the density of Y given M
k
, we integrate out the
k
:
Y[ M
k
g
k
(y [ M
k
) =
_

k
g
k
(y,
k
[ M
k
)d
k
=
_

k
f
k
(y [
k
)
k
(
k
)d
k
. (9.57)
With the parameters hidden, it is straightforward to nd the posterior probabilities
of the models using Bayes theorem, Theorem 2.2:
P[M
k
[ y] =
g
k
(y [ M
k
)
k
g
1
(y [ M
1
)
1
+ + g
K
(y [ M
K
)
K
. (9.58)
Comparing (9.58) to (9.53), we see that the goal is to approximate g
k
(y [ M
k
) by
e

1
2
BIC(M
k
; y)
. To do this, we use the Laplace approximation, as in Schwarz [1978].
The following requires a number of regularity assumptions, not all of which we will
detail. One is that the data y consists of n iid observations, another that n is large.
Many of the standard likelihood-based assumptions needed can be found in Chapter
6 of Lehmann and Casella [1998], or any other good mathematical statistics text. For
convenience we drop the k, and from (9.57) consider
_

f (y[ )()d =
_

e
l(; y)
()d. (9.59)
The Laplace approximation expands l(; y) around its maximum, the maximum oc-
curing at the maximum likelihood estimator

. Then, assuming all the derivatives
exist,
l(; y) l(

; y) + (

)
/
(

) +
1
2
(

)
/
H(

)(

), (9.60)
160 Chapter 9. Likelihood Methods
where (

) is the d 1 ( is d 1) vector with

i
(

) =

i
l(; y) [
=

, (9.61)
and H is the d d matrix with
H
ij
=

2

j
l(; y) [
=

. (9.62)
Because

is the MLE, the derivative of the loglikelihood at the MLE is zero, i.e.,
(

) = 0. Also, let

F =
1
n
H(

), (9.63)
which is called the observed Fisher information contained in one observation. Then
(9.59) and (9.60) combine to show that
_

f (y[ )()d e
l(

; y)
_

1
2
(

)
/
n

F(

)(

)
()d. (9.64)
If n is large, the exponential term in the integrand drops off precipitously when is
not close to

, and assuming that the prior density () is fairly at for near

, we
have
_

1
2
(

)
/
n

F(

)(

)
()d
_

1
2
(

)
/
n

F(

)(

)
d(

). (9.65)
The integrand in the last term looks like the density (8.48) as if
N
d
(

, (n

F)
1
), (9.66)
but without the constant. Thus the integral is just the reciprocal of that constant, i.e.,
_

1
2
(

)
/
n

F(

)(

)
d = (

2)
d/2
[n

F[
1/2
= (

2)
d/2
[

F[
1/2
n
d/2
. (9.67)
Putting (9.64) and (9.67) together gives
log
_
_

f (y[ )()d
_
l(

; y)
d
2
log(n)
+ log((

)) +
d
2
log(2)
1
2
log([

F[)
l(

; y)
d
2
log(n)
=
1
2
BIC(M; y). (9.68)
Dropping the last three terms in the rst line is justied by noting that as n ,
l(

; y) is of order n (in the iid case), log(n)d/2 is clearly of order log(n), and the
other terms are bounded. (This step may be a bit questionable since n has to be
extremely large before log(n) starts to dwarf a constant.)
There are a number of approximations and heuristics in this derivation, and in-
deed the resulting approximation may not be especially good. See Berger, Ghosh,
and Mukhopadhyay [1999], for example. A nice property is that under conditions,
if one of the considered models is the correct one, then the BIC chooses the correct
model as n .
9.4. Model selection 161
9.4.2 AIC: Motivation
The Akaike information criterion can be thought of as a generalization of Mallows
C
p
in that it is an attempt to estimate the expected prediction error. In Section 7.6
it was reasonable to use least squares as a measure of prediction error. In general
models, it is not obvious which measure to use. Akaikes idea is to use deviance.
We wish to predict the unobserved Y
New
, which is independent of Y but has the
same distribution. We measure the distance between the new variable y
New
and the
prediction of its density given by the model M
k
via the prediction deviance:
deviance(M
k
(

k
) ; y
New
) = 2l
k
(

k
; y
New
), (9.69)
as in (9.48), except that here, while the MLE

k
is based on the data y, the loglike-
lihood is evaluated at the new variable y
New
. The expected prediction deviance is
then
EPredDeviance(M
k
) = E[deviance(M
k
(

k
) ; Y
New
)], (9.70)
where the expected value is over both the data Y (through the

k
) and the new vari-
able Y
New
. The best model is then the one that minimizes this value.
We need to estimate the expected prediction deviance, and the observed deviance
(9.48) is the obvious place to start. As for Mallows C
p
, the observed deviance is
likely to be an underestimate because the parameter is chosen to be optimal for the
particular data y. Thus we would like to nd how much of an underestimate it is,
i.e., nd
= EPredDeviance(M
k
) E[deviance(M
k
(

k
) ; Y)]. (9.71)
Akaike argues that for large n, the answer is = 2d
k
, i.e.,
EPredDeviance(M
k
) E[deviance(M
k
(

k
) ; Y)] + 2d
k
, (9.72)
from which the AIC in (9.51) arises. One glitch in the proceedings is that the approxi-
mation assumes that the true model is in fact M
k
(or a submodel thereof), rather than
the most general model, as in (7.83) for Mallows C
p
.
Rather than justify the result in full generality, we will show the exact value for
for multivariate regression, as Hurvich and Tsai [1989] did in the multiple regression
model.
9.4.3 AIC: Multivariate regression
Consider the multivariate regression model (4.8),
Model M : Y N
nq
(x, I
n

R
) , (9.73)
where is p q, x is np with x
/
x invertible, and
R
is a nonsingular q q covariance
matrix. Now from (9.3),
l(,
R
; y) =
n
2
log([
R
[)
1
2
trace(
1
R
(y x)
/
(y x)). (9.74)
The MLEs are then

= (x
/
x)
1
x
/
y and

R
=
1
n
y
/
Q
x
y, (9.75)
162 Chapter 9. Likelihood Methods
as in (9.10). Thus, as from (9.11) and (9.47), we have
deviance(M(

R
) ; y) = n log([

R
[) + nq, (9.76)
and the prediction deviance as in (9.69) is
deviance(M(

R
) ; Y
New
) = nlog([

R
[) + trace(

1
R
(Y
New
x

)
/
(Y
New
x

)).
(9.77)
To nd EPredDeviance, we rst look at
Y
New
x

= Y
New
P
x
Y N(0, (I
n
+P
x
)
R
), (9.78)
because Y
New
and Y are independent, and P
x
Y N(x, P
x

R
). Then (Exercise
9.6.10)
E[(Y
New
x

)
/
(Y
New
x

)] = trace(I
n
+P
x
)
R
= (n + p)
R
. (9.79)
Because Y
New
and P
x
Y are independent of

R
, we have from (9.77) and (9.79) that
EPredDeviance(M) = E[nlog([

R
[)] + (n + p)E[trace(

1
R

R
)]. (9.80)
Using the deviance from (9.76), the difference from (9.71) can be written
= (n + p) E[trace(

1
R

R
)] nq = n(n + p) E[trace(W
1

R
)] nq, (9.81)
the log terms cancelling, where
W = Y
/
Q
x
Y Wishart(n p,
R
). (9.82)
By (8.31), with = n p and l

= q,
E[W
1
] =
1
n p q 1

R
. (9.83)
Then from (9.81),
= n(n + p)
q
n p q 1
nq =
n
n p q 1
q(2p + q + 1). (9.84)
Thus
AIC

(M; y) = deviance(M(

R
) ; y) +
nq(2p + q + 1)
n p q 1
(9.85)
for the multivariate regression model. For large n, 2 dim(). See Exercise 9.6.11.
In univariate regression q = 1, and (9.85) is the value given in Hurvich and Tsai
[1989].
9.5. Example 163
9.5 Example: Mouth sizes
Return to the both-sides models (7.40) for the mouth size data. We consider the same
submodels as in Section 7.6.1, which we denote M
p

l
: the p

indicates the x-part of


the model, where p

= 1 means just use the constant, and p

= 2 uses the constant


and the sex indicator; the l

indicates the degree of polynomial represented in the z


matrix, where l

= degree + 1.
To nd the MLE as in Section 9.2.2 for these models, because the matrix z is
invertible, we start by multiplying y on the right by (z
/
)
1
:
y
(z)
= y(z
/
)
1
= y

0.25 0.15 0.25 0.05


0.25 0.05 0.25 0.15
0.25 0.05 0.25 0.15
0.25 0.15 0.25 0.05

. (9.86)
We look at two of the models in detail. The full model M
24
in (7.40) is actually just
multivariate regression, so there is no before variable y
(z)
b
. Thus

= (x
/
x)
1
x
/
y
(z)
=
_
24.969 0.784 0.203 0.056
2.321 0.305 0.214 0.072
_
, (9.87)
and (because n = 27),

R
=
1
27
(y
(z)
x

)
/
(y
(z)
x

)
=

3.499 0.063 0.039 0.144


0.063 0.110 0.046 0.008
0.039 0.046 0.241 0.005
0.144 0.008 0.005 0.117

. (9.88)
These coefcient estimates were also found in Section 7.1.1.
The observed deviance (9.48), given in (9.76), is
deviance(M
24
(

R
) ; y
(z)
) = n log([

R
[) + nq = 18.611, (9.89)
with n = 27 and q = 4.
The best model in (7.103) was M
22
, which t different linear equations to the boys
and girls. In this case, y
(z)
a
consists of the rst two columns of y
(z)
, and y
(z)
b
the nal
two columns. As in (9.25), to nd the MLE of the coefcients, we shift the y
(z)
b
to be
with the x, yielding
_


_
= ((x y
(z)
b
)
/
(x y
(z)
b
))
1
(x y
(z)
b
)
/
y
(z)
a
=

24.937 0.827
2.272 0.350
0.189 0.191
1.245 0.063

, (9.90)
164 Chapter 9. Likelihood Methods
where the top 2 2 submatrix contains the estimates of the
ij
s. Notice that they are
similar but not exactly the same as the estimates in (9.87) for the corresponding coef-
cients. The bottom submatrix contains coefcients relating the after and before
measurements, which are not of direct interest.
There are two covariance matrices needed, both 2 2:

z,aab
=
1
27
_
y
(z)
b
(x y
(z)
b
)
_


__
/
_
y
(z)
b
(x y
(z)
b
)
_


__
=
_
3.313 0.065
0.065 0.100
_
(9.91)
and

z,bb
=
1
27
y
z
/
b
y
(z)
b
=
_
0.266 0.012
0.012 0.118
_
. (9.92)
Then by (9.29),
deviance(M
22
(

) ; y
(z)
) = n log([

z,aab
[) + n log([

z,bb
[) + nq
= 27(1.116 3.463 + 4)
= 15.643. (9.93)
Consider testing the two models:
H
0
: M
22
versus H
A
: M
24
. (9.94)
The likelihood ratio statistic is from (9.49),
deviance(M
22
) deviance(M
24
) = 15.643 + 18.611 = 2.968. (9.95)
That value is obviously not signicant, but formally the chi-square test would com-
pare the statistic to the cutoff from a
2
d f
where d f = d
24
d
22
. The dimension for
model M
p

l
is
d
p

l
= p

+
_
q
2
_
= p

+ 10. (9.96)
because the model has p

non-zero
ij
s, and
R
is 4 4. Note that all the models
we are considering have the same dimension
R
. For the hypotheses (9.94), d
22
= 14
and d
24
= 18, hence the d f = 4 (rather obviously since we are setting four of the
ij
s
to 0). The test shows that there is no reason to reject the smaller model in favor of the
full one.
The AIC (9.50) and BIC (9.51) are easy to nd as well (log(27) 3.2958):
AIC BIC C
p
M
22
15.643 + 2(14) = 12.357 15.643 + log(27)(14) = 30.499 599.7
M
24
18.611 + 2(18) = 17.389 18.611 + log(27)(18) = 40.714 610.2
(9.97)
The Mallows C
p
s are from (7.103). Whichever criterion you use, it is clear the smaller
model optimizes it. It is also interesting to consider the BIC-based approximations to
9.6. Exercises 165
the posterior probabilities of these models in (9.53). With
22
=
24
=
1
2
, we have

P
BIC
[M
22
[ y
(z)
] =
e

1
2
BIC(M
22
)
e

1
2
BIC(M
22
)
+ e

1
2
BIC(M
24
)
= 0.994. (9.98)
That is, between these two models, the smaller one has an estimated probability of
99.4%, quite high.
We repeat the process for each of the models in (7.103) to obtain the following
table (the last column is explained below):
p

Deviance d
p

l
AIC BIC C
p

P
BIC
GOF
1 1 36.322 11 58.322 72.576 947.9 0.000 0.000
1 2 3.412 12 20.588 36.138 717.3 0.049 0.019
1 3 4.757 13 21.243 38.089 717.9 0.018 0.017
1 4 4.922 14 23.078 41.220 722.6 0.004 0.008
2 1 30.767 12 54.767 70.317 837.7 0.000 0.000
2 2 15.643 14 12.357 30.499 599.7 0.818 0.563
2 3 18.156 16 13.844 34.577 601.2 0.106 0.797
2 4 18.611 18 17.389 40.714 610.2 0.005 0.000
(9.99)
For AIC, BIC, and C
p
, the best model is the one with linear ts for each sex, M
22
.
The next best is the model with quadratic ts for each sex, M
23
. The penultimate
column has the BIC-based estimated posterior probabilities, taking the prior proba-
bilities equal. Model M
22
is the overwhelming favorite, with about 82% estimated
probability, and M
23
is next with about 11%, not too surprising considering the plots
in Figure 4.1. The only other models with estimated probability over 1% are the lin-
ear and quadratic ts with boys and girls equal. The probability that a model shows
differences between the boys and girls can be estimated by summing the last four
probabilities, obtaining 93%.
The table in (9.99) also contains a column GOF, which stands for goodness-
of-t. Perlman and Wu [2003] suggest in such model selection settings to nd the
p-value for each model when testing the model (as null) versus the big model. Thus
here, for model M
p

l
, we nd the p-value for testing
H
0
: M
p

l
versus H
A
: M
24
. (9.100)
As in (9.49), we use the difference in the models deviances, which under the null has
an approximate
2
distribution, with the degrees of freedom being the difference in
their dimensions. Thus
GOF(M
p

l
) = P[
2
d f
> deviance(M
p

l
) deviance(M
24
)], d f = d
24
d
p

l
.
(9.101)
A good model is one that ts well, i.e., has a large p-value. The two models that t
very well are M
22
and M
23
, as for the other criteria, though here M
23
ts best. Thus
either model seems reasonable.
9.6 Exercises
Exercise 9.6.1. Consider the statistical model with space } and densities f (y [ )
for . Suppose the function g : is one-to-one and onto, so that a
166 Chapter 9. Likelihood Methods
reparametrization of the model has densities f

(y [ ) for , where f

(y [ ) =
f (y[ g
1
()). (a) Show that

uniquely maximizes f (y [ ) over if and only if
g(

) uniquely maximizes f

(y [ ) over . [Hint: Show that f (y[

) > f (y [ )
for all
/
,=

implies f

(y [ ) > f

(y [ ) for all ,= , and vice versa.] (b) Argue


that if

is the MLE of , then g(

) is the MLE of .
Exercise 9.6.2. Again consider the statistical model with space } and densities f (y [ )
for , and suppose g : is just onto. Let g

be any function of such that


the joint function h() = (g(), g

()), h : , is one-to-one and onto, and set


the reparametrized density as f

(y [ ) = f (y[ h
1
()). Exercise 9.6.1 shows that if

uniquely maximizes f (y [ ) over , then



= h(

) uniquely maximizes f

(y [ )
over . Argue that if

is the MLE of , that it is legitimate to dene g(

) to be the
MLE of = g().
Exercise 9.6.3. Show that (9.5) holds. [What are Q
x
Q
x
and Q
x
x?]
Exercise 9.6.4. Show that if A (p p) and B (q q) are positive denite, and u is
p q, that
trace(Bu
/
Au) > 0 (9.102)
unless u = 0. [Hint: See (8.79).]
Exercise 9.6.5. From (9.18), (9.21), and (9.23), give (,
z
) as a function of (, ,
z,aab
,

z,bb
), and show that the latter set of parameters has space R
pl
R
(ql)l
o
l

o
ql
.
Exercise 9.6.6. Verify (9.32).
Exercise 9.6.7. Consider maximizing h() in (9.32) over o
q
. (a) Let =
/
be the spectral decomposition of , so that the diagonals of are the eigenvalues

1

2

q
0. (Recall Theorem 1.1.) Show that
1
[[
a/2
e

1
2
trace(
1
)
=
q

i=1
[
a/2
i
e

1
2

i
]. (9.103)
(b) Find

i
, the maximizer of
a/2
i
exp(
i
/2), for each i = 1, . . . , q. (c) Show that
these

i
s satisfy the conditions on the eigenvalues of . (d) Argue that then

=
(1/a)I
q
maximizes h().
Exercise 9.6.8. Suppose the null hypothesis in (9.40) holds, so that Y N
nq
(x
1

1
, I
n

R
). Exercise 5.6.37 shows that Q
x
1
Q
x
= P
x
21
, where x
21
= Q
x
1
x
2
. (a) Show that
P
x
21
x
1
= 0 and Q
x
P
x
21
= 0. [Hint: See Lemma 5.3.] (b) Show that Q
x
Y and P
x
21
Y
are independent, and nd their distributions. (c) Part (b) shows that W = Y
/
Q
x
Y
and B = Y
/
P
x
21
Y are independent. What are their distributions? (d) Verify the Wilks
distribution in (9.42) for [W[/[W+B[.
Exercise 9.6.9. Consider the multivariate regression model (9.2), where
R
is known.
(a) Use (9.5) to show that
l( ; y) l(

; y) =
1
2
trace(
1
R
(

)
/
x
/
x(

)). (9.104)
(b) Show that in this case, (9.60) is actually an equality, and give H, which is a function
of
R
and x
/
x.
9.6. Exercises 167
Exercise 9.6.10. Suppose V is n q, with mean zero and Cov[V] = AB, where A
is n n and B is q q. (a) Show that E[V
/
V] = trace(A)B. [Hint: Write E[V
/
V] =

E[V
/
i
V
i
], where the V
i
are the rows of V, then use (3.33).] (b) Use part (a) to prove
(9.79).
Exercise 9.6.11. (a) Show that for the model in (9.73), dim() = pq + q(q + 1)/2,
where is the joint space of and
R
. (b) Show that in (9.84), 2 dim() as
n .
Exercise 9.6.12 (Caffeine). This question continues the caffeine data in Exercises 4.4.4
and 6.5.6. Start with the both-sides model Y = xz
/
+ R, where as before the Y is
2 28, the rst column being the scores without caffeine, and the second being the
scores with caffeine. The x is a 28 3 ANOVA matrix for the three grades, with
orthogonal polynomials. The linear vector is (1
/
9
, 0
/
10
, 1
/
9
)
/
and the quadratic vector
is (1
/
9
, 1.81
/
10
, 1
/
9
)
/
. The z looks at the sum and difference of scores:
z =
_
1 1
1 1
_
. (9.105)
The goal of this problem is to use BIC to nd a good model, choosing among the
constant, linear and quadratic models for x, and the overall mean and overall
mean + difference models for the scores. Start by nding Y
(z)
= Y(z
/
)
1
. (a) For
each of the 6 models, nd the deviance (just the log-sigma parts), number of free
parameters, BIC, and estimated probability. (b) Which model has highest probability?
(c) What is the chance that the difference effect is in the model? (d) Find the MLE of
for the best model.
Exercise 9.6.13 (Leprosy, Part I). This question continues Exercises 4.4.6 and 5.6.41 on
the leprosy data. The model is
(Y
(b)
, Y
(a)
) = x +R =

1 1 1
1 1 1
1 2 0

1
10

b

a
0
a
0
a

+R, (9.106)
where
R N
_
0, I
n

_

bb

ba

ab

aa
__
. (9.107)
Because of the zeros in the , the MLE is not the usual one for multivariate regres-
sion. Instead, the problem has to be broken up into the conditional part (after
conditioning on before), and the marginal of the before measurements, as for the
both-sided model in Sections 9.2.2 and 9.5. The conditional is
Y
(a)
[ Y
(b)
= y
(b)
N

(x y
(b)
)

,
aab
I
n

, (9.108)
where

=
a

b
and =

ab

bb
. (9.109)
168 Chapter 9. Likelihood Methods
In this question, give the answers symbolically, not the actual numerical values. Those
come in the next exercise. (a) What is the marginal distribution of Y
(b)
? Write it as
a linear model, without any zeroes in the coefcient matrix. (Note that the design
matrix will not be the entire x.) (b) What are the MLEs for
b
and
bb
? (c) Give the
MLEs of
a
,
ab
, and
aa
in terms of the MLEs of

,
b
, ,
bb
and
aab
. (d) What is
the deviance of this model? How many free parameters (give the actual number) are
there? (e) Consider the model with
a
= 0. Is the MLE of
b
the same or different
than in the original model? What about the MLE of
aab
? Or of
bb
?
Exercise 9.6.14 (Leoprosy, Part II). Continue with the leprosy example from Part I,
Exercise 9.6.13. (a) For the original model in Part I, give the values of the MLEs of

a
,
a
,
aab
and
bb
. (Note that the MLE of
aab
will be different than the unbiased
estimate of 16.05.) (b) Now consider four models: The original model, the model
with
a
= 0, the model with
a
= 0, and the model with
a
=
a
= 0. For each, nd
the MLEs of
aab
,
bb
, the deviance (just using the log terms, not the nq), the number
of free parameters, the BIC, and the BIC-based estimate of the posterior probability
(in percent) of the model. Which model has the highest probability? (c) What is the
probability (in percent) that the drug vs. placebo effect is in the model? The Drug A
vs. Drug D effect?
Exercise 9.6.15 (Skulls). For the data on Egyptian skulls (Exercises 4.4.2, 6.5.5, and
7.7.12), consider the linear model over time, so that
x =

1 3
1 1
1 0
1 1
1 3

1
30
. (9.110)
Then the model is the multivariate regression model Y N(x, I
n

R
), where Y
is 150 4 and is 2 4. Thus we are contemplating a linear regression for each of
the four measurement variables. The question here is which variables require a linear
term. Since there are 2
4
= 16 possible models. We will look at just four models: no
linear terms are in (
21
=
22
=
23
=
24
= 0), just the rst and third variables
(MaxBreadth and BasLength) linear terms are in (
22
=
24
= 0), the rst three
variables are in (
24
= 0), or all four linear terms are in. For each of the four models,
nd the deviance, number of parameters, BIC, and estimated probabilities. Which
model is best? How much better is it than the second best? [Note: To nd the MLEs
for the middle two models, you must condition on the variables without the linear
terms. I.e., for the second model, nd the MLEs conditionally for (Y
1
, Y
3
) [ (Y
2
, Y
4
) =
(y
2
, y
4
), and also for (Y
2
, Y
4
) marginally. Here, Y
j
is the j
th
column of Y.]
Exercise 9.6.16. (This is a discussion question, in that there is no exact answer. Your
reasoning should be sound, though.) Suppose you are comparing a number of mod-
els using BIC, and the lowest BIC is b
min
. How much larger than b
min
would a BIC
have to be for you to consider the corresponding model ignorable? That is, what is
so that models with BIC > b
min
+ dont seem especially viable. Why?
Exercise 9.6.17. Often, in hypothesis testing, people misinterpret the p-value to be the
probability that the null is true, given the data. We can approximately compare the
9.6. Exercises 169
two values using the ideas in this chapter. Consider two models, the null (M
0
) and
alternative (M
A
), where the null is contained in the alternative. Let deviance
0
and
deviance
A
be their deviances, and dim
0
and dim
A
be their dimensions, respectively.
Supposing that the assumptions are reasonable, the p-value for testing the null is
p-value = P[
2

> ], where = dim


A
dim
0
and = deviance
0
deviance
A
. (a)
Give the BIC-based estimate of the probability of the null for a given , and sample
size n. (b) For each of various values of n and (e.g, n = 1, 5, 10, 25, 100, 1000 and
= 1, 5, 10, 25), nd the that gives a p-value of 5%, and nd the corresponding
estimate of the probability of the null. (c) Are the probabilities of the null close to
5%? What do you conclude?
Chapter 10
Models on Covariance Matrices
The models so far have been on the means of the variables. In this chapter, we
look at some models for the covariance matrix. We start with testing the equality of
covariance matrices, then move on to testing independence and conditional indepen-
dence of sets of variables. Next is factor analysis, where the relationships among the
variables are assumed to be determined by latent (unobserved) variables. Principal
component analysis is sometimes thought of as a type of factor analysis, although it is
more of a decomposition than actual factor analysis. See Section 13.1.5. We conclude
with a particular class of structural models, called invariant normal models
We will base our hypothesis tests on Wishart matrices (one, or several independent
ones). In practice, these matrices will often arise from the residuals in linear models,
especially the Y
/
Q
x
Y as in (6.18). If U Wishart
q
(, ), where is invertible and
q, then the likelihood is
L(; U) = [[
/2
e

1
2
trace(
1
U)
. (10.1)
The likelihood follows from the density in (8.71). An alternative derivation is to note
that by (8.54), Z N(0, I

) has likelihood L

(; z) = L(; z
/
z). Thus z
/
z is a
sufcient statistic, and there is a theorem that states that the likelihood for any X is
the same as the likelihood for its sufcient statistic. Since Z
/
Z =
T
U, (10.1) is the
likelihood for U.
Recall from (5.34) that o
+
q
denotes the set of q q positive denite symmetric
matrices. Then Lemma 9.1 shows that the MLE of o
+
q
based on (10.1) is

=
U

, (10.2)
and the maximum likelihood is
L(

; U) =

/2
e

1
2
q
. (10.3)
171
172 Chapter 10. Covariance Models
10.1 Testing equality of covariance matrices
We rst suppose we have two groups, e.g., boys and girls, and wish to test whether
their covariance matrices are equal. Let U
1
and U
2
be independent, with
U
i
Wishart
q
(
i
,
i
), i = 1, 2. (10.4)
The hypotheses are then
H
0
:
1
=
2
versus H
A
:
1
,=
2
, (10.5)
where both
1
and
2
are in o
+
q
. (That is, we are not assuming any particular struc-
ture for the covariance matrices.) We need the likelihoods under the two hypotheses.
Because the U
i
s are independent,
L(
1
,
2
; U
1
, U
2
) = [
1
[

1
/2
e

1
2
trace(
1
1
U
1
)
[
2
[

2
/2
e

1
2
trace(
1
2
U
2
)
, (10.6)
which, under the null hypothesis, becomes
L(, ; U
1
, U
2
) = [[
(
1
+
2
)/2
e

1
2
trace(
1
(U
1
+U
2
))
, (10.7)
where is the common value of
1
and
2
. The MLE under the alternative hypoth-
esis is found by maximizing (10.5), which results in two separate maximizations:
Under H
A
:

A1
=
U
1

1
,

A2
=
U
2

2
. (10.8)
Under the null, there is just one Wishart, U
1
+U
2
, so that
Under H
0
:

01
=

02
=
U
1
+U
2

1
+
2
. (10.9)
Thus
sup
H
A
L =

U
1

1
/2
e

1
2

1
q

U
2

2
/2
e

1
2

2
q
, (10.10)
and
sup
H
0
L =

U
1
+U
2

1
+
2

(
1
+
2
)/2
e

1
2
(
1
+
2
)q
. (10.11)
Taking the ratio, note that the parts in the e cancel, hence
LR =
sup
H
A
L
sup
H
0
L
=
[U
1
/
1
[

1
/2
[U
2
/
2
[

2
/2
[(U
1
+U
2
)/(
1
+
2
)[
(
1
+
2
)/2
. (10.12)
And
2 log(LR) = (
1
+
2
) log

U
1
+U
2

1
+
2

1
log

U
1

2
log

U
2

. (10.13)
Under the null hypothesis, 2 log(LR) approaches a
2
as in (9.37). To gure out
the degrees of freedom, we have to nd the number of free parameters under each
10.1. Testing equality 173
hypothesis. A o
+
q
, unrestricted, has q(q + 1)/2 free parameters, because of the
symmetry. Under the alternative, there are two such sets of parameters. Thus,
dim(H
0
) = q(q + 1)/2, dim(H
A
) = q(q + 1)
dim(H
A
) dim(H
0
) = q(q + 1)/2. (10.14)
Thus, under H
0
,
2 log(LR)
2
q(q+1)/2
. (10.15)
10.1.1 Example: Grades data
Using the grades data on n = 107 students in (4.10), we compare the covariance
matrices of the men and women. There are 37 men and 70 women, so that the sample
covariance matrices have degrees of freedom
1
= 37 1 = 36 and
2
= 70 1 = 69,
respectively. Their estimates are:
Men:
1

1
U
1
=

166.33 205.41 106.24 51.69 62.20


205.41 325.43 206.71 61.65 69.35
106.24 206.71 816.44 41.33 41.85
51.69 61.65 41.33 80.37 50.31
62.20 69.35 41.85 50.31 97.08

, (10.16)
and
Women:
1

2
U
2
=

121.76 113.31 58.33 40.79 40.91


113.31 212.33 124.65 52.51 50.60
58.33 124.65 373.84 56.29 74.49
40.79 52.51 56.29 88.47 60.93
40.91 50.60 74.49 60.93 112.88

. (10.17)
These covariance matrices are clearly not equal, but are the differences signicant?
The pooled estimate, i.e., the common estimate under H
0
, is
1

1
+
2
(U
1
+U
2
) =

137.04 144.89 74.75 44.53 48.21


144.89 251.11 152.79 55.64 57.03
74.75 152.79 525.59 51.16 63.30
44.53 55.64 51.16 85.69 57.29
48.21 57.03 63.30 57.29 107.46

(10.18)
Then
2 log(LR) = (
1
+
2
) log

U
1
+U
2

1
+
2

1
log

U
1

2
log

U
2

= 105 log(2.6090 10
10
) 36 log(2.9819 10
10
) 69 log(1.8149 10
10
)
= 20.2331. (10.19)
The degrees of freedom for the
2
is q(q + 1)/2 = 5 6/2 = 15. The p-value is 0.16,
which shows that we have not found a signicant difference between the covariance
matrices.
174 Chapter 10. Covariance Models
10.1.2 Testing the equality of several covariance matrices
It is not hard to extend the test to testing the equality of more than two covariance
matrices. That is, we have U
1
, . . . , U
m
, independent, U
i
Wishart
l
(
i
,
i
), and wish
to test
H
0
:
1
= =
m
versus H
A
: not. (10.20)
Then
2 log(LR) = (
1
+ +
m
) log

U
1
+ +U
m

1
+ +
m

1
log

U
1


m
log

U
m

,
(10.21)
and under the null,
2 log(LR)
2
d f
, d f = (m1)q(q + 1)/2. (10.22)
This procedure is Bartletts test for the equality of covariances.
10.2 Testing independence of two blocks of variables
In this section, we assume U Wishart
q
(, ), and partition the matrices:
U =
_
U
11
U
12
U
21
U
22
_
and =
_

11

12

21

22
_
, (10.23)
where
U
11
and
11
are q
1
q
1
, and U
22
and
22
are q
2
q
2
; q = q
1
+ q
2
. (10.24)
Presuming the Wishart arises from multivariate normals, we wish to test whether the
two blocks of variables are independent, which translates to testing
H
0
:
12
= 0 versus H
A
:
12
,= 0. (10.25)
Under the alternative, the likelihood is just the one in (10.1), hence
sup
H
A
L(; U) =

/2
e

1
2
q
. (10.26)
Under the null, because is then block diagonal,
[[ = [
11
[ [
22
[ and trace(
1
U) = trace(
1
11
U
11
) + trace(
1
22
U
22
). (10.27)
Thus the likelihood under the null can be written
[
11
[
/2
e

1
2
trace(
1
11
U
11
)
[
22
[
/2
e

1
2
trace(
1
22
U
22
)
. (10.28)
The two factors can be maximized separately, so that
sup
H
0
L(; U) =

U
11

/2
e

1
2
q
1

U
22

/2
e

1
2
q
2
. (10.29)
10.2. Testing independence 175
Taking the ratio of (10.26) and (10.28), the parts in the exponent of the e again
cancel, hence
2 log(LR) = (log([U
11
/[) + log([U
22
/[) log([U/[)). (10.30)
(The s in the denominators of the determinants cancel, so they can be erased if
desired.)
Section 13.3 considers canonical correlations, which are a way to summarize rela-
tionships between two sets of variables.
10.2.1 Example: Grades data
Continuing Example 10.1.1, we start with the pooled covariance matrix

= (U
1
+
U
2
)/(
1
+
2
), which has q = 5 and = 105. Here we test whether the rst three
variables (homework, labs, inclass) are independent of the last two (midterms, nal),
so that q
1
= 3 and q
2
= 2. Obviously, they should not be independent, but we will
test it formally. Now
2 log(LR) = (log([U
11
/[) + log([U
22
/[) log([U/[))
= 28.2299. (10.31)
Here the degrees of freedom in the
2
are q
1
q
2
= 6, because that is the number of
covariances we are setting to 0 in the null. Or you can count
dim(H
A
) = q(q + 1)/2 = 15,
dim(H
0
) = q
1
(q
1
+ 1)/2 + q
2
(q
2
+ 1)/2 = 6 + 3 = 9, (10.32)
which has dim(H
A
) dim(H
0
) = 6. In either case, the result is clearly signicant (the
p-value is less than 0.0001), hence indeed the two sets of scores are not independent.
Testing the independence of several block of variables is almost as easy. Consider
the three variables homework, labs, and midterms, which have covariance
=

11

13

14

31

33

34

41

43

44

(10.33)
We wish to test whether the three are mutually independent, so that
H
0
:
13
=
14
=
34
= 0 versus H
A
: not. (10.34)
Under the alternative, the estimate of is just the usual one from (10.18), where
we pick out the rst, third, and fourth variables. Under the null, we have three
independent variables, so
ii
= U
ii
/ is just the appropriate diagonal from

. Then
the test statistic is
2 log(LR) = (log([U
11
/[) +log([U
33
/[) + log([U
44
/[) log([U

/[))
= 30.116, (10.35)
where U

contains just the variances and covariances of the variables 1, 3, and 4. We


do not need the determinant notation for the U
ii
, but leave it in for cases in which
the three blocks of variables are not 1 1. The degrees of freedom for the
2
is then
3, because we are setting three free parameters to 0 in the null. Clearly that is a
signicant result, i.e., these three variables are not mutually independent.
176 Chapter 10. Covariance Models
10.2.2 Example: Testing conditional independence
Imagine that we have (at least) three blocks of variables, and wish to see whether
the rst two are conditionally independent given the third. The process is exactly
the same as for testing independence, except that we use the conditional covariance
matrix. That is, suppose
Y = (Y
1
, Y
2
, Y
3
) N(xz
/
, I
n

R
), Y
i
is n q
i
, q = q
1
+ q
2
+ q
3
, (10.36)
where

R
=

11

12

13

21

22

23

31

32

33

, (10.37)
so that
ii
is q
i
q
i
. The null hypothesis is
H
0
: Y
1
and Y
2
are conditionally independent given Y
3
. (10.38)
The conditional covariance matrix is
Cov[(Y
1
, Y
2
) [ Y
3
= y
3
] =
_

11

12

21

22
_

_

13

23
_

1
33
_

31

32
_
=
_

11

13

1
33

31

12

13

1
33

32

21

23

1
33

31

22

23

1
33

32
_

_

113

123

213

223
_
(10.39)
Then the hypotheses are
H
0
:
123
= 0 versus H
A
:
123
,= 0. (10.40)
Letting U/ be the usual estimator of
R
, where
U = Y
/
Q
x
Y Wishart
(q
1
+q
2
+q
3
)
(,
R
), = n p, (10.41)
we know from Proposition 8.1 that the conditional covariance is also Wishart, but
loses q
3
degrees of freedom:
U
(1:2)(1:2)3

_
U
113
U
123
U
213
U
223
_
Wishart
(q
1
+q
2
)
_
q
3
,
_

113

123

213

223
__
. (10.42)
(The U is partitioned analogously to the
R
.) Then testing the hypothesis
123
here
is the same as (10.30) but after dotting out 3:
2 log(LR) = ( q
3
) ( log([U
113
/( q
3
)[) +log([U
223
/( q
3
)[)
log([U
(1:2)(1:2)3
/( q
3
)[)), (10.43)
which is asymptotically
2
q
1
q
2
under the null.
10.2. Testing independence 177
An alternative (but equivalent) method for calculating the conditional covariance
is to move the conditioning variables Y
3
to the x matrix, as we did for covariates.
Thus, leaving out the z,
(Y
1
, Y
2
) [ Y
3
= y
3
N(x

, I
n

), (10.44)
where
x

= (x, y
3
) and

=
_

113

123

213

223
_
. (10.45)
Then
U
(1:2)(1:2)3
= (Y
1
, Y
2
)
/
Q
x

(Y
1
, Y
2
). (10.46)
See Exercise 7.7.7.
We note that there appears to be an ambiguity in the denominators of the U
i
s for
the 2 log(LR). That is, if we base the likelihood on the original Y of (10.36), then the
denominators will be n. If we use the original U in (10.41), the denominators will
be n p. And what we actually used, based on the conditional covariance matrix
in (10.42), were n p q
3
. All three possibilities are ne in that the asymptotics as
n are valid. We chose the one we did because it is the most focussed, i.e., there
are no parameters involved (e.g., ) that are not directly related to the hypotheses.
Testing the independence of three or more blocks of variables, given another
block, again uses the dotted-out Wishart matrix. For Example, consider Example
10.2.1 with variables homework, inclass, and midterms, but test whether those three
are conditionally independent given the block 4 variables, labs and nal. The
conditional U matrix is now denoted U
(1:3)(1:3)4
, and the degrees of freedom are
q
4
= 105 2 = 103, so that the estimate of the conditional covariance matrix is

114

124

134

214

224

234

314

324

334

=
1
q
4
U
(1:3)(1:3)4
=

51.9536 18.3868 5.2905


18.3868 432.1977 3.8627
5.2905 3.8627 53.2762

. (10.47)
Then, to test
H
0
:
124
=
134
=
234
= 0 versus H
A
: not, (10.48)
we use the statistic analogous to (10.35),
2 log(LR) = ( q
4
) (log([U
114
/( q
4
)[) + log([U
224
/( q
4
)[)
+ log([U
334
/( q
4
)[) log([U
(1:3)(1:3)4
/( q
4
)[))
= 2.76. (10.49)
The degrees of freedom for the
2
is again 3, so we accept the null: There does not
appear to be signicant relationship among these three variables given the labs and
nal scores. This implies, among other things, that once we know someones labs and
nal scores, knowing the homework or inclass will not help in guessing the midterms
score. We could also look at the sample correlations, unconditionally (from (10.18))
178 Chapter 10. Covariance Models
and conditionally:
Unconditional
HW InClass Midterms
HW 1.00 0.28 0.41
InClass 0.28 1.00 0.24
Midterms 0.41 0.24 1.00
Conditional on Labs, Final
HW InClass Midterms
1.00 0.12 0.10
0.12 1.00 0.03
0.10 0.03 1.00
(10.50)
Notice that the conditional correlations are much smaller than the unconditional
ones, and the conditional correlation between homework and inclass scores is nega-
tive, though not signicantly so. Thus it appears that the labs and nal scores explain
the relationships among the other variables.
10.3 Factor analysis
The example above suggested that the relationship among three variables could be
explained by two other variables. The idea behind factor analysis is that the relation-
ships (correlations, to be precise) of a set of variables can be explained by a number
of other variables, called factors. The kicker here is that the factors are not observed.
Spearman [1904] introduced the idea based on the idea of a general intelligence
factor. This section gives the very basics of factor analysis. More details can be found
in Lawley and Maxwell [1971], Harman [1976] and Basilevsky [1994], as well as many
other books.
The model we consider sets Y to be the n q matrix of observed variables, and X
to be the n p matrix of factor variables, which we do not observe. Assume
_
X Y
_
N(D
_

_
, I
n
), =
_

XX

XY

YX

YY
_
, (10.51)
where D is an n k design matrix (e.g., to distinguish men from women), and
(k p) and (k q) are the parameters for the means of X and Y, respectively. Factor
analysis is not primarily concerned with the means (that is what the linear models
are for), but with the covariances. The key assumption is that the variables in Y
are conditionally independent given X, which means the conditional covariance is
diagonal:
Cov[Y [ X = x] =
YYX
= =

11
0 0
0
22
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0
qq

. (10.52)
Writing out the conditional covariance matrix, we have
=
YY

YX

1
XX

XY

YY
=
YX

1
XX

XY
+, (10.53)
so that marginally,
Y N(D, I
n
(
YX

1
XX

XY
+)). (10.54)
Because Y is all we observe, we cannot estimate
XY
or
XX
separately, but only the
function
YX

1
XX

XY
. Note that if we replace X with X

= AX for some invertible


matrix A,

YX

1
XX

XY
=
YX

1
X

Y
, (10.55)
10.3. Factor analysis 179
so that the distribution of Y is unchanged. See Exercise 10.5.5. Thus in order to
estimate the parameters, we have to make some restrictions. Commonly it is assumed
that
XX
= I
p
, and the mean is zero:
X N(0, I
n
I
p
). (10.56)
Then, letting =
1
XX

XY
=
XY
,
Y N(D, I
n
(
/
+)). (10.57)
Or, we can write the model as
Y = D +X +R, X N(0, I
n
I
p
), R N(0, I
n
), (10.58)
where X and R are independent. The equation decomposes each variable (column)
in Y into the xed mean plus the part depending on the factors plus the parts unique
to the individual variables. The element
ij
is called the loading of factor i on the
variable j. The variance
j
is the unique variance of variable j, i.e., the part not
explained by the factors. Any measurement error is assumed to be part of the unique
variance.
There is the statistical problem of estimating the model, meaning the
/
and
(and , but we already know about that), and the interpretative problem of nding
and dening the resulting factors. We will take these concerns up in the next two
subsections.
10.3.1 Estimation
We estimate the using least squares as usual, i.e.,
= (D
/
D)
1
D
/
Y. (10.59)
Then the residual sum of squares matrix is used to estimate the and :
U = Y
/
Q
D
Y Wishart
q
(,
/
+), = n k. (10.60)
The parameters are still not estimable, because for any p p orthogonal matrix
, ()
/
() yields the same
/
. We can use the QR decomposition from Theorem
5.3. Our is p q with p < q. Write = (
1
,
2
), where
1
has the rst p columns
of . We apply the QR decomposition to
1
, assuming the columns are linearly
independent. Then
1
= QR, where Q is orthogonal and R is upper triangular with
positive diagonal elements. Thus we can write Q
/

1
= R, or
Q
/
= Q
/
_

1

2
_
=
_
R R

, (10.61)
where R

is some p (q p) matrix. E.g., with p = 3, q = 5,

11

12

13

14

15
0

22

23

24

25
0 0

33

34

35

, (10.62)
where the

ii
s are positive. If we require that satises constraints (10.62), then it
is estimable. (Exercise 10.5.6.) Note that there are p(p 1)/2 non-free parameters
180 Chapter 10. Covariance Models
(since
ij
= 0 for i > j), which means the number of free parameters in the model
is pq p(p 1)/2 for the part, and q for . Thus for the p-factor model M
p
, the
number of free parameters is
d
p
dim(M
p
) = q(p + 1)
p(p 1)
2
. (10.63)
(We are ignoring the parameters in the , because they are the same for all the models
we consider.) In order to have a hope of estimating the factors, the dimension of the
factor model cannot exceed the dimension of the most general model,
YY
o
+
q
,
which has q(q + 1)/2 parameters. Thus for identiability we need
q(q + 1)
2
d
p
=
(q p)
2
p q
2
0. (10.64)
E.g., if there are q = 10 variables, at most p = 6 factors can be estimated.
There are many methods for estimating and . As in (10.1), the maximum
likelihood estimator maximizes
L(, ; U) =
1
[
/
+[
/2
e

1
2
trace((
/
+)
1
U)
(10.65)
over satisfying (10.62) and being diagonal. There is not a closed form solution to
the maximization, so it must be done numerically. There may be problems, too, such
as having one or more of the
j
s being driven to 0. It is not obvious, but if

and

are the MLEs, then the maximum of the likelihood is, similar to (10.3),
L(

; U) =
1
[

/
+

[
/2
e

1
2
q
. (10.66)
See Section 9.4 of Mardia, Kent, and Bibby [1979].
Typically one is interested in nding the simplest model that ts. To test whether
the p-factor model ts, we use the hypotheses
H
0
:
YY
=
/
+, is p q versus H
A
:
YY
o
+
q
. (10.67)
The MLE for H
A
is

YY
= U/, so that
LR =
_
[

+

[
[U/[
_
/2
. (10.68)
Now
2 log(LR) = (log([

+

[) log([U/[)), (10.69)
which is asymptotically
2
d f
with d f being the difference in (10.64). Bartlett suggests
a slight adjustment to the factor , similar to the Box approximation for Wilks , so
that under the null,
2 log(LR)

= (
2q + 5
6

2p
3
)(log([

+

[) log([U/[))
2
d f
, (10.70)
where
d f =
(q p)
2
p q
2
. (10.71)
10.3. Factor analysis 181
Alternatively, one can use AIC (9.50) or BIC (9.51) to assess M
p
for several p.
Because q is the same for all models, we can take
deviance(M
p
(

) ; y) = log([

+

[), (10.72)
so that
BIC(M
p
) = log([

+

[) + log() (q(p + 1) p(p 1)/2) , (10.73)
AIC(M
p
) = log([

+

[) + 2 (q(p + 1) p(p 1)/2) . (10.74)
10.3.2 Describing the factors
Once you have decided on the number p of factors in the model and the estimate

,
you have a choice of rotations. That is, since

for any p p orthogonal matrix


has exactly the same t, you need to choose the . There are a number of criteria. The
varimax criterion tries to pick a rotation so that the loadings (

ij
s) are either large
in magnitude, or close to 0. The hope is that it is then easy to interpret the factors
by seeing which variables they load heavily upon. Formally, the varimax rotation is
that which maximizes the sum of the variances of the squares of the elements in each
column. That is, if F is the q p matrix consisting of the squares of the elements
in , then the varimax rotation is the that maximizes trace(F
/
H
q
F), H
q
being the
centering matrix (1.12). There is nothing preventing you from trying as many s as
you wish. It is an art to nd a rotation and interpretation of the factors.
The matrix X, which has the scores of the factors for the individuals, is unobserved,
but can be estimated. The joint distribution is, from (10.51) with the assumptions
(10.56),
_
X Y
_
N
__
0 D
_
, I
n

_
, =
_
I
p

/

/
+
_
. (10.75)
Then given the observed Y:
X [ Y = y N(

+y

, I
n

XXY
), (10.76)
where

= (
/
+)
1

/
,

= D

, (10.77)
and

XXY
= I
p
(
/
+)
1

/
. (10.78)
An estimate of X is the estimate of E[X [ Y = y]:

X = (y D )(

+

)
1

/
. (10.79)
182 Chapter 10. Covariance Models
10.3.3 Example: Grades data
Continue with the grades data in Section 10.2.2, where the D in
E(Y) = D (10.80)
is a 107 2 matrix that distinguishes men from women. The rst step is to estimate

YY
:

YY
=
1

Y
/
Q
D
Y, (10.81)
where here = 107 2 (since D has two columns), which is the pooled covariance
matrix in (10.18).
We illustrate with the R program factanal. The input to the program can be a data
matrix or a covariance matrix or a correlation matrix. In any case, the program will
base its calculations on the correlation matrix. Unless D is just a column of 1s, you
shouldnt give it Y, but S = Y
/
Q
D
Y/, where = n k if D is n k. You need to also
specify how many factors you want, and the number of observations (actually, + 1
for us). Well start with one factor. The sigmahat is the S, and covmat= indicates to R
that you are giving it a covariance matrix. (Do the same if you are giving a correlation
matrix.) In such cases, the program does not know what n or k is, so you should set
the parameter n.obs. It assumes that D is 1
n
, i.e., that k = 1, so to trick it into using
another k, set n.obs to n k + 1, which in our case is 106. Then the one-factor model
is t to the sigmahat in (10.18) using
f < factanal(covmat=sigmahat,factors=1,n.obs=106)
The output includes the uniquenesses (diagonals of

), f$uniquenesses, and the
(transpose of the) loadings matrix, f$loadings. Here,
diagonals of

:
HW Labs InClass Midterms Final
0.247 0.215 0.828 0.765 0.786
(10.82)
and

:
HW Labs InClass Midterms Final
Factor1 0.868 0.886 0.415 0.484 0.463
(10.83)
The given loadings and uniquenesses are based on the correlation matrix, so the tted
correlation matrix can be found using
corr0 < f$loadings%%t(f$loadings) + diag(f$uniquenesses)
The result is
One-factor model HW Labs InClass Midterms Final
HW 1.00 0.77 0.36 0.42 0.40
Labs 0.77 1.00 0.37 0.43 0.41
InClass 0.36 0.37 1.00 0.20 0.19
Midterms 0.42 0.43 0.20 1.00 0.22
Final 0.40 0.41 0.19 0.22 1.00
(10.84)
Compare that to the observed correlation matrix. which is in the matrix f$corr:
10.3. Factor analysis 183
Unrestricted model HW Labs InClass Midterms Final
HW 1.00 0.78 0.28 0.41 0.40
Labs 0.78 1.00 0.42 0.38 0.35
InClass 0.28 0.42 1.00 0.24 0.27
Midterms 0.41 0.38 0.24 1.00 0.60
Final 0.40 0.35 0.27 0.60 1.00
(10.85)
The tted correlations are reasonably close to the observed ones, except for the
midterms/nal correlation: The actual is 0.60, but the estimate from the one-factor
model is only 0.22. It appears that this single factor is more focused on other correla-
tions.
For a formal goodness-of-t test, we have
H
0
: One-factor model versus H
A
: Unrestricted. (10.86)
We can use either the correlation or covariance matrices, as long as we are consistent,
and since factanal gives the correlation, we might as well use that. The MLE under
H
A
is then corrA, the correlation matrix obtained from S, and under H
0
is corr0. Then
2 log(LR) = (log([

+

[) log([S[)) (10.87)
is found in R using
105log(det(corr0)/det(f$corr))
yielding the value 37.65. It is probably better to use Bartletts renement (10.70),
(105 (25+5)/62/3)log(det(corr0)/det(f$corr))
which gives 36.51. This value can be found in f$STATISTIC, or by printing out f. The
degrees of freedom for the statistic in (10.71) is ((q p)
2
p q)/2 = 5, since p = 1
and q = 5. Thus H
0
is rejected: The one-factor model does not t.
Two factors
With small q, we have to be careful not to ask for too many factors. By (10.64), two is
the maximum when q = 5. In R, we just need to set factors=2 in the factanal function.
The
2
for goodness-of-t is 2.11, on one degree of freedom, hence the two-factor
model ts ne. The estimated correlation matrix is now
Two-factor model HW Labs InClass Midterms Final
HW 1.00 0.78 0.35 0.40 0.40
Labs 0.78 1.00 0.42 0.38 0.35
InClass 0.35 0.42 1.00 0.24 0.25
Midterms 0.40 0.38 0.24 1.00 0.60
Final 0.40 0.35 0.25 0.60 1.00
(10.88)
which is quite close to the observed correlation matrix (10.85) above. Only the In-
Class/HW correlation is a bit off, but not by much.
The uniquenesses and loadings for this model are
diagonals of

:
HW Labs InClass Midterms Final
0.36 0.01 0.80 0.48 0.30
(10.89)
184 Chapter 10. Covariance Models
and

:
HW Labs InClass Midterms Final
Factor 1 0.742 0.982 0.391 0.268 0.211
Factor 2 0.299 0.173 0.208 0.672 0.807
(10.90)
The routine gives the loadings using the varimax criterion.
Looking at the uniquenesses, we notice that inclasss is quite large, which suggests
that it has a factor unique to itself, e.g., being able to get to class. It has fairly low
loadings on both factors. We see that the rst factor loads highly on homework
and labs, especially labs, and the second loads heavily on the exams, midterms and
nal. (These results are not surprising given the example in Section 10.2.2, where we
see homework, inclass, and midterms are conditionally independent given labs and
nal.) So one could label the factors Diligence and Test taking ability.
The exact same t can be achieved by using other rotations , for a 2 2 orthog-
onal matrix . Consider the rotation
=
1

2
_
1 1
1 1
_
. (10.91)
Then the loadings become

:
HW Labs InClass Midterms Final
Factor

1 0.736 0.817 0.424 0.665 0.720


Factor

2 0.314 0.572 0.129 0.286 0.421


(10.92)
Now Factor

1 could be considered an overall ability factor, and Factor

2 a contrast
of HW+Lab and Midterms+Final.
Any rotation is ne whichever you can interpret easiest is the one to take.
Using the BIC to select the number of factors
We have the three models: One-factor (M
1
), two-factor (M
2
), and unrestricted (M
Big
).
The deviances (10.72) are log([

[)), where here we take the correlation form of the

s. The relevant quantities are next:


Model Deviance d BIC BIC

P
BIC
M
1
156.994 10 110.454 16.772 0
M
2
192.382 14 127.226 0 0.768
M
Big
194.640 15 124.831 2.395 0.232
(10.93)
The only difference between the two BIC columns is that the second one has 127.226
added to each element, making it easier to compare them. These results conform to
what we had before. The one-factor model is untenable, and the two-factor model is
ne, with 77% estimated probability. The full model has a decent probability as well.
Estimating the score matrix X
The score matrix is estimated as in (10.79). You have to be careful, though, to use
consistently the correlation form or covariance form. That is, if the (

) is estimated
from the correlation matrix, then the residuals y D must be rescaled so that the
variances are 1. Or you can let R do it, by submitting the residuals and asking for the
regression scores:
10.3. Factor analysis 185
5 4 3 2 1 0 1

1
0
1
Diligence
T
e
s
t

t
a
k
i
n
g

a
b
i
l
i
t
y
Figure 10.1: Plot of factor scores for two-factor model.
x < cbind(1,grades[,1])
gammahat < solve(t(x)%%x,t(x)%%grades[,2:6])
resids < grades[,2:6]x%%gammahat
xhat < factanal(resids,factors=2,scores=regression)$scores
The xhat is then 107 2:

X =

5.038 1.352
4.083 0.479
0.083 2.536
.
.
.
.
.
.
0.472 1.765

. (10.94)
Now we can use the factor scores in scatter plots. For example, Figure 10.1 contains
a scatter plot of the estimated factor scores for the two-factor model. They are by
construction uncorrelated, but one can see how diligence has a much longer lower
tail (lazy people?).
We also calculated box plots to compare the womens and mens distribution on
the factors:
par(mfrow=c(1,2))
yl < range(xhat) # To obtain the same yscales
w < (x[,2]==1) # Whether women (T) or not.
boxplot(list(Women=xhat[w,1],Men=xhat[!w,1]),main=Factor 1,ylim=yl)
boxplot(list(Women=xhat[w,2],Men=xhat[!w,2]),main=Factor 2,ylim=yl)
See Figure 10.2. There do not appear to be any large overall differences.
186 Chapter 10. Covariance Models
Women Men

1
0
1
2
D
i
l
i
g
e
n
c
e
Women Men

1
0
1
2
T
e
s
t

t
a
k
i
n
g

a
b
i
l
i
t
y
Figure 10.2: Box plots comparing the women and men on their factor scores.
10.4 Some symmetry models
Some structural models on covariances matrices, including testing independence, can
be dened through group symmetries. The advantage of such models is that the
likelihood estimates and tests are very easy to implement. The ones we present
are called invariant normal models as dened in Andersson [1975]. We will be
concerned with these models restrictions on the covariance matrices. More generally,
the models are dened on the means as well. Basically, the models are ones which
specify certain linear constraints among the elements of the covariance matrix.
The model starts with
Y N
nq
(0, I
n
), (10.95)
and a q q group ( (see (5.58)), a subgroup of the group of q q orthogonal matrices
O
q
. The model demands that the distribution of Y be invariant under multiplication
on the right by elements of (, that is,
Yg =
T
Y for all g (. (10.96)
Now because Cov(Yg) = I
n
g
/
g, (10.95) and (10.96) imply that
= g
/
g for all g (. (10.97)
Thus we can dene the mean zero invariant normal model based on ( to be (10.95)
with
o
+
q
(() o
+
q
[ = g
/
g for all g (. (10.98)
A few examples are in order at this point. Typically, the groups are fairly simple
groups.
10.4. Symmetry models 187
10.4.1 Some types of symmetry
Independence and block independence
Partition the variables in Y so that
Y = (Y
1
, . . . , Y
K
), where Y
k
is n q
k
, (10.99)
and
=

11

12

1K

21

22

2K
.
.
.
.
.
.
.
.
.
.
.
.

K1

K2

KK

,
kl
is q
k
q
l
. (10.100)
Independence of one block of variables from the others entails setting the covari-
ance to zero, that is,
kl
= 0 means Y
k
and Y
l
are independent. Invariant normal
models can specify a block being independent of all the other blocks. For example,
suppose K = 3. Then the model that Y
1
is independent of (Y
2
, Y
3
) has
=

11
0 0
0
22

23
0
32

33

. (10.101)
The group that gives rise to that model consists of two elements:
( =

I
q
1
0 0
0 I
q
2
0
0 0 I
q
3

I
q
1
0 0
0 I
q
2
0
0 0 I
q
3

. (10.102)
(The rst element is just I
q
, of course.) It is easy to see that of (10.101) is invariant
under ( of (10.102). Lemma 10.1 below can be used to show any in o
+
(() is of the
form (10.101).
If the three blocks Y
1
, Y
2
and Y
3
are mutually independent, then is block diago-
nal,
=

11
0 0
0
22
0
0 0
33

, (10.103)
and the corresponding ( consists of the eight matrices
( =

I
q
1
0 0
0 I
q
2
0
0 0 I
q
3

. (10.104)
An extreme case is when all variables are mutually independent, so that q
k
= 1
for each k (and K = q), is diagonal, and ( consists of all diagonal matrices with
1s down the diagonal.
Intraclass correlation structure
The intraclass correlation structure arises when the variables are interchangeable.
For example, the variables may be similar measurements (such as blood pressure)
made several times, or scores on sections of an exam, where the sections are all
188 Chapter 10. Covariance Models
measuring the same ability. In such cases, the covariance matrix would have equal
variances, and equal covariances:
=
2

1
1
.
.
.
.
.
.
.
.
.
.
.
.
1

. (10.105)
The ( for this model is the group of q q permutation matrices T
q
. (A permutation
matrix has exactly one 1 in each row and one 1 in each column, and zeroes elsewhere.
If g is a permutation matrix and x is a q 1 vector, then gx contains the same elements
as x, but in a different order.)
Compound symmetry
Compound symmetry is an extension of intraclass symmetry, where there are groups
of variables, and the variables within each group are interchangeable. Such models
might arise, e.g., if students are given three interchangeable batteries of math ques-
tions, and two interchangeable batteries of verbal questions. The covariance matrix
would then have the form
=

a b b c c
b a b c c
b b a c c
c c c d e
c c c e d

. (10.106)
In general, the group would consist of block diagonal matrices, with permutation
matrices as the blocks. That is, with partitioned as in (10.100),
( =

G
1
0 0
0 G
2
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 G
K

[ G
1
T
q
1
, G
2
T
q
2
, . . . , G
K
T
q
K

. (10.107)
IID, or spherical symmetry
Combining independence and intraclass correlation structure yields =
2
I
q
, so that
the variables are independent and identically distributed. The group for this model is
the set of permutation matrices augmented with signs on the 1s. (Recall Exercise
8.8.6.)
The largest group possible for these models is the group of q q orthogonal ma-
trices. When (10.96) holds for all orthogonal g, the distribution of Y is said to be
spherically symmetric. It turns out that this choice also yields =
2
I
q
. This result is a
reection of the fact that iid and spherical symmetry are the same for the multivari-
ate normal distribution. If Y has some other distribution, then the two models are
distinct, although they still have the same covariance structure.
10.4. Symmetry models 189
10.4.2 Characterizing the structure
It is not always obvious given a structure for the covariance matrix to nd the cor-
responding (, or even to decide whether there is a corresponding (. But given the
group, there is a straightforward method for nding the structure. We will consider
just nite groups (, but the idea extends to general groups, in which case we would
need to introduce uniform (Haar) measure on these groups.
For given nite group ( and general , dene the average of by
=

g(
g
/
g
#(
. (10.108)
It should be clear that if o
+
q
((), then = . The next lemma shows that all
averages are in o
+
q
(().
Lemma 10.1. For any o
+
q
, o
+
q
(().
Proof. For any h (,
h
/
h =

g(
h
/
g
/
gh
#(
=

g

(
g
/
g

#(
= . (10.109)
The second line follows by setting g

= gh, and noting that as g runs over (, so does


g

. (This is where the requirement that ( is a group is needed.) But (10.109) implies
that (.
The lemma shows that
o
+
q
(() = [ o
+
q
, (10.110)
so that one can discover the structure of covariance matrices invariant under a partic-
ular group by averaging a generic . That is how one nds the structures in (10.101),
(10.103), (10.106), and (10.107) from their respective groups.
10.4.3 Maximum likelihood estimates
The maximum likelihood estimate of in (10.98) is the

o
+
q
(() that maximizes
L(; y) =
1
[[
n/2
e

1
2
trace(
1
u)
, o
+
((), where u = y
/
y. (10.111)
The requirement o
+
q
(() means that

1
= (g
/
g)
1
= g
/

1
g (10.112)
190 Chapter 10. Covariance Models
for any g (, that is,
1
o
+
q
((), hence
trace(
1
u) = trace
_

g(
g
/

1
g
#(
u
_
= trace
_

g(
g
/

1
gu
#(
_
= trace
_

g(

1
gug
/
#(
_
= trace
_

g(
gug
/
#(
_
= trace(
1
u). (10.113)
Thus
L(; y) =
1
[[
n/2
e

1
2
trace(
1
u)
. (10.114)
We know from Lemma 9.1 that the maximizer of L in (10.114) over o
+
q
is

=
u
n
, (10.115)
but since that maximizer is in o
+
q
(() by Lemma 10.1, and o
+
q
(() o
+
q
, it must
be the maximizer over o
+
q
((). That is, (10.115) is indeed the maximum likelihood
estimate for (10.111).
To illustrate, let S = U/n. Then if ( is as in (10.104), so that the model is that
three sets of variables are independent (10.103), the maximum likelihood estimate is
the sample analog

(() =

S
11
0 0
0 S
22
0
0 0 S
33

. (10.116)
In the intraclass correlation model (10.105), the group is the set of q q permutation
matrices, and the maximum likelihood estimate has the same form,

(() =
2

1
1
.
.
.
.
.
.
.
.
.
.
.
.
1

, (10.117)
where

2
=
1
q
q

i=1
s
ii
, and
2
=

1i<jq
s
ij
(
q
2
)
. (10.118)
That is, the common variance is the average of the original variances, and the common
covariance is the average of the original covariances.
10.4. Symmetry models 191
10.4.4 Hypothesis testing and model selection
The deviance for the model dened by the group ( is, by (10.3),
deviance(M(()) = nlog([

(()[), (10.119)
where we drop the exponential term since nq is the same for all models. We can
then use this deviance in nding AICs or BICs for comparing such models, once we
gure out the dimensions of the models, which is usually not too hard. E.g., if the
model is that is unrestricted, so that (
A
= I
q
, the trivial group, the dimension
for H
A
is (
q+1
2
). The dimension for the independence model in (10.103) and (10.104)
sums the dimensions for the diagonal blocks: (
q
1
+1
2
) + (
q
2
+1
2
) + (
q
3
+1
2
). The dimension
for the intraclass correlation model (10.105) is 2 (for the variance and covariance).
Also, the likelihood ratio statistic for testing two nested invariant normal models
is easy to nd. These testing problems use two nested groups, (
A
(
0
, so that the
hypotheses are
H
0
: o
+
q
((
0
) versus H
A
: o
+
q
((
A
), (10.120)
Note that the larger (, the smaller o
+
q
((), since fewer covariance matrices are in-
variant under a larger group. Then the likelihood ratio test statistic, 2 log(LR), is the
difference of the deviances, as in (9.49).
The mean is not zero
So far this subsection assumed the mean is zero. In the more general case that Y
N
nq
(x, I
n
), estimate restricted to o
+
q
(() by nding U = Y
/
Q
x
Y, then taking

=
U
n
or
U
n p
, (10.121)
(where x is n p), depending on whether you want the maximum likelihood estimate
or an unbiased estimate. In testing, I would suggest taking the unbiased versions,
then using
deviance(M(()) = (n p) log([

(()[). (10.122)
10.4.5 Example: Mouth sizes
Continue from Section 7.3.1 with the mouth size data, using the model (7.40). Because
the measurements within each subject are of the same mouth, a reasonable question
to ask is whether the residuals within each subject are exchangeable, i.e., whether
R
has the intraclass correlation structure (10.105). Let U = Y
/
Q
x
Y and the unrestricted
estimate be

A
= U/ for = n 2 = 25. Then

A
and the estimate under the
intraclass correlation hypothesis

0
, given in (10.117) and (10.118), are

A
=

5.415 2.717 3.910 2.710


2.717 4.185 2.927 3.317
3.910 2.927 6.456 4.131
2.710 3.317 4.131 4.986

(10.123)
192 Chapter 10. Covariance Models
and

0
=

5.260 3.285 3.285 3.285


3.285 5.260 3.285 3.285
3.285 3.285 5.260 3.285
3.285 3.285 3.285 5.260

. (10.124)
To test the null hypothesis that the intraclass correlation structure holds, versus
the general model, we have from (10.119)
2 log(LR) = 25 (log([

0
[) log([

A
[)) = 9.374. (10.125)
The dimension for the general model is d
A
= q(q + 1)/2 = 10, and for the null is
just d
0
= 2, thus the degrees of freedom for this statistic is d f = d
A
d
0
= 8. The
intraclass correlation structure appears to be plausible.
We can exploit this structure (10.105) on the
R
to more easily test hypotheses
about the in both-sides models like (7.40). First, we transform the matrix
R
into a
diagonal matrix with two distinct variances. Notice that we can write this covariance
as

R
=
2
(1 )I
q
+
2
1
q
1
/
q
. (10.126)
Let be any q q orthogonal matrix whose rst column is proportional to 1
q
, i.e.,
1
q
/

q. Then

R
=
2
(1 )
/
+
2

/
1
q
1
/
q

=
2
(1 )I
q
+
2

q
0
.
.
.
0

_

q 0 0
_
=
2
_
1 + (q 1) 0
0 (1 )I
q1
_
. (10.127)
We used the fact that because all columns of except the rst are orthogonal to 1
q
,

/
1
q
=

q(1, 0, . . . , 0)
/
. As suggested by the notation, this is indeed the eigenvalue
matrix for
R
, and contains a corresponding set of eigenvectors.
In the model (7.40), the z is almost an appropriate :
z =

1 3 1 3
1 1 1 1
1 1 1 1
1 3 1 3

. (10.128)
The columns are orthogonal, and the rst is 1
4
, so we just have to divide each column
by its length to obtain orthonormal columns. The squared lengths of the columns are
the diagonals of z
/
z : (4, 20, 4, 20). Let be the square root of z
/
z,
=

2 0 0 0
0

20 0 0
0 0 2 0
0 0 0

20

, (10.129)
10.4. Symmetry models 193
and set
= z
1
and

= , (10.130)
so that the both-sides model can be written
Y = x
1
z
/
+R = x

/
+R. (10.131)
Multiplying everything on the right by yields
Y

Y = x

+R

, (10.132)
where
R

R N(0, I
n

R
) = N(0, I
n
). (10.133)
This process is so far similar to that in Section 7.5.1. The estimate of

is straightfor-
ward:

= (x
/
x)
1
x
/
Y

=
_
49.938 3.508 0.406 0.252
4.642 1.363 0.429 0.323
_
. (10.134)
These estimates are the same as those for model (6.28), multiplied by as in (10.130).
The difference is in their covariance matrix:

N(

, C
x
). (10.135)
To estimate the standard errors of the estimates, we look at the sum of squares
and cross products of the estimated residuals,
U

= Y

/
Q
x
Y

Wishart
q
(, ), (10.136)
where = trace(Q
x
) = n p = 27 2 = 25. Because the in (10.127) is diagonal,
the diagonals of U

are independent scaled


2

s:
U

11

2
0

, U

jj

2
1

, j = 2, . . . , q = 4. (10.137)
Unbiased estimates are

2
0
=
U
11



2
0

and
2
1
=
U
22
+ U
qq
(q 1)


2
1
(q 1)

2
(q1)
. (10.138)
For our data,

2
0
=
377.915
25
= 5.117 and
2
1
=
59.167 + 26.041 + 62.919
75
= 1.975. (10.139)
The estimated standard errors of the

ij
s from (10.135) are found from
C
x


=
_
0.0625 0.0625
0.0625 0.1534
_

15.117 0 0 0
0 1.975 0 0
0 0 1.975 0
0 0 0 1.975

, (10.140)
194 Chapter 10. Covariance Models
being the square roots of the diagonals:
Standard errors
Constant Linear Quadratic Cubic
Boys 0.972 0.351 0.351 0.351
GirlsBoys 1.523 0.550 0.550 0.550
(10.141)
The t-statistics divide (10.134) by their standard errors:
t-statistics
Constant Linear Quadratic Cubic
Boys 51.375 9.984 1.156 0.716
GirlsBoys 3.048 2.477 0.779 0.586
(10.142)
These statistics are not much different from what we found in Section 6.4.1, but the
degrees of freedom for all but the rst column are now 75, rather than 25. The main
impact is in the signicance of
1
, the difference between the girls and boys slopes.
Previously, the p-value was 0.033 (the t = 2.26 on 25 df). Here, the p-value is 0.016,
a bit stronger suggestion of a difference.
10.5 Exercises
Exercise 10.5.1. Verify the likelihood ratio statistic (10.21) for testing the equality of
several covariance matrices as in (10.20).
Exercise 10.5.2. Verify that trace(
1
U) = trace(
1
11
U
11
) + trace(
1
22
U
22
), as in
(10.27), for being block-diagonal, i.e.,
12
= 0 in (10.23).
Exercise 10.5.3. Show that the value of 2 log(LR) of (10.31) does not change if the s
in the denominators are erased.
Exercise 10.5.4. Suppose U Wishart
q
(, ), where is partitioned as
=

11

12

1K

21

22

2K
.
.
.
.
.
.
.
.
.
.
.
.

K1

K2

KK

, (10.143)
where
ij
is q
i
q
j
, and the q
i
s sum to q. Consider testing the null hypothesis that
the blocks are mutually independent, i.e.,
H
0
:
ij
= 0 for 1 i < j K, (10.144)
versus the alternative that is unrestricted. (a) Find the 2 log(LR), and the degrees
of freedom in the
2
approximation. (The answer is analogous to that in (10.35).) (b)
Let U

= AUA for some diagonal matrix A with positive diagonal elements. Replace
the U in 2 log(LR) with U

. Show that the value of the statistic remains the same.


(c) Specialize to the case that all q
i
= 1 for all i, so that we are testing the mutual
independence of all the variables. Let C be the sample correlation matrix. Show that
2 log(LR) = log([C[). [Hint: Find the appropriate A from part (b).)
10.5. Exercises 195
Exercise 10.5.5. Show that (10.55) holds.
Exercise 10.5.6. Suppose and are both p q, p < q, and let their decompositions
from (10.61) and (10.62) be = Q

and =

, where Q and are orthogonal,

ij
=

ij
= 0 for i > j, and

ii
s and
ii
s are positive. (We assume the rst p columns
of , and of , are linearly independent.) Show that
/
=
/
if and only if

.
[Hint: Use the uniqueness of the QR decomposition, in Theorem 5.3.]
Exercise 10.5.7. Show that the conditional parameters in (10.76) are as in (10.77) and
(10.78).
Exercise 10.5.8. Show that if the factor analysis is t using the correlation matrix,
then the correlation between variable j and factor i is estimated to be

ij
, the loading
of factor i on variable j.
Exercise 10.5.9. What is the factor analysis model with no factors (i.e., erase the in
(10.57))? Choose from the following: The covariance of Y is unrestricted; the mean
of Y is 0; the Y variables are mutually independent; the covariance matrix of Y is a
constant times the identity matrix.
Exercise 10.5.10. Show that if o
+
q
(() of (10.98), that in (10.108) is in o
+
q
(().
Exercise 10.5.11. Verify the steps in (10.113).
Exercise 10.5.12. Show that if has intraclass correlation structure (10.105), then
=
2
(1 )I
q
+
2
1
q
1
/
q
as in (10.126).
Exercise 10.5.13. Multivariate complex normals arise in spectral analysis of multiple
time series. A q-dimensional complex normal is Y
1
+ i Y
2
, where Y
1
and Y
2
are 1 q
real normal vectors with joint covariance of the form
= Cov
__
Y
1
Y
2
__
=
_

1
F
F
1
_
, (10.145)
i.e., Cov(Y
1
) = Cov(Y
2
). Here, i is the imaginary i =

1. (a) Show that F =
Cov(Y
1
, Y
2
) is skew-symmetric, which means that F
/
= F. (b) What is F when q = 1?
(c) Show that the set of s as in (10.145) is the set o
+
2q
(() in (10.98) with
( =
_
I
2q
,
_
0 I
q
I
q
0
_
,
_
0 I
q
I
p
0
_
, I
2q
_
. (10.146)
Exercise 10.5.14 (Mouth sizes). For the boys and girls mouth size data in Table 4.1,
let
B
be the covariance matrix for the boys mouth sizes, and
G
be the covariance
matrix for the girls mouth sizes. Consider testing
H
0
:
B
=
G
versus H
A
:
B
,=
G
. (10.147)
(a) What are the degrees of freedom for the boys and girls sample covariance matri-
ces? (b) Find [

B
[, [

G
[, and the pooled [

[. (Use the unbiased estimates of the


i
s.)
(c) Find 2 log(LR). What are the degrees of freedom for the
2
? What is the p-value?
Do you reject the null hypothesis (if = .05)? (d) Look at trace(

B
), trace(

G
). Also,
look at the correlation matrices for the girls and for the boys. What do you see?
196 Chapter 10. Covariance Models
Exercise 10.5.15 (Mouth sizes). Continue with the mouth size data from Exercise
10.5.14. (a) Test whether
B
has the intraclass correlation structure (versus the gen-
eral alternative). What are the degrees of freedom for the
2
? (b) Test whether
G
has the intraclass correlation structure. (c) Now assume that both
B
and
G
have
the intraclass correlation structure. Test whether the covariances matrices are equal.
What are the degrees of freedom for this test? What is the p-value. Compare this
p-value to that in Exercise 10.5.14, part (c). Why is it so much smaller (if it is)?
Exercise 10.5.16 (Grades). This problem considers the grades data. In what follows,
use the pooled covariance matrix in (10.18), which has = 105. (a) Test the indepen-
dence of the rst three variables (homework, labs, inclass) from the fourth variable,
the midterms score. (So leave out the nal exams at this point.) Find l
1
, l
2
, [

11
[, [

22
[,
and [

[. Also, nd 2 log(LR) and the degrees of freedom for the


2
. Do you accept or
reject the null hypothesis? (b) Now test the conditional independence of the set (home-
work, labs, inclass) from the midterms, conditioning on the nal exam score. What
is the for the estimated covariance matrix now? Find the new l
1
, l
2
, [

11
[, [

22
[, and
[

[. Also, nd 2 log(LR) and the degrees of freedom for the


2
. Do you accept or
reject the null hypothesis? (c) Find the correlations between the homework, labs and
inclass scores and the midterms scores, as well as the conditional correlations given
the nal exam. What do you notice?
Exercise 10.5.17 (Grades). The table in (10.93) has the BICs for the one-factor, two-
factor, and unrestricted models for the Grades data. Find the deviance, dimension,
and BIC for the zero-factor model, M
0
. (See Exercise 10.5.9.) Find the estimated
probabilities of the four models. Compare the results to those without M
0
.
Exercise 10.5.18 (Exams). The exams matrix has data on 191 statistics students, giving
their scores (out of 100) on the three midterm exams, and the nal exam. (a) What
is the maximum number of factors that can be estimated? (b) Give the number of
parameters in the covariance matrices for the 0, 1, 2, and 3 factor models (even if they
are not estimable). (d) Plot the data. There are three obvious outliers. Which obser-
vations are they? What makes them outliers? For the remaining exercise, eliminate
these outliers, so that there are n = 188 observations. (c) Test the null hypothesis
that the four exams are mutually independent. What are the adjusted 2 log(LR)

(in (10.70)) and degrees of freedom for the


2
? What do you conclude? (d) Fit the
one-factor model. What are the loadings? How do you interpret them? (e) Look at
the residual matrix C

, where C is the observed correlation matrix of the orig-


inal variables. If the model ts exactly, what values would the off-diagonals of the
residual matrix be? What is the largest off-diagonal in this observed matrix? Are the
diagonals of this matrix the uniquenesses? (f) Does the one-factor model t?
Exercise 10.5.19 (Exams). Continue with the Exams data from Exercise 10.5.18. Again,
do not use the outliers found in part (d). Consider the invariant normal model where
the group ( consists of 4 4 matrices of the form
G =
_
G

0
3
0
/
3
1
_
(10.148)
where G

is a 3 3 permutation matrix. Thus the model is an example of com-


pound symmetry, from Section 10.4.1. The model assumes the three midterms are
10.5. Exercises 197
interchangeable. (a) Give the form of a covariance matrix which is invariant un-
der that (. (It should be like the upper-left 4 4 block of the matrix in (10.106).)
How many free parameters are there? (b) For the exams data, give the MLE of the
covariance matrix under the assumption that it is (-invariant. (c) Test whether this
symmetry assumption holds, versus the general model. What are the degrees of free-
dom? For which elements of is the null hypothesis least tenable? (d) Assuming
is G-invariant, test whether the rst three variables are independent of the last. (That
is, the null hypothesis is that is (-invariant and
14
=
24
=
34
= 0, while the
alternative is that is (-invariant, but otherwise unrestricted.) What are the degrees
of freedom for this test? What do you conclude?
Exercise 10.5.20 (South Africa heart disease). The data for this question comes from
a study of heart disease in adult males in South Africa from Rousseauw et al. [1983].
(We return to these data in Section 11.8.) The R data frame is SAheart, found in
the ElemStatLearn package [Halvorsen, 2009]. The main variable of interest is chd,
congestive heart disease, where 1 indicates the person has the disease, 0 he does
not. Explanatory variables include sbc (measurements on blood pressure), tobacco
use, ldl (bad cholesterol), adiposity (fat %), family history of heart disease (absent
or present), type A personality, obesity, alcohol usage, and age. Here you are to
nd common factors among the the explanatory variables excluding age and family
history. Take logs of the variables sbc, ldl, and obesity, and cube roots of alcohol
and tobacco, so that the data look more normal. Age is used as a covariate. Thus
Y is n 7, and D = (1
n
x
age
). Here, n = 462. (a) What is there about the tobacco
and alcohol variables that is distinctly non-normal? (b) Find the sample correlation
matrix of the residuals from the Y = D +R model. Which pairs of variables have
correlations over 0.25, and what are their correlations? How would you group these
variables? (c) What is the largest number of factors that can be t for this Y? (d) Give
the BIC-based probabilities of the p-factor models for p = 0 to the maximum found
in part (c), and for the unrestricted model. Which model has the highest probability?
Does this model t, according to the
2
goodness of t test? (e) For the most probable
model from part (d), which variables loadings are highest (over 0.25) for each factor?
(Use the varimax rotation for the loadings.) Give relevant names to the two factors.
Compare the factors to what you found in part (b). (f) Keeping the same model,
nd the estimated factor scores. For each factor, nd the two-sample t-statistic for
comparing the people with heart disease to those without. (The statistics are not
actually distributed as Students t, but do give some measure of the difference.) (g)
Based on the statistics in part (f), do any of the factors seem to be important factors in
predicting heart disease in these men? If so, which one(s). If not, what are the factors
explaining?
Exercise 10.5.21 (Decathlon). Exercise 1.9.20 created a biplot for the decathlon data
The data consist of the scores (number of points) on each of ten events for the top 24
men in the decathlon at the 2008 Olympics. For convenience, rearrange the variables
so that the running events come rst, then the jumping, then throwing (ignoring the
overall total):
y < decathlon[,c(1,5,10,6,3,9,7,2,4,8)]
Fit the 1, 2, and 3 factor models. (The chi-squared approximations for the t might
not be very relevant, because the sample size is too small.) Based on the loadings, can
198 Chapter 10. Covariance Models
you give an interpretation of the factors? Based on the uniquenesses, which events
seem to be least correlated with the others?
Chapter 11
Classication
Multivariate analysis of variance seeks to determine whether there are differences
among several groups, and what those differences are. Classication is a related area
in which one uses observations whose group memberships are known in order to
classify new observations whose group memberships are not known. This goal was
the basic idea behind the gathering of the Fisher/Anderson iris data (Section 1.3.1).
Based on only the petal and sepal measurements of a new iris, can one effectively
classify it into one of the three species setosa, virginica and versicolor? See Figure 1.4
for an illustration of the challenge.
The task is prediction, as in Section 7.6, except that rather than predicting a con-
tinuous variable Y, we predict a categorical variable. We will concentrate mainly on
linear methods arising from Fishers methods, logistic regression, and trees. There is
a vast array of additional approaches, including using neural networks, support vec-
tor machines, boosting, bagging, and a number of other ashy-sounding techniques.
Related to classication is clustering (Chapter 12), in which one assumes that there
are groups in the population, but which groups the observations reside in is unob-
served, analogous to the factors in factor analysis. In the machine learning com-
munity, classication is supervised learning, because we know the groups and have
some data on group membership, and clustering is unsupervised learning, because
group membership itself must be estimated. See the book by Hastie, Tibshirani, and
Friedman [2009] for a ne statistical treatment of machine learning.
The basic model is a mixture model, presented in the next section.
11.1 Mixture models
The mixture model we consider assumes there are K groups, numbered from 1 to K,
and p predictor variables on which to base the classications. The data then consist
of n observations, each a 1 (p + 1) vector,
(X
i
, Y
i
), i = 1, . . . , n, (11.1)
where X
i
is the 1 p vector of predictors for observation i, and Y
i
is the group number
of observation i, so that Y
i
1, . . . , K. Marginally, the proportion of the population
in group k is
P[Y = k] =
k
, k = 1, . . . , K. (11.2)
199
200 Chapter 11. Classication
0 5 10 15
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
x
p
d
f
Figure 11.1: Three densities, plus a mixture of the three (the thick line).
Each group then has a conditional distribution P
k
:
X
i
[ Y
i
= k P
k
, (11.3)
where the P
k
will (almost always) depend on some unknown parameters. Assuming
that P
k
has density f
k
(x), the joint density of (X
i
, Y
i
) is a mixed one as in (2.11), with
f (x
i
, y
i
) = f
y
i
(x
i
)
y
i
. (11.4)
The marginal pdf of X
i
is found by summing the joint density over the groups:
f (x
i
) =
1
f
1
(x
i
) + +
K
f
K
(x
i
). (11.5)
For example, suppose that K = 3,
1
=
2
=
3
= 1/3, and the three groups are,
conditionally,
X[ Y = 1 N(5, 1), X[ Y = 2 N(5, 2
2
), X[ Y = 3 N(10, 1). (11.6)
Figure 11.1 exhibits the three pdfs, plus the mixture pdf, which is the thick black line.
The data for classication includes the group index, so that the joint distributions
of (X
i
, Y
i
) are operative, meaning we can estimate the individual densities. The over-
all density for the data is then
n

i=1
f (x
i
, y
i
) =
n

i=1
f
y
i
(x
i
)
y
i
=

N
1
1

i[y
i
=1
f
1
(x
i
)

N
K
K

i[y
i
=K
f
K
(x
i
)

=
_
K

k=1

N
k
k
_

k=1

i[y
i
=k
f
k
(x
i
)

, (11.7)
11.2. Classiers 201
where N
k
is the number of observations in group k:
N
k
= #y
i
= k. (11.8)
The classication task arises when a new observation, X
New
, arrives without its group
identication Y
New
, so its density is that of the mixture. We have to guess what the
group is.
In clustering, the data themselves are without group identication, so we have just
the marginal distributions of the X
i
. Thus the joint pdf for the data is

n
i=1
f (x
i
) =
n
i=1
(
1
f
1
(x
i
) + +
K
f
K
(x
i
)). (11.9)
Thus clustering is similar to classifying new observations, but without having any
previous y data to help estimate the
k
s and f
k
s. See Section 12.3.
11.2 Classiers
A classier is a function ( that takes the new observation, and emits a guess at its
group:
( : A 1, . . . , K, (11.10)
where A is the space of X
New
. The classier may depend on previous data, as well
as on the
k
s and f
k
s, but not on the Y
New
. A good classier is one that is unlikely
to make a wrong classication. Thus a reasonable criterion for a classier is the
probability of an error:
P[((X
New
) ,= Y
New
]. (11.11)
We would like to minimize that probability. (This criterion assumes that any type of
misclassication is equally bad. If that is an untenable assumption, then one can use
a weighted probability:
K

k=1
K

l=1
w
kl
P[((X
New
) = k and Y
New
= l], (11.12)
where w
kk
= 0.)
Under the (unrealistic) assumption that we know the
k
s and f
k
s, the best guess
of Y
New
given X
New
is the group that has the highest conditional probability.
Lemma 11.1. Dene the Bayes classier by
(
B
(x) = k i f P[Y = k [ X = x] > P[Y = l [ X = x] f or l ,= k. (11.13)
Then (
B
minimizes (11.11) over classiers (.
Proof. Let I be the indicator function, so that
I[((X
New
) ,= Y
New
] =
_
1 i f ((X
New
) ,= Y
New
0 i f ((X
New
) = Y
New
(11.14)
and
P[((X
New
) ,= Y
New
] = E[I[((X
New
) ,= Y
New
]]. (11.15)
202 Chapter 11. Classication
As in (2.34), we have that
E[I[((X
New
) ,= Y
New
]] = E[e
I
(X
New
)], (11.16)
where
e
I
(x
New
) = E[I[((X
New
) ,= Y
New
] [ X
New
= x
New
]
= P[((x
New
) ,= Y
New
[ X
New
= x
New
]
= 1 P[((x
New
) = Y
New
[ X
New
= x
New
]. (11.17)
Thus if we minimize the last expression in (11.17) for each x
New
, we have minimized
the expected value in (11.16). Minimizing (11.17) is the same as maximizing
P[((x
New
) = Y
New
[ X
New
= x
New
], (11.18)
but that conditional probability can be written
K

l=1
I[((x
New
) = l] P[Y
New
= l [ X
New
= x
New
]. (11.19)
This sum equals P[Y
New
= l [ X
New
= x
New
] for whichever k ( chooses, so to maxi-
mize the sum, choose the l with the highest conditional probability, as in (11.13).
Now the conditional distribution of Y
New
given X
New
is obtained from (11.4) and
(11.5) (it is Bayes theorem, Theorem 2.2):
P[Y
New
= k [ X
New
= x
New
] =
f
k
(x
New
)
k
f
1
(x
New
)
1
+ + f
K
(x
New
)
K
. (11.20)
Since, given x
New
, the denominator is the same for each k, we just have to choose the
k to maximize the numerator:
(
B
(x) = k if f
k
(x
New
)
k
> f
l
(x
New
)
l
f or l ,= k. (11.21)
We are assuming there is a unique maximum, which typically happens in practice
with continuous variables. If there is a tie, any of the top categories will yield the
optimum.
Consider the example in (11.6). Because the
k
s are equal, it is sufcient to look at
the conditional pdfs. A given x is then classied into the group with highest density,
as given in Figure 11.2.
Thus the classications are
(
B
(x) =

1 if 3.640 < x < 6.360


2 if x < 3.640 or 6.360 < x < 8.067 or x > 15.267
3 if 8.067 < x < 15.267
(11.22)
In practice, the
k
s and f
k
s are not known, but can be estimated from the data.
Consider the joint density of the data as in (11.7). The
k
s appear in only the rst
term. They can be estimated easily (as in a multinomial situation) by

k
=
N
k
n
. (11.23)
11.3. Linear discrimination 203
0 5 10 15
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
2 1 2 3
x
p
d
f
Figure 11.2: Three densities, and the regions in which each is the highest. The den-
sities are 1: N(5,1), solid line; 2: N(5,4), dashed line; 3: N(10,1), dashed/dotted line.
Density 2 is also the highest for x > 15.267.
The parameters for the f
k
s can be estimated using the x
i
s that are associated with
group k. These estimates are then plugged into the Bayes formula to obtain an ap-
proximate Bayes classier. The next section shows what happens in the multivariate
normal case.
11.3 Fishers linear discrimination
Suppose that the individual f
k
s are multivariate normal densities with different
means but the same covariance, so that
X
i
[ Y
i
= k N
1p
(
k
, ). (11.24)
The pdfs (8.48) are then
f
k
(x [
k
, ) = c.
1
[[
1/2
e

1
2
(x
k
)
1
(x
k
)
/
= c.
1
[[
1/2
e

1
2
x
1
x
/
e
x
1

/
k

1
2

/
k
. (11.25)
We can ignore the factors that are the same for each group, i.e., that do not depend
on k, because we are in quest of the highest pdf
k
. Thus for a given x, we choose
the k to maximize

k
e
x
1

/
k

1
2

/
k
, (11.26)
or, by taking logs, the k that maximizes
d

k
(x) x
1

/
k

1
2

/
k
+ log(
k
). (11.27)
204 Chapter 11. Classication
These d

k
s are called the discriminant functions. Note that in this case, they are
linear in x, hence linear discriminant functions. It is often convenient to target one
group (say the K
th
) as a benchmark, then use the functions
d
k
(x) = d

k
(x) d

K
(x), (11.28)
so that the nal function is 0.
We still must estimate the parameters, but that is straightforward: take the
k
=
N
K
/n as in (11.23), estimate the
k
s by the obvious sample means:

k
=
1
N
k

i[y
i
=k
x
i
, (11.29)
and estimate by the MLE, i.e., because we are assuming the covariances are equal,
the pooled covariance:

=
1
n
K

k=1

i[y
i
=k
(x
i

k
)
/
(x
i

k
). (11.30)
(The numerator equals X
/
QX for Q being the projection matrix for the design matrix
indicating which groups the observations are from. We could divide by n K to
obtain the unbiased estimator, but the classications would still be essentially the
same, exactly so if the
k
s are equal.) Thus the estimated discriminant functions are

d
k
(x) = c
k
+xa
/
k
, (11.31)
where
a
k
= (
k

K
)

1
and
c
k
=
1
2
(
k

1

/
k

K

1

/
K
) + log(
k
/
K
). (11.32)
Now we can dene the classier based upon Fishers linear discrimination func-
tion to be

(
FLD
(x) = k i f

d
k
(x) >

d
l
(x) f or l ,= k. (11.33)
(The hat is there to emphasize the fact that the classier is estimated from the data.) If
p = 2, each set x [ d
k
(x) = d
l
(x) denes a line in the x-space. These lines divide the
space into a number of polygonal regions (some innite). Each region has the same

((x). Similarly, for general q, the regions are bounded by hyperplanes. Figure 11.3
illustrates for the iris data when using just the sepal length and width. The solid line
is the line for which the discriminant functions for setosas and versicolors are equal.
It is basically perfect for these data. The dashed line tries to separate setosas and
virginicas. There is one misclassication. The dashed/dotted line tries to separate
the versicolors and virginicas. It is not particularly successful. See Section 11.4.1 for
a better result using all the variables.
11.4. Cross-validation estimate of error 205
s
s
s
s
s
s
s s
s
s
s
s
s s
s
s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
v v
v
v
v v
v
v
v
v
v
v
v
v v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
vv
v v
v
v
v
v
v
v
v
v
v
v
v
v
v v
v
v
g
g
g
g
g g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g g
g
g
g
g
g
g
g
g
g
g g
g
g
g
g
g
g g g
g
g
g
g
g
g
g
g
4.5 5.5 6.5 7.5
2
.
0
2
.
5
3
.
0
3
.
5
4
.
0
Sepal length
S
e
p
a
l

w
i
d
t
h
Figure 11.3: Fishers linear discrimination for the iris data using just sepal length
and width. The solid line separates setosa (s) and versicolor (v); the dashed line
separates setosa and virginica (r); and the dashed/dotted line separates versicolor
and virginica.
Remark
Fishers original derivation in Fisher [1936] of the classier (11.33) did not start with
the multivariate normal density. Rather, in the case of two groups, he obtained the
p 1 vector a that maximized the ratio of the squared difference of means of the
variable X
i
a
/
for the two groups to the variance:
((
1

2
)a
/
)
2
a

a
/
. (11.34)
The optimal a is (anything proportional to)
a = (
1

2
)

1
, (11.35)
which is the a
1
in (11.32). Even though our motivation leading to (11.27) is different
than Fishers, because we end up with his coefcients, we will refer to (11.31) as
Fishers.
11.4 Cross-validation estimate of error
In classication, an error occurs if an observation is misclassied, so one often uses
the criterion (11.11) to assess the efcacy of a classier. This criterion depends on
the distribution of the X as well as the Y, and needs to be estimated. To relate the
206 Chapter 11. Classication
error to the data at hand, we take the criterion to be the probability of error given the
observed X
i
s, (c.f. the prediction error in (7.88)),
Cl assError =
1
n
n

i=1
P[

((X
New
i
) ,= Y
New
i
[ X
New
i
= x
i
], (11.36)
where the (X
(New)
i
, Y
New
i
) are independent, and independent of the data, but with the
same distribution as the data. So the criterion is measuring how well the classier
would work on a new data set with the same predictors x
i
.
How does one estimate the error? The obvious approach is to try the classier on
the data, and count the number of misclassications:
Cl assError
Obs
=
1
n
n

i=1
I[

((x
i
) ,= y
i
]. (11.37)
As in prediction, this error will be an underestimate because we are using the same
data to estimate the classier and test it out. A common approach to a fair estimate is
to initially set aside a random fraction of the observations (e.g., 10% to 25%) to be test
data, and use the remaining so-called training data to estimate the classier. Then
this estimated classier is tested on the test data.
Cross-validation is a method that takes the idea one step further, by repeatedly
separating the data into test and training data. The leave-one-out cross-validation
uses single observations as the test data. It starts by setting aside the rst observation,
(x
1
, y
1
), and calculating the classier using the data (x
2
, y
2
), . . . , (x
n
, y
n
). (That is, we
nd the sample means, covariances, etc., leaving out the rst observation.) Call the
resulting classier

(
(1)
. Then determine whether this classier classies the rst
observation correctly:
I[

(
(1)
(x
1
) ,= y
1
]. (11.38)
The

(
(1)
and Y
1
are independent, so the function in (11.38) is almost an unbiased
estimate (conditionally on X
1
= x
1
) of the error
P[

((X
New
1
) ,= Y
New
1
[ X
New
1
= x
1
], (11.39)
the only reason it is not exactly unbiased is that

(
(1)
is based on n 1 observations,
rather than the n for

(. This difference should be negligible.
Repeat the process, leaving out each observation in turn, so that

(
(i)
is the classi-
er calculated without observation i. Then the almost unbiased estimate of Cl assError
in (11.36) is

Cl assError
LOOCV
=
1
n
n

i=1
I[

(
(i)
(x
i
) ,= y
i
]. (11.40)
If n is large, and calculating the classier is computationally challenging, then leave-
one-out cross-validation can use up too much computer time (especially if one is
trying a number of different classiers). Also, the estimate, though nearly unbiased,
might have a high variance. An alternative is to leave out more than one observation
each time, e.g., the 10% cross-validation would break the data set into 10 sets of size
n/10, and for each set, use the other 90% to classify the observations. This approach
is much more computationally efcient, and less variable, but does introduce more
bias. Kshirsagar [1972] contains a number of other suggestions for estimating the
classiaction error.
11.4. Cross-validation estimate of error 207
11.4.1 Example: Iris data
Turn again to the iris data. Figure 1.3 has the scatter plot matrix. Also see Figures
1.4 and 11.3. In R, the iris data is in the data frame iris. You may have to load the
datasets package. The rst four columns constitute the n p matrix of x
i
s, n = 150,
p = 4. The fth column has the species, 50 each of setosa, versicolor, and virginica.
The basic variables are then
x.iris < as.matrix(iris[,1:4])
y.iris < rep(1:3,c(50,50,50)) # gets group vector (1,...,1,2,...,2,3,...,3)
We will ofoad many of the calculations to the function lda in Section A.3.1. The
following statement calculates the a
k
and c
k
in (11.32):
ld.iris < lda(x.iris,y.iris)
The a
k
are in the matrix ld.iris$a and the c
k
are in the vector ld.iris$c, given below:
k a
k
c
k
1 (Setosa) 11.325 20.309 29.793 39.263 18.428
2 (Versicolor) 3.319 3.456 7.709 14.944 32.159
3 (Virginica) 0 0 0 0 0
(11.41)
Note that the nal coefcients are zero, because of the way we normalize the functions
in (11.28).
To see how well the classier works on the data, we have to rst calculate the
d
k
(x
i
). The following places these values in an n K matrix disc:
disc < x%%ld.iris$a
disc < sweep(disc,2,ld.iris$c,+)
The rows corresponding to the rst observation from each species are
k
i 1 2 3
1 97.703 47.400 0
51 32.305 9.296 0
101 120.122 19.142 0
(11.42)
The classier (11.33) classies each observation into the group corresponding to the
column with the largest entry. Applied to the observations in (11.42), we have

(
FLD
(x
1
) = 1,

(
FLD
(x
51
) = 2,

(
FLD
(x
101
) = 3, (11.43)
that is, each of these observations is correctly classied into its group. To nd the

(
FLD
s for all the observations, use
imax < function(z) ((1:length(z))[z==max(z)])[1]
yhat < apply(disc,1,imax)
where imax is a little function to give the index of the largest value in a vector. To see
how close the predictions are to the observed, use the table command:
table(yhat,y.iris)
208 Chapter 11. Classication
which yields
y
y 1 2 3
1 50 0 0
2 0 48 1
3 0 2 49
(11.44)
Thus there were 3 observations misclassied two versicolors were classied as
virginica, and one virginica was classied as versicolor. Not too bad. The observed
misclassication rate is
Cl assError
Obs
=
#

(
FLD
(x
i
) ,= y
i

n
=
3
150
= 0.02. (11.45)
As noted above in Section 11.4, this value is likely to be an optimistic (underestimate)
of Cl assError in (11.36), because it uses the same data to nd the classier and to test
it out. We will nd the leave-one-out cross-validation estimate (11.40) using the code
below, where we set varin=1:4 to specify using all four variables.
yhat.cv < NULL
n < nrow(x.iris)
for(i in 1:n) {
dcv < lda(x.iris[i,varin],y.iris[i])
dxi < x.iris[i,varin]%%dcv$a+dcv$c
yhat.cv < c(yhat.cv,imax(dxi))
}
sum(yhat.cv!=y.iris)/n
Here, for each i, we calculate the classier without observation i, then apply it to
that left-out observation i, the predictions placed in the vector yhat.cv. We then count
how many observations were misclassied. In this case,

Cl assError
LOOCV
= 0.02,
just the same as the observed classication error. In fact, the same three observations
were misclassied.
Subset selection
The above classications used all four iris variables. We now see if we can obtain
equally good or better results using a subset of the variables. We use the same loop
as above, setting varin to the vector of indices for the variables to be included. For
example, varin = c(1,3) will use just variables 1 and 3, sepal length and petal length.
Below is a table giving the observed error and leave-one-out cross-validation error
(in percentage) for 15 models, depending on which variables are included in the
classication.
11.4. Cross-validation estimate of error 209
setosa versicolor virginica
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
P
e
t
a
l

w
i
d
t
h
Figure 11.4: Boxplots of the petal widths for the three species of iris. The solid
line separates the setosas from the versicolors, and the dashed line separates the
versicolors from the virginicas.
Classication errors
Variables Observed Cross-validation
1 25.3 25.3
2 44.7 48.0
3 5.3 6.7
4 4.0 4.0
1, 2 20.0 20.7
1, 3 3.3 4.0
1, 4 4.0 4.7
2, 3 4.7 4.7
2, 4 3.3 4.0
3, 4 4.0 4.0
1, 2, 3 3.3 4.0
1, 2, 4 4.0 5.3
1, 3, 4 2.7 2.7
2, 3, 4 2.0 4.0
1, 2, 3, 4 2.0 2.0
(11.46)
Note that the cross-validation error estimates are either the same, or a bit larger,
than the observed error rates. The best classier uses all 4 variables, with an estimated
2% error. Note, though, that Variable 4 (Petal Width) alone has only a 4% error rate.
Also, adding Variable 1 to Variable 4 actual worsens the prediction a little, showing
that adding the extra variation is not worth it. Looking at just the observed error, the
prediction stays the same.
Figure 11.4 shows the classications using just petal widths. Because the sample
sizes are equal, and the variances are assumed equal, the separating lines between
210 Chapter 11. Classication
two species are just the average of their means. We did not plot the line for setosa
versus virginica. There are six misclassications, two versicolors and four virginicas.
(Two of the latter had the same petal width, 1.5.)
11.5 Fishers quadratic discrimination
When the equality of the covariance matrices is not tenable, we can use a slightly
more complicated procedure. Here the conditional probabilities are proportional to

k
f
k
(x [
k
,
k
) = c
k
1
[
k
[
1/2
e

1
2
(x
k
)
1
k
(x
k
)
/
= c e

1
2
(x
k
)
1
k
(x
k
)
/

1
2
log([
k
[)+log(
k
)
. (11.47)
Then the discriminant functions can be taken to be the terms in the exponents (times
2, for convenience), or their estimates:

d
Q
k
(x) = (x
k
)

1
k
(x
k
)
/
+ c
k
, (11.48)
where
c
k
= log([

k
[) +2 log(N
k
/n), (11.49)
and

k
is the sample covariance matrix from the k
th
group. Now the boundaries
between regions are quadratic rather than linear, hence Fishers quadratic discrimi-
nation function is dened to be

(
FQD
(x) = k i f

d
Q
k
(x) >

d
Q
l
(x) f or l ,= k. (11.50)
11.5.1 Example: Iris data, continued
We consider the iris data again, but as in Section 11.5 we estimate three separate co-
variance matrices. Sections A.3.2 and A.3.3 contain the functions qda and predict.qda
for calculating the quadratic discriminant functions (11.25) and nding the predic-
tions. Apply these to the iris data as follows:
qd.iris < qda(x.iris,y.iris)
yhat.qd < NULL
for (i in 1:n) {
yhat.qd < c(yhat.qd,imax(predict.qda(qd.iris,x.iris[i,])))
}
table(yhat.qd,y.iris)
The resulting table is (11.44), the same as for linear discrimination. The leave-one-out
cross-validation estimate of classication error is 4/150 = 0.0267, which is slightly
worse than that for linear discrimination. It does not appear that the extra complica-
tion of having three covariance matrices improves the classication rate.
Hypothesis testing, AIC, or BIC can also help decide between the model with equal
covariance matrices and the model with three separate covariance matrices. Because
we have already calculated the estimates, it is quite easy to proceed. The two models
are then
M
Same

1
=
2
=
3
;
M
Diff
(
1
,
2
,
3
) unrestricted. (11.51)
11.6. Modications 211
Both models have the same unrestricted means, and we can consider the
k
s xed,
so we can work with just the sample covariance matrices, as in Section 10.1. Let
U
1
, U
2
, and U
3
be the sum of squares and cross-products matrices (1.15) for the three
species, and U = U
1
+ U
2
+ U
3
be the pooled version. The degrees of freedom for
each species is
k
= 50 1 = 49. Thus from (10.10) and (10.11), we can nd the
deviances (9.47) to be
deviance(M
Same
) = (
1
+
2
+
3
) log([U/(
1
+
2
+
3
)[)
= 1463.905,
deviance(M
Diff
) =
1
log([U
1
/
1
[) +
2
log([U
2
/
2
[) +
3
log([U
3
/
3
[)
= 1610.568 (11.52)
Each covariance matrix has (
q+1
2
) = 10 parameters, hence
d
Same
= 10 and d
Diff
= 30. (11.53)
To test the null hypothesis M
Same
versus the alternative M
Diff
, as in (9.49),
2 log(LR) = deviance(M
Same
) deviance(M
Diff
)
= 1463.905 + 1610.568
= 146.663, (11.54)
on d
Same
d
Diff
= 20 degrees of freedom. The statistic is highly signicant; we reject
emphatically the hypothesis that the covarince matrices are the same. The AICs
(9.50) and BICs (9.51) are found directly from (11.54):
AIC BIC
M
Same
1443.90 1414.00
M
Diff
1550.57 1460.86
(11.55)
They, too, favor the separate covariance model. Cross-validation above suggests that
the equal-covariance model is slightly better. Thus there seems to be a conict be-
tween AIC/BIC and cross-validation. The conict can be explained by noting that
AIC/BIC are trying to model the x
i
s and y
i
s jointly, while cross-validation tries to
model the conditional distribution of the y
i
s given the x
i
s. The latter does not really
care about the distribution of the x
i
s, except to the extent it helps in predicting the
y
i
s.
11.6 Modications to Fishers discrimination
The key component in both the quadratic and linear discriminant functions is the
quadratic form,
q(x;
k
,
K
) =
1
2
(x
k
)
1
k
(x
k
)
/
, (11.56)
where in the case the
k
s are equal, the x
1
x
/
part is ignored. Without the
1
2
,
(11.56) is a measure of distance (called the Mahalanobis distance) between an x and
the mean of the k
th
group, so that it makes sense to classify an observation into the
212 Chapter 11. Classication
group to which it is closest (modulo an additive constant). The idea is plausible
whether the data are normal or not, and whether the middle component is a general

k
or not. E.g., when taking the s equal, we could take
= I
p
=q(x;
k
, I
p
) =
1
2
|x
k
|
2
, or
= , diagonal = q(x;
k
, ) =
1
2

(x
i

ki
)
2
/
ii
. (11.57)
The rst case is regular Euclidean distance. In the second case, one would need to
estimate the
ii
s by the pooled sample variances. These alternatives may be better
when there are not many observations per group, and a fairly large number of vari-
ables p, so that estimating a full introduces enough extra random error into the
classication to reduce its effectiveness.
Another modication is to use functions of the individual variables. E.g., in the
iris data, one could generate quadratic boundaries by using the variables
Sepal Length, (Sepal Length)
2
, Petal Length, (Petal Length)
2
(11.58)
in the x. The resulting set of variables certainly would not be multivariate normal,
but the classication based on them may still be reasonable. See the next section for
another method of incorporating such functions.
11.7 Conditioning on X: Logistic regression
Based on the conditional densities of X given Y = k and priors
k
, Lemma 11.1
shows that the Bayes classier in (11.13) is optimal. In Section 11.3, we saw that if the
conditional distributions of the X are multivariate normal, with the same covariance
matrix for each group, then the classier devolved to a linear one (11.31) in x. The
linearity is not specic to the normal, but is a consequence of the normal being an
exponential family density, which means the density has the form
f (x [ ) = a(x)e
t(x)
/
()
(11.59)
for some 1 m parameter , 1 m function t(x) (the sufcient statistic), and function
a(x), where () is the normalizing constant.
Suppose that the conditional density of X given Y = k is f (x [
k
), that is, each
group has the same form of the density, but a different parameter value. Then the
analog to equations (11.27) and (11.28) yields discriminant functions like those in
(11.31),
d
k
(x) =
k
+t(x)
/
k
, (11.60)
a linear function of t(x), where
k
=
k

K
, and
k
is a constant depending on the
parameters. (Note that d
K
(x) = 0.) To implement the classier, we need to estimate
the parameters
k
and
k
, usually by nding the maximum likelihood estimates.
(Note that Fishers quadratic discrimination in Section 11.5 also has discriminant
functions (11.48) of the form (11.60), where the t is a function of the x and its square,
xx
/
.) In such models the conditional distribution of Y given X is given by
P[Y = k [ X = x] =
e
d
k
(x)
e
d
1
(x)
+ + e
d
K1
(x)
+ 1
(11.61)
11.7. Logistic regression 213
for the d
k
s in (11.60). This conditional model is called the logistic regression model.
Then an alternative method for estimating the
k
s and
k
s is to nd the values that
maximize the conditional likelihood,
L((
1
,
1
), . . . , (
K1
, a
K1
) ; (x
1
, y
1
), . . . , (x
n
, y
n
)) =
n

i=1
P[Y = y
i
[ X
i
= x
i
]. (11.62)
(We know that
K
= 0 and
K
= 0.) There is no closed-form solution for solving
the likelihood equations, so one must use some kind of numerical procedure like
Newton-Raphson. Note that this approach estimates the slopes and intercepts of the
discriminant functions directly, rather than (in the normal case) estimating the means
and variances, and the
k
s, then nding the slopes and intercepts as functions of
those estimates.
Whether using the exponential family model unconditionally or the logistic model
conditionally, it is important to realize that both lead to the exact same classier.
The difference is in the way the slopes and intercepts are estimated in (11.60). One
question is then which gives the better estimates. Note that the joint distribution of
the (X, Y) is the product of the conditional of Y given X in (11.61) and the marginal
of X in (11.5), so that for the entire data set,
n

i=1
f (y
i
, x
i
[
k
)
y
i
=
_
n

i=1
P[Y = y
i
[ X
i
= x
i
,
k
]
_

_
n

i=1
(
1
f (x
i
[
1
) + +
K
f (x
i
[
K
))
_
.
(11.63)
Thus using just the logistic likelihood (11.62), which is the rst term on the right-
hand side in (11.63), in place of the complete likelihood on the left, leaves out the
information about the parameters that is contained in the mixture likelihood (the
second term on the right). As we will see in Chapter 12, there is information in the
mixture likelihood. One would then expect that the complete likelihood gives better
estimates in the sense of asymptotic efciency of the estimates. It is not clear whether
that property always translates to yielding better classication schemes, but maybe.
On the other hand, the conditional logistic model is more general in that it yields
valid estimates even when the exponential family assumption does not hold. We can
entertain the assumption that the conditional distributions in (11.61) hold for any
statistics t(x) we wish to use, without trying to model the marginal distributions of
the Xs at all. This realization opens up a vast array of models to use, that is, we can
contemplate any functions t we wish.
In what follows, we restrict ourselves to having K = 2 groups, and renumber the
groups 0, 1, so that Y is conditionally Bernoulli:
Y
i
[ X
i
= x
i
Bernoulli((x
i
)), (11.64)
where
(x) = P[Y = 1 [ X = x]. (11.65)
The modeling assumption from (11.61) can be translated to the logit (log odds) of ,
logit() = log(/(1 )). Then
logit((x)) = logit((x [ , )) = +x
/
. (11.66)
214 Chapter 11. Classication
(We have dropped the t from the notation. You can always dene x to be whatever
functions of the data you wish.) The form (11.66) exhibits the reason for calling the
model logistic regression. Letting
logit() =

logit((x
1
[ , ))
logit((x
2
[ , ))
.
.
.
logit((x
n
[ , ))

, (11.67)
we can set up the model to look like the regular linear model,
logit() =

1 x
1
1 x
2
.
.
.
.
.
.
1 x
n

/
_
= X. (11.68)
We turn to examples.
11.7.1 Example: Iris data
Consider the iris data, restricting to classifying the virginicas versus versicolors. The
next table has estimates of the linear discrimination functions intercepts and slopes
using the multivariate normal with equal covariances, and the logistic regression
model:
Intercept Sepal Length Sepal Width Petal Length Petal Width
Normal 17.00 3.63 5.69 7.11 12.64
Logistic 42.64 2.47 6.68 9.43 18.29
(11.69)
The two estimates are similar, the logistic giving more weight to the petal widths,
and having a large intercept. It is interesting that the normal-based estimates have an
observed error of 3/100, while the logistic has 2/100.
11.7.2 Example: Spam
The Hewlett-Packard spam data was introduced in Exercise 1.9.14. The n = 4601
observations are emails to George Forman, at Hewlett-Packard labs. The The Y clas-
sies each email as spam (Y = 1) or not spam, Y = 0. There are q = 57 explana-
tory variables based on the contents of the email. Most of the explanatory variables
are frequency variables with many zeroes, hence are not at all normal, so Fishers
discrimination may not be appropriate. One could try to model the variables using
Poissons or multinomials. Fortunately, if we use the logistic model, we do not need to
model the explanatory variables at all, but only decide on the x
j
s to use in modeling
the logit in (11.68).
The (x [ , ) is the probability an email with message statistics x is spam. We
start by throwing in all 57 explanatory variables linearly, so that in (11.68), the design
matrix contains all the explanatory variables, plus the 1
n
vector. This t produces an
observed misclassication error rate of 6.9%.
A number of the coefcients are not signicant, hence it makes sense to try subset
logistic regression, that is, nd a good subset of explanatory variables to use. It is
11.7. Logistic regression 215
computationally much more time consuming to t a succession of logistic regression
models than regular linear regression models, so that it is often infeasible to do an
all-subsets exploration. Stepwise procedures can help, though are not guaranteed
to nd the best model. Start with a given criterion, e.g., AIC, and a given subset of
explanatory variables, e.g., the full set or the empty set. At each step, one has an
old model with some subset of the explanatory variables, and tries every possible
model that either adds one variable or removes one variable from that subset. Then
the new model is the one with the lowest AIC. The next step uses that new model
as the old, and adds and removes one variable from that. This process continues until
at some step the new model and old model are the same.
The table in (11.70) shows the results when using AIC and BIC. (The R code is
below.) The BIC has a stronger penalty, hence ends up with a smaller model, 30
variables (including the 1
n
) versus 44 for the AIC. For those two best models as
well as the full model, the table also contains the 46-fold cross-validation estimate
of the error, in percent. That is, we randomly cut the data set into 46 blocks of 100
observations, then predict each block of 100 from the remaining 4501. For the latter
two models, cross-validation involves redoing the entire stepwise procedure for each
reduced data set. A computationally simpler, but maybe not as defensible, approach
would be to use cross-validation on the actual models chosen when applying stepwise
to the full data set. Here, we found the estimated errors for the best AIC and BIC
models were 7.07% and 7.57%, respectively, approximately the same as for the more
complicated procedure.
p Deviance AIC BIC Obs. error CV error CV se
Full 58 1815.8 1931.8 2305.0 6.87 7.35 0.34
Best AIC 44 1824.9 1912.9 2196.0 6.78 7.15 0.35
Best BIC 30 1901.7 1961.7 2154.7 7.28 7.59 0.37
(11.70)
The table shows that all three models have essentially the same cross-validation
error, with the best AICs model being best. The standard errors are the standard
deviations of the 46 errors divided by

46, so give an idea of how variable the error
estimates are. The differences between the three errors are not large relative to these
standard errors, so one could arguably take either the best AIC or best BIC model.
The best AIC model has p = 44 parameters, one of which is the intercept. The
table (11.71) categorizes the 41 frequency variables (word or symbol) in this model,
according to the signs of their coefcients. The ones with positive coefcients tend
to indicate spam, while the others indicate non-spam. Note that the latter tend to be
words particular to someone named George who works at a lab at HP, while the
spam indicators have words like credit, free, money, and exciting symbols like
! and $. Also with positive coefcients are the variables that count the number
of capital letters, and the length of the longest run of capitals, in the email.
Positive Negative
3d our over remove internet order
mail addresses free business you
credit your font 000 money 650 tech-
nology ! $ #
make address will hp hpl george lab
data 85 parts pm cs meeting original
project re edu table conference ;
(11.71)
216 Chapter 11. Classication
Computational details
In R, logistic regression models with two categories can be t using the generalized
linear model function, glm. The spam data is in the data frame Spam. The indicator
variable, Y
i
, for spam is called spam. We rst must change the data matrix into a data
frame for glm: Spamdf < data.frame(Spam). The full logistic regression model is t
using
spamfull < glm(spam ~ .,data=Spamdf,family=binomial)
The spam ~ . tells the program that the spam variable is the Y, and the dot means
use all the variables except for spam in the X. The family = binomial tells the pro-
gram to t logistic regression. The summary command, summary(spamfull), will print
out all the coefcients, which I will not reproduce here, and some other statistics, in-
cluding
Null deviance: 6170.2 on 4600 degrees of freedom
Residual deviance: 1815.8 on 4543 degrees of freedom
AIC: 1931.8
The residual deviance is the regular deviance in (9.47). The full model uses 58
variables, hence
AIC = deviance +2p = 1815.8 + 2 58 = 1931.8, (11.72)
which checks. The BIC is found by substituting log(4601) for the 2.
We can nd the predicted classications from this t using the function predict,
which returns the estimated linear X

from (11.68) for the tted model. The



Y
i
s
are then 1 or 0 as the (x
(i
) [ c, a) is greater than or less than
1
2
. Thus to nd the
predictions and overall error rate, do
yhat < ifelse(predict(spamfull)>0,1,0)
sum(yhat!=Spamdf[,spam])/4601
We nd the observed classifcation error to be 6.87%.
Cross-validation
We will use 46-fold cross-validation to estimate the classication error. We randomly
divide the 4601 observations into 46 groups of 100, leaving one observation who
doesnt get to play. First, permute the indices from 1 to n:
o < sample(1:4601)
Then the rst hundred are the indices for the observations in the rst leave-out-block,
the second hundred in the second leave-out-block, etc. The loop is next, where the
err collects the number of classication errors in each block of 100.
err < NULL
for(i in 1:46) {
oi < o[(1:100)+(i1)100]
yti < glm(spam ~ ., family = binomial,data = Spamdf,subset=(1:4601)[oi])
dhati < predict(yti,newdata=Spamdf[oi,])
yhati < ifelse(dhati>0,1,0)
err < c(err,sum(yhati!=Spamdf[oi,spam]))
}
11.8. Trees 217
In the loop for cross-validation, the oi is the vector of indices being left out. We then t
the model without those by using the keyword subset=(1:4601)[oi], which indicates
using all indices except those in oi. The dhati is then the vector of discriminant
functions evaluated for the left out observations (the newdata). The mean of err is the
estimated error, which for us is 7.35%. See the entry in table in (11.70).
Stepwise
The command to use for stepwise regression is step. To have the program search
through the entire set of variables, use one of the two statements
spamstepa < step(spamfull,scope=list(upper= ~.,lower = ~1))
spamstepb < step(spamfull,scope=list(upper= ~.,lower = ~1),k=log(4601))
The rst statement searches on AIC, the second on BIC. The rst argument in the step
function is the return value of glm for the full data. The upper and lower inputs refer
to the formulas of the largest and smallest models one wishes to entertain. In our
case, we wish the smallest model to have just the 1
n
vector (indicated by the ~1),
and the largest model to contain all the vectors (indicated by the ~.).
These routines may take a while, and will spit out a lot of output. The end result
is the best model found using the given criterion. (If using the BIC version, while
calculating the steps, the program will output the BIC values, though calling them
AIC. The summary output will give the AIC, calling it AIC. Thus if you use just
the summary output, you must calculate the BIC for yourself. )
To nd the cross-validation estimate of classication error, we need to insert the
stepwise procedure after tting the model leaving out the observations, then predict
those left out using the result of the stepwise procedure. So for the best BIC model,
use the following:
errb < NULL
for(i in 1:46) {
oi < o[(1:100)+(i1)100]
yti < glm(spam ~ ., family = binomial, data = Spamdf,subset=(1:4601)[oi])
stepi < step(yti,scope=list(upper= ~.,lower = ~1),k=log(4501))
dhati < predict(stepi,newdata=Spamdf[oi,])
yhati < ifelse(dhati>0,1,0)
errb < c(errb,sum(yhati!=Spamdf[oi,spam]))
}
The estimate for the best AIC model uses the same statements but with k = 2 in
the step function. This routine will take a while, because each stepwise procedure
is time consuming. Thus one might consider using cross-validation on the model
chosen using the BIC (or AIC) criterion for the full data.
The neural networks R package nnet [Venables and Ripley, 2002] can be used to t
logistic regression models for K > 2.
11.8 Trees
The presentation here will also use just K = 2 groups, labeled 0 and 1, but can be
extended to any number of groups. In the logistic regression model (11.61), we mod-
eled P[Y = 1 [ X = x] (x) using a particular parametric form. In this section we
218 Chapter 11. Classication
20 30 40 50 60
1
0
1
5
2
0
2
5
3
0
3
5
4
0
8.55%
45.61%
38.68%
Adiposity
A
g
e
Figure 11.5: Splitting on age and adiposity. The open triangles indicate no heart
disease, the solid discs indicate heart disease. The percentages are the percentages of
men with heart disease in each region of the plot.
use a simpler, nonparametric form, where (x) is constant over rectangular regions
of the X-space.
To illustrate, we will use the South African heart disease data from Rousseauw
et al. [1983], which was used in Exercise 10.5.20. The Y is congestive heart disease
(chd), where 1 indicates the person has the disease, 0 he does not. Explanatory
variables include various health measures. Hastie et al. [2009] apply logistic regres-
sion to these data. Here we use trees. Figure 11.5 plots the chd variable for the
age and adiposity (fat percentage) variables. Consider the vertical line. It splits the
data according to whether age is less than 31.5 years. The splitting point 31.5 was
chosen so that the proportions of heart disease in each region would be very dif-
ferent. Here, 10/117 = 8.85% of the men under age 31.5 had heart disease, while
150/345 = 43.48% of those above 31.5 had the disease.
The next step is to consider just the men over age 31.5, and split them on the
adiposity variable. Taking the value 25, we have that 41/106 = 38.68% of the men
over age 31.5 but with adiposity under 25 have heart disease; 109/239 = 45.61% of
the men over age 31.5 and with adiposity over 25 have the disease. We could further
split the younger men on adiposity, or split them on age again. Subsequent steps
split the resulting rectangles, each time with either a vertical or horizontal segment.
There are also the other variables we could split on. It becomes easier to represent
the splits using a tree diagram, as in Figure 11.6. There we have made several splits,
at the nodes. Each node needs a variable and a cutoff point, such that people for
which the variable is less than the cutoff are placed in the left branch, and the others
11.8. Trees 219
|
age < 31.5
tobacco < 0.51
alcohol < 11.105
age < 50.5
typea < 68.5
famhist:a
tobacco < 7.605
typea < 42.5
adiposity < 24.435
adiposity < 28.955
ldl < 6.705
tobacco < 4.15
adiposity < 28
typea < 48
0:1.00
1:0.01
1.00
0.00
0.60
0.40
0.70
0.30
0.20
0.80
1.00
0.00
0.90
0.08
0.60
0.40
0.00
1.00
0.50
0.50
0.60
0.40
0.00
1.00
0.70
0.30
0.20
0.80
0.05
1.00
Figure 11.6: A large tree, with proportions at the leaves
go to the right. The ends of the branches are terminal nodes or leaves. This plot has 15
leaves. At each leaf, there are a certain number of observations. The plot shows the
proportion of 0s (the top number) and 1s (the bottom number) at each leaf.
For classication, we place a 0 or 1 at each leaf, depending on whether the pro-
portion of 1s is less than or greater than 1/2. Figure 11.7 shows the results. Note
that for some splits, both leaves have the same classication, because although their
proportions of 1s are quite different, they are both on the same side of 1/2. For
classication purposes, we can snip some of the branches off. Further analysis (Sec-
tion 11.8.1) leads us to the even simpler tree in Figure 11.8. The tree is very easy to
interpret, hence popular among people (e.g., doctors) who need to use them. The
tree also makes sense, showing age, type A personality, tobacco use, and family his-
tory are important factors in predicting heart disease among these men. The trees
also are exible, incorporating continuous or categorical variables, avoiding having
to consider transformations, and automatically incorporating interactions. E.g., the
type A variable shows up only for people between the ages of 31.5 and 50.5, and
family history and tobacco use show up only for people over 50.5.
Though simple to interpret, it is easy to imagine that nding the best tree is a
220 Chapter 11. Classication
|
age < 31.5
tobacco < 0.51
alcohol < 11.105
age < 50.5
typea < 68.5
famhist:a
tobacco < 7.605
typea < 42.5
adiposity < 24.435
adiposity < 28.955
ldl < 6.705
tobacco < 4.15
adiposity < 28
typea < 48
0
0 0
0 1
0
0 0
1 1
0 1
0
1
1
Figure 11.7: A large tree, with classications at the leaves.
rather daunting prospect, as there is close to an innite number of possible trees in
any large data set (at each stage one can split any variable at any of a number of
points), and searching over all the possibilities is a very discrete (versus continuous)
process. In the next section, we present a popular, and simple, algorithm to nd a
good tree.
11.8.1 CART
Two popular commercial products for tting trees are Categorization and Regression
Trees, CART
R
, by Breiman et al. [1984], and C5.0, by Quinlan [1993]. We will take
the CART approach, the main reason being the availability of an R version. It seems
that CART would appeal more to statisticians, and C5.0 to data-miners, but I do not
think the results of the two methods would differ much.
We rst need an objective function to measure the t of a tree to the data. We will
use deviance, although other measures such as the observed misclassication rate are
certainly reasonable. For a tree T with L leaves, each observation is placed in one of
the leaves. If observation y
i
is placed in leaf l, then that observations (x
i
) is given
11.8. Trees 221
|
age < 31.5
age < 50.5
typea < 68.5
famhist:a
tobacco < 7.605
0
0 1
0 1
1
Figure 11.8: A smaller tree, chosen using BIC.
by the parameter for leaf l, say p
l
. The likelihood for that Bernoulli observation is
p
y
i
l
(1 p
l
)
1y
i
. (11.73)
Assuming the observations are independent, at leaf l there is a sample of iid Bernoulli
random variables with parameter p
l
, hence the overall likelihood of the sample is
L(p
1
, . . . , p
L
[ y
1
, . . . , y
n
) =
L

l=1
p
w
l
l
(1 p
l
)
n
l
w
l
, (11.74)
where
n
l
= #i at leaf l, w
l
= #y
i
= 1 at leaf l. (11.75)
This likelihood is maximized over the p
l
s by taking p
l
= w
l
/n
l
. Then the deviance
(9.47) for this tree is
deviance(T ) = 2
L

l=1
(w
l
log( p
l
) + (n
l
w
l
) log(1 p
l
)) . (11.76)
The CART method has two main steps: grow the tree, then prune the tree. The
tree is grown in a stepwise, greedy fashion, at each stage trying to nd the next
split that maximally reduces the objective function. We start by nding the single
split (variable plus cutoff point) that minimizes the deviance among all such splits.
Then the observations at each resulting leaf are optimally split, again nding the
variable/cutoff split with the lowest deviance. The process continues until the leaves
have just a few observations, e.g., stopping when any split would result in a leaf with
fewer than ve observations.
To grow the tree for the South African heart disease data in R, we need to install
the package called tree [Ripley, 2010]. A good explanation of it can be found in
Venables and Ripley [2002]. We use the data frame SAheart in the ElemStatLearn
package [Halvorsen, 2009]. The dependent variable is chd. To grow a tree, use
222 Chapter 11. Classication
basetree < tree(as.factor(chd)~.,data=SAheart)
The as.factor function indicates to the tree function that it should do classication. If
the dependent variable is numeric, tree will t a so-called regression tree, not what
we want here. To plot the tree, use one of the two statements
plot(basetree);text(basetree,label=yprob,digits=1)
plot(basetree);text(basetree)
The rst gives the proportions of 0s and 1s at each leaf, and the second gives the
classications of the leaves, yielding the trees in Figures 11.6 and 11.7, respectively.
This basetree is now our base tree, and we consider only subtrees, that is, trees
obtainable by snipping branches off this tree. As usual, we would like to balance
observed deviance with the number of parameters in the model, in order to avoid
overtting. To whit, we add a penalty to the deviance depending on the number of
leaves in the tree. To use AIC or BIC, we need to count the number of parameters for
each subtree, conditioning on the structure of the base tree. That is, we assume that
the nodes and the variable at each node are given, so that the only free parameters
are the cutoff points and the p
l
s. The task is one of subset selection, that is, deciding
which nodes to snip away. If the subtree has L leaves, then there are L 1 cutoff
points (there are L 1 nodes), and L p
l
s, yielding 2L 1 parameters. Thus the BIC
criterion for a subtree T with L leaves is
BIC(T ) = deviance(T ) + log(n)(2L 1). (11.77)
The prune.tree function can be used to nd the subtree with the lowest BIC. It takes
the base tree and a value k as inputs, then nds the subtree that minimizes
obj
k
(T ) = deviance(T ) + kL. (11.78)
Thus for the best AIC subtree we would take k = 4, and for BIC we would take
k = 2 log(n):
aictree < prune.tree(basetree,k=4)
bictree < prune.tree(basetree,k=2log(462)) # n = 462 here.
If the k is not specied, then the routine calculates the numbers of leaves and de-
viances of best subtrees for all values of k. The best AIC subtree is in fact the full
base tree, as in Figure 11.7. Figure 11.9 exhibits the best BIC subtree, which has eight
leaves. There are also routines in the tree package that use cross-validation to choose
a good factor k to use in pruning.
Note that the tree in Figure 11.9 has some redundant splits. Specically, all leaves
to the left of the rst split (age < 31.5) lead to classication 0. To snip at that node,
we need to determine its index in basetree. One approach is to print out the tree,
resulting in the output in Listing 11.1. We see that node #2 is age < 31.5, which is
where we wish to snip, hence we use
bictree.2 < snip.tree(bictree,nodes=2)
Plotting the result yields Figure 11.8. It is reasonable to stick with the presnipped
tree, in case one wished to classify using a cutoff point for p
l
s other than
1
2
.
There are some drawbacks to this tree-tting approach. Because of the stepwise
nature of the growth, if we start with the wrong variable, it is difcult to recover. That
11.8. Trees 223
|
age < 31.5
tobacco < 0.51
alcohol < 11.105
age < 50.5
typea < 68.5
famhist:a
tobacco < 7.605
0
0 0
0 1
0 1
1
Figure 11.9: The best subtree using the BIC criterion, before snipping redundant
leaves.
Listing 11.1: Text representation of the output of tree for the tree in Figure 11.9
node), split, n, deviance, yval, (yprob)
denotes terminal node
1) root 462 596.10 0 ( 0.65368 0.34632 )
2) age < 31.5 117 68.31 0 ( 0.91453 0.08547 )
4) tobacco < 0.51 81 10.78 0 ( 0.98765 0.01235 )
5) tobacco > 0.51 36 40.49 0 ( 0.75000 0.25000 )
10) alcohol < 11.105 16 0.00 0 ( 1.00000 0.00000 )
11) alcohol > 11.105 20 27.53 0 ( 0.55000 0.45000 )
3) age > 31.5 345 472.40 0 ( 0.56522 0.43478 )
6) age < 50.5 173 214.80 0 ( 0.68786 0.31214 )
12) typea < 68.5 161 188.90 0 ( 0.72671 0.27329 )
13) typea > 68.5 12 10.81 1 ( 0.16667 0.83333 )
7) age > 50.5 172 236.10 1 ( 0.44186 0.55814 )
14) famhist: Absent 82 110.50 0 ( 0.59756 0.40244 )
28) tobacco < 7.605 58 68.32 0 ( 0.72414 0.27586 )
29) tobacco > 7.605 24 28.97 1 ( 0.29167 0.70833 )
15) famhist: Present 90 110.00 1 ( 0.30000 0.70000 )
224 Chapter 11. Classication
is, even though the best single split may be on age, the best two-variable split may
be on type A and alcohol. There is inherent instability, because having a different
variable at a given node can completely change the further branches. Additionally,
if there are several splits, the sample sizes for estimating the p
l
s at the farther-out
leaves can be quite small. Boosting, bagging, and random forests are among the tech-
niques proposed that can help ameliorate some of these problems and lead to better
classications. They are more black-box-like, though, losing some of the simplicity of
the simple trees. See Hastie et al. [2009].
Estimating misclassication rate
The observed misclassication rate for any tree is easily found using the summary
command. Below we nd the 10-fold cross-validation estimates of the classication
error. The results are in (11.79). Note that the BIC had the lowest estimate, though
by only about 0.01. The base tree was always chosen by AIC. It is interesting that the
BIC trees were much smaller, averaging 5 leaves versus 22 for the AIC/base trees.
Obs. error CV error CV se Average L
Base tree 0.208 0.328 0.057 22
Best AIC 0.208 0.328 0.057 22
Best BIC 0.229 0.317 0.063 5
(11.79)
The following nds the cross-validation estimate for the BIC chosen tree:
o < sample(1:462) # Reorder the indices
err < NULL # To collect the errors
for(i in 1:10) {
oi < o[(1:46)+46(i1)] # Leftout indices
basetreei < tree(as.factor(chd)~.,data=SAheart,subset=(1:462)[oi])
bictreei < prune.tree(basetreei,k=2log(416)) # BIC tree w/o leftout data
yhati < predict(bictreei,newdata=SAheart[oi,],type=class)
err < c(err,sum(yhati!=SAheart[oi,chd]))
}
For each of the left-out observations, the predict statement with type=class gives
the trees classication of the left-out observations. The estimate of the error is then
mean(err)/46, and the standard error is sd(err)/46.
11.9 Exercises
Exercise 11.9.1. Show that (11.19) follows from (11.18).
Exercise 11.9.2. Compare the statistic in (11.34) and its maximum using the a in
(11.35) to the motivation for Hotellings T
2
presented in Section 8.4.1.
Exercise 11.9.3. Write the
k
in (11.60) as a function of the
i
s and
i
s.
Exercise 11.9.4 (Spam). Consider the spam data from Section 11.7.2 and Exercise
1.9.14. Here we simplify it a bit, and just look at four of the 0/1 predictors: Whether
or not the email contains the words free or remove or the symbols ! or $.
The following table summarizes the data, where the rst four columns indicate the
presence (0) or absence (1) of the word or symbol, and the last two columns give the
11.9. Exercises 225
numbers of corresponding emails that are spam or not spam. E.g., there are 98 emails
containing remove and !, but not free nor $, 8 of which are not spam, 90 are
spam.
free remove ! $ not spam spam
0 0 0 0 1742 92
0 0 0 1 157 54
0 0 1 0 554 161
0 0 1 1 51 216
0 1 0 0 15 28
0 1 0 1 4 17
0 1 1 0 8 90
0 1 1 1 5 166
1 0 0 0 94 42
1 0 0 1 28 20
1 0 1 0 81 159
1 0 1 1 38 305
1 1 0 0 1 16
1 1 0 1 0 33
1 1 1 0 2 116
1 1 1 1 8 298
(11.80)
Assuming a multinomial distribution for the 2
5
= 32 possibilities, nd the estimated
Bayes classier of email as spam or not spam based on the other four variables
in the table. What is the observed error rate?
Exercise 11.9.5 (Crabs). This problem uses data on 200 crabs, categorized into two
species, Orange and Blue, and two sexes. It is in the MASS R package [Venables
and Ripley, 2002]. The data is in the data frame crabs. There are 50 crabs in each
speciessex category; the rst 50 are blue males, then 50 blue females, then 50 orange
males, then 50 orange females. The ve measurements are frontal lobe size, rear
width, carapace length, carapace width, and body depth, all in millimeters. The goal
here is to nd linear discrimination procedures for classifying new crabs into species
and sex categories. (a) The basic model is that Y N(x, I
200
), where x is any
analysis of variance design matrix (n 4) that distinguishes the four groups. Find the
MLE of ,

. (b) Find the c
k
s and a
k
s in Fishers linear discrimination for classifying
all four groups, i.e., classifying on species and sex simultaneously. (Take
k
= 1/4 for
all four groups.) Use the version wherein d
K
= 0. (c) Using the procedure in part (b)
on the observed data, how many crabs had their species misclassied? How many
had their sex misclassied? What was the overall observed misclassication rate (for
simultaneous classication of color and sex)? (d) Use leave-one-out cross-validation
to estimate the overall misclassication rate. What do you get? Is it higher than the
observed rate in part (c)?
Exercise 11.9.6 (Crabs). | Continue with the crabs data from Exercise 11.9.5, but use
classication trees to classify the crabs by just species. (a) Find the base tree using the
command
crabtree < tree(sp ~ FL+RW+CL+CW+BD,data=crabs)
How many leaves does the tree have? Snip off redundant nodes. How many leaves
does the snipped tree have? What is its observed misclassication rate? (b) Find the
226 Chapter 11. Classication
BIC for the subtrees found using prune.tree. Give the number of leaves, deviance,
and dimension for the subtree with the best BIC. (c) Consider the subtree with the
best BIC. What is its observed misclassication rate? What two variables gure most
prominently in the tree? Which variables do not appear? (d) Now nd the leave-
one-out cross-validation estimate of the misclassication error rate for the best model
using BIC. How does this rate compare with the observed rate?
Exercise 11.9.7 (South African heart disease). This question uses the South African
heart disease study discussed in Section 11.8. The objective is to use logistic regres-
sion to classify people on the presence of heart disease, variable chd. (a) Use the
logistic model that includes all the explanatory variables to do the classication. (b)
Find the best logistic model using the stepwise function, with BIC as the criterion.
Which variables are included in the best model from the stepwise procedure? (c) Use
the model with just the variables suggested by the factor analysis of Exercise 10.5.20:
tobacco, ldl, adiposity, obesity, and alcohol. (d) Find the BIC, observed error rate, and
leave-one-out cross-validation rate for the three models in parts (a), (b) and (c). (e)
True or false: (i) The full model has the lowest observed error rate; (ii) The factor-
analysis based model is generally best; (iii) The cross-validation-based error rates are
somewhat larger than the corresponding observed error rates; (iv) The model with
the best observed error rate has the best cv-based error rate as well; (v) The best
model of these three is the one chosen by the stepwise procedure; (vi) Both adiposity
and obesity seem to be important factors in classifying heart disease.
Exercise 11.9.8 (Zipcode). The objective here is to classify handwritten numerals
(0, 1, . . . , 9), so that machines can read peoples handwritten zipcodes. The data set
consists of 16 16 grayscale images, that is, each numeral has been translated to a
16 16 matrix, where the elements of the matrix indicate the darkness (from 1 to
1) of the image at 256 grid points. The data set is from LeCun [1989], and can be
found in the R package StatElemLearn [Halvorsen, 2009]. This question will use just
the 7s, 8s and 9s, for which there are n = 1831 observations. We put the data in
three matrices, one for each digit, called train7, train8, and train9. Each row contains
rst the relevant digit, then the 256 grayscale values, for one image. The task is to use
linear discrimination to distinguish between the digits, even though it is clear that
the data are not multivariate normal. First, create the three matrices from the large
zip.train matrix:
train7 < zip.train[zip.train[,1]==7,1]
train8 < zip.train[zip.train[,1]==8,1]
train9 < zip.train[zip.train[,1]==9,1]
(a) Using the image, contour, and matrix functions in R, reconstruct the images of
some of the 7s, 8s and 9s from their grayscale values. (Or explore the zip2image
function in the StatElemLearn package.) (b) Use linear discrimination to classify the
observation based on the 256 variables under the three scenarios below. In each case,
nd both the observed misclassication rate and the estimate using cross-validation.
(i) Using = I
256
. (ii) Assuming is diagonal, using the pooled estimates of the
individual variances. (iii) Using the pooled covariance matrix as an estimate of .
(d) Which method had the best error rate, estimated by cross-validation? (e) Create a
data set of digits (7s, 8s, and 9, as well as 5s) to test classiers as follows:
test5 < zip.train[zip.test[,1]==5,1]
11.9. Exercises 227
test7 < zip.train[zip.test[,1]==7,1]
test8 < zip.train[zip.test[,1]==8,1]
test9 < zip.train[zip.test[,1]==9,1]
Using the discriminant functions from the original data for the best method from part
(b), classify these new observations. What is the error rate for the 7s, 8s, and 9s?
How does it compare with the cross-validation estimate? How are the 5s classied?
Exercise 11.9.9 (Spam). Use classication trees to classify the spam data. It is best to
start as follows:
Spamdf < data.frame(Spam)
spamtree < tree(as.factor(spam)~.,data=Spamdf)
Turning the matrix into a data frame makes the labeling on the plots simpler. (a)
Find the BICs for the subtrees obtained using prune.tree. How many leaves in the
best model? What is its BIC? What is its observed error rate? (b) You can obtain a
cross-validation estimate of the error rate by using
cvt < cv.tree(spamtree,method=misclass,K=46)
The 46 means use 46-fold cross-validation, which is the same as leaving 100 out.
The vector cvt$dev contains the number of left-outs misclassied for the various mod-
els. The cv.tree function randomly splits the data, so you should run it a few times,
and use the combined results to estimate the misclassication rates for the best model
you chose in part (a). What do you see? (c) Repeat parts (a) and (b), but using the
rst ten principal components of the spam explanatory variables as the predictors.
(Exercise 1.9.15 calculated the principal components.) Repeat again, but this time
using the rst ten principal components based on the scaled explanatory variables,
scale(Spam[,1:57]). Compare the effectiveness of the three approaches.
Exercise 11.9.10. This questions develops a Bayes classier when there is a mix of nor-
mal and binomial explanatory variables. Consider the classication problem based
on (Y, X, Z), where Y is the variable to be classied, with values 0 and 1, and X and
Z are predictors. X is a 1 2 continuous vector, and Z takes the values 0 and 1. The
model for (Y, X, Z) is given by
X [ Y = y, Z = z N(
yz
, ), (11.81)
and
P[Y = y & Z = z] = p
yz
, (11.82)
so that p
00
+ p
01
+ p
10
+ p
11
= 1. (a) Find an expression for P[Y = y [ X = x & Z =
z]. (b) Find the 1 2 vector
z
and the constant
z
(which depend on z and the
parameters) so that
P[Y = 1 [ X = x & Z = z] > P[Y = 0 [ X = x & Z = z] x
/
z
+
z
> 0. (11.83)
(c) Suppose the data are (Y
i
, X
i
, Z
i
), i = 1, . . . , n, iid, distributed as above. Find
expressions for the MLEs of the parameters (the four
yz
s, the four p
yz
s, and ).
228 Chapter 11. Classication
Exercise 11.9.11 (South African heart disease). Apply the classication method in Ex-
ercise 11.9.10 to the South African heart disease data, with Y indicating heart disease
(chd), X containing the two variables age and type A, and Z being the family history
of heart disease variable (history: 0 = absent, 1 = present). Randomly divide the data
into two parts: The training data with n = 362, and the test data with n = 100. E.g.,
use
random.index < sample(462,100)
sahd.train < SAheart[random.index,]
sahd.test < SAheart[random.index,]
(a) Estimate the
z
and
z
using the training data. Find the observed misclassication
rate on the training data, where you classify an observation as

Y
i
= 1 if x
i

z
+
z
> 0,
and

Y
i
= 0 otherwise. What is the misclassication rate for the test data (using the
estimates from the training data)? Give the 2 2 table showing true and predicted
Ys for the test data. (b) Using the same training data, nd the classication tree. You
dont have to do any pruning. Just take the full tree from the tree program. Find the
misclassication rates for the training data and the test data. Give the table showing
true and predicted Ys for the test data. (c) Still using the training data, nd the
classication using logistic regression, with the X and Z as the explanatory variables.
What are the coefcients for the explanatory variables? Find the misclassication
rates for the training data and the test data. (d) What do you conclude?
Chapter 12
Clustering
The classication and prediction we have covered in previous chapters were cases
of supervised learning. For example, in classication, we try to nd a function that
classies individuals into groups using their x values, where in the training set we
know what the proper groups are because we observe their ys. In clustering, we
again wish to classify observations into groups using their xs, but do not know the
correct groups even in the training set, i.e., we do not observe the ys, nor often even
know how many groups there are. Clustering is a case of unsupervised learning.
There are many clustering algorithms. Most are reasonably easy to implement
given the number K of clusters. The difcult part is deciding what K should be.
Unlike in classication, there is no obvious cross-validation procedure to balance
the number of clusters with the tightness of the clusters. Only in the model-based
clustering do we have direct AIC or BIC criteria. Otherwise, a number of reasonable
but ad hoc measures have been proposed. We will look at two: gap statistics, and
silhouettes.
In some situations one is not necessarily assuming that there are underlying clus-
ters, but rather is trying to divide the observations into a certain number of groups
for other purposes. For example, a teacher in a class of 40 students might want to
break up the class into four sections of about ten each based on general ability (to
give more focused instruction to each group). The teacher does not necessarily think
there will be wide gaps between the groups, but still wishes to divide for pedagogical
purposes. In such cases K is xed, so the task is a bit simpler.
In general, though, when clustering one is looking for groups that are well sep-
arated. There is often an underlying model, just as in Chapter 11 on model-based
classication. That is, the data are
(Y
1
, X
1
), . . . , (Y
n
, X
n
), iid, (12.1)
where y
i
1, . . . , K,
X[ Y = k f
k
(x) = f (x [
k
) and P[Y = k] =
k
, (12.2)
as in (11.2) and (11.3). If the parameters are known, then the clustering proceeds
exactly as for classication, where an observation x is placed into the group
((x) = k that maximizes
f
k
(x)
k
f
1
(x)
1
+ + f
K
(x)
K
. (12.3)
229
230 Chapter 12. Clustering
See (11.13). The y in the ointment is that we do not observe the y
i
s (neither in the
training set nor for the new observations), nor do we necessarily know what K is, let
alone the parameter values.
The following sections look at some approaches to clustering. The rst, K-means,
does not explicitly use a model, but has in the back of its mind f
k
s being N(
k
,
2
I
p
).
Hierarchical clustering avoids the problems of number of clusters by creating a tree
containing clusterings of all sizes, from K = 1 to n. Finally, the model-based cluster-
ing explicitly assumes the f
k
s are multivariate normal (or some other given distribu-
tion), with various possibilities for the covariance matrices.
12.1 K-Means
For a given number K of groups, K-means assumes that each group has a mean vector

k
. Observation x
i
is assigned to the group with the closest mean. To estimate these
means, we minimize the sum of the squared distances from the observations to their
group means:
obj(
1
, . . . ,
K
) =
n

i=1
min

1
,...,
K
|x
i

k
|
2
. (12.4)
An algorithm for nding the clusters starts with a random set of means
1
, . . . ,
K
(e.g., randomly choose K observations from the data), then iterate the following two
steps:
1. Having estimates of the means, assign observations to the group corresponding
to the closest mean,
((x
i
) = k that minimizes |x
i

k
|
2
over k. (12.5)
2. Having individuals assigned to groups, nd the group means,

k
=
1
#((x
i
) = k

i[((x
i
)=k
x
i
. (12.6)
The algorithm is guaranteed to converge, but not necessarily to the global mini-
mum. It is a good idea to try several random starts, then take the one that yields the
lowest obj in (12.4). The resulting means and assignments are the K-means and their
clustering.
12.1.1 Example: Sports data
Recall the data on people ranking seven sports presented in Section 1.6.2. Using the
K-means algorithm for K = 1, . . . , 4, we nd the following means (where K = 1 gives
the overall mean):
12.1. K-Means 231
K = 1 BaseB FootB BsktB Ten Cyc Swim Jog
Group 1 3.79 4.29 3.74 3.86 3.59 3.78 4.95
K = 2 BaseB FootB BsktB Ten Cyc Swim Jog
Group 1 5.01 5.84 4.35 3.63 2.57 2.47 4.12
Group 2 2.45 2.60 3.06 4.11 4.71 5.21 5.85
K = 3 BaseB FootB BsktB Ten Cyc Swim Jog
Group 1 2.33 2.53 3.05 4.14 4.76 5.33 5.86
Group 2 4.94 5.97 5.00 3.71 2.90 3.35 2.13
Group 3 5.00 5.51 3.76 3.59 2.46 1.90 5.78
K = 4 BaseB FootB BsktB Ten Cyc Swim Jog
Group 1 5.10 5.47 3.75 3.60 2.40 1.90 5.78
Group 2 2.30 2.10 2.65 5.17 4.75 5.35 5.67
Group 3 2.40 3.75 3.90 1.85 4.85 5.20 6.05
Group 4 4.97 6.00 5.07 3.80 2.80 3.23 2.13
(12.7)
Look at the K = 2 means. Group 1 likes swimming and cycling, while group 2
likes the team sports, baseball, football, and basketball. If we compare these to the
K = 3 clustering, we see group 1 appears to be about the same as the team sports
group from K = 2, while groups 2 and 3 both like swimming and cycling. The
difference is that group 3 does not like jogging, while group 2 does. For K = 4, it
looks like the team-sports group has split into one that likes tennis (group 3), and
one that doesnt (group 2). At this point it may be more useful to try to decide
what number of clusters is good. (Being able to interpret the clusters is one good
characteristic.)
12.1.2 Gap statistics
Many measures of goodness for clusterings are based on the tightness of clusters. In
K-means, an obvious measure of closeness is the within-group sum of squares. For
group k, the within sum of squares is
SS
k
=

i[((x
i
)=k
|x
i

k
|
2
, (12.8)
so that for all K clusters we have
SS(K) =
K

k=1
SS
k
, (12.9)
which is exactly the optimal value of the objective function in (12.4).
A good K will have a small SS(K), but SS(K) is a decreasing function of K. See
the rst plot in Figure 12.1. The lowest solid line is K versus log(SS(K)). In fact,
taking K = n (one observation in each cluster) yields SS(n) = 0. We could balance
SS(K) and K, e.g., by minimizing SS(K) + K for some . (Cf. equation (11.78).)
The question is how to choose . There does not appear to be an obvious AIC, BIC
or cross-validation procedure, although in Section 12.5 we look at the model-based
soft K-means procedure.
232 Chapter 12. Clustering
2 4 6 8 10
7
.
2
7
.
6
8
.
0
K
l
o
g
(
S
S
)
2 4 6 8 10
0
.
0
5
0
.
1
5
0
.
2
5
K
G
a
p
Figure 12.1: The rst plot shows the log of the total sums of squares for cluster sizes
from K = 1 to 10 for the data (solid line), and for 100 random uniform samples (the
clump of curves). The second plot exhibits the gap statistics with SD lines.
Tibshirani, Walther, and Hastie [2001] take a different approach, proposing the gap
statistic, which compares the observed log(SS(K))s with what would be expected
from a sample with no cluster structure. We are targeting the values
Gap(K) = E
0
[log(SS(K))] log(SS(K)), (12.10)
where E
0
[] denotes expected value under some null distribution on the X
i
s. Tib-
shirani et al. [2001] suggest taking a uniform distribution over the range of the data,
possibly after rotating the data to the principal components. A large value of Gap(K)
indicates that the observed clustering is substantially better than what would be ex-
pected if there were no clusters. Thus we look for a K with a large gap.
Because the sports data are rankings, it is natural to consider as a null distribu-
tion that the observations are independent and uniform over the permutations of
1, 2, . . . , 7. We cannot analytically determine the expected value in (12.10), so we
use simulations. For each b = 1, . . . , B = 100, we generate n = 130 random rankings,
perform K-means clustering for K = 1, . . . , 10, and nd the corresponding SS
b
(K)s.
These make up the dense clump of curves in the rst plot of Figure 12.1.
The Gap(K) in (12.10) is then estimated by using the average of the random curves,

Gap(K) =
1
B
B

b=1
log(SS
b
(K)) log(SS(K)). (12.11)
The second plot in Figure 12.1 graphs this estimated curve, along with curves plus
or minus one standard deviation of the SS
b
(K)s. Clearly K = 2 is much better than
K = 1; Ks larger than two do not appear to be better than two, so that the gap statistic
suggests K = 2 to be appropriate. Even if K = 3 had a higher gap, unless it is higher
by a standard deviation, one may wish to stick with the simpler K = 2. Of course,
interpretability is a strong consideration as well.
12.1. K-Means 233
12.1.3 Silhouettes
Another measure of clustering efcacy is Rousseeuws [1987] notion of silhouettes.
The silhouette of an observation i measures how well it ts in its own cluster versus
how well it ts in its next closest cluster. Adapted to K-means, we have
a(i) = |x
i

k
|
2
and b(i) = |x
i

l
|
2
, (12.12)
where observation i is assigned to group k, and group l has the next-closest group
mean to x
i
. Then its silhouette is
silhouette(i) =
b(i) a(i)
maxa(i), b(i)
. (12.13)
By construction, b(i) a(i), hence the denominator is b(i), and the silhouette takes
values between 0 and 1. If the observation is equal to its group mean, its silhouette
is 1. If it is halfway between the two group means, its silhouette is 0. For other
clusterings (K-medoids, as in Section 12.2, for example), the silhouettes can range
from -1 to 1, but usually stay above 0, or at least do not go much below.
Figure 12.2 contains the silhouettes for Ks from 2 to 5 for the sports data. The
observations (along the horizontal axis) are arranged by group and, within group,
by silhouettes. This arrangement allows one to compare the clusters. In the rst
plot (K = 2 groups), the two clusters have similar silhouettes, and the silhouettes are
fairly full. High silhouettes are good, so that the average silhouette is a measure of
goodness for the clustering. In this case, the average is 0.625. For K = 3, notice that
the rst silhouette is still full, while the two smaller clusters are a bit frail. The K = 4
and 5 silhouettes are not as full, either, as indicated by their averages.
Figure 12.3 plots the average silhouette versus K. It is clear that K = 2 has the
highest silhouette, hence we would (as when using the gap statistic) take K = 2 as
the best cluster size.
12.1.4 Plotting clusters in one and two dimensions
With two groups, we have two means in p(= 7)-dimensional space. To look at the
data, we can project the observations to the line that runs through the means. This
projection is where the clustering is taking place. Let
z =

1

2
|
1

2
|
, (12.14)
the unit vector pointing from
2
to
1
. Then using z as an axis, the projections of the
observations onto z have coordinates
w
i
= x
i
z
/
, i = 1, . . . , N. (12.15)
Figure 12.4 is the histogram for the w
i
s, where group 1 has w
i
> 0 and group 2 has
w
i
< 0. We can see that the clusters are well-dened in that the bulk of each cluster
is far from the center of the other cluster.
We have also plotted the sports, found by creating a pure ranking for each sport.
Thus the pure ranking for baseball would give baseball the rank of 1, and the other
234 Chapter 12. Clustering
0 20 40 60 80 120
0
.
2
0
.
4
0
.
6
0
.
8
Ave = 0.625
K = 2
0 20 40 60 80 120
0
.
2
0
.
4
0
.
6
0
.
8
Ave = 0.555
K = 3
0 20 40 60 80 120
0
.
2
0
.
4
0
.
6
0
.
8
Ave = 0.508
K = 4
0 20 40 60 80 120
0
.
2
0
.
4
0
.
6
0
.
8
Ave = 0.534
K = 5
Figure 12.2: The silhouettes for K = 2, . . . , 5 clusters. The horizontal axis indexes the
observations. The vertical axis exhibits the values of the silhouettes.
2 4 6 8 10
0
.
5
2
0
.
5
6
0
.
6
0
K
Figure 12.3: The average silhouettes for K = 2, . . . , 10 clusters.
12.1. K-Means 235
6 4 2 0 2 4 6
0
2
4
6
8
1
0
W
6 4 2 0 2 4 6
0
2
4
6
8
1
0
Baseball
Football
Basketball
Tennis
Cycling
Swimming
Jogging
Figure 12.4: The histogram for the observations along the line connecting the two
means for K = 2 groups.
sports the rank of 4.5, so that the sum of the ranks, 28, is the same as for the other
rankings. Adding these sports to the plot helps aid in interpreting the groups: team
sports on the left, individual sports on the right, with tennis on the individual-sport
side, but close to the border.
If K = 3, then the three means lie in a plane, hence we would like to project the
observations onto that plane. One approach is to use principal components (Section
1.6) on the means. Because there are three, only the rst two principal components
will have positive variance, so that all the action will be in the rst two. Letting
Z =


1

2

3

, (12.16)
we apply the spectral decomposition (1.33) in Theorem 1.1 to the sample covariance
matrix of Z:
1
3
Z
/
H
3
Z = GLG
/
, (12.17)
where G is orthogonal and L is diagonal. The diagonals of L here are 11.77, 4.07, and
ve zeros. We then rotate the data and the means using G,
W = YG and W
(means)
= ZG. (12.18)
Figure 12.5 plots the rst two variables for W and W
(means
), along with the seven pure
rankings. We see the people who like team sports to the right, and the people who
like individual sports to the left, divided into those who can and those who cannot
abide jogging. Compare this plot to the biplot that appears in Figure 1.6.
236 Chapter 12. Clustering
4 2 0 2 4

2
0
2
4
Var 1
V
a
r

2
1
2
3
Baseball
Football
Basketball
Tennis
Cycling
Swimming
Jogging
Figure 12.5: The scatter plot for the data projected onto the plane containing the
means for K = 3.
12.1.5 Example: Sports data, using R
The sports data is in the R matrix sportsranks. The K-means clustering uses the
function kmeans. We create a list whose K
th
component contains the results for
K = 2, . . . , 10 groups:
kms < vector(list,10)
for(K in 2:10) {
kms[[K]] < kmeans(sportsranks,centers=K,nstart=10)
}
The centers input species the number of groups desired, and nstart=10 means ran-
domly start the algorithm ten times, then use the one with lowest within sum of
squares. The output in kms[[K]] for the K-group clustering is a list with centers, the
K p of estimated cluster means; cluster, an n-vector that assigns each observation
to its cluster (i.e., the y
i
s); withinss, the K-vector of SS
k
s (so that SS(K) is found
by sum(kms[[K]]$withinss)); and size, the K-vector giving the numbers of observations
assigned to each group.
12.1. K-Means 237
Gap statistic
To nd the gap statistic, we rst calculate the vector of SS(K)s in (12.9) for K =
1, . . . , 10. For K = 1, there is just one large group, so that SS(1) is the sum of the
sample variances of the variables, times n 1. Thus
n < nrow(sportsranks) # n=130
ss < tr(var(sportsranks))(n1) # For K=1
for(K in 2:10) {
ss < c(ss, sum(kms[[K]]$withinss))
}
The solid line in the rst plot in Figure 12.1 is K versus log(ss). (Or something like it;
there is randomness in the results.)
For the summation term on the right-hand side of (12.11), we use uniformly dis-
tributed permutations of 1, . . . , 7, which uses the command sample(7). For non-
rank statistics, one has to try some other randomization. For each b = 1, . . . , 100,
we create n = 130 random permutations, then go through the K-means process for
K = 1, . . . , 10. The xstar is the n 7 random data set.
ssb < NULL
for(b in 1:100) {
xstar < NULL
for(i in 1:n) xstar < rbind(xstar,sample(7))
sstar < tr(var(xstar))(n1)
for(K in 2:10) {
sstar < c(sstar,sum(kmeans(xstar,centers=K,nstart=10)$withinss))
}
ssb < rbind(ssb,sstar)
}
Now each column of ssb is a random vector of log(SS
b
(K))s. The gap statistics (12.11)
and the two plots in Figure 12.1 are found using
par(mfrow=c(1,2)) # Set up two plots
matplot(1:10,log(cbind(ss,t(ssb))),type=l,xlab=K,ylab=log(SS))
ssbm < apply(log(ssb),2,mean) # Mean of log(ssb[,K])s
ssbsd < sqrt(apply(log(ww),2,var)) # SD of log(ssb[,K])s
gap < ssbm log(ss) # The vector of gap statistics
matplot(1:10,cbind(gap,gap+ssbsd,gapssbsd),type=l,xlab=K,ylab=Gap)
Silhouettes
Section A.4.1 contains a simple function for calculating the silhouettes in (12.13) for
a given K-means clustering. The sort.silhouette function in Section A.4.2 sorts the
silhouette values for plotting. The following statements produce Figure 12.2:
sil.ave < NULL # To collect silhouettes means for each K
par(mfrow=c(3,3))
for(K in 2:10) {
sil < silhouette.km(sportsranks,kms[[K]]$centers)
sil.ave < c(sil.ave,mean(sil))
ssil < sort.silhouette(sil,kms[[K]]$cluster)
238 Chapter 12. Clustering
plot(ssil,type=h,xlab=Observations,ylab=Silhouettes)
title(paste(K =,K))
}
The sil.ave calculated above can then be used to obtain Figure 12.3:
plot(2:10,sil.ave,type=l,xlab=K,ylab=Average silhouette width)
Plotting the clusters
Finally, we make plots as in Figures 12.4 and 12.5. For K = 2, we have the one z as in
(12.14) and the w
i
s as in (12.15):
z < kms[[2]]$centers[1,]kms[[2]]$centers[2,]
z < z/sqrt(sum(z^2))
w < sportsranks%%z
xl < c(6,6); yl < c(0,13) # Fix the x and yranges
hist(w[kms[[2]]$cluster==1],col=2,xlim=xl,ylim=yl,main=K=2,xlab=W)
par(new=TRUE) # To allow two histograms on the same plot
hist(w[kms[[2]]$cluster==2],col=3,xlim=xl,ylim=yl,main= ,xlab= )
To add the sports names:
y < matrix(4.5,7,7)3.5diag(7)
ws < y%%z
text(ws,c(10,11,12,8,9,10,11),labels=dimnames(sportsranks)[[2]])
The various placement numbers were found by trial and error.
For K = 3, or higher, we can use Rs eigenvector/eigenvalue function, eigen, to
nd the G used in (12.18):
z < kms[[3]]$centers
g < eigen(var(z))$vectors[,1:2] # Just need the rst two columns
w < sportsranks%%g # For the observations
ws < y%%g # For the sports names
wm < z%%g # For the groups means
cl < kms[[3]]$cluster
plot(w,xlab=Var 1,ylab=Var 2,pch=cl)
text(wc,labels=1:3)
text(ws,dimnames(sportsranks)[[2]])
12.2 K-medoids
Clustering with medoids [Kaufman and Rousseeuw, 1990] works directly on dis-
tances between objects. Suppose we have n objects, o
1
, . . . , o
n
, and a dissimilarity
measure d(o
i
, o
j
) between pairs. This d satises
d(o
i
, o
j
) 0, d(o
i
, o
j
) = d(o
j
, o
i
), and d(o
i
, o
i
) = 0, (12.19)
but it may not be an actual metric in that it need not satisfy the triangle inequality.
Note that one cannot necessarily impute distances between an object and another
vector, e.g., a mean vector. Rather than clustering around means, the clusters are then
12.2. K-medoids 239
built around some of the objects. That is, K-medoids nds K of the objects (c
1
, . . . , c
K
)
to act as centers (or medoids), the objective being to nd the set that minimizes
obj(c
1
, . . . , c
K
) =
N

i=1
min
c
1
,...,c
K

d(o
i
, c
k
). (12.20)
Silhouettes are dened as in (12.13), except that here, for each observation i,
a(i) =

j Group k
d(o
i
, o
j
) and b(i) =

j Group l
d(o
i
, o
j
), (12.21)
where group k is object is group, and group l is its next closest group.
In R, one can use the package cluster, [Maechler et al., 2005], which implements
K-medoids clustering in the function pam, which stands for partitioning around
medoids. Consider the grades data in Section 4.2.1. We will cluster the ve vari-
ables, homework, labs, inclass, midterms, and nal, not the 107 people. A natural
measure of similarity between two variables is their correlation. Instead of using the
usual Pearson coefcient, we will use Kendalls , which is more robust. For n 1
vectors x and y, Kendalls is
T(x, y) =

1i<jn
Sign(x
i
x
j
)Sign(y
i
y
j
)
(
n
2
)
. (12.22)
The numerator looks at the line segment connecting each pair of points (x
i
, y
i
) and
(x
j
, y
j
), counting +1 if the slope is positive and 1 if it is negative. The denominator
normalizes the statistic so that it is between 1. Then T(x, y) = +1 means that the
x
i
s and y
i
s are exactly monotonically increasingly related, and 1 means they are
exactly monotonically decreasingly related, much as the correlation coefcient. The
Ts measure similarities, so we subtract each T from 1 to obtain the dissimilarity
matrix:
HW Labs InClass Midterms Final
HW 0.00 0.56 0.86 0.71 0.69
Labs 0.56 0.00 0.80 0.68 0.71
InClass 0.86 0.80 0.00 0.81 0.81
Midterms 0.71 0.68 0.81 0.00 0.53
Final 0.69 0.71 0.81 0.53 0.00
(12.23)
Using R, we nd the dissimilarity matrix:
x < grades[,2:6]
dx < matrix(nrow=5,ncol=5) # To hold the dissimilarities
for(i in 1:5)
for(j in 1:5)
dx[i,j] < 1cor.test(x[,i],x[,j],method=kendall)$est
This matrix is passed to the pam function, along with the desired number of groups
K. Thus for K = 3, say, use
pam3 < pam(as.dist(dx),k=3)
The average silhouette for this clustering is in pam3$silinfo$avewidth. The results for
K = 2, 3 and 4 are
K 2 3 4
Average silhouette 0.108 0.174 0.088
(12.24)
240 Chapter 12. Clustering
We see that K = 3 has the best average silhouette. The assigned groups for this clus-
tering can be found in pam3$clustering, which is (1,1,2,3,3), meaning the groupings
are, reasonably enough,
HW, Labs InClass Midterms, Final. (12.25)
The medoids, i.e., the objects chosen as centers, are in this case labs, inclass, and
midterms, respectively.
12.3 Model-based clustering
In model-based clustering [Fraley and Raftery, 2002], we assume that the model in
(12.2) holds, just as for classication. We then estimate the parameters, which in-
cludes the
k
s and the
k
s, and assign observations to clusters as in (12.3):

((x
i
) = k that maximizes
f (x
i
[

k
)
k
f (x
i
[

1
)
1
+ + f (x
i
[

K
)
K
. (12.26)
As opposed to classication situations, in clustering we do not observe the y
i
s, hence
cannot use the joint distribution of (Y, X) to estimate the parameters. Instead, we
need to use the marginal of X, which is the denominator in the (:
f (x
i
) = f (x
i
[
1
, . . . ,
K
,
1
, . . . ,
K
)
= f (x
i
[
1
)
1
+ + f (x
i
[
K
)
K
. (12.27)
The density is a mixture density, as in (11.5).
The likelihood for the data is then
L(
1
, . . . ,
K
,
1
, . . . ,
K
; x
1
, . . . , x
n
) =
n

i=1
( f (x
i
[
1
)
1
+ + f (x
i
[
K
)
K
) .
(12.28)
The likelihood can be maximized for any specic model (specifying the f s and
k
s
as well as K), and models can compared using the BIC (or AIC). The likelihood
(12.28) is not always easy to maximize due to its being a product of sums. Often the
EM algorithm (see Section 12.4) is helpful.
We will present the multivariate normal case, as we did in (11.24) and (11.25) for
classication. The general model assumes for each k that
X[ Y = k N
1p
(
k
,
k
). (12.29)
We will assume the
k
s are free to vary, although models in which there are equalities
among some of the elements are certainly reasonable. There are also a variety of
structural and equality assumptions on the
k
s used.
12.3.1 Example: Automobile data
The R function we use is in the package mclust, Fraley and Raftery [2010]. Our
data consists of size measurements on 111 automobiles, the variables include length,
wheelbase, width, height, front and rear head room, front leg room, rear seating, front
and rear shoulder room, and luggage area. The data are in the le cars, from Con-
sumers Union [1990], and can be found in the S-Plus
R
[TIBCO Software Inc., 2009]
12.3. Model-based clustering 241
2 4 6 8

6
0
0
0

5
0
0
0

4
0
0
0
EII
VII
EEI
VEI
EVI
VVI
EEE
EEV
VEV
VVV
K

B
I
C
Figure 12.6: BICs for tting the entire data set.
data frame cu.dimensions. The variables in cars have been normalized to have medians
of 0 and median absolute deviations (MAD) of 1.4826 (the MAD for a N(0, 1)).
The routine well use is Mclust (be sure to capitalize the M). It will try various
forms of the covariance matrices and group sizes, and pick the best based on the BIC.
To use the default options and have the results placed in mcars, use
mcars < Mclust(cars)
There are many options for plotting in the package. To see a plot of the BICs, use
plot(mcars,cars,what=BIC)
You have to clicking on the graphics window, or hit enter, to reveal the plot. The result
is in Figure 12.6. The horizontal axis species the K, and the vertical axis gives the
BIC values, although these are the the negatives of our BICs. The symbols plotted
on the graph are codes for various structural hypotheses on the covariances. See
(12.35). In this example, the best model is Model VVV with K = 2, which means
the covariance matrices are arbitrary and unequal.
Some pairwise plots (length versus height, width versus front head room, and
rear head room versus luggage) are given in Figure 12.7. The plots include ellipses to
illustrate the covariance matrices. Indeed we see that the two ellipses in each plot are
arbitrary and unequal. To plot variable 1 (length) versus variable 4 (height), use
plot(mcars,cars,what=classication,dimens=c(1,4))
We also plot the rst two principal components (Section 1.6). The matrix of eigenvec-
tors, G in (1.33), is given by eigen(var(cars))$vectors:
242 Chapter 12. Clustering
4 2 0 2 4

5
0
5
1
0
1
5
2
0
Length
H
e
i
g
h
t
4 2 0 2 4

2
0
2
4
Width
F
r
o
n
t

h
e
a
d

r
o
o
m
4 2 0 2 4 6

4
0
2
4
Rear head room
L
u
g
g
a
g
e
0 10 20 30

2
0

1
0
0
5
PC
1
P
C
2
Figure 12.7: Some two-variable plots of the clustering produced by Mclust. The solid
triangles indicate group 1, and the open squares indicate group 2. The fourth graph
plots the rst two principal components of the data.
carspc < cars%%eigen(var(cars))$vectors # Principal components
To obtain the ellipses, we redid the clustering using the principal components as the
data, and specifying G=2 groups in Mclust.
Look at the plots. The lower left graph shows that group 2 is almost constant
on the luggage variable. In addition, the upper left and lower right graphs indicate
that group 2 can be divided into two groups, although the BIC did not pick up the
difference. The Table 12.1 exhibits four of the variables for the 15 automobiles in
group 2.
We have divided this group as suggested by the principal component plot. Note
that the rst group of ve are all sports cars. They have no back seats or luggage areas,
hence the values in the data set for the corresponding variables are coded somehow.
The other ten automobiles are minivans. They do not have specic luggage areas, i.e.,
trunks, either, although in a sense the whole vehicle is a big luggage area. Thus this
group really is a union of two smaller groups, both of which are quite a bit different
than group 1.
12.3. Model-based clustering 243
Rear Head Rear Seating Rear Shoulder Luggage
Chevrolet Corvette 4.0 19.67 28.00 8.0
Honda Civic CRX 4.0 19.67 28.00 8.0
Mazda MX5 Miata 4.0 19.67 28.00 8.0
Mazda RX7 4.0 19.67 28.00 8.0
Nissan 300ZX 4.0 19.67 28.00 8.0
Chevrolet Astro 2.5 0.33 1.75 8.0
Chevrolet Lumina APV 2.0 3.33 4.00 8.0
Dodge Caravan 2.5 0.33 6.25 8.0
Dodge Grand Caravan 2.0 2.33 3.25 8.0
Ford Aerostar 1.5 1.67 4.25 8.0
Mazda MPV 3.5 0.00 5.50 8.0
Mitsubishi Wagon 2.5 19.00 2.50 8.0
Nissan Axxess 2.5 0.67 1.25 8.5
Nissan Van 3.0 19.00 2.25 8.0
Volkswagen Vanagon 7.0 6.33 7.25 8.0
Table 12.1: The automobiles in group 2 of the clustering of all the data.
We now redo the analysis on just the group 1 automobiles:
cars1 < cars[mcars$classication==1,]
mcars1 < Mclust(cars1)
The model chosen by BIC is XXX with 1 components which means the best cluster-
ing is one large group, where the is arbitrary. See Figure 12.8 for the BIC plot. The
EEE models (equal but arbitrary covariance matrices) appear to be quite good, and
similar BIC-wise, for K from 1 to 4. To get the actual BIC values, look at the vector
mcars1$BIC[,EEE]. The next table has the BICs and corresponding estimates of the
posterior probabilities for the rst ve model, where we shift the BICs so that the
best is 0:
K 1 2 3 4 5
BIC 0 28.54 9.53 22.09 44.81

P
BIC
99.15 0 0.84 0 0
(12.30)
Indeed, it looks like one group is best, although three groups may be worth looking
at. It turns out the three groups are basically large, middle-sized, and small cars. Not
profound, perhaps, but reasonable.
12.3.2 Some of the models in mclust
The mclust package considers several models for the covariance matrices. Suppose
that the covariance matrices for the groups are
1
, . . . ,
K
, where each has its spectral
decomposition (1.33)

k
=
k

/
k
, (12.31)
and the eigenvalue matrix is decomposed as

k
= c
k

k
where [
k
[ = 1 and c
k
= [
p

j=1

j
]
1/p
, (12.32)
244 Chapter 12. Clustering
2 4 6 8

4
2
0
0

3
8
0
0

3
4
0
0

3
0
0
0
EII
VII
EEI
VEI
EVI
VVI
EEE
EEV
VEV
VVV
K

B
I
C
Figure 12.8: BICs for the data set without the sports cars or minivans.
the geometric mean of the eigenvalues. A covariance matrix is then described by
shape, volume, and orientation:
Shape(
k
) =
k
;
Volume(
k
) = [
k
[ = c
p
k
;
Orientation(
k
) =
k
. (12.33)
The covariance matrices are then classied into spherical, diagonal, and ellipsoidal:
Spherical
k
= I
p

k
= c
k
I
p
;
Diagonal
k
= I
p

k
= c
k
D
k
;
Ellipsoidal
k
is arbitrary. (12.34)
The various models are dened by the type of covariances, and what equalities
there are among them. I havent been able to crack the code totally, but the descrip-
tions tell the story. When K 2 and p 2, the following table may help translate the
descriptions into restrictions on the covariance matrices through (12.33) and(12.34):
12.4. The EM algorithm 245
Code Description
k
EII spherical, equal volume
2
I
p
VII spherical, unequal volume
2
k
I
p
EEI diagonal, equal volume and shape
VEI diagonal, varying volume, equal shape c
k

EVI diagonal, equal volume, varying shape c


k
VVI diagonal, varying volume and shape
k
EEE ellipsoidal, equal volume, shape, and orientation
EEV ellipsoidal, equal volume and equal shape
k

/
k
VEV ellipsoidal, equal shape c
k

/
k
VVV ellipsoidal, varying volume, shape, and orientation arbitrary
(12.35)
Here, s are diagonal matrices with positive diagonals, s are diagonal matrices
with positive diagonals whose product is 1 as in (12.32), s are orthogonal matrices,
s are arbitrary nonnegative denite symmetric matrices, and cs are positive scalars.
A subscript k on an element means the groups can have different values for that
element. No subscript means that element is the same for each group.
If there is only one variable, but K 2, then the only two models are E, meaning
the variances of the groups are equal, and V, meaning the variances can vary. If
there is only one group, then the models are as follows:
Code Description
X one-dimensional
2
XII spherical
2
I
p
XXI diagonal
XXX ellipsoidal arbitrary
(12.36)
12.4 An example of the EM algorithm
The aim of this section is to give the avor of an implementation of the EM algorithm.
We assume K groups with the multivariate normal distribution as in (12.29), with
different arbitrary
k
s. The idea is to iterate two steps:
1. Having estimates of the parameters, nd estimates of P[Y = k [ X = x
i
]s.
2. Having estimates of P[Y = k [ X = x
i
]s, nd estimates of the parameters.
Suppose we start with initial estimates of the
k
s,
k
s, and
k
s. E.g., one could
rst perform a K-means procedure, then use the sample means and covariance ma-
trices of the groups to estimate the means and covariances, and estimate the
k
s by
the proportions of observations in the groups. Then, as in (12.26), for step 1 we use

P[Y = k [ X = x
i
] =
f (x
i
[
k
,

k
)
k
f (x
i
[
1
,

1
)
1
+ + f (x
i
[
K
,

K
)
K
w
(i)
k
, (12.37)
where

k
= (
k
,

k
).
246 Chapter 12. Clustering
Note that for each i, the w
(i)
k
can be thought of as weights, because their sum over
k is 1. Then in Step 2, we nd the weighted means and covariances of the x
i
s:

k
=
1
n
k
N

i=1
w
(i)
k
x
i
and

k
=
1
n
k
n

i=1
w
(i)
k
(x
i

k
)(x
i

k
)
/
,
where n
k
=
n

i=1
w
(i)
k
.
Also,
k
=
n
k
n
. (12.38)
The two steps are iterated until convergence. The convergence may be slow, and
it may not approach the global maximum likelihood, but it is guaranteed to increase
the likelihood at each step. As in K-means, it is a good idea to try different starting
points.
In the end, the observations are clustered using the conditional probabilities, be-
cause from (12.26),

((x
i
) = k that maximizes w
(i)
k
. (12.39)
12.5 Soft K-means
We note that the K-means procedure in (12.5) and (12.6) is very similar to the EM
procedure in (12.37) and (12.38) if we take a hard form of conditional probability, i.e.,
take
w
(i)
k
=
_
1 if x
i
is assigned to group k
0 otherwise.
(12.40)
Then the
k
in (12.38) becomes the sample means of the observations assigned to
cluster k.
A model for which model-based clustering mimics K-means clustering assumes
that in (12.29), the covariance matrices
k
=
2
I
p
(model EII in (12.35)), so that
f
k
(x
i
) = c
1

p
e

1
2
2
|x
i

k
|
2
. (12.41)
If is xed, then the EM algorithm proceeds as above, except that the covariance
calculation in (12.38) is unnecessary. If we let 0 in (12.37), xing the means, we
have that

P[Y = k [ X = x
i
] w
(i)
k
(12.42)
for the w
(i)
k
in (12.40), at least if all the
k
s are positive. Thus for small xed ,
K-means and model-based clustering are practically the same.
Allowing to be estimated as well leads to what we call soft K-means, soft be-
cause we use a weighted mean, where the weights depend on the distances from the
observations to the group means. In this case, the EM algorithm is as in (12.37) and
12.6. Hierarchical clustering 247
(12.38), but with the estimate of the covariance replaced with the pooled estimate of

2
,

2
=
1
n
K

k=1
n

i=1
w
(i)
k
|x
i

k
|
2
. (12.43)
12.5.1 Example: Sports data
In Section 12.1.1, we used K-means to nd clusters in the data on peoples favorite
sports. Here we use soft K-means. There are a couple of problems with using this
model (12.41): (1) The data are discrete, not continuous as in the multivariate normal;
(2) The dimension is actually 6, not 7, because each observation is a permutation of
1, . . . , 7, hence sums to the 28. To x the latter problem, we multiply the data matrix
by any orthogonal matrix whose rst column is constant, then throw away the rst
column of the result (since it is a constant). Orthogonal polynomials are easy in R:
h < poly(1:7,6) # Gives all but the constant term.
x < sportsranks%%h
The clustering can be implemented in Mclust by specifying the EII model in
(12.35):
skms < Mclust(x,modelNames=EII)
The shifted BICs are
K 1 2 3 4 5
BIC 95.40 0 21.79 32.28 48.27
(12.44)
Clearly K = 2 is best, which is what we found using K-means in Section 12.1.1. It
turns out the observations are clustered exactly the same for K = 2 whether using
K-means or soft K-means. When K = 3, the two methods differ on only three obser-
vations, but for K = 4, 35 are differently clustered.
12.6 Hierarchical clustering
A hierarchical clustering gives a sequence of clusterings, each one combining two
clusters of the previous stage. We assume n objects and their dissimilarities d as in
(12.19). To illustrate, consider the ve grades variables in Section 12.2. A possible
hierarchical sequence of clusterings starts with each object in its own group, then
combines two of those elements, say midterms and nal. The next step could combine
two of the other singletons, or place one of them with the midterms/nal group. Here
we combine homework and labs, then combine all but inclass, then nally have one
big group with all the objects:
HW Labs InClass Midterms Final
HW Labs InClass Midterms, Final
HW, Labs InClass Midterms, Final
InClass HW, Labs, Midterms, Final
InClass, HW, Labs, Midterms, Final (12.45)
Reversing the steps and connecting, one obtains a tree diagram, or dendrogram, as in
Figure 12.9.
248 Chapter 12. Clustering
I
n
C
l
a
s
s
M
i
d
t
e
r
m
s
F
i
n
a
l
H
W
L
a
b
s
0
.
5
0
0
.
6
5
0
.
8
0
H
e
i
g
h
t
Figure 12.9: Hierarchical clustering of the grades, using complete linkage.
For a set of objects, the question is which clusters to combine at each stage. At
the rst stage, we combine the two closest objects, that is, the pair (o
i
, o
j
) with the
smallest d(o
1
, o
j
). At any further stage, we may wish to combine two individual
objects, or a single object to a group, or two groups. Thus we need to decide how
to measure the dissimilarity between any two groups of objects. There are many
possibilities. Three popular ones look at the minimum, average, and maximum of
the individuals distances. That is, suppose A and B are subsets of objects. Then the
three distances between the subsets are
Single linkage: d(A, B) = min
aA,bB
d(a, b)
Average linkage: d(A, B) =
1
#A#B

aA

bB
d(a, b)
Complete linkage: d(A, B) = max
aA,bB
d(a, b) (12.46)
In all cases, d(a, b) = d(a, b). Complete linkage is an example of Hausdorff
distance, at least when the d is a distance.
12.6.1 Example: Grades data
Consider the dissimilarities for the ve variables of the grades data given in (12.23).
The hierarchical clustering using these dissimilarities with complete linkage is given
in Figure 12.9. This clustering is not surprising given the results of K-medoids in
(12.25). As in (12.45), the hierarchical clustering starts with each object in its own
cluster. Next we look for the smallest dissimilarity between two objects, which is the
0.53 between midterms and nal. In the dendrogram, we see these two scores being
connected at the height of 0.53.
12.6. Hierarchical clustering 249
We now have four clusters, with dissimilarity matrix
HW Labs InClass Midterms, Final
HW 0.00 0.56 0.86 0.71
Labs 0.56 0.00 0.80 0.71
InClass 0.86 0.80 0.00 0.81
Midterms, Final 0.71 0.71 0.81 0.00
(12.47)
(The dissimilarity between the cluster Midterms, Final and itself is not really zero,
but we put zero there for convenience.) Because we are using complete linkage, the
dissimilarity between a single object and the cluster with two objects is the maximum
of the two individual dissimilarities. For example,
d(HW, Midterms, Final) = maxd(HW, Midterms), d(HW, Final)
= max0.71, 0.69
= 0.71. (12.48)
The two closest clusters are now the singletons HW and Labs, with a dissimilarity of
0.56. The new dissimilarity matrix is then
HW, Labs InClass Midterms, Final
HW, Labs 0.00 0.86 0.71
InClass 0.86 0.00 0.81
Midterms, Final 0.71 0.81 0.00
(12.49)
The next step combines the two two-object clusters, and the nal step places InClass
with the rest.
To use R, we start with the dissimilarity matrix dx in (12.23). The routine hclust
creates the tree, and plclust plots it. We need the as.dist there to let the function know
we already have the dissimilarities. Then Figure 12.9 is created by the statement
plclust(hclust(as.dist(dx)))
12.6.2 Example: Sports data
Turn to the sports data from Section 12.1.1. Here we cluster the sports, using squared
Euclidean distance as the dissimilarity, and compete linkage. To use squared Eu-
clidean distances, use the dist function directly on the data matrix. Figure 12.10 is
found using
plclust(hclust(dist(t(sportsranks))))
Compare this plot to the K-means plot in Figure 12.5. We see somewhat similar
closenesses among the sports.
Figure 12.11 clusters the individuals using complete linkage and single linkage,
created using
par(mfrow=c(2,1))
dxs < dist(sportsranks) # Gets Euclidean distances
lbl < rep( ,130) # Prefer no labels for the individuals
plclust(hclust(dxs),xlab=Complete linkage,sub= ,labels=lbl)
plclust(hclust(dxs,method=single),xlab=Single linkage,sub= ,labels=lbl)
250 Chapter 12. Clustering
B
a
s
e
b
a
l
l
F
o
o
t
b
a
l
l
B
a
s
k
e
t
b
a
l
l
J
o
g
g
i
n
g
T
e
n
n
i
s
C
y
c
l
i
n
g
S
w
i
m
m
i
n
g
2
0
2
5
3
0
3
5
4
0
H
e
i
g
h
t
Figure 12.10: Clustering the sports, using complete linkage.
0
4
8
Complete linkage
H
e
i
g
h
t
0
2
4
Single linkage
H
e
i
g
h
t
Figure 12.11: Clustering the individuals in the sports data, using complete linkage
(top) and single linkage (bottom).
12.7. Exercises 251
Complete linkage tends to favor similar-sized clusters, because by using the max-
imum distance, it is easier for two small clusters to get together than anything to
attach itself to a large cluster. Single linkage tends to favor a few large clusters, and
the rest small, because the larger the cluster, the more likely it will be close to small
clusters. These ideas are borne out in the plot, where complete linkage yields a more
treey-looking dendrogram.
12.7 Exercises
Exercise 12.7.1. Show that [
k
[ = c
p
k
in (12.33) follows from (12.31) and (12.32).
Exercise 12.7.2. (a) Show that the EM algorithm, where we use the w
(i)
k
s in (12.40) as
the estimate of

P[Y = k [ X = x
i
], rather than that in (12.37), is the K-means algorithm
of Section 12.1. [Note: You have to worry only about the mean in (12.38).] (b) Show
that the limit as 0 of

P[Y = k [ X = x
i
] is indeed given in (12.40), if we use the f
k
in (12.41) in (12.37).
Exercise 12.7.3 (Grades). This problem is to cluster the students in the grades data
based on variables 26: homework, labs, inclass, midterms, and nal. (a) Use K-
means clustering for K = 2. (Use nstart=100, which is a little high, but makes sure
everyone gets similar answers.) Look at the centers, and briey characterize the
clusters. Compare the men and women (variable 1, 0=Male, 1=Female) on which
clusters they are in. (Be sure to take into account that there are about twice as many
women as men.) Any differences? (b) Same question, for K = 3. (c) Same question,
for K = 4. (d) Find the average silhouettes for the K = 2, 3 and 4 clusterings from
parts (a), (b) and (c). Which K has the highest average silhouette? (e) Use soft K-
means to nd the K = 1, 2, 3 and 4 clusterings. Which K is best according to the
BICs? (Be aware that the BICs in Mclust are negative what we use.) Is it the same as
for the best K-means clustering (based on silhouettes) found in part (d)? (f) For each
of K = 2, 3, 4, compare the classications of the data using regular K-means to that of
soft K-means. That is, match the clusters produced by both methods for given K, and
count how many observations were differently clustered.
Exercise 12.7.4 (Diabetes). The R package mclust contains the data set diabetes [Reaven
and Miller, 1979]. There are n = 145 subjects and four variables. The rst variable
(class) is a categorical variable indicating whether the subject has overt diabetes (my
interpretation: symptoms are obvious), chemical diabetes (my interpretation: can
only be detected through chemical analysis of the blood), or is normal (no diabetes).
The other three variables are blood measurements: glucose, insulin, sspg. (a) First,
normalize the three blood measurement variables so that they have means zero and
variances 1:
blood < scale(diabetes[,2:4])
(a) Use K-means to cluster the observations on the three normalized blood measure-
ment variables for K = 1, 2, . . . , 9. (b) Find the gap statistics for the clusterings in part
(a). To generate a random observation, use three independent uniforms, where their
ranges coincide with the ranges of the three variables. So to generate a random data
set xstar:
252 Chapter 12. Clustering
n < nrow(blood)
p < ncol(blood)
ranges < apply(blood,2,range) # Obtains the mins and maxes
xstar < NULL # To contain the new data set
for(j in 1:p) {
xstar < cbind(xstar,runif(n,ranges[1,j],ranges[2,j]))
}
Which K would you choose based on this criterion? (b) Find the average silhouettes
for the clusterings found in part (a), except for K = 1. Which K would you choose
based on this criterion? (c) Use model-based clustering, again with K = 1, . . . 9.
Which model and K has the best BIC? (d) For each of the three best clusterings
in parts (a), (b), and (c), plot each pair of variables, indicating which cluster each
point was assigned, as in Figure 12.7. Compare these to the same plots that use
the class variable as the indicator. What do you notice? (e) For each of the three
best clusterings, nd the table comparing the clusters with the class variable. Which
clustering was closest to the class variable? Why do you suppose that clustering was
closest? (Look at the plots.)
Exercise 12.7.5 (Iris). This question applies model-based clustering to the iris data,
pretending we do not know which observations are in which species. (a) Do the
model-based clustering without any restrictions (i.e., use the defaults). Which model
and number K was best, according to BIC? Compare the clustering for this best model
to the actual species. (b) Now look at the BICs for the model chosen in part (a), but
for the various Ks from 1 to 9. Calculate the corresponding estimated posterior
probabilities. What do you see? (c) Fit the same model, but with K = 3. Now
compare the clustering to the true species.
Exercise 12.7.6 (Grades). Verify the dissimilarity matrices in (12.47) and (12.49).
Exercise 12.7.7 (Soft drinks). The data set softdrinks has 23 peoples ranking of 8
soft drinks: Coke, Pepsi, Sprite, 7-up, and their diet equivalents. Do a hierarchical
clustering on the drinks, so that the command is
hclust(dist(t(softdrinks2)))
then plot the tree with the appropriate labels. Describe the tree. Does the clustering
make sense?
Exercise 12.7.8 (Cereal). Exercise 1.9.19 presented the cereal data (in the R data ma-
trix cereal), nding the biplot. Do hierarchical clustering on the cereals, and on the
attributes. Do the clusters make sense? What else would you like to know from these
data? Compare the clusterings to the biplot.
Chapter 13
Principal Components and Related
Techniques
Data reduction is a common goal in multivariate analysis one has too many vari-
ables, and wishes to reduce the number of them without losing much information.
How to approach the reduction depends of course on the goal of the analysis. For
example, in linear models, there are clear dependent variables (in the Y matrix) that
we are trying to explain or predict from the explanatory variables (in the x matrix,
and possibly the z matrix). Then Mallows C
p
or cross-validation are reasonable ap-
proaches. If the correlations between the Ys are of interest, then factor analysis is
appropriate, where the likelihood ratio test is a good measure of how many factors
to take. In classication, using cross-validation is a good way to decide on the vari-
ables. In model-based clustering, and in fact any situation with a likelihood, one can
balance the t and complexity of the model using something like AIC or BIC.
There are other situations in which the goal is not so clear cut as in those above;
one is more interested in exploring the data, using data reduction to get a better
handle on the data, in the hope that something interesting will reveal itself. The
reduced data may then be used in more formal models, although I recommend rst
considering targeted reductions as mentioned in the previous paragraph, rather than
immediately jumping to principal components.
Below we discuss principal components in more depth, then present multidimen-
sional scaling, and canonical correlations.
13.1 Principal components, redux
Recall way back in Section 1.6 that the objective in principal component analysis was
to nd linear combinations (with norm 1) with maximum variance. As an exploratory
technique, principal components can be very useful, as are other projection pursuit
methods. The conceit underlying principal components is that variance is associated
with interestingness, which may or may not hold. As long as in an exploratory mood,
though, if one nds the top principal components are not particularly interesting or
interpretable, then one can go in a different direction.
But be careful not to shift over to the notion that components with low variance
can be ignored. It could very well be that they are the most important, e.g., most
253
254 Chapter 13. Principal Components, etc.
correlated with a separate variable of interest. Using principal components as the
rst step in a process, where one takes the rst few principal components to use in
another procedure such as clustering or classication, may or may not work out well.
In particular, it makes little sense to use principal components to reduce the variables
before using them in a linear process such as regression, canonical correlations, or
Fishers linear discrimination. For example, in regression, we are trying to nd the
linear combination of xs that best correlates with y. Using principal components rst
on the xs will give us a few new variables that are linear combinations of xs, which
we then further take linear combinations of to correlate with the y. What we end up
with is a worse correlation than if we just started with the original xs, since some
parts of the xs are left behind. The same thinking goes when using linear discrimina-
tion: We want the linear combination of xs that best distinguishes the groups, not the
best linear combination of a few linear combinations of the xs. Because factor analy-
sis tries to account for correlations among the variables, if one transforms to principal
components, which are uncorrelated, before applying factor analysis, then there will
be no common factors. On the other hand, if one is using nonlinear techniques such
as classication trees, rst reducing by principal components may indeed help.
Even principal components are not unique. E.g., you must choose whether or
not to take into account covariates or categorical factors before nding the sample
covariance matrix. You also need to decide how to scale the variables, i.e., whether to
leave them in their original units, or scale so that all variables have the same sample
variance, or scale in some other way. The scaling will affect the principal components,
unlike in factor analysis or linear regression.
13.1.1 Example: Iris data
Recall that the Fisher/Anderson iris data (Section 1.3.1) has n = 150 observations
and q = 4 variables. The measurements of the petals and sepals are in centimeters,
so it is reasonable to leave the data unscaled. On the other hand, the variances of
the variables do differ, so scaling so that each has unit variance is also reasonable.
Furthermore, we could either leave the data unadjusted in the sense of subtracting
the overall mean when nding the covariance matrix, or adjust the data for species by
subtracting from each observation the mean of its species. Thus we have four reason-
able starting points for principal components, based on whether we adjust for species
and whether we scale the variables. Figure 13.1 has plots of the rst two principal
components for each of these possibilities. Note that there is a stark difference be-
tween the plots based on adjusted and unadjusted data. The unadjusted plots show
a clear separation based on species, while the adjusted plots have the species totally
mixed, which would be expected because there are differences in means between
the species. Adjusting hides those differences. There are less obvious differences
between the scaled and unscaled plots within adjusted/unadjusted pairs. For the ad-
justed data, the unscaled plot seems to have fairly equal spreads for the three species,
while the scaled data has the virginica observations more spread out than the other
two species.
The table below shows the sample variances, s
2
, and rst principal components
loadings (sample eigenvector), PC
1
, for each of the four sets of principal components:
13.1. Principal components, redux 255
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
ss
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v v
v
v
v v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
vv
v
v
v
v
v
v
v
v
v
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
2 3 4 5 6 7 8 9

6
.
5

5
.
5

4
.
5
Unadjusted, unscaled
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
v
v v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
gg
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
3 2 1 0 1 2 3

1
0
1
2
Unadjusted, scaled
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
1 0 1 2

0
.
5
0
.
0
0
.
5
1
.
0
Adjusted, unscaled
s
s
s
s
s
s
s
s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s s
s
s
s
s
s
s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
v
vv
v
v
v
v
g
g
g g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
4 2 0 2 4

1
0
1
2
Adjusted, scaled
Figure 13.1: Plots of the rst two principal components for the iris data, depending on
whether adjusting for species and whether scaling the variables to unit variance. For
the individual points, s indicates setosa, v indicates versicolor, and g indicates
virginica.
Unadjusted Adjusted
Unscaled Scaled Unscaled Scaled
s
2
PC
1
s
2
PC
1
s
2
PC
1
s
2
PC
1
Sepal Length 0.69 0.36 1 0.52 0.26 0.74 1 0.54
Sepal Width 0.19 0.08 1 0.27 0.11 0.32 1 0.47
Petal Length 3.12 0.86 1 0.58 0.18 0.57 1 0.53
Petal Width 0.58 0.36 1 0.56 0.04 0.16 1 0.45
(13.1)
Note that whether adjusted or not, the relative variances of the variables affect
the relative weighting they have in the principal component. For example, for the
unadjusted data, petal length has the highest variance in the unscaled data, and
receives the highest loading in the eigenvector. That is, the rst principal component
is primarily sepal length. But for the scaled data, all variables are forced to have
the same variance, and now the loadings of the variables are much more equal. The
256 Chapter 13. Principal Components, etc.
2 4 6 8 10
0
1
2
3
4
5
6
Index
E
i
g
e
n
v
a
l
u
e
2 4 6 8 10
0
.
2
0
.
6
1
.
0
Index
l
o
g
(
r
a
t
i
o
)
Figure 13.2: The left-hand plot is a scree plot (i versus l
i
) of the eigenvalues for the
automobile data. The right-hand plot shows i versus log(l
i
/l
i+1
), the successive log-
proportional gaps.
opposite holds for sepal width. A similar effect is seen for the adjusted data. The
sepal length has the highest unscaled variance and highest loading in PC
1
, and petal
width the lowest variance and loading. But scaled, the loadings are approximately
equal.
Any of the four sets of principal components is reasonable. Which to use depends
on what one is interested in, e.g., if wanting to distinguish between species, the
unadjusted plots are likely more interesting, while when interested in relations within
species, adjusting make sense. We mention that in cases where the units are vastly
different for the variables, e.g., population in thousands and areas in square miles of
cities, leaving the data unscaled is less defensible.
13.1.2 Choosing the number of principal components
One obvious question is, How does one choose p? Unfortunately, there is not
any very good answer. In fact, it is probably not even a good question, because the
implication of the question is that once we have p, we can proceed using just the
rst p principal components, and throw away the remainder. Rather, we take a more
modest approach and ask Which principal components seem to be worth exploring
further? A key factor is whether the component has a reasonable interpretation. Of
course, nothing prevents you from looking at as many as you have time for.
The most common graphical technique for deciding on p is the scree plot, in
which the sample eigenvalues are plotted versus their indices. (A scree is a pile of
small stones at the bottom of a cliff.) Consider Example 12.3.1 on the automobile data,
here using the n = 96 autos with trunk space, and all q = 11 variables. Scaling the
variables so that they are have unit sample variance, we obtain the sample eigenvalues
6.210, 1.833, 0.916, 0.691, 0.539, 0.279, 0.221, 0.138, 0.081, 0.061, 0.030. (13.2)
The scree plot is the rst one in Figure 13.2. Note that there is a big drop from the
rst to the second eigenvalue. There is a smaller drop to the third, then the values
13.1. Principal components, redux 257
seem to level off. Other simple plots can highlight the gaps. For example, the second
plot in the gure shows the logarithms of the successive proportional drops via
log(ratio
i
) log
_
l
i
l
i+1
_
. (13.3)
The biggest drops are again from #1 to #2, and #2 to #3, but there are almost as large
proportional drops at the fth and tenth stages.
One may have outside information or requirements that aid in choosing the com-
ponents. For examples, there may be a reason one wishes a certain number of com-
ponents (say, three if the next step is a three-dimensional plot), or to have as few
components as possible in order to achieve a certain percentage (e.g., 95%) of the to-
tal variance. If one has an idea that the measurement error for the observed variables
is c, then it makes sense to take just the principal components that have eigenvalue
signicantly greater than c
2
. Or, as in the iris data, all the data is accurate just to one
decimal place, so that taking c = 0.05 is certainly defensible.
To assess signicance, assume that
U Wishart
q
(, ), and S =
1

U, (13.4)
where > q and is invertible. Although we do not necessarily expect this dis-
tribution to hold in practice, it will help develop guidelines to use. Let the spectral
decompositions of S and be
S = GLG
/
and =
/
, (13.5)
where G and are orthogonal, and L and are diagonal with nonincreasing diagonal
elements (the eigenvalues), as in Theorem 1.1. The eigenvalues of S will be distinct
with probability 1. If we assume that the eigenvalues of are also distinct, then
Theorem 13.5.1 in Anderson [1963] shows that for large , the sample eigenvalues are
approximately independent, and l
i
N(
i
, 2
2
i
/). If components with
i
c
2
are
ignorable, then it is reasonable to ignore the l
i
for which

l
i
c
2

2 l
i
< 2, equivalently, l
i
<
c
2
1 2

2/

. (13.6)
(One may be tempted to take c = 0, but if any
i
= 0, then the corresponding l
i
will
be zero as well, so that there is no need for hypothesis testing.) Other test statistics
(or really guidance statistics) can be easily derived, e.g., to see whether the average
of the k smallest eigenvalues are less than c
2
, or the sum of the rst p are greater than
some other cutoff.
13.1.3 Estimating the structure of the component spaces
If the eigenvalues of are distinct, then the spectral decomposition (13.5) splits
the q-dimensional space into q orthogonal one-dimensional spaces. If, say, the rst
two eigenvalues are equal, then the rst two subspaces are merged into one two-
dimensional subspace. That is, there is no way to distinguish between the top two
dimensions. At the extreme, if all eigenvalues are equal, in which case = I
q
, there
258 Chapter 13. Principal Components, etc.
is no statistically legitimate reason to distinguish any principal components. More
generally, suppose there are K distinct values among the
i
s, say

1
>
2
> >
K
, (13.7)
where q
k
of the
i
s are equal to
k
:

1
= =
q
1
=
1
,

q
1
+1
= =
q
1
+q
2
=
2
,
.
.
.

q
1
++q
K1
+1
= =
q
=
K
. (13.8)
Then the space is split into K orthogonal subspaces, of dimensions q
1
, . . . , q
K
,
where q = q
1
+ + q
K
. The vector (q
1
, . . . , q
K
) is referred to as the pattern of equal-
ities among the eigenvalues. Let be an orthogonal matrix containing eigenvectors
as in (13.5), and partition it as
=
_

1

2

K
_
,
k
is q q
k
, (13.9)
so that
k
contains the eigenvectors for the q
k
eigenvalues that equal
k
. These are
not unique because
k
J for any q
k
q
k
orthogonal matrix J will also yield a set of
eigenvectors for those eigenvalues. The subspaces have corresponding projection
matrices P
1
, . . . , P
K
, which are unique, and we can write
=
K

k=1

k
P
k
, where P
k
=
k

/
k
. (13.10)
With this structure, the principal components can be dened only in groups, i.e., the
rst q
1
of them represent one group, which have higher variance than the next group
of q
2
components, etc., down to the nal q
K
components. There is no distinction
within a group, so that one would take either the top q
1
components, or the top
q
1
+ q
2
, or the top q
1
+ q
2
+ q
3
, etc.
Using the distributional assumption (13.4), we nd the Bayes information criterion
to choose among the possible patterns (13.8) of equality. The best set can then be used
in plots such as in Figure 13.3, where the gaps will be either enhanced (if large) or
eliminated (if small). The model (13.8) will be denoted M
(q
1
,...,q
K
)
. Anderson [1963]
(see also Section 12.5) shows the following.
Theorem 13.1. Suppose (13.4) holds, and S and have spectral decompositions as in (13.5).
Then the MLE of under the model M
(q
1
,...,q
K
)
is given by

= G

G
/
, where the

i
s are
found by averaging the relevant l
i
s:

1
= . . . =

q
1
=
1
=
1
q
1
(l
1
+ . . . + l
q
1
),

q
1
+1
= =

q
1
+q
2
=
2
=
1
q
2
(l
q
1
+1
+ + l
q
1
+q
2
),
.
.
.

q
1
++q
K1
+1
= =

q
=
K
=
1
q
K
(l
q
1
++q
K1
+1
+ + l
q
). (13.11)
13.1. Principal components, redux 259
The number of free parameters is
d(q
1
, . . . q
K
) =
1
2
(q
2

k=1
q
2
k
) + K. (13.12)
The deviance can then be taken to be
deviance(M
(q
1
,...q
K
)
(

) ; S) =
q

i=1
log(

i
) =
K

k=1
q
k
log(
k
). (13.13)
See Exercise 13.4.3. Using (13.12), we have
BIC(M
(q
1
,...q
K
)
) =
K

k=1
q
k
log(
k
) + log()d(q
1
, . . . q
K
). (13.14)
13.1.4 Example: Automobile data
Let S be the scaled covariance matrix for the automobiles with trunks, described in
Section 12.3.1. Equation (13.2) and Figure 13.2 exhibit the eigenvalues of S, They are
denoted l
j
in (13.15). We rst illustrate the model (13.8) with pattern (1, 1, 3, 3, 2, 1).
The MLEs of the eigenvalues are then found by averaging the third through fth,
the sixth through eighth, the ninth and tenth, and leaving the others alone, denoted
below by the

j
s:
j 1 2 3 4 5
l
j
6.210 1.833 0.916 0.691 0.539

j
6.210 1.833 0.716 0.716 0.716
j 6 7 8 9 10 11
l
j
0.279 0.221 0.138 0.081 0.061 0.030

j
0.213 0.213 0.213 0.071 0.071 0.030
(13.15)
With = n 1 = 95,
deviance(M
(1,1,3,3,2,1)
(

) ; S) = 95

j
log(

j
) = 1141.398, (13.16)
and d(1, 1, 3, 3, 2, 1) = 54, hence
BIC(M
(1,1,3,3,2,1)
) = 1141.398 + log(95) 54 = 895.489. (13.17)
Table 13.1.4 contains a number of models, one each for K from 1 to 11. Each
pattern after the rst was chosen to be the best that is obtained from the previous by
summing two consecutive q
k
s. The estimated probabilities are among those in the
table. Clearly, the preferred is the one with MLE in (13.15). Note that the assumption
(13.4) is far from holding here, both because the data are not normal, and because
we are using a correlation matrix rather than a covariance matrix. We are hoping,
though, that in any case, the BIC is a reasonable balance of the t of the model on the
eigenvalues and the number of parameters.
260 Chapter 13. Principal Components, etc.
Pattern d BIC BIC

P
BIC
(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) 66 861.196 34.292 0.000
(1, 1, 1, 1, 1, 2, 1, 1, 1, 1) 64 869.010 26.479 0.000
(1, 1, 1, 2, 2, 1, 1, 1, 1) 62 876.650 18.839 0.000
(1, 1, 3, 2, 1, 1, 1, 1) 59 885.089 10.400 0.004
(1, 1, 3, 2, 1, 2, 1) 57 892.223 3.266 0.159
(1, 1, 3, 3, 2, 1) 54 895.489 0.000 0.812
(1, 1, 3, 3, 3) 51 888.502 6.987 0.025
(1, 4, 3, 3) 47 870.824 24.665 0.000
(1, 4, 6) 37 801.385 94.104 0.000
(5, 6) 32 657.561 237.927 0.000
11 1 4.554 900.042 0.000
Table 13.1: The BICs for the sequence of principal component models for the auto-
mobile data.
2 4 6 8 10

1
1
2
Sample
Index
l
o
g
(
e
i
g
e
n
v
a
l
u
e
)
2 4 6 8 10

1
1
2
MLE
Index
l
o
g
(
e
i
g
e
n
v
a
l
u
e
)
Figure 13.3: Plots of j versus the sample l
j
s, and j versus the MLEs

j
s for the
chosen model.
Figure 13.3 shows the scree plots, using logs, of the sample eigenvalues and the
tted ones from the best model. Note that the latter gives more aid in deciding how
many components to choose because the gaps are enhanced or eliminated. That is,
taking one or two components is reasonable, but because there is little distinction
among the next three, one may as well take all or none of those three. Similarly with
numbers six, seven and eight.
What about interpretations? Below we have the rst ve principal component
13.1. Principal components, redux 261
loadings, multiplied by 100 then rounded off:
PC
1
PC
2
PC
3
PC
4
PC
5
Length 36 23 3 7 26
Wheelbase 37 20 11 6 20
Width 35 29 19 11 1
Height 25 41 41 10 10
FrontHd 19 30 68 45 16
RearHd 25 47 1 28 6
FrtLegRoom 10 49 30 69 37
RearSeating 30 26 43 2 18
FrtShld 37 16 20 11 10
RearShld 38 1 7 13 12
Luggage 28 6 2 43 81
(13.18)
The rst principal component has fairly equal positive loadings for all variables, in-
dicating an overall measure of bigness. The second component tends to have positive
loadings for tallness (height, front headroom, rear headroom), and negative loadings
for the length and width-type variables. This component then measures tallness rela-
tive to length and width. The next three may be harder to interpret. Numbers 3 and 4
could be front seat versus back seat measurements, and number 5 is mainly luggage
space. But from the analysis in (13.15), we have that from a statistical signicance
point of view, there is no distinction among the third through fth components, that
is, any rotation of them is equally important. Thus we might try a varimax rotation
on the three vectors, to aid in interpretation. (See Section 10.3.2 for a description of
varimax.) The R function varimax will do the job. The results are below:
PC

3
PC

4
PC

5
Length 4 20 18
Wheelbase 11 0 21
Width 13 17 6
Height 32 29 1
FrontHd 81 15 7
RearHd 11 18 19
FrtLegRoom 4 83 6
RearSeating 42 10 18
FrtShld 12 22 1
RearShld 0 19 3
Luggage 4 8 92
(13.19)
These three components are easy to interpret, weighting heavily on front headroom,
front legroom, and luggage space, respectively.
Figure 13.4 plots the rst two principal components. The horizontal axis repre-
sents the size, going from the largest at the left to the smallest at the right. The
vertical axis has tall/narrow cars at the top, and short/wide at the bottom. We also
performed model-based clustering (Section 12.3) using just these two variables. The
best clustering has two groups, whose covariance matrices have the same eigenval-
ues but different eigenvectors (EEV in (12.35)), indicated by the two ellipses, which
have the same size and shape, but have different orientations. These clusters are rep-
resented in the plot as well. We see the clustering is dened mainly by the tall/wide
variable.
262 Chapter 13. Principal Components, etc.
6 4 2 0 2 4

1
0
1
2
3
Overall size
T
a
l
l

v
s
.

w
i
d
e
Figure 13.4: The rst two principal component variables for the automobile data
(excluding sports cars and minivans), clustered into two groups.
Using R
In Section 12.3.1 we created cars1, the reduced data set. To center and scale the data,
so that the means are zero and variances are one, use
xcars < scale(cars1)
The following obtains eigenvalues and eigenvectors of S:
eg < eigen(var(xcars))
The eigenvalues are in eg$values and the matrix of eigenvectors are in eg$vectors. To
nd the deviance and BIC for the pattern (1, 1, 3, 3, 2, 1) seen in (13.15 and (13.17), we
use the function pcbic (detailed in Section A.5.1):
pcbic(eg$values,95,c(1,1,3,3,2,1))
In Section A.5.2 we present the function pcbic.stepwise, which uses the stepwise pro-
cedure to calculate the elements in Table 13.1.4:
pcbic.stepwise(eg$values,95)
13.1.5 Principal components and factor analysis
Factor analysis and principal components have some similarities and some differ-
ences. Recall the factor analysis model with p factors in (10.58). Taking the mean to
be 0 for simplicity, we have
Y = X +R, (13.20)
where X and R are independent, with
X N(0, I
n
I
p
) and R N(0, I
n
), diagonal. (13.21)
13.1. Principal components, redux 263
For principal components, where we take the rst p components, partition and
in (13.5) as
=
_

1

2
_
and =
_

1
0
0
2
_
. (13.22)
Here,
1
is q p,
2
is q (q p),
1
is p p, and
2
is (q p) (q p), the
k
s
being diagonal. The large eigenvalues are in
1
, the small ones are in
2
. Because
I
q
=
/
=
1

/
1
+
2

/
2
, we can write
Y = Y
1

/
1
+Y
2

/
2
= X +R, (13.23)
where
X = Y
1
N(0,
X
), =
/
1
and R = Y
2

/
2
N(0, I
n

R
). (13.24)
Because
/
1

2
= 0, X and R are again independent. We also have (Exercise 13.4.4)

X
=
/
1

1
=
1
, and
R
=
2

/
2

/
2
=
2

/
2
. (13.25)
Comparing these covariances to the factor analytic ones in (13.20), we see the follow-
ing:

X

R
Factor analysis I
p

Principal components
1

2

/
2
(13.26)
The key difference is in the residuals. Factor analysis chooses the p-dimensional X so
that the residuals are uncorrelated, though not necessarily small. Thus the correlations
among the Ys are explained by the factors X. Principal components chooses the p-
dimensional X so that the residuals are small (the variances sum to the sum of the
(q p) smallest eigenvalues), but not necessarily uncorrelated. Much of the variance
of the Y is explained by the components X.
A popular model that ts into both frameworks is the factor analytic model (13.20)
with the restriction that
=
2
I
q
,
2
small. (13.27)
The interpretation in principal components is that the X contains the important infor-
mation in Y, while the residuals R contain just random measurement error. For factor
analysis, we have that the X explains the correlations among the Y, and the residuals
happen to have the same variances. In this case, we have
=
XX

/
+
2
I
q
. (13.28)
Because is q p, there are at most p positive eigenvalues for
XX

/
. Call these

p
, and let

1
be the p p diagonal matrix with diagonals

i
. Then
the spectral decomposition is

XX

/
=
_

1
0
0 0
_

/
(13.29)
for some orthogonal . But any orthogonal matrix contains eigenvectors for I
q
=
/
,
hence is also an eigenvector matrix for :
=
/
=
_

1
+
2
I
p
0
0
2
I
qp
_

/
. (13.30)
264 Chapter 13. Principal Components, etc.
Thus the eigenvalues of are

1
+
2

2
+
2

p
+
2

2
= =
2
, (13.31)
and the eigenvectors for the rst p eigenvalues are the columns of
1
. In this case the
factor space and the principal component space are the same. In fact, if the

j
are
distinct and positive, the eigenvalues (13.30) satisfy the structural model (13.8) with
pattern (1, 1, . . . , 1, q p). A common approach to choosing p is to use hypothesis
testing on such models to nd the smallest p for which the model ts. See Anderson
[1963] or Mardia, Kent, and Bibby [1979]. Of course, AIC or BIC could be used as
well.
13.1.6 Justication of the principal component MLE, Theorem 13.1
We rst nd the MLE of , and the maximal value of the likelihood, for U as in (13.4),
where the eigenvalues of satisfy (13.8). We know from (10.1) that the likelihood for
S is
L(; S) = [[
/2
e

2
trace(
1
S)
. (13.32)
For the general model, i.e., where there is no restriction among the eigenvalues of
(M
(1,1,...,1)
), the MLE of is S.
Suppose there are nontrivial restrictions (13.8). Write and S in their spectral
decomposition forms (13.5) to obtain
L(; S) = [[
/2
e

2
trace((
/
)
1
GLG
/
)
= (

i
)
/2
e

1
2
trace(
1

/
GLG
/
)
. (13.33)
Because of the multiplicities in the eigenvalues, the is not uniquely determined
from , but any orthogonal matrix that maximizes the likelihood is adequate.
We start by xing , and maximizing the likelihood over the . Set
F = G
/
, (13.34)
which is also orthogonal, and note that

1
2
trace(
1
F
/
LF) =
1
2
q

i=1
q

j=1
f
2
ij
l
i

j
=
q

i=1
q

j=1
f
2
ij
l
i

1
2
q
q

i=1
l
i
, (13.35)
where

j
=
1
2
j
+
1
2
q
. (13.36)
(The 1/(2
q
) is added because we need the
j
s to be nonnegative in what follows.)
Note that the last term in (13.35) is independent of F. We do summation by parts by
letting

i
=
i

i+1
, d
i
= l
i
l
i+1
, 1 i q 1,
q
=
q
, d
q
= l
q
, (13.37)
13.1. Principal components, redux 265
so that
l
i
=
q

k=i
d
k
and
j
=
q

m=j

m
. (13.38)
Because the l
i
s and
i
s are nondecreasing in i, the l
i
s are positive, and by (13.36)
the
j
s are also nonnegative, we have that the
i
s and d
i
s are all nonnegative. Using
(13.38) and interchanging the orders of summation, we have
q

i=1
q

j=1
f
2
ij
l
i

j
=
q

i=1
q

j=1
q

k=i
q

m=j
f
2
ij
d
k

m
=
q

k=1
q

m=1
d
k

m
k

i=1
m

j=1
f
2
ij
. (13.39)
Because F is an orthogonal matrix,
k

i=1
m

j=1
f
2
ij
mink, m. (13.40)
Also, with F = I
q
, the f
ii
= 1, hence expression in (13.40) is an equality. By the
nonnegativity of the d
k
s and
m
s, the sum in (13.39), hence in (13.35), is maximized
(though not uniquely) by taking F = I
q
. Working back, from (13.34), the (not neces-
sarily unique) maximizer over of (13.35) is

= G. (13.41)
Thus the maximum over in the likelihood (13.33) is
L(GG
/
; S) = (

i
)
/2
e

2
(l
i
/
i
)
. (13.42)
Break up the product according to the equalities (13.8):
K

k=1

q
k
/2
k
e

2
(t
k
/
k
)
, (13.43)
where
t
1
=
q
1

i=1
l
i
and t
k
=
q
1
++q
k

i=q
1
++q
k1
+1
l
i
for 2 k K. (13.44)
It is easy to maximize over each
k
in (13.43), which proves that (13.11) is indeed the
MLE of the eigenvalues. Thus with (13.41), we have the MLE of as in Theorem 13.1.
We give a heuristic explanation of the dimension (13.12) of the model M
(q
1
,...q
K
)
in
(13.8). To describe the model, we need the K distinct parameters among the
i
s, as
well as the K orthogonal subspaces that correspond to the distinct values of
i
. We
start by counting the number of free parameters needed to describe an s-dimensional
subspace of a t-dimensional space, s < t. Any such subspace can be described by a
t s basis matrix B, that is, the columns of B comprise a basis for the subspace. (See
Section 5.2.) The basis is not unique, in that BA for any invertible s s matrix A is
also a basis matrix, and in fact any basis matrix equals BA for some such A. Take A
266 Chapter 13. Principal Components, etc.
to be the inverse of the top s s submatrix of B, so that BA has I
s
as its top s s part.
This matrix has (t s) s free parameters, represented in the bottom (t s) s part
of it, and is the only basis matrix with I
s
at the top. Thus the dimension is (t s) s.
(If the top part of B is not invertible, then we can nd some other subset of s rows to
use.)
Now for model (13.8), we proceed stepwise. There are q
1
(q
2
+ + q
K
) parame-
ters needed to specify the rst q
1
-dimensional subspace. Next, focus on the subspace
orthogonal to that rst one. It is (q
2
+ + q
K
)-dimensional, hence to describe the
second, q
2
-dimensional, subspace within that, we need q
2
(q
3
+ + q
K
) parame-
ters. Continuing, the total number of parameters is
q
1
(q
2
+ + q
K
) + q
2
(q
3
+ + q
K
) + + q
K1
q
k
=
1
2
(q
2

k=1
q
2
k
). (13.45)
Adding K for the distinct
i
s, we obtain the dimension in (13.12).
13.2 Multidimensional scaling
Given n objects and dened distances, or dissimiarities, between them, multidimen-
sional scaling tries to mimic the dissimilarities as close as possible by representing
the objects in p-dimensional Euclidean space, where p is fairly small. That is, sup-
pose o
1
, . . . , o
n
are the objects, and d(o
i
, o
j
) is the dissimilarity between the i
th
and
j
th
objects, as in (12.19). Let be the n n matrix of d
2
(o
i
, o
j
)s. The goal is to nd
1 p vectors x
1
, . . . , x
n
so that

ij
= d
2
(o
i
, o
j
) |x
i
x
j
|
2
. (13.46)
Then the x
i
s are plotted in R
p
, giving an approximate visual representation of the
original dissimilarities.
There are a number of approaches. Our presentation here follows that of Mar-
dia, Kent, and Bibby [1979], which provides more in-depth coverage. We will start
with the case that the original dissimilarities are themselves Euclidean distances, and
present the so-called classical solution. Next, we exhibit the classical solution when
the distances may not be Euclidean. Finally, we briey mention the nonmetric ap-
proach.
13.2.1 is Euclidean: The classical solution
Here we assume that object o
i
has associated a 1 q vector y
i
, and let Y be the n q
matrix with y
i
as the i
th
row. Of course, now Y looks like a regular data matrix. It
might be that the objects are observations (people), or they are variables, in which
case this Y is really the transpose of the usual data matrix. Whatever the case,
d
2
(o
i
, o
j
) = |y
i
y
j
|
2
. (13.47)
For any n p matrix X with rows x
i
, dene (X) to be the n n matrix of |x
i
x
j
|
2
s,
so that (13.47) can be written = (Y).
The classical solution looks for x
i
s in (13.46) that are based on rotations of the
y
i
s, much like principal components. That is, suppose B is a q p matrix with
13.2. Multidimensional scaling 267
orthonormal columns, and set
x
i
= y
i
B, and

X = YB. (13.48)
The objective is then to choose the B that minimizes

1i<jn

|y
i
y
j
|
2
|x
i
x
j
|
2

(13.49)
over B. Exercise 13.4.9 shows that |y
i
y
j
|
2
> |x
i
x
j
|
2
, which means that the ab-
solute values in (13.49) can be removed. Also, note that the sum over the |y
i
y
j
|
2
is
independent of B, and by symmetry, minimizing (13.49) is equivalent to maximizing
n

i=1
n

j=1
|x
i
x
j
|
2
1
/
n
(

X)1
n
. (13.50)
The next lemma is useful in relating the (Y) to the deviations.
Lemma 13.1. Suppose X is n p. Then
(X) = a1
/
n
+1
n
a
/
2H
n
XX
/
H
n
, (13.51)
where a is the n 1 vector with a
i
= |x
i
x|
2
, x is the mean of the x
i
s, and H
n
is the
centering matrix (1.12).
Proof. Write
|x
i
x
j
|
2
= |(x
i
x) (x
j
x)|
2
= a
i
2(H
n
X)
i
(H
n
X)
/
j
+a
j
, (13.52)
from which we obtain (13.51).
By (13.50) and (13.51) we have
n

i=1
n

j=1
|x
i
x
j
|
2
= 1
/
n
(

X)1
n
= 1
/
n
(a1
/
n
+1
n
a
/
2H
n
XX
/
H
n
)1
n
= 2n1
/
n
a = 2n
n

i=1
|x
i
x|
2
, (13.53)
because H
n
1
n
= 0. But then
n

i=1
|x
i
x|
2
= trace(

X
/
H
n

X)
= trace(B
/
Y
/
H
n
YB). (13.54)
Maximizing (13.54) over B is a principal components task. That is, as in Lemma 1.3,
this trace is maximized by taking B = G
1
, the rst p eigenvectors of Y
/
H
n
Y. To
summarize:
268 Chapter 13. Principal Components, etc.
Proposition 13.1. If = (Y), then the classical solution of the multidimensional scaling
problem for given p is

X = YG
1
, where the columns of G
1
consist of the rst p eigenvectors
of Y
/
H
n
Y.
If one is interested in the distances between variables, so that the distances of in-
terest are in (Y
/
) (note the transpose), then the classical solution uses the rst p
eigenvectors of YH
q
Y
/
.
13.2.2 may not be Euclidean: The classical solution
Here, we are given only the n n dissimilarity matrix . The dissimilarities may or
may not arise from Euclidean distance on vectors y
i
, but the solution acts as if they
do. That is, we assume there is an n q matrix Y such that = (Y), but we do not
observe the Y, nor do we know the dimension q. The rst step in the process is to
derive the Y from the (Y), then we apply Proposition 13.1 to nd the

X.
It turns out that we can assume any value of q as long as it is larger than the
values of p we wish to entertain. Thus we are safe taking q = n. Also, note that using
(13.51), we can see that (H
n
X) = (X), which implies that the sample mean of Y is
indeterminate. Thus we may as well assume the mean is zero, i.e.,
H
n
Y = Y, (13.55)
so that (13.51) yields
= (Y) = a1
/
n
+1
n
a
/
2YY
/
. (13.56)
To eliminate the as, we can pre- and post-multiply by H
n
:
H
n
H
n
= H
n
(a1
/
n
+1
n
a
/
2YY
/
)H
n
= 2YY
/
, (13.57)
hence
YY
/
=
1
2
H
n
H
n
. (13.58)
Now consider the spectral decomposition (1.33) of YY
/
,
YY
/
= JLJ
/
, (13.59)
where the orthogonal J = (j
1
, . . . , j
n
) contains the eigenvectors, and the diagonal L
contains the eigenvalues. Separating the matrices, we can take
Y = JL
1/2
=
_
l
1
j
1

l
2
j
2

l
n
j
n
_
. (13.60)
Now we are in the setting of Section 13.2.1, hence by Proposition 13.1, we need G
1
,
the rst p eigenvectors of
Y
/
H
n
Y = (JL
1/2
)
/
(JL
1/2
) = L
1/2
J
/
JL
1/2
= L. (13.61)
But the matrix of eigenvectors is just I
n
, hence we take the rst p columns of Y in
(13.60):

X =
_
l
1
j
1

l
2
j
2

_
l
p
j
p
_
. (13.62)
It could be that is not Euclidean, that is, there is no Y for which = (Y).
In this case, the classical solution uses the same algorithm as in equations (13.58) to
(13.62). A possible glitch is that some of the eigenvalues may be negative, but if p is
small, the problem probably wont raise its ugly head.
13.2. Multidimensional scaling 269
13.2.3 Nonmetric approach
The original approach to multidimensional scaling attempts to nd the

X that gives
the same ordering of the observed dissimilarities, rather than trying to match the
actual values of the dissimilarities. See Kruskal [1964]. That is, take the t (
n
2
)
pairwise dissimilarities and order them:
d(o
i
1
, o
j
1
) d(o
i
2
, o
j
2
) d(o
i
t
, o
j
t
). (13.63)
The ideal would be to nd the x
i
s so that
|x
i
1
x
j
1
|
2
|x
i
2
x
j
2
|
2
|x
i
t
x
j
t
|
2
. (13.64)
That might not be (actually probably is not) possible for given p, so instead one
nds the

X that comes as close as possible, where close is measured by some stress
function. A popular stress function is given by
Stress
2
(

X) =

1i<jn
(|x
i
x
j
|
2
d

ij
)
2

1i<jn
|x
i
x
j
|
4
, (13.65)
where the d

ij
s are constants that have the same ordering as the original dissimilarities
d(o
i
, o
j
)s in (13.63), and among such orderings minimize the stress. See Johnson and
Wichern [2007] for more details and some examples. The approach is nonmetric
because it does not depend on the actual ds, but just their order.
13.2.4 Examples: Grades and sports
The examples here all start with Euclidean distance matrices, and use the classical
solution, so everything is done using principal components.
In Section 12.6.1 we clustered the ve variables for the grades data, (homework,
labs, inclass, midterms, nal), for the n = 107 students. Here we nd the multidi-
mensional scaling plot. The distance between two variables is the sum of squares of
the difference in scores the people obtained for them. So we take the transpose of the
data. The following code nds the plot.
ty < t(grades[,2:6])
ty < scale(ty,scale=F)
ev < eigen(var(ty))$vectors[,1:2]
tyhat < ty%%ev
lm < range(tyhat)1.1
plot(tyhat,xlim=lm,ylim=lm,xlab=Var 1,ylab=Var 2,type=n)
text(tyhat,labels=dimnames(ty)[[1]])
To plot the variables names rather than points, we rst create the plot with no plot-
ting: type=n. Then text plots the characters in the labels parameters, which we give
as the names of the rst dimension of ty. The results are in Figure 13.5. Notice how
the inclass variable is separated from the others, homeworks and labs are fairly close,
and midterms and nal are fairly close together, not surprisingly given the clustering.
Figure 13.5 also has the multidimensional scaling plot of the seven sports from
the Louis Roussos sports data in Section 12.1.1, found by substituting sportsranks for
grades[,2:6] in the R code. Notice that the rst variable orders the sports according
270 Chapter 13. Principal Components, etc.
100 0 50 150

1
0
0
0
5
0
1
5
0
HW
Labs
InClass
Midterms
Final
Var 1
V
a
r

2
Grades
20 0 10 20 30

2
0
0
1
0
2
0
3
0
BaseB
FootB
BsktB
Ten
Cyc
Swim
Jog
Var 1
V
a
r

2
Sports
Figure 13.5: Multidimensional scaling plot of the grades variables (left) and the
sports variables (right).
to how many people typically participate, i.e., jogging, swimming and cycling can
be done solo, tennis needs two to four people, basketball has ve per team, baseball
nine, and football eleven. The second variable serves mainly to separate jogging from
the others.
13.3 Canonical correlations
Testing the independence of two sets of variables in the multivariate normal dis-
tribution is equivalent to testing their covariance is zero, as in Section 10.2. When
there is no independence, one may wish to know where the lack of independence
lies. A projection-pursuit approach is to nd the linear combination of the rst set
which is most highly correlated with a linear combination of the second set, hoping
to isolate the factors within each group that explain a substantial part of the overall
correlations.
The distributional assumption is based on partitioning the 1 q vector Y into Y
1
(1 q
1
) and Y
2
(1 q
2
), with
Y =
_
Y
1
Y
2
_
N(, ), where =
_

11

12

21

22
_
, (13.66)

11
is q
1
q
1
and
22
is q
2
q
2
. If (q
1
1) and (q
2
1) are coefcient vectors,
then
Cov[Y
1
, Y
2
] =
/

12
,
Var[Y
1
] =
/

11
, and
Var[Y
2
] =
/

22
, (13.67)
hence
Corr[Y
1
, Y
2
] =

/

12

11

/

22

. (13.68)
13.3. Canonical correlations 271
Analogous to principal component analysis, the goal in canonical correlation analysis
is to maximize the correlation (13.68) over and . Equivalently, we could maximize
the covariance
/

12
in (13.68) subject to the two variances equaling one. This pair
of linear combination vectors may not explain all the correlations between Y
1
and Y
2
,
hence we next nd the maximal correlation over linear combinations uncorrelated
with the rst combinations. We continue, with each pair maximizing the correlation
subject to being uncorrelated with the previous.
The precise denition is below. Compare it to Denition 1.2 for principal compo-
nents.
Denition 13.1. Canonical correlations. Assume (Y
1
, Y
2
) are as in (13.66), where
11
and
22
are invertible, and set m = minq
1
, q
2
. Let
1
, . . . ,
m
be a set of q
1
1 vectors,
and
1
, . . . ,
m
be a set of q
2
1 vectors, such that
(
1
,
1
) is any (, ) that maximizes
/

12
over
/

11
=
/

11
= 1;
(
2
,
2
) is any (, ) that maximizes
/

12
over
/

11
=
/

11
= 1,

11

1
=
/

22

1
= 0;
.
.
.
(
m
,
m
) is any (, ) that maximizes
/

12
over
/

11
=
/

11
= 1,

11

i
=
/

22

i
= 0,
i = 1, . . . , m1. (13.69)
Then
i

/
i

12

i
is the i
th
canonical correlation, and
i
and
i
are the associated
canonical correlation loading vectors.
Recall that principal component analysis (Denition 1.2) led naturally to the spec-
tral decomposition theorem (Theorem 1.1). Similarly, canonical correlation analysis
will lead to the singular value decomposition (Theorem 13.2 below). We begin the
canonical correlation analysis with some simplcations. Let

i
=
1/2
11

i
and
i
=
1/2
22

i
(13.70)
for each i, so that the
i
s and
i
s are sets of orthonormal vectors, and

i
= Corr[Y
1

i
, Y
2

i
] =
/
i

i
, (13.71)
where
=
1/2
11

12

1/2
22
. (13.72)
This matrix is a multivariate generalization of the correlation coefcient which is
useful here, but I dont know exactly how it should be interpreted.
In what follows, we assume that q
1
q
2
= m. The q
1
< q
2
case can be handled
similarly. The matrix
/
is a q
2
q
2
symmetric matrix, hence by the spectral decom-
position in (1.33), there is a q
2
q
2
orthogonal matrix and a q
2
q
2
diagonal matrix
with diagonal elements
1

2

q
2
such that

/
= . (13.73)
Let
1
, . . . ,
q
2
denote the columns of , so that the i
th
column of is
i
. Then
(13.73) shows these columns are orthogonal and have squared lengths equal to the

i
s, i.e.,
|
i
|
2
=
i
and (
i
)
/
(
j
) = 0 if i ,= j. (13.74)
272 Chapter 13. Principal Components, etc.
Furthermore, because the
i
s satisfy the equations for the principal components
loading vectors in (1.28) with S =
/
,
|
i
| =
_

i
maximizes || over || = 1,
/

j
= 0 for j < i. (13.75)
Now for the rst canonical correlation, we wish to nd unit vectors and to
maximize
/
. By Corollary 8.2, for xed, the maximum over is when is
proportional to , hence

/
||, equality achieved with =

||
. (13.76)
But by (13.75), || is maximized when =
1
, hence

/

/
1

1
=
_

1
,
1
=

1
|
1
|
. (13.77)
Thus the rst canonical correlation
1
is

1
. (The
1
is arbitrary if
1
= 0.)
We proceed for i = 2, . . . , k, where k is the index of the last positive eigenvalue,
i.e.,
k
> 0 =
k+1
= =
q
2
. For the i
th
canonical correlation, we need to nd unit
vectors orthogonal to
1
, . . . ,
i1
, and orthogonal to
1
, . . . ,
i1
, that maximize

/
. Again by (13.75), the best is
i
, and the best is proportional to
i
, so that

/

/
i

i
=
_

i

i
,
i
=

i
|
i
|
. (13.78)
That this
i
is indeed orthogonal to previous
j
s follows from the second equation
in (13.75).
If
i
= 0, then
/

i
= 0 for any . Thus the canonical correlations for i > k
are
i
= 0, and the corresponding
i
s, i > k, can be any set of vectors such that

1
, . . . ,
q
2
are orthonormal.
Finally, to nd the
i
and
i
in (13.69), we solve the equations in (13.70):

i
=
1
11

i
and
i
=
1
22

i
, i = 1, . . . , m. (13.79)
Backing up a bit, we have almost obtained the singular value decomposition of .
Note that by (13.78),
i
= |
i
|, hence we can write
=
_

1

q
2
_
=
_

1

1

q
2

q
2
_
= , (13.80)
where = (
1
, . . . ,
q
2
) has orthonormal columns, and is diagonal with
1
, . . . ,
q
2
on the diagonal. Shifting the to the other side of the equation, we obtain the
following.
Theorem 13.2. Singular value decomposition. The q
1
q
2
matrix can be written
=
/
(13.81)
where (q
1
m) and (q
2
m) have orthonormal columns, and is an mm diagonal
matrix with diagonals
1

2

m
0, where m = minq
1
, q
2
.
13.3. Canonical correlations 273
To summarize:
Corollary 13.1. Let (13.81) be the singular value decomposition of =
1/2
11

12

1/2
22
for
model (13.66). Then for 1 i minq
1
, q
2
, the i
th
canonical correlation is
i
, with loading
vectors
i
and
i
given in (13.79), where
i
(
i
) is the i
th
column of ().
Next we present an example, where we use the estimate of to estimate the
canonical correlations. Theorem 13.3 guarantees that the estimates are the MLEs.
13.3.1 Example: Grades
Return to the grades data. In Section 10.3.3, we looked at factor analysis, nding two
main factors: An overall ability factor, and a contrast of homework and labs versus
midterms and nal. Here we lump in inclass assignments with homework and labs,
and nd the canonical correlations between the sets (homework, labs, inclass) and
(midterms, nal), so that q
1
= 3 and q
2
= 2. The Y is the matrix of residuals from the
model (10.80). In R,
x < cbind(1,grades[,1])
y < grades[,2:6]x%%solve(t(x)%%x,t(x))%%grades[,2:6]
s < t(y)%%y/(nrow(y)2)
corr < cov2cor(s)
The nal statement calculates the correlation matrix from the S, yielding
HW Labs InClass Midterms Final
HW 1.00 0.78 0.28 0.41 0.40
Labs 0.78 1.00 0.42 0.38 0.35
InClass 0.28 0.42 1.00 0.24 0.27
Midterms 0.41 0.38 0.24 1.00 0.60
Final 0.40 0.35 0.27 0.60 1.00
(13.82)
There are q
1
q
2
= 6 correlations between variables in the two sets. Canonical
correlations aim to summarize the overall correlations by the two

i
s. The estimate
of the matrix in (13.69) is given by

= S
1/2
11
S
12
S
1/2
22
=

0.236 0.254
0.213 0.146
0.126 0.185

, (13.83)
found in R using
symsqrtinv1 < symsqrtinv(s[1:3,1:3])
symsqrtinv2 < symsqrtinv(s[4:5,4:5])
xi < symsqrtinv1%%s[1:3,4:5]%%symsqrtinv2
where
symsqrtinv < function(x) {
ee < eigen(x)
ee$vectors%%diag(sqrt(1/ee$values))%%t(ee$vectors)
}
274 Chapter 13. Principal Components, etc.
calculates the inverse symmetric square root of an invertible symmetric matrix x. The
singular value decomposition function in R is called svd:
sv < svd(xi)
a < symsqrtinv1%%sv$u
b < symsqrtinv2%%sv$v
The component sv$u is the estimate of and the component sv$v is the estimate of
in (13.81). The matrices of loading vectors are obtained as in (13.79):
A = S
1/2
11

=

0.065 0.059
0.007 0.088
0.014 0.039

,
and B = S
1/2
22

=
_
0.062 0.12
0.053 0.108
_
. (13.84)
The estimated canonical correlations (singular values) are in the vector sv$d, which
are
d
1
= 0.482 and d
2
= 0.064. (13.85)
The d
1
is fairly high, and d
2
is practically negligible. (See the next section.) Thus it
is enough to look at the rst columns of A and B. We can change signs, and take the
rst loadings for the rst set of variables to be (0.065, 0.007, 0.014), which is primarily
the homework score. For the second set of variables, the loadings are (0.062, 0.053),
essentially a straight sum of midterms and nal. Thus the correlations among the two
sets of variables can be almost totally explained by the correlation between homework
and the sum of midterms and nal, which correlation is 0.45, almost the optimum of
0.48.
13.3.2 How many canonical correlations are positive?
One might wonder how many of the
i
s are nonzero. We can use BIC to get an idea.
The model is based on the usual
S =
1

U, where U Wishart
q
(, ), (13.86)
q, with partitioned as in (13.66), and S is partitioned similarly. Let
/
in (13.81) be the singular value decomposition of as in (13.72). Then Model K
(1 K m minq
1
, q
2
) is given by
M
K
:
1
>
2
> >
K
>
K+1
= =
m
= 0, (13.87)
where the
i
s are the canonical correlations, i.e., diagonals of . Let
S
1/2
11
S
12
S
1/2
22
= PDG
/
(13.88)
be the sample analog of (on the left), and its singular value decomposition (on the
right).
We rst obtain the MLE of under model K. Note that and (
11
,
22
, ) are
in one-to-one correspondence. Thus it is enough to nd the MLE of the latter set of
parameters. The next theorem is from Fujikoshi [1974].
13.3. Canonical correlations 275
Theorem 13.3. For the above setup, the MLE of (
11
,
22
, ) under model M
K
in (13.87)
is given by
(S
11
, S
22
, PD
(K)
G
/
), (13.89)
where D
(K)
is the diagonal matrix with diagonals (d
1
, d
2
, . . . , d
K
, 0, 0, . . . , 0).
That is, the MLE is obtained by setting to zero the sample canonical correlations
that are set to zero in the model. One consequence of the theorem is that the natu-
ral sample canonical correlations and accompanying loading vectors are indeed the
MLEs. The deviance, for comparing the models M
K
, can be expressed as
deviance(M
K
) =
K

i=1
log(1 d
2
i
). (13.90)
See Exercise 13.4.14.
For the BIC, we need the dimension of the model. The number of parameters for
the
ii
s we know to be q
i
(q
i
+ 1)/2. For , we look at the singular value decompo-
sition (13.81):
=
(K)

/
=
K

i=1

/
i
. (13.91)
The dimension for the
i
s is K. Only the rst K of the
i
s enter into the equation.
Thus the dimension is the same as for principal components with K distinct eigen-
values, and the rest equal at 0, yielding pattern (1, 1, . . . , 1, q
1
K), where there are
K ones. Similarly, the
i
s dimension is as for pattern (1, 1, . . . , 1, q
2
K). Then by
(13.45),
dim() + dim() + dim(
(K)
) =
1
2
(q
2
1
K (q
1
K)
2
)
+
1
2
(q
2
2
K (q
2
K)
2
) + K
=K(q K). (13.92)
Finally, we can take
BIC(M
K
) =
K

k=1
log(1 d
2
k
) + log()K(q K) (13.93)
because the q
i
(q
i
+ 1)/2 parts are the same for each model.
In the example, we have three models: K = 0, 1, 2. K = 0 means the two sets
of variables are independent, which we already know is not true, and K = 2 is the
unrestricted model. The calculations, with = 105, d
1
= 0.48226 and d
2
= 0.064296:
K Deviance dim() BIC

P
BIC
0 0 0 0 0.0099
1 27.7949 4 9.1791 0.9785
2 28.2299 6 0.3061 0.0116
(13.94)
Clearly K = 1 is best, which is what we gured above.
276 Chapter 13. Principal Components, etc.
13.3.3 Partial least squares
A similar idea is to nd the linear combinations of the variables to maximize the
covariance, rather than correlation:
Cov(Y
1
a, Y
2
b) = a
/

12
b. (13.95)
The process is the same as for canonical correlations, but we use the singular value
decomposition of
12
instead of . The procedure is called partial least squares,
but it could have been called canonical covariances. It is an attractive alternative to
canonical correlations when there are many variables and not many observations, in
which cases the estimates of
11
and
22
are not invertible.
13.4 Exercises
Exercise 13.4.1. In the model (13.4), nd the approximate test for testing the null
hypothesis that the average of the last k (k < q) eigenvalues is less than the constant
c
2
.
Exercise 13.4.2. Verify the expression for in (13.10).
Exercise 13.4.3. Show that the deviance for the model in Theorem 13.1 is given by
(13.13). [Hint: Start with the likelihood as in (13.32). Show that
trace(

1
S) =
q

i=1
l
i

i
= q. (13.96)
Argue you can then ignore the part of the deviance that comes from the exponent.]
Exercise 13.4.4. Verify (13.25). [Hint: First, show that
/
1
= (I
p
0) and
/
2
=
(0 I
qp
).]
Exercise 13.4.5. Show that (13.30) follows from (13.28) and (13.29).
Exercise 13.4.6. Prove (13.40). [Hint: First, explain why

k
i=1
f
2
ij
1 and

m
j=1
f
2
ij
1.]
Exercise 13.4.7. Verify the equality in (13.42), and show that (13.11) does give the
maximizers of (13.43).
Exercise 13.4.8. Verify the equality in (13.45).
Exercise 13.4.9. Show that |y
i
y
j
|
2
> |x
i
x
j
|
2
for y
i
and x
i
in (13.48). [Hint: Start
by letting B
2
be any (q p) q matrix such that (B, B
2
) is an orthogonal matrix. Then
|y
i
y
j
|
2
= |(y
i
y
j
)(B, B
2
)|
2
(why?), and by expanding equals |(y
i
y
j
)B|
2
+
|(y
i
y
j
)B
2
|
2
.]
Exercise 13.4.10. Verify (13.52) by expanding the second expression.
Exercise 13.4.11. In (13.80), verify that
i
=
i
.
13.4. Exercises 277
Exercise 13.4.12. For the canonical correlations situation in Corollary 13.1, let =
(
1
, . . . ,
m
) and = (
1
, . . . ,
m
) be matrices with columns being the loading vec-
tors. Find the covariance matrix of the transformation
_
Y
1
Y
2

_
=
_
Y
1
Y
2
_
_
0
0
_
. (13.97)
[It should depend on the parameters only through the
i
s.]
Exercise 13.4.13. Given the singular decomposition of in (13.81), nd the spec-
tral decompositions of
/
and of
/
. What can you say about the two matrices
eigenvalues? How are these eigenvalues related to the singular values in ?
Exercise 13.4.14. This exercise derives the deviance for the canonical correlation
model in (13.87). Start with
2 log(L(

; S)) = log([

[) + trace(

1
S) (13.98)
for the likelihood in (13.32), where

is the estimate given in Theorem 13.3. (a) Show
that
=
_

1/2
11
0
0
1/2
22
_
_
I
q
1

/
I
q
2
_
_

1/2
11
0
0
1/2
22
_
, (13.99)
(b) Letting C
K
= PD
(K)
G
/
and C = PDG
/
(= PD
(m)
G
/
), show that
trace(

1
S) =trace
_
_
I
q
1
C
K
C
/
K
I
q
2
_
1
_
I
q
1
C
C
/
I
q
2
_
_
=trace((I
q
1
C
K
C
/
K
)
1
(I
q
1
C
K
C
/
+CC
/
K
C
K
C
/
K
)) + trace(I
q
2
)
=trace((I
q
1
C
K
C
/
K
)
1
(I
q
1
C
K
C
/
K
)) + trace(I
q
2
)
=q. (13.100)
[Hint: The rst equality uses part (a). The second equality might be easiest to show
by letting
H =
_
I
q
1
C
K
0 I
q
2
_
, (13.101)
and multiplying the two large matrices by H on the left and H
/
on the right. For the
third equality, using orthogonality of the columns of G, show that C
K
C
/
= CC
/
K
=
C
K
C
/
K
.] (c) Show that [

[ = [S
11
[[S
22
[[I
q
1
C
K
C
/
K
[, where C
K
is given in part (b).
[Hint: Recall (5.83).] (d) Show that [I
q
1
C
K
C
/
K
[ =

K
i=1
(1 d
2
i
). (e) Use parts (b)
through (d) to nd an expression for (13.98), then argue that for comparing M
K
s, we
can take the deviance as in (13.87).
Exercise 13.4.15. Verify the calculation in (13.92).
Exercise 13.4.16 (Painters). The biplot for the painters data set (in the MASS package)
was analyzed in Exercise 1.9.18 (a) Using the rst four variables, without any scaling,
nd the sample eigenvalues l
i
. Which seem to be large, and which small? (b) Find
the pattern of the l
i
s that has best BIC. What are the MLEs of the
i
s for the best
pattern? Does the result conict with your answer to part (a)?
278 Chapter 13. Principal Components, etc.
Exercise 13.4.17 (Spam). In Exercises 1.9.15 and 11.9.9, we found principal compo-
nents for the spam data. Here we look for the best pattern of eigenvalues. Note that
the data is far from multivariate normal, so the distributional aspects should not be
taken too seriously. (a) Using the unscaled spam explanatory variables (1 through
57), nd the best pattern of eigenvalues based on the BIC criterion. Plot the sample
eigenvalues and their MLEs. Do the same, but for the logs. How many principal
components is it reasonable to take? (b) Repeat part (b), but using the scaled data,
scale(Spam[,1:57]). (c) Which approach yielded the more satisfactory answer? Was
the decision to use ten components in Exercise 11.9.9 reasonable, at least for the scaled
data?
Exercise 13.4.18 (Iris). This question concerns the relationships between the sepal
measurements and petal measurements in the iris data. Let S be pooled covariance
matrix, so that the denominator is = 147. (a) Find the correlation between the
sepal length and petal length, and the correlation between the sepal width and petal
width. (b) Find the canonical correlation quantities for the two groups of variables
{Sepal Length, Sepal Width} and {Petal Length, Petal Width}. What do the loadings
show? Compare the d
i
s to the correlations in part (a). (c) Find the BICs for the three
models K = 0, 1, 2, where K is the number of nonzero
i
s. What do you conclude?
Exercise 13.4.19 (Exams). Recall the exams data set (Exercise 10.5.18) has the scores
of 191 students on four exams, the three midterms (variables 1, 2, and 3) and the nal
exam. (a) Find the canonical correlations quantities, with the three midterms in one
group, and the nal in its own group. Describe the relative weightings (loadings) of
the midterms. (b) Apply the regular multiple regression model with the nal as the
Y and the three midterms as the Xs. What is the correlation between the Y and the
t,

Y? How does this correlation compare to d
1
in part (a)? What do you get if you
square this correlation? (c) Look at the ratios

i
/a
i1
for i = 1, 2, 3, where

i
is the
regression coefcient for midterm i in part (b), and a
i1
is the rst canonical correlation
loading. What do you conclude? (d) Run the regression again, with the nal still Y,
but use just the one explanatory variable Xa
1
. Find the correlation of Y and the

Y for
this regression. How does it compare to that in part (b)? (e) Which (if either) yields
a linear combination of the midterms that best correlates with the nal, canonical
correlation analysis or multiple regression. (f) Look at the three midterms variances.
What do you see? Find the regular principal components (without scaling) for the
midterms. What are the loadings for the rst principal component? Compare them
to the canonical correlations loadings in part (a). (g) Run the regression again, with
the nal as the Y again, but with just the rst principal component of the midterms as
the sole explanatory variable. Find the correlation between Y and

Y here. Compare
to the correlations in parts (b) and (d). What do you conclude?
Exercise 13.4.20 (States). This problems uses the matrix states, which contains several
demographic variables on the 50 United States, plus D.C. We are interested in the
relationship between crime variables and money variables:
Crime: Violent crimes per 100,000 people
Prisoners: Number of people in prison per 10,000 people.
Poverty: Percentage of people below the poverty line.
Employment: Percentage of people employed
Income: Median household income
13.4. Exercises 279
Let the rst two variables be Y
1
, and the other three be Y
2
. Scale them to have mean
zero and variance one:
y1 < scale(states[,7:8])
y2 < scale(states[,9:11])
Find the canonical correlations between the Y
1
and Y
2
. (a) What are the two canonical
correlations? How many of these would you keep? (b) Find the BICs for the K = 0, 1
and 2 canonical correlation models. Which is best? (c) Look at the loadings for the
rst canonical correlation, i.e., a
1
and b
1
. How would you interpret these? (d) Plot
the rst canonical variables: Y
1
a
1
versus Y
2
b
1
. Do they look correlated? Which
observations, if any, are outliers? (e) Plot the second canonical variables: Y
1
a
2
versus
Y
2
b
1
2. Do they look correlated? (f) Find the correlation matrix of the four canonical
variables: (Y
1
a
1
, Y
1
a
2
, Y
2
b
1
, Y
2
b
2
). What does it look like? (Compare it to the result
in Exercise 13.4.9.)
Appendix A
Extra R routines
These functions are very barebones. They do not perform any checks on the inputs,
and are not necessarily efcient. You are encouraged to robustify and enhance any of
them to your hearts content.
A.1 Estimating entropy
We present a simple method for estimating the best entropy. See Hyvrinen et al.
[2001] for a more sophisticated approach, which is implemented in the R package
fastICA [Marchini et al., 2010]. First, we need to estimate the negentropy (1.46) for
a given univariate sample of n observations. We use the histogram as the density,
where we take K bins of equal width d, where K is the smallest integer larger than
log
2
(n) + 1 [Sturges, 1926]. Thus bin i is (b
i1
, b
i
], i = 1, . . . , K, where b
i
= b
0
+ d i,
and b
0
and d are chosen so that (b
0
, b
K
] covers the range of the data. Letting p
i
be the
proportion of observations in bin i, the histogram estimate of the density g is
g(x) =
p
i
d
if b
i1
< x b
i
. (A.1)
From (2.102) in Exercise 2.7.16, we have that the negative entropy (1.46) is
Negent( g) =
1
2
_
1 +log
_
2
_
Var[1] +
1
12
___
+
K

i=1
p
i
log( p
i
), (A.2)
where 1 is the random variable with P[1 = i] = p
i
, hence
Var[1] =
K

i=1
i
2
p
i

_
K

i=1
i p
i
_
2
. (A.3)
See Section A.1.1 for the R function we use to calculate this estimate.
For projection pursuit, we have our n q data matrix Y, and wish to nd rst the
q 1 vector g
1
with norm 1 that maximizes the estimated negentropy of Yg
1
. Next
we look for the g
2
with norm 1 orthogonal to g
1
that maximizes the negentropy of
Yg
2
, etc. Then our rotation is given by the orthogonal matrix G = (g
1
, g
2
, . . . , g
q
).
281
282 Appendix A. Extra R routines
We need to parametrize the orthogonal matrices somehow. For q = 2, we can set
G() = c
2
()
_
cos() sin()
sin() cos()
_
. (A.4)
Clearly each such matrix is orthogonal. As ranges from 0 to 2, c
2
() ranges
through half of the orthogonal matrices (those with determinant equal to +1), the
other half obtainable by switching the minus sign from the sine term to one of the
cosine terms. For our purposes, we need only to take 0 < , since the other
Gs are obtained from that set by changing the sign on one or both of the columns.
Changing signs does not affect the negentropy, nor the graph except for reection
around an axis. To nd the best G(), we perform a simple line search over . See
Section A.1.2.
For q = 3 we use Euler angles
1
,
2
, and
3
, so that
G(
1
,
2
,
3
) = c
3
(
1
,
2
,
3
)

_
c
2
(
3
) 0
0 1
__
1 0
0 c
2
(
2
)
__
c
2
(
1
) 0
0 1
_
. (A.5)
See Anderson et al. [1987] for similar parametrizations when q > 3. The rst step is to
nd the G = (g
1
, g
2
, g
3
) whose rst column, g
1
, achieves the maximum negentropy
of Yg
1
. Here it is enough to take
3
= 0, so that the left-hand matrix is the identity.
Because our estimate of negentropy for Yg is not continuous in g, we use the simulated
annealing option in the R function optim to nd the optimal g
1
. The second step is to
nd the best further rotation of the remaining variables, Y(g
2
, g
3
), for which we can
use the two-dimensional procedure above. See Section A.1.3.
A.1.1 negent: Estimating negative entropy
Description: Calculates the histogram-based estimate (A.2) of the negentropy (1.46)
for a vector of observations. See Listing A.1 for the code.
Usage: negent(x,K=log2(length(x))+1)
Arguments:
x: The n-vector of observations.
K: The number of bins to use in the histogram.
Value: The value of the estimated negentropy.
A.1.2 negent2D: Maximizing negentropy for q = 2 dimensions
Description: Searches for the rotation that maximizes the estimated negentropy of
the rst column of the rotated data, for q = 2 dimensional data. See Listing A.2 for
the code.
Usage: negent2D(y,m=100)
Arguments:
A.2. Both-sides model 283
y: The n 2 data matrix.
m: The number of angles (between 0 and ) over which to search.
Value: A list with the following components:
vectors: The 2 2 orthogonal matrix G that optimizes the negentropy.
values: Estimated negentropies for the two rotated variables. The largest is rst.
A.1.3 negent3D: Maximizing negentropy for q = 3 dimensions
Description: Searches for the rotation that maximizes the estimated negentropy of
the rst column of the rotated data, and of the second variable xing the rst, for
q = 3 dimensional data. The routine uses a random start for the function optim using
the simulated annealing option SANN, hence one may wish to increase the number
of attempts by setting nstart to a integer larger than 1. See Listing A.3 for the code.
Usage: negent2D(y,nstart=1,m=100,...)
Arguments:
y: The n 3 data matrix.
nstart: The number of times to randomly start the search routine.
m: The number of angles (between 0 and ) over which to search to nd the second
variables.
. . .: Further optional arguments to pass to the optim function to control the simulated
annealing algorithm.
Value: A list with the following components:
vectors: The 3 3 orthogonal matrix G that optimizes the negentropy.
values: Estimated negentropies for the three rotated variables, from largest to small-
est.
A.2 Both-sides model
A.2.1 bothsidesmodel: Calculate the estimates
Description: For the both-sides model,
Y = xz
/
+R, R N(0, I
n

R
), (A.6)
where Y is n q, x is n p, and z is q l, the function nds the least-squares estimates
of ; the standard errors and t-statistics (with degrees of freeedom) for the individual

ij
s; and the matrices C
x
= (x
/
x)
1
and
z
of (6.5). The function requires that n p,
and x
/
x and z
/
z be invertible. See Listing A.4 for the code.
Usage: bothsidesmodel(x,y,z)
Arguments:
284 Appendix A. Extra R routines
x: An n p design matrix.
y: The n q matrix of observations.
z: A q l design matrix.
Value: A list with the following components:
Betahat: The least-squares estimate of .
se: The p l matrix with the i j
th
element being the standard error of

ij
.
T: The p l matrix with the i j
th
element being the t-statistic based on

ij
.
nu: The degrees of freedom for the t-statistics, = n p.
Cx: The p p matrix C
x
.
Sigmaz: The q q matrix
z
.
A.2.2 bothsidesmodel.test: Test blocks of are zero
Description: Performs tests of the null hypothesis H
0
:

= 0, where

is a
block submatrix of as in Section 7.2. An example is given in (7.11). The in-
put consists of the output from the bothsidesmodel function, plus vectors giving the
rows and columns of to be tested. In the example, we set rows < c(1,4,5) and
cols < c(1,3). See Listing A.5 for the code.
Usage: bothsidesmodel.test(bsm,rows,cols)
Arguments:
bsm: The output of the bothsidesmodel function.
rows: The vector of rows to be tested.
cols: The vector of columns to be tested.
Value: A list with the following components:
Hotelling: A list with the components of the Lawley-Hotelling T
2
test (7.22):
T2: The T
2
statistic (7.19).
F: The F version (7.22) of the T
2
statistic.
df: The degrees of freedom for the F.
pvalue: The p-value of the F.
Wilks: A list with the components of the Wilks test (7.37):
lambda: The statistic (7.35).
Chisq: The
2
version (7.37) of the statistic, using Bartletts correction.
df: The degrees of freedom for the
2
.
pvalue: The p-value of the
2
.
A.3. Classication 285
A.3 Classication
A.3.1 lda: Linear discrimination
Description: Finds the coefcents a
k
and constants c
k
for Fishers linear discrimina-
tion function d
k
in (11.31) and (11.32). See Listing A.6 for the code.
Usage: lda(x,y)
Arguments:
x: The n p data matrix.
y: The n-vector of group identities, assumed to be given by the numbers 1, . . . , K for
K groups.
Value: A list with the following components:
a: A p K matrix, where column k contains the coefcents a
k
for (11.31). The nal
column is all zero.
c: The K- vector of constants c
k
for (11.31). The nal value is zero.
A.3.2 qda: Quadratic discrimination
Description: The function returns the elements needed to calculate the quadratic
discrimination in (11.48). Use the output from this function in predict.qda (Section
A.3.2) to nd the predicted groups. See Listing A.7 for the code.
Usage: qda(x,y)
Arguments:
x: The n p data matrix.
y: The n-vector of group identities, assumed to be given by the numbers 1, . . . , K for
K groups.
Value: A list with the following components:
Mean: A K p matrix, where row k contains the sample mean vector for group k.
Sigma: A Kp p array, where the Sigma[k,,] contains the sample covariance matrix
for group k,

k
.
c: The K- vector of constants c
k
for (11.48).
286 Appendix A. Extra R routines
A.3.3 predict.qda: Quadratic discrimination prediction
Description: The function uses the output from the function qda (Section A.3.2) and
a p-vector x, and calculates the predicted group for this x. See Listing A.8 for the
code.
Usage: predict.qda(qd,newx)
Arguments:
qd: The output from qda.
newx: A p-vector x whose components match the variables used in the qda function.
Value: A K-vector of the discriminant values d
Q
k
(x) in (11.48) for the given x.
A.4 Silhouettes for K-Means Clustering
A.4.1 silhouette.km: Calculate the silhouettes
This function is a bit different from the silhouette function in the cluster package,
[Maechler et al., 2005].
Description: Find the silhouettes (12.13) for K-means clustering from the data and
and the groups centers. See Listing A.9 for the code.
Usage: silhouette.km(x,centers)
Arguments:
x: The n p data matrix.
centers: The K p matrix of centers (means) for the K clusters, row k being the
center for cluster k.
Value: The n-vector of silhouettes, indexed by the observations indices.
A.4.2 sort.silhouette: Sort the silhouettes by group
Description: Sorts the silhouettes, rst by group, then by value, preparatory to plot-
ting. See Listing A.10 for the code.
Usage: sort.silhouette(sil,clusters)
Arguments:
sil: The n-vector of silhouette values.
clusters: The n-vector of cluster indices.
Value: The n-vector of sorted silhouettes.
A.5. Estimating the eigenvalues 287
A.5 Estimating the eigenvalues
We have two main functions, pcbic to nd the MLE and BIC for a particular pattern,
and pcbic.stepwise, which uses a stepwise search to nd a good pattern. The functions
pcbic.unite and pcbic.patterns are used by the main functions, and probably not of
much interest on their own.
A.5.1 pcbic: BIC for a particular pattern
Description: Find the BIC and MLE from a set of observed eigenvalues for a specic
pattern. See Listing A.11 for the code.
Usage: pcbic(eigenvals,n,pattern)
Arguments:
eigenvals: The q-vector of eigenvalues of the covariance matrix, in order from largest
to smallest.
n: The degrees of freedom in the covariance matrix.
pattern: The pattern of equalities of the eigenvalues, given by the K-vector
(q
1
, . . . , q
K
) as in (13.8).
Value: A list with the following components:
lambdaHat: A q-vector containing the MLEs for the eigenvalues.
Deviance: The deviance of the model, as in (13.13).
Dimension: The dimension of the model, as in (13.12).
BIC: The value of the BIC for the model, as in (13.14).
A.5.2 pcbic.stepwise: Choosing a good pattern
Description: Uses the stepwise procedure described in Section 13.1.4 to nd a pattern
for a set of observed eigenvalues with good BIC value. See Listing A.11 for code.
Usage: pcbic.stepwise(eigenvals,n)
Arguments:
eigenvals: The q-vector of eigenvalues of the covariance matrix, in order from largest
to smallest.
n: The degrees of freedom in the covariance matrix.
Value: A list with the following components:
Patterns: A list of patterns, one for each value of length K.
BICs: A vector of the BICs for the above patterns.
288 Appendix A. Extra R routines
BestBIC: The best (smallest) value among the BICs in BICs.
BestPattern: The pattern with the best BIC.
lambdaHat: A q-vector containing the MLEs for the eigenvalues for the pattern with
the best BIC.
A.5.3 Helper functions
The function pcbic.unite takes as arguments a pattern (q
1
, . . . , q
K
), called pattern, and
an index i, called index1, where 1 i < K. It returns the pattern obtained by summing
q
i
and q
i+1
. See Listing A.13. The function pcbic.patterns (Listing A.14) takes the
arguments eigenvals, n, and pattern0 (as for pcbic in Section A.5.1), and returns the
best pattern and its BIC among the patterns obtainable by summing two consecutive
terms in pattern0 via pcbic.unite.
A.5. Estimating the eigenvalues 289
Listing A.1: The function negent
negent < function(x,K=ceiling(log2(length(x))+1)) {
p < table(cut(x,breaks=K))/length(x)
sigma2 < sum((1:K)^2p)sum((1:K)p)^2
p < p[(p>0)]
(1+log(2pi(sigma2+1/12)))/2+sum(plog(p))
}
Listing A.2: The function negent2D
negent2D < function(y,m=100) {
thetas < (1:m)pi/m
ngnt < NULL
for(theta in thetas) {
x < y%%c(cos(theta),sin(theta))
ngnt < c(ngnt,negent(x))
}
i < imax(ngnt)
g < c(cos(thetas[i]),sin(thetas[i]))
g < cbind(g,c(g[2],g[1]))
list(vectors = g,values = c(ngnt[i],negent(y%%g[,2])))
}
Listing A.3: The function negent3D
negent3D < function(y,nstart=1,m=100,...) {
f < function(thetas) {
cs < cos(thetas)
sn < sin(thetas)
negent(y%%c(cs[1],sn[1]c(cs[2],sn[2])))
}
tt < NULL
nn < NULL
for(i in 1:nstart) {
thetas < runif(3)pi
o < optim(thetas,f,method=SANN,control=list(fnscale=1),...)
tt < rbind(tt,o$par)
nn < c(nn,o$value)
}
i < imax(nn) # The index of best negentropy
cs<cos(tt[i,])
sn<sin(tt[i,])
g.opt < c(cs[1],sn[1]cs[2],sn[1]sn[2])
g.opt < cbind(g.opt,c(sn[1],cs[1]cs[2],sn[2]cs[1]))
g.opt < cbind(g.opt,c(0,sn[2],cs[2]))
x < y%%g.opt[,2:3]
n2 < negent2D(x,m=m)
g.opt[,2:3] < g.opt[,2:3]%%n2$vectors
list(vectors=g.opt,values = c(nn[i],n2$values))
}
290 Appendix A. Extra R routines
Listing A.4: The function bothsidesmodel
bothsidesmodel < function(x,y,z) {
if(is.vector(x)) x < matrix(x,ncol=1)
xpxi < solve(t(x)%%x)
xx < xpxi%%t(x)
zz < solve(t(z)%%z,t(z))
b < xx%%y
yr < yx%%b
b < b%%t(zz)
yr < yr%%t(zz)
df < nrow(x)ncol(x)
sigmaz < t(yr)%%yr/df
se < sqrt(outer(diag(xpxi),diag(sigmaz),""))
tt < b/se
list(Betahat = b, Se = se, T=tt,Cx=xpxi,Sigmaz=sigmaz,nu=nrow(x)ncol(x))
}
Listing A.5: The function bothsidesmodel.test
bothsidesmodel.test < function(bsm,rows,cols) {
lstar < length(cols)
pstar < length(rows)
nu < bsm$nu
bstar < bsm$Betahat[rows,cols]
if(lstar==1) bstar < matrix(bstar,ncol=1)
if(pstar==1) bstar < matrix(bstar,nrow=1)
W.nu < bsm$Sigmaz[cols,cols]
B < t(bstar)%%solve(bsm$Cx[rows,rows])%%bstar
t2 < tr(solve(W.nu)%%B)
f < (nulstar+1)t2/(lstarpstarnu)
df < c(lstarpstar,nulstar+1)
W < W.nunu
lambda < det(W)/det(W+B)
chis < (nu(lstarpstar+1)/2)log(lambda)
Hotelling < list(T2 = t2, F = f, df = df,pvalue = 1pf(f,df[1],df[2]))
Wilks < list(Lambda=lambda,Chisq=chis,df=df[1],pvalue=1pchisq(chis,df[1]))
list(Hotelling = Hotelling,Wilks = Wilks)
}
A.5. Estimating the eigenvalues 291
Listing A.6: The function lda
lda < function(x,y) {
if(is.vector(x)) {x < matrix(x,ncol=1)}
K < max(y)
p < ncol(x)
n < nrow(x)
m < NULL
v < matrix(0,ncol=p,nrow=p)
for(k in 1:K) {
xk < x[y==k,]
if(is.vector(xk)) {xk < matrix(xk,ncol=1)}
m < rbind(m,apply(xk,2,mean))
v < v + var(xk)(nrow(xk)1)
}
v < v/n
phat < table(y)/n
ck < NULL
ak < NULL
vi < solve(v)
for(k in 1:K) {
c0 < (1/2)(m[k,]%%vi%%m[k,]m[K,]%%vi%%m[K,])
+log(phat[k]/phat[K])
ck < c(ck,c0)
a0 < vi%%(m[k,]m[K,])
ak < cbind(ak,a0)
}
list(a = ak, c = ck)
}
Listing A.7: The function qda
qda < function(x,y) {
K < max(y)
p < ncol(x)
n < nrow(x)
m < NULL
v < array(0,c(K,p,p))
ck < NULL
phat < table(y)/n
for(k in 1:K) {
xk < x[y==k,]
m < rbind(m,apply(xk,2,mean))
nk < nrow(xk)
v[k,,] < var(xk)(nk1)/nk
ck < c(ck,log(det(v[k,,]))+2log(phat[k]))
}
list(Mean = m,Sigma = v, c = ck)
}
292 Appendix A. Extra R routines
Listing A.8: The function predict.qda
predict.qda < function(qd,newx) {
newx < c(newx)
disc < NULL
K < length(qd$c)
for(k in 1:K) {
dk < t(newxqd$Mean[k,])%%
solve(qd$Sigma[k,,],newxqd$Mean[k,])+qd$c[k]
disc < c(disc,dk)
}
disc
}
Listing A.9: The function silhouette.km
silhouette.km < function(x,centers) {
dd < NULL
k < nrow(centers)
for(i in 1:k) {
xr < sweep(x,2,centers[i,],)
dd<cbind(dd,apply(xr^2,1,sum))
}
dd < apply(dd,1,sort)[1:2,]
(dd[2,]dd[1,])/dd[2,]
}
Listing A.10: The function sort.silhouette
sort.silhouette < function(sil,cluster) {
ss < NULL
ks < sort(unique(cluster))
for(k in ks) {
ss < c(ss,sort(sil[cluster==k]))
}
ss
}
A.5. Estimating the eigenvalues 293
Listing A.11: The function pcbic
pcbic < function(eigenvals,n,pattern) {
p < length(eigenvals)
l < eigenvals
k < length(pattern)
istart < 1
for(i in 1:k) {
iend < istart+pattern[i]
l[istart:(iend1)] = mean(l[istart:(iend1)])
istart < iend
}
dev < nsum(log(l))
dimen < (p^2sum(pattern^2))/2 + k
bic < dev + log(n)dimen
list(lambdaHat = l,Deviance = dev,Dimension = dimen,BIC = bic)
}
Listing A.12: The function pcbic.stepwise
pcbic.stepwise < function(eigenvals,n) {
k < length(eigenvals)
p0 < rep(1,k)
b < rep(0,k)
pb < vector(list,k)
pb[[1]] < p0
b[1] < pcbic(eigenvals,n,p0)$BIC
for(i in 2:k) {
psb < pcbic.subpatterns(eigenvals,n,pb[[i1]])
b[i] < min(psb$bic)
pb[[i]] < psb$pattern[,psb$bic==b[i]]
}
ib < (1:k)[b==min(b)]
list(Patterns = pb,BICs = b,
BestBIC = b[ib],BestPattern = pb[[ib]],
LambdaHat = pcbic(eigenvals,n,pb[[ib]])$lambdaHat)
}
Listing A.13: The function pcbic.unite
pcbic.unite < function(pattern,index1) {
k < length(pattern)
if(k==1) return(pattern)
if(k==2) return(sum(pattern))
if(index1==1) return(c(pattern[1]+pattern[2],pattern[3:k]))
if(index1==k1) return(c(pattern[1:(k2)],pattern[k1]+pattern[k]))
c(pattern[1:(index11)],pattern[index1]+pattern[index1+1],pattern[(index1+2):k])
}
294 Appendix A. Extra R routines
Listing A.14: The function pcbic.subpatterns
pcbic.subpatterns < function(eigenvals,n,pattern0) {
b < NULL
pts < NULL
k < length(pattern0)
if(k==1) return(F)
for(i in 1:(k1)) {
p1 < pcbic.unite(pattern0,i)
b2 < pcbic(eigenvals,n,p1)
b < c(b,b2$BIC)
pts < cbind(pts,p1)
}
list(bic=b,pattern=pts)
}
Bibliography
Hirotugu Akaike. A new look at the statistical model identication. IEEE Transactions
on Automatic Control, 19:716 723, 1974.
E. Anderson. The irises of the Gaspe Peninsula. Bulletin of the American Iris Society,
59:25, 1935.
E. Anderson. The species problem in iris. Annals of the Missouri Botanical Garden, 23:
457 509, 1936.
T. W. Anderson. Asymptotic theory for principal component analysis. The Annals of
Mathematical Statistics, 34:122148, 1963.
T. W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley, NY, third
edition, 2003.
T. W. Anderson, I. Olkin, and L. G. Underhill. Generation of random orthogonal
matrices. SIAM Journal on Scientic and Statistical Computing, 8:625629, 1987.
Steen Andersson. Invariant normal models. Annals of Statistics, 3:132 154, 1975.
David R. Appleton, Joyce M. French, and Mark P. J. Vanderpump. Ignoring a covari-
ate: An example of Simpsons paradox. The American Statistician, 50(4):340341,
1996.
Robert B. Ash. Basic Probability Theory. John Wiley and Sons Inc., http://www.math.
uiuc.edu/~r-ash/BPT.html, 1970.
Daniel Asimov. The grand tour: A tool for viewing multidimensional data. SIAM
Journal on Scientic and Statistical Computing, 6:128143, 1985.
M. S. Bartlett. A note on multiplying factors for various
2
approximations. Journal
of the Royal Statistical Society B, 16:296 298, 1954.
M. S. Bartlett. A note on tests of signicance in multivariate analysis. Mathematical
Proceedings of the Cambridge Philosophical Society, 35:180185, 1939.
Alexander Basilevsky. Statistical Factor Analysis and Related Methods: Theory and Appli-
cations. John Wiley & Sons, 1994.
295
296 Bibliography
J. Berger, J. Ghosh, and N. Mukhopadhyay. Approximations to the Bayes factor in
model selection problems and consistency issues. Technical report, 1999. URL
http://www.stat.duke.edu/~berger/papers/00-10.html.
Peter J. Bickel and Kjell A. Doksum. Mathematical Statistics: Basic Ideas and Selected
Topics, Volume I. Prentice Hall, second edition, 2000.
G. E. P. Box and Mervin E. Muller. A note on the generation of random normal
deviates. Annals of Mathematical Statistics, 29(2):610611, 1958.
Leo Breiman, Jerome Friedman, Charles J. Stone, and R. A. Olshen. Classication and
Regression Trees. CRC Press, 1984.
Andreas Buja and Daniel Asimov. Grand tour methods: An outline. In D. M. Allen,
editor, Computer Science and Statistics: Proceedings of the 17th Symposium on the Inter-
face, pages 6367, New York and Amsterdam, 1986. Elsevier/North-Holland.
T. K. Chakrapani and A. S. C. Ehrenberg. An alternative to factor analysis in market-
ing research part 2: Between group analysis. Professional Marketing Research Society
Journal, 1:3238, 1981.
Herman Chernoff. The use of faces to represent points in k-dimensional space graph-
ically. Journal of the American Statistical Association, 68:361368, 1973.
Vernon M. Chinchilli and Ronald K. Elswick. A mixture of the MANOVA and
GMANOVA models. Communications in Statistics: Theory and Methods, 14:3075
3089, 1985.
Ronald Christensen. Plane Answers to Complex Questions: The Theory of Linear Models.
Springer-Verlag Inc, third edition, 2002.
J. W. L. Cole and James E. Grizzle. Applications of multivariate analysis of variance
to repeated measurements experiments. Biometrics, 22:810828, 1966.
Consumers Union. Body dimensions. Consumer Reports, April 286 288, 1990.
D. Cook and D. F. Swayne. Interactive and Dynamic Graphics for Data Analysis. Springer,
2007.
DASL Project. Data and Story Library. Cornell University, Ithaca, NY, 1996. URL
http://http://lib.stat.cmu.edu/DASL.
M. Davenport and G. Studdert-Kennedy. Miscellanea: The statistical analysis of aes-
thetic judgment: An exploration. Applied Statistics, 21(3):324333, 1972.
Bill Davis and J. Jerry Uhl. Matrices, Geometry & Mathematica. Calculus & Mathemat-
ica. Math Everywhere, Inc., 1999.
Edward J. Dudewicz and Thomas G. Ralley. The Handbook of Random Number Genera-
tion and Testing with TESTRAND Computer Code. American Sciences Press, 1981.
R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of
Eugenics, 7:179 188, 1936.
Chris Fraley and Adrian Raftery. mclust: Model-based clustering / normal mix-
ture modeling, 2010. URL http://cran.r-project.org/web/packages/mclust/
index.html. R package version 3.4.8.
Bibliography 297
Chris Fraley and Adrian E Raftery. Model-based clustering, discriminant analysis,
and density estimation. Journal of the American Statistical Association, 97(458):611
631, 2002.
A. Frank and A. Asuncion. UCI machine learning repository, 2010. URL http:
//archive.ics.uci.edu/ml/.
Yasunori Fujikoshi. The likelihood ratio tests for the dimensionality of regression
coefcients. J. Multivariate Anal., 4:327340, 1974.
K. R. Gabriel. Biplot display of multivariate matrices for inspection of data and
diagnois. In V. Barnett, editor, Interpreting Multivariate Data, pages 147 173. John
Wiley & Sons, London, 1981.
Leon J. Gleser and Ingram Olkin. Linear models in multivariate analysis. In Essays in
Probability and Statistics, pages 276292. Wiley, New York, 1970.
A. K. Gupta and D. G. Kabe. On Mallows C
p
for the GMANOVA model under
double linear restrictions on the regression parameter matrix. Journal of the Japan
Statistical Society, 30(2):253257, 2000.
P. R. Halmos. Measure Theory. The University Series in Higher Mathematics. Van
Nostrand, 1950.
Kjetil Halvorsen. ElemStatLearn: Data sets, functions and examples from the
book: The Elements of Statistical Learning, Data Mining, Inference, and Prediction
by Trevor Hastie, Robert Tibshirani and Jerome Friedman., 2009. URL http://
cran.r-project.org/web/packages/ElemStatLearn/index.html. Material from
the books webpage and R port and packaging by Kjetil Halvorsen; R package
version 0.1-7.
D. J. Hand and C. C. Taylor. Multivariate Analysis of Variance and Repeated Measures: A
Practical Approach for Behavioural Scientists. Chapman & Hall Ltd, 1987.
Harry Horace Harman. Modern Factor Analysis. University of Chicago Press, 1976.
Hikaru Hasegawa. On Mallows C
p
in the multivariate linear regression model under
the uniform mixed linear constraints. Journal of the Japan Statistical Society, 16:16,
1986.
Trevor Hastie, Robert Tibshirani, and J. H. Friedman. The Elements of Statistical Learn-
ing: Data Mining, Inference, and Prediction. Springer-Verlag Inc, second edition, 2009.
URL http://www-stat.stanford.edu/~tibs/ElemStatLearn/.
Claire Henson, Claire Rogers, and Nadia Reynolds. Always Coca-Cola. Technical
report, University Laboratory High School, Urbana, IL, 1996.
Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for
nonorthogonal problems. Technometrics, 12(1):pp. 5567, 1970.
Robert V. Hogg, Joseph W. McKean, and A. T. Craig. Introduction to Mathematical
Statistics. Prentice Hall, sixth edition, 2004.
Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt. Spam data.
Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304, 1999.
298 Bibliography
Peter J. Huber. Projection pursuit (C/R: P475-525). The Annals of Statistics, 13:435475,
1985.
Clifford M. Hurvich and Chih-Ling Tsai. Regression and time series model selection
in small samples. Biometrika, 76:297307, 1989.
Aapo Hyvrinen, Juha Karhunen, and Erkki Oja. Independent Component Analysis.
Wiley-Interscience, May 2001.
IBM. System/360 scientic subroutine package, version III, programmers manual,
program number 360A-CM-03X. In Manual GH20-0205-4, White Plains, NY, 1970.
International Business Machines Corporation.
Richard Arnold Johnson and Dean W. Wichern. Applied Multivariate Statistical Analy-
sis. Pearson Prentice-Hall Inc, sixth edition, 2007.
Takeaki Kariya. Testing in the Multivariate General Linear Model. Kinokuniya, 1985.
Leonard Kaufman and Peter J. Rousseeuw. Finding Groups in Data: An Introduction to
Cluster Analysis. John Wiley & Sons, 1990.
J. B. Kruskal. Non-metric multidimensional scaling. Psychometrika, 29:1 27, 115
129, 1964.
A. M. Kshirsagar. Bartlett decomposition and Wishart distribution. Annals of Mathe-
matical Statistics, 30(1):239241, 1959.
Anant M. Kshirsagar. Multivariate Analysis. Marcel Dekker, 1972.
S. Kullback and R. A. Leibler. On information and sufciency. The Annals of Mathe-
matical Statistics, 22(1):7986, 1951.
D. N. Lawley and A. E. Maxwell. Factor Analysis As a Statistical Method. Butterworth
and Co Ltd, 1971.
Yann LeCun. Generalization and network design strategies. Technical Report CRG-
TR-89-4, Department of Computer Science, University of Toronto, 1989. URL http:
//yann.lecun.com/exdb/publis/pdf/lecun-89t.pdf.
E. L. Lehmann and George Casella. Theory of Point Estimation. Springer-Verlag Inc,
second edition, 1998.
Martin Maechler, Peter Rousseeuw, Anja Struyf, and Mia Hubert. Cluster analysis
basics and extensions, 2005. URL http://cran.r-project.org/web/packages/
cluster/index.html. Rousseeuw et. al. provided the S original which has been
ported to R by Kurt Hornik and has since been enhanced by Martin Maechler.
C. L. Mallows. Some comments on C
p
. Technometrics, 15(4):661675, 1973.
J. L. Marchini, C. Heaton, and B. D. Ripley. fastICA: FastICA algorithms to per-
form ICA and projection pursuit, 2010. URL http://cran.r-project.org/web/
packages/fastICA/index.html. R package version 1.1-13.
K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate Analysis. Academic Press,
London, 1979.
Bibliography 299
Kenny J. Morris and Robert Zeppa. Histamine-induced hypotension due to morphine
and arfonad in the dog. Journal of Surgical Research, 3(6):313317, 1963.
I. Olkin and S. N. Roy. On multivariate distribution theory. Annals of Mathematical
Statistics, 25(2):329339, 1954.
NBC Olympics. http://www.2008.nbcolympics.com, 2008.
K. Pearson. On lines and planes of closest t to systems of points in space. Philosoph-
ical Magazine, 2(6):559572, 1901.
Michael D. Perlman and Lang Wu. On the validity of the likelihood ratio and maxi-
mum likelihood methods. J. Stat. Plann. Inference, 117(1):5981, 2003.
Richard F. Potthoff and S. N. Roy. A generalized multivariate analysis of variance
model useful especially for growth curve problems. Biometrika, 51:313326, 1964.
J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers
Inc., San Francisco, 1993.
R Development Core Team. The Comprehensive R Archive Network, 2010. URL
http://cran.r-project.org/.
C. R. Rao. Linear Statistical Inference and Its Applications. Wiley, New York, 1973.
G. M. Reaven and R. G. Miller. An attempt to dene the nature of chemical diabetes
using a multidimensional analysis. Diabetologia, 16:1724, 1979.
Brian Ripley. tree: Classication and regression trees, 2010. URL http://cran.
r-project.org/web/packages/tree/. R package version 1.0-28.
J. Rousseauw, J. Plessis, A. Benade, P. Jordaan, J. Kotze, P. Jooste, and J. Ferreira.
Coronary risk factor screening in three rural communities. South African Medical
Journal, 64:430 436, 1983.
Peter Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of
cluster analysis. J. Comput. Appl. Math., 20(1):5365, 1987.
Henry Scheff. The Analysis of Variance. John Wiley & Sons, 1999.
Gideon Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6:
461464, 1978.
George W. Snedecor and William G. Cochran. Statistical Methods. Iowa State Univer-
sity Press, Ames, Iowa, eighth edition, 1989.
C. Spearman. General intelligence, objectively determined and measured. American
Journal of Psychology, 15:201293, 1904.
David J. Spiegelhalter, Nicola G. Best, Bradley P. Carlin, and Angelika van der Linde.
Bayesian measures of model complexity and t (Pkg: P583-639). Journal of the Royal
Statistical Society, Series B: Statistical Methodology, 64(4):583616, 2002.
H. A. Sturges. The choice of a class interval. Journal of the American Statistical Associa-
tion, 21(153):6566, 1926.
A. Thomson and R. Randall-MacIver. Ancient Races of the Thebaid. Oxford University
Press, 1905.
300 Bibliography
TIBCO Software Inc. S-Plus. Palo Alto, CA, 2009. URL http://www.tibco.com.
R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data
set via the gap statistic. Journal of the Royal Statistical Society B, 63:411 423, 2001.
W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Springer, New York,
fourth edition, 2002.
J H Ware and R E Bowden. Circadian rhythm analysis when output is collected at
intervals. Biometrics, 33(3):566571, 1977.
Sanford Weisberg. Applied Linear Regression. John Wiley & Sons, third edition, 2005.
Wikipedia. List of breakfast cereals Wikipedia, The Free Encyclopedia,
2011. URL http://en.wikipedia.org/w/index.php?title=List_of_breakfast_
cereals&oldid=405905817. [Online; accessed 7-January-2011].
Wikipedia. Pterygomaxillary ssure Wikipedia, The Free Encyclopedia,
2010. URL http://en.wikipedia.org/w/index.php?title=Pterygomaxillary_
fissure&oldid=360046832. [Online; accessed 7-January-2011].
John Wishart. The generalised product moment distribution in samples from a nor-
mal multivariate population. Biometrika, 20A:32 52, 1928.
Peter Wolf and Uni Bielefeld. aplpack: Another Plot PACKage, 2010. URL http:
//cran.r-project.org/web/packages/aplpack/index.html. R package version
1.2.3.
John W. Wright, editor. The Universal Almanac. Andrews McMeel Publishing, Kansas
City, MO, 1997.
Gary O. Zerbe and Richard H. Jones. On application of growth curve techniques to
time series data. Journal of the American Statistical Association, 75:507509, 1980.
Index
The italic page numbers are references to Exercises.
afne transformation, 4041
covariance matrix, 40
denition, 40
mean, 40
multivariate normal distribution,
4952
conditional mean, 56
data matrix, 5354
Akaike information criterion (AIC), 126,
158
equality of covariance matrices
iris data, 210211
factor analysis, 181
motivation, 161
multivariate regression, 161162
Bartletts test, 174
Bayes information criterion (BIC), 158
160
as posterior probability, 159
both-sides model
caffeine data, 167
clustering, 240
equality of covariance matrices
iris data, 210211
factor analysis, 181
principal components
automobile data, 259262
Bayes theorem, 39, 159
multivariate normal distribution,
64
Bayesian inference, 39
binomial, 4344
conjugate prior, 43
multivariate normal distribution
mean, 6465
mean and covariance, 147149
regression parameter, 109110
Wishart distribution, 146147
biplot, 1314
cereal data, 2324
decathlon data, 24
painters data, 23
sports data, 1314
both-sides model, 7380
a.k.a. generalized multivariate anal-
ysis of variance (GMANOVA)
model, 80
caffeine data, 81
estimation of coefcients, 8990
bothsidesmodel (R routine), 283
284
caffeine data, 110111
covariance matrix, 103106
expected value, 103
grades data, 111
histamine in dogs data, 108
109
mouth size data, 106108
prostaglandin data, 110
Students t, 106
estimation of covariance matrix,
105
ts and residuals, 104105
covariance matrix, 104105
301
302 Index
distribution, 104
estimation, 104
expected value, 104
multivariate normal distribution,
105
growth curve model
births data, 7678
mouth size data, 107
histamine in dogs data, 8182
hypothesis testing
approximate
2
tests, 113114
bothsidesmodel.test (R routine),
284
histamine in dogs data, 120
Hotellings T
2
and projection
pursuit, 139140
Hotellings T
2
test, 116, 138
139
Lawley-Hotelling trace test, 114
117
linear restrictions, 120121
mouth size data, 114, 118120
Pillai trace test, 118
prostaglandin data, 131
Roys maximum root test, 118
Wilks , 117
iid model, 7578
independence of coefcients and
covariance estimator, 106
intraclass correlation structure
mouth size data, 191194
least squares, 8990
Mallows C
p
caffeine data, 133
mouth size data, 129130
mouth size data, 7980
parity data, 8283
prostaglandin data, 80
pseudo-covariates
histamine in dogs data, 132
133
canonical correlations, 175, 270275
Bayes information criterion, 274
275
grades data, 275
compared to partial least squares,
276
exams data, 278
grades data, 273274
iris data, 278
states data, 278279
Cauchy-Schwarz inequality, 135136
Chernoffs faces, 23
iris data, 5
planets data, 23
spam data, 22
sports data, 2223
classication, 199224, 253
Bayes classier, 201
spam data, 224225
classier, 201
estimation, 202
cross-validation, 205206
leave out more than one, 206
leave-one-out, 206, 208
discriminant function, 204
linear, 204
error, 201, 206
cross-validation estimate, 206
observed, 206, 208
Fishers linear discrimination, 203
210
crabs data, 225
denition, 204
heart disease data, 227228
iris data, 207210
lda (R routine), 285
modication, 211212
zipcode data, 226227
Fishers quadratic discrimination,
210211
denition, 210
iris data, 210211
predict.qda (R routine), 286
qda (R routine), 285
illustration, 202
logistic regression, 212217
heart disease data, 226
iris data, 214
spam data, 214217
new observation, 201
subset selection
iris data, 208210
trees, 217224
Akaike information criterion, 222
Bayes information criterion, 222
C5.0, 220
Categorization and Regression
Trees (CART), 220
Index 303
crabs data, 225226
cross-validation, 224
deviance, 221
heart disease data, 218224
pruning, 222
snipping, 219
spam data, 227
tree (R routine), 221
clustering, 199, 229251
classier, 229
density of data, 201
hierarchical, 230, 247251
average linkage, 248
cereal data, 252
complete linkage, 248
dissimilarities, 239
grades data, 247249, 252
hclust (R routine), 249
plclust (R routine), 249
single linkage, 248
soft drink data, 252
sports data, 249251
K-means, 230238
algorithm, 230
diabetes data, 251252
gap statistics, 231232, 237
grades data, 251
kmeans (R routine), 236237
objective function, 230
relationship to EM algorithm,
246
silhouette.km (R routine), 286
silhouettes, 233235, 237238
sort.silhouette (R routine), 286
sports data, 230233, 236238
K-medoids, 238240
dissimilarities, 238
grades data, 239240
objective function, 239
pam (R routine), 239
silhouettes, 233, 239
model-based, 229, 230, 240246,
253
automobile data, 240243
Bayes information criterion, 240
241, 243
classier, 240
EM algorithm, 245246
iris data, 252
likelihood, 240
Mclust (R routine), 241
mixture density, 240
multivariate normal model, 240
plotting
sports data, 233235
soft K-means, 246247
sports data, 247
conditional probability, 30
smoking data, 4546
contrasts, 122
convexity, 20
correlation inequality, 136
covariance (variance-covariance) matrix
collection of random variables, 34
generalized variance, 146
multivariate normal distribution,
4951
data matrix, 53
of two vectors, 3435
principal components, 11
sample, 8
sum of squares and cross prod-
uct matrix, 11
covariance models, 171191
factor analysis, see separate entry
invariant normal models, see sym-
metry models
symmetry models, see separate en-
try
testing conditional independence
grades data, 196
testing equality, 172174
mouth size data, 195196
testing independence, 174175
grades data, 196
testing independence of blocks of
variables, 270
cross-validation, see classication
data examples
automobiles
entropy, 24
model-based clustering, 240243
principal components, 256257,
259262
births, 76
growth curve models, 7678
caffeine, 81
Bayes information criterion, 167
304 Index
both-sides model, 81, 110111,
133
orthogonalization, 101
cereal, 23
biplot, 2324
cereals
hierarchical clustering, 252
crabs, 225
classication, 225226
decathlon, 24
biplot, 24
factor analysis, 197198
diabetes, 251
K-means clustering, 251252
elections, 23
principal components, 23
exams, 196
canonical correlations, 278
covariance models, 196197
factor analysis, 196
Fisher-Anderson iris data, 5
Akaike information criterion, 210
211
Bayes information criterion, 210
211
canonical correlations, 278
Chernoffs faces, 5
classication, 199
Fishers linear discrimination,
207210
Fishers quadratic discrimina-
tion, 210211
logistic regression, 214
model-based clustering, 252
principal components, 12, 254
256
projection pursuit, 1618
rotations, 10, 24
scatter plot matrix, 5
subset selection in classication,
208210
grades, 70
Bayes information criterion, 184
both-sides model, 111
canonical correlations, 273275
covariance models, 196
factor analysis, 182185, 196
hierarchical clustering, 247249,
252
K-means clustering, 251
K-medoids clustering, 239240
multidimensional scaling, 269
multivariate regression, 7071
testing conditional independence,
177178
testing equality of covariance
matrices, 173
testing independence, 175
Hewlett-Packard spam data, 22
Chernoffs faces, 22
classication, 224225, 227
logistic regression, 214217
principal components, 22, 278
histamine in dogs, 71
both-sides model, 8182, 108
109, 120, 132133
multivariate analysis of variance,
7173
multivariate regression, 131132
leprosy, 82
covariates, 121124
multivariate analysis of variance,
82
multivariate regression, 167168
orthogonalization, 101
Louis Roussos sports data, 13
biplot, 1314
Chernoffs faces, 2223
hierarchical clustering, 249251
K-means clustering, 230233, 236
238
multidimensional scaling, 269
270
plotting clusters, 233235
principal components, 1314
soft K-means clustering, 247
mouth sizes, 71
both-sides model, 7980, 106
108, 114, 118120, 191194
covariance models, 195196
intraclass correlations, 191194
Mallows C
p
, 129130
model selection, 163165
multivariate regression, 71, 130
131
pseudo-covariates, 125
painters, 23
biplot, 23
principal components, 277278
parity, 82
Index 305
both-sides model, 8283
planets, 2
Chernoffs faces, 23
star plot, 23
prostaglandin, 80
both-sides model, 80, 110, 131
RANDU
rotations, 2425
skulls, 80
multivariate regression, 80, 110,
131, 168
orthogonalization, 101
smoking, 45
conditional probability, 4546
soft drinks, 252
hierarchical clustering, 252
South African heart disease, 197
classication, 226228
factor analysis, 197
trees, 218224
states, 278
canonical correlations, 278279
zipcode, 226
classication, 226227
data matrix, 2
planets data, 2
data reduction, 253
density, see probability distributions
deviance, see likelihood
dissimilarities, 238, 239
distributions, see probability distribu-
tions
eigenvalues and eigenvectors, 12, see
also principal components
positive and nonnegative denite
matrices, 6162
uniqueness, 21
EM algorithm, 240, 245246
entropy and negentropy, see projection
pursuit
Euler angles, 282
expected value, 3233
correlation coefcient, 33
covariance, 33
covariance (variance-covariance) ma-
trix
of matrix, 34
of two vectors, 34
mean, 33
of collection, 33
of matrix, 34
of vector, 34
variance, 33
factor analysis, 171, 178185, 253
Akaike information criterion, 181
Bartletts renement, 183
Bayes information criterion, 181
grades data, 184, 196
compared to principal components,
262264
decathlon data, 197198
estimation, 179180
exams data, 196
factanal (R routine), 182
goodness-of-t, 183
grades data, 182185
heart disease data, 197
likelihood, 180
model, 178
model selection, 180181
rotation of factors, 181, 184
scores, 181
estimation, 184
structure of covariance matrix, 178
uniquenesses, 182
varimax, 181
Fishers linear discrimination, see clas-
sication
Fishers quadratic discrimination, see
classication
t, 87
gamma function, 43
generalized multivariate analysis of vari-
ance model (GMANOVA), see
both-sides model
glyphs, 23
Chernoffs faces, 23
star plot, 23
grand tour, 9
Hausdorff distance, 248
hypothesis testing, 156
conditional independence, 176178
grades data, 177178
likelihood ratio statistic, 176, 177
equality of covariance matrices, 172
174
306 Index
Bartletts test, 174
grades data, 173
likelihood ratio statistic, 172, 174
F test, 116
Hotellings T
2
test, 116
distribution, 138139
null distribution, 116
independence of blocks of vari-
ables, 174175
as symmetry model, 187
grades data, 175
likelihood ratio statistic, 175
Lawley-Hotelling trace test, 114
117
likelihood ratio test, 156157
asymptotic null distribution, 156
Pillai trace test, 118
Roys maximum root test, 118
symmetry models, 191
likelihood ratio statistic, 191
Wilks , 117
Jensens inequality, 20, 2122
least squares, 8889
denition, 88
equation, 88
normal equations, 88
projection, 88
regression, 44
least squares estimation
both-sides model, 8990
distribution, 103109
likelihood, 151165
denition, 151
deviance
Akaike and Bayes information
criteria, 158
canonical correlations, 275
denition, 158
equality of covariance matrices,
211
factor analysis, 181
likelihood ratio statistic, 158
multivariate regrssion, 162
observed, 158
prediction, 161
principal components, 259, 262
symmetry models, 191
trees, 221
likelihood ratio test, 156157
asymptotic null distribution, 156
covariance models, 171
equality of covariance matrices,
172174
factor analysis, 180181
independence, 174175
multivariate regression, 157
statistic, 156, 158
symmetry models, 191
maximum likelihood estimation,
see separate entry
multivariate regression, 152
principal components, 264266
Wishart, 171
linear discrimination, Fishers, see clas-
sication
linear models, 6783, 253
both-sides model, see separate en-
try
covariates, 121125, 154
adjusted estimates, 123
general model, 124
leprosy data, 121124
denition, 9091
estimation, 103111
Gram-Schmidt orthogonalization,
9193
growth curve model, 74, 76
hypothesis testing, 113133
hypothesis tests, see also both-sides
model
least squares, see separate entry
least squares estimation, see sep-
arate entry
linear regression, 6769
analysis of covariance, 68
analysis of variance, 68
cyclic model, 69
model, 67
polynomial model, 69
simple, 68
model selection, 125130
multivariate analysis of variance,
121, see separate entry
multivariate regression, see sepa-
rate entry
orthogonal polynomials, 9596
prediction, 125130
big model, 126
Index 307
estimator of prediction sum of
squares, 128
expected value of prediction sum
of squares, 128
Mallows C
p
, 127130
Mallows C
p
in univariate re-
gression, 129
Mallows C
p
, denition, 128
prediction sum of squares, 127
submodel, 126
pseudo-covariates, 125
mouth size data, 125
repeated measures, 74
ridge regression estimator, 110
linear subspace
basis
denition, 87
existence, 100
denition, 85
dimension, 87
least squares, see separate entry
linear independence, 86
orthogonal, 87
orthogonal complement, 89
projection
denition, 87
properties, 8788
projection matrix, see matrix: pro-
jection
span
denition, 85
matrix representation, 86
transformation, 86
linear transformations, 8
logit, 213
Mallows C
p
, 157, 161
both-sides model, 126130
caffeine data, 133
denition, 128
expected prediction sum of squares,
127
prediction sum of squares, 127
prostaglandin data, 131
linear regression, 129
matrix
centering, 7, 20
decompositions
Bartletts, 142
Cholesky, 9495
QR, 9394, 141143
singular value, 272273
spectral, see separate entry
determinant
Cholesky decomposition, 100
decomposing covariance, 99
group, 186
complex normal, 195
compound symmetry, 188
denition, 93
orthogonal, 142, 143, 186
permutation, 188
upper triangular, 94
upper unitriangular, 93
idempotent, 20
denition, 8
in Wishart distribution, 58, 65,
105
Kronecker product, 62
projection, 89, 97
spectral decomposition, 58
identity, 7
inverse
block, 99
Kronecker product
denition, 53
properties, 54, 62
orthogonal, 186
orthogonal polynomials, 9596
positive and nonnegative denite,
51
and eigenvalues, 6162
projection, 8788, 127, 258
idempotent, 89, 97
properties, 89
square root, 50
Cholesky decomposition, 95
symmetric, 51
maximum likelihood estimation, 151
156
both-sides model, 153155
covariance matrix, 155
denition, 152
multivariate normal covariance ma-
trix, 153
multivariate regression, 152153
coefcients, 152
covariance matrix, 153
symmetry models, 189190
mixture models, 199203
308 Index
conditional probability of predic-
tor, 200
data, 199
density of data, 200
illustration, 200
marginal distribution of predic-
tor, 200
marginal probability of group, 200
model selection, 210211
Akaike information criterion, see
separate entry
Bayes information criterion, 126,
see separate entry
classication, 210211
factor analysis, 180181
linear models, 125130
Mallows C
p
, see separate entry
mouth size data
Akaike and Bayes information
criterion, 163165
penalty, 126, 158
Mallows C
p
, 129
moment generating function, 35
chi-square, 60
double exponential, 61
gamma, 60
multivariate normal, 50
standard normal, 49
standard normal collection, 49
sum of random variables, 45
uniqueness, 35
multidimensional scaling, 266270
classical solution, 266268
Euclidean distances, 266268
non-Euclidean distances, 268
grades data, 269
nonmetric, 269
stress function, 269
principal components, 267
sports data, 269270
multivariate analysis of variance, 121
histamine in dogs data, 7173
leprosy data, 82, 121
test statistics, 117118
multivariate normal distribution, 49
66
afne transformations, 51
conditional distributions, 5557
covariance matrix, 49
data matrix, 5254
afne transformation, 54
conditional distributions, 5657
denition, 49
density, 140141
Kronecker product covariance,
141
independence, 52
marginals, 52
mean, 49
moment generating function, 50
QR decomposition, 141143
standard normal
collection, 49
univariate, 49
multivariate regression, 6973
Akaike information criterion, 161
162
Bayes information criterion
leprosy data, 168
skulls data, 168
covariates
histamine in dogs data, 131
132
estimation
skulls data, 110
grades data, 7071
hypothesis testing
mouth size data, 130131
skulls data, 131
likelihood ratio test, 157
maximum likelihood estimation,
152153
leprosy data, 167168
mouth size data, 71
skulls data, 80
orthogonal, see also linear models: Gram-
Schmidt orthogonalization
matrix
denition, 10
orthonormal set of vectors, 9
to subspace, 87
vectors, 9
orthogonalization
caffeine data, 101
leprosy data, 101
skulls data, 101
partial least squares, 276
Pillai trace test, 118
Index 309
mouth size data, 119
prediction, 199, see also linear models:
prediction
principal components, 1015, 171, 253
267
Bayes information criterion
automobile data, 259262
pcbic (R routine), 287
pcbic.stepwise (R routine), 287
288
best K, 13, 1819
biplot, 1314
choosing the number of, 256262
compared to factor analysis, 262
264
denition, 11
eigenvalues and eigenvalues, 12
election data, 23
iris data, 12, 254256
likelihood, 264266
painters data, 277278
scree plot, 256
automobile data, 256257
spam data, 22, 278
spectral decomposition theorem,
11
sports data, 1314
subspaces, 257262
uncorrelated property, 11, 18
varimax, 261
probability distributions, 2737
beta
and F, 130
dened via gammas, 130
density, 43
Wilks , 130
beta-binomial, 43
chi-square
and Wisharts, 59
denition, 59
density, 6061
moment generating function, 60
61
conditional distributions, 3031,
3739
conditional independence, 38
conditional space, 30
continuous, density, 31
covariates, 122, 124
dependence through a function,
38
discrete, 30
independence, 36
iterated expectation, 32
multivariate normal data ma-
trix, 5657
multivariate normal distribution,
5557
notation, 30
plug-in property, 3738, 42, 56,
137, 138
variance decomposition, 38
Wishart distribution, 136137
continuous, 28
density, 2829
mixed-type, 29
probability density function (pdf),
28
probability mass function (pmf),
28
discrete, 28
double exponential, 61
expected values, 3233
exponential family, 212
F, 115, 145
gamma, 60
Haar on orthogonal matrices, 143
Half-Wishart, 142, 146
Hotellings T
2
, 138139
projection pursuit, 139140
independence, 3537
denition, 36
marginals, 2930
moment generating function, see
separate entry
multivariate normal, see separate
entry
multivariate t, 148
mutual independence
denition, 37
representations, 27
Students t, 106, 145
uniform, 29
Wilks , see separate entry
Wishart, see separate entry
projection, 8596, see also matrix: pro-
jection
projection pursuit, 10, 1518
entropy and negentropy, 1618
310 Index
automobile data, 24
estimation, 281282
maximized by normal, 16, 19
20
negent (R routine), 282
negent2D (R routine), 282283
negent3D (R routine), 283
Hotellings T
2
, 139140
iris data, 1618
kurtosis, 15
skewness, 15
R routines
both-sides model
bothsidesmodel, 283284
bothsidesmodel.test, 284
classication
lda, 285
predict.qda, 286
qda, 285
clustering
silhouette.km, 286
sort.silhouette, 286
entropy and negentropy
negent, 282
negent2D, 282283
negent3D, 283
principal components
helper functions, 288
pcbic, 287
pcbic.stepwise, 287288
random variable, 27
collection of, 27
rectangle, 35
residual, 55, 87
rotations, 10
example data sets, 25
iris data, 24
RANDU, 2425
Roys maximum root test, 118
mouth size data, 119
sample
correlation coefcient, 7
covariance, 7
covariance (variance-covariance) ma-
trix, 8
marginals, 8
mean, 6
mean vector, 7
variance, 6
scatter plot
matrix, 35
iris data, 5
stars, 3
spectral decomposition, 1112
eigenvalues and eigenvectors, 12,
21
intraclass correlation structure, 21
theorem, 12
spherical symmetry, 188
star plot, 2
planets data, 23
sum of squares and cross-products ma-
trix, 8
supervised learning, 199, 229
symmetry models, 186191
a.k.a. invariant normal models,
186
complex normal structure, 195
compound symmetry, 188
grades data, 196197
denition, 186
hypothesis testing, 191
iid, 188
independence of blocks of vari-
ables, 187
intraclass correlation structure, 187
188
mouth size data, 191194, 196
spectral decomposition, 21
likelihood ratio statistic, 191
maximum likelihood estimation,
189190
independence, 190
intraclass correlation, 190
spherical symmetry, 188
structure from group, 189
total probability formula, 33
trace of a matrix, 18
unsupervised learning, 199, 229
variance, 33, see also covariance (variance-
covariance) matrix
varimax, 181, 261
R function, 261
vector
norm of, 9
Index 311
one, 7
Wilks , 117
mouth size data, 119
Wishart distribution, 5760, 171
and chi-squares, 59
Bartletts decomposition, 142
conditional property, 136137
denition, 57
density, 143144
expectation of inverse, 137138
for sample covariance matrix, 58
Half-Wishart, 142, 146
likelihood, 171
linear transformations, 59
marginals, 60
mean, 59
sum of independent, 59

You might also like