0% found this document useful (0 votes)
298 views319 pages

Multivariate Statistical Analysis: Old School

This document provides an overview of a textbook on multivariate statistical analysis with an emphasis on three key aspects: 1) Linear models form a central theme, including multivariate regression, analysis of variance, and "both-sides models" that allow modeling relationships among individuals and variables. 2) Inference on covariance matrices covers techniques like testing for equality of covariance matrices and independence of variables. 3) Classification and clustering techniques like canonical correlations that attempt to categorize individuals are also discussed. The examples and techniques are implemented using the statistical computing environment R.

Uploaded by

Vivek11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
298 views319 pages

Multivariate Statistical Analysis: Old School

This document provides an overview of a textbook on multivariate statistical analysis with an emphasis on three key aspects: 1) Linear models form a central theme, including multivariate regression, analysis of variance, and "both-sides models" that allow modeling relationships among individuals and variables. 2) Inference on covariance matrices covers techniques like testing for equality of covariance matrices and independence of variables. 3) Classification and clustering techniques like canonical correlations that attempt to categorize individuals are also discussed. The examples and techniques are implemented using the statistical computing environment R.

Uploaded by

Vivek11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 319

Multivariate Statistical Analysis

Old School

John I. Marden
Department of Statistics
University of Illinois at Urbana-Champaign
2011
c by John I. Marden
Preface

The goal of this text is to give the reader a thorough grounding in old-school mul-
tivariate statistical analysis. The emphasis is on multivariate normal modeling and
inference, both theory and implementation. Linear models form a central theme of
the book. Several chapters are devoted to developing the basic models, including
multivariate regression and analysis of variance, and especially the “both-sides mod-
els” (i.e., generalized multivariate analysis of variance models), which allow model-
ing relationships among individuals as well as variables. Growth curve and repeated
measure models are special cases.
The linear models are concerned with means. Inference on covariance matrices
covers testing equality of several covariance matrices, testing independence and con-
ditional independence of (blocks of) variables, factor analysis, and some symmetry
models. Principal components, though mainly a graphical/exploratory technique,
also lends itself to some modeling.
Classification and clustering are related areas. Both attempt to categorize indi-
viduals. Classification tries to classify individuals based upon a previous sample of
observed individuals and their categories. In clustering, there is no observed catego-
rization, nor often even knowledge of how many categories there are. These must be
estimated from the data.
Other useful multivariate techniques include biplots, multidimensional scaling,
and canonical correlations.
The bulk of the results here are mathematically justified, but I have tried to arrange
the material so that the reader can learn the basic concepts and techniques while
plunging as much or as little as desired into the details of the proofs.
Practically all the calculations and graphics in the examples are implemented
using the statistical computing environment R [R Development Core Team, 2010].
Throughout the notes we have scattered some of the actual R code we used. Many of
the data sets and original R functions can be found in the file http://www.istics.
net/r/multivariateOldSchool.r. For others we refer to available R packages.

iii
Contents

Preface iii

Contents iv

1 A First Look at Multivariate Data 1


1.1 The data matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Example: Planets data . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Glyphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Scatter plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Example: Fisher-Anderson iris data . . . . . . . . . . . . . . . . . 5
1.4 Sample means, variances, and covariances . . . . . . . . . . . . . . . . . 6
1.5 Marginals and linear combinations . . . . . . . . . . . . . . . . . . . . . . 8
1.5.1 Rotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Principal components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6.1 Biplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6.2 Example: Sports data . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.7 Other projections to pursue . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.7.1 Example: Iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.8 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Multivariate Distributions 27
2.1 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.1 Distribution functions . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.2 Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.3 Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.4 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . 30
2.2 Expected values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 Means, variances, and covariances . . . . . . . . . . . . . . . . . . . . . . 33
2.3.1 Vectors and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.2 Moment generating functions . . . . . . . . . . . . . . . . . . . . 35
2.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5 Additional properties of conditional distributions . . . . . . . . . . . . . 37
2.6 Affine transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

iv
Contents v

2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3 The Multivariate Normal Distribution 49


3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Some properties of the multivariate normal . . . . . . . . . . . . . . . . . 51
3.3 Multivariate normal data matrix . . . . . . . . . . . . . . . . . . . . . . . 52
3.4 Conditioning in the multivariate normal . . . . . . . . . . . . . . . . . . 55
3.5 The sample covariance matrix: Wishart distribution . . . . . . . . . . . . 57
3.6 Some properties of the Wishart . . . . . . . . . . . . . . . . . . . . . . . . 59
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4 Linear Models on Both Sides 67


4.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Multivariate regression and analysis of variance . . . . . . . . . . . . . . 69
4.2.1 Examples of multivariate regression . . . . . . . . . . . . . . . . . 70
4.3 Linear models on both sides . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.1 One individual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.2 IID observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.3 The both-sides model . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5 Linear Models: Least Squares and Projections 85


5.1 Linear subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3 Least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.1 Both-sides model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4 What is a linear model? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5 Gram-Schmidt orthogonalization . . . . . . . . . . . . . . . . . . . . . . . 91
5.5.1 The QR and Cholesky decompositions . . . . . . . . . . . . . . . 93
5.5.2 Orthogonal polynomials . . . . . . . . . . . . . . . . . . . . . . . 95
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6 Both-Sides Models: Distribution of Estimator 103


6.1 Distribution of β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2 Fits and residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.3 Standard errors and t-statistics . . . . . . . . . . . . . . . . . . . . . . . . 105
6.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.4.1 Mouth sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.4.2 Histamine in dogs . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

7 Both-Sides Models: Hypothesis Tests on β 113


7.1 Approximate χ2 test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.1.1 Example: Mouth sizes . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.2 Testing blocks of β are zero . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.2.1 Just one column – F test . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2.2 Just one row – Hotelling’s T 2 . . . . . . . . . . . . . . . . . . . . . 116
7.2.3 General blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2.4 Additional test statistics . . . . . . . . . . . . . . . . . . . . . . . . 117
7.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
vi Contents

7.3.1 Mouth sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118


7.3.2 Histamine in dogs . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.4 Testing linear restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.5 Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.5.1 Pseudo-covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.6 Model selection: Mallows’ C p . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.6.1 Example: Mouth sizes . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

8 Some Technical Results 135


8.1 The Cauchy-Schwarz inequality . . . . . . . . . . . . . . . . . . . . . . . 135
8.2 Conditioning in a Wishart . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.3 Expectation of inverse Wishart . . . . . . . . . . . . . . . . . . . . . . . . 137
8.4 Distribution of Hotelling’s T 2 . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.4.1 A motivation for Hotelling’s T 2 . . . . . . . . . . . . . . . . . . . 139
8.5 Density of the multivariate normal . . . . . . . . . . . . . . . . . . . . . . 140
8.6 The QR decomposition for the multivariate normal . . . . . . . . . . . . 141
8.7 Density of the Wishart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

9 Likelihood Methods 151


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
9.2 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . 151
9.2.1 The MLE in multivariate regression . . . . . . . . . . . . . . . . . 152
9.2.2 The MLE in the both-sides linear model . . . . . . . . . . . . . . 153
9.2.3 Proof of Lemma 9.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 156
9.3 Likelihood ratio tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
9.3.1 The LRT in multivariate regression . . . . . . . . . . . . . . . . . 157
9.4 Model selection: AIC and BIC . . . . . . . . . . . . . . . . . . . . . . . . 157
9.4.1 BIC: Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
9.4.2 AIC: Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
9.4.3 AIC: Multivariate regression . . . . . . . . . . . . . . . . . . . . . 161
9.5 Example: Mouth sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

10 Models on Covariance Matrices 171


10.1 Testing equality of covariance matrices . . . . . . . . . . . . . . . . . . . 172
10.1.1 Example: Grades data . . . . . . . . . . . . . . . . . . . . . . . . . 173
10.1.2 Testing the equality of several covariance matrices . . . . . . . . 174
10.2 Testing independence of two blocks of variables . . . . . . . . . . . . . . 174
10.2.1 Example: Grades data . . . . . . . . . . . . . . . . . . . . . . . . . 175
10.2.2 Example: Testing conditional independence . . . . . . . . . . . . 176
10.3 Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
10.3.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
10.3.2 Describing the factors . . . . . . . . . . . . . . . . . . . . . . . . . 181
10.3.3 Example: Grades data . . . . . . . . . . . . . . . . . . . . . . . . . 182
10.4 Some symmetry models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
10.4.1 Some types of symmetry . . . . . . . . . . . . . . . . . . . . . . . 187
10.4.2 Characterizing the structure . . . . . . . . . . . . . . . . . . . . . 189
10.4.3 Maximum likelihood estimates . . . . . . . . . . . . . . . . . . . 189
Contents vii

10.4.4 Hypothesis testing and model selection . . . . . . . . . . . . . . 191


10.4.5 Example: Mouth sizes . . . . . . . . . . . . . . . . . . . . . . . . . 191
10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

11 Classification 199
11.1 Mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
11.2 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
11.3 Fisher’s linear discrimination . . . . . . . . . . . . . . . . . . . . . . . . . 203
11.4 Cross-validation estimate of error . . . . . . . . . . . . . . . . . . . . . . 205
11.4.1 Example: Iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
11.5 Fisher’s quadratic discrimination . . . . . . . . . . . . . . . . . . . . . . . 210
11.5.1 Example: Iris data, continued . . . . . . . . . . . . . . . . . . . . 210
11.6 Modifications to Fisher’s discrimination . . . . . . . . . . . . . . . . . . . 211
11.7 Conditioning on X: Logistic regression . . . . . . . . . . . . . . . . . . . 212
11.7.1 Example: Iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.7.2 Example: Spam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.8 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
11.8.1 CART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
11.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

12 Clustering 229
12.1 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
12.1.1 Example: Sports data . . . . . . . . . . . . . . . . . . . . . . . . . 230
12.1.2 Gap statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
12.1.3 Silhouettes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
12.1.4 Plotting clusters in one and two dimensions . . . . . . . . . . . . 233
12.1.5 Example: Sports data, using R . . . . . . . . . . . . . . . . . . . . 236
12.2 K-medoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
12.3 Model-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
12.3.1 Example: Automobile data . . . . . . . . . . . . . . . . . . . . . . 240
12.3.2 Some of the models in mclust . . . . . . . . . . . . . . . . . . . . 243
12.4 An example of the EM algorithm . . . . . . . . . . . . . . . . . . . . . . . 245
12.5 Soft K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
12.5.1 Example: Sports data . . . . . . . . . . . . . . . . . . . . . . . . . 247
12.6 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
12.6.1 Example: Grades data . . . . . . . . . . . . . . . . . . . . . . . . . 248
12.6.2 Example: Sports data . . . . . . . . . . . . . . . . . . . . . . . . . 249
12.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

13 Principal Components and Related Techniques 253


13.1 Principal components, redux . . . . . . . . . . . . . . . . . . . . . . . . . 253
13.1.1 Example: Iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
13.1.2 Choosing the number of principal components . . . . . . . . . . 256
13.1.3 Estimating the structure of the component spaces . . . . . . . . . 257
13.1.4 Example: Automobile data . . . . . . . . . . . . . . . . . . . . . . 259
13.1.5 Principal components and factor analysis . . . . . . . . . . . . . 262
13.1.6 Justification of the principal component MLE, Theorem 13.1 . . 264
13.2 Multidimensional scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
13.2.1 Δ is Euclidean: The classical solution . . . . . . . . . . . . . . . . 266
13.2.2 Δ may not be Euclidean: The classical solution . . . . . . . . . . 268
viii Contents

13.2.3 Nonmetric approach . . . . . . . . . . . . . . . . . . . . . . . . . . 269


13.2.4 Examples: Grades and sports . . . . . . . . . . . . . . . . . . . . 269
13.3 Canonical correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
13.3.1 Example: Grades . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
13.3.2 How many canonical correlations are positive? . . . . . . . . . . 274
13.3.3 Partial least squares . . . . . . . . . . . . . . . . . . . . . . . . . . 276
13.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276

A Extra R routines 281


A.1 Estimating entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
A.1.1 negent: Estimating negative entropy . . . . . . . . . . . . . . . . . 282
A.1.2 negent2D: Maximizing negentropy for q = 2 dimensions . . . . . 282
A.1.3 negent3D: Maximizing negentropy for q = 3 dimensions . . . . . 283
A.2 Both-sides model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
A.2.1 bothsidesmodel: Calculate the estimates . . . . . . . . . . . . . . . 283
A.2.2 bothsidesmodel.test: Test blocks of β are zero . . . . . . . . . . . 284
A.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
A.3.1 lda: Linear discrimination . . . . . . . . . . . . . . . . . . . . . . . 285
A.3.2 qda: Quadratic discrimination . . . . . . . . . . . . . . . . . . . . 285
A.3.3 predict.qda: Quadratic discrimination prediction . . . . . . . . . 286
A.4 Silhouettes for K-Means Clustering . . . . . . . . . . . . . . . . . . . . . 286
A.4.1 silhouette.km: Calculate the silhouettes . . . . . . . . . . . . . . . 286
A.4.2 sort.silhouette: Sort the silhouettes by group . . . . . . . . . . . . 286
A.5 Estimating the eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . 287
A.5.1 pcbic: BIC for a particular pattern . . . . . . . . . . . . . . . . . . 287
A.5.2 pcbic.stepwise: Choosing a good pattern . . . . . . . . . . . . . . 287
A.5.3 Helper functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

Bibliography 295

Index 301
Chapter 1

A First Look at Multivariate Data

In this chapter, we try to give a sense of what multivariate data sets looks like, and
introduce some of the basic matrix manipulations needed throughout these notes.
Chapters 2 and 3 lay down the distributional theory. Linear models are probably the
most popular statistical models ever. With multivariate data, we can model relation-
ships between individuals or between variables, leading to what we call “both-sides
models,” which do both simultaneously. Chapters 4 through 8 present these models
in detail. The linear models are concerned with means. Before turning to models
on covariances, Chapter 9 briefly reviews likelihood methods, including maximum
likelihood estimation, likelihood ratio tests, and model selection criteria (Bayes and
Akaike). Chapter 10 looks at a number of models based on covariance matrices, in-
cluding equality of covariances, independence and conditional independence, factor
analysis, and other structural models. Chapter 11 deals with classification, in which
the goal is to find ways to classify individuals into categories, e.g., healthy or un-
healthy, based on a number of observed variable. Chapter 12 has a similar goal,
except that the categories are unknown and we seek groupings of individuals using
just the observed variables. Finally, Chapter 13 explores principal components, which
we first see in Section 1.6. It is an approach for reducing the number of variables,
or at least find a few interesting ones, by searching through linear combinations of
the observed variables. Multidimensional scaling has a similar objective, but tries to
exhibit the individual data points in a low-dimensional space while preserving the
original inter-point distances. Canonical correlations has two sets of variables, and
finds linear combinations of the two sets to explain the correlations between them.
On to the data.

1.1 The data matrix

Data generally will consist of a number of variables recorded on a number of indi-


viduals, e.g., heights, weights, ages, and sex of a sample of students. Also, generally,
there will be n individuals and q variables, and the data will be arranged in an n × q

1
2 Chapter 1. Multivariate Data

data matrix, with rows denoting individuals and the columns denoting variables:

Var 1 Var 2 ··· Var q


⎛ ⎞
Individual 1 y11 y12 ... y1q
Individual 2 ⎜ y21 y22 ... y2q ⎟
Y= ⎜ ⎟ (1.1)
.. ⎜ .. .. .. .. ⎟ .
. ⎝ . . . . ⎠
Individual n yn1 yn2 ... ynq

Then yij is the value of the variable j for individual i. Much more complex data
structures exist, but this course concentrates on these straightforward data matrices.

1.1.1 Example: Planets data


Six astronomical variables are given on each of the historical nine planets (or eight
planets, plus Pluto). The variables are (average) distance in millions of miles from
the Sun, length of day in Earth days, length of year in Earth days, diameter in miles,
temperature in degrees Fahrenheit, and number of moons. The data matrix:

Dist Day Year Diam Temp Moons


Mercury 35.96 59.00 88.00 3030 332 0
Venus 67.20 243.00 224.70 7517 854 0
Earth 92.90 1.00 365.26 7921 59 1
Mars 141.50 1.00 687.00 4215 −67 2
(1.2)
Jupiter 483.30 0.41 4332.60 88803 −162 16
Saturn 886.70 0.44 10759.20 74520 −208 18
Uranus 1782.00 0.70 30685.40 31600 −344 15
Neptune 2793.00 0.67 60189.00 30200 −261 8
Pluto 3664.00 6.39 90465.00 1423 −355 1
The data can be found in Wright [1997], for example.

1.2 Glyphs
Graphical displays of univariate data, that is, data on one variable, are well-known:
histograms, stem-and-leaf plots, pie charts, box plots, etc. For two variables, scat-
ter plots are valuable. It is more of a challenge when dealing with three or more
variables.
Glyphs provide an option. A little picture is created for each individual, with char-
acteristics based on the values of the variables. Chernoff’s faces [Chernoff, 1973] may
be the most famous glyphs. The idea is that people intuitively respond to character-
istics of faces, so that many variables can be summarized in a face.
Figure 1.1 exhibits faces for the nine planets. We use the faces routine by H. P. Wolf
in the R package aplpack, Wolf and Bielefeld [2010]. The distance the planet is from
the sun is represented by the height of the face (Pluto has a long face), the length of
the planet’s day by the width of the face (Venus has a wide face), etc. One can then
cluster the planets. Mercury, Earth and Mars look similar, as do Saturn and Jupiter.
These face plots are more likely to be amusing than useful, especially if the number
of individuals is large. A star plot is similar. Each individual is represented by a
p-pointed star, where each point corresponds to a variable, and the distance of the
1.3. Scatter plots 3

Mercury Venus Earth

Mars Jupiter Saturn

Uranus Neptune Pluto

Figure 1.1: Chernoff’s faces for the planets. Each feature represents a variable. For
the first six variables, from the faces help file, 1-height of face, 2-width of face, 3-shape of
face, 4-height of mouth, 5-width of mouth, 6-curve of smile.

point from the center is based on the variable’s value for that individual. See Figure
1.2.

1.3 Scatter plots

Two-dimensional scatter plots can be enhanced by using different symbols for the
observations instead of plain dots. For example, different colors could be used for
different groups of points, or glyphs representing other variables could be plotted.
Figure 1.2 plots the planets with the logarithms of day length and year length as the
axes, where the stars created from the other four variables are the plotted symbols.
Note that the planets pair up in a reasonable way. Mercury and Venus are close, both
in terms of the scatter plot and in the look of their stars. Similarly, Earth and Mars
pair up, as do Jupiter and Saturn, and Uranus and Neptune. See Listing 1.1 for the R
code.
A scatter plot matrix arranges all possible two-way scatter plots in a q × q matrix.
These displays can be enhanced with brushing, in which individual or groups of
individual plots can be selected in one plot, and be simultaneously highlighted in the
other plots.
4 Chapter 1. Multivariate Data

Listing 1.1: R code for the star plot of the planets, Figure 1.2. The data are in the
matrix planets. The first statement normalizes the variables to range from 0 to 1. The
ep matrix is used to place the names of the planets. Tweaking is necessary, depending
on the size of the plot.
p <− apply(planets,2,function(z) (z−min(z))/(max(z)−min(z)))
x <− log(planets[,2])
y <− log(planets[,3])
ep <− matrix(c(−.3,.4),c(−.5,.4),c(.5,0),c(.5,0),c(.6,−1),c(−.5,1.4),
c(1,−.6),c(1.3,.4),c(1,−.5))
symbols(x,y,stars=p[,−(2:3)],xlab=’log(day)’,ylab=’log(year)’,inches=.4)
text(x+ep[,1],y+ep[,2],labels=rownames(planets),cex=.5)
12

Neptune
Pluto
Saturn
10

Uranus
log(year)
8

Jupiter

Mars
6

Earth Venus

Mercury
4

0 2 4 6
log(day)

Figure 1.2: Scatter plot of log(day) versus log(year) for the planets, with plotting
symbols being stars created from the other four variables, distance, diameter, tem-
perature, moons.
1.3. Scatter plots 5

gg g g
g ggggg g
ggg
gggg g gg
gggg gg g
g v
g vvvgggggg vvvgg gg
g
g vvvg vvgg
vg vvvvggggggg gg ggg
vvvvvg ggg
vvvgvggg g g
vvgg vg
vg v v gg
vggggg vvvgvg gggggggg
Sepal.Length g v g
vv v
g
v vv s vvvvvvvvvg
vggg
v v v vg v v g
v
gvg g
vvg sss vvvvvvvvv sss vvvv v gg g
vv v v ss sss s s
v vv v ssss vvvvvgg ss s v v v
v vvgv v sssssssss s ssssssss vvvv g ssss
sssssss vv g
v
s s sssssssss
s sssss s s s ss
ss

s s s
sss s
ssss sss
sssss s gg ssss gg ssssss gg
s ss g sssssss g ssss s g
ssssssssss vgg gg sssss vvggggg ggg sssss vvg gg
ggg
sssssssss g vggvgvg
ggvgggg
vvggvg
vgg gg Sepal.Width ssssssss vvgv vvvvg
gggg
g
gggggg ssssss vvvvg v
g g g
gg
sss sss vg vvvvgvvvvg
vvg
gvvvg
g v ggg g sssss vvvvvvvvvvvvvvvvg gg s vvvvv vg
vvvvvvg
g gggg
gggggg
v vvgvg
vvvv
vgg gg vvv g gg
vg vggg g
g vvvvvvvg
vvg g
g v v v v g vvvvvvvvvgvg gg g vgggg g
s vv vv g vv v s v vvvv g s vvv vg
v v
v v v
g gg g gggg
gggg ggggg
gg ggg ggggg
g ggg g
ggg
ggg g
ggggg
g
gg ggg gggg
g ggg gg g g
ggggg
ggg
g g
g vvg
ggggvgggggg v gg vg
g
gg
vg gg g
g
vvvvv
vvvvg g vvggg
g gg
vvvg vvvvvvvvvv
vvg vvg vvv v v
vvvvvvvvvvgg
v
v vvvvvvvvvvvvvvvv vv vvvvvvv Petal.Length vvvvvvvvv
vvvv v v vvvvvv v vv
v v
ssssss s s s s
s sssssssssssssssssssssss s
ssssssss
ssssss
ssssssssssssssssss sss
g gggg gg gggg g ggg
g
ggg gg g g ggg gg
g g g gggggg
g gggg gg
ggg gg gggggggggg g
g g
g
g
g gggg g
g gg
ggggggggggggg
g
g g
vgggggggv gg g
gggg gg vggv gvg vgvgggg ggg
v v g v vgvvg vvg vvvvv vvvvvv v
vvvvvvvg vvg
vg vvvv
vvvvv g
v vvvvvvvvvvvvvv
vvvgg
vvvvv g
vvvvvvvvvvvvv
vvv vvvvvv Petal.Width
vvv vvvvvv v vvvvvvv vvvvvvvv
ss sss sss s s
ssssssssssssssssss s sssss ss s ssssssss
ssssssss
ss sssssssssss sss sssssss
sssssss

Figure 1.3: A scatter plot matrix for the Fisher/Anderson iris data. In each plot, “s”
indicates setosa plants, “v” indicates versicolor, and “r” indicates virginica.

1.3.1 Example: Fisher-Anderson iris data

The most famous data set in multivariate analysis is the iris data analyzed by Fisher
[1936] based on data collected by Anderson [1935]. See also Anderson [1936]. There
are fifty specimens each of three species of iris: setosa, versicolor, and virginica. There
are four variables measured on each plant, sepal length, sepal width, petal length, and
pedal width. Thus n = 150 and q = 4. Figure 1.3 contains the corresponding scatter
plot matrix, with species indicated by letter. We used the R function pairs. Note that
the three species separate fairly well, with setosa especially different from the other
two.
As a preview of classification (Chapter 11), Figure 1.4 uses faces to exhibit five
observations from each species, and five random observations without their species
label. Can you guess which species each is? See page 20. The setosas are not too
difficult, since they have small faces, but distinguishing the other two can be a chal-
lenge.
6 Chapter 1. Multivariate Data

set set set set set

vers vers vers vers vers

virg virg virg virg virg

? ? ? ? ?

Figure 1.4: Five specimens from each iris species, plus five from unspecified species.
Here, “set” indicates setosa plants, “vers” indicates versicolor, and “virg” indicates
virginica.

1.4 Sample means, variances, and covariances


For univariate values x1 , . . . , xn , the sample mean is

1 n
n i∑
x̄ = xi , (1.3)
=1

and the sample variance is

1 n
n i∑
s2x = s xx = ( xi − x̄ )2 . (1.4)
=1

Note the two notations: The s2x is most common when dealing with individual vari-
ables, but the s xx transfers better to multivariate data. Often one is tempted to divide
by n − 1 instead of n. That’s fine, too. With a second set of values z1 , . . . , zn , we have
1.4. Sample means, variances, and covariances 7

the sample covariance between the xi ’s and zi ’s to be

1 n
n i∑
s xz = ( xi − x̄ )(zi − z̄). (1.5)
=1

So the covariance between the xi ’s and themselves is the variance, which is to say that
s2x = s xx . The sample correlation coefficient is a normalization of the covariance that
ranges between −1 and +1, defined by

s xz
r xz = (1.6)
s x sz

provided both variances are positive. (See Corollary 8.1.) In a scatter plot of x versus
z, the correlation coefficient is +1 if all the points lie on a line with positive slope,
and −1 if they all lie on a line with negative slope.
For a data matrix Y (1.1) with q variables, there are q means:

1 n
n i∑
ȳ j = yij . (1.7)
=1

Placing them in a row vector, we have

ȳ = (ȳ1 , . . . , ȳq ). (1.8)

The n × 1 one vector is 1n = (1, 1, . . . , 1) , the vector of all 1’s. Then the mean vector
(1.8) can be written
1
ȳ = 1n Y. (1.9)
n
To find the variances and covariances, we first have to subtract the means from the
individual observations in Y: change yij to yij − ȳ j for each i, j. That can be achieved
by subtracting the n × q matrix 1n ȳ from Y to get the matrix of deviations. Using
(1.9), we can write

1 1
Y − 1n ȳ = Y − 1n 1n Y = (In − 1n 1n )Y ≡ Hn Y. (1.10)
n n

There are two important matrices in that formula: The n × n identity matrix In ,
⎛ ⎞
1 0 ... 0
⎜ 0 1 ... 0 ⎟
⎜ ⎟
In = ⎜ .. .. .. .. ⎟, (1.11)
⎝ . . . . ⎠
0 0 ... 1

and the n × n centering matrix Hn ,


⎛ ⎞
1 − n1 − n1 ... − n1
⎜ − n1 1 − n1 ... − n1 ⎟
1 ⎜ ⎟
Hn = In − 1n 1n = ⎜ .. .. .. .. ⎟, (1.12)
n ⎝ . . . . ⎠
− n1 − n1 ... 1 − n1
8 Chapter 1. Multivariate Data

The identity matrix leaves any vector or matrix alone, so if A is n × m, then A =


In A = AIm , and the centering matrix subtracts the column mean from each element
in Hn A. Similarly, AHm results in the row mean being subtracted from each element.
For an n × 1 vector x with mean x̄, and n × 1 vector z with mean z̄, we can write
n
∑ (xi − x̄)2 = (x − x̄1n ) (x − x̄1n ) (1.13)
i =1

and
n
∑ (xi − x̄ )(zi − z̄) = (x − x̄1n ) (z − z̄1n ). (1.14)
i =1

Thus taking the deviations matrix in (1.10), (Hn Y) (Hn Y) contains all the ∑(yij −
ȳ j )(yik − ȳk )’s. We will call that matrix the sum of squares and cross-products matrix.
Notice that
(Hn Y) (Hn Y) = Y Hn Hn Y = Y Hn Y. (1.15)
What happened to the Hn ’s? First, Hn is clearly symmetric, so that Hn = Hn . Then
notice that Hn Hn = Hn . Such a matrix is called idempotent, that is, a square matrix

A is idempotent if AA = A. (1.16)

Dividing the sum of squares and cross-products matrix by n gives the sample
variance-covariance matrix, or more simply sample covariance matrix:
⎛ ⎞
s11 s12 ... s1q
⎜ s s22 ... s2q ⎟
1 ⎜ 21 ⎟
S = Y  Hn Y = ⎜ . .. .. .. ⎟, (1.17)
n ⎝ .. . . . ⎠
sq1 sq2 ... sqq

where s jj is the sample variance of the jth variable (column), and s jk is the sample
covariance between the jth and kth variables. (When doing inference later, we may
divide by n − d f instead of n for some “degrees-of-freedom” integer d f .)

1.5 Marginals and linear combinations


A natural first stab at looking at data with several variables is to look at the variables
one at a time, so with q variables, one would first make q histograms, or box plots, or
whatever suits one’s fancy. Such techniques are based on marginals, that is, based on
subsets of the variable rather than all variables at once as in glyphs. One-dimensional
marginals are the individual variables, two-dimensional marginals are the pairs of
variables, three-dimensional marginals are the sets of three variables, etc.
Consider one-dimensional marginals. It is easy to construct the histograms, say.
But why be limited to the q variables? Functions of the variables can also be his-
togrammed, e.g., weight/height. The number of possible functions one could imag-
ine is vast. One convenient class is the set of linear transformations, that is, for some
constants b1 , . . . , bq , a new variable is W = b1 Y1 + · · · + bq Yq , so the transformed data
consist of w1 , . . . , wn , where

wi = b1 yi1 + · · · + bq yiq . (1.18)


1.5. Marginals and linear combinations 9

Placing the coefficients into a column vector b = (b1 , . . . , bq ) , we can write


⎛ ⎞
w1
⎜ w2 ⎟
⎜ ⎟
W≡⎜ .. ⎟ = Yb, (1.19)
⎝ . ⎠
wn

transforming the original data matrix to another one, albeit with only one variable.
Now there is a histogram for each vector b. A one-dimensional grand tour runs
through the vectors b, displaying the histogram for Yb as it goes. (See Asimov [1985]
and Buja and Asimov [1986] for general grand tour methodology.) Actually, one does
not need all b, e.g., the vectors b = (1, 2, 5) and b = (2, 4, 10) would give the same
histogram. Just the scale of the horizontal axis on the histograms would be different.
One simplification is to look at only the b’s with norm 1. That is, the norm of a vector
x = ( x1 , . . . , xq ) is

x = x12 + · · · + x2q = x x, (1.20)

so one would run through the b’s with b = 1. Note that the one-dimensional
marginals are special cases: take

b = (1, 0, . . . , 0), (0, 1, 0, . . . , 0), . . . , or (0, 0, . . . , 1). (1.21)

Scatter plots of two linear combinations are more common. That is, there are two
sets of coefficients (b1j ’s and b2j ’s), and two resulting variables:

wi1 = b11 yi1 + b21 yi2 + · · · + bq1 yiq , and


wi2 = b12 yi1 + b22 yi2 + · · · + bq2 yiq . (1.22)

In general, the data matrix generated from p linear combinations can be written

W = YB, (1.23)

where W is n × p, and B is q × p with column k containing the coefficients for the kth
linear combination. As for one linear combination, the coefficient vectors are taken to
have norm 1, i.e., (b1k , . . . , bqk ) = 1, which is equivalent to having all the diagonals
of B B being 1.
Another common restriction is to have the linear combination vectors be orthogo-
nal, where two column vectors b and c are orthogonal if b c = 0. Geometrically, or-
thogonality means the vectors are perpendicular to each other. One benefit of restrict-
ing to orthogonal linear combinations is that one avoids scatter plots that are highly
correlated but not meaningfully so, e.g., one might have w1 be Height + Weight, and
w2 be .99 × Height + 1.01 × Weight. Having those two highly correlated does not tell
us anything about the data set. If the columns of B are orthogonal to each other, as
well as having norm 1, then
B B = I p . (1.24)
A set of norm 1 vectors that are mutually orthogonal are said to be orthonormal .
Return to q = 2 orthonormal linear combinations. A two-dimensional grand tour
plots the two variables as the q × 2 matrix B runs through all the matrices with a pair
of orthonormal columns.
10 Chapter 1. Multivariate Data

1.5.1 Rotations
If the B in (1.24) is q × q, i.e., there are as many orthonormal linear combinations as
variables, then B is an orthogonal matrix
.

Definition 1.1. A q × q matrix G is orthogonal if

G G = GG = Iq . (1.25)

Note that the definition says that the columns are orthonormal, and the rows are
orthonormal. In fact, the rows are orthonormal if and only of the columns are (if
the matrix is square), hence the middle equality in (1.25) is not strictly needed in the
definition.
Think of the data matrix Y being the set of n points in q-dimensional space. For
orthogonal matrix G, what does the set of points W = YG look like? It looks exactly
like Y, but rotated or flipped. Think of a pinwheel turning, or a chicken on a rotisserie,
or the earth spinning around its axis or rotating about the sun. Figure 1.5 illustrates
a simple rotation of two variables. In particular, the norms of the points in Y are the
same as in W, so each point remains the same distance from 0.
Rotating point clouds for three variables work by first multiplying the n × 3 data
matrix by a 3 × 3 orthogonal matrix, then making a scatter plot of the first two re-
sulting variables. By running through the orthogonal matrices quickly, one gets the
illusion of three dimensions. See the discussion immediately above Exercise 1.9.21
for some suggestions on software for real-time rotations.

1.6 Principal components


The grand tours and rotating point clouds described in the last two subsections do
not have mathematical objectives, that is, one just looks at them to see if anything
interesting pops up. In projection pursuit [Huber, 1985], one looks for a few (often
just one or two) (orthonormal) linear combinations that maximize a given objective
function. For example, if looking at just one linear combination, one may wish to find
the one that maximizes the variance of the data, or the skewness or kurtosis, or one
whose histogram is most bimodal. With two linear combinations, one may be after
clusters of points, high correlation, curvature, etc.
Principal components are the orthonormal combinations that maximize the vari-
ance. They predate the term projection pursuit by decades [Pearson, 1901], and are the
most commonly used. The idea behind them is that variation is information, so if
one has several variables, one wishes the linear combinations that capture as much
of the variation in the data as possible. You have to decide in particular situations
whether variation is the important criterion. To find a column vector b to maximize
the sample variance of W = Yb, we could take b infinitely large, which yields infinite
variance. To keep the variance meaningful, we restrict to vectors b of unit norm.
For q variables, there are q principal components: The first has the maximal vari-
ance any one linear combination (with norm 1) can have, the second has the maximal
variance among linear combinations orthogonal to the first, etc. The technical defi-
nition for a data matrix is below. First, we note that for a given q × p matrix B, the
mean and variance of the elements in the linear transformation W = YB are easily
1.6. Principal components 11

obtained from the mean and covariance matrix of Y using (1.8) and (1.15):
1  1
w= 1 W = 1n YB = ȳB, (1.26)
n n n
by (1.9), and
1  1
SW = W Hn W = B Y Hn YB = B SB, (1.27)
n n
where S is the covariance matrix of Y in (1.17). In particular, for a column vector b,
the sample variance of Yb is b Sb. Thus the principal components aim to maximize
g Sg for g’s of unit length.
Definition 1.2. Suppose S is the sample covariance matrix for the n × q data matrix Y. Let
g1 , . . . , gq be an orthonormal set of q × 1 vectors such that

g1 is any g that maximizes g Sg over g = 1;


g2 is any g that maximizes g Sg over g = 1, g g1 = 0;
g3 is any g that maximizes g Sg over g = 1, g g1 = g g2 = 0;
..
.
gq is any g that maximizes g Sg over g = 1, g g1 = · · · = g gq−1 = 0. (1.28)

Then Ygi is the i th sample principal component, gi is its loading vector, and li ≡ gi Sgi
is its sample variance.
Because the function g Sg is continuous in g, and the maximizations are over
compact sets, these principal components always exist. They may not be unique,
although for sample covariance matrices, if n ≥ q, they almost always are unique, up
to sign. See Section 13.1 for further discussion.
By the construction in (1.28), we have that the sample variances of the principal
components are ordered as
l1 ≥ l2 ≥ · · · ≥ l q . (1.29)
What is not as obvious, but quite important, is that the principal components are
uncorrelated, as in the next lemma, proved in Section 1.8.
Lemma 1.1. The S and g1 , . . . , gq in Definition 1.2 satisfy

gi Sg j = 0 for i = j. (1.30)

Now G ≡ (g1 , . . . , gq ) is an orthogonal matrix, and the matrix of principal compo-


nents is
W = YG. (1.31)
Equations (1.29) and (1.30) imply that the sample covariance matrix, say L, of W is
diagonal, with the li ’s on the diagonal. Hence by (1.27),
⎛ ⎞
l1 0 · · · 0
⎜ 0 l2 · · · 0 ⎟
⎜ ⎟
SW = G SG = G GLG G = L = ⎜ . .. .. .. ⎟ . (1.32)
⎝ .. . . . ⎠
0 0 · · · lq
Moving the G’s to the L side of the equality, we obtain the following.
12 Chapter 1. Multivariate Data

1.5 Original data Principal components

1.5
0.5

0.5
Sepal width

PC2
−0.5

−0.5
−1.5

−1.5
−1.5 −0.5 0.5 1.5 −1.5 −0.5 0.5 1.5
Sepal length PC1

Figure 1.5: The sepal length and sepal width for the setosa iris data. The first plot is
the raw data, centered. The second shows the two principal components.

Theorem 1.1 (The spectral decomposition theorem for symmetric matrices). If S is a


symmetric q × q matrix, then there exists a q × q orthogonal (1.25) matrix G and a q × q
diagonal matrix L with diagonals l1 ≥ l2 ≥ · · · ≥ lq such that

S = GLG . (1.33)
Although we went through the derivation with S being a covariance matrix, all
we really needed for this theorem was that S is symmetric. The gi ’s and li ’s have
mathematical names, too: Eigenvectors and eigenvalues.
Definition 1.3 (Eigenvalues and eigenvectors). Suppose A is a q × q matrix. Then λ is
an eigenvalue of A if there exists a non-zero q × 1 vector u such that Au = λu. The vector
u is the corresponding eigenvector. Similarly, u = 0 is an eigenvector if there exists an
eigenvalue to which it corresponds.
A little linear algebra shows that indeed, each gi is an eigenvector of S correspond-
ing to li . Hence the following:
Symbol Principal components Spectral decomposition
li Variance Eigenvalue (1.34)
gi Loadings Eigenvector
Figure 1.5 plots the principal components for the q = 2 variables sepal length and
sepal width for the fifty iris observations of the species setosa. The data has been
centered, so that the means are zero. The variances of the two original variables are
0.124 and 0.144, respectively. The first graph shows the two variables are highly cor-
related, with most of the points lining up near the 45◦ line. The principal component
loading matrix G rotates the points approximately 45◦ clockwise as in the second
graph, so that the data are now most spread out along the horizontal axis (variance is
0.234), and least along the vertical (variance is 0.034). The two principal components
are also, as it appears, uncorrelated.
1.6. Principal components 13

Best K components
In the process above, we found the principal components one by one. It may be
that we would like to find the rotation for which the first K variables, say, have the
maximal sum of variances. That is, we wish to find the orthonormal set of q × 1
vectors b1 , . . . , bK to maximize

b1 Sb1 + · · · + bK SbK . (1.35)

Fortunately, the answer is the same, i.e., take bi = gi for each i, the principal compo-
nents. See Proposition 1.1 in Section 1.8. Section 13.1 explores principal components
further.

1.6.1 Biplots
When plotting observations using the first few principal component variables, the
relationship between the original variables and principal components is often lost.
An easy remedy is to rotate and plot the original axes as well. Imagine in the original
data space, in addition to the observed points, one plots arrows of length λ along the
axes. That is, the arrows are the line segments

ai = {(0, . . . , 0, c, 0, . . . , 0) | 0 < c < λ} (the c is in the i th slot), (1.36)

where an arrowhead is added at the non-origin end of the segment. If Y is the matrix
of observations, and G1 the matrix containing the first p loading vectors, then

 = YG1 .
X (1.37)

We also apply the transformation to the arrows:

 = (a1 , . . . , aq )G1 .
A (1.38)

The plot consisting of the points X and the arrows A  is then called the biplot. See
 are just
Gabriel [1981]. The points of the arrows in A

λIq G1 = λG1 , (1.39)

so that in practice all we need to do is for each axis, draw an arrow pointing from
the origin to λ× (the i th row of G1 ). The value of λ is chosen by trial-and-error, so
that the arrows are amidst the observations. Notice that the components of these
arrows are proportional to the loadings, so that the length of the arrows represents
the weight of the corresponding variables on the principal components.

1.6.2 Example: Sports data


Louis Roussos asked n = 130 people to rank seven sports, assigning #1 to the sport
they most wish to participate in, and #7 to the one they least wish to participate in.
The sports are baseball, football, basketball, tennis, cycling, swimming and jogging.
Here are a few of the observations:
14 Chapter 1. Multivariate Data

Obs i BaseB FootB BsktB Ten Cyc Swim Jog


1 1 3 7 2 4 5 6
2 1 3 2 5 4 7 6
3 1 3 2 5 4 7 6 (1.40)
.. .. .. .. .. .. .. ..
. . . . . . . .
129 5 7 6 4 1 3 2
130 2 1 6 7 3 5 4
E.g., the first person likes baseball and tennis, but not basketball or jogging (too much
running?).
We find the principal components. The data is in the matrix sportsranks. We find it
easier to interpret the plot if we reverse the ranks, so that 7 is best and 1 is worst, then
center the variables. The function eigen calculates the eigenvectors and eigenvalues of
its argument, returning the results in the components vectors and values, respectively:
y <− 8−sportsranks
y <− scale(y,scale=F) # Centers the columns
eg <− eigen(var(y))
The function prcomp can also be used. The eigenvalues (variances) are

j 1 2 3 4 5 6 7
(1.41)
lj 10.32 4.28 3.98 3.3 2.74 2.25 0

The first eigenvalue is 10.32, quite a bit larger than the second. The second through
sixth are fairly equal, so it may be reasonable to look at just the first component.
(The seventh eigenvalue is 0, but that follows because the rank vectors all sum to
1 + · · · + 7 = 28, hence exist in a six-dimensional space.)
We create the biplot using the first two dimensions. We first plot the people:
ev <− eg$vectors
w <− y%∗%ev # The principal components
lm <− range(w)
plot(w[,1:2],xlim=lm,ylim=lm)
The biplot adds in the original axes. Thus we want to plot the seven (q = 7) points as
in (1.39), where Γ1 contains the first two eigenvectors. Plotting the arrows and labels:
arrows(0,0,5∗ev[,1],5∗ev[,2])
text(7∗ev[,1:2],labels=colnames(y))
The constants “5” (which is the λ) and “7” were found by trial and error so that the
graph, Figure 1.6, looks good. We see two main clusters. The left-hand cluster of
people is associated with the team sports’ arrows (baseball, football and basketball),
and the right-hand cluster is associated with the individual sports’ arrows (cycling,
swimming, jogging). Tennis is a bit on its own, pointing south.

1.7 Other projections to pursue


Principal components can be very useful, but you do have to be careful. For one, they
depend crucially on the scaling of your variables. For example, suppose the data set
1.7. Other projections to pursue 15

4
Jog

2 BaseB
Cyc
PC2

FootB
0

Swim
BsktB
−2
−4

Ten

−4 −2 0 2 4
PC1

Figure 1.6: Biplot of the sports data, using the first two principal components.

has two variables, height and weight, measured on a number of adults. The variance
of height, in inches, is about 9, and the variance of weight, in pounds, is 900 (= 302 ).
One would expect the first principal component to be close to the weight variable,
because that is where the variation is. On the other hand, if height were measured in
millimeters, and weight in tons, the variances would be more like 6000 (for height)
and 0.0002 (for weight), so the first principal component would be essentially the
height variable. In general, if the variables are not measured in the same units, it can
be problematic to decide what units to use for the variables. See Section 13.1.1. One

common approach is to divide each variable by its standard deviation s jj , so that
the resulting variables all have variance 1.
Another caution is that the linear combination with largest variance is not neces-
sarily the most interesting, e g., you may want one which is maximally correlated
with another variable, or which distinguishes two populations best, or which shows
the most clustering.
Popular objective functions to maximize, other than variance, are skewness, kur-
tosis and negative entropy. The idea is to find projections that are not normal (in the
sense of the normal distribution). The hope is that these will show clustering or some
other interesting feature.
Skewness measure a certain lack of symmetry, where one tail is longer than the
other. It is measured by the normalized sample third central (meaning subtract the
mean) moment:
∑ni=1 ( xi − x̄ )3 /n
Skewness = . (1.42)
(∑ni=1 ( xi − x̄ )2 /n )3/2
Positive values indicate a longer tail to the right, and negative to the left. Kurtosis is
16 Chapter 1. Multivariate Data

the normalized sample fourth central moment. For a sample x1 , . . . , xn , it is

∑ni=1 ( xi − x̄ )4 /n
Kurtosis = − 3. (1.43)
(∑ ni=1 ( xi − x̄ )2 /n )2

The “−3” is there so that exactly normal data will have kurtosis 0. A variable with
low kurtosis is more “boxy” than the normal. One with high kurtosis tends to have
thick tails and a pointy middle. (A variable with low kurtosis is platykurtic, and one
with high kurtosis is leptokurtic, from the Greek: kyrtos = curved, platys = flat, like a
platypus, and lepto = thin.) Bimodal distributions often have low kurtosis.

Entropy
(You may wish to look through Section 2.1 before reading this section.) The entropy
of a random variable Y with pdf f (y) is

Entropy( f ) = − E f [log( f (Y ))]. (1.44)

Entropy is supposed to measure lack of structure, so that the larger the entropy, the
more diffuse the distribution is. For the normal, we have that


(Y − μ )2 1
Entropy( N (μ, σ )) = E f log( 2πσ ) +
2 2 = (1 + log(2πσ2 )). (1.45)
2σ2 2

Note that it does not depend on the mean μ, and that it increases without bound as σ2
increases. Thus maximizing entropy unrestricted is not an interesting task. However,
one can imagine maximizing entropy for a given mean and variance, which leads to
the next lemma, to be proved in Section 1.8.

Lemma 1.2. The N (μ, σ2 ) uniquely maximizes the entropy among all pdf’s with mean μ and
variance σ2 .
Thus a measure of nonnormality of g is its entropy subtracted from that of the
normal with the same variance. Since there is a negative sign in front of the entropy
of g, this difference is called negentropy defined for any g as

1
Negent( g) = (1 + log(2πσ2 )) − Entropy( g), where σ2 = Var g [Y ]. (1.46)
2
With data, one does not know the pdf g, so one must estimate the negentropy. This
value is known as the Kullback-Leibler distance, or discrimination information, from
g to the normal density. See Kullback and Leibler [1951].

1.7.1 Example: Iris data


Consider the first three variables of the iris data (sepal length, sepal width, and petal
length), normalized so that each variable has mean zero and variance one. We find
the first two principal components, which maximize the variances, and the first two
components that maximize the estimated entropies, defined as in Definition 1.28,
but with estimated entropy of Yg substituted for the variance g Sg. The table (1.47)
contains the loadings for the variables. Note that the two objective functions do
1.7. Other projections to pursue 17

Variance Entropy
v
s s
−2 sssss

−2
v
gv v vvg vv ssssssssss
v g vv sss
g vv v
sss ssssss
0 −1

0 −1
g gvgvgggvvvvv v sssss s ss
g ggvvg vv ss
s v
g vvv vvv vvvvv ggg

Ent2
vggvv vg
PC2

ssssss s
g g g vvvvvvvvvvg ggg g
g
g v
g g g gggvvgvgv v v
s
sssssss s vvvvvvvvgggg g g
gg
gg g ggvgggv v
ss
vvvvvvvvvg vggggg
vgvgg
g gg
ssss g gg
1

1
s ss vvvvvg g ggg
g ss gg
s g
2

2
g
gg s

3 2 1 0 −1 −2 3 2 1 0 −1 −2
PC1 Ent1

Figure 1.7: Projection pursuit for the iris data. The first plot is based on maximizing
the variances of the projections, i.e., principal components. The second plot maxi-
mizes estimated entropies.

produce different projections. The first principal component weights equally on the
two length variables, while the first entropy variable is essentially petal length.

Variance Entropy
g1 g2 g1∗ g2∗
Sepal length 0.63 0.43 0.08 0.74 (1.47)
Sepal width −0.36 0.90 0.00 −0.68
Petal length 0.69 0.08 −1.00 0.06
Figure 1.7 graphs the results. The plots both show separation between setosa and
the other two species, but the principal components plot has the observations more
spread out, while the entropy plot shows the two groups much tighter.
The matrix iris has the iris data, with the first four columns containing the mea-
surements, and the fifth specifying the species. The observations are listed with the
fifty setosas first, then the fifty versicolors, then the fifty virginicas. To find the prin-
cipal components for the first three variables, we use the following:
y <− scale(as.matrix(iris[,1:3]))
g <− eigen(var(y))$vectors
pc <− y%∗%g
The first statement centers and scales the variables. The plot of the first two columns
of pc is the first plot in Figure 1.7. The procedure we used for entropy is negent3D in
Listing A.3, explained in Appendix A.1. The code is
gstar <− negent3D(y,nstart=10)$vectors
ent <−y%∗%gstar
To create plots like the ones in Figure 1.7, use
par(mfrow=c(1,2))
18 Chapter 1. Multivariate Data

sp <− rep(c(’s’,’v’,’g’),c(50,50,50))
plot(pc[,1:2],pch=sp) # pch specifies the characters to plot.
plot(ent[,1:2],pch=sp)

1.8 Proofs
Proof of the principal components result, Lemma 1.1
The idea here was taken from Davis and Uhl [1999]. Consider the g1 , . . . , gq as defined
in (1.28). Take i < j, and for angle θ, let

h(θ ) = g(θ ) Sg(θ ) where g(θ ) = cos(θ )gi + sin(θ )g j . (1.48)

Because the gi ’s are orthogonal,

g(θ ) = 1 and g(θ ) g1 = · · · = g(θ ) gi−1 = 0. (1.49)

According to the i th stage in (1.28), h(θ ) is maximized when g(θ ) = gi , i.e., when
θ = 0. The function is differentiable, hence its derivative must be zero at θ = 0. To
verify (1.30), differentiate:

d
0 = h(θ )| θ =0

d
= (cos2 (θ )gi Sgi + 2 sin(θ ) cos(θ ) gi Sg j + sin2 (θ )gj Sg j )| θ =0

= 2gi Sg j . (1.50)

Best K components
We next consider finding the set b1 , . . . , bK orthonormal vectors to maximize the sum
of variances, ∑K 
i =1 bi Sbi , as in (1.35). It is convenient here to have the next definition.

Definition 1.4 (Trace). The trace of an m × m matrix A is the sum of its diagonals,
trace(A) = ∑ m
i =1 a ii .

Thus if we let B = (b1 , . . . , bK ), we have that


K
∑ bi Sbi = trace(B SB). (1.51)
i =1

Proposition 1.1. Best K components. Suppose S is a q × q covariance matrix, and define


BK to be the set of q × K matrices with orthonormal columns, 1 ≤ K ≤ q. Then

max trace(B SB) = l1 + · · · + lK , (1.52)


B∈B K

which is achieved by taking B = (g1 , . . . , gK ), where gi is the i th principal component loading


vector for S, and li is the corresponding variance.
The proposition follows directly from the next lemma, with S as in (1.33).
1.8. Proofs 19

Lemma 1.3. Suppose S and BK are as in Proposition 1.1, and S = GLG is its spectral
decomposition. Then (1.52) holds.

Proof. Set A = G B, so that A is also in BK . Then B = GA, and

trace (B SB) = trace(A G SGA)


= trace(A LA)
q K
= ∑ [( ∑ a2ij )li ]
i =1 j =1
q
= ∑ ci l i , (1.53)
i =1

where the aij ’s are the elements of A, and ci = ∑Kj=1 a2ij . Because the columns of A
have norm one, and the rows of A have norms less than or equal to one,
q K q
∑ ci = ∑ [ ∑ a2ij ] = K and ci ≤ 1. (1.54)
i =1 j =1 i =1

To maximize (1.53) under those constraints on the ci ’s, we try to make the earlier
ci ’s as large as possible, which means that c1 = · · · = cK = 1 and cK +1 = · · · =
cq = 0. The resulting value is then l1 + · · · + lK . Note that taking A with aii = 1,
i = 1, . . . , K, and 0 elsewhere (so that A consists of the first K columns of Iq ), achieves
that maximum. With that A, we have that B = (g1 , . . . , gK ).

Proof of the entropy result, Lemma 1.2


Let f be the N (μ, σ2 ) density, and g be any other pdf with mean μ and variance σ2 .
Then

Entropy( f ) − Entropy( g) = − f (y) log( f (y))dy + g(y) log( g(y))dy

= g(y) log( g(y))dy − g(y) log( f (y))dy

+ g(y) log( f (y))dy − f (y) log( f (y))dy

=− g(y) log( f (y)/g(y))dy


(Y − μ )2
+ Eg log( 2πσ ) + 2
2σ2


(Y − μ )2
− E f log( 2πσ2 ) + (1.55)
2σ2
= Eg [− log( f (Y )/g(Y ))]. (1.56)

The last two terms in (1.55) are equal, since Y has the same mean and variance under
f and g.
At this point we need an important inequality about convexity, to whit, what
follows is a definition and lemma.
20 Chapter 1. Multivariate Data

Definition 1.5 (Convexity). The real-valued function h, defined on an interval I ⊂ R, is


convex if for each x0 ∈ I, there exists an a0 and b0 such that
h( x0 ) = a0 + b0 x0 and h( x ) ≥ a0 + b0 x for x = x0 . (1.57)
The function is strictly convex if the inequality is strict in (1.57).
The line a0 + b0 x is the tangent line to h at x0 . Convex functions have tangent
lines that are below the curve, so that convex functions are “bowl-shaped.” The next
lemma is proven in Exercise 1.9.13.
Lemma 1.4 (Jensen’s inequality). Suppose W is a random variable with finite expected
value. If h(w) is a convex function, then
E [ h(W )] ≥ h( E [W ]), (1.58)
where the left-hand expectation may be infinite. Furthermore, the inequality is strict if h(w)
is strictly convex and W is not constant, that is, P[W = c] < 1 for any c.
One way to remember the direction of the inequality is to imagine h(w) = w2 ,
in which case (1.58) states that E [W 2 ] > E [W ]2, which we already know because
Var [ X ] = E [W 2 ] − E [W ]2 ≥ 0.
Now back to (1.56). The function h(w) = − log(w) is strictly convex, and if g is
not equivalent to f , W = f (Y )/g(Y ) is not constant. Jensen’s inequality thus shows
that
Eg [− log( f (Y )/g(Y ))] > − log ( E [ f (Y )/g(Y )])

= − log ( f (y)/g(y)) g(y)dy

= − log f (y)dy = − log(1) = 0. (1.59)

Putting (1.56) and (1.59) together yields


Entropy( N (0, σ2 )) − Entropy( g) > 0, (1.60)
which completes the proof of Lemma 1.2. 2
Answers: The question marks in Figure 1.4 are, respectively, virginica, setosa,
virginica, versicolor, and setosa.

1.9 Exercises
Exercise 1.9.1. Let Hn be the centering matrix in (1.12). (a) What is Hn 1n ? (b) Suppose
x is an n × 1 vector whose elements sum to zero. What is Hn x? (c) Show that Hn is
idempotent (1.16).

Exercise 1.9.2. Define the matrix Jn = (1/n )1n 1n , so that Hn = In − Jn . (a) What
does Jn do to a vector? (That is, what is Jn a?) (b) Show that Jn is idempotent. (c) Find
the spectral decomposition (1.33) for Jn explicitly when n = 3. [Hint: In G, the first
column (eigenvector) is proportional to 13 . The remaining two eigenvectors can be
any other vectors such that the three eigenvectors are orthonormal. Once you have a
G, you can find the L.] (d) Find the spectral decomposition for H3 . [Hint: Use the
same eigenvectors as for J3 , but in a different order.] (e) What do you notice about
the eigenvalues for these two matrices?
1.9. Exercises 21

Exercise 1.9.3. A covariance matrix has intraclass correlation structure if all the vari-
ances are equal, and all the covariances are equal. So for n = 3, it would look like
⎛ ⎞
a b b
A = ⎝ b a b ⎠. (1.61)
b b a

Find the spectral decomposition for this type of matrix. [Hint: Use the G in Exercise
1.9.2, and look at G AG.]

Exercise 1.9.4. Suppose Y is an n × q data matrix, and W = YG, where G is a q × q


orthogonal matrix. Let y1 , . . . , yn be the rows of Y, and similarly wi ’s be the rows of
W. (a) Show that the corresponding points have the same length, yi  = wi . (b)
Show that the distances between the points have not changed, yi − y j  = wi − w j ,
for any i, j.
Exercise 1.9.5. Suppose that the columns of G constitute the principal component
loading vectors for the sample covariance matrix S. Show that gi Sgi = li and gi Sg j
for i = j, as in (1.30), implies (1.32): G SG = L.
Exercise 1.9.6. Verify (1.49) and (1.50).

Exercise 1.9.7. In (1.53), show that trace(A LA) = ∑ i=1 [(∑Kj=1 a2ij )li ].
q

Exercise 1.9.8. This exercise is to show that the eigenvalue matrix of a covariance
matrix S is unique. Suppose S has two spectral decompositions, S = GLG = HMH ,
where G and H are orthogonal matrices, and L and M are diagonal matrices with
nonincreasing diagonal elements. Use Lemma 1.3 on both decompositions of S to
show that for each K = 1, . . . , q, l1 + · · · + lK = m1 + · · · + mK . Thus L = K.

Exercise 1.9.9. Suppose Y is a data matrix, and Z = YF for some orthogonal matrix
F, so that Z is a rotated version of Y. Show that the variances of the principal com-
ponents are the same for Y and Z. (This result should make intuitive sense.) [Hint:
Find the spectral decomposition of the covariance of Z from that of Y, then note that
these covariance matrices have the same eigenvalues.]

Exercise 1.9.10. Show that in the spectral decomposition (1.33), each li is an eigen-
value, with corresponding eigenvector gi , i.e., Sgi = li gi .
Exercise 1.9.11. Suppose λ is an eigenvalue of the covariance matrix S. Show that
λ must equal one of the li ’s in the spectral decomposition of S. [Hint: Let u be
an eigenvector corresponding to λ. Show that λ is also an eigenvalue of L, with
corresponding eigenvector v = G u, hence li vi = λvi for each i.]

Exercise 1.9.12. Verify the expression for g(y) log( f (y))dy in (1.55).

Exercise 1.9.13. Consider the setup in Jensen’s inequality, Lemma 1.4. (a) Show that if
h is convex, E [ h(W )] ≥ h( E [W ]). [Hint: Set x0 = E [W ] in Definition 1.5.] (b) Suppose
h is strictly convex. Give an example of a random variable W for which E [ h(W )] =
h( E [W ]). (c) Show that if h is convex and W is not constant, that E [ h(W )] > E [W ].
22 Chapter 1. Multivariate Data

Exercise 1.9.14 (Spam). In the Hewlett-Packard spam data, a set of n = 4601 emails
were classified according to whether they were spam, where “0” means not spam, “1”
means spam. Fifty-seven explanatory variables based on the content of the emails
were recorded, including various word and symbol frequencies. The emails were
sent to George Forman (not the boxer) at Hewlett-Packard labs, hence emails with
the words “George” or “hp” would likely indicate non-spam, while “credit” or “!”
would suggest spam. The data were collected by Hopkins et al. [1999], and are in the
data matrix Spam. ( They are also in the R data frame spam from the ElemStatLearn
package [Halvorsen, 2009], as well as at the UCI Machine Learning Repository [Frank
and Asuncion, 2010].)
Based on an email’s content, is it possible to accurately guess whether it is spam
or not? Here we use Chernoff’s faces. Look at the faces of some emails known to
be spam and some known to be non-spam (the “training data”). Then look at some
randomly chosen faces (the “test data”). E.g., to have twenty observations known
to be spam, twenty known to be non-spam, and twenty test observations, use the
following R code:
x0 <− Spam[Spam[,’spam’]==0,] # The non−spam
x1 <− Spam[Spam[,’spam’]==1,] # The spam
train0 <− x0[1:20,]
train1 <− x1[1:20,]
test <− rbind(x0[−(1:20),],x1[−(1:20),])[sample(1:4561,20),]
Based on inspecting the training data, try to classify the test data. How accurate are
your guesses? The faces program uses only the first fifteen variables of the input
matrix, so you should try different sets of variables. For example, for each variable
find the value of the t-statistic for testing equality of the spam and email groups, then
choose the variables with the largest absolute t’s.

Exercise 1.9.15 (Spam). Continue with the spam data from Exercise 1.9.14. (a) Plot the
variances of the explanatory variables (the first 57 variables) versus the index (i.e., the
x-axis has (1, 2, . . . , 57), and the y-axis has the corresponding variances.) You might
not see much, so repeat the plot, but taking logs of the variances. What do you see?
Which three variables have the largest variances? (b) Find the principal components
using just the explanatory variables. Plot the eigenvalues versus the index. Plot the
log of the eigenvalues versus the index. What do you see? (c) Look at the loadings for
the first three principal components. (E.g., if spamload contains the loadings (eigen-
vectors), then you can try plotting them using matplot(1:57,spamload[,1:3]).) What
is the main feature of the loadings? How do they relate to your answer in part (a)?
(d) Now scale the explanatory variables so each has mean zero and variance one:
spamscale <− scale(Spam[,1:57]). Find the principal components using this matrix.
Plot the eigenvalues versus the index. What do you notice, especially compared to
the results of part (b)? (e) Plot the loadings of the first three principal components
obtained in part (d). How do they compare to those from part (c)? Why is there such
a difference?

Exercise 1.9.16 (Sports data). Consider the Louis Roussos sports data described in
Section 1.6.2. Use faces to cluster the observations. Use the raw variables, or the prin-
cipal components, and try different orders of the variables (which maps the variables
to different sets of facial features). After clustering some observations, look at how
1.9. Exercises 23

they ranked the sports. Do you see any pattern? Were you able to distinguish be-
tween people who like team sports versus individual sports? Those who like (dislike)
tennis? Jogging?

Exercise 1.9.17 (Election). The data set election has the results of the first three US
presidential races of the 2000’s (2000, 2004, 2008). The observations are the 50 states
plus the District of Columbia, and the values are the ( D − R)/( D + R) for each state
and each year, where D is the number of votes the Democrat received, and R is the
number the Republican received. (a) Without scaling the variables, find the principal
components. What are the first two principal component loadings measuring? What
is the ratio of the standard deviation of the first component to the second’s? (c) Plot
the first versus second principal components, using the states’ two-letter abbrevia-
tions as the plotting characters. (They are in the vector stateabb.) Make the plot so
that the two axes cover the same range. (d) There is one prominent outlier. What
is it, and for which variable is it mostly outlying? (e) Comparing how states are
grouped according to the plot and how close they are geographically, can you make
any general statements about the states and their voting profiles (at least for these
three elections)?

Exercise 1.9.18 (Painters). The data set painters has ratings of 54 famous painters. It
is in the MASS package [Venables and Ripley, 2002]. See Davenport and Studdert-
Kennedy [1972] for a more in-depth discussion. The R help file says
The subjective assessment, on a 0 to 20 integer scale, of 54 classical painters.
The painters were assessed on four characteristics: composition, drawing,
colour and expression. The data is due to the Eighteenth century art critic,
de Piles.
The fifth variable gives the school of the painter, using the following coding:
A: Renaissance; B: Mannerist; C: Seicento; D: Venetian; E: Lombard; F:
Sixteenth Century; G: Seventeenth Century; H: French
Create the two-dimensional biplot for the data. Start by turning the data into a matrix,
then centering both dimensions, then scaling:
x <− scale(as.matrix(painters[,1:4]),scale=F)|
x <− t(scale(x),scale=F))
x <− scale(x)
Use the fifth variable, the painters’ schools, as the plotting character, and the four
rating variables as the arrows. Interpret the two principal component variables. Can
you make any generalizations about which schools tend to rate high on which scores?

Exercise 1.9.19 (Cereal). Chakrapani and Ehrenberg [1981] analyzed people’s atti-
tudes towards a variety of breakfast cereals. The data matrix cereal is 8 × 11, with
rows corresponding to eight cereals, and columns corresponding to potential at-
tributes about cereals. The attributes: Return (a cereal one would come back to),
tasty, popular (with the entire family), digestible, nourishing, natural flavor, afford-
able, good value, crispy (stays crispy in milk), fit (keeps one fit), and fun (for children).
The original data consisted of the percentage of subjects who thought the given ce-
real possessed the given attribute. The present matrix has been doubly centered, so
24 Chapter 1. Multivariate Data

that the row means and columns means are all zero. (The original data can be found
in the S-Plus [TIBCO Software Inc., 2009] data set cereal.attitude.) Create the two-
dimensional biplot for the data with the cereals as the points (observations), and the
attitudes as the arrows (variables). What do you see? Are there certain cereals/at-
tributes that tend to cluster together? (You might want to look at the Wikipedia entry
[Wikipedia, 2011] on breakfast cereals.)

Exercise 1.9.20 (Decathlon). The decathlon data set has scores on the top 24 men in
the decathlon (a set of ten events) at the 2008 Olympics. The scores are the numbers
of points each participant received in each event, plus each person’s total points. The
data can be found at the NBC Olympic site [Olympics, 2008]. Create the biplot for
these data based on the first ten variables (i.e., do not use their total scores). Doubly
center, then scale, the data as in Exercise 1.9.18. The events should be the arrows. Do
you see any clustering of the events? The athletes?

The remaining questions require software that will display rotating point clouds of
three dimensions, and calculate some projection pursuit objective functions. The
Spin program at http://stat.istics.net/MultivariateAnalysis is sufficient for
our purposes. GGobi [Cook and Swayne, 2007] has an excellent array of graphical
tools for interactively exploring multivariate data. See also the spin3R routine in the
R package aplpack [Wolf and Bielefeld, 2010].

Exercise 1.9.21 (Iris). Consider the three variables X = Sepal Length, Y = Petal Length,
and Z = Petal Width in the Fisher/Anderson iris data. (a) Look at the data while
rotating. What is the main feature of these three variables? (b) Scale the data so that
the variables all have the same sample variance. (The Spin program automatically
performs the scaling.) For various objective functions (variance, skewness, kurtosis,
negative kurtosis, negentropy), find the rotation that maximizes the function. (That
is, the first component of the rotation maximizes the criterion over all rotations. The
second then maximizes the criterion for components orthogonal to the first. The third
component is then whatever is orthogonal to the first two.) Which criteria are most
effective in yielding rotations that exhibit the main feature of the data? Which are
least effective? (c) Which of the original variables are most prominently represented
in the first two components of the most effective rotations?

Exercise 1.9.22 (Automobiles). The data set cars [Consumers’ Union, 1990] contains
q = 11 size measurements on n = 111 models of automobile. The original data can be
found in the S-Plus R
[TIBCO Software Inc., 2009] data frame cu.dimensions. In cars,
the variables have been normalized to have medians of 0 and median absolute devi-
ations (MAD) of 1.4826 (the MAD for a N (0, 1)). Inspect the three-dimensional data
set consisting of the variables length, width, and height. (In the Spin program, the
data set is called “Cars.”) (a) Find the linear combination with the largest variance.
What is the best linear combination? (Can you interpret it?) What is its variance?
Does the histogram look interesting? (b) Now find the linear combination to maxi-
mize negentropy. What is the best linear combination, and its entropy? What is the
main feature of the histogram? (c) Find the best two linear combinations for entropy.
What are they? What feature do you see in the scatter plot?
1.9. Exercises 25

Exercise 1.9.23 (RANDU). RANDU [IBM, 1970] is a venerable, fast, efficient, and very
flawed random number generator. See Dudewicz and Ralley [1981] for a thorough re-
view of old-time random number generators. For given “seed” x0 , RANDU produces
xi+1 from xi via
xi+1 = (65539 xi ) mod 231 . (1.62)
The “random” Uniform(0,1) values are then u i = xi /231 . The R data set randu is
based on a sequence generated using RANDU, where each of n = 400 rows is a set
of p = 3 consecutive u i ’s. Rotate the data, using objective criteria if you wish, to look
for significant non-randomness in the data matrix. If the data are really random, the
points should uniformly fill up the three-dimensional cube. What feature do you see
that reveals the non-randomness?

The data sets Example 1, Examples 2, . . ., Example 5 are artificial three-dimensional


point clouds. The goal is to rotate the point clouds to reveal their structures.
Exercise 1.9.24. Consider the Example 1 data set. (a) Find the first two principal
components for these data. What are their variances? (b) Rotate the data. Are the
principal components unique? (c) Find the two-dimensional plots based on maximiz-
ing the skewness, kurtosis, negative kurtosis, and negentropy criteria. What do you
see? What does the histogram for the linear combination with the largest kurtosis
look like? Is it “pointy”? What does the histogram for the linear combination with
the most negative kurtosis look like? Is it “boxy”? (d) Describe the three-dimensional
structure of the data points. Do the two-dimensional plots in part (c) give a good
idea of the three-dimensional structure?
Exercise 1.9.25. This question uses the Example 2 data set. (a) What does the his-
togram for the linear combination with the largest variance look like? (b) What does
the histogram for the linear combination with the largest negentropy look like? (c)
Describe the three-dimensional object.
Exercise 1.9.26. For each of Example 3, 4, and 5, try to guess the shape of the cloud
of data points based on just the 2-way scatter plots. Then rotate the points enough to
convince yourself of the actual shape.
Chapter 2

Multivariate Distributions

This chapter reviews the elements of distribution theory that we need, especially for
vectors and matrices. (Classical multivariate analysis is basically linear algebra, so
everything we do eventually gets translated into matrix equations.) See any good
mathematical statistics book such as Hogg, McKean, and Craig [2004], Bickel and
Doksum [2000], or Lehmann and Casella [1998] for a more comprehensive treatment.

2.1 Probability distributions


We will deal with random variables and finite collections of random variables. A ran-
dom variable X has range or space X ⊂ R, the real line. A collection of random vari-
ables is just a set of random variables. They could be arranged in any convenient way,
such as a row or column vector, matrix, triangular array, or three-dimensional array,
and will often be indexed to match the arrangement. The default arrangement will be
to index the random variables by 1, . . . , N, so that the collection is X = ( X1 , . . . , X N ),
considered as a row vector. The space of X is X ⊂ R N , N-dimensional Euclidean
space. A probability distribution P for a random variable or collection of random
variables specifies the chance that the random object will fall in a particular subset
of its space. That is, for A ⊂ X , P [ A] is the probability that the random X is in A,
also written P [X ∈ A]. In principle, to describe a probability distribution, one must
specify P [ A] for all subsets A. (Technically, all “measurable” subsets, but we will not
worry about measurability.) Fortunately, there are easier ways. We will use densities,
but the main method will be to use representations, by which we mean describing a
collection of random variables Y in terms of another collection X for which we already
know the distribution, usually through a function, i.e., Y = g(X).

2.1.1 Distribution functions


The distribution function for the probability distribution P for the collection X =
( X1 , . . . , X N ) of random variables is the function

F : R N → [0, 1]
F ( x 1 , x 2 , . . . , x N ) = P [ X1 ≤ x 1 , X2 ≤ x 2 , . . . , X N ≤ x N ] . (2.1)

27
28 Chapter 2. Multivariate Distributions

Note that it is defined on all of R N , not just the space of X. It is nondecreasing, and
continuous from the left, in each xi . The limit as all xi → − ∞ is zero, and as all
xi → ∞, the limit is one. The distribution function uniquely defines the distribution,
though we will not find much use for it.

2.1.2 Densities
A collection of random variables X is said to have a density with respect to Lebesgue
measure on R N , if there is a nonnegative function f (x),

f : X −→ [0, ∞ ), (2.2)

such that for any A ⊂ X ,



P[ A] = f (x)dx
A
= ··· f ( x1 , . . . , x N )dx1 · · · dx N . (2.3)
A

The second line is there to emphasize that wehave a multiple integral. (The Lebesgue
measure of a subset A of R N is the integral A dx, i.e., as if f (x) = 1 in (2.3). Thus if
N = 1, the Lebesgue measure of a line segment is its length. In two dimensions, the
Lebesgue measure of a set is its area. For N = 3, it is the volume.)
We will call a density f as in (2.3) the “pdf,” for “probability density function.”
Because P [X ∈ X ] = 1, the integral of the pdf over the entire space X must be 1.
Random variables or collections that have pdf’s are continuous in the sense that the
probability X equals a specific value x is 0. (There are continuous distributions that
do not have pdf’s, such as the uniform distribution on the unit circle.)
If X does have a pdf, then it can be obtained from the distribution function in (2.1)
by differentiation:

∂N
f ( x1 , . . . , x N ) = F ( x1 , . . . , x N ). (2.4)
∂x1 · · · ∂x N

If the space X is a countable (which includes finite) set, then its probability can be
given by specifying the probability of each individual point. The probability mass
function f , or “pmf,” with
f : X −→ [0, 1], (2.5)
is given by
f (x) = P [X = x] = P [{x}]. (2.6)
The probability of any subset A is the sum of the probabilities of the individual points
in A,
P [ A ] = ∑ f (x). (2.7)
x∈ A

Such an X is called discrete. (A pmf is also a density, but with respect to counting
measure on X , not Lebesgue measure.)
Not all random variables are either discrete or continuous, and especially a collec-
tion of random variables could have some discrete and some continuous members. In
such cases, the probability of a set is found by integrating over the continuous parts
2.1. Probability distributions 29

and summing over the discrete parts. For example, suppose our collection is a 1 × N
vector combining two other collections, i.e.,

W = (X, Y) has space W , X is 1 × Nx and Y is 1 × Ny , N = Nx + Ny . (2.8)

For a subset A ⊂ W , define the marginal set by

X A = {x ∈ R Nx | (x, y) ∈ A for some y}, (2.9)

and the conditional set given X = x by

YxA = {y ∈ R Ny | (x, y) ∈ A}. (2.10)

Suppose X is discrete and Y is continuous. Then f (x, y) is a mixed-type density for


the distribution of W if for any A ⊂ W ,

P[ A] = ∑ f (x, y)dy. (2.11)
x∈X A Yx
A

We will use the generic term “density” to mean pdf, pmf, or the mixed type of density
in (2.11). There are other types of densities, but we will not need to deal with them.

2.1.3 Representations
Representations are very useful, especially when no pdf exists. For example, suppose
Y = (Y1 , Y2 ) is uniform on the unit circle, by which we mean Y has space Y = {y ∈
R2 | y = 1}, and it is equally likely to be any point on that circle. There is no
pdf, because the area of the circle in R2 is zero, so the integral over any subset of
Y of any function is zero. The distribution can be thought of in terms of the angle
y makes with the x-axis, that is, y is equally likely to be at any angle. Thus we can
let X ∼ Uniform(0, 2π ]: X has space (0, 2π ] and pdf f X ( x ) = 1/(2π ). Then we can
define
Y = (cos( X ), sin( X )). (2.12)
In general, suppose we are given the distribution for X with space X and function
g,
g : X −→ Y . (2.13)
Then for any B ⊂ Y , we can define the probability of Y by

P [Y ∈ B ] = P [ g(X) ∈ B ] = P [X ∈ g−1 ( B )]. (2.14)

We know the final probability because g−1 ( B ) ⊂ X .


One special type of function yields marginal distributions, analogous to the mar-
ginals in Section 1.5, that picks off some of the components. Consider the setup in
(2.8). The marginal function for X simply chooses the X components:

g(x, y) = x. (2.15)

The space of X is then given by (2.9) with A = W , i.e.,

X ≡ X W = {x ∈ R Nx | (x, y) ∈ W for some y}. (2.16)


30 Chapter 2. Multivariate Distributions

If f (x, y) is the density for (X, Y), then the density of X can be found by “integrating
(or summing) out” the y. That is, if f is a pdf, then f X (x) is the pdf for X, where

f X (x) = f (x, y)dy, (2.17)
Yx

and
Yx = YxW = {y ∈ R Ny | (x, y) ∈ W } (2.18)
is the conditional space (2.10) with A = W . If y has some discrete components, then
they are summed in (2.17).
Note that we can find the marginals of any subset, not just sets of consecutive
elements. E.g., if X = ( X1 , X2 , X3 , X4 , X5 ), we can find the marginal of ( X2 , X4 , X5 )
by integrating out the X1 and X3 .
Probability distributions can also be represented through conditioning, discussed
in the next section.

2.1.4 Conditional distributions


The conditional distribution of one or more variables given another set of variables,
the relationship of “cause” to “effect,” is central to multivariate analysis. E.g., what is
the distribution of health measures given diet, smoking, and ethnicity? We start with
the two collections of variables Y and X, each of which may be a random variable,
vector, matrix, etc. We want to make sense of the notion

Conditional distribution of Y given X = x, written Y | X = x. (2.19)

What this means is that for each fixed value x, there is a possibly different distribution
for Y.
Very generally, such conditional distributions will exist, though they may be hard
to figure out, even what they mean. In the discrete case, the concept is straightfor-
ward, and by analogy the case with densities follows. For more general situations,
we will use properties of conditional distributions rather than necessarily specifying
them.
We start with the (X, Y) as in (2.8), and assume we have their joint distribution
P. The word “joint” is technically unnecessary, but helps to emphasize that we are
considering the two collections together. The joint space is W , and let X denote the
marginal space of X as in (2.16), and for each x ∈ X , the conditional space of Y given
X = x, Yx , is given in (2.18). For example, if the space W = {( x, y) | 0 < x < y < 1},
then X = (0, 1), and for x ∈ X , Y x = ( x, 1).
Next, given the joint distribution of (X, Y), we define the conditional distribution
(2.19) in the discrete, then pdf, cases.

Discrete case
For sets A and B, the conditional probability of A given B is defined as

P[ A ∩ B]
P[ A | B] = if B = ∅. (2.20)
P[ B]

If B is empty, then the conditional probability is not defined since we would have 00 .
For a discrete pair (X, Y), let f (x, y) be the pmf. Then the conditional distribution of
2.1. Probability distributions 31

Y given X = x can be specified by

P [Y = y | X = x], for x ∈ X , y ∈ Yx . (2.21)

at least if P [X = x] > 0. The expression in (2.21) is, for fixed x, the conditional pmf
for Y:

f Y | X (y | x) = P [ Y = y | X = x]
P [Y = y and X = x]
=
P [ X = x]
f (x, y)
= , y ∈ Yx , (2.22)
f X (x)

if f X (x) > 0, where f X (x) is the marginal pmf of X from (2.17) with sums.

Pdf case
In the discrete case, the restriction that P [X = x] > 0 is not worrisome, since the
chance is 0 we will have a x with P [X = x] = 0. In the continuous case, we cannot
follow the same procedure, since P [X = x] = 0 for all x ∈ X . However, if we have
pdf’s, or general densities, we can analogize (2.22) and declare that the conditional
density of Y given X = x is

f (x, y)
f Y | X (y | x) = , y ∈ Yx , (2.23)
f X (x)

if f X (x) > 0. In this case, as in the discrete one, the restriction that f X (x) > 0 is not
worrisome, since the set on which X has density zero has probability zero. It turns
out that the definition (2.23) is mathematically legitimate.
The Y and X can be very general. Often, both will be functions of a collection
of random variables, so that we may be interested in conditional distributions of the
type
g (Y ) | h (Y ) = z (2.24)
for some functions g and h.

Reconstructing the joint distribution


Note that if we are given the marginal space and density for X, and the conditional
spaces and densities for Y given X = x, then we can reconstruct the joint space and
joint density:

W = {(x, y) | y ∈ Yx , x ∈ X } and f (x, y) = f Y|X (y | x) f X (x). (2.25)

Thus another way to represent a distribution for Y is to specify the conditional


distribution given some X = x, and the marginal of X. The marginal distribution of
Y is then found by first finding the joint as in (2.25), then integrating out the x:

f Y (y ) = f Y|X (y | x) f X (x)dx. (2.26)
Xy
32 Chapter 2. Multivariate Distributions

2.2 Expected values


Means, variances, and covariances (Section 1.4) are key sample quantities in describ-
ing data. Similarly, they are important for describing random variables. These are all
expected values of some function, defined next.
Definition 2.1 (Expected value). Suppose X has space X , and consider the real-valued
function g,
g : X −→ R. (2.27)
If X has pdf f , then the expected value of g(X), E [ g(X)], is

E [ g(X)] = g(x) f (x)dx (2.28)
X

if the integral converges. If X has pmf f , then

E [ g(X)] = ∑ g (x) f (x) (2.29)


x∈X

if the sum converges.


As in (2.11), if the collection is (X, Y), where X is discrete and Y is continuous, and
f (x, y) is its mixed-type density, then for function g(x, y),

E [ g(X, Y)] = ∑ Yx
g(x, y) f (x, y)dy. (2.30)
x∈X

if everything converges. (The spaces are defined in (2.16) and (2.18).)


Expected values for representations cohere is the proper way, that is, if Y is a
collection of random variables such that Y = h(X), then for a function g,

E [ g(Y)] = E [ g(h(X))], (2.31)

if the latter exists. Thus we often can find the expected values of functions of Y based
on the distribution of X.

Conditioning
If (X, Y) has a joint distribution, then we can define the conditional expectation of
g(Y) given X = x to be the regular expected value of g(Y), but we use the conditional
distribution Y | X = x. In the pdf case, we write

E [ g (Y ) | X = x] = g(y) f Y|X (y|x)dy ≡ e g (x). (2.32)
Yx

Note that the conditional expectation is a function of x. We can then take the expected
value of that, using the marginal distribution of X. We end up with the same result
(if we end up with anything) as taking the usual expected value of g(Y). That is

E [ g(Y)] = E [ E [ g(Y) | X = x] ]. (2.33)

There is a bit of a notational glitch in the formula, since the inner expected value is a
function of x, a constant, and we really want to take the expected value over X. We
cannot just replace x with X, however, because then we would have the undesired
2.3. Means, variances, and covariances 33

E [ g(Y) | X = X]. So a more precise way to express the result is to use the e g (x) in
(2.32), so that
E [ g(Y)] = E [ e g (X)]. (2.34)
This result holds in general. It is not hard to see in the pdf case:

E [ e g (X)] = e g (x) f X (x)dx
X
= g(y) f Y|X(y|x)dy f X (x)dx by (2.32)
X Yx

= g(y) f (x, y)dydx by (2.25)
X Yx

= g(y) f (x, y)dxdy by (2.25)
W
= E [ g(Y)]. (2.35)

A useful corollary is the total probability formula: For B ⊂ Y , if X has a pdf,



P [Y ∈ B ] = P [Y ∈ B | X = x] f X (x)dx. (2.36)
X

If X has a pmf, then we sum. The formula follows by taking g to be the indicator
function IB , given as 
1 if y ∈ B,
I B (y ) = (2.37)
0 if y ∈ B.

2.3 Means, variances, and covariances


Means, variances, and covariances are particular expected values. For a collection
of random variables X = ( X1 , . . . , X N ), the mean of X j is its expected value, E [ X j ].
(Throughout this section, we will be acting as if the expected values exist. So if E [ X j ]
doesn’t exist, then the mean of X j doesn’t exist, but we might not explicitly mention
that.) Often the mean is denoted by μ, so that E [ X j ] = μ j .
The variance of X j , often denoted σj2 or σjj , is

σjj = Var [ X j ] = E [( X j − μ j )2 ]. (2.38)

The covariance between X j and Xk is defined to be

σjk = Cov[ X j , Xk ] = E [( X j − μ j )( Xk − μ k )]. (2.39)

Their correlation coefficient is


σjk
Corr [ X j , Xk ] = ρ jk = √ , (2.40)
σjj σkk

if both variances are positive. Compare these definitions to those of the sample
analogs, (1.3), (1.4), (1.5), and (1.6). So, e.g., Var [ X j ] = Cov[ X j , X j ].
The mean of the collection X is the corresponding collection of means. That is,

μ = E [X] = ( E [ X1 ], . . . , E [ X N ]). (2.41)


34 Chapter 2. Multivariate Distributions

2.3.1 Vectors and matrices


If a collection has a particular structure, then its mean has the same structure. That
is, if X is a row vector as in (2.41), then E [X] = ( E [ X1 ], . . . , E [ X N ]). If X is a column
vector, so is its mean. Similarly, if W is an n × p matrix, then so is its mean. That is,
⎡⎛ ⎞⎤
W11 W12 ··· W1p
⎢⎜ W21 W22 ··· W2p ⎟⎥
⎢⎜ ⎟⎥
E [W] = E ⎢⎜ .. .. .. .. ⎟⎥
⎣⎝ . . . . ⎠⎦
Wn1 Wn2 ··· Wnp
⎛ ⎞
E [W11 ] E [W12 ] ··· E [W1p ]
⎜ E [W21 ] E [W22 ] ··· E [W2p ] ⎟
⎜ ⎟
=⎜ .. .. .. .. ⎟. (2.42)
⎝ . . . . ⎠
E [Wn1 ] E [Wn2 ] ··· E [Wnp ]

Turning to variances and covariances, first suppose that X is a vector (row or


column). There are N variances and ( N2 ) covariances among the X j ’s to consider,
recognizing that Cov[ X j , Xk ] = Cov[ Xk , X j ]. By convention, we will arrange them into
a matrix, the variance-covariance matrix, or simply covariance matrix of X:

Σ = Cov[X]
⎛ ⎞
Var [ X1 ] Cov[ X1 , X2 ] ··· Cov[ X1 , X N ]
⎜ Cov[ X2 , X1 ] Var [ X2 ] ··· Cov[ X2 , X N ] ⎟
⎜ ⎟
=⎜ .. .. .. .. ⎟, (2.43)
⎝ . . . . ⎠
Cov[ X N , X1 ] Cov[ X N , X2 ] ··· Var [ X N ]

so that the elements of Σ are the σjk ’s. Compare this arrangement to that of the
sample covariance matrix (1.17). If X is a row vector, and μ = E [X], a convenient
expression for its covariance is
 
Cov[X] = E (X − μ) (X − μ) . (2.44)

Similarly, if X is a column vector, Cov[X] = E [(X − μ)(X − μ) ].


Now suppose X is a matrix as in (2.42). Notice that individual components have
double subscripts: Xij . We need to decide how to order the elements in order to
describe its covariance matrix. We will use the convention that the elements are
strung out by row, so that row(X) is the 1 × N vector, N = np, given by

row(X) =( X11 , X12 , · · · , X1p ,


X21 , X22 , · · · , X2p ,
···
Xn1 , Xn2 , · · · , Xnp ). (2.45)

Then Cov[X] is defined to be Cov[row(X)], which is an (np) × (np) matrix.


One more covariance: The covariance between two vectors is defined to be the
matrix containing all the individual covariances of one variable from each vector.
2.4. Independence 35

That is, if X is 1 × p and Y is 1 × q , then the p × q matrix of covariances is


⎛ ⎞
Cov[ X1 , Y1 ] Cov[ X1 , Y2 ] ··· Cov[ X1 , Yq ]
⎜ Cov[ X2 , Y1 ] Cov[ X2 , Y2 ] ··· Cov[ X2 , Yq ] ⎟
⎜ ⎟
Cov[X, Y] = ⎜ .. .. .. .. ⎟. (2.46)
⎝ . . . . ⎠
Cov[ X p , Y1 ] Cov[ X p , Y2 ] ··· Cov[ X p , Yq ]

2.3.2 Moment generating functions


The moment generating function (mgf for short) of X is a function from R N → [0, ∞ ]
given by
   
MX (t) = MX (t1 , . . . , t N ) = E et1 X1 +···+t N XN = E eXt (2.47)

for t = (t1 , . . . , t N ). It is very useful in distribution theory, especially convolutions


(sums of independent random variables), asymptotics, and for generating moments.
The main use we have is that the mgf determines the distribution:

Theorem 2.1 (Uniqueness of MGF). If for some > 0,

MX (t) < ∞ and MX (t) = MY (t) for all t such that t < , (2.48)

then X and Y have the same distribution.

See Ash [1970] for an approach to proving this result. The mgf does not always
exist, that is, often the integral or sum defining the expected value diverges. That
is ok, as long as it is finite for t in a neighborhood of 0. If one knows complex
variables, the characteristic function is handy because it always exists. It is defined
as φX (t) = E [exp(iXt )].
If a distribution’s mgf is finite when t < for some > 0, then all of its moments
are finite, and can be calculated via differentiation:

∂K 

E [ X1k1 · · · X kNN ] = MX ( t )  , (2.49)
∂t1k1 · · · ∂tkNN t= 0

where the k1 are nonnegative integers, and K = k1 + · · · + k N . See Exercise 2.7.21

2.4 Independence
Two sets of random variables are independent if the values of one set do not affect
the values of the other. More precisely, suppose the collection is (X, Y) as in (2.8),
with space W . Let X and Y be the marginal spaces (2.16) of X and Y, respectively.
First, we need the following:

Definition 2.2. If A ⊂ RK and B ⊂ R L , then A × B is a rectangle, the subset of RK + L


given by
A × B = {(y, z) ∈ RK + L | y ∈ A and z ∈ B }. (2.50)

Now for the main definition.


36 Chapter 2. Multivariate Distributions

Definition 2.3. Given the setup above, the collections X and Y are independent if W =
X × Y , and for every A ⊂ X and B ⊂ Y ,

P [(X, Y) ∈ A × B ] = P [X ∈ A] P [Y ∈ B ]. (2.51)

In the definition, the left-hand side uses the joint probability distribution for (X, Y),
and the right-hand side uses the marginal probabilities for X and Y, respectively.
If the joint collection (X, Y) has density f , then X and Y are independent if and
only if W = X × Y , and

f (x, y) = f X (x) f Y (y) for all x ∈ X and y ∈ Y , (2.52)

where f X and f Y are the marginal densities (2.17) of X and Y, respectively. (Techni-
cally, (2.52) only has to hold with probability one. Also, except for sets of probability
zero, the requirements (2.51) or (2.52) imply that W = X × Y , so that the requirement
we place on the spaces is redundant. But we keep it for emphasis.)
A useful result is that X and Y are independent if and only if

E [ g(X)h(Y)] = E [ g(X)] E [ h(Y)] (2.53)

for all functions g : X → R and h : Y → R with finite expectation.


The last expression can be used to show that independent variables have covari-
ance equal to 0. If X and Y are independent random variables with finite expectations,
then

Cov[ X, Y ] = E [( X − E [ X ])(Y − E [Y ])]


= E [( X − E [ X ])] E [(Y − E [Y ])]
= 0. (2.54)

The second equality uses (2.53), and the final equality uses that E [ X − E [ X ]] = E [ X ] −
E [ X ] = 0. Be aware that the reverse is not true, that is, variables can have 0 covariance
but still not be independent.
If the collections X and Y are independent, then Cov[ Xk , Yl ] = 0 for all k, l, so that

Cov[X] 0
Cov[(X, Y)] = , (2.55)
0 Cov[Y]

at least if the covariances exist. (Throughout this book, “0” represents a matrix of
zeroes, its dimension implied by the context.)
Collections Y and X are independent if and only if the conditional distribution of
Y given X = x does not depend on x. If (X, Y) has a pdf or pmf, this property is easy
to see. If X and Y are independent, then Yx = Y since W = X × Y , and by (2.23) and
(2.52),
f (x, y) f (y ) f X (x)
f Y | X (y | x) = = Y = f Y (y ), (2.56)
f X (x) f X (x)
so that the conditional distribution does not depend on x. On the other hand, if
the conditional distribution does not depend on x, then the conditional space and
pdf cannot depend on x, in which case they are the marginal space and pdf, so that
W = X × Y and
f (x, y)
= f Y (y) =⇒ f (x, y) = f X (x) f Y (y). (2.57)
f X (x)
2.5. Conditional distributions 37

So far, we have treated independence of just two sets of variables. Everything


can be easily extended to any finite number of sets. That is, suppose X1 , . . . , XS are
collections of random variables, with Ns and X s being the dimension and space for
Xs , and X = (X1 , . . . , XS ), with dimension N = N1 + · · · + NS and space X .
Definition 2.4. Given the setup above, the collections X1 , . . . , XS are mutually independent
if X = X1 × · · · × X S , and for every set of subsets As ⊂ X s ,

P [(X1 , . . . , XS ) ∈ A1 × · · · × AS ] = P [X1 ∈ A1 ] · · · P [XS ∈ AS ]. (2.58)

In particular, X1 , . . . , XS being mutually independent implies that every pair Xi , X j


(i = j) is independent. The reverse need not be true, however, that is, each pair could
be independent without having all mutually independent. Analogs of the equiva-
lences in (2.52) to (2.53) hold for this case, too. E.g., X1 , . . . , XS are mutually indepen-
dent if and only if

E [ g1 (X1 ) · · · gS (XS )] = E [ g1 (X1 )] · · · E [ gS (XS )] (2.59)

for all functions gs : X s → R, s = 1, . . . , S, with finite expectation.


A common situation is that the individual random variables Xi ’s in X are mutually
independent. Then, e.g., if there are densities,

f ( x1 , . . . , x N ) = f 1 ( x1 ) · · · f N ( x N ), (2.60)

where f j is the density of X j . Also, if the variances exist, the covariance matrix is
diagonal:
⎛ ⎞
Var [ X1 ] 0 ··· 0
⎜ 0 Var [ X2 ] · · · 0 ⎟
⎜ ⎟
Cov[X] = ⎜ .. .. . .. ⎟. (2.61)
⎝ . . . . . ⎠
0 ··· 0 Var [ X N ]

2.5 Additional properties of conditional distributions


The properties that follow are straightforward to prove in the discrete case. They still
hold for the continuous and more general cases, but are not always easy to prove.
See Exercises 2.7.6 to 2.7.15.

Plug-in formula
Suppose the collection of random variables is given by (X, Y), and we are interested
in the conditional distribution of the function g(X, Y) given X = x. Then

g(X, Y) | X = x = D g(x, Y) | X = x. (2.62)

That is, the conditional distribution of g(X, Y) given X = x is the same as that of
g(x, Y) given X = x. (The “= D ” means “equal in distribution.”) Furthermore, if Y
and X are independent, we can take off the conditional part at the end of (2.62):

X and Y independent =⇒ g(X, Y) | X = x = D g(x, Y). (2.63)


38 Chapter 2. Multivariate Distributions

This property may at first seem so obvious to be meaningless, but it can be very
useful. For example, suppose X and Y are independent N (0, 1)’s, and g( X, Y ) =
X + Y, so we wish to find X + Y | X = x. The official way is to let W = X + Y, and
Z = X, and use the transformation of variables to find the space and pdf of (W, Z ).
One can then figure out Wz , and use the formula (2.23). Instead, using the plug-in
formula with independence (2.63), we have that

X + Y | X = x = D x + Y, (2.64)

which we immediately realize is N ( x, 1).

Conditional independence
Given a set of three collections, (X, Y, Z), X are Y are said to be conditionally independent
given Z = z if

P [(X, Y) ∈ A × B | Z = z] = P [X ∈ A | Z = z] P [Y ∈ B | Z = z], (2.65)

for sets A ⊂ Xz and B ⊂ Yz as in (2.51). If further X is independent of Z, then X is


independent of the combined (Y, Z).

Dependence on x only through a function


If the conditional distribution of Y given X = x depends on x only through the func-
tion h(x), then that conditional distribution is the same as the conditional distribution
given h(X) = h(x). Symbolically, if v = h(x),

Y | X = x = D Y | h(X) = v. (2.66)

As an illustration, suppose ( X, Y ) is uniformly distributed over the unit disk, so


that the pdf is f ( x, y) = 1/π for x2 + y2 < 1. Then it can be shown that
 
Y | X = x ∼ Uniform(− 1 − x2 , 1 − x2 ). (2.67)

Note that the distribution depends on x only through h( x ) = x2 , so that, e.g., condi-
tioning on X = 1/2 is the same as conditioning on X = −1/2. The statement (2.66)
then yields √ √
Y | X2 = v ∼ Uniform(− 1 − v, 1 − v). (2.68)
That is, we have managed to turn a statement about conditioning on X to one about
conditioning on X2 .

Variance decomposition
The formula (2.34) shows that the expected value of g(Y) is the expected value of the
conditional expected value, e g (X). A similar formula holds for the variance, but it is
not simply that the variance is the expected value of the conditional variance. Using
the well-known identity Var [ Z ] = E [ Z2 ] − E [ Z ]2 on Z = g(Y), as well as (2.34) on
g(Y) and g(Y)2 , we have

Var [ g(X)] = E [ g(Y)2 ] − E [ g(Y)]2


= E [ e g2 (X)] − E [ e g (X)]2 . (2.69)
2.5. Conditional distributions 39

The identity holds conditionally as well, i.e.,


v g (x) = Var [ g(Y) | X = x] = E [ g(Y)2 | X = x] − E [ g(Y) | X = x]2
= e g 2 (x) − e g (x)2 . (2.70)
Taking expected value over X in (2.70), we have
E [ v g (X)] = E [ e g2 (X)] − E [ e g (X)2 ]. (2.71)
Comparing (2.69) and (2.71), we see the difference lies in where the square is in the
second terms. Thus
Var [ g(Y)] = E [ v g (X)] + E [ e g (X)2 ] − E [ e g (X)]2
= E [ v g (X)] + Var [ e g (X)], (2.72)
now using the identity on e g (X). Thus the variance of g(Y) equals the variance of the
conditional expected value plus the expected value of the conditional variance.
For a collection Y of random variables,
eY (x) = E [Y | X = x] and vY (x) = Cov[Y | X = x], (2.73)
(2.72) extends to
Cov[Y] = E [ vY (X)] + Cov[ eY (X)]. (2.74)
See Exercise 2.7.12.

Bayes theorem
Bayes formula reverses conditional distributions, that is, it takes the conditional distri-
bution of Y given X, and the marginal of X, and returns the conditional distribution of
X given Y. Bayesian inference is based on this formula, starting with the distribution
of the data given the parameters, and a marginal (“prior”) distribution of the pa-
rameters, and producing the conditional distribution (“posterior”) of the parameters
given the data. Inferences are then based on this posterior, which is the distribution
one desires because the data are observed while the parameters are not.
Theorem 2.2 (Bayes). In the setup of (2.8), suppose that the conditional density of Y given
X = x is f Y|X (y | x), and the marginal density of X is f X (x). Then for (x, y) ∈ W , the
conditional density of X given Y = y is
f Y | X (y | x) f X (x)
f X | Y (x | y ) =  . (2.75)
Xy f Y|X (y | z) f X (z)dz
Proof. From (2.23) and (2.25),
f (x, y)
f X | Y (x | y ) =
f Y (y )
f Y | X (y | x) f X (x)
= . (2.76)
f Y (y )
By (2.26), using z for x, to avoid confusion with the x in (2.76),

f Y (y ) = f Y|X (y | z) f X (z)dz, (2.77)
Xy
which, substituted in the denominator of (2.76), shows (2.75).
40 Chapter 2. Multivariate Distributions

2.6 Affine transformations


In Section 1.5, linear combinations of the data were used heavily. Here we consider
the distributional analogs of linear functions, or their extensions, affine transforma-
tions. For a single random variable X, an affine transformation is a + bX for constants
a and b. Equation (2.82) is an example of an affine transformation with two random
variables.
More generally, an affine transformation of a collection of N random variables X
is a collection of M random variables Y where

Yj = a j + b j1 X1 + · · · + b jN X N , j = 1, . . . , M, (2.78)

the a j ’s and b jk ’s being constants. Note that marginals are examples of affine trans-
formations: the a j ’s are 0, and most of the b jk ’s are 0, and a few are 1. Depending on
how the elements of X and Y are arranged, affine transformations can be written as a
matrix equation. For example, if X and Y are row vectors, and B is M × N, then

Y = a + XB , (2.79)

where B is the matrix of b jk ’s, and a = ( a1 , . . . , a M ). If X and Y are column vectors,


then the equation is Y = a + BX. For an example using matrices, suppose X is n × p,
C is m × n, D is q × p, and A is m × q, and

Y = A + CXD . (2.80)

Then Y is an m × q matrix, each of whose elements is some affine transformation of


the elements of X. The relationship between the b jk ’s and the elements of C and D is
somewhat complicated but could be made explicit, if desired. Look ahead to (3.32d),
if interested.
Expectations are linear, that is, for any random variables ( X, Y ), and constant c,

E [ cX ] = cE [ X ] and E [ X + Y ] = E [ X ] + E [Y ], (2.81)

which can be seen from (2.28) and (2.29) by the linearity of integrals and sums. Con-
sidering any constant a as a (nonrandom) random variable, with E [ a] = a, (2.81) can
be used to show, e.g.,

E [ a + bX + cY ] = a + bE [ X ] + cE [Y ]. (2.82)

The mean of an affine transformation is the affine transformation of the mean.


This property follows from (2.81) as in (2.82), i.e., for (2.78),

E [Yj ] = a j + b j1 E [ X1 ] + · · · + b jN E [ X N ], j = 1, . . . , M. (2.83)

If the collections are arranged as vectors or matrices, then so are the means, so that
for the row vector (2.79) and matrix (2.80) examples, one has, respectively,

E [Y] = a + E [X]B and E [Y] = A + CE [X]D . (2.84)

The covariance matrix of Y can be obtained from that of X. It is a little more


involved than for the means, but not too bad, at least in the vector case. Suppose X
2.7. Exercises 41

and Y are row vectors, and (2.79) holds. Then from (2.44),
 
Cov[Y] = E (Y − E [Y]) (Y − E [Y])
 
= E (a + XB − (a + E [X]B )) (a + XB − (a + E [X]B ))
 
= E (XB − E [X]B ) (XB − E [X]B )
 
= E B(X − E [X]) (X − E [X])B
 
= BE (X − E [X]) (X − E [X]) B by second part of (2.84)
= BCov[X]B . (2.85)

Compare this formula to the sample version in (1.27). Though modest looking, the
formula Cov[XB ] = BCov[X]B is extremely useful. It is often called a “sandwich”
formula, with the B as the slices of bread. The formula for column vectors is the
same. Compare this result to the familiar one from univariate analysis: Var [ a + bX ] =
b2 Var [ X ]. Also, we already saw the sample version of (2.85) in (1.27).
For matrices, we again will wait. (We are waiting for Kronecker products as in
Definition 3.5, in case you are wondering.)

2.7 Exercises
Exercise 2.7.1. Consider the pair of random variables ( X, Y ), where X is discrete and
Y is continuous. Their space is

W = {( x, y) | x ∈ {1, 2, 3} & 0 < y < x }, (2.86)

and their mixed-type density is

x+y
f ( x, y) = . (2.87)
21

Let A = {( x, y) ∈ W | y ≤ x/2}. (It is a good idea to sketch W and A.) (a) Find X A
(b) Find Y xA for each x ∈ X A . (c) Find P [ A]. (d) Find the marginal density and space
of X. (e) Find the marginal space of Y. (f) Find the conditional space of X given Y,
X y , for each y. (Do it separately for y ∈ (0, 1), y ∈ [1, 2) and y ∈ [2, 3).) (g) Find the
marginal density of Y.

Exercise 2.7.2. Given the setup in (2.8) through (2.10), show that for A ∈ W ,

A = {(x, y) | x ∈ X A and y ∈ YxA } = {(x, y) | y ∈ Y A and x ∈ XyA }. (2.88)

Exercise 2.7.3. Verify (2.17), that is, given B ⊂ X , show that




P[ X ∈ B] = f (x, y)dy dx. (2.89)
B Yx

[Hint: Show that for A = {(x, y) | x ∈ B and y ∈ Yx }, x ∈ B if and only if


 (x, y) ∈ A, so
that P [X ∈ B ] = P [(X, Y) ∈ A]. Then note that the latter probability is A f (x, y)dxdy,
which with some interchanging equals the right-hand side of (2.89).]
42 Chapter 2. Multivariate Distributions

Exercise 2.7.4. Show that X and Y are independent if and only if E [ g(X)h(Y)] =
E [ g(X)] E [ h(Y)] as in (2.53) for all g and h with finite expectations. You can assume
densities exist, i.e., (2.52). [Hint: To show independence implies (2.53), write out the
sums/integrals. For the other direction, consider indicator functions for g and h as in
(2.37).]
Exercise 2.7.5. Prove (2.31), E [ g(Y)] = E [ g(h(X))] for Y = h(X), in the discrete case.
[Hint: Start by writing

f Y (y ) = P [ Y = y ] = P [ h (X ) = y ] = ∑ f X (x), (2.90)
x∈Xy

where Xy = {x ∈ X | h(x) = y}. Then

E [ g(Y)] = ∑ g (y ) ∑ f X (x) = ∑ ∑ g (y ) f X (x). (2.91)


y∈Y x∈Xy y∈Y x∈Xy

In the inner summation in the final expression, h(x) is always equal to y. (Why?) Sub-
stitute h(x) for y in the g, then. Now the summand is free of y. Argue that the dou-
ble summation is the same as summing over x ∈ X , yielding ∑x∈X g(h( x )) f X (x) =
E [ g(h(X))].]

Exercise 2.7.6. (a) Prove the plugin formula (2.62) in the discrete case. [Hint: For z
in the range of g, write P [ g(X, Y) = z | X = x] = P [ g(X, Y) = z and X = x] /P [X = x],
then note that in the numerator, the X can be replaced by x.] (b) Prove (2.63). [Hint:
Follow the proof in part (a), then note the two events g(x, Y) = z and X = x are
independent.]

Exercise 2.7.7. Suppose (X, Y, Z) has a discrete distribution, X and Y are condi-
tionally independent given Z (as in (2.65)), and X and Z are independent. Show
that (X, Y) is independent of Z. [Hint: Use the total probability formula (2.36) on
P [X ∈ A and (Y, Z) ∈ B ], conditioning on Z. Then argue that the summand can be
written

P [X ∈ A and (Y, Z) ∈ B | Z = z] = P [X ∈ A and (Y, z) ∈ B | Z = z]


= P [X ∈ A | Z = z] P [(Y, z) ∈ B | Z = z]. (2.92)

Use the independence of X and Z on the first probability in the final expression, and
bring it out of the summation.]
Exercise 2.7.8. Prove (2.67). [Hint: Find Y x and the marginal f X ( x ).]
Exercise 2.7.9. Suppose Y = (Y1 , Y2 , Y3 , Y4 ) is multinomial with parameters n and
p = ( p1 , p2 , p3 , p4 ). Thus n is a positive integer, the pi ’s are positive and sum to 1,
and the Yi ’s are positive integers that sum to n. The pmf is

n y y
f (y ) = p 1 · · · p4 4 , (2.93)
y1 , y2 , y3 , y4 1

where (y1 ,y2n,y3 ,y4 ) = n!/(y1 ! · · · y4 !). Consider the conditional distribution of (Y1 , Y2 )
given (Y3 , Y4 ) = (c, d). (a) What is the conditional space of (Y1 , Y2 ) given (Y3 , Y4 ) =
2.7. Exercises 43

(c, d)? Give Y2 as a function of Y1 , c, and d. What is the conditional range of Y1 ? (b)
Write the conditional pmf of (Y1 , Y2 ) given (Y3 , Y4 ) = (c, d), and simplify noting that
  
n n n−c−d
= (2.94)
y1 , y2 , c, d n − c − d, c, d c, d
What is the conditional distribution of Y1 | (Y3 , Y4 ) = (c, d)? (c) What is the condi-
tional distribution of Y1 given Y3 + Y4 = a?
Exercise 2.7.10. Prove (2.44). [Hint: Write out the elements of the matrix (X − μ) (X −
μ), then use (2.42).]
Exercise 2.7.11. Suppose X, 1 × N, has finite covariance matrix. Show that Cov[X] =
E [X X] − E [X]  E [X].
Exercise 2.7.12. (a) Prove the variance decomposition holds for the 1 × q vector Y, as
in (2.74). (b) Write Cov[Yi , Yj ] as a function of the conditional quantities Cov[Yi , Yj | X =
x], E [Yi | X = x], and E [Yj | X = x].

Exercise 2.7.13. The beta-binomial(n, α, β) distribution is a mixture of binomial dis-


tributions. That is, suppose Y given X = x is Binomial(n, x ) ( f Y (y) = (nx) x y (1 − x )n−y
for y = 0, 1, . . . , n), and X is (marginally) Beta(α, β):
Γ ( α + β ) α −1
f X (x) = x (1 − x ) β−1 , x ∈ (0, 1), (2.95)
Γ (α)Γ ( β)
where Γ is the gamma function,

Γ (α) = u α −1 e−u du, α > 0. (2.96)
0

(a) Find the marginal pdf of Y. (b) The conditional mean and variance of Y are nx
and nx (1 − x ). (Right?) The unconditional mean and variance of X are α/(α + β)
and αβ/(α + β)2 (α + β + 1). What are the unconditional mean and variance of Y? (c)
Compare the variance of a Binomial(n, p) to that of a Beta-binomial(n, α, β), where
p = α/(α + β). (d) Find the joint density of ( X, Y ). (e) Find the pmf of the beta-
binomial. [Hint: Notice that the part of the joint density depending on x looks like a
Beta pdf, but without the constant. Thus integrating out x yields the reciprocal of the
constant.]

Exercise 2.7.14 (Bayesian inference). This question develops Bayesian inference for a
binomial. Suppose
Y | P = p ∼ Binomial(n, p) and P ∼ Beta(α0 , β0 ), (2.97)
that is, the probability of success P has a beta prior. (a) Show that the posterior
distribution is
P | Y = y ∼ Beta(α0 + y, β0 + n − y). (2.98)
The beta prior is called the conjugate prior for the binomial p, meaning the posterior
has the same form, but with updated parameters. [Hint: Exercise 2.7.13 (d) has the
joint density of ( P, Y ).] (b) Find the posterior mean, E [ P | Y = y]. Show that it can be
written as a weighted mean of the sample proportion p = y/n and the prior mean
p o = α o / ( α0 + β 0 ) .
44 Chapter 2. Multivariate Distributions

Exercise 2.7.15. Do the mean and variance formulas (2.33) and (2.72) work if g is a
function of X and Y? [Hint: Consider the collection (X, W), where W = (X, Y).]
Exercise 2.7.16. Suppose h(y) is a histogram with K equal-sized bins. That is, we
have bins (bi−1, bi ], i = 1, . . . , K, where bi = b0 + d × i, d being the width of each bin.
Then 
pi /d if bi−1 < x ≤ bi , i = 1, . . . , K
h(y) = (2.99)
0 if y ∈ (b0 , bK ],
where the pi ’s are probabilities that sum to 1. Suppose Y is a random variable with
pdf h. For y ∈ (b0 , bK ], let I(y) be y’s bin, i.e., I(y) = i if bi−1 < y ≤ bi . (a) What is
the distribution of the random variable I(Y )? Find its mean and variance. (b) Find
the mean and variance of bI(Y ) = b0 + dI(Y ). (c) What is the conditional distribution
of Y given I(Y ) = i, for each i = 1, . . . , K? [It is uniform. Over what range?] Find the
conditional mean and variance. (d) Show that unconditionally,
1 1
E [Y ] = b0 + d( E [I] − ) and Var [Y ] = d2 (Var [I] + ). (2.100)
2 12
(e) Recall the entropy in (1.44). Note that for our pdf, h(Y ) = pI(Y ) /d. Show that
K
Entropy(h) = − ∑ pi log( pi ) + log(d), (2.101)
i =1

and for the negentropy in (1.46),


 K
1 1
Negent(h) = 1 + log 2π Var [I] + + ∑ pi log( pi ). (2.102)
2 12 i =1

Exercise 2.7.17. Suppose for random vector ( X, Y ), one observes X = x, and wishes
to guess the value of Y by h( x ), say, using the least squares criterion: Choose h to
minimize E [ q ( X, Y )], where q ( X, Y ) = (Y − h( X ))2 . This h is called the regression
function of Y on X. Assume all the relevant means and variances are finite. (a)
Write E [ q ( X, Y )] as the expected value of the conditional expected value conditioning
on X = x, eq ( x ). For fixed x, note that h( x ) is a scalar, hence one can minimize
eq ( x ) over h( x ) using differentiation. What h( x ) achieves the minimum conditional
expected value of q? (b) Show that the h found in part (a) minimizes the unconditional
expected value E [ q ( X, Y )]. (c) Find the value of E [ q ( X, Y )] for the minimizing h.
Exercise 2.7.18. Continue with Exercise 2.7.17, but this time restrict h to be a linear
function, h( x ) = α + βx. Thus we wish to find α and β to minimize E [(Y − α − βX )2 ].
The minimizing function is the linear regression function of Y on X. (a) Find the
α and β to minimize E [(Y − α − βX )2 ]. [You can differentiate that expected value
directly, without worrying about conditioning.] (b) Find the value of E [(Y − α − βX )2 ]
for the minimizing α and β.

Exercise 2.7.19. Suppose Y is 1 × q and X is 1 × p, E [X] = 0, Cov[X] = I p , E [Y | X =


x] = μ + xβ for some p × q matrix β, and Cov[Y | X = x] = Ψ for some q × q diagonal
matrix Ψ. Thus the Yi ’s are conditionally uncorrelated given X = x. Find the uncon-
ditional E [Y] and Cov[Y]. The covariance matrix of Y has a factor-analytic structure,
which we will see in Section 10.3. The Xi ’s are factors that explain the correlations
among the Yi ’s. Typically, the factors are not observed.
2.7. Exercises 45

Exercise 2.7.20. Suppose Y1 , . . . , Yq are independent 1 × p vectors, where Yi has


moment generating function Mi (t), i = 1, . . . , q, all of which are finite for t <
for some > 0. Show that the moment generating function of Y1 + · · · + Yq is
M1 (t) · · · Mq (t). For which t is this moment generating function finite?
Exercise 2.7.21. Prove (2.49). It is legitimate to interchange the derivatives and ex-
pectation, and to set t = 0 within the expectation, when t < . [Extra credit: Prove
that those operations are legitimate.]
Exercise 2.7.22. The cumulant generating function of X is defined to be cX (t) =
log( MX (t)), and, if the function is finite for t in a neighborhood of zero, then the
(k1 , . . . , k N )th mixed cumulant is the corresponding mixed derivative of cX (t) evalu-
ated at zero. (a) For N = 1, find the first four cumulants, κ1 , . . . , κ4 , where

∂ 

κi = cX ( t )  . (2.103)
∂t t =0

Show that κ3 /κ23/2 is the population analog of skewness (1.42), and κ4 /κ22 is the
population analog of kurtosis (1.43), i.e.,

κ3 E [( X − μ )3 ] κ E [( X − μ )4 ]
= and 42 = − 3, (2.104)
κ23/2 σ 3 κ2 σ4

where μ = E [ X ] and σ2 = Var [ X ]. [Write everything in terms of E [ X k ]’s by expanding


the E [( X − μ )k ]’s.] (b) For general N, find the second mixed cumulants, i.e.,

∂2 

cX (t )  , i = j. (2.105)
∂ti ∂t j t= 0

Exercise 2.7.23. A study was conducted on people near Newcastle on Tyne in 1972-
74 [Appleton et al., 1996], and followed up twenty years later. We will focus on 1314
women in the study. The three variables we will consider are Z: age group (three
values); X: whether they smoked or not (in 1974); and Y: whether they were still
alive in 1994. Here are the frequencies:

Age group Young (18 − 34) Middle (35 − 64) Old (65+)
Smoker? Yes No Yes No Yes No
(2.106)
Died 5 6 92 59 42 165
Lived 174 213 262 261 7 28

(a) Treating proportions in the table as probabilities, find

P [Y = Lived | X = Smoker] and P [Y = Lived | X = Non-smoker]. (2.107)

Who were more likely to live, smokers or non-smokers? (b) Find P [ X = Smoker | Z =
z] for z= Young, Middle, and Old. What do you notice? (c) Find

P [Y = Lived | X = Smoker & Z = z] (2.108)

and
P [Y = Lived | X = multNon-smoker & Z = z] (2.109)
46 Chapter 2. Multivariate Distributions

for z= Young, Middle, and Old. Adjusting for age group, who were more likely to
live, smokers or non-smokers? (d) Conditionally on age, the relationship between
smoking and living is negative for each age group. Is it true that marginally (not
conditioning on age), the relationship between smoking and living is negative? What
is the explanation? (Simpson’s Paradox.)

Exercise 2.7.24. Suppose in a large population, the proportion of people who are
infected with the HIV virus is = 1/100, 000. People can take a blood test to see
whether they have the virus. The test is 99% accurate: The chance the test is positive
given the person has the virus is 99%, and the chance the test is negative given the
person does not have the virus is also 99%. Suppose a randomly chosen person takes
the test. (a) What is the chance that this person does have the virus given that the test
is positive? Is this close to 99%? (b) What is the chance that this person does have the
virus given that the test is negative? Is this close to 1%? (c) Do the probabilities in (a)
and (b) sum to 1?
Exercise 2.7.25. Suppose Z1 , Z2 , Z3 are iid with P [ Zi = −1] = P [ Zi = +1] = 12 . Let
X1 = Z1 Z2 , X2 = Z1 Z3 , X3 = Z2 Z3 . (2.110)
(a) Find the conditional distribution of ( X1 , X2 ) | Z1 = +1. Are X1 and X2 con-
ditionally independent given Z1 = +1? (b) Find the conditional distribution of
( X1 , X2 ) | Z1 = −1. Are X1 and X2 conditionally independent given Z1 = −1? (c) Is
( X1 , X2 ) independent of Z1 ? Are X1 and X2 independent (unconditionally)? (d) Are
X1 and X3 independent? Are X2 and X3 independent? Are X1 , X2 and X3 mutually
independent? (e) What is the space of ( X1 , X2 , X3 )? (f) What is the distribution of
X1 X2 X3 ?
Exercise 2.7.26. Yes/no questions: (a) Suppose X1 and X2 are independent, X1 and
X3 are independent, and X2 and X3 are independent. Are X1 , X2 and X3 mutually
independent? (b) Suppose X1 , X2 and X3 are mutually independent. Are X1 and X2
conditionally independent given X3 = x3 ?
Exercise 2.7.27. (a) Let U ∼Uniform(0, 1), so that it has space (0, 1) and pdf f U (u ) =
1. Find its distribution function (2.1), FU (u ). (b) Suppose X is a random variable with
space ( a, b ) and pdf f X ( x ), where f X ( x ) > 0 for x ∈ ( a, b ). [Either or both of a and
b may be infinite.] Thus the inverse function FX−1 (u ) exists for u ∈ (0, 1). (Why?)
Show that the distribution of Y = FX ( X ) is Uniform(0, 1). [Hint: For y ∈ (0, 1), write
P [Y ≤ y] = P [ FX ( X ) ≤ y] = P [ X ≤ FX−1 (y)], then use the definition of FX .] (c)
Suppose U ∼ Uniform(0, 1). For the X in part (b), show that FX−1 (U ) has the same
distribution as X. [Note: This fact provides a way of generating random variables X
from random uniforms.]
Exercise 2.7.28. Suppose Y is n × 2 with covariance matrix

2 1
Σ= . (2.111)
1 2
Let W = YB , for 
1 1
B= (2.112)
1 c
for some c. Find c so that the covariance between the two variables in W is zero.
What are the variances of the resulting two variables?
2.7. Exercises 47

Exercise 2.7.29. 1. Let Y be a 1 × 4 vector with

Yj = μ j + B + E j ,

where the μ j are constants, B has mean zero and variance σB2 , the E j ’s are independent,
each with mean zero and variance σE2 , and B is independent of the E j ’s. (a) Find the
mean and covariance matrix of
 
X ≡ B E1 E2 E3 E4 . (2.113)

(b) Write Y as an affine transformation of X. (c) Find the mean and covariance matrix
of Y. (d) Cov[Y] can be written as

Cov[Y] = aI4 + b14 14 . (2.114)

Give a and b in terms of σB2 and σE2 . (d) What are the mean and covariance matrix of
Y = (Y1 + · · · + Y4 )/4?
Exercise 2.7.30. Suppose Y is a 5 × 4 data matrix, and

Yij = μ + Bi + γ + Eij for j = 1, 2, (2.115)

Yij = μ + Bi − γ + Eij for j = 3, 4, (2.116)

where the Bi ’s are independent, each with mean zero and variance σB2 , the Eij are
independent, each with mean zero and variance σE2 ’s, and the Bi ’s are independent
of the Eij ’s. (Thus each row of Y is distributed as the vector in Extra 2.7.29, for some
particular values of μ j ’s.) [Note: This model is an example of a randomized block
model, where the rows of Y represent the blocks. For example, a farm might be
broken into 5 blocks, and each block split into four plots, where two of the plots
(Yi1 , Yi2 ) get one fertilizer, and two of the plots (Yi3 , Yi4 ) get another fertilizer.] (a)
E [Y] = xβz . Give x, β, and z . [The β contains the parameters μ and γ. The x and
z contain known constants.] (b) Are the rows of Y independent? (c) Find Cov[Y].
(d) Setting which parameter equal to zero guarantees that all elements of Y have the
same mean? (e) Setting which parameter equal to zero guarantees that all elements
of Y are uncorrelated?
Chapter 3

The Multivariate Normal Distribution

3.1 Definition
There are not very many commonly used multivariate distributions to model a data
matrix Y. The multivariate normal is by far the most common, at least for contin-
uous data. Which is not to say that all data are distributed normally, nor that all
techniques assume such. Rather, typically one either assumes normality, or makes
few assumptions at all and relies on asymptotic results.
The multivariate normal arises from the standard normal:
Definition 3.1. The random variable Z is standard normal, written Z ∼ N (0, 1), if it has
space R and pdf
1
φ(z) = √ e− 2 z .
1 2
(3.1)

It is not hard to show that if Z ∼ N (0, 1),
1 2
E [ Z ] = 0, Var [ Z ] = 1, and M Z (t) = e 2 t . (3.2)
Definition 3.2. The collection of random variables Z = ( Z1 , . . . , Z M ) is a standard normal
collection if the Zi ’s are mutually independent standard normal random variables.
Because the variables in a standard normal collection are independent, by (3.2),
(2.61) and (2.59),

E [Z] = 0, Cov[Z] = I M and MZ (t) = e 2 ( t1 +···+t M ) = e 2 t .


1 2 2 1 2
(3.3)
The mgf is finite for all t.
A general multivariate normal distribution can have any (legitimate) mean and
covariance, achieved through the use of affine transformations. Here is the definition.
Definition 3.3. The collection X is multivariate normal if it is an affine transformation of
a standard normal collection.
The mean and covariance of a multivariate normal can be calculated from the
coefficients in the affine transformation. In particular, suppose Z is a standard normal
collection represented as an 1 × M row vector, and Y is a 1 × q row vector
Y = μ + ZB , (3.4)

49
50 Chapter 3. Multivariate Normal

where B is q × M and μ is 1 × q. From (3.3), (2.84) and (2.85),

μ = E [Y] and Σ = Cov[Y] = BB . (3.5)

The mgf is calculated, for 1 × q vector s, as

MY (s) = E [exp(Ys )]
= E [exp ((μ + ZB )s )]
= exp(μs ) E [exp(Z(sB) )]
= exp(μs ) MZ (sB)
1
= exp(μs ) exp( sB2 ) by (3.3)
2
1
= exp(μs + sBB s )
2
1
= exp(μs + sΣs ) (3.6)
2

The mgf depends on B through only Σ = BB . Because the mgf determines the
distribution (Theorem 2.1), two different B’s can produce the same distribution. That
is, as long as BB = CC , the distribution of μ + ZB and μ + ZC are the same. Which
is to say that the distribution of the multivariate normal depends on only the mean
and covariance. Thus it is legitimate to write

Y ∼ Nq (μ, Σ), (3.7)

which is read “Y has q-dimensional multivariate normal distribution with mean μ


and covariance Σ.”
For example, consider the two matrices
 √ 
1 2 1 2 2
B= and C = . (3.8)
0 3 4 0 5

It is not hard to show that



6 10
BB = CC = ≡ Σ. (3.9)
10 25

Thus if the Zi ’s are independent N (0, 1),

( Z1 , Z2 , Z3 )B = ( Z1 + 2Z2 + Z3 , 3Z2 + 4Z3 )



= D ( 2Z1 + 2Z2 , 5Z2 )
= ( Z1 , Z2 )C , (3.10)

i.e., both vectors are N (0, Σ). Note that the two expressions are based on differing
numbers of standard normals, not just different linear combinations.
Which μ and Σ are legitimate parameters in (3.7)? Any μ ∈ Rq is. The covariance
matrix Σ can be BB for any q × M matrix B. Any such matrix B is considered a
square root of Σ. Clearly, Σ must be symmetric, but we already knew that. It must
also be nonnegative definite, which we define now.
3.2. Properties 51

Definition 3.4. A symmetric q × q matrix A is nonnegative definite if

bAb ≥ 0 for all 1 × q vectors b. (3.11)

Also, A is positive definite if

bAb > 0 for all 1 × q vectors b = 0. (3.12)

Note that bBB b = bB2 ≥ 0, which means that Σ must be nonnegative definite.
But from (2.85),
bΣb = Cov[Yb ] = Var [Yb ] ≥ 0, (3.13)
because all variances are nonnegative. That is, any covariance matrix has to be non-
negative definite, not just multivariate normal ones.
So we know that Σ must be symmetric and nonnegative definite. Are there any
other restrictions, or for any symmetric nonnegative definite matrix is there a corre-
sponding B? In fact, there are potentially many square roots of Σ. These follow from
the spectral decomposition theorem, Theorem 1.1. Because Σ is symmetric, we can
write
Σ = ΓΛΓ  , (3.14)
where Γ is orthogonal, and Λ is diagonal with diagonal elements λ1 ≥ λ2 ≥ · · · ≥ λq .
Because Σ is nonnegative definite, the eigenvalues are nonnegative (Exercise 3.7.12),
hence they have square roots. Consider

B = ΓΛ1/2 , (3.15)

where Λ1/2 is the diagonal matrix with diagonal elements the λ1/2
j ’s. Then, indeed,

BB = ΓΛ1/2 Λ1/2 Γ  = ΓΛΓ  = Σ. (3.16)

That is, in (3.7), μ is unrestricted, and Σ can be any symmetric nonnegative definite
matrix. Note that C = ΓΛ1/2 Ψ for any q × q orthogonal matrix Ψ is also a square root
of Σ. If we take Ψ = Γ  , then we have the symmetric square root, ΓΛ1/2 Γ  .
If q = 1, then we have a normal random variable, say Y, and Y ∼ N (μ, σ2 ) signifies
that it has mean μ and variance σ2 . If Y is a multivariate normal collection represented
as an n × q matrix, we write

Y ∼ Nn×q (μ, Σ) ⇔ row(Y) ∼ Nnq (row(μ), Σ). (3.17)

3.2 Some properties of the multivariate normal


Affine transformations of multivariate normals are also multivariate normal, because
any affine transformation of a multivariate normal collection is an affine transfor-
mation of an affine transformation of a standard normal collection, and an affine
transformation of an affine transformation is also an affine transformation. That is,
suppose Y ∼ Nq (μ, Σ), and W = c + YD for p × q matrix D and 1 × p vector c. Then
we know that for some B with BB = Σ, Y = μ + ZB , where Z is a standard normal
vector. Hence

W = c + YD = c + (μ + ZB )D = μD + c + Z(B D ), (3.18)


52 Chapter 3. Multivariate Normal

and as in (3.4),

W ∼ Np (c + μD , DBB D ) = Np (c + μD , DΣD ). (3.19)

Of course, the mean and covariance result we already knew from (2.84) and (2.85).
Because marginals are special cases of affine transformations, marginals of multi-
variate normals are also multivariate normal. One needs just to pick off the appro-
priate means and covariances. So if Y = (Y1 , . . . , Y5 ) is N5 (μ, Σ), and W = (Y2 , Y5 ),
then 
σ22 σ25
W ∼ N2 (μ2 , μ5 ), . (3.20)
σ52 σ55
In Section 2.4, we showed that independence of two random variables means that
their covariance is 0, but that a covariance of 0 does not imply independence. But,
with multivariate normals, it does. That is, if X is a multivariate normal collection,
and Cov[ X j , Xk ] = 0, then X j and Xk are independent. The next theorem generalizes
this independence to sets of variables.

Theorem 3.1. If W = (X, Y) is a multivariate normal collection, then Cov[X, Y] = 0 (see


Equation 2.46) implies that X and Y are independent.

Proof. For simplicity, we will assume the mean of W is 0. Let B (p × M1 ) and C


(q × M2 ) be matrices such that BB = Cov[X] and CC = Cov[Y], and Z = (Z1 , Z2 )
be a standard normal collection of M1 + M2 variables, where Z1 is 1 × M1 and Z2 is
1 × M2 . By assumption on the covariances between the Xk ’s and Yl ’s, and properties
of B and C,
 
Cov[X] 0 BB 0
Cov[W] = = = AA , (3.21)
0 Cov[Y] 0 CC

where 
B 0
A= . (3.22)
0 C

Which shows that W has distribution given by ZA . With that representation, we
have that X = Z1 B and Y = Z2 C . Because the Zi ’s are mutually independent, and
the subsets Z1 and Z2 do not overlap, Z1 and Z2 are independent, which means that
X and Y are independent.

The theorem can also be proved using mgf’s or pdf’s. See Exercises 3.7.15 and
8.8.12.

3.3 Multivariate normal data matrix


Here we connect the n × q data matrix Y (1.1) to the multivariate normal. Each row of
Y represents the values of q variables for an individual. Often, the data are modeled
considering the rows of Y as independent observations from a population. Letting Yi
be the i th row of Y, we would say that

Y1 , . . . , Yn are independent and identically distributed (iid). (3.23)


3.3. Data matrix 53

In the iid case, the vectors all have the same mean μ and covariance matrix Σ. Thus
the mean of the entire matrix M = E [Y] is
⎛ ⎞
μ
⎜ μ ⎟
⎜ ⎟
M = ⎜ . ⎟. (3.24)
⎝ .. ⎠
μ
For the covariance of the Y, we need to string all the elements out, as in (2.45),
as (Y1 , . . . , Yn ). By independence, the covariance between variables from different
individuals is 0, that is, Cov[Yij , Ykl ] = 0 if i = k. Each group of q variables from a
single individual has covariance Σ, so that Cov[Y] is block diagonal:
⎛ ⎞
Σ 0 ··· 0
⎜ 0 Σ ··· 0 ⎟
⎜ ⎟
Ω = Cov[Y] = ⎜ . .. . . . ⎟. (3.25)
⎝ .. . . .. ⎠
0 0 ··· Σ
Patterned matrices such as (3.24) and (3.25) can be more efficiently represented as
Kronecker products.
Definition 3.5. If A is a p × q matrix and B is an n × m matrix, then the Kronecker
product is the (np) × (mq ) matrix A ⊗ B given by
⎛ ⎞
a11 B a12 B · · · a1q B
⎜ a21 B a22 B · · · a2q B ⎟
⎜ ⎟
A⊗B = ⎜ . .. .. .. ⎟ . (3.26)
⎝ .. . . . ⎠
a p1 B a p2 B · · · a pq B
Thus the mean in (3.24) and covariance matrix in (3.25) can be written as follows:
M = 1n ⊗ μ and Ω = In ⊗ Σ. (3.27)
Recall that 1n is the n × 1 vector of all 1’s, and In is the n × n identity matrix. Now if
the rows of Y are iid multivariate normal, we write
Y ∼ Nn×q (1n ⊗ μ, In ⊗ Σ). (3.28)
Often the rows are independent with common covariance Σ, but not necessarily hav-
ing the same means. Then we have
Y ∼ Nn×q (M, In ⊗ Σ). (3.29)
We have already seen examples of linear combinations of elements in the data
matrix. In (1.9) and (1.10), we had combinations of the form CY, where the matrix
multiplied Y on the left. The linear combinations are of the individuals within the
variable, so that each variable is affected in the same way. In (1.23), and for principal
components, the matrix is on the right: YD . In this case, the linear combinations are
of the variables, with the variables for each individual affected the same way. More
generally, we have affine transformations of the form (2.80),
W = A + CYD . (3.30)
Because W is an affine transformation of Y, it is also multivariate normal. When
Cov[Y] has the form as in (3.29), then so does W.
54 Chapter 3. Multivariate Normal

Proposition 3.1. If Y ∼ Nn×q (M, H ⊗ Σ) and W = A + CYD , where C is m × n, D is


p × q, and A is m × p, then

W ∼ Nm× p (A + CMD , CHC ⊗ DΣD ). (3.31)

The mean part follows directly from the second part of (2.84). For the covari-
ance, we need some facts about Kronecker products, proofs of which are tedious but
straightforward. See Exercises 3.7.17 to 3.7.18.

Proposition 3.2. Presuming the matrix operations make sense and the inverses exist,

(A ⊗ B )  = A ⊗ B  (3.32a)
(A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD) (3.32b)
(A ⊗ B ) −1 = A−1 ⊗ B −1 (3.32c)
row(CYD ) = row(Y)(C ⊗ D) (3.32d)
trace (A ⊗ B) = trace(A) trace(B) (3.32e)
| A ⊗ B | = | A| | B | ,b a
(3.32f)

where in the final equation, A is a × a and B is b × b. If Cov[U] = A ⊗ B, then

Var [Uij ] = aii b jj , more generally, (3.33a)


Cov[Uij , Ukl ] = aik b jl (3.33b)
Cov[i th row of U] = aii B (3.33c)
Cov[ jth column of U] = b jj A. (3.33d)

To prove the covariance result in Proposition 3.1, write

Cov[CYD ] = Cov[row(Y)(C ⊗ D )] by (3.32d)


    
= (C ⊗ D ) Cov[row(Y)](C ⊗ D ) by (2.85)
 
= (C ⊗ D)(H ⊗ Σ)(C ⊗ D ) by (3.32a)
 
= CHC ⊗ DΣD by (3.32b), twice. (3.34)

One direct application of the proposition is the sample mean in the iid case (3.28),
so that Y ∼ Nn×q (1n ⊗ μ, In ⊗ Σ). Then from (1.9),

1 
Y= 1 Y, (3.35)
n n

so we can use Proposition 3.1 with C = 1  D = Iq , and A = 0. Thus


n 1n ,

1 1 1 1
Y ∼ Nq (( 1n 1n ) ⊗ μ, ( 1n )In ( 1n ) ⊗ Σ) = Nq (μ, Σ), (3.36)
n n n n

since (1/n )1n 1n = 1, and c ⊗ A = cA if c is a scalar. This result should not be


surprising because it is the analog of the univariate result that Y ∼ N (μ, σ2 /n ).
3.4. Conditioning in the multivariate normal 55

3.4 Conditioning in the multivariate normal


We start here with X being a 1 × p vector and Y being a 1 × q vector, then specialize
to the data matrix case at the end of this section. If (X, Y) is multivariate normal, then
the conditional distributions of Y given X = x are multivariate normal as well. Let

Σ XX Σ XY
(X, Y) ∼ Np+q (μ X , μY ), . (3.37)
ΣYX ΣYY

Rather than diving in to joint densities, as in (2.23), we start by predicting the vector
Y from X with an affine transformation. That is, we wish to find α, 1 × q, and β, p × q,
so that
Y ≈ α + Xβ. (3.38)
We use the least squares criterion, which is is to find (α, β) to minimize

q (α, β) = E [Y − α − Xβ2 ]. (3.39)

We start by noting that if the 1 × q vector W has finite covariance matrix, then
E [W − c2 ] is uniquely minimized over c ∈ Rq by c = E [W]. See Exercise (3.7.21).
Letting W = Y − Xβ, we have that fixing β, q (α, β) is minimized over α by taking

α = E [Y − Xβ] = μY − μ X β. (3.40)

Using that α in (3.39), we now want to minimize

q (μY − μ X β, β) = E [Y − (μY − μ X β) − Xβ2 ] = E [(Y − μY ) − (X − μ X ) β2 ] (3.41)

over β. Using the trick that for a row vector z, z2 = trace(z z), and letting X∗ =
X − μ X and Y∗ = Y − μY , we can write (3.41) as

E [trace((Y∗ − X∗ β) (Y∗ − X∗ β))] = trace( E [(Y∗ − X∗ β) (Y∗ − X∗ β))])


= trace(ΣYY − ΣYX β − β Σ XY + β Σ XX β). (3.42)

Now we complete the square. That is, we want to find β∗ so that



ΣYY − ΣYX β − β Σ XY + β Σ XX β = ( β − β∗ ) Σ XX ( β − β∗ ) + ΣYY − β∗ Σ XX β∗ . (3.43)

Matching, we must have that β Σ XX β∗ = β Σ XY , so that if Σ XX is invertible, we need


that β∗ = Σ− 1
XX Σ XY . Then the trace of the expression in (3.43) is minimized by taking

β = β , since that sets to 0 the part depending on β, and you can’t do better than
that. Which means that (3.39) is minimized with

β = Σ− 1
XX Σ XY , (3.44)

and α in (3.40). The minimum of (3.39) is the trace of

ΣYY − β Σ XX β = ΣYY − ΣYX Σ− 1


XX Σ XY . (3.45)

The prediction of Y is then α + Xβ. Define the residual to be the error in the
prediction:
R = Y − α − Xβ. (3.46)
56 Chapter 3. Multivariate Normal

Next step is to find the joint distribution of (X, R). Because it is an affine transforma-
tion of (X, Y), the joint distribution is multivariate normal, hence we just need to find
the mean and covariance matrix. The mean of X we know is μ X , and the mean of R
is 0, from (3.40). The transform is

    Ip −β  
X R = X Y + 0 −α , (3.47)
0 Iq

hence
  
  Ip 0 Σ XX Σ XY Ip −β
Cov[ X R ]=
− β Iq ΣYX ΣYY 0 Iq

Σ XX 0
= (3.48)
0 ΣYY · X

where
ΣYY · X = ΣYY − ΣYX Σ− 1
XX Σ XY , (3.49)
the minimizer in (3.45).
Note the zero in the covariance matrix. Because we have multivariate normality, X
and R are thus independent, and

R ∼ Nq (0, ΣYY · X ). (3.50)

Using the plug-in formula with independence, (2.63), so that

Y | X = x = α + xβ + R, (3.51)

leads to the next result.


Proposition 3.3. If (X, Y) is multivariate normal as in (3.37), and Σ XX invertible, then

Y | X = x ∼ N (α + xβ, ΣYY · X ), (3.52)

where α is given in (3.40), β is given in (3.44), and the conditional covariance matrix is given
in (3.49).
The conditional distribution is particularly nice:
• It is multivariate normal;
• The conditional mean is an affine transformation of x;
• It is homoskedastic, that is, the conditional covariance matrix does not depend on
x.
These properties are the typical assumptions in linear regression.

Conditioning in a multivariate normal data matrix


So far we have looked at just one X/Y vector, whereas data will have a number of
such vectors. Stacking these vectors into a data matrix, we have the distribution as in
(3.29), but with an X matrix as well. That is, let X be n × p and Y be n × q, where

    Σ XX Σ XY
X Y ∼ Nn×( p+q) M X MY , I n ⊗ . (3.53)
ΣYX ΣYY
3.5. Wishart 57

The conditional distribution of Y given X = x can be obtained by applying Proposi-


tion 3.3 to (row(X), row(Y)), whose distribution can be written
   
row(X) row(Y) ∼ Nnp+nq row(M X ) row(MY ) ,

In ⊗ Σ XX In ⊗ Σ XY
. (3.54)
In ⊗ ΣYX In ⊗ ΣYY

See Exercise 3.7.23. We again have the same β, but α is a bit expanded (it is n × q):

α = MY − M X β, β = Σ− 1
XX Σ XY . (3.55)
With R = Y − α − Xβ, we obtain that X and R are independent, and
Y | X = x ∼ Nn×q (α + xβ, In ⊗ ΣYY · X ). (3.56)

3.5 The sample covariance matrix: Wishart distribution


Consider the iid case (3.28), Y ∼ Nn×q (1n ⊗ μ, In ⊗ Σ). The sample covariance matrix
is given in (1.17) and (1.15),
1 1
S= W, W = Y Hn Y, Hn = In − 1n 1n . (3.57)
n n
See (1.12) for the centering matrix, Hn . Here we find the joint distribution of the
sample mean Y and W. The marginal distribution of the sample mean is given in
(3.36). Start by looking at the mean and the deviations together:
 1 
Y 
= n 1n Y. (3.58)
Hn Y Hn
Thus they are jointly normal. The mean of the sample mean is μ, and the mean of the
deviations Hn Y is Hn 1n ⊗ μ = 0. (Recall Exercise 1.9.1.) The covariance is given by

 1 
Y   1 
Cov = n 1n Hn ⊗ Σ
Hn Y Hn n 1n
1 
0
= n ⊗ Σ. (3.59)
0 Hn

The zeroes in the covariance show that Y and Hn Y are independent (as they are in
the familiar univariate case), implying that Y and W are independent. Also,
U ≡ Hn Y ∼ N (0, Hn ⊗ Σ). (3.60)
Because Hn is idempotent, W = Y Hn Y = U U. At this point, instead of trying to
figure out the distribution of W, we define it to be what it is. Actually, Wishart [1928]
did this a while ago. Next is the formal definition.
Definition 3.6 (Wishart distribution). If Z ∼ Nν× p (0, Iν ⊗ Σ), then Z Z is Wishart on ν
degrees of freedom, with parameter Σ, written
Z Z ∼ Wishart p (ν, Σ). (3.61)
58 Chapter 3. Multivariate Normal

The difference between the distribution of U and the Z in the definition is the
former has Hn where we would prefer an identity matrix. We can deal with this issue
by rotating the Hn . We need its spectral decomposition. More generally, suppose J is
an n × n symmetric and idempotent matrix, with spectral decomposition (Theorem
1.1) J = ΓΛΓ  , where Γ is orthogonal and Λ is diagonal with nondecreasing diagonal
elements. Because it is idempotent, JJ = J, hence
ΓΛΓ  = ΓΛΓ  ΓΛΓ  = ΓΛΛΓ  , (3.62)
so that Λ = or λi =
Λ2 , for the eigenvalues i = 1, . . . , n. That means that each of
λ2i
the eigenvalues is either 0 or 1. If matrices A and B have the same dimensions, then
trace(AB ) = trace(BA ). (3.63)
See Exercise 3.7.5. Thus
trace(J) = trace(ΛΓ  Γ ) = trace(Λ) = λ1 + · · · + λn , (3.64)
which is the number of eigenvalues that are 1. Because the eigenvalues are ordered
from largest to smallest, λ1 = · · · = λtrace(J) = 1, and the rest are 0. Hence the
following result.
Lemma 3.1. Suppose J, n × n, is symmetric and idempotent. Then its spectral decomposition
is   
  Ik 0 Γ1
J = Γ1 Γ2 = Γ1 Γ1 , (3.65)
0 0 Γ2
where k = trace(J), Γ1 is n × k, and Γ2 is n × (n − k).
Now suppose
U ∼ N (0, J ⊗ Σ), (3.66)
for J as in the lemma. Letting Γ = (Γ1 , Γ2 ) in (3.65), we have E [ Γ  U]
= 0 and

  
Γ1 U Ik 0
Cov[ Γ  U] = Cov = ⊗ Σ. (3.67)
Γ2 U 0 0

Thus Γ2 U has mean and covariance zero, hence it must be zero itself (with probability
one). That is,
U U = U ΓΓ  U = U Γ1 Γ1 U + U Γ2 Γ2 U = U Γ1 Γ1 U. (3.68)
By (3.66), and since J = Γ1 Γ1 in (3.65),

Γ1 U ∼ N (0, Γ1 Γ1 Γ1 Γ1 ⊗ Σ) = N (0, Ik ⊗ Σ). (3.69)


Now we can apply the Wishart definition (3.61) to Γ1 U, to obtain the next result.
Corollary 3.1. If U ∼ Nn× p (0, J ⊗ Σ) for idempotent J, then U U ∼ Wishart p (trace(J), Σ).
To apply the corollary to W = Y Hn Y in (3.57), by (3.60), we need only the trace of
Hn :
1 1
trace(Hn ) = trace(In ) − trace(1n 1n ) = n − (n ) = n − 1. (3.70)
n n
Thus
W ∼ Wishartq (n − 1, Σ). (3.71)
3.6. Properties of the Wishart 59

3.6 Some properties of the Wishart


In this section we present some useful properties of the Wishart. The density is
derived later in Section 8.7, and a conditional distribution is presented in Section 8.2.

Mean
Letting Z1 , . . . , Zν be the rows of Z in Definition 3.6, we have that

Z Z = Z1 Z1 + · · · + Zν Zν ∼ Wishartq (ν, Σ). (3.72)

Each Zi ∼ N1×q (0, Σ), so E [Zi Zi ] = Cov[Zi ] = Σ. Thus

E [W] = νΣ. (3.73)

In particular, for the S in (3.57), because ν = n − 1, E [S] = ((n − 1)/n )Σ, so that an
unbiased estimator of Σ is
 = 1 Y Hn Y.
Σ (3.74)
n−1

Sum of independent Wisharts


If
W1 ∼ Wishartq (ν1 , Σ) and W2 ∼ Wishartq (ν2 , Σ), (3.75)
and W1 and W2 are independent, then W1 + W2 ∼ Wishartq (ν1 + ν2 , Σ). This fact
can be easily shown by writing each as in (3.72), then summing.

Chi-squares
If Z1 , . . . , Zν are independent N (0, σ2 )’s, then

W = Z12 + · · · + Zν2 (3.76)

is said to be “chi-squared on ν degrees of freedom with scale σ2 ,” written W ∼ σ2 χ2ν .


(If σ2 = 1, we call it just “chi-squared on ν degrees of freedom.”) If q = 1 in the
Wishart (3.72), the Zi ’s in (3.72) are one-dimensional, i.e., N (0, σ2 )’s, hence

Wishart1 (ν, σ2 ) = σ2 χ2ν . (3.77)

Linear transformations
If Z ∼ Nν×q (0, Iν ⊗ Σ), then for p × q matrix A, ZA ∼ Nν× p (0, Iν ⊗ AΣA ). Using
the definition of Wishart (3.61),

AZZ A ∼ Wishart p (ν, AΣA ), (3.78)

i.e.,
AWA ∼ Wishart p (ν, AΣA ). (3.79)
60 Chapter 3. Multivariate Normal

Marginals
Because marginals are special cases of linear transformations, central blocks of a
Wishart are Wishart. E.g., if W11 is the upper-left p × p block of W, then W11 ∼
Wishart p (ν, Σ11 ), where Σ11 is the upper-left block of Σ. See Exercise 3.7.9. A special
case of such marginal is a diagonal element, Wii , which is Wishart1 (ν, σii ), i.e., σii χ2ν .
Furthermore, if Σ is diagonal, then the diagonals of W are independent because the
corresponding normals are.

3.7 Exercises
Exercise 3.7.1. Verify the calculations in (3.9).

Exercise 3.7.2. Find the matrix B for which W ≡ (Y2 , Y5 ) = (Y1 , . . . , Y5 )B, and verify
(3.20).

Exercise 3.7.3. Verify (3.42).

Exercise 3.7.4. Verify the covariance calculation in (3.59).

Exercise 3.7.5. Suppose that A and B are both n × p matrices. Denote the elements
of A by aij , and of B by bij . (a) Give the following in terms of those elements: (AB )ii
(the i th diagonal element of the matrix AB ); and (B A) jj (the jth diagonal element of
the matrix B A). (c) Using the above, show that trace(AB ) = trace (B A).

Exercise 3.7.6. Show that in (3.69), Γ1 Γ1 = Ik .

Exercise 3.7.7. Explicitly write the sum of W1 and W2 as in (3.75) as a sum of Zi Zi ’s
as in (3.72).

Exercise 3.7.8. Suppose W ∼ σ2 χ2ν from (3.76), that is, W = Z12 + · · · + Zν2 , where the
Zi ’s are independent N (0, σ2 )’s. This exercise shows that W has pdf

1 ν
w 2 −1 e−w/(2σ ) , w > 0.
2
f W (w | ν, σ2 ) = (3.80)
Γ ( 2ν )(2σ2 )ν/2

It will help to know that U has the Gamma(α, λ) density if α > 0, λ > 0, and

1
f U (u | α, λ) = x α −1 e−λx for x > 0. (3.81)
Γ (α)λα

The Γ function is defined in (2.96). (It is the constant needed to have the pdf integrate
to one.) We’ll use moment generating functions. Working directly with convolutions
is another possibility. (a) Show that the moment generating function of U in (3.81) is
(1 − λt)−α when it is finite. For which t is the mgf finite? (b) Let Z ∼ N (0, σ2 ), so that
Z2 ∼ σ2 χ21 . Find the moment generating function for Z2 . [Hint: Write E [exp(tZ2 )] as
an integral using the pdf of Z, then note the exponential term in the integrand looks
like a normal with mean zero and some variance, but without the constant. Thus the
integral over that exponential is the reciprocal of the constant.] (c) Find the moment
generating function for W. (See Exercise 2.7.20.) (d) W has a gamma distribution.
3.7. Exercises 61

What are the parameters? Does this gamma pdf coincide with (3.80)? (e) [Aside] The
density of Z2 can be derived by writing
√w
P[ Z2 ≤ w] = √ f Z (z )dz, (3.82)
− w

then taking the derivative. Match the result with the σ2 χ21 density found above. What
is Γ ( 12 )?

Exercise 3.7.9. Suppose W ∼ Wishart p+q (ν, Σ), where W and Σ are partitioned as
 
W11 W12 Σ11 Σ12
W= and Σ = (3.83)
W21 W22 Σ21 Σ22
where W11 and Σ11 are p × p, etc. (a) What matrix A in (3.79) is used to show
that W11 ∼ Wishart p (ν, Σ11 )? (b) Argue that if Σ12 = 0, then W11 and W22 are
independent.

Exercise 3.7.10. The balanced one-way random effects model in analysis of variance
has
Yij = μ + Ai + eij , i = 1, . . . , g; j = 1, . . . , r, (3.84)
where the Ai ’s are iid N (0, σA
2 ) and the e ’s are iid N (0, σ2 ), and the e ’s are inde-
ij e ij
pendent of the Ai ’s. Let Y be the g × r matrix of the Yij ’s. Show that
Y ∼ Ng×r (M, Ig ⊗ Σ), (3.85)
and give M and Σ in terms of the μ, σA
2 and σ2 .
e

Exercise 3.7.11. The double exponential random variable U has density


1 −|u|
f (u ) = e , u ∈ R. (3.86)
2
It has mean 0, variance 2, and moment generating function M (t) = 1/(1 − t2 ) for
| t| < 1. Suppose U and V are independent double exponentials, and let
X1 = 5U, X2 = 4U + 2V. (3.87)
(a) Find the covariance matrix of X = ( X1 , X2 ). (b) Find the symmetric positive
definite square root of the covariance matrix. Call it A. Let Y = (Y1 , Y2 ) = (U, V )A.
(c) Do X and Y have the same mean? (d) Do X and Y have the same covariance matrix?
(e) Are X and Y both linear combinations of independent double exponentials? (f) Do
X and Y have the same distribution? [Look at their moment generating functions.]
(g) [Extra credit] Derive the mgf of the double exponential.

Exercise 3.7.12. Suppose Ω is a q × q symmetric matrix with spectral decomposition


(Theorem 1.1) ΓΛΓ  . (a) Show that Ω is nonnegative definite if and only if λi ≥ 0 for
all i = 1, . . . , q. [Hint: Suppose it is nonnegative definite. Let γi be the i th column of
Γ, and look at γi Ωγi . What can you say about λi ? The other way, suppose all λi ≥ 0.
Consider bΩb , and let w = bΓ  . Write bΩb in terms of w and the λi .] (b) Show
that Ω is positive definite if and only if λi > 0 for all i = 1, . . . , q. (c) Show that Ω is
invertible if and only if λi = 0 for all i = 1, . . . , q. What is the spectral decomposition
of Ω−1 if the inverse exists?
62 Chapter 3. Multivariate Normal

Exercise 3.7.13. Extend Theorem 3.1: Show that if W = (Y1 , . . . , Yg ) is a multivariate


normal collection, then Cov[Yi , Y j ] = 0 for each i = j implies that Y1 , . . . , Yg are
mutually independent.
Exercise 3.7.14. Given the random vector ( X, Y, Z ), answer true or false to the follow-
ing questions: (a) Pairwise independence implies mutual independence. (b) Pairwise
independence and multivariate normality implies mutual independence. (c) Mutual
independence implies conditional independence of X and Y given Z. (d) Conditional
independence of X and Y given Z implies that X and Y are unconditionally indepen-
dent. (e) ( X, Y, Z ) multivariate normal implies (1, X, Y, Z ) is multivariate normal.
Exercise 3.7.15. Let X be 1 × p and Y be 1 × q, where

Σ XX 0
(X, Y) ∼ N1×( p+q) (μ X , μY ), , (3.88)
0 ΣYY

so that Cov(X) = Σ XX , Cov(Y) = ΣYY , and Cov(X, Y) = 0. Using moment generating


functions, show that X and Y are independent.

Exercise 3.7.16. True/false questions: (a) If A and B are identity matrices, then A ⊗ B
is an identity matrix. (b) If A and B are orthogonal, then A ⊗ B is orthogonal. (c)
If A is orthogonal and B is not orthogonal, then A ⊗ B is orthogonal. (d) If A and
B are diagonal, then A ⊗ B is diagonal. (e) If A and B are idempotent, then A ⊗ B
is idempotent. (f) If A and B are permutation matrices, then A ⊗ B is a permutation
matrix. (A permutation matrix is a square matrix with exactly one 1 in each row, one
1 in each column, and 0’s elsewhere.) (g) If A and B are upper triangular, then A ⊗ B
is upper triangular. (An upper triangular matrix is a square matrix whose elements
below the diaginal are 0. I.e., if A is upper triangular, then aij = 0 if i > j.) (h) If A is
upper triangular and B is not upper triangular, then A ⊗ B is upper triangular. (i) If
A is not upper triangular and B is upper triangular, then A ⊗ B is upper triangular.
(j) If A and C have the same dimensions, and B and D have the same dimensions,
then A ⊗ B + C ⊗ D = (A + C) ⊗ (B + D). (k) If A and C have the same dimensions,
then A ⊗ B + C ⊗ B = (A + C) ⊗ B. (l) If B and D have the same dimensions, then
A ⊗ B + A ⊗ D = A ⊗ (B + D).
Exercise 3.7.17. Prove (3.32a), (3.32b) and (3.32c).
Exercise 3.7.18. Take C, Y and D to all be 2 × 2. Show (3.32d) explicitly.
Exercise 3.7.19. Suppose A is a × a and B is b × b. (a) Show that (3.32e) for the
trace of A ⊗ B holds. (b) Show that (3.32f) determinant of A ⊗ B holds. [Hint: Write
A ⊗ B = (A ⊗ Ib )(Ia ⊗ B). You can use the fact that the the determinant of a product
is the product of the determinants. For |Ia ⊗ B|, permutate the rows and columns so
it looks like |B ⊗ Ia |.]
Exercise 3.7.20. Suppose the spectral decompositions of A and B are A = GLG and
B = HKH . Is the equation

A ⊗ B = (G ⊗ H)(L ⊗ K)(G ⊗ H) (3.89)

the spectral decomposition of A ⊗ B? If not, what is wrong, and how can it be fixed?
3.7. Exercises 63

Exercise 3.7.21. Suppose W is a 1 × q vector with finite covariance matrix. Show that
q (c) = W − c2 is minimized over c ∈ Rq by c = E [W], and the minimum value is
q ( E [W]) = trace(Cov[W]). [Hint: Write

q (c) = E [(W − E [W]) − ( E [W] − c)2 ]


= E [W − E [W]2 ] + 2E [(W − E [W])( E [W] − c)] + E [ E [W] − c2 ] (3.90)
= E [W − E [W] ] + E [ E [W] − c ].
2 2
(3.91)

Show that the middle (cross-product) term in line (3.90) is zero (E [W] and c are
constants), and argue that the second term in line (3.91) is uniquely minimized by
c = E [W]. (No need to take derivatives.)]
Exercise 3.7.22. Verify the matrix multiplication in (3.48).
Exercise 3.7.23. Suppose (X, Y) is as in (3.53). (a) Show that (3.54) follows. [Be
careful about the covariance, since row(X, Y) = (row(X), row(Y)) if n > 1.] (b)
Apply Proposition 3.3 to (3.54) to obtain

row(Y) | row(X) = row(x) ∼ Nnq (α∗ + row(x) β∗ , ΣYY



· X ), (3.92)

where

α∗ = row(μY ) − row(μ X ) β∗ , β∗ = In ⊗ β, ΣYY



· X = In ⊗ ΣYY · X . (3.93)

What are β and ΣYY · X ? (c) Use Proposition 3.2 to derive (3.56) from part (b).
Exercise 3.7.24. Suppose ( X, Y, Z ) is multivariate normal with covariance matrix
⎛ ⎛ ⎞⎞
5 1 2
( X, Y, Z ) ∼ N ⎝(0, 0, 0), ⎝ 1 5 2 ⎠⎠ (3.94)
2 2 3

(a) What is the correlation of X and Y? Consider the conditional distribution of


( X, Y )| Z = z. (b) Give the conditional covariance matrix, Cov[( X, Y )| Z = z]. (c)
The correlation from that matrix is the condition correlation of X and Y given Z = z,
sometimes called the partial correlation. What is the conditional correlation in this
case? (d) If the conditional correlation between two variables given a third variable is
negative, is the marginal correlation between those two necessarily negative?
Exercise 3.7.25. Now suppose
⎛ ⎛ ⎞⎞
5 1 c
( X, Y, Z ) ∼ N ⎝(0, 0, 0), ⎝ 1 5 2 ⎠⎠ . (3.95)
c 2 3

Find c so that the conditional correlation between X and Y given Z = z is 0 (so that
X and Y are conditionally independent, because of their normality).

Exercise 3.7.26. Let Y | X = x ∼ N (0, x2 ) and X ∼ N (2, 1). (a) Find E [Y ] and Var [Y ].
(b) Let Z = Y/X. What is the conditional distribution of Z | X = x? Is Z independent
of X? What is the marginal distribution of Z? (c) What is the conditional distribution
of Y | | X | = r?
64 Chapter 3. Multivariate Normal

Exercise 3.7.27. Suppose that conditionally, (Y1 , Y2 ) | X = x are iid N (α + βx, 10),
and that marginally, E [ X ] = Var [ X ] = 1. (The X is not necessarily normal.) (a) Find
Var [Yi ], Cov[Y1, Y2 ], and the (unconditional) correlation between Y1 and Y2 . (b) What
is the conditional distribution of Y1 + Y2 | X = x? Is Y1 + Y2 independent of X? (c)
What is the conditional distribution of Y1 − Y2 | X = x? Is Y1 − Y2 independent of X?

Exercise 3.7.28. This question reverses the conditional distribution in a multivariate


normal, without having to use Bayes’ formula. Suppose conditionally Y | X = x ∼
N (α + xβ, Σ), and marginally X ∼ N (μ X , Σ XX ), where Y is 1 × q and X is 1 × p. (a)
Show that (X, Y) is multivariate normal, and find its mean vector, and show that

Σ XX Σ XX β
Cov[(X Y)] = . (3.96)
β Σ XX Σ + β Σ XX β

[Hint: Show that X and Y − α − Xβ are independent normals, and find the A so that
(X, Y) = (X, Y − α − Xβ)A.] (b) Show that the conditional distribution X | Y = y is
multivariate normal with mean

E [X | Y = y] = μ X + (y − α − μ X β)(Σ + β Σ XX β)−1 β Σ XX , (3.97)

and
Cov[X | Y = y] = Σ XX − Σ XX (Σ + β Σ XX β)−1 Σ XX . (3.98)
(You can assume any covariance that needs to be invertible is invertible.)

Exercise 3.7.29 (Bayesian inference). A Bayesian approach to estimating the normal


mean vector, when the covariance matrix is known, is to set

Y | μ = m ∼ N1×q (m, Σ) and μ ∼ N1×q (μ0 , Σ0 ), (3.99)

where Σ, μ0 , and Σ0 are known. That is, the mean vector μ is a random variable,
with a multivariate normal prior. (a) Use Exercise 3.7.28 to show that the posterior
distribution of μ, i.e., μ given Y = y, is multivariate normal with

E [ μ | Y = y] = (yΣ−1 + μ0 Σ0−1 )(Σ−1 + Σ0−1 )−1 , (3.100)

and
Cov[ μ | Y = y] = (Σ−1 + Σ0−1 )−1 . (3.101)
Thus the posterior mean is a weighted average of the data y and the prior mean, with
weights inversely proportional to their respective covariance matrices. [Hint: What
are the α and β in this case? It takes some matrix manipulations to get the mean and
covariance in the given form.] (b) Show that the marginal distribution of Y is

Y ∼ N1×q (μ0 , Σ + Σ0 ). (3.102)

[Hint: See (3.96).][Note that the inverse of the posterior covariance is the sum of
the inverses of the conditional covariance of Y and the prior covariance, while the
marginal covariance of the Y is the sum of the conditional covariance of Y and the
prior covariance.] (c) Replace Y with Y, the sample mean of n iid vectors, so that
Y | μ = m ∼ N (m, Σ/n ). Keep the same prior on μ. Find the posterior distribution
of μ given the Y = y. What are the posterior mean and covariance matrix, approxi-
mately, when n is large?
3.7. Exercises 65

Exercise 3.7.30 (Bayesian inference). Consider a matrix version of Exercise 3.7.29, i.e.,

Y | μ = m ∼ Np×q (m, K−1 ⊗ Σ) and μ ∼ Np×q (μ0 , K0−1 ⊗ Σ), (3.103)

where K, Σ, μ0 and K0 are known, and the covariance matrices are invertible. [So if Y
is a sample mean vector, K would be n, and if Y is β from multivariate regression, K
would be x x.] Notice that the Σ is the same in the conditional distribution of Y and
in the prior. Show that the posterior distribution of μ is multivariate normal, with

E [ μ | Y = y] = (K + K0 )−1 (Ky + K0 μ0 ), (3.104)

and
Cov[ μ | Y = y] = (K + K0 )−1 ⊗ Σ. (3.105)
[Hint: Use (3.100) and (3.101) on row(Y) and row(μ), then use properties of Kro-
necker products, e.g., (3.32d) and Exercise 3.7.16 (l).]

Exercise 3.7.31. Suppose X is n × p, Y is n × q, and



    Σ XX Σ XY
X Y ∼N M X MY , I n ⊗ . (3.106)
ΣYX ΣYY

Let
R = Y − XC − D (3.107)
for some matrices C and D. Instead of using least squares as in Section 3.4, here
we try to find C and D so that the residuals have mean zero and are independent
of X. (a) What are the dimensions of R, C and D? (b) Show that (X, R) is an affine
transformation of (X, Y). That is, find A and B so that

(X, R) = A + (X, Y)B . (3.108)

(c) Find the distribution of (X, R). (d) What must C be in order for X and R to be
independent? (You can assume Σ XX is invertible.) (e) Using the C found in part (d),
find Cov[R]. (It should be In ⊗ ΣYY · X .) (f) Sticking with the C from parts (d) and (e),
find D so that E [R] = 0. (g) Using the C and D from parts (d), (e), (f), what is the
distribution of R? The distribution of R R?

Exercise 3.7.32. Let Y ∼ Nn× p (M, In ⊗ Σ). Suppose K is an n × n symmetric idem-


potent matrix with trace(K) = k, and that KM = 0. Show that Y KY is Wishart, and
give the parameters.
Exercise 3.7.33. Suppose Y ∼ N (xβz , In ⊗ Σ), where x is n × p, and
⎛ ⎞
1 −1 1
z=⎝ 1 0 −2 ⎠ . (3.109)
1 1 1

(a) Find C so that E [YC ] = xβ. (b) Assuming that x x is invertible, what is the dis-
tribution of Qx YC , where Qx = In − x(x x)−1 x ? (Is Qx idempotent? Such matrices
will appear again in equation 5.19.) (c) What is the distribution of CY Qx YC ?
66 Chapter 3. Multivariate Normal

Exercise 3.7.34. Here, W ∼ Wishart p (n, Σ). (a) Is E [trace(W)] = n trace(Σ)? (b)
Are the diagonal elements of W independent? (c) Suppose Σ = σ2 I p . What is the
distribution of trace(W)?
Exercise 3.7.35. Suppose Z = ( Z1 , Z2 ) ∼ N1×2 (0, I2 ). Let (θ, R) be the polar coordi-
nates, so that
Z1 = R cos(θ ) and Z2 = R sin(θ ). (3.110)
In order for the transformation to be one-to-one, remove 0 from the space of Z. Then
the space of (θ, R) is [0, 2π ) × (0, ∞ ). The question is to derive the distribution of
(θ, R). (a) Write down the density of Z. (b) Show that the Jacobian of the transforma-
tion is r. (c) Find the density of (θ, R). What is the marginal distribution of θ? What
is the marginal density of R? Are R and θ independent? (d) Find the distribution
function FR (r ) for R. (e) Find the inverse function of FR . (f) Argue that if U1 and U2
are independent Uniform(0, 1) random variables, then
 
−2 log(U2 ) × cos(2πU1 ) sin(2πU1 ) ∼ N1×2 (0, I2 ). (3.111)

Thus we can generate two random normals from two random uniforms. Equation
(3.111) is called the Box-Muller transformation [Box and Muller, 1958] [Hint: See
Exercise 2.7.27.] (g) Find the pdf of W = R2 . What is the distribution of W? Does it
check with (3.80)?
Chapter 4

Linear Models on Both Sides

This chapter presents some basic types of linear model. We start with the usual
linear model, with just one Y-variable. Multivariate regression extends the idea to
several variables, placing the same model on each variable. We then introduce linear
models that model the variables within the observations, basically reversing the roles
of observations and variables. Finally, we introduce the both-sides model, which
simultaneously models the observations and variables. Subsequent chapters present
estimation and hypothesis testing for these models.

4.1 Linear regression

Section 3.4 presented conditional distributions in the multivariate normal. Interest


was in the effect of one set of variables, X, on another set, Y. Conditional on X = x,
the distribution of Y was normal with the mean being a linear function of x, and
the covariance being independent of x. The normal linear regression model does
not assume that the joint distribution of (X, Y) is normal, but only that given x, Y is
multivariate normal. Analysis is carried out considering x to be fixed. In fact, x need
not be a realization of a random variable, but a quantity fixed by the researcher, such
as the dose of a drug or the amount of fertilizer.
The multiple regression model uses the data matrix (x, Y), where x is n × p and is
lower case to emphasize that those values are fixed, and Y is n × 1. That is, there are
p variables in x and a single variable in Y. In Section 4.2, we allow Y to contain more
than one variable.
The model is

Y = xβ + R, where β is p × 1 and R ∼ Nn×1 (0, σR2 In ). (4.1)

Compare this to (3.51). The variance σR2 plays the role of σYY · X . The model (4.1)
assumes that the residuals Ri are iid N (0, σR2 ).
Some examples follow. There are thousands of books on linear regression and
linear models. Scheffé [1999] is the classic theoretical reference, and Christensen
[2002] provides a more modern treatment. A fine applied reference is Weisberg [2005].

67
68 Chapter 4. Linear Models

Simple linear regression


One may wish to assess the relation between height and weight, or between choles-
terol level and percentage of fat in the diet. A linear relation would be cholesterol =
α + β( f at) + residual, so one would typically want both an intercept α and a slope β.
Translating this model to (4.1), we would have p = 2, where the first column contains
all 1’s. That is, if x1 , . . . , xn are the values of the explanatory variable (fat), the model
would be ⎛ ⎞ ⎛ ⎞ ⎛ ⎞
Y1 1 x1 R1
⎜ Y2 ⎟ ⎜ 1 x2 ⎟  ⎜ R2 ⎟
⎜ ⎟ ⎜ ⎟ α ⎜ ⎟
⎜ .. ⎟ = ⎜ .. .. ⎟ + ⎜ . ⎟. (4.2)
⎝ . ⎠ ⎝ . . ⎠ β ⎝ .. ⎠
Yn 1 xn Rn
Multiple regression would add more explanatory variables, e.g., age, blood pres-
sure, amount of exercise, etc., each one being represented by its own column in the x
matrix.

Analysis of variance
In analysis of variance, observations are classified into different groups, and one
wishes to compare the means of the groups. If there are three groups, with two
observations in each group, the model could be
⎛ ⎞ ⎛ ⎞
Y1 1 0 0
⎜ Y2 ⎟ ⎜ 1 0 0 ⎟⎛ ⎞
⎜ ⎟ ⎜ ⎟ μ1
⎜ Y3 ⎟ ⎜ 0 1 0 ⎟⎝
⎜ ⎟=⎜ ⎟ μ2 ⎠ + R. (4.3)
⎜ Y4 ⎟ ⎜ 0 1 0 ⎟ μ3
⎝ Y5 ⎠ ⎝ 0 0 1 ⎠
Y6 0 0 1

Other design matrices x yield the same model (See Section 5.4), e.g., we could just as
well write
⎛ ⎞ ⎛ ⎞
Y1 1 2 −1
⎜ Y2 ⎟ ⎜ 1 2 −1 ⎟ ⎛ ⎞
⎜ ⎟ ⎜ ⎟ μ
⎜ Y3 ⎟ ⎜ 1 −1 2 ⎟⎝
⎜ ⎟=⎜ ⎟ α ⎠ + R, (4.4)
⎜ Y4 ⎟ ⎜ 1 −1 2 ⎟
β
⎝ Y ⎠ ⎝ 1 −1 −1 ⎠
5
Y6 1 −1 −1
where μ is the grand mean, α is the effect for the first group, and β is the effect for
the second group. We could add the effect for group three, but that would lead to
a redundancy in the model. More complicated models arise when observations are
classified in multiple ways, e.g., sex, age, and ethnicity.

Analysis of covariance
It may be that the main interest is in comparing the means of groups as in analysis
of variance, but there are other variables that potentially affect the Y. For example, in
a study comparing three drugs’ effectiveness in treating leprosy, there were bacterial
measurements before and after treatment. The Y is the “after” measurement, and one
would expect the “before” measurement, in addition to the drugs, to affect the after
4.2. Multivariate regression 69

measurement. Letting xi ’s represent the before measurements, the model would be


⎛ ⎞ ⎛ ⎞
Y1 1 0 0 x1 ⎛ ⎞
⎜ Y2 ⎟ ⎜ 1 0 0 x2 ⎟ μ1
⎜ ⎟ ⎜ ⎟⎜
⎜ Y3 ⎟ ⎜ 0 1 0 x3 ⎟⎜ μ2 ⎟
⎟ + R.
⎜ ⎟=⎜ ⎟⎝ (4.5)
⎜ Y4 ⎟ ⎜ 0 1 0 x4 ⎟ μ3 ⎠
⎝ Y5 ⎠ ⎝ 0 0 1 x5 ⎠ β
Y6 0 0 1 x6

The actual experiment had ten observations in each group. See Section 7.5.

Polynomial and cyclic models


The “linear” in linear models refers to the linearity of the mean of Y in the parameter
β for fixed values of x. Within the matrix x, there can be arbitrary nonlinear functions
of variables. For example, in growth curves, one may be looking at Yi ’s over time
which grow as a quadratic in xi , i.e., E [Yi ] = β0 + β1 xi + β2 x2i . Such a model is still
considered a linear model because the β j ’s come in linearly. The full model would be
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
Y1 1 x1 x12 ⎛ ⎞ R1
⎜ Y2 ⎟ ⎜ 1 x2 x22 ⎟ β ⎜ R2 ⎟
⎜ ⎟ ⎜ ⎟⎝ 0 ⎠ ⎜ ⎟
⎜ .. ⎟=⎜ .. .. .. ⎟ β1 +⎜ .. ⎟. (4.6)
⎝ . ⎠ ⎝ . . . ⎠ β2 ⎝ . ⎠
Yn 1 xn x2n Rn

Higher-order polynomials add on columns of x3i ’s, x4i ’s, etc.


Alternatively, the Yi ’s might behave cyclically, such as temperature over the course
of a year, or the circadian (daily) rhythms of animals. If the cycle is over 24 hours,
and measurements are made at each hour, the model could be
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
Y1 1 cos(2π · 1/24) sin(2π · 1/24) ⎛ ⎞ R1
⎜ Y2 ⎟ ⎜ 1 cos(2π · 2/24) sin(2π · 2/24) ⎟ α ⎜ R2 ⎟
⎜ ⎟ ⎜ ⎟⎝ ⎜ ⎟
⎜ . ⎟=⎜ . . . ⎟ γ1 ⎠ + ⎜ . ⎟ .
⎝ .. ⎠ ⎝ .. .. .. ⎠ γ2 ⎝ .. ⎠
Y24 1 cos(2π · 24/24) sin(2π · 24/24) R24
(4.7)
Based on the data, typical objectives in linear regression are to estimate β, test
whether certain components of β are 0, or predict future values of Y based on its x’s.
In Chapters 6 and 7, such formal inferences will be handled. In this chapter, we are
concentrating on setting up the models.

4.2 Multivariate regression and analysis of variance


Consider (x, Y) to be a data matrix where x is again n × p, but now Y is n × q. The
linear model analogous to the conditional model in (3.56) is

Y = xβ + R, where β is p × q and R ∼ Nn×q (0, In ⊗ Σ R ). (4.8)

This model looks very much like the linear regression model in (4.1), and it is. It is
actually just a concatenation of q linear models, one for each variable (column) of Y.
Note that (4.8) places the same model on each variable, in the sense of using the same
70 Chapter 4. Linear Models

x’s, but allows different coefficients represented by the different columns of β. That
is, (4.8) implies
Y1 = xβ 1 + R1 , . . . , Yq = xβ q + Rq , (4.9)

where the subscript i indicates the i th column of the matrix.


The x matrix is the same as in the previous section, so rather than repeating
the examples, just imagine them with extra columns of Y and β, and prepend the
word “multivariate” to the models, e.g., multivariate analysis of variance, multivari-
ate polynomial regression, etc.
One might ask what the advantage is of doing all q regressions at once rather
than doing q separate ones. Good question. The main reason is to gather strength
from having several variables. For example, suppose one has an analysis of variance
comparing drugs on a number of health-related variables. It may be that no single
variable shows significant differences between drugs, but the variables together show
strong differences. Using the overall model can also help deal with multiple compar-
isons, e.g., when one has many variables, there is a good chance at least one shows
significance even when there is nothing going on.
These models are more compelling when they are expanded to model dependen-
cies among the means of the variables, which is the subject of Section 4.3.

4.2.1 Examples of multivariate regression


Example: Grades
The data are the grades (in the data set grades), and sex (0=Male, 1=Female), of 107
students, a portion of which is below:

Obs i Gender HW Labs InClass Midterms Final Total


1 0 30.47 0.00 0 60.38 52 43.52
2 1 37.72 20.56 75 69.84 62 59.34
3 1 65.56 77.33 75 68.81 42 63.18
4 0 65.50 75.83 100 58.88 56 64.04
5 1 72.36 65.83 25 74.93 60 65.92 (4.10)
.. .. .. .. .. .. .. ..
. . . . . . . .
105 1 93.18 97.78 100 94.75 92 94.64
106 1 97.54 99.17 100 91.23 96 94.69
107 1 94.17 97.50 100 94.64 96 95.67

Consider predicting the midterms and final exam scores from gender, and the
homework, labs, and inclass scores. The model is Y = xβ + R, where Y is 107 × 2
(the Midterms and Finals), x is 107 × 5 (with Gender, HW, Labs, InClass, plus the first
column of 1107 ), and β is 5 × 2:
⎛ ⎞
β0M β0F
⎜ βG M β GF ⎟
⎜ ⎟
β=⎜ βH M β HF ⎟. (4.11)
⎝ β LM β LF ⎠
βI M β IF
Chapter 6 shows how to estimate the β ij ’s. In this case the estimates are
4.2. Multivariate regression 71

Midterms Final
Intercept 56.472 43.002
Gender −3.044 −1.922
(4.12)
HW 0.244 0.305
Labs 0.052 0.005
InClass 0.048 0.076
Note that the largest slopes (not counting the intercepts) are the negative ones for
gender, but to truly assess the sizes of the coefficients, we will need to find their
standard errors, which we will do in Chapter 6.

Mouth sizes
Measurements were made on the size of mouths of 27 children at four ages: 8, 10,
12, and 14. The measurement is the distance from the “center of the pituitary to the
pteryomaxillary fissure”1 in millimeters, These data can be found in Potthoff and
Roy [1964]. There are 11 girls (Sex=1) and 16 boys (Sex=0). See Table 4.1. Figure 4.1
contains a plot of the mouth sizes over time. These curves are generally increasing.
There are some instances where the mouth sizes decrease over time. The measure-
ments are between two defined locations in the mouth, and as people age, the mouth
shape can change, so it is not that people mouths are really getting smaller. Note that
generally the boys have bigger mouths than the girls, as they are generally bigger
overall.
For the linear model, code x where the first column is 1 = girl, 0 = boy, and the
second column is 0 = girl, 1 = boy:
 
111 011 β11 β12 β13 β14
Y = xβ + R = + R. (4.13)
016 116 β21 β22 β23 β24

Here, Y and R are 27 × 4. So now the first row of β has the (population) means of the
girls for the four ages, and the second row has the means for the boys. The sample
means are

Age8 Age10 Age12 Age14


Girls 21.18 22.23 23.09 24.09 (4.14)
Boys 22.88 23.81 25.72 27.47
The lower plot in Figure 4.1 shows the sample mean vectors. The boys’ curve is
higher than the girls’, and the two are reasonably parallel, and linear.

Histamine in dogs
Sixteen dogs were treated with drugs to see the effects on their blood histamine
levels. The dogs were split into four groups: Two groups received the drug morphine,
and two received the drug trimethaphan, both given intravenously. For one group
within each pair of drug groups, the dogs had their supply of histamine depleted
before treatment, while the other group had histamine intact. So this was a two-way
1
Actually, I believe it is the pterygomaxillary fissure. See Wikipedia [2010] for an illustration and some
references.
72 Chapter 4. Linear Models

Obs i Age8 Age10 Age12 Age14 Sex


1 21.0 20.0 21.5 23.0 1
2 21.0 21.5 24.0 25.5 1
3 20.5 24.0 24.5 26.0 1
4 23.5 24.5 25.0 26.5 1
5 21.5 23.0 22.5 23.5 1
6 20.0 21.0 21.0 22.5 1
7 21.5 22.5 23.0 25.0 1
8 23.0 23.0 23.5 24.0 1
9 20.0 21.0 22.0 21.5 1
10 16.5 19.0 19.0 19.5 1
11 24.5 25.0 28.0 28.0 1
12 26.0 25.0 29.0 31.0 0
13 21.5 22.5 23.0 26.5 0
14 23.0 22.5 24.0 27.5 0
15 25.5 27.5 26.5 27.0 0
16 20.0 23.5 22.5 26.0 0
17 24.5 25.5 27.0 28.5 0
18 22.0 22.0 24.5 26.5 0
19 24.0 21.5 24.5 25.5 0
20 23.0 20.5 31.0 26.0 0
21 27.5 28.0 31.0 31.5 0
22 23.0 23.0 23.5 25.0 0
23 21.5 23.5 24.0 28.0 0
24 17.0 24.5 26.0 29.5 0
25 22.5 25.5 25.5 26.0 0
26 23.0 24.5 26.0 30.0 0
27 22.0 21.5 23.5 25.0 0

Table 4.1: The mouth size data, from Potthoff and Roy [1964].

analysis of variance model, the factors being “Drug” (Morphine or Trimethaphan)


and “Depletion” (Intact or Depleted). These data are from a study by Morris and
Zeppa [1963], analyzed also in Cole and Grizzle [1966]. See Table 4.2.
Each dog had four measurements: Histamine levels (in micrograms per milliliter
of blood) before the inoculation, and then at 1, 3, and 5 minutes after. (The value
“0.10” marked with an asterisk was actually missing. I filled it in arbitrarily.)
Figure 4.2 has a plot of the 16 dogs’ series of measurements. Most of the data is
close to zero, so it is hard to distinguish many of the individuals.
The model is a two-way multivariate analysis of variance one: Y = xβ + R, where
β contains the mean effect (μ), two main effects (α and β), and interaction effect (γ)
for each time point:

⎛ ⎞⎛ ⎞
14 −14 −14 14 μ0 μ1 μ3 μ5
⎜ 14 −14 −14 ⎟ ⎜ α5 ⎟

Y=⎝
14 ⎟ ⎜ α0 α1 α3 ⎟ + R. (4.15)
14 14 −14 −14 ⎠ ⎝ β0 β1 β3 β5 ⎠
14 14 14 14 γ0 γ1 γ3 γ5
4.3. Both sides models 73

30
25
Size
20

8 9 10 11 12 13 14
Age
30
Mean size
25
20

8 9 10 11 12 13 14
Age

Figure 4.1: Mouth sizes over time. The boys are indicated by dashed lines, the girls
by solid lines. The top graphs has the individual graphs, the bottom the averages for
the boys and girls.

The estimate of β is

Effect Before After1 After3 After5


Mean 0.077 0.533 0.364 0.260
Drug −0.003 0.212 0.201 0.140 (4.16)
Depletion 0.012 −0.449 −0.276 −0.169
Interaction 0.007 −0.213 −0.202 −0.144
See the second plot in Figure 4.2 for the means of the groups, and Figure 4.3 for
the effects, both plotted over time. Note that the mean and depletion effects are the
largest, particularly at time point 2, After1.

4.3 Linear models on both sides


The regression and multivariate regression models in the previous sections model
differences between the individuals: The rows of x are different for different indi-
74 Chapter 4. Linear Models

Obs i Before After1 After3 After5


Morphine 1 0.04 0.20 0.10 0.08
Intact 2 0.02 0.06 0.02 0.02
3 0.07 1.40 0.48 0.24
4 0.17 0.57 0.35 0.24
Morphine 5 0.10 0.09 0.13 0.14
Depleted 6 0.12 0.11 0.10 ∗0.10
7 0.07 0.07 0.07 0.07
8 0.05 0.07 0.06 0.07
Trimethaphan 9 0.03 0.62 0.31 0.22
Intact 10 0.03 1.05 0.73 0.60
11 0.07 0.83 1.07 0.80
12 0.09 3.13 2.06 1.23
Trimethaphan 13 0.10 0.09 0.09 0.08
Depleted 14 0.08 0.09 0.09 0.10
15 0.13 0.10 0.12 0.12
16 0.06 0.05 0.05 0.05

Table 4.2: The data on histamine levels in dogs. The value with the asterisk is missing,
but for illustration purposes I filled it in. The dogs are classified according to the
drug administered (morphine or trimethaphan), and whether the dog’s histamine
was artificially depeleted.

viduals, but the same for each variable. Models on the variables switch the roles of
variable and individual.

4.3.1 One individual


Start with just one individual, so that Y = (Y1 , . . . , Yq ) is a 1 × q row vector. A linear
model on the variables is

Y = βz + R, where β is 1 × l, R ∼ N1×q (0, Σ R ), (4.17)

and z is a fixed q × l matrix. The model (4.17) looks like just a transpose of model
(4.1), but (4.17) does not have iid residuals, because the observations are all on the
same individual. Simple repeated measures models and growth curve models are special
cases. (Simple because there is only one individual. Actual models would have more
than one.)
A repeated measure model is used if the y j ’s represent replications of the same
measurement. E.g., one may measure blood pressure of the same person several
times, or take a sample of several leaves from the same tree. If no systematic differ-
ences are expected in the measurements, the model would have the same mean μ for
each variable:
Y = μ (1, . . . , 1) + R = μ1q + R. (4.18)

It is common in this setting to assume Σ R has the intraclass correlation structure, as


in (1.61), i.e., the variances are all equal, and the covariances are all equal.
Growth curve models are used when the measurements are made over time, and
growth (polynomial or otherwise) is expected. A quadratic model turns (4.6) on its
4.3. Both sides models 75

3.0
Histamine
2.0
1.0
0.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0


Time
3.0
Mean histamine

MI
MD
2.0

TI
TD
1.0
0.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0


Time

Figure 4.2: Plots of the dogs over time. The top plot has the individual dogs, the
bottom has the means of the groups. The groups: MI = Morphine, Intact; MD =
Morphine, Depleted; TI = Trimethaphan, Intact; TD = Trimethaphan, Depleted

side: ⎛ ⎞
1 ··· ··· 1
Y = ( β 0 , β 1 , β 2 ) ⎝ x1 x2 ··· xq ⎠ + R. (4.19)
x12 x22 ··· x2q

Similarly one can transpose cyclic models akin to (4.7).

4.3.2 IID observations


Now suppose we have a sample of n independent individuals, so that the n × q data
matrix is distributed
Y ∼ Nn×q (1q ⊗ μ, Iq ⊗ Σ R ), (4.20)
which is the same as (3.28) with slightly different notation. Here, μ is 1 × q, so the
model says that the rows of Y are independent with the same mean μ and covariance
matrix Σ R . A repeated measure model assumes in addition that the elements of μ are
equal to μ, so that the linear model takes the mean in (4.20) and combines it with the
76 Chapter 4. Linear Models

0.0 0.2 0.4


Effects

Mean
Drug
Depletion
Interaction
−0.4

1.0 1.5 2.0 2.5 3.0 3.5 4.0


Time

Figure 4.3: Plots of the effects in the analysis of variance for the dogs data, over time.

mean in (4.18) to obtain

Y = 1n μ1q + R, R ∼ Nn×q (0, In ⊗ Σ R ). (4.21)

This model makes sense if one takes a random sample of n individuals, and makes
repeated measurements from each. More generally, a growth curve model as in (4.19),
but with n individuals measured, is
⎛ ⎞
1 ··· ··· 1
Y = 1n ( β 0 , β 1 , β 2 ) ⎝ z1 z2 ··· zq ⎠ + R. (4.22)
z21 z22 ··· z2q

Example: Births
The average births for each hour of the day for four different hospitals is given in
Table 4.3. The data matrix Y here is 4 × 24, with the rows representing the hospitals
and the columns the hours. Figure 4.4 plots the curves.
One might wish to fit sine waves (Figure 4.5) to the four hospitals’ data, presuming
one day reflects one complete cycle. The model is

Y = βz + R, (4.23)

where ⎛ ⎞
β10 β11 β12
⎜ β20 β21 β22 ⎟

β=⎝ ⎟ (4.24)
β30 β31 β32 ⎠
β40 β41 β42
4.3. Both sides models 77

1 2 3 4 5 6 7 8
Hosp1 13.56 14.39 14.63 14.97 15.13 14.25 14.14 13.71
Hosp2 19.24 18.68 18.89 20.27 20.54 21.38 20.37 19.95
Hosp3 20.52 20.37 20.83 21.14 20.98 21.77 20.66 21.17
Hosp4 21.14 21.14 21.79 22.54 21.66 22.32 22.47 20.88

9 10 11 12 13 14 15 16
Hosp1 14.93 14.21 13.89 13.60 12.81 13.27 13.15 12.29
Hosp2 20.62 20.86 20.15 19.54 19.52 18.89 18.41 17.55
Hosp3 21.21 21.68 20.37 20.49 19.70 18.36 18.87 17.32
Hosp4 22.14 21.86 22.38 20.71 20.54 20.66 20.32 19.36

17 18 19 20 21 22 23 24
Hosp1 12.92 13.64 13.04 13.00 12.77 12.37 13.45 13.53
Hosp2 18.84 17.18 17.20 17.09 18.19 18.41 17.58 18.19
Hosp3 18.79 18.55 18.19 17.38 18.41 19.10 19.49 19.10
Hosp4 20.02 18.84 20.40 18.44 20.83 21.00 19.57 21.35

Table 4.3: The data on average number of births for each hour of the day for four
hospitals.
22
20
# Births
18
16
14
12

5 10 15 20
Hour

Figure 4.4: Plots of the four hospitals’ births’, over twenty-four hours.
78 Chapter 4. Linear Models

1.0
0.5
Sine/Cosine
0.0
−1.0

0 5 10 15 20
Hour

Figure 4.5: Sine and cosine waves, where one cycle spans twenty-four hours.

and
⎛ ⎞
1 1 ··· 1

z = ⎝ cos(1 · 2π/24) cos(2 · 2π/24) ··· cos(24 · 2π/24) ⎠ , (4.25)
sin(1 · 2π/24) sin(2 · 2π/24) ··· sin(24 · 2π/24)

the z here being the same as the x in (4.7).


The estimate of β is

Mean Cosine Sine


Hosp1 13.65 0.03 0.93
Hosp2 19.06 −0.69 1.46 (4.26)
Hosp3 19.77 −0.22 1.70
Hosp4 20.93 −0.12 1.29

  , which is also 4 × 24. See Figure 4.6.


 = βz
Then the “fits” are Y
Now try the model with same curve for each hospital, Y = xβ∗ z + R, where
x = 14 (the star on the β∗ is to distinguish it from the previous β):
⎛ ⎞
1
⎜ 1 ⎟ ∗  
Y = xβ∗ z + R = ⎜ ⎟
⎝ 1 ⎠ β0 β∗1 β∗2 z + R. (4.27)
1

The estimates of the coefficients are now β∗ = (18.35, −0.25, 1.34), which is the aver-
 The fit is graphed as the thick line in Figure 4.6
age of the rows of β.
4.3. Both sides models 79

22
20
# Births
18
16
14
12

5 10 15 20
Hour

Figure 4.6: Plots of the four hospitals’ births, with the fitted sign waves. The thick
line fits one curve to all four hospitals.

4.3.3 The both-sides model


Note that the last two models, (4.19) and (4.22) have means with fixed matrices on
both sides of the parameter. Generalizing, we have the model

Y = xβz + R, (4.28)

where x is n × p, β is p × l, and z is q × l. The x models differences between indi-


viduals, and the z models relationships between the variables. This formulation is by
Potthoff and Roy [1964].
For example, consider the mouth size example in Section 4.2. A growth curve
model seems reasonable, but one would not expect the iid model to hold. In particu-
lar, the mouths of the eleven girls would likely be smaller on average than those of the
sixteen boys. An analysis of variance model, with two groups, models the differences
between the individuals, while a growth curve models the relationship among the
four time points. With Y being the 27 × 4 data matrix of measurements, the model is
⎛ ⎞
  1 1 1 1
111 011 β g0 β g1 β g2 ⎝ 8
Y= 10 12 14 ⎠ + R. (4.29)
016 116 β b0 β b1 β b2
82 102 122 142

The “0m ”’s are m × 1 vectors of 0’s. Thus ( β g0 , β g1 , β g2 ) contains the coefficients for
the girls’ growth curve, and ( β b0 , β b1 , β b2 ) the boys’. Some questions which can be
addressed include

• Does the model fit, or are cubic terms necessary?


• Are the quadratic terms necessary (is β g2 = β b2 = 0)?
• Are the girls’ and boys’ curves the same (are β gj = β bj for j = 0, 1, 2)?
80 Chapter 4. Linear Models

• Are the girls’ and boys’ curves parallel (are β g1 = β b1 and β g2 = β b2 , but maybe
not β g0 = β b0 )?
See also Ware and Bowden [1977] for a circadean application and Zerbe and Jones
[1980] for a time-series context. The model is often called the generalized multivari-
ate analysis of variance, or GMANOVA, model. Extensions are many. For examples,
see Gleser and Olkin [1970], Chinchilli and Elswick [1985], and the book by Kariya
[1985].

4.4 Exercises
Exercise 4.4.1 (Prostaglandin). Below are data from Ware and Bowden [1977] taken at
six four-hour intervals (labelled T1 to T6) over the course of a day for 10 individuals.
The measurements are prostaglandin contents in their urine.
Person T1 T2 T3 T4 T5 T6
1 146 280 285 215 218 161
2 140 265 289 231 188 69
3 288 281 271 227 272 150
4 121 150 101 139 99 103
5 116 132 150 125 100 86 (4.30)
6 143 172 175 222 180 126
7 174 276 317 306 139 120
8 177 313 237 135 257 152
9 294 193 306 204 207 148
10 76 151 333 144 135 99

(a) Write down the “xβz " part of the model that fits a separate sine wave to each
person. (You don’t have to calculate the estimates or anything. Just give the x, β and
z matrices.) (b) Do the same but for the model that fits one sine wave to all people.

Exercise 4.4.2 (Skulls). The data concern the sizes of Egyptian skulls over time, from
Thomson and Randall-MacIver [1905]. There are 30 skulls from each of five time
periods, so that n = 150 all together. There are four skull size measurements, all in
millimeters: maximum length, basibregmatic height, basialveolar length, and nasal
height. The model is a multivariate analysis of variance one, where x distinguishes
between the time periods, and we do not use a z. Use polynomials for the time
periods (code them as 1, 2, 3, 4, 5), so that x = w ⊗ 130 . Find w.

Exercise 4.4.3. Suppose Yb and Ya are n × 1 with n = 4, and consider the model
(Yb Ya ) ∼ N (xβ, In ⊗ Σ), (4.31)
where ⎛ ⎞
1 1 1
⎜ 1 1 −1 ⎟

x=⎝ ⎟. (4.32)
1 −1 1 ⎠
1 −1 −1
(a) What are the dimensions of β and Σ? The conditional distribution of Ya given
Yb = (4, 2, 6, 3) is
Ya | Yb = (4, 2, 6, 3) ∼ N (x∗ β∗ , In ⊗ Ω) (4.33)
4.4. Exercises 81

for some fixed matrix x∗ , parameter matrix β∗ , and covariance matrix Ω. (b) What are
the dimensions of β∗ and Ω? (c) What is x∗ ? (d) What is the most precise description
of the conditional model?

Exercise 4.4.4 (Caffeine). Henson et al. [1996] conducted an experiment to see whether
caffeine has a negative effect on short-term visual memory. High school students
were randomly chosen: 9 from eighth grade, 10 from tenth grade, and 9 from twelfth
grade. Each person was tested once after having caffeinated Coke, and once after
having decaffeinated Coke. After each drink, the person was given ten seconds to
try to memorize twenty small, common objects, then allowed a minute to write down
as many as could be remembered. The main question of interest is whether people
remembered more objects after the Coke without caffeine than after the Coke with
caffeine. The data are
Grade 8 Grade 10 Grade 12
Without With Without With Without With
5 6 6 3 7 7
9 8 9 11 8 6
6 5 4 4 9 6
8 9 7 6 11 7
(4.34)
7 6 6 8 5 5
6 6 7 6 9 4
8 6 6 8 9 7
6 8 9 8 11 8
6 7 10 7 10 9
10 6

“Grade" is the grade in school, and the “Without" and “With" entries are the numbers
of items remembered after drinking Coke without or with caffeine. Consider the
model
Y = xβz + R, (4.35)
where the Y is 28 × 2, the first column being the scores without caffeine, and the
second being the scores with caffeine. The x is 28 × 3, being a polynomial (quadratic)
matrix in the three grades. (a) The z has two columns. The first column of z represents
the overall mean (of the number of objects a person remembers), and the second
column represents the difference between the number of objects remembered with
caffeine and without caffeine. Find z. (b) What is the dimension of β? (c) What
effects do the β ij ’s represent? (Choices: overall mean, overall linear effect of grade,
overall quadratic effect of grade, overall difference in mean between caffeinated and
decaffeinated coke, linear effect of grade in the difference between caffeinated and
decaffeinated coke, quadratic effect of grade in the difference between caffeinated
and decaffeinated coke, interaction of linear and quadratic effects of grade.)

Exercise 4.4.5 (Histamine in dogs). In Table 4.2, we have the model

Y = xβz + R, (4.36)

where x (n × 4) describes a balanced two-way analysis of variance. The columns


represent, respectively, the overall mean, the drug effect, the depletion effect, and the
drug × depletion interaction. For the z, the first column is the effect of the “before"
82 Chapter 4. Linear Models

measurement (at time 0), and the last three columns represent polynomial effects
(constant, linear, and quadratic) for just the three “after" time points (times 1, 3, 5).
(a) What is z? (b) What effects do the β ij ’s represent? (Choices: overall drug effect
for the after measurements, overall drug effect for the before measurement, average
“after" measurement, drug × depletion interaction for the “before" measurement,
linear effect in “after" time points for the drug effect.)

Exercise 4.4.6 (Leprosy). Below are data on leprosy patients found in Snedecor and
Cochran [1989]. There were 30 patients, randomly allocated to three groups of 10.
The first group received drug A, the second drug D, and the third group received a
placebo. Each person had their bacterial count taken before and after receiving the
treatment.
Drug A Drug D Placebo
Before After Before After Before After
11 6 6 0 16 13
8 0 6 2 13 10
5 2 7 3 11 18
14 8 8 1 9 5
(4.37)
19 11 18 18 21 23
6 4 8 4 16 12
10 13 19 14 12 5
6 1 8 9 12 16
11 8 5 1 7 1
3 0 15 9 12 20

(a) Consider the model Y = xβ + R for the multivariate analysis of variance with three
groups and two variables (so that Y is 30 × 2), where R ∼ N30×2 (0, I30 ⊗ Σ R ). The
x has vectors for the overall mean, the contrast between the drugs and the placebo,
and the contrast between Drug A and Drug D. Because there are ten people in each
group, x can be written as w ⊗ 110 . Find w. (b) Because the before measurements
were taken before any treatment, the means for the three groups on that variable
should be the same. Describe that constraint in terms of the β. (c) With Y = (Yb Ya ),
find the model for the conditional distribution

Y a | Y b = y b ∼ N (x∗ β ∗ , I n × Ω ). (4.38)

Give the x∗ in terms of x and yb , and give Ω in terms of the elements of Σ R . (Hint:
Write down what it would be with E [Y] = (μb μ a ) using the conditional formula,
then see what you get when μb = xβ b and μ a = xβ a .)

Exercise 4.4.7 (Parity). Johnson and Wichern [2007] present data (in their Exercise
6.17) on an experiment. Each of 32 subjects was given several sets of pairs of integers,
and had to say whether the two numbers had the same parity (i.e., both odd or both
even), or different parities. So (1, 3) have the same parity, while (4, 5) have differ-
ent parity. Some of the integer pairs were given numerically, like (2, 4), and some
were written out, i.e., ( Two, Four ). The time it took to decide the parity for each pair
was measured. Each person had a little two-way analysis of variance, where the two
factors are Parity, with levels different and same, and Format, with levels word and nu-
meric. The measurements were the median time for each Parity/Format combination
4.4. Exercises 83

for that person. Person i then had observation vector yi = (yi1 , yi2 , yi3 , yi4 ), which in
the ANOVA could be arranged as

Format
Parity Word Numeric
(4.39)
Different yi1 yi2
Same yi3 yi4

The model is of the form


Y = xβz + R. (4.40)
(a) What are x and z for the model where each person has a possibly different
ANOVA, and each ANOVA has effects for overall mean, parity effect, format effect,
and parity/format interaction? How many, and which, elements of β must be set to
zero to model no-interaction? (b) What are x and z for the model where each person
has the same mean vector, and that vector represents the ANOVA with effects for
overall mean, parity effect, format effect, and parity/format interaction? How many,
and which, elements of β must be set to zero to model no-interaction?

Exercise 4.4.8 (Sine waves). Let θ be an angle running from 0 to 2π, so that a sine/-
cosine wave with one cycle has the form

g(θ ) = A + B cos(θ + C ) (4.41)

for parameters A, B, and C. Suppose we observe the wave at the q equally-spaced


points

θj = j, j = 1, . . . , q, (4.42)
q
plus error, so that the model is


Yj = g(θ j ) + R j = A + B cos j + C + R j , j = 1, . . . , q, (4.43)
q

where the R j are the residuals. (a) Is the model linear in the parameters A, B, C? Why
or why not? (b) Show that the model can be rewritten as
 
2π 2π
Yj = β1 + β2 cos j + β3 sin j + R j , j = 1, . . . , q, (4.44)
q q

and give the β k ’s in terms of A, B, C. [Hint: What is cos( a + b )?] (c) Write this model
as a linear model, Y = βz + R, where Y is 1 × q. What is the z? (d) Waves with m ≥ 1
cycles can be added to the model by including cosine and sine terms with θ replaced
by mθ:  
2πm 2πm
cos j , sin j . (4.45)
q q
If q = 6, then with the constant term, we can fit in the cosine and sign terms for
the wave with m = 1 cycle, and the cosine and sine terms for the wave with m = 2
cycles. The x cannot have more than 6 columns (or else it won’t be invertible). Find
the cosine and sine terms for m = 3. What do you notice? Which one should you put
in the model?
Chapter 5

Linear Models: Least Squares and


Projections

In this chapter, we briefly review linear subspaces and projections onto them. Most of
the chapter is abstract, in the sense of not necessarily tied to statistics. The main result
we need for the rest of the book is the least-squares estimate given in Theorem 5.2.
Further results can be found in Chapter 1 of Rao [1973], an excellent compendium of
facts on linear subspaces and matrices.

5.1 Linear subspaces


We start with the space R M . The elements y ∈ R M may be considered row vectors, or
column vectors, or matrices, or any other configuration. We will generically call them
“vectors.” This space could represent vectors for individuals, in which case M = q,
the number of variables, or it could represent vectors for variables, so M = n, the
number of individuals, or it could represent the entire data matrix, so that M = nq. A
linear subspace is one closed under addition and multiplication by a scalar. Because
will deal with Euclidean space, everyone knows what addition and multiplication
mean. Here is the definition.

Definition 5.1. A subset W ⊂ R M is a linear subspace of R M if

x, y ∈ W =⇒ x + y ∈ W , and (5.1)
c ∈ R, x ∈ W =⇒ cx ∈ W . (5.2)

We often shorten “linear subspace” to “subspace,” or even “space.” Note that R M


is itself a linear (sub)space, as is the set {0}. Because c in (5.2) can be 0, any subspace
must contain 0. Any line through 0, or plane through 0, is a subspace. One convenient
representation of subspaces is the set of linear combinations of some elements:

Definition 5.2. The span of the set of vectors {d1 , . . . , dK } ⊂ R M is

span{d1 , . . . , dK } = {b1 d1 + · · · + bK dK | b = (b1 , . . . , bK ) ∈ RK }. (5.3)

85
86 Chapter 5. Least Squares

By convention, the span of the empty set is just {0}. It is not hard to show that
any span is a linear subspace. Some examples: For K = 2, span{(1, 1)} is the set
of vectors of the form ( a, a), that is, the equiangular line through 0. For K = 3,
span{(1, 0, 0), (0, 1, 0)} is the set of vectors of the form ( a, b, 0), which is the x/y
plane, considering the axes to be x, y, z.
We will usually write the span in matrix form. Letting D be the M × K matrix
with columns d1 , . . . , dK . We have the following representations of subspace W :
W = span{d1 , . . . , dK }
= span{columns of D}
= {Db | b ∈ RK (b is K × 1)}
= span{rows of D }
= {bD | b ∈ RK (b is 1 × K )}. (5.4)
Not only is any span a subspace, but any subspace is a span of some vectors. In
fact, any subspace of RK can be written as a span of at most K vectors, although not
in a unique way. For example, for K = 3,
span{(1, 0, 0), (0, 1, 0)} = span{(1, 0, 0), (0, 1, 0), (1, 1, 0)}
= span{(1, 0, 0), (1, 1, 0)}
= span{(2, 0, 0), (0, −7, 0), (33, 2, 0)}. (5.5)
Any invertible transformation of the vectors yields the same span, as in the next
lemma. See Exercise 5.6.4 for the proof.
Lemma 5.1. Suppose W is the span of the columns of the M × K matrix D as in (5.4), and
A is an invertible K × K matrix. Then W is also the span of the columns of DA, i.e.,
span{columns of D} = span{columns of DA}. (5.6)
Note that the space in (5.5) can be a span of two or three vectors, or a span of any
number more than three as well. It cannot be written as a span of only one vector.
In the two sets of three vectors, there is a redundancy, that is, one of the vectors can
be written as a linear combination of the other two: (1, 1, 0) = (1, 0, 0) + (0, 1, 0) and
(2, 0, 0) = (4/(33 × 7))(0, −7, 0) + (2/33) × (33, 2, 0). Such sets are called linearly
dependent. We first define the opposite.
Definition 5.3. The vectors d1 , . . . , dK in RK are linearly independent if
b1 d1 + · · · + bK dK = 0 =⇒ b1 = · · · = bK = 0. (5.7)
Equivalently, the vectors are linearly independent if no one of them (as long as
it is not 0) can be written as a linear combination of the others. That is, there is no
di = 0 and set of coefficients b j such that
d i = b1 d1 + · · · + b i − 1 d i − 1 + b i + 1 d i + 1 + . . . + b K d K . (5.8)
The vectors are linearly dependent if and only if they are not linearly independent.
In (5.5), the sets with three vectors are linearly dependent, and those with two
vectors are linearly independent. To see that latter fact for {(1, 0, 0), (1, 1, 0)}, suppose
that γ1 (1, 0, 0) + γ2 (1, 1, 0) = (0, 0, 0). Then
b1 + b2 = 0 and b2 = 0 =⇒ b1 = b2 = 0, (5.9)
5.2. Projections 87

which verifies (5.7).


If a set of vectors is linearly dependent, then one can remove one of the redundant
vectors (5.8), and still have the same span. A basis is a set of vectors that has the same
span but no dependencies.
Definition 5.4. The set of vectors {d1 , . . . , dK } is a basis for the subspace W if the vectors
are linearly independent and W = span{d1 , . . . , dK }.
Although a (nontrivial) subspace has many bases, each basis has the same number
of elements, which is the dimension. See Exercise 5.6.34.
Definition 5.5. The dimension of a subspace is the number of vectors in any of its bases.

5.2 Projections
In linear models, the mean of the data matrix is presumed to lie in a linear subspace,
and an aspect of fitting the model is to find the point in the subspace closest to the
data. This closest point is called the projection. Before we get to the formal definition,
we need to define orthogonality. Recall from Section 1.5 that two column vectors v
and w are orthogonal if v w = 0 (or vw = 0 if they are row vectors).
Definition 5.6. The vector v ∈ R M is orthogonal to the subspace W ⊂ R M if v is orthogonal
to w for all w ∈ W . Also, subspace V ⊂ R M is orthogonal to W if v and w are orthogonal
for all v ∈ V and w ∈ W .
Geometrically, two objects are orthogonal if they are perpendicular. For example,
in R3 , the z-axis is orthogonal to the x/y-plane. Exercise 5.6.6 is to prove the next
result.
Lemma 5.2. Suppose W = span{d1 , . . . , dK }. Then y is orthogonal to W if and only if y
is orthogonal to each d j .
Definition 5.7. The projection of y onto W is the 
y that satisfies

y ∈ W and y − 
 y is orthogonal to W . (5.10)

y is the fit and y − 


In statistical parlance, the projection  y is the residual for the
model. Because of the orthogonality, we have the decomposition of squared norms,

y 2 = 
y 2 + y − 
y 2 , (5.11)

which is Pythagoras’ Theorem. In a regression setting, the left-hand side is the total
sum-of-squares, and the right-hand side is the regression sum-of-squares (
y2 ) plus the
residual sum-of-squares, although usually the sample mean of the yi ’s is subtracted
from y and  y.
Exercise 5.6.8 proves the following useful result.
Theorem 5.1 (Projection). Suppose y ∈ RK and W is a subspace of RK , and 
y is the
projection of y onto W . Then

(a) The projection is unique: If  y2 are both in W , and y − 


y1 and  y1 and y − 
y2 are both
orthogonal to W , then 
y1 = y2 .
(b) If y ∈ W , then 
y = y.
88 Chapter 5. Least Squares

(c) If y is orthogonal to W , then 


y = 0.
y uniquely minimizes the Euclidean distance between y and W , that is,
(d) The projection 

y − 
y2 < y − w2 for all w ∈ W , w = 
y. (5.12)

5.3 Least squares


In this section, we explicitly find the projection of y (1 × M) onto W . Suppose
d1 , . . . , dK , the (transposed) columns of the M × K matrix D, form a basis for W ,
so that the final expression in (5.4) holds. Our objective is to find a b, 1 × K, so that

y ≈ bD . (5.13)

In Section 5.3.1, we specialize to the both-sides model (4.28). Our first objective is to
find the best value of b, where we define “best” by least squares.
 such that
Definition 5.8. A least-squares estimate of b in the equation (5.13) is any b

  2 = min y − bD 2 .
y − bD (5.14)
b∈RK

 for which
Part (d) of Theorem 5.1 implies that a least squares estimate of b is any b
  is the projection of y onto the subspace W . Thus y − bD
bD   is orthogonal to W ,
and by Lemma 5.2, is orthogonal to each d j . The result are the normal equations:

  )d j = 0 for each j = 1, . . . , K.
(y − bD (5.15)

We then have
  )D = 0
(5.15) =⇒ (y − bD
  D = yD
=⇒ bD (5.16)
 = yD(D D)−1 ,
=⇒ b (5.17)

where the final equation holds if D D is invertible, which occurs if and only if the
columns of D constitute a basis of W . See Exercise 5.6.33. Summarizing:

Theorem 5.2 (Least squares). Any solution b  to the least-squares equation (5.14) satisfies
the normal equations (5.16). The solution is unique if and only if D D is invertible, in which
case (5.17) holds.
If D D is invertible, the projection of y onto W can be written

   = yD(D D)−1 D ≡ yPD ,


y = bD (5.18)

where
PD = D ( D  D ) − 1 D  . (5.19)
The matrix PD is called the projection matrix for W . The residuals are then

y−
y = y − yPD = y(IK − PD ) = yQD , (5.20)
5.3. Least squares 89

where
QD = I M − PD . (5.21)
The minimum value in (5.14) is then

y2 = yQD y ,
y −  (5.22)

and the fit and residuals are orthogonal:

y) = 0.
y (y −  (5.23)

These two facts are consequences of parts (a) and (c) of the following proposition.
See Exercises 5.6.10 to 5.6.12.
Proposition 5.1 (Projection matrices). Suppose PD is defined as in (5.18), where D D is
invertible. Then the following hold.

(a) PD is symmetric and idempotent, with trace(PD ) = K, the dimension of W ;


(b) Any symmetric idempotent matrix is a projection matrix for some subspace W ;
(c) QD = I M − PD is also a projection matrix, and is orthogonal to PD in the sense that
PD QD = QD PD = 0;
(d) PD D = D and QD D = 0.

The matrix QD is the projection matrix onto the orthogonal complement of W ,


where the orthogonal complement contains all vectors in R M that are orthogonal to
W.

5.3.1 Both-sides model


We apply least squares to estimate β in the both-sides model. Chapter 6 will derive
its distribution. Recall the model

Y = xβz + R, R ∼ N (0, In ⊗ Σ R ) (5.24)

from (4.28), where Y is n × q, x is n × p, β is p × l, and z is q × l. Vectorizing Y and


β, by property (3.32d) of Kronecker products,

row(Y) = row( β)(x ⊗ z) + row(R). (5.25)

Then the least squares estimate of β is found as in (5.17), where we make the identi-
fications y = row(Y), b = row( β) and D = x ⊗ z (hence M = nq and K = pl):


row ( β) = row(Y)(x ⊗ z)[(x ⊗ z) (x ⊗ z)] −1
= row(Y)(x ⊗ z)(x x ⊗ z z)−1
= row(Y)(x(x x)−1 ⊗ z(z z)−1 ). (5.26)

See Proposition 3.2. Now we need that x x and z z are invertible. Undoing as in
(3.32d), the estimate can be written

β = (x x)−1 x Yz(z z)−1 . (5.27)


90 Chapter 5. Least Squares

In multivariate regression of Section 4.2, z is non-existent (actually, z = Iq ), so that


the model and estimate are

Y = xβ + R and β = (x x)−1 x Y, (5.28)

the usual estimate for regression. The repeated measures and growth curve models
such as in (4.23) have x = In , so that

Y = βz + R and β = Yz(z z)−1 . (5.29)

Thus, indeed, the both-sides model has estimating matrices on both sides.

5.4 What is a linear model?


We have been working with linear models for a while, so perhaps it is time to formally
define them. Basically, a linear model for Y is one for which the mean of Y lies in a
given linear subspace. A model itself is a set of distributions. The linear model does
not describe the entire distribution, thus the actual distribution, e.g., multivariate
normal with a particular covariance structure, needs to be specified as well. The
general model we have been using is the both-sides model (4.28) for given matrices x
and z. The Y, hence its mean, is an n × q matrix, which resides in RK with K = nq.
As in (5.25),
E [row(Y)] = row( β)(x ⊗ z) . (5.30)
Letting β range over all the p × l matrices, we have the restriction

E [row(Y)] ∈ W ≡ {row( β)(x ⊗ z) | row( β) ∈ R pl }. (5.31)

Is W a linear subspace? Indeed, as in Definition 5.2, it is the span of the transposed


columns of x ⊗ z, the columns being xi ⊗ z j , where xi and z j are the columns of x and
z, respectively. That is,
p l
row( β)(x ⊗ z) = ∑ ∑ βij (xi ⊗ zj ) . (5.32)
i =1 j =1

The linear model is then the set of distributions

M = { N (M, In ⊗ Σ) | M ∈ W and Σ ∈ Sq+ }, (5.33)

denoting

Sq+ = The set of q × q positive definite symmetric matrices. (5.34)

Other linear models can have different distributional assumptions, e.g., covariance
restrictions, but do have to have the mean lie in a linear subspace.
There are many different parametrizations of a given linear model, for the same
reason that there are many different bases for the mean space W . For example, it
may not be obvious, but
⎛ ⎞
 1 1 1
1 0
x= , z=⎝ 1 2 4 ⎠ (5.35)
0 1
1 3 9
5.5. Gram-Schmidt orthogonalization 91

and ⎛ ⎞
 1 −1 1
∗ 1 −1 ∗
x = , z =⎝ 1 0 −2 ⎠ (5.36)
1 1
1 1 1
lead to exactly the same model, though different interpretations of the parameters.
In fact, with x being n × p and z being q × l,

x∗ = xA and z∗ = zB, (5.37)

yields the same model as long as A (p × p) and B (l × l) are invertible:

xβz = x∗ β∗ z∗  with β∗ = A−1 βB−1 . (5.38)

The representation in (5.36) has the advantage that the columns of the x∗ ⊗ z∗ are
orthogonal, which makes it easy to find the least squares estimates as the D D matrix
is diagonal, hence easy to invert. Note the z is the matrix for a quadratic. The z∗ is
the corresponding set of orthogonal polynomials, as discussed in Section 5.5.2.

5.5 Gram-Schmidt orthogonalization


We have seen polynomial models in (4.6), (4.22) and (4.29). Note that, especially
in the latter case, one can have a design matrix (x or z) whose entries have widely
varying magnitudes, as well as highly correlated vectors, which can lead to numerical
difficulties in calculation. Orthogonalizing the vectors, without changing their span,
can help both numerically and for interpretation. Gram-Schmidt orthogonalization
is a well-known constructive approach. It is based on the following lemma.
Lemma 5.3. Suppose (D1 , D2 ) is M × K, where D1 is M × K1 , D2 is M × K2 , and W is
the span of the combined columns:

W = span{columns of (D1 , D2 )}. (5.39)

Suppose D1 D1 is invertible, and let

D2·1 = QD1 D2 , (5.40)

for QD1 defined in (5.21) and (5.19). Then the columns of D1 and D2·1 are orthogonal,

D2 ·1 D1 = 0, (5.41)

and
W = span{columns of (D1 , D2·1 )}. (5.42)

Proof. D2·1 is the matrix of residuals for the least-squares model D2 = D1 β + R, i.e.,
D1 is the x and D2 is the Y in the multivariate regression model (5.28). Equation
(5.41) then follows from part (d) of Proposition 5.1: D2 ·1 D1 = D2 QD1 D1 = 0. For
(5.42), 
    IK −(D1 D1 )−1 D1 D2
D1 D2·1 = D1 D2 1 . (5.43)
0 IK 2
The final matrix is invertible, hence by Lemma 5.1, the spans of the columns of
(D1 , D2 ) and (D1 , D2·1 ) are the same.
92 Chapter 5. Least Squares

Now let d1 , . . . dK be the columns of D ≡ (D1 , D2 ), and W their span as in (5.39).


The Gram-Schmidt process starts by applying Lemma 5.3 with D1 = d1 and D2 =
(d2 , . . . , dK ). Dotting out the “1” on the columns as well, we write the resulting
columns of D2·1 as the vectors
dj d1
d2·1 , · · · , dK ·1, where d j·1 = d j − d1 . (5.44)
 d1  2
In other words, d j·1 is the residual of the projection of d j onto span{d1 }. Thus d1 is
orthogonal to all the d j·1 ’s in (5.44), and W = span{d1 , d2·1 , . . . , dK ·1 } by (5.42).
Second step is to apply the lemma again, this time with D1 = d2·1 , and D2 =
(d3·1 , . . . , dK ·1 ), leaving aside the d1 for the time being. Now write the columns of
the new D2·1 dotting out the “1” and “2”:
dj·1 d2·1
d3·12 , · · · , dK ·12, where d j·12 = d j·1 − d2 · 1 . (5.45)
 d2 · 1  2
Now d2·1 , as well as d1 , are orthogonal to the vectors in (5.45), and
W = span{d1 , d2·1 , d3·12 , . . . , dK ·12 }. (5.46)
We continue until we have the set of vectors
d1 , d2·1 , d3·12 , . . . , dK ·{1: ( K −1)}, (5.47)
which are mutually orthogonal and span W . Here, we are using the R-based notation
{ a : b } = { a, a + 1, . . . , b } for integers a < b. (5.48)
It is possible that one or more of the vectors we use for D1 will be zero. In such
cases, we just leave the vectors in D2 alone, i.e., D2·1 = D2 , because the projection
of any vector on the space {0} is 0, hence the residual equals the original vector.
We can describe the entire resulting process iteratively, for i = 1, . . . , K − 1 and j =
i + 1, . . . , K, as setting

d j·{1: ( i−1)} − bij di·{1: ( i−1)} if di·{1: ( i−1)} = 0
d j·{1:i} = (5.49)
d j·{1: ( i−1)} if di·{1: ( i−1)} = 0,
where
dj·{1: ( i−1)}di·{1: ( i−1)}
bij = (5.50)
di·{1: ( i−1)}2
if its denominator is nonzero. Otherwise, set bij = 0, although any value will do.
Optionally, one can multiply any of these vectors by a nonzero constant, e.g., so
that it has a norm of one, or for esthetics, so that the entries are small integers. Any
zero vectors left in the set can be eliminated without affecting the span.
Note that by the stepwise nature of the algorithm, we have the spans of the first k
vectors from each set are equal, that is,
span{d1 , d2 } = span{d1 , d2·1 }
span{d1 , d2 , d3 } = span{d1 , d2·1 , d3·12 }
..
.
span{d1 , d2 , . . . , dK } = span{d1 , d2·1 , d3·12 , . . . , dK ·{1: ( K −1)}}. (5.51)
5.5. Gram-Schmidt orthogonalization 93

The next section derives some important matrix decompositions based on the
Gram-Schmidt orthogonalization. Section 5.5.2 applies the orthogonalization to poly-
nomials.

5.5.1 The QR and Cholesky decompositions


We can write the Gram-Schmidt process in matrix form. The first step is
⎛ ⎞
1 − b12 ··· − b1K
  ⎜ 0 1 ··· 0 ⎟
⎜ ⎟
d1 d2 · 1 ··· dK ·1 = D⎜ .. .. .. .. ⎟, (5.52)
⎝ . . . . ⎠
0 0 ··· 1

The bij ’s are defined in (5.50). Next,


 
d1 d2 · 1 d3·12 ··· dK ·12
⎛ ⎞
1 0 0 ··· 0
⎜ 0 1 − b23 ··· − b2K ⎟
 ⎜
⎜ 0 0 1 ··· 0


= d1 d2 · 1 ··· dK ·1 ⎜ ⎟. (5.53)
⎜ .. .. .. .. .. ⎟
⎝ . . . . . ⎠
0 0 0 ··· 1

We continue, so that the final result is


 
D∗ ≡ d1 d2 · 1 d3·12 · · · dK ·{1: ( K −1)} = DB(1) B(2) · · · B( K −1), (5.54)

where B( k) is the identity except for the elements kj, j > k:




⎨1 if i = j
(k)
Bij = − bkj if j > k = i (5.55)

⎩0 otherwise.

These matrices are upper unitriangular, meaning they are upper triangular (i.e.,
all elements below the diagonal are zero), and all diagonal elements are one. We will
use the notation

Tq1 = {T | T is q × q, tii = 1 for all i, tij = 0 for i > j}. (5.56)

Such matrices form an algebraic group. A group of matrices is a set G of N × N


invertible matrices g that is closed under multiplication and inverse:

g1 , g2 ∈ G ⇒ g1 g2 ∈ G , (5.57)
−1
g∈G ⇒ g ∈ G. (5.58)

Thus we can write

D = D∗ B−1 , where B = B(1) · · · B( K −1) . (5.59)


94 Chapter 5. Least Squares

Exercise 5.6.19 shows that


⎛ ⎞
1 b12 b13 ··· b1K
⎜ 0 1 b23 ··· b2K ⎟
⎜ ⎟
⎜ ··· b3K ⎟
B −1 =⎜ 0 0 1 ⎟. (5.60)
⎜ .. .. .. .. .. ⎟
⎝ . . . . . ⎠
0 0 0 ··· 1

Now suppose the columns of D are linearly independent, which means that all the
columns of D∗ are nonzero (See Exercise 5.6.21.) Then we can divide each column of
D∗ by its norm, so that the resulting vectors are orthonormal:
di·{1: ( i−1)}  
qi = , Q= q1 ··· qK = D∗ Δ −1 , (5.61)
di·{1: ( i−1)}

where Δ is the diagonal matrix with the norms on the diagonal. Letting R = ΔB−1 ,
we have that
D = QR, (5.62)
where R is upper triangular with positive diagonal elements, the Δii ’s. The set of
such matrices R is also group, denoted by

Tq+ = {T | T is q × q, tii > 0 for all i, tij = 0 for i > j}. (5.63)

Hence we have the next result. The uniqueness for M = K is shown in Exercise 5.6.26.
Theorem 5.3 (QR-decomposition). Suppose the M × K matrix D has linearly independent
columns (hence K ≤ M). Then there is a unique decomposition D = QR, where Q, M × K,
has orthonormal columns and R ∈ TK+ .
Gram-Schmidt also has useful implications for the matrix S = D D. From (5.43)
we have
   
IK 1 0 D1 D1 0 IK1 (D1 D1 )−1 D1 D2
S=
D2 D1 (D1 D1 )−1 IK2 0 D2 ·1 D2·1 0 IK 2
  − 1

IK 1 0 S11 0 IK1 S11 S12
= −1 , (5.64)
S21 S11 IK 2 0 S22·1 0 IK 2
−1
where S22·1 = S22 − S21 S11 S12 as in (3.49). See Exercise 5.6.27. Then using steps as
in Gram-Schmidt, we have
⎛ ⎞
S11 0 0 ··· 0
⎜ 0 S22·1 0 ··· 0 ⎟
⎜ ⎟
− 1 ⎜ 0 0 S · · · · 0 ⎟ −1
S = (B ) ⎜ 33 12 ⎟B
⎜ .. .. .. .. .. ⎟
⎝ . . . . . ⎠
0 0 0 · · · SKK ·{1: ( K −1)}
= R R, (5.65)

because the inner matrix is Δ2 . Also, note that

bij = Sij·{1: ( i−1)}/Sii·{1: ( i−1)} for j > i, (5.66)


5.5. Gram-Schmidt orthogonalization 95

and R is given by


⎪ Sii·{1··· i−1} if j = i,

Rij = Sij·{1··· i−1} / Sii·{1··· i−1} if j > i, (5.67)



0 if j < i.

Exercise 5.6.30 shows this decomposition works for any positive definite symmetric
matrix. It is then called the Cholesky decomposition:
Theorem 5.4 (Cholesky decomposition). If S ∈ Sq+ (5.34), then there exists a unique
R ∈ Tq+ such that S = R R.

Note that this decomposition yields another square root of S.

5.5.2 Orthogonal polynomials


Turn to polynomials. We will illustrate with the example on mouth sizes in (4.29).
Here K = 4, and we will consider the cubic model, so that the vectors are
⎛ ⎞
1 8 82 83
  ⎜ 1 10 102 103 ⎟
d1 d2 d3 d4 = ⎜ ⎝ 1 12 122 123 ⎠ .
⎟ (5.68)
1 14 14 2 14 3

Note that the ages (values 8, 10, 12, 14) are equally spaced. Thus we can just as well
code the ages as (0,1,2,3), so that we actually start with
⎛ ⎞
1 0 0 0
  ⎜ 1 1 1 1 ⎟
d1 d2 d3 d4 = ⎜ ⎝ 1 2 4 8 ⎠.
⎟ (5.69)
1 3 9 27

Dotting d1 out of vector w is equivalent to subtracting the mean of the elements


of w for each element. Hence
⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞
−3/2 −3 −7/2 −7 −9
⎜ −1/2 ⎟ ⎜ −1 ⎟ ⎜ −5/2 ⎟ ⎜ −5 ⎟ ⎜ −8 ⎟
d2 · 1 = ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜
⎝ 1/2 ⎠ → ⎝ 1 ⎠ , d3·1 = ⎝ 1/2 ⎠ → ⎝ 1 ⎠ , d4·1 = ⎝ −1 ⎠ .
⎟ ⎜ ⎟
3/2 3 11/2 11 18
(5.70)
where we multiplied the first two vectors in (5.70) by 2 for simplicity. Next, we dot
d2·1 out of the last two vectors. So for d3·1 , we have
⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞
−7 −3 2 1

⎜ −5 ⎟ (−7, −5, 1, 11) (−3, −1, 1, 3) ⎜ −1 ⎟ ⎜ −2 ⎟ ⎜ −1 ⎟
d3·12 = ⎜ ⎝ 1 ⎠−
⎟ ⎜ ⎟ ⎜ ⎟
⎝ 1 ⎠ = ⎝ −2 ⎠ → ⎝ −1 ⎠ ,
⎜ ⎟
(−3, −1, 1, 3)2
11 3 2 1
(5.71)
and, similarly, d4·12 = (4.2, −3.6, −5.4, 4.8) → (7, −6, −9, 8) . Finally, we dot d3·12
out of d4·12 to obtain d4·123 = (−1, 3, −3, 1) . Then our final orthogonal polynomial
96 Chapter 5. Least Squares

matrix is the very nice


⎛ ⎞
1 −3 1 −1
  ⎜ 1 −1 −1 3 ⎟
d1 d2 · 1 d3·12 d4·123 ⎜
=⎝ ⎟. (5.72)
1 1 −1 −3 ⎠
1 3 1 1
Some older statistics books (e.g., Snedecor and Cochran [1989]) contain tables of or-
thogonal polynomials for small K, and statistical packages will calculate them for
you. In R, the function is poly.
A key advantage to using orthogonal polynomials over the original polynomial
vectors is that, by virtue of the sequence in (5.51), one can estimate the parameters
for models of all degrees at once. For example, consider the mean of the girls’ mouth
sizes in (4.14) as the y, and the matrix in (5.72) as the D, in the model (5.13):
⎛ ⎞
1 1 1 1
⎜ −3 −1 1 3 ⎟
(21.18, 22.23, 23.09, 24.09) ≈ (b1 , b2 , b3 , b4 ) ⎜
⎝ 1 −1 −1 1 ⎠ .
⎟ (5.73)
−1 3 −3 1

Because D D is diagonal, the least-squares estimates of the coefficients are found via
yd j·{1: ( j−1)}
j =
b , (5.74)
d j·{1: ( j−1)}2
which here yields
 = (22.6475, 0.4795, −0.0125, 0.0165).
b (5.75)
These are the coefficients for the cubic model. The coefficients for the quadratic model
set 
b4 = 0, but the other three are as for the cubic. Likewise, the linear model has b 
equalling (22.6475, 0.4795, 0, 0), and the constant model has (22.6475, 0, 0, 0).
In contrast, if one uses the original vectors in either (5.68) or (5.69), one has to recal-
culate the coefficients separately for each model. Using (5.69), we have the following
estimates:
Model 
b1∗ 
b2∗ 
b3∗ 
b4∗
Cubic 21.1800 1.2550 −0.2600 0.0550
Quadratic 21.1965 0.9965 −0.0125 0 (5.76)
Linear 21.2090 0.9590 0 0
Constant 22.6475 0 0 0
Note that the non-zero values in each column are not equal.

5.6 Exercises
Exercise 5.6.1. Show that the span in (5.3) is indeed a linear subspace.
Exercise 5.6.2. Verify that the four spans given in (5.5) are the same.
Exercise 5.6.3. Show that for matrices C (M × J) and D (M × K),
span{columns of D} ⊂ span{columns of C} ⇒ D = CA, (5.77)
for some J × K matrix A. [Hint: Each column of D must be a linear combination of
the columns of C.]
5.6. Exercises 97

Exercise 5.6.4. Here, D is an M × K matrix, and A is K × L. (a) Show that

span{columns of DA} ⊂ span{columns of D}. (5.78)

[Hint: Any vector in the left-hand space equals DAb for some L × 1 vector b. For
what vector b∗ is DAb = Db∗ ?] (b) Prove Lemma 5.1. [Use part (a) twice, once for
A and once for A−1 .] (c) Show that if the columns of D are linearly independent,
and A is K × K and invertible, then the columns of DA are linearly independent.
[Hint: Suppose the columns of DA are linearly dependent, so that for some b = 0,
DAb = 0. Then there is a b∗ = 0 with Db∗ = 0. What is it?]
Exercise 5.6.5. Let d1 , . . . , dK be vectors in R M . (a) Suppose (5.8) holds. Show that
the vectors are linearly dependent. [That is, find b j ’s, not all zero, so that ∑ bi di = 0.]
(b) Suppose the vectors are linearly dependent. Find an index i and constants b j so
that (5.8) holds.
Exercise 5.6.6. Prove Lemma 5.2.
Exercise 5.6.7. Suppose the set of M × 1 vectors γ1 , . . . , γK are nonzero and mutually
orthogonal. Show that they are linearly independent. [Hint: Suppose they are linearly
dependent, and let γi be the vector on the left-hand side in (5.8). Then take γi times
each side of the equation, to arrive at a contradiction.]
Exercise 5.6.8. Prove part (a) of Theorem 5.1. [Hint: Show that the difference of y − 
y1
and y − y2 is orthogonal to W , as well as in W . Then show that such a vector must be
zero.] (b) Prove part (b) of Theorem 5.1. (c) Prove part (c) of Theorem 5.1. (d) Prove
part (d) of Theorem 5.1. [Hint: Start by writing y − w2 = (y −  y ) − (w − 
y)2 ,
then expand. Explain why y −  y and w − y are orthogonal.]
Exercise 5.6.9. Derive the normal equations (5.15) by differentiating y − bD 2 with
respect to the bi ’s.

Exercise 5.6.10. This Exercise proves part (a) of Proposition 5.1. Suppose W =
span{columns of D}, where D is M × K and D D is invertible. (a) Show that the
projection matrix PD = D(D D)−1 D as in (5.19) is symmetric and idempotent. (b)
Show that trace(PD ) = K.
Exercise 5.6.11. This Exercise proves part (b) of Proposition 5.1. Suppose P is a
symmetric and idempotent M × M matrix. Find a set of linearly independent vectors
d1 , . . . , dK , where K = trace(P), so that P is the projection matrix for span{d1 , . . . , dK }.
[Hint: Write P = Γ1 Γ1 where Γ1 has orthonormal columns, as in Lemma 3.1. Show
that P is the projection matrix onto the span of the columns of the Γ1 , and use Exercise
5.6.7 to show that those columns are a basis. What is D, then?]

Exercise 5.6.12. (a) Prove part (c) of Proposition 5.1. (b) Prove part (d) of Proposition
5.1. (c) Prove (5.22). (d) Prove (5.23).
Exercise 5.6.13. Consider the projection of y ∈ RK onto span{1K }. (a) Find the
projection. (b) Find the residual. What does it contain? (c) Find the projection matrix
P. What is Q = IK − P? Have we seen it before?
Exercise 5.6.14. Verify the steps in (5.26), detailing which parts of Proposition 3.2 are
used at each step.
98 Chapter 5. Least Squares

Exercise 5.6.15. Show that the equation for d j·1 in (5.44) does follow from the deriva-
tion of D2·1 .
Exercise 5.6.16. Give an argument for why the set of equations in (5.51) follows from
the Gram-Schmidt algorithm.
Exercise 5.6.17. Given that a subspace is a span of a set of vectors, explain how one
would obtain an orthogonal basis for the space.
Exercise 5.6.18. Let Z1 be a M × K matrix with linearly independent columns. (a)
How would you find a M × ( M − K ) matrix Z2 so that (Z1 , Z2 ) is an invertible M × M
matrix, and Z1 Z2 = 0 (i.e., the columns of Z1 are orthogonal to those of Z2 ). [Hint:
Start by using Lemma 5.3 with D1 = Z1 and D2 = I M . (What is the span of the
columns of (Z1 , I M )?) Then use Gram-Schmidt on D2·1 to find a set of vectors to use
as the Z2 . Do you recognize D2·1 ?] (b) Suppose the columns of Z are orthonormal.
How would you modify the Z2 in part (a) so that (Z1 , Z2 ) is an orthogonal matrix?
Exercise 5.6.19. Consider the matrix B( k) defined in (5.55). (a) Show that the inverse
of B( k) is of the same form, but with the − bkj ’s changed to bkj ’s. That is, the inverse
is the K × K matrix C( k) , where


⎨1 if i = j
(k)
Cij = bkj if j > k = i (5.79)

⎩0 otherwise.

Thus C is the inverse of the B in (5.59), where C = C( K −1) · · · C(1) . (b) Show that
C is unitriangular, where the bij ’s are in the upper triangular part, i.e, Cij = bij for
j > i, as in (5.60). (c) The R in (5.62) is then ΔC, where Δ is the diagonal matrix with
diagonal elements being the norms of the columns of D∗ . Show that R is given by

⎨di·{1: ( i−1)}
⎪ if j = i
Rij = dj·{1: ( i−1)}di·{1: ( i−1)}/di·{1: ( i−1)} if j > i (5.80)


0 if j < i.
Exercise 5.6.20. Verify (5.66).
Exercise 5.6.21. Suppose d1 , . . . , dK are vectors in R M , and d1∗ , . . . , d∗K are the corre-
sponding orthogonal vectors resulting from the Gram-Schmidt algorithm, i.e., d1∗ =
d1 , and for i > 1, d∗i = di·{1: ( i−1)} in (5.49). (a) Show that the d1∗ , . . . , d∗K are linearly
independent if and only if they are all nonzero. Why? [Hint: Recall Exercise 5.6.7.]
(b) Show that d1 , . . . , dK are linearly independent if and only if all the d∗j are nonzero.

Exercise 5.6.22. Suppose D is M × K, with linearly independent columns, and D =


QR is its QR decomposition. Show that span{columns of D} = span{columns of Q}.
Exercise 5.6.23. Suppose D is an M × M matrix whose columns are linearly indepen-
dent. Show that D is invertible. [Hint: Use the QR decomposition in Theorem 5.3.
What kind of a matrix is the Q here? Is it invertible?]
Exercise 5.6.24. (a) Show that span{columns of Q} = R M if Q is an M × M orthog-
onal matrix. (b) Suppose the M × 1 vectors d1 , . . . , d M are linearly independent, and
W = span{d1 , . . . , d M }. Show that W = R M . [Hint: Use Theorem 5.3, Lemma 5.1,
and part (a).]
5.6. Exercises 99

Exercise 5.6.25. Show that if d1 , . . . , dK are vectors in R M with K > M, that the di ’s
are linearly dependent. (This fact should make sense, since there cannot be more axes
than there are dimensions in Euclidean space.) [Hint: Use Exercise 5.6.24 on the first
M vectors, then show how d M+1 is a linear combination of them.]
Exercise 5.6.26. Show that the QR decomposition in Theorem 5.3 is unique when
M = K. That is, suppose Q1 and Q2 are K × K orthogonal matrices, and R1 and
R2 are K × K upper triangular matrices with positive diagonals, and Q1 R1 = Q2 R2 .
Show that Q1 = Q2 and R1 = R2 . [Hint: Show that Q ≡ Q2 Q1 = R2 R1−1 ≡ R,
so that the orthogonal matrix Q equals the upper triangular matrix R with positive
diagonals. Show that therefore Q = R = IK .] [Extra credit: Show the uniqueness
when M > K.]
Exercise 5.6.27. Verify (5.64). In particular: (a) Show that
−1 
IK 1 A IK 1 −A
= . (5.81)
0 IK 2 0 IK 2

(b) Argue that the 0s in the middle matrix on the left-hand side of (5.64) are correct.
(c) Show S22·1 = D2 ·1 D2·1 .

Exercise 5.6.28. Suppose 


S11 S12
S= , (5.82)
S21 S22
where S11 is K1 × K1 and S22 is K2 × K2 , and S11 is invertible. (a) Show that

|S| = |S11 | |S22·1 |. (5.83)

[Hint: Use (5.64).] (b) Show that


⎛ −1 −1 −1 −1 −1 −1 ⎞
S11 + S11 S12 S22 ·1 S21 S11 −S11 S12 S22 ·1
−1
S =⎝ ⎠. (5.84)
−S22−1 S S −1 −1
S22
·1 21 11 ·1

[Hint: Use (5.64) and (5.81).] (c) Use part (b) to show that

[S−1 ]22 = S22


−1
·1 , (5.85)

where [S−1 ]22 is the lower-right K2 × K2 block of S−1 . Under what condition on the
−1
Sij ’s is [S−1 ]22 = S22 ?

Exercise 5.6.29. For S ∈ SK+ , show that

|S| = S11 S22·1 S33·12 · · · SKK ·{1··· K −1} . (5.86)

[Hint: Use (5.65)). What is the determinant of a unitriangular matrix?]

Exercise 5.6.30. Suppose S ∈ SK+ . Prove Theorem 5.4, i.e., show that we can write
S = R R, where R is upper triangular with positive diagonal elements. [Hint: Use
the spectral decomposition S = GLG from(1.33). Then let D = L1/2 G in (5.62). Are
the columns of this D linearly independent?]
100 Chapter 5. Least Squares

Exercise 5.6.31. Show that if W = R R is the Cholesky decomposition of W (K × K),


then
K
|W| = ∏ r2jj . (5.87)
j =1

Exercise 5.6.32. Show that the Cholesky decomposition in Theorem 5.4 is unique.
That is, if R1 and R2 are K × K upper triangular matrices with positive diagonals,
that R1 R1 = R2 R2 implies that R1 = R2 . [Hint: Let R = R1 R2−1 , and show that
R R = IK . Then show that this R must be IK , just as in Exercise 5.6.26.]
Exercise 5.6.33. Show that the M × K matrix D has linearly independent columns
if and only if D D is invertible. [Hint: If D has linearly independent columns, then
D D = R R as Theorem 5.4, and R is invertible. If the columns are linearly dependent,
there is a b = 0 with D Db = 0. Why does that equation imply D D has no inverse?]
Exercise 5.6.34. Suppose D is M × K and C is M × J, K > J, and both matrices have
linearly independent columns. Furthermore, suppose

span{columns of D} = span{columns of C}. (5.88)

Thus this space has two bases with differing numbers of elements. (a) Let A be the
J × K matrix such that D = CA, guaranteed by Exercise 5.6.3. Show that the columns
of A are linearly independent. [Hint: Note that Db = 0 for any K × 1 vector b = 0.
Hence Ab = 0 for any b = 0.] (b) Use Exercise 5.6.25 to show that such an A cannot
exist. (c) What do you conclude?

Exercise 5.6.35. This exercise is to show that any linear subspace W in R M has a basis.
If W = {0}, the basis is the empty set. So you can assume W has more than just the
zero vector. (a) Suppose d1 , . . . , d J are linearly independent vectors in R M . Show that
d ∈ R M but d ∈ span{d1 , . . . , d J } implies that d1 , . . . , d J , d are linearly independent.
[Hint: If they are not linearly independent, then some linear combination of them
equals zero. The coefficient of d in that linear combination must be nonzero. (Why?)
Thus d must be in the span of the others.] (b) Take d1 ∈ W , d1 = 0. [I guess we are
assuming the Axiom of Choice.] If span{d1 } = W , then we have the basis. If not,
there must be a d2 ∈ W − span{d1 }. If span{d1 , d2 } = W , we are done. Explain
how to continue. (Also, explain why part (a) is important here.) How do you know
this process stops? (c) Argue that any linear subspace has a corresponding projection
matrix.

Exercise 5.6.36. Suppose P and P∗ are projection matrices for the linear subspace
W ⊂ R M . Show that P = P∗ , i.e., the projection matrix is unique to the subspace.
[Hint: Because the projection of any vector is unique, Py = P∗ y for all y. Consider
the columns of I M .]
Exercise 5.6.37. Let D = (D1 , D2 ), where D1 is M × K1 and D2 is M × K2 , and
suppose that the columns of D are linearly independent. Show that

PD = PD1 + PD2·1 and QD1 − QD = PD2·1 , (5.89)

where D2·1 = QD1 D2 . [Hint: Use Lemma 5.3 and the uniqueness in Exercise 5.6.36.]
5.6. Exercises 101

Exercise 5.6.38. Find the orthogonal polynomial matrix (up to cubic) for the four time
points 1, 2, 4, 5.

Exercise 5.6.39 (Skulls). For the model on skull measurements described in Exercise
4.4.2, replace the polynomial matrix w with that for orthogonal polynomials.

Exercise 5.6.40 (Caffeine). In Exercise 4.4.4, the x is a quadratic polynomial matrix


in grade (8, 10, 12). Replace it with the orthogonal polynomial matrix (also 28 × 3),
where the first column is all ones, the second is is the linear vector (−19 , 010
 , 1 ) , and
9
third is the quadratic vector is (19 , − c110
 , 1 ) for some c. Find c.
9

Exercise 5.6.41 (Leprosy). Consider again the model for the leprosy data in Exercise
4.4.6. An alternate expression for x is w∗ ⊗ 110 , where the first column of w∗ rep-
resents the overall mean, the second tells whether the treatment is one of the drugs,
and the third whether the treatment is Drug A, so that
⎛ ⎞
1 1 1
w∗ = ⎝ 1 1 0 ⎠ . (5.90)
1 0 0

Use Gram-Schmidt to orthogonalize the columns of w∗ . How does this matrix differ
from w? How does the model using w∗ differ from that using w?
Chapter 6

Both-Sides Models: Distribution of


Estimator

6.1 Distribution of β
The both-sides model as defined in (4.28) is

Y = xβz + R, R ∼ N (0, In ⊗ Σ R ), (6.1)

where Y is n × q, x is n × p, β is p × l, and z is q × l. Assuming that x x and z z are


invertible, the least-squares estimate of β is given in (5.27) to be

β = (x x)−1 x Yz(z z)−1 . (6.2)


 we first look at the mean, which by (6.1) is
To find the distribution of β,

E [ β] = (x x)−1 x E [Y]z(z z)−1


= (x x)−1 x (xβz )z(z z)−1
= β. (6.3)

Thus β is an unbiased estimator of β. For the variances and covariances, Proposition


3.2 helps:

Cov[ β] = Cov[(x x)−1 x Yz(z z)−1 ]


= Cov[row(Y)(x(x x)−1 ⊗ z(z z)−1 )]
= (x(x x)−1 ⊗ z(z z)−1 ) Cov[row(Y)](x(x x)−1 ⊗ z(z z)−1 )
= (x(x x)−1 ⊗ z(z z)−1 ) (In ⊗ Σ R )(x(x x)−1 ⊗ z(z z)−1 )
= ((x x)−1 x In x(x x)−1 ) ⊗ ((z z)−1 z Σ R z(z z)−1 )
= Cx ⊗ Σ z , (6.4)

where we define

Cx = (x x)−1 and Σz = (z z)−1 z Σ R z(z z)−1 . (6.5)

103
104 Chapter 6. Both-Sides Models: Distribution

Because β is a linear transformation of the Y,

β ∼ Np×l ( β, Cx ⊗ Σz ). (6.6)

The variances of the individual βij ’s are the diagonals of Cx ⊗ Σz , so that

Var ( βij ) = Cxii × σzjj . (6.7)

The x and z matrices are known, but Σ R must be estimated. Before tackling that issue,
we will consider fits and residuals.

6.2 Fits and residuals


The observed matrix Y in (6.1) is a sum of its mean and the residuals R. Estimates of
these two quantities are called the fits and estimated residuals, respectively. The fit
 of E [Y] = xβz ,
of the model to the data is the natural estimate Y

Y  = x(x x)−1 x Yz(z z)−1 z = Px YPz ,


 = x βz (6.8)

where for matrix u, Pu = u(u u)−1 u , the projection matrix on the span of the
columns of u, as in (5.19). We estimate the residuals R by subtracting:

R  = Y − Px YPz .
 = Y−Y (6.9)

 and R
The joint distribution of Y  is multivariate normal because the collection is a
linear transformation of Y. The means are straightforward:

 ] = Px E [Y]Pz = Px xβz Pz = xβz


E [Y (6.10)

by part (d) of the Proposition 5.1 and

E [R  ] = xβz − xβz = 0.
 ] = E [Y] − E [Y (6.11)

The covariance matrix of the fit is not hard to obtain, but the covariance of the
residuals, as well as the joint covariance of the fit and residuals, are less obvious since
the residuals are not of the form AYB . Instead, we break the residuals into two parts,
the residuals from the left-hand side of the model, and the residuals of the right-hand
part on the fit of the left-hand part. That is, we write

 = Y − Px YPz
R
= Y − Px Y + Px Y − Px YPz
= ( I n − Px ) Y + Px Y ( I n − Pz )
= Qx Y + Px YQz
≡R1 + R  2, (6.12)

where Qu = In − Pu as in part (c) of Proposition 5.1. Note that in the multivariate


regression model, where z = Iq , or if z is q × q and invertible, the Pz = Iq , hence
 2 = 0.
Qz = 0, and R
6.3. SE’s and t-statistics 105

For the joint covariance of the fit and two residual components, we use the row
function to write
⎛⎛ ⎞⎞

Y  
row ⎝⎝ R  1 ⎠ ⎠ = row(Y  ) row(R 1 ) row(R
2)
2
R
 
= row(Y) Px ⊗ Pz Qx ⊗ Iq Px ⊗ Qz . (6.13)

Using Proposition 3.1 on the covariance in (6.1), we have


⎡⎛ ⎞⎤ ⎛ ⎞ ⎛ ⎞

Y Px ⊗ Pz Px ⊗ Pz
 1 ⎠⎦ = ⎝ Qx ⊗ Iq ⎠ (In ⊗ Σ R ) ⎝ Qx ⊗ Iq ⎠
Cov ⎣⎝ R
2
R Px ⊗ Qz Px ⊗ Qz
⎛ ⎞
Px ⊗ Pz Σ R Pz 0 Px ⊗ Pz Σ R Qz
=⎝ 0 Qx ⊗ Σ R 0 ⎠, (6.14)
Px ⊗ Qz Σ R Pz 0 Px ⊗ Qz Σ R Qz

where we use Proposition 5.1, parts (a) and (c), on the projection matrices.
One fly in the ointment is that the fit and residuals are not in general independent,
due to the possible non-zero correlation between the fit and R  2 . The fit is independent
 1 , though. We obtain the distributions of the fit and residuals to be
of R

 ∼ N (xβz , Px ⊗ Pz Σ R Pz ),
Y (6.15)

and
 ∼ N (0, Qx ⊗ Σ R + Px ⊗ Qz Σ R Qz ).
R (6.16)

6.3 Standard errors and t-statistics


Our first goal is to estimate the variances of the βij ’s in (6.7). We need to estimate Σz
in (6.5), so we start by estimating Σ R . From (6.14) and above,

 1 = Qx Y ∼ N (0, Qx ⊗ Σ R ).
R (6.17)

Because Qx is idempotent, Corollary 3.1 shows that

 R
 
R 1 1 = Y Qx Y ∼ Wishart p (n − p, Σ R ), (6.18)

by Proposition 5.1 (a), since trace(Qx ) = trace(In − Px ) = n − p. Thus an unbiased


estimate of Σ R is
R = 1 R
Σ  R
 . (6.19)
n−p 1 1
Now to estimate Cov( β) in (6.4), by (3.79),

(z z)−1 z Y Qx Yz(z z)−1 ∼ Wishartq (n − p, Σz ), (6.20)

so that an unbiased estimate of Σz is


 z = (z z) − 1 z Σ
Σ  R z(z z) − 1 . (6.21)
106 Chapter 6. Both-Sides Models: Distribution

 z are chi-squareds:
The diagonals of Σ
1
σzjj ∼
 σ χ2 , (6.22)
n − p zjj n− p
ij in (6.7) is
and the estimate of the variance of β


Var ( βij ) = Cxii 
σzjj ∼
1
C σ χ2 . (6.23)
n − p xii zjj n− p

One more result we need is that β is independent of Σ


 z . But because Pu u = u, we
can write

β = (x x)−1 x Px YPz z(z z)−1


= (x x)−1 x Yz
 (z z) − 1 , (6.24)

which shows that β depends on Y through only Y.  Since Σ z depends on Y through


 1 , the independence of Y
only R  and R
 1 in (6.14) implies independence of the two
estimators. We collect these results.
Theorem 6.1. In the both-sides model (6.1), β in (5.27) and Y Qx Y are independent, with
distributions given in (6.6) and (6.18), respectively.
Recall the Student’s t distribution:
Definition 6.1. If Z ∼ N (0, 1) and U ∼ χ2ν , and Z and U are independent, then
Z
T≡ √ , (6.25)
U/ν
has the Student t distribution on ν degrees of freedom, written T ∼ tν .

Applying the definition to the βij ’s, from (6.6), (6.7), and (6.23),

βij − β ij 
Var ( βij )
Z= ∼ N (0, 1) and U = (n − p) , (6.26)
Cxii σzjj Cxii σzjj

and Z and U are independent, hence



( ij − β ij )/ Cxii σzjj
β
Z βij − β ij
 = $ = $ ∼ tn− p . (6.27)
U/(n − p)  

Var ( β ij )/Cxii σzjj 
Var ( β ij )

6.4 Examples
6.4.1 Mouth sizes
Recall the measurements on the size of mouths of 27 kids at four ages, where there
are 11 girls (Sex=1) and 16 boys (Sex=0) in Section 4.2.1. Here’s the model where the x
6.4. Examples 107

matrix compares the boys and girls, and the z matrix specifies orthogonal polynomial
growth curves:

Y = xβz + R
⎛ ⎞
  1 1 1 1
111 111 β0 β1 β2 β3 ⎜ −3 −1 1 3 ⎟
= ⎜ ⎟ + R, (6.28)
116 016 δ0 δ1 δ2 δ3 ⎝ 1 −1 −1 1 ⎠
−1 3 −3 1

Compare this model to that in (4.13). Here the first row of coefficients are the boys’
coefficients, and the sum of the rows are the girls’ coefficients, hence the second row
is the girls’ minus the boys’. We first find the estimated coefficients (using R), first
creating the x and z matrices:
x <− cbind(1,rep(c(1,0),c(11,16)))
z <− cbind(c(1,1,1,1),c(−3,−1,1,3),c(1,−1,−1,1),c(−1,3,−3,1))
estx <− solve(t(x)%∗%x,t(x))
estz <− solve(t(z)%∗%z,t(z))
betahat <− estx%∗%mouths[,1:4]%∗%t(estz)

The β is
Intercept Linear Quadratic Cubic
Boys 24.969 0.784 0.203 −0.056 (6.29)
Girls − Boys −2.321 −0.305 −0.214 0.072
Before trying to interpret the coefficients, we would like to estimate their standard
errors. We calculate Σ R = Y Qx Y/(n − p), where n = 27 and p = 2, then Σ z =
 − 
(z z) z Σz
1  (z z) and Cx = (x x)−1 of (6.5):
 − 1

PxY <− x%∗%estx%∗%mouths[,1:4]


QxY <− mouths[,1:4]− PxY
sigmaRhat <− t(QxY)%∗%QxY/(27−2)
sigmazhat <− estz%∗%sigmaRhat%∗%t(estz)
cx <− solve(t(x)%∗%x)
We find ⎛ ⎞
3.7791 0.0681 −0.0421 −0.1555
⎜ 0.0681 0.1183 −0.0502 0.0091 ⎟
 ⎜
Σz = ⎝ ⎟ (6.30)
−0.0421 −0.0502 0.2604 −0.0057 ⎠
−0.1555 0.0091 −0.0057 0.1258
and 
0.0625 −0.0625
Cx = . (6.31)
−0.0625 0.1534

By (6.7), the standard errors of the βij ’s are estimated by multiplying the i th di-
agonal of Cx and jth diagonal of Σ z , then taking the square root. We can obtain the
matrix of standard errors using the command
se <− sqrt(outer(diag(cx),diag(sigmazhat),"∗"))
The t-statistics then divide the estimates by their standard errors, betahat/se:
108 Chapter 6. Both-Sides Models: Distribution

Standard Errors
Intercept Linear Quadratic Cubic
Boys 0.4860 0.0860 0.1276 0.0887
Girls − Boys 0.7614 0.1347 0.1999 0.1389
(6.32)
t-statistics
Intercept Linear Quadratic Cubic
Boys 51.38 9.12 1.59 −0.63
Girls − Boys −3.05 −2.26 −1.07 0.52
(The function bothsidesmodel of Section A.2 performs these calculations as well.) It
looks like the quadratic and cubic terms are unnecessary, so that straight lines for each
sex fit well. It is clear that the linear term for boys is necessary, and the intercepts for
the boys and girls are different (the two-sided p-value for 3.05 with 25 df is 0.005).
The p-value for the Girls−Boys slope is 0.033, which may or may not be significant,
depending on whether you take into account multiple comparisons.

6.4.2 Histamine in dogs


Turn to the example in Section 4.2.1 where sixteen dogs were split into four groups,
and each had four histamine levels measured over the course of time.
In the model (4.15), the x part of the model is for a 2 × 2 ANOVA with interaction.
We add a matrix z to obtain a both-sides model. The z we will take has a separate
mean for the “before” measurements, because these are taken before any treatment,
and a quadratic model for the three “after” time points:

Y = xβz + R, (6.33)
where as in (4.15), but using a Kronecker product to express the x matrix,
⎛ ⎞
1 −1 −1 1
⎜ 1 −1 1 −1 ⎟

x=⎝ ⎟ ⊗ 14 , (6.34)
1 1 −1 −1 ⎠
1 1 1 1
⎛ ⎞
μb μ0 μ1 μ2
⎜ αb α0 α1 α2 ⎟
β=⎜
⎝ βb
⎟ (6.35)
β0 β1 β2 ⎠
γb γ0 γ1 γ2
and ⎛ ⎞
1 0 0 0
⎜ 0 1 1 1 ⎟
z =⎜

⎝ 0
⎟. (6.36)
−1 0 1 ⎠
0 1 −2 1
The μ’s are for the overall mean of the groups, the α’s for the drug effects, the β’s
for the depletion effect, and the γ’s for the interactions. The “b” subscript indicates
the before means, and the 0, 1 and 2 subscripts indicate the constant, linear, and
quadratic terms of the growth curves. We first set up the design matrices:
6.5. Exercises 109

x <− cbind(1,rep(c(−1,1),c(8,8)),rep(c(−1,1,−1,1),c(4,4,4,4)))
x <− cbind(x,x[,2]∗x[,3])
z <− cbind(c(1,0,0,0),c(0,1,1,1),c(0,−1,0,1),c(0,1,−2,1))
The rest of the calculations follow as in the previous section, where here the y is
in the R matrix histamine. Because of the orthogonal columns for x, the matrix Cx
is diagonal, in fact is Cx = 161 I . The coefficients’ estimates, standard errors, and
4
t-statistics are below. Note the pattern in the standard errors.

Estimates
Before Intercept Linear Quadratic
Mean 0.0769 0.3858 −0.1366 0.0107
Drug −0.0031 0.1842 −0.0359 −0.0082
Depletion 0.0119 −0.2979 0.1403 −0.0111
Interaction 0.0069 −0.1863 0.0347 0.0078

Standard Errors
Before Intercept Linear Quadratic
Mean 0.0106 0.1020 0.0607 0.0099
(6.37)
Drug 0.0106 0.1020 0.0607 0.0099
Depletion 0.0106 0.1020 0.0607 0.0099
Interaction 0.0106 0.1020 0.0607 0.0099

t-statistics
Before Intercept Linear Quadratic
Mean 7.25 3.78 −2.25 1.09
Drug −0.29 1.81 −0.59 −0.83
Depletion 1.12 −2.92 2.31 −1.13
Interaction 0.65 −1.83 0.57 0.79
Here n = 16 and p = 4, so the degrees of freedom in the t-statistics are 12. It
looks like the quadratic terms are not needed, and that the basic assumption that the
treatment effects for the before measurements is 0 is reasonable. It looks also like the
drug and interaction effects are 0, so that the statistically significant effects are the
intercept and linear effects for the mean and depletion effects. See Figure 4.3 for a
plot of these effects. Chapter 7 deals with testing blocks of β ij ’s equal to zero, which
may be more appropriate for these data.

6.5 Exercises
Exercise 6.5.1. Justify the steps in (6.4) by refering to the appropriate parts of Propo-
sition 3.2.
Exercise 6.5.2. Verify the calculations in (6.12).

Exercise 6.5.3 (Bayesian inference). This exercise extends the Bayesian results in Exer-
cises 3.7.29 and 3.7.30 to the β in multivariate regression. We start with the estimator
β in (6.6), where the z = Iq , hence Σz = Σ R . The model is then

β | β = b ∼ Np×q (b, (x x)−1 ⊗ Σ R ) and β ∼ Np×q ( β 0 , K0−1 ⊗ Σ R ), (6.38)


110 Chapter 6. Both-Sides Models: Distribution

where Σ R , β 0 , and K0 are known. Note that the Σ R matrix appears in the prior,
which makes the posterior tractable. (a) Show that the posterior distribution of β is
multivariate normal, with

E [ β | β = b
 ] = (x x + K0 )−1 ((x x)b
 + K0 β 0 ), (6.39)

and
Cov[ β | β = b
 ] = (x x + K0 )−1 ⊗ Σ R . (6.40)
[Hint: Same hint as in Exercise 3.7.30.] (b) Set the prior parameters β 0 = 0 and
K0 = k0 I p for some k0 > 0. Show that

E [ β | β = b
 ] = (x x + k0 I p )−1 x y. (6.41)

This conditional mean is the ridge regression estimator of β. See Hoerl and Kennard
[1970]. This estimator can be better than the least squares estimator (a little biased, but
much less variable) when x x is nearly singular, that is, one or more of its eigenvalues
are close to zero.

Exercise 6.5.4 (Prostaglandin). Continue with the data described in Exercise 4.4.1.
The data are in the R matrix prostaglandin. Consider the both-sides model (6.1),
where the ten people have the same mean, so that x = 110 , and z contains the cosine
and sine vectors for m = 1, 2 and 3, as in Exercise 4.4.8. (Thus z is 6 × 6.) (a) What is
z? (b) Are the columns of z orthogonal? What are the squared norms of the columns?
 (d) Find Σ
(c) Find β.  z . (e) Find the (estimated) standard errors of the βj ’s. (f) Find
the t-statistics for the β j ’s. (g) Based on the t-statistics, which model appears most
appropriate? Choose from the constant model; the one-cycle model (just m=1); the
model with one cycle and two cycles; the model with one, two and three cycles.

Exercise 6.5.5 (Skulls). This question continues with the data described in Exercise
4.4.2. The data are in the R matrix skulls, obtained from http://lib.stat.cmu.edu/
DASL/Datafiles/EgyptianSkulls.html at DASL Project [1996]. The Y ∼ N (xβ, Im ⊗
Σ R ), where the x represents the orthogonal polynomials over time periods (from
Exercise 5.6.39). (a) Find β. (b) Find (x x)−1 . (c) Find Σ
 R . What are the degrees
of freedom? (d) Find the standard errors of the βij ’s. (e) Which of the β ij ’s have
t-statistic larger than 2 in absolute value? (Ignore the first row, since those are the
overall means.) (f) Explain what the parameters with | t| > 2 are measuring. (g)
There is a significant linear trend for which measurements? (h) There is a significant
quadratic trend for which measurements?

Exercise 6.5.6 (Caffeine). This question uses the caffeine data (in the R matrix caffeine)
and the model from Exercise 4.4.4. (a) Fit the model, and find the relevant estimates.
ij ’s. (c) What do you conclude? (Choose as many
(b) Find the t-statistics for the β
conclusions as appropriate from the following: On average the students do about the
same with or without caffeine; on average the students do significantly better with-
out caffeine; on average the students do significantly better with caffeine; the older
students do about the same as the younger ones on average; the older students do
significantly better than the younger ones on average; the older students do signifi-
cantly worse than the younger ones on average; the deleterious effects of caffeine are
not significantly different for the older students than for the younger; the deleterious
6.5. Exercises 111

effects of caffeine are significantly greater for the older students than for the younger;
the deleterious effects of caffeine are significantly greater for the younger students
than for the older; the quadratic effects are not significant.)

Exercise 6.5.7 (Grades). Consider the grades data in (4.10). Let Y be the 107 × 5
matrix consisting of the variables homework, labs, inclass, midterms, and final. The
x matrix indicates gender. Let the first column of x be 1n . There are 70 women and
37 men in the class, so let the second column have 0.37 for the women and −0.70
for the men. (That way, the columns of x are orthogonal.) For the z, we want the
overall mean score; a contrast between the exams (midterms and final) and other
scores (homework, labs, inclass); a contrast between (homework, labs) and inclass; a
contrast between homework and labs; and a contrast between midterms and final.
Thus ⎛ ⎞
1 1 1 1 1
⎜ 2 2 2 −3 −3 ⎟
⎜ ⎟
z = ⎜ 1 1 −2 0 0 ⎟. (6.42)
⎝ 1 −1 0 0 0 ⎠
0 0 0 1 −1
Let 
β1 β2 β3 β4 β5
β= . (6.43)
δ1 δ2 δ3 δ4 δ5

(a) Briefly describe what each of the parameters represents. (b) Find β.  (c) Find the

standard errors of the β ij ’s. (d) Which of the parameters have |t-statistic| over 2? (e)
Based on the results in part (d), discuss whether there is any difference between the
grade profiles of the men and women.
Chapter 7

Both-Sides Models: Hypothesis Tests on


β

Testing a single β ij = 0 is easy using the t-test based on (6.27). It is often informative
to test a set of β ij ’s is 0, e.g., a row from β, or a column, or a block, or some other
configuration. In Section 7.1, we present a general test statistic and its χ2 approxima-
tion for testing any set of parameters equals zero. Section 7.2 refines the test statistic
when the set of β ij ’s of interest is a block.

7.1 Approximate χ2 test


Start by placing the parameters of interest in the 1 × K vector θ. We assume we have
 such that
a vector of estimates θ
 ∼ N (θ, Ω),
θ (7.1)

and we wish to test


H0 : θ = 0. (7.2)
We could test whether the vector equals a fixed non-zero vector, but then we can
subtract that hypothesized value from θ and return to the case (7.2). Assuming Ω is
invertible, we have that under H0 ,

 −1/2 ∼ N (0, IK ),
θΩ (7.3)

hence
 −1 θ
θΩ  ∼ χ2 . (7.4)
K
Typically, Ω will have to be estimated, in which case we use


T2 ≡ θ  −1 θ
 , (7.5)

where under appropriate conditions (e.g., large sample size relative to K), under H0 ,

T 2 ≈ χ2K . (7.6)

113
114 Chapter 7. Both-Sides Models: Testing

7.1.1 Example: Mouth sizes


In the mouth size example in Section 7.1.1, consider testing the fit of the model spec-
ifying parallel straight lines for the boys and girls, with possibly differing intercepts.
Then in the model (6.28), only β0 , β1 , and δ0 would be nonzero, so we would be
testing whether the other five are zero. Place those in the vector θ:

θ = ( β2 , β3 , δ1 , δ2 , δ3 ). (7.7)

The estimate is
 = (0.203, −0.056, −0.305, −0.214, 0.072).
θ (7.8)

To find the estimated covariance matrix Ω, we need to pick off the relevant elements

of the matrix Cx ⊗ Σz using the values in (6.30). In terms of row( β), we are interested
in elements 3, 4, 6, 7, and 8. Continuing the R work from Section 6.4.1, we have
omegahat <− kronecker(cx,sigmazhat)[c(3,4,6,7,8),c(3,4,6,7,8)]
so that
⎛ ⎞
0.01628 −0.00036 0.00314 −0.01628 0.00036
⎜ −0.00036 0.00786 −0.00057 0.00036 −0.00786 ⎟
 =⎜
Ω ⎜ 0.00314 −0.00057 0.01815 −0.00770 0.00139

⎟. (7.9)
⎝ −0.01628 0.00036 −0.00770 0.03995 −0.00088 ⎠
0.00036 −0.00786 0.00139 −0.00088 0.01930

The statistic (7.6) is



T2 = θ  −1 θ
 = 10.305. (7.10)
The degrees of freedom K = 5, which yields an approximate p-value of 0.067, border-
line significant. Judging from the individual t-statistics, the β22 element, indicating a
difference in slopes, may be the reason for the almost-significance.

7.2 Testing blocks of β are zero


In this section, we focus on blocks in the both-sides model, for which we can find a
better approximation to the distribution of T 2 , or in some cases the exact distribution.
A block β∗ of the p × l matrix β is a p∗ × l ∗ rectangular (though not necessarily
contiguous) submatrix of β. For example, if β is 5 × 4, we might have β∗ be the 3 × 2
submatrix ⎛ ⎞
β11 β13
β∗ = ⎝ β41 β43 ⎠ , (7.11)
β51 β53

which uses the rows 1, 4, and 5, and the columns 1 and 3. Consider testing the
hypothesis
H0 : β∗ = 0. (7.12)
The corresponding estimate of β∗ has distribution

β∗ ∼ Np∗ ×l ∗ ( β∗ , Cx∗ ⊗ Σz∗ ), (7.13)


7.2. Tests for blocks 115

where Cx∗ and Σz∗ are the appropriate p∗ × p∗ and l ∗ × l ∗ submatrices of, respectively,
Cx and Σz . In the example (7.11),
⎛ ⎞
Cx11 Cx14 Cx15 
σz11 σz13
Cx = ⎝ Cx41 Cx44 Cx45 ⎠ , and Σz∗ =
∗ . (7.14)
σz31 σz33
Cx51 Cx54 Cx55
 z∗ be the corresponding submatrix of (6.21), we have
Also, letting Σ
 z∗ ∼ Wishartl ∗ (ν, Σz∗ ), ν = n − p.
νΣ (7.15)
We take
θ = row( β∗ ) and Ω = Cx∗ ⊗ Σz∗ , (7.16)

and θ,  as the obvious estimates. Then using (3.32c) and (3.32d), we have that

T 2 = row( β∗ )(Cx∗ ⊗ Σ


 z∗ )−1 row( β∗ )

= row( β∗ )(Cx∗ −1 ⊗ Σ


 z∗ −1 ) row( β∗ )

= row(Cx∗ −1 β∗ Σ
 z∗ −1 ) row( β∗ )

= trace(Cx∗ −1 β∗ Σ
 z∗ −1 β∗ ), (7.17)
where the final equation results by noting that for p × q matrices A and D,
p q
row(A) row(D) = ∑ ∑ aij dij = trace(AD ). (7.18)
i =1 j =1

To clean up the notation a bit, we write

T 2 = ν trace(W−1 B), (7.19)


where
 z∗ , ν = n − p, and B = β∗ Cx∗ −1 β∗ .
W=νΣ (7.20)
Thus by Theorem 6.1, B and W are independent, and by (7.13) and (7.15), under H0 ,
B ∼ Wishartl ∗ ( p∗ , Σz∗ ) and W ∼ Wishartl ∗ (ν, Σz∗ ). (7.21)
In multivariate analysis of variance, we usually call B the “between-group” sum of
squares and cross-products matrix, and W the “within-group” matrix. The test based
on T 2 in (7.19) is called the Lawley-Hotelling trace test, where the statistic is usually
defined to be T 2 /ν.
We could use the approximation (7.6) with K = p∗ l ∗ , but we do a little better with
the approximation
ν − l∗ + 1 2
F≡ T ≈ Fp∗ l ∗ ,ν−l ∗ +1 . (7.22)
νp∗ l ∗
Recall the following definition of Fμ,ν.
Definition 7.1. If B ∼ χ2μ and W ∼ χ2ν , and B and W are independent, then
1
μ B
F= 1
∼ Fμ,ν, (7.23)
ν W
an F distribution with degrees of freedom μ and ν.
116 Chapter 7. Both-Sides Models: Testing

When p∗ = 1 or l ∗ = 1, so that we are testing elements within a single row or


column, the distribution in (7.22) is exact. In fact, when we are testing just one β ij
(p∗ = l ∗ = 1), the T 2 is the square of the usual t-statistic, hence distributed F1,ν . In
other cases, at least the mean of the test statistic matches that of the F. The rest of
this section verifies these statements.

7.2.1 Just one column – F test


Suppose l ∗ = 1, so that B and W are independent scalars, and from (7.21),

B ∼ σz∗2 χ2p∗ and W ∼ σz∗2 χ2ν . (7.24)

Then in (7.22), the constant multiplying T 2 is simply ν/p∗ , so that

ν B
F= ∼ Fp∗ ,ν . (7.25)
p∗ W

This is the classical problem in multiple (univariate) regression, and this test is the
regular F test.

7.2.2 Just one row – Hotelling’s T 2


Now p∗ = 1, so that the B in (7.19) has just one degree of freedom. Thus we can write

B = Z Z, Z ∼ N1×l ∗ (0, Σ∗z ), (7.26)

where 
Z = β∗ / Cx∗ . (7.27)
(Note that Cx∗ is a scalar.) From (7.19), T2 can be variously written

β∗ Σz∗−1 β∗


T 2 = ν trace(W−1 Z Z) = ν ZW−1 Z = . (7.28)
Cx∗

In this situation, the statistic is called Hotelling’s T 2 . The next proposition shows that
the distribution of the F version of Hotelling’s T 2 in (7.22) is exact, setting p∗ = 1.
The proof of the proposition is in Section 8.4.
Proposition 7.1. Suppose W and Z are independent, W ∼ Wishartl ∗ (ν, Σ) and Z ∼
N1×l ∗ (0, Σ), where ν ≥ l ∗ and Σ is invertible. Then

ν − l∗ + 1
ZW−1 Z ∼ Fl ∗ ,ν−l ∗ +1 . (7.29)
l∗

7.2.3 General blocks


In this section we verify that the expected values of the two sides of (7.22) are the
same. It is not hard to show that E [ χ2μ ] = μ and E [1/χ2ν ] = 1/(ν − 2) if ν > 2. Thus
by the definition of F in (7.23), independence yields
ν
E [ Fμ,ν ] = (7.30)
ν−2
7.2. Tests for blocks 117

if ν > 2. Otherwise, the expected value is + ∞. For T 2 in (7.19) and (7.21), again by
independence of B and W,

E [ T 2 ] = ν trace( E [W−1 ] E [B]) = νp∗ trace( E [W−1 ] Σz∗ ), (7.31)

because E [B] = p∗ Σz∗ by (3.73). To finish, we need the following lemma, which
extends the results on E [1/χ2ν ].

Lemma 7.1. If W ∼ Wishartl ∗ (ν, Σ), ν > l ∗ + 1, and Σ is invertible,

1
E [ W −1 ] = Σ −1 . (7.32)
ν − l∗ − 1

The proof in in Section 8.3. Continuing from (7.31),

νp∗ νp∗ l ∗
E[ T2 ] = ∗ trace(Σz∗−1 Σz∗ ) = . (7.33)
ν−l −1 ν − l∗ − 1

Using (7.33) and (7.30) on (7.22), we have

ν − l∗ + 1 ν − l∗ + 1
∗ ∗ E[ T2 ] = = E [ Fp∗ l ∗ ,ν−l ∗ +1 ]. (7.34)
νp l ν − l∗ − 1

7.2.4 Additional test statistics


In addition to the Lawley-Hotelling trace statistic T 2 (7.19), other popular test statis-
tics for testing blocks based on W and B in (7.19) include the following.

Wilks’ Λ
The statistic is based on the likelihood ratio statistic (see Section 9.3.1), and is defined
as
|W|
Λ= . (7.35)
|W + B|
See Exercise 9.6.8. Its distribution under the null hypothesis has the Wilk’s Λ distri-
bution, which is a generalization of the beta distribution.

Definition 7.2 (Wilks’ Λ). Suppose W and B are independent Wishart’s, with distributions
as in (7.21). Then Λ has the Wilks’ Λ distribution with dimension l ∗ and degrees of freedom
( p∗ , ν), written
Λ ∼ Wilksl ∗ ( p∗ , ν). (7.36)

Wilks’ Λ can be represented as a product of independent beta random variables.


Bartlett [1954] has a number of approximations for multivariate statistics, including
one for Wilk’s Λ: 
l ∗ − p∗ + 1
− ν− log(Λ) ≈ χ2p∗ l ∗ . (7.37)
2
118 Chapter 7. Both-Sides Models: Testing

Pillai trace
This one is the locally most powerful invariant test. (Don’t worry about what that
means exactly, but it has relatively good power if in the alternative the β∗ is not far
from 0.) The statistic is
trace((W + B)−1 B). (7.38)
Asymptotically, as ν → ∞,

ν trace((W + B)−1 B) → χ2l ∗ p∗ , (7.39)

which is the same limit as for the Lawley-Hotelling T2 .

Roy’s maximum root


This test is based on the largest root, i.e., largest eigenvalue of (W + B)−1 B.
If p∗ = 1 or l ∗ = 1, these statistics are all equivalent to T 2 . In general, Lawley-
Hotelling, Pillai, and Wilks’ have similar operating characteristics. Each of these four
tests is admissible in the sense that there is no other test of the same level that always
has better power. See Anderson [2003], for discussions of these statistics, including
some asymptotic approximations and tables.

7.3 Examples
In this section we further analyze the mouth size and histamine data. The book Hand
and Taylor [1987] contains a number of other nice examples.

7.3.1 Mouth sizes


We will use the model from (6.28) with orthogonal polynomials for the four age
variables and the second row of the β representing the differences between the girls
and boys:

Y = xβz + R
⎛ ⎞
  1 1 1 1
111 111 β0 β1 β2 β3 ⎜ −3 −1 1 3 ⎟
= ⎜ ⎟ + R.
116 016 δ0 δ1 δ2 δ3 ⎝ 1 −1 −1 1 ⎠
−1 3 −3 1
(7.40)

See Section 6.4.1 for calculation of estimates of the parameters.


We start by testing equality of the boys’ and girls’ curves. Consider the last row of
β:

H0 : (δ0 , δ1 , δ2 , δ3 ) = (0, 0, 0, 0). (7.41)


The estimate is

(δ0 , δ1 , δ2 , δ3 ) = (−2.321, −0.305, −0.214, 0.072). (7.42)


7.3. Examples 119

Because p∗ = 1, the T 2 is Hotelling’s T 2 from Section 7.2.2, where l ∗ = l = 4 and


ν = n − p = 27 − 2 = 25. Here Cx∗ = Cx22 = 0.1534 and Σ  z from (6.30). We
 z∗ = Σ
calculate T 2 = 16.5075, using
t2 <− betahat[2,]%∗%solve(sigmazhat,betahat[2,])/cx[2,2]
By (7.22), under the null,

ν − l∗ + 1 2 22
T ∼ Fl ∗ ,ν−l ∗ +1 −→ 16.5075 = 3.632, (7.43)
νl ∗ 100
which, compared to a F4,22 , has p-value 0.02. So we reject H0 , showing there is a
difference in the sexes.
Next, consider testing that the two curves are actually linear, that is, the quadratic
and cubic terms are 0 for both curves:

β2 β3
H0 : = 0. (7.44)
δ2 δ3

 z∗ is the lower right 2 × 2 submatrix of Σ


Now p∗ = l ∗ = 2, Cx∗ = Cx , and Σ  z . Calcu-
lating:
sigmazstar <− sigmazhat[3:4,3:4]
betastar <− betahat[,3:4]
b <− t(betastar)%∗%solve(cx)%∗%betastar % Note that solve(cx) = t(x)%∗%x
t2 <− tr(solve(sigmazstar)%∗%b)
The function tr is a simple function that finds the trace of a square matrix defined by
tr <− function(x) sum(diag(x))
This T 2 = 2.9032, and the F form in (7.22) is (24/100) × T 2 = 0.697, which is not
at all significant for an F4,24. The function bothsidesmodel.test in Section A.2.2 will
perform this test, as well as the Wilk’s test.
The other tests in Section 7.2.4 are also easy to implement, where here W = 25 Σ z∗ .
Wilk’s Λ (7.35) is
w <− 25∗sigmazstar
lambda <− det(w)/det(b+w)
The Λ = 0.8959. For the large-sample approximation (7.37), the factor is 24.5, and the
statistic is −24.5 log(Λ) = 2.693, which is not significant for a χ24 . Pillai’s trace test
statistic (7.38) is
tr(solve(b+w)%∗%b)
which equals 0.1041, and the statistic (7.39 ) is 2.604, similar to Wilk’s Λ. The final
one is Roy’s maximum root test. The eigenvalues are found using
eigen(solve(b+u)%∗%b)$values
being 0.1036 and 0.0005. Thus the statistic here is 0.1036. Anderson [2003] has tables
and other information about these tests. For this situation, (ν + p∗ )/p∗ times the
statistic, which is (27/2) × 0.1036 = 1.40, has 0.05 cutoff point of 5.75.
The conclusion is that we need not worry about the quadratic or cubic terms. Just
for fun, go back to the original model, and test the equality of the boys’ and girls’
120 Chapter 7. Both-Sides Models: Testing

curves presuming the quadratic and cubic terms are 0. The β∗ = (−2.321, −0.305),
p∗ = 1 and l ∗ = 2. Hotelling’s T 2 = 13.1417, and the F = [(25 − 2 + 1)/(25 × 2)] ×
13.1417 = 6.308. Compared to an F2,25, the p-value is 0.006. Note that this is quite
a bit smaller than the p-value before (0.02), since we have narrowed the focus by
eliminating insignificant terms from the statistic.
Our conclusion that the boys’ and girls’ curves are linear but different appears
reasonable given the Figure 4.1.

7.3.2 Histamine in dogs


Consider again the two-way multivariate analysis of variance model Y = xβ + R from
(4.15), where ⎛ ⎞
μ b μ0 μ1 μ2
⎜ α b α1 α2 α3 ⎟
β=⎜ ⎝ β b β1 β2 β3 ⎠ .
⎟ (7.45)
γ b γ1 γ2 γ3
Recall that the μ’s are for the overall mean of the groups, the α’s for the drug effects,
the β’s for the depletion effect, and the γ’s for the interactions. The “b” subscript
indicates the before means, and the 1, 2, 3’s represent the means for the three after
time points.
We do an overall test of equality of the four groups based on the three after time
points. Thus ⎛ ⎞
α1 α2 α3
H0 : ⎝ β1 β2 β3 ⎠ = 0. (7.46)
γ1 γ2 γ3
Section 6.4.2 contains the initial calculations. Here, Cx∗ is the lower right 3 × 3
submatrix of Cx , i.e., Cx = (1/16)I3 , Σz∗ is the lower right 3 × 3 submatrix of Σz ,
p∗ = l ∗ = 3, and ν − n − p = 16 − 4 = 12. We then have
t2 <− 16∗tr(solve(sigmazhat[2:4,2:4])%∗%t(betahat[2:4,2:4])%∗%betahat[2:4,2:4])
f <− (12−3+1)∗t2/(12∗3∗3)

Here, T 2 = 41.5661 and F = 3.849, which has degrees of freedom (9,10). The p-value
is 0.024, which does indicate a difference in groups.

7.4 Testing linear restrictions


Instead of testing that some of the β ij ’s are 0, one often wishes to test equalities among
them, or other linear restrictions. For example, consider the one-way multivariate
analysis of variance with three groups, with n k observations in group k, and q = 2
variables (such as the leprosy data below), written as
⎛ ⎞⎛ ⎞
1n1 0 0 μ11 μ12
Y=⎝ 0 1n2 0 ⎠ ⎝ μ21 μ22 ⎠ + R. (7.47)
0 0 1n3 μ31 μ32

The hypothesis that the groups have the same means is

H0 : μ11 = μ21 = μ31 and μ12 = μ22 = μ32 . (7.48)


7.5. Covariates 121

That hypothesis can be expressed in matrix form as


⎛ ⎞
 μ11 μ12 
1 −1 0 ⎝ μ21 ⎠ 0 0
μ22 = . (7.49)
1 0 −1 0 0
μ31 μ32

Or, if only the second column of Y is of interest, then one might wish to test

H0 : μ12 = μ22 = μ32 , (7.50)

which in matrix form can be expressed as


⎛ ⎞
 μ11 μ12  
1 −1 0 ⎝ μ21 0 0
μ22 ⎠ = . (7.51)
1 0 −1 1 0
μ31 μ32

Turning to the both-sides model, these hypotheses can be written as

H0 : CβD = 0, (7.52)

where C (p∗ × p) and D (l ∗ × l) are fixed matrices that express the desired restrictions.
To test the hypothesis, we use

  ∼ N (CβD , CCx C ⊗ DΣz D ),


C βD (7.53)

and
 z D ∼ 1
DΣ Wishart(n − p, DΣz D ). (7.54)
n−p
Then, assuming the appropriate matrices are invertible, we set

B = D β (CCx C )−1 C βD


  , ν = n − p, W = νDΣ
 z D , (7.55)

which puts us back at the distributions in (7.21). Thus T 2 or any of the other test
statistics can be used as above. In fact, the hypothesis β∗ in (7.12) and (7.11) can be
written as β∗ = CβD for C and D with 0’s and 1’s in the right places.

7.5 Covariates
A covariate is a variable that is of (possibly) secondary importance, but is recorded
because it might help adjust some estimates in order to make them more precise.
As a simple example, consider the Leprosy example described in Exercise 4.4.6. The
main interest is the effect of the treatments on the after-treatment measurements. The
before-treatment measurements constitute the covariate in this case. These measure-
ments are indications of health of the subjects before treatment, the higher the less
healthy. Because of the randomization, the before measurements have equal popula-
tion means for the three treatments. But even with a good randomization, the sample
means will not be exactly the same for the three groups, so this covariate can be used
to adjust the after comparisons for whatever inequities appear.
122 Chapter 7. Both-Sides Models: Testing

The multivariate analysis of variance model, with the before and after measure-
ments as the two Y variables, we use here is
 
Y = Yb Ya = xβ + R
⎡⎛ ⎞ ⎤⎛ ⎞
1 1 1 μb μ a
= ⎣⎝ 1 1 −1 ⎠ ⊗ 110 ⎦ ⎝ αb α a ⎠ + R, (7.56)
1 −2 0 βb β a
where 
σbb σba
Cov(Y) = I30 ⊗ . (7.57)
σab σaa
The treatment vectors in x are the contrasts Drugs versus Control and Drug A versus
Drug D. The design restriction that the before means are the same for the three groups
is represented by  
αb 0
= . (7.58)
βb 0
Interest centers on α a and β a . (We do not care about the μ’s.)
The estimates and standard errors using the usual calculations for the before and
after α’s and β’s are given in the next table:
Before After
Estimate se Estimate se
(7.59)
Drug vs. Control −1.083 0.605 −2.200 0.784
Drug A vs. Drug D −0.350 1.048 −0.400 1.357
Looking at the after parameters, we see a significant difference in the first contrast,
showing that the drugs appear effective. The second contrast is not significant, hence
we cannot say there is any reason to suspect difference between the drugs. Note,
though, that the first contrast for the before measurements is somewhat close to sig-
nificance, that is, by chance the control group received on average less healthy people.
Thus we wonder whether the significance of this contrast for the after measurements
is at least partly due to this fortuitous randomization.
To take into account the before measurements, we condition on them, that is, con-
sider the conditional distribution of the after measurements given the before mea-
surements (see equation 3.56 with X = Yb ):
Ya | Yb = yb ∼ N30×1 (α + yb γ, I30 × σaa·b ), (7.60)
where γ = σab /σbb , and recalling (7.58),
α + yb γ = E [Y a ] − E [Yb ] γ + yb γ
⎛ ⎞ ⎛ ⎞
μa μb
= x ⎝ α a ⎠ − x ⎝ 0 ⎠ γ + yb γ
βa 0
⎛ ∗ ⎞
μ
= x ⎝ α a ⎠ + yb γ
βa
⎛ ∗ ⎞
μ
⎜ αa ⎟

= (x, yb ) ⎝ ⎟, (7.61)
βa ⎠
γ
7.5. Covariates 123

where μ ∗ = μ a − μ b γ. Thus conditionally we have another linear model, this one


with one Y vector and four X variables, instead of two Y’s and three X’s. But the key
point is that the parameters of interest, α a and β a , are estimable in this model. Note
that μ a and μ b are not so estimable, but then we do not mind.
The estimates of the parameters from this conditional model have distribution
⎛ ∗ ⎞ ⎛⎛ ∗ ⎞ ⎞
μ μ
⎜  ∗ ⎟ ⎜⎜ ⎟ ⎟
⎜ α a ⎟ | Yb = yb ∼ N4×1 ⎜⎜ α a ⎟ , C ⎟
⎝ β ∗a ⎠ ⎝⎝ β a ⎠ (x,yb ) ⊗ σaa·b ⎠ , (7.62)

γ γ

where
C(x,yb ) = ((x, yb ) (x, yb ))−1 (7.63)
as in (6.5).
The next tables have the comparisons of the original estimates and the covariate-
adjusted estimates for α a and β a :

Original Covariate-adjusted
Estimate se t Estimate se t
(7.64)

αa −2.20 0.784 −2.81 α∗a
 −1.13 0.547 −2.07
βa −0.40 1.357 −0.29 β∗a −0.05 0.898 −0.06
The covariates helped the precision of the estimates, lowering the standard errors by
about 30%. Also, the first effect is significant but somewhat less so than without the
covariate. This is due to the control group having somewhat higher before means
than the drug groups.
Whether the original or covariate-adjusted estimates are more precise depends on
a couple of terms. The covariance matrices for the two cases are

σbb σba
Cx ⊗ and C(x,yb ) ⊗ σaa·b. (7.65)
σab σaa
Because
σab
2
σaa·b = σaa − = σaa (1 − ρ2 ), (7.66)
σbb
where ρ is the correlation between the before and after measurements, the covariate-
adjusted estimates are relatively better the higher ρ is. From the original model we
estimate ρ = 0.79, which is quite high. The estimates from the two models are
σaa = 36.86 and 
 σaa·b = 16.05. On the other hand,
[Cx ]2:3,2:3 “ < ” [C(x,yb ) ]2:3,2:3, (7.67)

where the subscripts are to indicate we are taking the second and third row/columns
from the matrices. The inequality holds unless x yb = 0, and favors the original
estimates. Here the two matrices (7.67) are
 
0.0167 0 0.0186 0.0006
and , (7.68)
0 0.050 0.0006 0.0502

respectively, which are not very different. Thus in this case, the covariate-adjusted
estimates are better because the gain from the σ terms is much larger than the loss
124 Chapter 7. Both-Sides Models: Testing

from the Cx terms. Note that the covariate-adjusted estimate of the variances has lost
a degree of freedom.
The parameters in model (7.56) with the constraint (7.58) can be classified into
three types:

• Those of interest: α a and β a ;

• Those assumed zero: αb and β b ;

• Those of no particular interest nor with any constraints: μ b and μ a .

The model can easily be generalized to the following:



  μb μa
Y= Yb Ya = xβ = x + R, where β b = 0, (7.69)
βb βa

Σbb Σba
Cov(Y) = In ⊗ , (7.70)
Σ ab Σ aa

and

x is n × p, Yb is n × q b , Ya is n × q a , μb is p1 × q b , μ a is p1 × q a ,
β b is p2 × q b , β a is p2 × q a , Σbb is q b × q b , and Σ aa is q a × q a . (7.71)

Here, Yb contains the q b covariate variables, the β a contains the parameters of interest,
β b is assumed zero, and μb and μ a are not of interest. For example, the covariates
might consist of a battery of measurements before treatment.
Conditioning on the covariates again, we have that

Ya | Yb = yb ∼ Nn×q a (α + yb γ, In ⊗ Σ aa·b ), (7.72)

where Σ aa·b = Σ aa − Σ ab Σ− −1
bb Σ ba , γ = Σ bb Σ ba , and
1

  
μa μb μ a − μb γ
α = E [Y a ] − E [Yb ] β ∗ = x −x γ=x . (7.73)
βa 0 βa

Thus we can write the conditional model (conditional on Yb = yb ),


⎛ ⎞
 μ∗
μ a − μb γ
Ya = x + yb γ + R∗ = (x y b ) ⎝ β a ⎠ + R ∗ = x∗ β ∗ + R ∗ , (7.74)
βa
γ

where μ∗ = μ a − μb γ and R∗ ∼ Nn×q a (0, In ⊗ Σ aa·b ). In this model, the parameter of


interest, β a , is still estimable, but the μ a and μb are not, being combined in the μ∗ . The
model has turned from one with q = q a + q b dependent vectors and p explanatory
vectors into one with q a dependent vectors and p + q b explanatory vectors. The esti-
mates and hypothesis tests on β∗ then proceed as for general multivariate regression
models.
7.6. Mallows’ C p 125

7.5.1 Pseudo-covariates
The covariates considered above were “real” in the sense that they were collected
purposely in such a way that their distribution was independent of the ANOVA
groupings. At times, one finds variables that act like covariates after transforming
the Y’s. Continue the mouth size example from Section 7.3.1, with cubic model over
time as in (7.40). We will assume that the cubic and quadratic terms are zero, as
testing the hypothesis (7.44) suggests. Then β is of the form ( β a β b ) = ( β a 0), where
 
β0 β1 β2 β3
βa = , and β b = , (7.75)
δ0 δ1 δ2 δ3

as in (7.69) but with no μ a or μb . (Plus the columns are in opposite order.) But the z
is in the way. Note though that z is square, and invertible, so that we can use the 1-1
function of Y, Y(z )−1 :

Y ( z ) ≡ Y (z ) − 1 = x( β a 0) + R ( z ) , (7.76)

where
R(z) ∼ Nn×q (0, In ⊗ Σz ), Σz = z−1 Σ R (z )−1 . (7.77)
(This Σz is the same as the one in (6.5) because z itself is invertible.)
(z) (z)
Now we are back in the covariate case (7.69), so we have Y(z) = (Ya Yb ), and
have the conditional linear model

(z) (z) ( z) ( z) βa ( z)
[ Y a | Y b = y b ] = (x y b ) + Ra , (7.78)
γ

−1 ( z)
where γ = Σz,bb Σz,ba , and Ra ∼ Nn×q a (0, In ⊗ Σz,aa·b ), and estimation proceeds as
in the general case (7.69). See Section 9.5 for the calculations.
When z is not square, so that z is q × k for k < q, pseudo-covariates can be created
by filling out z, that is, find a q × (q − k) matrix z2 so that z z2 = 0 and (z z2 ) is
invertible. (Such z2 can always be found. See Exercise 5.6.18.) Then the model is

Y = xβz + R
 
  z
=x β 0 + R, (7.79)
z2

and we can proceed as in (7.76) with


 −1
z
Y( z) = Y . (7.80)
z2

7.6 Model selection: Mallows’ C p


Hypothesis testing is a good method for deciding between two nested models, but
often in linear models, and in any statistical analysis, there are many models up for
consideration. For example, in a linear model with a p × l matrix β of parameters,
there are 2 pl models attainable by setting subsets of the β ij ’s to 0. There is also an
126 Chapter 7. Both-Sides Models: Testing

infinite number of submodels obtained by setting linear (and nonlinear) restrictions


among the parameters. One approach to choosing among many models is to min-
imize some criterion that measures the efficacy of a model. In linear models, some
function of the residual sum of squares is an obvious choice. The drawback is that,
typically, the more parameters in the model, the lower the residual sum of squares,
hence the best model always ends up being the one with all the parameters, i.e., the
entire β in the multivariate linear model. Thus one often assesses a penalty depend-
ing on the number of parameters in the model, the larger the number of parameters,
the higher the penalty, so that there is some balance between the residual sum of
squares and number of parameters.
In this section we present Mallows’ C p criterion, [Mallows, 1973]. Section 9.4 ex-
hibits the Bayes information criterion (BIC) and Akaike information criteria (AIC),
which are based on the likelihood. Mallows’ C p and the AIC are motivated by pre-
diction. We develop this idea for the both-sides model (4.28), where

Y = xβz + R, where R ∼ Nn×q (0, In ⊗ Σ R ). (7.81)

(We actually do not use the normality assumption on R in what follows.) This ap-
proach is found in Hasegawa [1986] and Gupta and Kabe [2000].
The observed Y is dependent on the value of other variables represented by x and
z. The objective is to use the observed data to predict a new variable Y New based
on its x New and z New . For example, an insurance company may have data on Y, the
payouts the company has made to a number of people, and (x, z), the basic data (age,
sex, overall health, etc.) on these same people. But the company is really wondering
whether to insure new people, whose (x New , z New ) they know, but Y New they must
predict. The prediction, Y  New , is a function of (x New , z New ) and the observed Y. A
good predictor has Y  New close to Y New .
The model (7.81), with β being p × l, is the largest model under consideration, and
will be called the “big” model. The submodels we will look at will be

Y = x∗ β∗ z∗ + R, (7.82)

where x∗ is n × p∗ , consisting of p∗ of the p columns of x, and similarly z∗ is q × l ∗ ,


consisting of l ∗ of the l columns of z (and β∗ is p∗ × l ∗ ). There are about 2 p+l such
submodels (7.82).
We act as if we wish to predict observations Y New that have the same explanatory
variables x and z as the data, so that

Y New = xβz + R New , (7.83)

where R New has the same distribution as R, and Y New and Y are independent. It is
perfectly reasonable to want to predict Y New ’s for different x and z than in the data,
but the analysis is a little easier if they are the same, plus it is a good starting point.
For a given submodel (7.82), we predict Y New by

 ∗ = x∗ β∗ z∗ ,
Y (7.84)

where β∗ is the usual estimate based on the smaller model (7.82) and the observed Y,

β∗ = (x∗ x∗ )−1 x∗ Yz∗ (z∗ z∗ )−1 . (7.85)


7.6. Mallows’ C p 127

It is convenient to use projection matrices as in (6.8), so that

 ∗ = Px∗ YPz∗ .
Y (7.86)

To assess how well a model predicts, we use the sum of squares between Y New
∗:
and Y

PredSS ∗ = Y New − Y
 ∗ 2
n q
= ∑ ∑ (yijNew − y∗ij )2
i =1 j =1
 ∗ ) (Y New − Y
= trace((Y New − Y  ∗ )). (7.87)

Of course, we cannot calculate PredSS ∗ because the Y New is not observed (if it were,
we wouldn’t need to predict it), so instead we look at the expected value:

 ∗ ) (Y New − Y
EPredSS ∗ = E [trace((Y New − Y  ∗ ))]. (7.88)

The expected value is taken assuming the big model, (7.81) and (7.83). We cannot
observe EPredSS ∗ either, because it is a function of the unknown parameters β and
Σ R , but we can estimate it. So the program is to
1. Estimate EPredSS ∗ for each submodel (7.82);
2. Find the submodel(s) with the smallest estimated EPredSS ∗ ’s.
Whether prediction is the ultimate goal or not, the above is a popular way to choose
a submodel.
We will discuss Mallows’ C p as a method to estimate EPredSS ∗ . Cross-validation
is another popular method that will come up in classification, Chapter 11. The temp-
tation is to use the observed Y in place of Y New to estimate EPredSS ∗ , that is, to use
the observed residual sum of squares

 ∗ )  (Y − Y
ResidSS ∗ = trace((Y − Y  ∗ )). (7.89)

 ∗ is based on the
This estimate is likely to be too optimistic, because the prediction Y
observed Y. The ResidSS ∗ is estimating its expected value,

EResidSS ∗ = E [trace((Y − Y
 ∗ )  (Y − Y
 ∗ ))]. (7.90)

Is ResidSS ∗ a good estimate of PredSS ∗ , or EPredSS ∗ ? We calculate and compare


EPredSS ∗ and EResidSS ∗ .
First note that for any n × q matrix U, we have (Exercise 7.7.8)

E [trace(U U)] = trace(Cov[row(U)]) + trace( E [U]  E [U]). (7.91)

We will use

for EPredSS ∗ :  ∗ = Y New − Px∗ YPz∗ ;


U = Y New − Y
and for EResidSS ∗ :  ∗ = Y − Px∗ YPz∗ .
U = Y−Y (7.92)
128 Chapter 7. Both-Sides Models: Testing

For the mean, note that by (7.81) and (7.83), Y and Y New both have mean xβz , hence,

E [Y New − Px∗ YPz∗ ] = E [Y − Px∗ YPz∗ ] = xβz − Px∗ xβz Pz∗ ≡ Δ. (7.93)

The covariance term for EPredSS ∗ is easy because Y New and Y are independent.
Using (6.15),

Cov[Y New − Px∗ YPz∗ ] = Cov[Y New ] + Cov[Px∗ YPz∗ ]


= ( I n ⊗ Σ R ) + ( Px ∗ ⊗ Pz ∗ Σ R Pz ∗ ) . (7.94)

See (6.14). The trace is then

trace (Cov[Y New − Px∗ YPz∗ ]) = n trace(Σ R ) + p∗ trace(Σ R Pz∗ ). (7.95)

The covariance for the residuals uses (6.16):

Cov[Y − Px∗ YPz∗ ] = Qx∗ ⊗ Σ R + Px∗ ⊗ Qz∗ Σ R Qz∗ . (7.96)

Thus

trace(Cov[Y − Px∗ YPz∗ ]) = (n − p∗ ) trace (Σ R ) + p∗ trace(Qz∗ Σ R Qz∗ )


= n trace(Σ R ) + p∗ trace(Σ R Pz∗ ). (7.97)

Applying (7.91) with (7.93), (7.95), and (7.97), we have the following result.
Lemma 7.2. For Δ given in (7.93),

EPredSS ∗ = trace(Δ Δ) + n trace (Σ R ) + p∗ trace(Σ R Pz∗ );


EResidSS ∗ = trace(Δ Δ) + n trace (Σ R ) − p∗ trace(Σ R Pz∗ ). (7.98)

Note that both quantities can be decomposed into a bias part, Δ Δ, and a covari-
ance part. They have the same bias, but the residuals underestimate the prediction
error by having a “− p∗ ” in place of the “+ p∗ ”:

EPredSS ∗ − EResidSS ∗ = 2p∗ trace (Σ R Pz∗ ). (7.99)

So to use the residuals to estimate the prediction error unbiasedly, we need to add
an unbiased estimate of the term in (7.99). That is easy, because we have an unbiased
estimator of Σ R .
Proposition 7.2. An unbiased estimator of EPredSS ∗ is Mallows’ C p statistic,

 ∗ = ResidSS ∗ + 2p∗ trace(Σ


C p (x∗ , z∗ ) = EPredSS  R Pz ∗ ) , (7.100)

where
1
R =
Σ Y Qx Y. (7.101)
n−p
Some comments:
 R is calculated from
• The ResidSS ∗ is calculated from the submodel, while the Σ
the big model.
7.6. Mallows’ C p 129

• The estimate of prediction error takes the residual error, and adds a penalty
depending (partly) on the number of parameters in the submodel. So the larger
the submodel, generally, the smaller the residuals and the larger the penalty. A
good model balances the two.

• In univariate regression, l = 1, so there is no Pz∗ (it is 1), and Σ R = σR2 , so that


Mallows’ C p is
C p (x∗ ) = ResidSS ∗ + 2p∗ 
σR2 . (7.102)

7.6.1 Example: Mouth sizes


Take the big model to be (6.28) again, where the x matrix distinguishes between girls
and boys, and the z matrix has the orthogonal polynomials for age. We will consider
2 × 4 = 8 submodels, depending on whether the Girls−Boys term is in the x, and
which degree polynomial (0, 1, 2, 3) we use of the z. In Section 6.4.1, we have the
R representations of x, z, and the estimate Σ  R . To find the Mallows C p ’s (7.100),
we also need to find the residuals and projection matrix Pz∗ for each model under
consideration. Let ii contain the indices of the columns of x in the model, and jj
contain the indices for the columns of z in the model. Then to find ResidSS ∗ , the
penalty term, and the C p statistics, we can use the following:

y <− mouths[,1:4]
xstar <− x[,ii]
zstar <− z[,jj]
pzstar <− zstar%∗%solve(t(zstar)%∗%zstar,t(zstar))
yhat <− xstar%∗%solve(t(xstar)%∗%xstar,t(xstar))%∗%y%∗%pzstar
residss <− sum((y−yhat)^2)
pstar <− length(ii)
penalty <− 2∗pstar∗tr(sigmaRhat%∗%pzstar)
cp <− residss + penalty
So, for example, the full model takes ii <− 1:2 and jj <− 1:4, while the model with
no difference between boys and girls, and a quadratic for the growth curve, would
take ii <− 1 and jj <− 1:3.
Here are the results for the eight models of interest:

p∗ l∗ ResidSS ∗ Penalty Cp
1 1 917.7 30.2 947.9
1 2 682.3 35.0 717.3
1 3 680.9 37.0 717.9
1 4 680.5 42.1 722.6 (7.103)
2 1 777.2 60.5 837.7
2 2 529.8 69.9 599.7
2 3 527.1 74.1 601.2
2 4 526.0 84.2 610.2
Note that in general, the larger the model, the smaller the ResidSS ∗ but the larger
the penalty. The C p statistics aims to balance the fit and complexity. The model
with the lowest C p is the (2, 2) model, which fits separate linear growth curves to the
boys and girls. We arrived at this model in Section 7.3.1 as well. The (2, 3) model
is essentially as good, but is a little more complicated. Generally, one looks for the
130 Chapter 7. Both-Sides Models: Testing

model with the smallest C p , but if several models have approximately the same low
C p , one chooses the simplest.

7.7 Exercises
Exercise 7.7.1. Show that when p∗ = l ∗ = 1, the T 2 in (7.17) equals the square of the
t-statistic in (6.27), assuming β ij = 0 there.
Exercise 7.7.2. Verify the equalities in (7.28).
Exercise 7.7.3. If A ∼ Gamma(α, λ) and B ∼ Gamma( β, λ), and A and B are indepen-
dent, then U = A/( A + B ) is distributed Beta(α, β). Show that when l ∗ = 1, Wilks’ Λ
(Definition 7.2) is Beta(α, β), and give the parameters in terms of p∗ and ν. [Hint: See
Exercise 3.7.8 for the Gamma distribution, whose pdf is given in (3.81). Also, the Beta
pdf is found in Exercise 2.7.13, in equation (2.95), though you do not need it here.]

Exercise 7.7.4. Suppose V ∼ Fμ,ν as in Definition 7.1. Show that U = ν/(μF + ν) is


Beta(α, β) from Exercise 7.7.3, and give the parameters in terms of μ and ν.

Exercise 7.7.5. Show that E [1/χ2ν ] = 1/(ν − 2) if ν ≥ 2, as used in (7.30). The pdf of
the chi-square is given in (3.80).
Exercise 7.7.6. Find the matrices C and D so that the hypothesis in (7.52) is the same
as that in (7.12), where β is 5 × 4.
Exercise 7.7.7. Consider the model in (7.69) and (7.70). Let W = y Qx y. Show that
W aa·b = ya Qx∗ y a . Thus one has the same estimate of Σ aa·b in the original model and
in the covariate-adjusted model. [Hint: Write out Waa·b the usual way, where the
blocks in W are Ya Qx Ya , etc. Note that the answer is a function of Qx y a and Qx yb .
Then use (5.89) with D1 = x and D2 = yb .]
Exercise 7.7.8. Prove (7.91). [Hint: Use (7.18) and Exercise 2.7.11 on row(U).]
Exercise 7.7.9. Show that Δ in (7.93) is zero in the big model, i.e, x∗ = x and z∗ = z.
Exercise 7.7.10. Verify the second equality in (7.97). [Hint: Note that trace(QΣQ) =
trace(ΣQ) if Q is idempotent. Why?]

Exercise 7.7.11 (Mouth sizes). In the mouth size data in Section 7.1.1, there are n G =
11 girls and n B = 16 boys, and q = 4 measurements on each. Thus Y is 27 × 4.
Assume that
Y ∼ N (xβ, In ⊗ Σ R ), (7.104)
where this time 
111 011
x= , (7.105)
016 116
and  
μG μ G1 μ G2 μ G3 μ G4
β= = . (7.106)
μB μ B1 μ B2 μ B3 μ B4
G and μ
The sample means for the two groups are μ B . Consider testing

H0 : μ G = μ B . (7.107)
7.7. Exercises 131

(a) What is the constant c so that

G − μ
μ B
B= ∼ N (0, Σ R )? (7.108)
c
 R . Then
(b) The unbiased estimate of Σ R is Σ
 R ∼ Wishart(ν, Σ R ).
W=νΣ (7.109)

What is ν? (c) What is the value of Hotelling’s T 2 ? (d) Find the constant d and the
degrees of freedom a, b so that F = d T 2 ∼ Fa,b. (e) What is the value of F? What is
the resulting p-value? (f) What do you conclude? (g) Compare these results to those
in Section 7.1.1, equations (7.10) and below.

Exercise 7.7.12 (Skulls). This question continues Exercise 6.5.5 on Egyptian skulls. (a)
Consider testing that there is no difference among the five time periods for all four
measurements at once. What are p∗ , l ∗ , ν and T 2 in (7.22) for this hypothesis? What
is the F and its degrees of freedom? What is the p-value? What do you conclude?
(b) Now consider testing whether there is a non-linear effect on skull size over time,
that is, test whether the last three rows of the β matrix are all zero. What are l ∗ , ν, the
F-statistic obtained from T 2 , the degrees of freedom, and the p-value? What do you
conclude? (e) Finally, consider testing whether there is a linear effect on skull size
over time. Find the F-statistic obtained from T 2 . What do you conclude?

Exercise 7.7.13 (Prostaglandin). This question continues Exercise 6.5.4 on prostaglan-


din levels over time. The model is Y = 110 βz + R, where the i th row of z is

(1, cos(θi ), sin(θi ), cos(2θi ), sin(2θi ), cos(3θi )) (7.110)

for θi = 2πi/6, i = 1, . . . , 6. (a) In Exercise 4.4.8, the one-cycle wave is given by the
equation A + B cos(θ + C ). The null hypothesis that the model does not include that
wave is expressed by setting B = 0. What does this hypothesis translate to in terms
of the β ij ’s? (b) Test whether the one-cycle wave is in the model. What is p∗ ? (c)
Test whether the two-cycle wave is in the model. What is p∗ ? (d) Test whether the
three-cycle wave is in the model. What is p∗ ? (e) Test whether just the one-cycle wave
needs to be in the model. (I.e., test whether the two- and three-cycle waves have zero
coefficients.) (f) Using the results from parts (b) through (e), choose the best model
among the models with (1) No waves; (2) Just the one-cycle wave; (3) Just the one-
and two-cycle waves; (4) The one-, two-, and three-cycle waves. (g) Use Mallows’ C p
to choose among the four models listed in part (f).

Exercise 7.7.14 (Histamine in dogs). Consider the model for the histamine in dogs
example in (4.15), i.e.,

⎛⎛ ⎞ ⎞⎛ ⎞
1 −1 −1 1 μb μ1 μ2 μ3
⎜⎜ 1 −1 1 −1 ⎟ ⎟⎜ αb α1 α2 α3 ⎟

Y = xβ + R = ⎝⎝⎜ ⎟ ⊗ 14 ⎟ ⎜ ⎟ + R.
1 1 −1 −1 ⎠ ⎠⎝ βb β1 β2 β3 ⎠
1 1 1 1 γb γ1 γ2 γ3
(7.111)
132 Chapter 7. Both-Sides Models: Testing

For the following two null hypotheses, specify which parameters are set to zero, then
find p∗ , l ∗ , ν, the T 2 and its F version, the degrees of freedom for the F, the p-value,
and whether you accept or reject. Interpret the finding in terms of the groups and
variables. (a) The four groups have equal means (for all four time points). Compare
the results to that for the hypothesis in (7.46). (b) The four groups have equal before
means. (c) Now consider testing the null hypothesis that the after means are equal,
but using the before measurements as a covariate. (So we assume that αb = β b =
γb = 0.) What are the dimensions of the resulting Ya and the x matrix augmented
with the covariate? What are p∗ , l ∗ , ν, and the degrees of freedom in the F for testing
the null hypothesis. (d) The x x from the original model (not using the covariates)
is 16 × I4 , so that the [(x x)−1 ] ∗ = (1/16)I3 . Compare the diagonals (i.e., 1/16)
to the diagonals of the analogous matrix in the model using the covariate. How
much smaller or larger, percentagewise, are the covariate-based diagonals than the
original? (e) The diagonals of the Σ  ∗z in the original model are 0.4280, 0.1511, and
0.0479. Compare these diagonals to the diagonals of the analogous matrix in the
model using the covariate. How much smaller or larger, percentagewise, are the
covariate-based diagonals than the original? (f) Find the T 2 , the F statistic, and the
p-value for testing the hypothesis using the covariate. What do you conclude? How
does this result compare to that without the covariates?

Exercise 7.7.15 (Histamine, cont.). Continue the previous question, using as a starting
point the model with the before measurements as the covariate, so that
⎛ ∗ ⎞
μ1 μ2∗ μ3∗
∗ ∗
⎜ α1 α2 α3 ⎟ ∗
⎜ ⎟
Y∗ = x∗ ⎜ β∗1 β∗2 β∗3 ⎟ z + R∗ , (7.112)
⎝ γ∗ γ∗ γ∗ ⎠
1 2 3
δ1 δ2 δ3
where Y∗ has just the after measurements, x∗ is the x in (7.111) augmented with
the before measurements, and z represents orthogonal polynomials for the after time
points, ⎛ ⎞
1 −1 1
z=⎝ 1 0 −2 ⎠ . (7.113)
1 1 1
Now consider the equivalent model resulting from multiplying both sides of the
equation on the right by (z )−1 . (a) Find the estimates and standard errors for
the quadratic terms, (μ3∗ , α3∗ , β∗2 , γ3∗ ). Test the null hypothesis that (μ3∗ , α3∗ , β∗3 , γ3∗ ) =
(0, 0, 0, 0). What is ν? What is the p-value? Do you reject this null? (The answer
should be no.) (b) Now starting with the model from part (a), use the vector of
quadratic terms as the covariate. Find the estimates and standard errors of the rele-
vant parameters, i.e., ⎛ ∗ ⎞
μ1 μ2∗
⎜ α∗ α∗ ⎟
⎜ 1∗ 2 ⎟. (7.114)
⎝ β
1 β∗2 ⎠

γ1 γ2 ∗

(c) Use Hotelling’s T 2 to test the interaction terms are zero, i.e., that (γ1∗ , γ2∗ ) = (0, 0).
(What are l ∗ and ν?) Also, do the t-tests for the individual parameters. What do you
conclude?
7.7. Exercises 133

Exercise 7.7.16 (Caffeine). This question uses the data on he effects of caffeine on
memory described in Exercise 4.4.4. The model is as in (4.35), with x as described
there, and 
1 −1
z= . (7.115)
1 1
The goal of this problem is to use Mallows’ C p to find a good model, choosing among
the constant, linear and quadratic models for x, and the “overall mean" and “overall
mean + difference models" for the scores. Thus there are six models. (a) For each
of the 6 models, find the p∗ , l ∗ , residual sum of squares, penalty, and C p values. (b)
Which model is best in terms of C p ? (c) Find the estimate of β∗ for the best model.
(d) What do you conclude?
Chapter 8

Some Technical Results

This chapter contains a number of results useful for linear models and other models,
including the densities of the multivariate normal and Wishart. We collect them here
so as not to interrupt the flow of the narrative.

8.1 The Cauchy-Schwarz inequality


Lemma 8.1. Cauchy-Schwarz inequality. Suppose y and d are 1 × K vectors. Then

(yd )2 ≤ y2 d2 , (8.1)

with equality if and only if d is zero, or

y=γ
d (8.2)

.
for some constant γ

Proof. If d is zero, the result is immediate. Suppose d = 0, and let 


y be the projection
of y onto span{d}. (See Definitions 5.2 and 5.7.) Then by least-squares, Theorem 5.2
(with D = d ), y=γ d, where here γ = yd /d2 . The sum-of-squares decomposition
in (5.11) implies that
y 2 − 
y 2 =  y − 
y2 ≥ 0, (8.3)
which yields
(yd )2
y 2 ≥ 
y 2 = , (8.4)
d2
from which (8.1) follows. Equality in (8.1) holds if and only if y = 
y, which holds if
and only if (8.2).

If U and V are random variables, with E [|UV |] < ∞, then the Cauchy-Schwarz
inequality becomes
E [UV ]2 ≤ E [U 2 ] E [V 2 ], (8.5)
with equality if and only if V is zero with probability one, or

U = bV (8.6)

135
136 Chapter 8. Technical Results

for constant b = E [UV ] /E [V 2 ]. See Exercise 8.8.2. The next result is well-known in
statistics.

Corollary 8.1 (Correlation inequality). Suppose Y and X are random variables with finite
positive variances. Then
−1 ≤ Corr [Y, X ] ≤ 1, (8.7)
with equality if and only if, for some constants a and b,

Y = a + bX. (8.8)

Proof. Apply (8.5) with U = Y − E [Y ] and V = X − E [ X ] to obtain

Cov[Y, X ]2 ≤ Var [Y ]Var [ X ], (8.9)

from which (8.7) follows. Then (8.8) follows from (8.6), with b = Cov[Y, X ] /Var [ X ]
and a = E [Y ] − bE [ X ], so that (8.8) is the least squares fit of X to Y.

This inequality for the sample correlation coefficient of n × 1 vectors x and y fol-
lows either by using Lemma 8.1 on Hn y and Hn x, where Hn is the centering matrix
(1.12), or by using Corollary 8.1 with X and Y having the empirical distributions
given by x and y, respectively, i.e.,

1 1
P[ X = x] = #{ xi = x } and P [Y = y] = #{yi = y}. (8.10)
n n
The next result also follows from Cauchy-Schwarz. It will be useful for Hotelling’s
T 2 in Section 8.4.1, and for canonical correlations in Section 13.3.

Corollary 8.2. Suppose y and d are 1 × K vectors, and y = 1. Then

(yd )2 ≤ d2 , (8.11)

with equality if and only if d is zero, or d is nonzero and

d
y=± . (8.12)
d

8.2 Conditioning in a Wishart


We start with W ∼ Wishart p+q (ν, Σ) as in Definition 3.6, where Σ is partitioned

Σ XX Σ XY
Σ= , (8.13)
ΣYX ΣYY

Σ XX is p × p, ΣYY is q × q, and W is partitioned similarly. We are mainly interested


in the distribution of
WYY · X = WYY − WYX W− 1
XX W XY (8.14)
(see Equation 3.49), but some additional results will easily come along for the ride.
8.3. Expectation of inverse Wishart 137

Proposition 8.1. Consider the situation above, where Σ XX is invertible and ν ≥ p. Then

(WXX , WXY ) is independent of WYY · X , (8.15)

WYY · X ∼ Wishartq (ν − p, ΣYY · X ), (8.16)


and

W XY | W XX = wxx ∼ N (wxx Σ−
XX Σ XY , w xx ⊗ ΣYY · X ),
1
(8.17)
W XX ∼ Wishart p (ν, Σ XX ). (8.18)

Proof. The final equation is just the marginal of a Wishart, as in Section 3.6. By
Definition 3.6 of the Wishart,

W = D (X Y) (X Y), (X Y) ∼ N (0, In ⊗ Σ), (8.19)

where X is n × p and Y is n × q. Conditioning as in (3.56), we have

Y | X = x ∼ Nn×q (xβ, In ⊗ ΣYY · X ), β = Σ− 1


XX Σ XY . (8.20)

(The α = 0 because the means of X and Y are zero.) Note that (8.20) is the both-
sides model (6.1), with z = Iq and Σ R = ΣYY · X . Thus by Theorem 6.1 and the
plug-in property (2.62) of conditional distributions, β = (X X)−1 X Y and Y QX Y are
conditionally independent given X = x,

β | X = x ∼ N ( β, (x x)−1 ⊗ ΣYY · X ), (8.21)

and
Y QX Y | X = x ∼ Wishartq (n − p, ΣYY · X ). (8.22)
The conditional distribution in (8.22) does not depend on x, hence Y QX Y
is (uncon-
ditionally) independent of the pair (X, β), as in (2.65) and therebelow, hence

Y QX Y is independent of (X X, X Y). (8.23)

Property (2.66) implies that

β | X X = x x ∼ N ( β, (x x)−1 ⊗ ΣYY · X ), (8.24)

hence
X Y = (X X) β | X X = x x ∼ N (x x) β, (x x) ⊗ ΣYY · X ). (8.25)
Translating to W using (8.19), noting that Y QX Y = WYY · X , we have that (8.23) is
(8.15), (8.22) is (8.16), and (8.25) is (8.17).

8.3 Expectation of inverse Wishart


We first prove Lemma 7.1 for

U ∼ Wishartl ∗ (ν, Il ∗ ). (8.26)


138 Chapter 8. Technical Results

For any q × q orthogonal matrix Γ, ΓUΓ  has the same distribution as U, hence in
particular
E [U−1 ] = E [(ΓUΓ  )−1 ] = ΓE [U−1 ] Γ  . (8.27)
Exercise 8.8.6 shows that any symmetric q × q matrix A for which A = ΓAΓ  for all
orthogonal Γ must be of the form a11 Iq . Thus

E [U−1 ] = E [(U−1 )11 ] Il ∗ . (8.28)

Using (5.85),
% & % &
1 1 1
E [(U−1 )11 ] = E =E = . (8.29)
U11·{2:q} χ2ν−l ∗ +1 ν − l∗ − 1

Equations (8.28) and (8.29) show that

1
E [ U−1 ] = I ∗. (8.30)
ν − l∗ − 1 l

Next take W ∼ Wishartq (ν, Σ), with Σ invertible. Then, W = D Σ1/2 UΣ1/2 , and

E [W−1 ] = E [(Σ1/2 UΣ1/2 )−1 ]


1
= Σ−1/2 I ∗ Σ−1/2
ν − l∗ − 1 l
1
= Σ −1 , (8.31)
ν − l∗ − 1
verifying (7.32).

8.4 Distribution of Hotelling’s T 2


Here we prove Proposition 7.1. Exercise 8.8.5 shows that we can assume Σ = Il ∗ in
the proof, which we do. Divide and multiply the ZW−1 Z by Z2 :

ZW−1 Z
ZW−1 W = Z2 . (8.32)
Z2

Because Z is a vector of l ∗ independent standard normals,

Z2 ∼ χ2l ∗ . (8.33)

Consider the distribution of the ratio conditional on Z = z:

ZW−1 Z
| Z = z. (8.34)
Z2

Because Z and W are independent, we can use the plugin formula (2.63), so that
% &
ZW−1 Z zW−1 z
| Z = z =D = g1 W−1 g1 , (8.35)
Z 2  z 2
8.4. Distribution of Hotelling’s T 2 139

where g1 = z/z. Note that on the right-hand side we have the unconditional
distribution for W. Let G be any l ∗ × l ∗ orthogonal matrix with g1 as its first row.
(Exercise 5.6.18 guarantees there is one.) Then

g1 W−1 g1 = e1 GW−1 G e1 , e1 = (1, 0, . . . , 0). (8.36)


Because the covariance parameter in the Wishart distribution for W is Il ∗ , U ≡
GWG ∼ Wishartl ∗ (ν, Il ∗ ). But

e1 GW−1 G e1 = e1 U−1 e1 = [U−1 ]11 = U11


−1
·{2:l ∗ }
(8.37)

by (5.85).
Note that the distribution of U, hence [U−1 ]11 , does not depend on z, which means
that
ZW−1 Z
is independent of Z. (8.38)
Z2
Furthermore, by (8.16), where p = l ∗ − 1 and q = 1,

U11·2 ∼ Wishart1 (ν − l ∗ + 1, 1) ≡ χ2ν−l ∗ +1 . (8.39)


Now (8.32) can be expressed as

Z2 χ2l ∗
ZW−1 Z = =D , (8.40)
U11·{2:l ∗ } χ2ν−l ∗ +1

where the two χ2 ’s are independent. Then (7.29) follows from Definition 7.1 for the
F.

8.4.1 A motivation for Hotelling’s T 2


Hotelling’s T 2 test can be motivated using the projection pursuit idea. Let a = 0 be
an 1 × l ∗ vector of constants, and look at
Za ∼ N (0, aΣa ) and aWa ∼ Wishart1 (ν, aΣa ) = (aΣa ) χ2ν . (8.41)
Now we are basically in the univariate t (6.25) case, i.e.,

Za
Ta = √ ∼ tν , (8.42)
aWa /ν
or, since t2ν = F1,ν ,
(Za )2
Ta2 = ∼ F1,ν . (8.43)
aWa /ν
For any a, we can do a regular F test. The projection pursuit approach is to find the
a that gives the most significant result. That is, we wish to find

T 2 = max Ta2 . (8.44)


a =0

To find the best a, first simplify the denominator by setting

b = aW1/2 , so that a = bW−1/2. (8.45)


140 Chapter 8. Technical Results

Then
(Vb )2
T 2 = max ν , where V = ZW−1/2 . (8.46)
b  =0 bb
Letting g = b/b, so that g = 1, Corollary 8.2 of Cauchy-Schwarz shows that (see
Exercise 8.8.9)
T 2 = ν max (Vg )2 = νVV = νZW−1 Z , (8.47)
g | g=1

which is indeed Hotelling’s T 2 of (7.28), a multivariate generalization of Student’s


t2 . Even though Ta2 has an F1,ν distribution, the T 2 does not have that distribution,
because it maximizes over many F1,ν ’s.

8.5 Density of the multivariate normal


Except for when using likelihood methods in Chapter 9, we do not need the density
of the multivariate normal, nor of the Wishart, for our main purposes, but present
them here because of their intrinsic interest. We start with the multivariate normal,
with positive definite covariance matrix.
Lemma 8.2. Suppose Z ∼ N1× N (μ, Ω), where Ω is positive definite. Then the pdf of Z is
1 1 ( z − μ ) Ω − 1( z − μ ) 
e− 2
1
f (z | μ, Ω) = . (8.48)
(2π ) N/2 | Ω|1/2
Proof. Recall that a multivariate normal vector is an affine transform of a vector of
independent standard normals,
Y = ZA + μ, Z ∼ N1× N (0, I N ), AA = Ω. (8.49)
We will take A to be N × N, so that Ω being positive definite implies that A is
invertible. Then
Z = (Y − μ)(A )−1 , (8.50)
and the Jacobian is
⎛ ⎞
 ∂z1 /∂y1 ∂z1 /∂y2 ··· ∂z1 /∂y N 
 
 ⎜ ··· ⎟
∂z ⎜ ∂z2 /∂y1 ∂z2 /∂y2 ∂z2 /∂y N ⎟
| | ≡ ⎜ .. .. .. .. ⎟ = |(A )−1 |. (8.51)
∂y ⎝ . . . . ⎠
 
 ∂z /∂y ∂z N /∂y2 ··· ∂z N /∂y N 
N 1

The density of Z is
1 1 zz
e− 2 z21 +···z2N
e− 2
1 1
f (z | 0, I N ) = = , (8.52)
(2π ) N/2 (2π ) N/2
so that
1  −1 −1 
abs |(A )−1 | e− 2 (y−μ)(A ) A (y−μ)
1
f (y | μ, Ω) =
(2π ) N/2
1  −1 
|AA | −1/2 e− 2 (y−μ)(AA ) (y−μ) ,
1
= (8.53)
(2π ) N/2
from which (8.48) follows.
8.6. The QR decomposition for the multivariate normal 141

When Z can be written as a matrix with a Kronecker product for its covariance
matrix, as is often the case for us, the pdf can be compactified.
Corollary 8.3. Suppose Y ∼ Nn×q (M, C ⊗ Σ), where C (n × n) and Σ (q × q) are positive
definite. Then
1 1 −1 −1 
e− 2 trace(C (y−M) Σ (y−M) ) .
1
f (y | M, C, Σ) = (8.54)
(2π )nq/2 |C| q/2 | Σ| n/2
See Exercise 8.8.15 for the proof.

8.6 The QR decomposition for the multivariate normal


Here we discuss the distributions of the Q and R matrices in the QR decomposition
of a multivariate normal matrix. From the distribution of the upper triangular R
we obtain Bartlett’s decomposition [Bartlett, 1939], useful for randomly generating
Wisharts, as well as derive the Wishart density in Section 8.7. Also, we see that Q
has a certain uniform distribution, which provides a method for generating random
orthogonal matrices from random normals. The results are found in Olkin and Roy
[1954], and this presentation is close to that of Kshirsagar [1959]. (Old school, indeed!)
We start with the data matrix
Z ∼ Nν×q (0, Iν ⊗ Iq ), (8.55)
a matrix of independent N (0, 1)’s, where ν ≥ q, and consider the QR decomposition
(Theorem 5.3)
Z = QR. (8.56)
We find the distribution of the R. Let
S ≡ Z Z = R R ∼ Wishartq (ν, Iq ). (8.57)
Apply Proposition 8.1 with S XX being the single element S11 . Because Σ = Iq ,
(S11 , S1{2:q} ) is independent of S{2:q}{2:q}·1,

S11 ∼ Wishart1 (ν, I1 ) = χ2ν ,


S1{2:q} | S11 = s11 ∼ N1×( q−1) (0, s11 Iq−1 ),
and S{2:q}{2:q}·1 ∼ Wishartq−1 (ν − 1, Iq−1 ). (8.58)

Note that S1{2:q} / S11 , conditional on S11 , is N (0, Iq−1 ), in particular, is independent

of S11 . Thus the three quantities S11 , S1{2:q} / S11 , and S{2:q}{2:q}·1 are mutually
independent. Equation (5.67) shows that

R11 = S11 ∼ χ2ν
  
and R12 · · · R1q = S1{2:q} / S11 ∼ N1×( q−1) (0, Iq−1 ). (8.59)

Next, work on the first component of S22·1 of S{2:q}{2:q}·1. We find that



R22 = S22·1 ∼ χ2ν−1
  
and R23 · · · R2q = S2{3:q}·1/ S22·1 ∼ N1×( q−2) (0, Iq−2 ), (8.60)
142 Chapter 8. Technical Results

both independent of each other, and of S{3:q}{3:q}·12. We continue, obtaining the


following result.

Lemma 8.3 (Bartlett’s decomposition). Suppose S ∼ Wishartq (ν, Iq ), where ν ≥ q, and


let its Cholesky decomposition be S = R R. Then the elements of R are mutually independent,
where
R2ii ∼ χ2ν−i+1, i = 1, . . . , q, and Rij ∼ N (0, 1), 1 ≤ i < j ≤ q. (8.61)

Next, suppose Y ∼ Nν×q (0, Iν ⊗ Σ), where Σ is invertible. Let A be the matrix
such that
Σ = A A, where A ∈ Tq+ of (5.63), (8.62)

i.e., A A is the Cholesky decomposition of Σ. Thus we can take Y = ZA. Now


Y = QV, where V ≡ RA is also in Tq+ , and Q still has orthonormal columns. By the
uniqueness of the QR decomposition, QV is the QR decomposition for Y. Then

Y Y = V V ∼ Wishart(0, Σ). (8.63)

We call the distribution of V the Half-Wishartq (ν, A).


To generate a random W ∼ Wishartq (ν, Σ) matrix, one can first generate q (q − 1)/2
N (0, 1)’s, and q χ2 ’s, all independently, then set the Rij ’s as in (8.61), then calculate
V = RA, and W = V V. If ν is large, this process is more efficient than generating
the νq normals in Z or Y. The next section derives the density of the Half-Wishart,
then that of the Wishart itself.
We end this section by completing description of the joint distribution of (Q, V).
Exercise 3.7.35 handled the case Z ∼ N2×1 (0, I2 ).

Lemma 8.4. Suppose Y = QV as above. Then

(i) Q and V are independent;

(ii) The distribution of Q in does not depend on Σ;

(iii) The distribution of Q is invariant under orthogonal transforms: If Γ ∈ On , the group of


n × n orthogonal matrices (see (5.58)), then

Q = D ΓQ. (8.64)

Proof. From above, we see that Z and Y = ZA have the same Q. The distribution of Z
does not depend on Σ, hence neither does the distribution of Q, proving part (ii). For
part (iii), consider ΓY, which has the same distribution as Y. We have ΓY = (ΓQ)V.
Since ΓQ also has orthonormal columns, the uniqueness of the QR decomposition
implies that ΓQ is the “Q” for ΓY. Thus Q and ΓQ have the same distribution.
Proving the independence result of part (i) takes some extra machinery from math-
ematical statistics. See, e.g., Lehmann and Casella [1998]. Rather than providing all
the details, we outline how one can go about the proof. First, V can be shown to be a
complete sufficient statistic for the model Y ∼ N (0, Iν ⊗ Σ). Basu’s Lemma says that
any statistic whose distribution does not depend on the parameter, in this case Σ, is
independent of the complete sufficient statistic. Thus by part (ii), Q is independent
of V.
8.7. Density of the Wishart 143

If n = q, the Q is an orthogonal matrix, and its distribution has the Haar probabil-
ity measure, or uniform distribution, over Oν . It is the only probability distribution
that does have the above invariance property, although proving the fact is beyond
this book. See Halmos [1950]. Thus one can generate a random q × q orthogonal
matrix by first generating an q × q matrix of independent N (0, 1)’s, then performing
Gram-Schmidt orthogonalization on the columns, normalizing the results so that the
columns have norm 1.

8.7 Density of the Wishart


We derive the density of the Half-Wishart, then the Wishart. We need to be careful
with constants, and find two Jacobians. Some details are found in Exercises 8.8.16 to
8.8.21.
We start by writing down the density of R ∼ Half-Wishartq (ν, Iq ), assuming n ≥ q,
as in (8.61), The density of U (> 0), where U 2 ∼ χ2k , is

1
u k −1 e − 2 u .
1 2
f k (u ) = (8.65)
Γ (k/2)2( k/2)−1

Thus that for R is

1 n−q 
r ν−1 r ν−2 · · · rqq e− 2 trace(r r) ,
1
f R (r ) = (8.66)
c(ν, q ) 11 22

where
q 
ν−j+1
c(ν, q ) = π q( q−1) /4 2( νq/2)−q ∏Γ . (8.67)
j =1
2

For V ∼ Half-Wishartq (ν, Σ), where Σ is invertible, we set V = RA, where A A is


the Cholesky decomposition of Σ in (8.62). The Jacobian J is given by
 
1  ∂v 
=   = a11 a222 · · · aqq .
q
(8.68)
J ∂r

Thus, since v jj = a jj r jj , the density of V is

ν −1 ν −2 ν−q
1 v11 v22 · · · vqq − 1 trace((A ) −1v vA−1 ) 1
f V (v | Σ ) = ν−q e
2
c(ν, q ) a ν − 1
a ν −2
· · · a qq a 11 a 2 · · · aq
22 qq
11 22
1 1 ν−q −1 
vν−1 vν−2 · · · vqq e− 2 trace( Σ v v) ,
1
= (8.69)
c(ν, q ) | Σ| ν/2 11 22

since | Σ| = ∏ a2ii . See Exercise 5.6.31.


Finally, suppose W ∼ Wishartq (ν, Σ), so that we can take W = V V. The Jacobian
is
 
1  ∂w 
=   = 2q vq vq−1 · · · vqq . (8.70)
J∗ ∂v  11 22
144 Chapter 8. Technical Results

Thus from (8.69),


ν −1 ν −2 ν−q
1 1 1 v11 v22 · · · vqq − 1w )
e− 2 trace( Σ
1
f W (w | Σ ) = q
2 c(ν, q ) | Σ| ν/2 vq vq−1 · · · vqq
11 22
1 1 −1
|w| ( ν−q−1) /2 e− 2 trace( Σ w) ,
1
= (8.71)
d(ν, q ) | Σ| n/2

where 
q
ν−j+1
d(ν, q ) = π q( q−1) /4 2νq/2 ∏Γ . (8.72)
j =1
2

8.8 Exercises
Exercise 8.8.1. Suppose y is 1 × K, D is M × K, and D D is invertible. Let y  be
the projection of y onto the span of the rows of D , so that y D , where γ
 = γ  =
yD(D D)−1 is the least-squares estimate as in (5.17). Show that

y2 = yD(D D)−1 D y .


 (8.73)

(Notice the projection matrix, from (5.19).) Show that in the case D = d , i.e., M = 1,
(8.73) yields the equality in (8.4).
Exercise 8.8.2. Prove the Cauchy-Schwarz inequality for random variables U and V
given in (8.5) and (8.6), assuming that V is not zero with probability one. [Hint: Use
least squares, by finding b to minimize E [(U − bV )2 ].]
Exercise 8.8.3. Prove Corollary 8.2. [Hint: Show that (8.11) follows from (8.1), and
 = ±1/d, using the fact that y = 1.]
that (8.2) implies that γ
Exercise 8.8.4. For W in (8.19), verify that X X = W XX , X Y = W XY , and Y QX Y =
WYY · X , where QX = In − X(X X)−1 X .
Exercise 8.8.5. Suppose Z ∼ N1×l ∗ (0, Σ) and W ∼ Wishartl ∗ (ν, Σ) are as in Proposi-
tion 7.1. (a) Show that for any l ∗ × l ∗ invertible matrix A,

ZW−1 Z = (ZA)(A WA)−1 (ZA) . (8.74)

(b) For what A do we have ZA ∼ N1×l ∗ (0, Il ∗ ) and A WA ∼ Wishartl ∗ (ν, Il ∗ )?


Exercise 8.8.6. Let A be a q × q symmetric matrix, and for q × q orthogonal matrix Γ,
contemplate the equality
A = ΓAΓ  . (8.75)
(a) Suppose (8.75) holds for all permutation matrices (matrices with one “1” in each
row, and one “1” in each column, and zeroes elsewhere). Show that all the diagonals
of A must be equal (i.e., a11 = a22 = · · · = aqq ), and that all off-diagonals must be
equal (i.e., aij = akl if i = j and k = l). [Hint: You can use the permutation matrix
that switches the first two rows and first two columns,
⎛ ⎞
0 1 0
Γ=⎝ 1 0 0 ⎠, (8.76)
0 0 Iq −2
8.8. Exercises 145

to show that a11 = a22 and a1i = a2i for i = 3, . . . , q. Similar equalities can be obtained
by switching other pairs of rows and columns.] (b) Suppose (8.75) holds for all Γ that
are diagonal, with each diagonal element being either +1 or −1. (They needn’t all be
the same sign.) Show that all off-diagonals must be 0. (c) Suppose (8.75) holds for all
orthogonal Γ. Show that A must be of the form a11 Iq . [Hint: Use parts (a) and (b).]
Exercise 8.8.7. Verify the three equalities in (8.37).

Exercise 8.8.8. Show that t2ν = F1,ν .

Exercise 8.8.9. Find the g in (8.47) that maximizes (Vg )2 , and show that the maxi-
mum is indeed VV . (Use Corollary 8.2.) What is a maximizing a in (8.44)?
Exercise 8.8.10. Suppose that z = yB, where z and y are 1 × N, and B is N × N and
invertible. Show that | ∂z/∂y| = |B|. (Recall (8.51)).

Exercise 8.8.11. Show that for N × N matrix A, abs |(A )−1 | = |AA | −1/2 .
Exercise 8.8.12. Let

Σ XX 0
(X, Y) ∼ Np+q (0, 0), . (8.77)
0 ΣYY

where X is 1 × p, Y is 1 × q, Cov[X] = Σ XX , and Cov[Y] = ΣYY . By writing out the


density of (X, Y), show that X and Y are independent. (Assume the covariances are
invertible.)
Exercise 8.8.13. Take (X, Y) as in (8.77). Show that X and Y are independent by using
moment generating functions. Do you need that the covariances are invertible?
Exercise 8.8.14. With

σXX σXY
( X, Y ) ∼ N1×2 (μ X , μY ), , (8.78)
σYX σYY

derive the conditional distribution Y | X = x explicitly using densities, assuming


σXX > 0. That is, show that f Y | X (y| x ) = f ( x, y)/ f X ( x ).

Exercise 8.8.15. Prove Corollary 8.3. [Hint: Make the identifications z = row(y),
μ = row(M), and Ω = C ⊗ Σ in (8.48). Use (3.32f) for the determinant term in the
density. For the term in the exponent, use (3.32d) to help show that

trace (C−1 (y − M)Σ−1 (y − M) ) = (z − μ)(C−1 ⊗ Σ−1 )(z − μ) .] (8.79)



Exercise 8.8.16. Show that U = X, where X ∼ χ2k , has density as in (8.65).
Exercise 8.8.17. Verify (8.66) and (8.67). [Hint: Collect the constants as in (8.65),
along with the constants from the normals (the Rij ’s, j > i). The trace in the exponent
collects all the r2ij .]

Exercise 8.8.18. Verify (8.68). [Hint: Vectorize the matrices by row, leaving out the
structural zeroes, i.e., for q = 3, v → (v11 , v12 , v13 , v22 , v23 , v33 ). Then the matrix of
derivatives will be lower triangular.]
146 Chapter 8. Technical Results

Exercise 8.8.19. Verify (8.69). In particular, show that

trace((A )−1 v vA−1 ) = trace(Σ−1 v v) (8.80)

and ∏ a jj = | Σ|1/2. [Recall (5.87).]


Exercise 8.8.20. Verify (8.70). [Hint: Vectorize the matrices as in Exercise 8.8.18, where
for w just take the elements in the upper triangular part.]
Exercise 8.8.21. Verify (8.71) and (8.72).

Exercise 8.8.22. Suppose V ∼ Half-Wishartq (ν, Σ) as in (8.63), where Σ is positive


definite and ν ≥ p. Show that the diagonals Vjj are independent, and

Vii2 ∼ σjj·{1: ( j−1)} χ2ν− j+1 . (8.81)

[Hint: Show that with V = RA as in (8.61) and (8.62), Vjj = a jj R jj . Apply (5.67) to the
A and Σ.]

Exercise 8.8.23. For a covariance matrix Σ, | Σ| is called the population generalized


variance. It is an overall measure of spread. Suppose W ∼ Wishartq (ν, Σ), where Σ
is positive definite and ν ≥ q. Show that
1
|'
Σ| = |W| (8.82)
ν ( ν − 1) · · · ( ν − q + 1)
is an unbiased estimate of the generalized variance. [Hint: Find the Cholesky decom-
position W = V V, then use (8.81) and (5.86).]

Exercise 8.8.24 (Bayesian inference). Consider Bayesian inference for the covariance
matrix. It turns out that the conjugate prior is an inverse Wishart on the covariance
matrix, which means Σ−1 has a Wishart prior. Specifically, let

Ψ = Σ−1 and ν0 Ψ0 = Σ0−1 , (8.83)

where Σ0 is the prior guess of Σ, and ν0 is the “prior sample size.” (The larger the ν0 ,
the more weight is placed on the prior vs. the data.) Then the model in terms of the
inverse covariance parameter matrices is

W | Ψ = ψ ∼ Wishartq (ν, ψ −1 )
Ψ ∼ Wishartq (ν0 , Ψ0 ) , (8.84)
where ν ≥ q, ν0 ≥ q and Ψ0 is positive definite, so that Ψ, hence Σ, is invertible with
probability one. Note that the prior mean for Ψ is Σ0−1 . (a) Show that the joint density
of (W, Ψ) is
−1
f W | Ψ (w | ψ ) f Ψ (ψ ) = c(w)| ψ | ( ν+ν0 −q−1) /2 e− 2 trace((w+Ψ0 ) ψ) ,
1
(8.85)

where c(w) is some constant that does not depend on ψ, though it does depend on
Ψ0 and ν0 . (b) Without doing any calculations, show that the posterior distribution
of Ψ is
Ψ | W = w ∼ Wishartq (ν + ν0 , (w + Ψ0−1 )−1 ). (8.86)
8.8. Exercises 147

[Hint: Dividing the joint density in (8.85) by the marginal density of W, f W (w), yields
the posterior density just like the joint density, but with a different constant, say,
c∗ (w). With ψ as the variable, the density is a Wishart one, with given parameters.]
(c) Letting S = W/ν be the sample covariance matrix, show that the posterior mean
of Σ is
1
E [ Σ | W = w] = (νS + ν0 Σ0 ), (8.87)
ν + ν0 − q − 1
close to a weighted average of the prior guess and observed covariance matrices.
[Hint: Use Lemma 7.1 on Ψ, rather than trying to find the distribution of Σ.]

Exercise 8.8.25 (Bayesian inference). Exercise 3.7.30 considered Bayesian inference on


the normal mean when the covariance matrix is known, and the above Exercise 8.8.24
treated the covariance case with no mean apparent. Here we present a prior to deal
with the mean and covariance simultaneously. It is a two-stage prior:

μ | Ψ = ψ ∼ Np×q (μ0 , (K0 ⊗ ψ )−1 ),


Ψ ∼ Wishartq (ν0 , Ψ0 ) . (8.88)

Here, K0 , μ0 , Ψ0 and ν0 are known, where K0 and Ψ0 are positive definite, and ν0 ≥ q.
Show that unconditionally, E [ μ] = μ0 and, using (8.83),
1 ν0
Cov[ μ] = (K0 ⊗ Ψ0 )−1 = K−1 ⊗ Σ 0 . (8.89)
ν0 − q − 1 ν0 − q − 1 0

[Hint: Use the covariance decomposition in (2.74) on Ψ.]


Exercise 8.8.26. This exercise finds density of the marginal distribution of the μ in
(8.88). (a) Show that the joint density of μ and Ψ can be written
1
f μ,Ψ (m, ψ ) = | Ψ0 | −ν0 /2 |K0 | q/2 | ψ | ( ν0+ p−q−1) /2
(2π ) pq/2 d(ν0 , q )
 −1
e− 2 trace(((m−μ0) K0 (m−μ0 )+Ψ0 ) ψ) ,
1
(8.90)

for the Wishart constant d(ν0 , q ) given in (8.72). [Hint: Use the pdfs in (8.54) and
(8.69).] (b) Argue that the final two terms in (8.90) (the | ψ | term and the exponential
term) look like the density of Ψ if

Ψ ∼ Wishartq (ν0 + p, ((m − μ0 ) K0 (m − μ0 ) + Ψ0−1 )−1 ), (8.91)

but without the constants, hence integrating over ψ yields the inverse of those con-
stants. Then show that the marginal density of μ is

f μ (m) = f μ,Ψ (m, ψ )dψ
d(ν0 + p, q )
= | Ψ0 | −ν0 /2 |K0 | q/2 |(m − μ0 ) K0 (m − μ0 ) + Ψ0−1 | −( ν0 + p) /2
(2π ) pq/2 d(ν0 , q )
1 | Ψ0 | p/2 |K0 | q/2
= ,
c(ν0 , p, q ) |(m − μ0 ) K0 (m − μ0 )Ψ0 + Iq | ( ν0 + p) /2


(8.92)
148 Chapter 8. Technical Results

where
c(ν0 , p, q ) = (2π ) pq/2 d(ν0 , q )/d(ν0 + p, q ). (8.93)
This density for μ is a type of multivariate t. Hotellings T 2 is another type. (c) Show
that if p = q = 1, μ0 = 0, K0 = 1/ν0 and Ψ0 = 1, that the pdf (8.92) is that of a
Student’s t on ν0 degrees of freedom:

Γ ((ν + 1)/2) 1
f (t | ν0 ) = √ 0 . (8.94)
νπ Γ (ν0 /2) (1 + t /ν0 )( ν0 +1) /2
2

Exercise 8.8.27 (Bayesian inference). Now we add some data to the prior in Exercise
8.8.25. The conditional model for the data is

Y | μ = m, Ψ = ψ ∼ Np×q (m, (K ⊗ ψ )−1 ),


W | μ = m, Ψ = ψ ∼ Wishartq (ν, ψ −1 ), (8.95)

where Y and W are independent given μ and Ψ. Note that W’s distribution does not
depend on the μ. The conjugate prior is given in (8.88), with the conditions given
therebelow. The K is a fixed positive definite matrix. A curious element is that prior
covariance of the mean and the conditional covariance of Y have the same ψ, which
helps tractability (as in Exercise 6.5.3). (a) Justify the following equations:

f Y,W,μ,Ψ (y, w, m, ψ ) = f Y | μ,Ψ (y | m, ψ ) f W | Ψ (w | ψ ) f μ | Ψ (m | ψ ) f Ψ (ψ )


= f μ | Y,Ψ (m | y, ψ ) f Y | Ψ (y | ψ ) f W | Ψ (w | ψ ) f Ψ (ψ )
(8.96)

(b) Show that the conditional distribution of μ given Y and Ψ is multivariate normal
with

E [ μ | Y = y, Ψ = ψ ] = (K + K0 )−1 (Ky + K0 μ0 ),
Cov[ μ | Y = y, Ψ = ψ ] = (K + K0 )−1 ⊗ ψ −1 . (8.97)

[Hint: Follows from Exercise 3.7.29, noting that ψ is fixed (conditioned upon) for this
calculation.] (c) Show that

Y | Ψ = ψ ∼ Np×q (μ0 , (K−1 + K0−1 ) ⊗ ψ −1 ). (8.98)

[Hint: See (3.102).] (d) Let Z = (K−1 + K0−1 )−1/2 (Y − μ0 ), and show that the middle
two densities in the last line of (8.96) can be combined into the density of

U = W + Z Z | Ψ = ψ ∼ Wishartq (ν + p, ψ −1 ), (8.99)

that is,
f Y | Ψ (y | ψ ) f W | Ψ (w | ψ ) = c∗ (u, w) f U | Ψ (u | ψ ) (8.100)
for some constant c∗ (u, w) that does not depend on ψ. (e) Now use Exercise 8.8.24 to
show that
Ψ | U = u ∼ Wishartq (ν + ν0 + p, (u + Ψ0−1 )−1 ). (8.101)
(f) Thus the posterior distribution of μ and Ψ in (8.97) and (8.101) are given in the
same two stages as the prior in (8.88). The only differences are in the parameters.
8.8. Exercises 149

The prior parameters are μ0 , K0 , ν0 , and Ψ0 . What are the corresponding posterior
parameters? (g) Using (8.83), show that the posterior means of μ and Σ are

E [ μ | Y = y, W = w] = (K + K0 )−1 (Ky + K0 μ0 ),
1
E [ Σ | Y = y, W = w] = (u + ν0 Σ0 ), (8.102)
ν + ν0 + p − q − 1

and the posterior covariance of μ is

1
Cov[ μ | Y = y, W = w] = (K + K0 )−1 ⊗ (u + ν0 Σ0 ). (8.103)
ν + ν0 + p − q − 1
Chapter 9

Likelihood Methods

For the linear models, we derived estimators of β using the least-squares principle,
and found estimators of Σ R in an obvious manner. Likelihood provides another
general approach to deriving estimators, hypothesis tests, and model selection proce-
dures. We start with a very brief introduction, then apply the principle to the linear
models. Chapter 10 considers MLE’s for models concerning the covariance matrix.

9.1 Introduction
Throughout this chapter, we assume we have a statistical model consisting of a ran-
dom object (usually a matrix or a set of matrices) Y with space Y , and a set of
distributions { Pθ | θ ∈ Θ }, where Θ is the parameter space. We assume that these
distributions have densities, with Pθ having associated density f (y | θ).
Definition 9.1. For a statistical model with densities, the likelihood function is defined for
each fixed y ∈ Y as the function L (· ; y) : θ → [0, ∞ ) given by
L (θ ; y) = a(y) f (y | θ), (9.1)
for any positive a(y).
Likelihoods are to be interpreted in only relative fashion, that is, to say the likeli-
hood of a particular θ1 is L (θ1 ; y) does not mean anything by itself. Rather, meaning
is attributed to saying that the relative likelihood of θ1 to θ2 (in light of the data y) is
L (θ1 ; y)/L (θ2 ; y). Which is why the “a(y)” in (9.1) is allowed. There is a great deal of
controversy over what exactly the relative likelihood means. We do not have to worry
about that particularly, since we are just using likelihood as a means to an end. The
general idea, though, is that the data supports θ’s with relatively high likelihood.
The next few sections consider maximum likelihood estimation. Subsequent sec-
tions look at likelihood ratio tests, and two popular model selection techniques (AIC
and BIC). Our main applications are to multivariate normal parameters.

9.2 Maximum likelihood estimation


Given the data y, it is natural (at least it sounds natural terminologically) to take as
estimate of θ the value that is most likely. Indeed, that is the maximum likelihood

151
152 Chapter 9. Likelihood Methods

estimate.
Definition 9.2. The maximum likelihood estimate (MLE) of the parameter θ based on the
(y) ∈ Θ that maximizes the likelihood L (θ; y).
data y is the unique value, if it exists, θ
It may very well be that the maximizer is not unique, or does not exist at all, in
which case there is no MLE for that particular y. The MLE of a function of θ, g(θ),
is defined to be the function of the MLE, that is, g(
(θ) = g(θ). See Exercises 9.6.1 and
9.6.2 for justification.

9.2.1 The MLE in multivariate regression


Here the model is the multivariate regression model (4.8),
Y ∼ Nn×q (xβ, In ⊗ Σ R ), (9.2)
where x is n × p and β is p × q. We need to assume that
• n ≥ p + q;
• x x is positive definite;
• Σ R is positive definite.
The parameter θ = ( β, Σ R ) and the parameter space is Θ = R pq × Sq+ , where Sq+ is
the space of q × q positive definite symmetric matrices as in (5.34).
To find the MLE, we first must find the L, that is, the pdf of Y. We can use
Corollary 8.3. From (9.2) we see that the density in (8.54) has M = xβ and C = In ,
hence the likelihood is
1 −1 
e− 2 trace( Σ R (y−xβ) (y−xβ)) .
1
L ( β, Σ R ; y) = (9.3)
| Σ R | n/2
(Note that the constant has been dropped.)
To maximize the likelihood, first consider maximizing L over β, which is equiva-
lent to minimizing trace(Σ−  
R (y − xβ ) (y − xβ )). Let β be the least-squares estimate,
1

  − 
β = (x x) x y. We show that this is in fact the MLE. Write
1

y − xβ = y − x β + x β − xβ
= Qx y + x( β − β), (9.4)
where Qx = In − x(x x)−1 x (see Proposition 5.1), so that

(y − xβ) (y − xβ) = (Qx y + x( β − β)) (Qx y + x( β − β))


= y Qx y + ( β − β) x x( β − β). (9.5)
1    
Because Σ R and x x are positive definite, trace(Σ− R ( β − β ) x x( β − β )) > 0 unless
β = β, in which case the trace is zero, which means that β uniquely maximizes the
likelihood over β for fixed Σ R , and because it does not depend on Σ R , it is the MLE.
Now consider
 Σ R ; y) = 1 −1 
e− 2 trace( Σ R y Qx y) .
1
L ( β, (9.6)
| Σ R | n/2
We need to maximize that over Σ R ∈ Sq+ . We appeal to the following lemma, proved
in Section 9.2.3.
9.2. Maximum likelihood estimation 153

Lemma 9.1. Suppose a > 0 and U ∈ Sq+ . Then

1 −1
e− 2 trace( Σ U)
1
g(Σ ) = (9.7)
| Σ| a/2

is uniquely maximized over Σ ∈ Sq+ by

 = 1 U,
Σ (9.8)
a
and the maximum is
1 aq
) =
g(Σ e− 2 . (9.9)
 | a/2

Applying this lemma to (9.6) yields

 R = y Qx y .
Σ (9.10)
n

Note that Y Qx Y ∼ Wishartq (n − p, Σ R ), which is positive definite with probability


1 if n − p ≥ q, i.e., if n ≥ p + q, which we have assumed. Also, note that the
denominator is n, compared to the n − p we used in (6.19). Thus unless x = 0, this
estimate will be biased since E [Y Qx Y] = (n − p)Σ R . Now we have that from (9.3) or
(9.9),
Σ 1 nq
L ( β,  R ; y) = e− 2 . (9.11)
 R | n/2

9.2.2 The MLE in the both-sides linear model


Here we add the z, that is, the model is

Y ∼ Nn×q (xβz , In ⊗ Σ R ), (9.12)

where x is n × p, z is q × l, and β is p × l. We need one more assumption, i.e., we


need that
• n ≥ p + q;
• x x is positive definite;
• z z is positive definite;
• Σ R is positive definite.

The parameter is again θ = ( β, Σ R ), where the parameter space is Θ = R pl × Sq+ .


There are two cases to consider: z is square, l = q, hence invertible, or l < q. In
the former case, we proceed as in (7.76), that is, let Y(z) = Y(z )−1 , so that

Y(z) ∼ Nn×q (xβ, In ⊗ Σz ), (9.13)

where as in (6.5),
Σ z = z− 1 Σ R (z ) − 1 . (9.14)
154 Chapter 9. Likelihood Methods

The MLE’s of β and Σz are as in the previous Section 9.2.1:

β = (x x)−1 x Y(z) = (x x)−1 x Y(z )−1 (9.15)


and
 z = 1 Y(z)  Qx Y(z) = z−1 Y Qz Y(z )−1 .
Σ (9.16)
n
Since ( β, Σz ) and ( β, Σ R ) are in one-to-one correspondence, the MLE of ( β, Σ R ) is
given by (9.15) and (9.10), the latter because Σ R = zΣz z .
The case that l < q is a little more involved. As in Section 7.5.1 (and Exercise
5.6.18), we fill out z, that is, find a q × (q − l ) matrix z2 so that z z2 = 0 and (z z2 ) is
invertible. Following (7.79), the model (9.12) is the same as
  
  z
Y ∼ Nn×q x β 0 , I ⊗ Σ R , (9.17)
z2 n

or  −1
z    
Y( z ) ≡ Y ∼ Nn×q x β 0 , In ⊗ Σz ) , (9.18)
z2
where  −1
  −1 z
Σz = z z2 ΣR . (9.19)
z2
As before, ( β, Σz ) and ( β, Σ R ) are in one-to-one correspondence, so it will be
sufficient to first find the MLE of the former. Because of the “0” in the mean of Y(z) ,
the least-squares estimate of β in (5.27) is not the MLE. Instead, we have to proceed
conditionally. That is, partition Y(z) similar to (7.69),
(z) (z) (z) (z)
Y ( z ) = (Y a Y b ), Y a is n × l, Yb is n × (q − l ), (9.20)
and Σz similar to (7.70), 
Σz,aa Σz,ab
Σz = . (9.21)
Σz,ba Σz,bb
( z) ( z) ( z)
The density of Y(z) is a product of the conditional density of Ya given Yb = yb ,
( z)
and the marginal density of Yb :
( z) ( z) ( z)
f (y(z) | β, Σz ) = f (y a | yb , β, γ, Σz,aa·b ) × f (yb | Σz,bb ), (9.22)
where
−1 −1
γ = Σz,bb Σz,ba and Σz,aa·b = Σz,aa − Σz,ab Σz,bb Σz,ba . (9.23)
( z)
The notation in (9.22) makes explicit the facts that the conditional distribution of Ya
( z) ( z)
given Yb = yb depends on ( β, Σz ) through only ( β, γ, Σz,aa·b), and the marginal
( z)
distribution of Yb depends on ( β, Σz ) through only Σz,bb .
The set of parameters ( β, γ, Σz,aa·b, Σz,bb ) can be seen to be in one-to-one corre-
spondence with ( β, Σz ), and has space R p×l × R( q−l )×l × Sl+ × Sq+−l . That is, the
parameters in the conditional density are functionally independent of those in the
marginal density, which means that we can find the MLE’s of these parameters sepa-
rately.
9.2. Maximum likelihood estimation 155

Conditional part
We know as in (7.72) (without the “μ” parts) and (7.73) that
) *  
( z) β
Ya | Yb = yb ∼ Nn×l x yb , In ⊗ Σz,aa·b . (9.24)
γ

We are in the multivariate regression case, that is, without a z, so the MLE of the
( β , γ ) parameter is the least-squares estimate

 + , −1 + ,
( z) x
β x x x y b ( z)
= ( z)  ( z)  ( z) ( z)  Ya , (9.25)
γ yb x yb yb yb

and
 z,aa·b = 1 Y(az)  Q (z) Y(az) .
Σ (9.26)
n (x,yb )

Marginal part
From (9.19), we have that

( z)
Yb ∼ Nn×( q−l )(0, In ⊗ Σz,bb ), (9.27)

hence it is easily seen that


 z,bb = 1 Y(z)  Y(z)
Σ (9.28)
n b b

is the MLE.
The maximum of the likelihood from (9.22), ignoring the constants, is

1 1 nq
e− 2 (9.29)

| Σz,aa·b| n/2 
| Σz,bb | n/2

Putting things back together, we first note that the MLE of β is given in (9.25), and
that for Σz,bb is given in (9.28). By (9.23), the other parts of Σz have MLE

Σ  z,bb γ
 z,ba = Σ  and Σ  Σ
 z,aa·b + γ
 z,aa = Σ  z,bb γ
. (9.30)

Finally, to get back to Σ R , use



  z
R =
Σ z z2 z
Σ . (9.31)
z2

If one is mainly interested in β, then the MLE can be found using the pseudo-
covariate approach in Section 7.5.1, and the estimation of Σz,bb and the reconstruction
 R are unnecessary.
of Σ
We note that a similar approach can be used to find the MLE’s in the covariate
model (7.69), with just a little more complication to take care of the μ parts. Again, if
one is primarily interested in the β a , then the MLE is found as in that section.
156 Chapter 9. Likelihood Methods

9.2.3 Proof of Lemma 9.1


Because U is positive definite and symmetric, it has an invertible symmetric square
root, U1/2 . Let Ψ = U−1/2 ΣU−1/2 , and from (9.7) write

1 1 −1
g(Σ) = h(U−1/2 ΣU−1/2 ), where h(Ψ) ≡ e− 2 trace( Ψ )
1
(9.32)
| U| a/2 | Ψ| a/2

 = (1/a)Iq ,
is a function of Ψ ∈ Sq+ . Exercise 9.6.7 shows that (9.32) is maximized by Ψ
hence
1
) = e− 2 a·trace(Iq ) ,
1
h(Ψ (9.33)
| U| | a Iq |
a/2 1 a/2

from which follows (9.9). Also,

 = U−1/2 ΣU
Ψ  = U1/2 1 Iq U1/2 = 1 U,
 −1/2 ⇒ Σ (9.34)
a a
which proves (9.8). 2

9.3 Likelihood ratio tests


Again our big model has Y with space Y and a set of distributions { Pθ | θ ∈ Θ } with
associated densities f (y | θ). Testing problems we consider are of the form

H0 : θ ∈ Θ0 versus H A : θ ∈ Θ A , (9.35)

where
Θ0 ⊂ Θ A ⊂ Θ. (9.36)
Technically, the space in H A should be Θ A − Θ0 , but we take that to be implicit.
The likelihood ratio statistic for problem (9.35) is defined to be

supθ∈Θ A L (θ; y)
LR = , (9.37)
supθ∈Θ0 L (θ; y)

where the likelihood L is given in Definition 9.1.


The idea is that the larger LR, the more likely the alternative H A is, relative to the
null H0 . For testing, one would either use LR as a basis for calculating a p-value, or
find a cα such that rejecting the null when LR > cα yields a level of (approximately) α.
Either way, the null distribution of LR is needed, at least approximately. The general
result we use says that under certain conditions (satisfied by most of what follows),
under the null hypothesis,

2 log( LR) −→D χ2d f where d f = dim( H A ) − dim( H0 ) (9.38)

as n → ∞ (which means there must be some n to go to infinity). The dimension


of a hypothesis is the number of free parameters it takes to uniquely describe the
associated distributions. This definition is not very explicit, but in most examples the
dimension will be “obvious.”
9.4. Model selection 157

9.3.1 The LRT in multivariate regression


Consider the multivariate regression model in (9.2), with the x and β partitioned:
 
β1
Y ∼ Nn×q (x1 x2 ) , In ⊗ Σ R , (9.39)
β2

where the xi are n × pi and the β i are pi × q, p1 + p2 = p. We wish to test

H0 : β 2 = 0 versus H A : β 2 = 0. (9.40)

The maximized likelihoods under the null and alternative are easy to find using
(9.10) and (9.11). The MLE’s of Σ under H0 and H A are, respectively,

 0 = 1 y Qx1 y and Σ
Σ  A = 1 y Qx y, (9.41)
n n
the former because under the null, the model is multivariate regression with mean
x1 β 1 . Then the likelihood ratio from (9.37) is
+ ,n/2 n/2
0|
|Σ |y Qx1 y|
LR = = . (9.42)
 A|
|Σ |y Qx y|

We can use the approximation (9.38) under the null, where here d f = p2 q. It turns
out that the statistic is equivalent to Wilk’s Λ in (7.2),

|W|
Λ = ( LR)−2/n = ∼ Wilksq ( p2 , n − p), (9.43)
|W + B|
where
W = y Qx y and B = y (Qx1 − Qx )y. (9.44)
See Exercise 9.6.8. Thus we can use Bartlett’s approximation in (7.37), with l∗ = q
and p∗ = p2 .

9.4 Model selection: AIC and BIC


As in Section 7.6, we often have a number of models we wish to consider, rather
than just two as in hypothesis testing. (Note also that hypothesis testing may not be
appropriate even when choosing between two models, e.g., when there is no obvious
allocation to “null” and “alternative” models.) Using Mallows’ C p (Proposition 7.2)
is reasonable in linear models, but more general methods are available that are based
on likelihoods.
We assume there are K models under consideration, labelled M1 , M2 , . . . , MK .
Each model is based on the same data, Y, but has its own density and parameter
space:
Model Mk ⇒ Y ∼ f k (y | θk ), θk ∈ Θ k . (9.45)
The densities need not have anything to do with each other, i.e., one could be normal,
another uniform, another logistic, etc., although often they will be of the same type.
It is possible that the models will overlap, so that several models might be correct at
once, e.g., when there are nested models.
158 Chapter 9. Likelihood Methods

Let

lk (θk ; y) = log( L k (θk ; y)) = log( f k (y | θk )) + C (y), k = 1, . . . , K, (9.46)

be the loglikelihoods for the models. The constant C (y) is arbitrary, and as long as
it is the same for each k, it will not affect the outcome of the following procedures.
Define the deviance of the model Mk at parameter value θk by

deviance( Mk (θk ) ; y) = −2 lk (θk ; y). (9.47)

It is a measure of fit of the model to the data; the smaller the deviance, the better the
fit. The MLE of θk for model Mk minimizes this deviance, giving us the observed
deviance,
k ) ; y) = −2 lk (θ
deviance( Mk (θ k ; y) = −2 max lk (θk ; y). (9.48)
θk ∈ Θ k

Note that the likelihood ratio statistic in (9.38) is just the difference in observed
deviance of the two hypothesized models:

0 ) ; y) − deviance( H A (θ
2 log( LR) = deviance( H0 (θ A ) ; y ) . (9.49)

At first blush one might decide the best model is the one with the smallest ob-
served deviance. The problem with that approach is that because the deviances are
based on minus the maximum of the likelihoods, the model with the best observed
deviance will be the largest model, i.e., one with highest dimension. Instead, we add
a penalty depending on the dimension of the parameter space, as for Mallows’ C p in
(7.102). The two most popular procedures are the Bayes information criterion (BIC)
of Schwarz [1978] and the Akaike information criterion (AIC) of Akaike [1974] (who
actually meant for the “A” to stand for “An”):

k ) ; y) + log(n )dk , and


BIC( Mk ; y) = deviance( Mk (θ (9.50)
k ) ; y) + 2dk ,
AIC( Mk ; y) = deviance( Mk (θ (9.51)

where in both cases,


dk = dim(Θ k ). (9.52)
Whichever criterion is used, it is implemented by finding the value for each model,
then choosing the model with the smallest value of the criterion, or looking at the
models with the smallest values.
Note that the only difference between AIC and BIC is the factor multiplying the
dimension in the penalty component. The BIC penalizes each dimension more heav-
ily than does the AIC, at least if n > 7, so tends to choose more parsimonious models.
In more complex situations than we deal with here, the deviance information criterion
is useful, which uses more general definitions of the deviance. See Spiegelhalter et al.
[2002].
The next two sections present some further insight into the two criteria.

9.4.1 BIC: Motivation


The AIC and BIC have somewhat different motivations. The BIC, as hinted at by the
“Bayes” in the name, is an attempt to estimate the Bayes posterior probability of the
9.4. Model selection 159

models. More specifically, if the prior probability that model Mk is the true one is π k ,
then the BIC-based estimate of the posterior probability is

e− 2 BIC( Mk ; y) π
1

PBIC [ Mk | y] = k
. (9.53)
e− 2 BIC( M1 ; y) π − 12 BIC( MK ; y) π
1
1 + · · · + e K
If the prior probabilities are taken to be equal, then because each posterior probability
has the same denominator, the model that has the highest posterior probability is
indeed the model with the smallest value of BIC. The advantage of the posterior
probability form is that it is easy to assess which models are nearly as good as the
best, if there are any.
To see where the approximation arises, we first need a prior on the parameter
space. In this case, there are several parameter spaces, one for each model under
consideration. Thus is it easier to find conditional priors for each θk , conditioning on
the model:
θk | M k ∼ ρ k ( θk ) , (9.54)
for some density ρk on Θ k . The marginal probability of each model is the prior
probability:
π k = P [ Mk ]. (9.55)
The conditional density of (Y, θk ) given Mk is

gk (y, θk | Mk ) = f k (y | θk )ρk (θk ). (9.56)


To find the density of Y given Mk , we integrate out the θk :

Y | M k ∼ gk (y | M k ) = gk (y, θk | Mk )dθk
Θk

= f k (y | θk )ρk (θk )dθk . (9.57)
Θk

With the parameters hidden, it is straightforward to find the posterior probabilities


of the models using Bayes theorem, Theorem 2.2:

gk (y | M k ) π k
P [ Mk | y] = . (9.58)
g1 (y | M1 )π1 + · · · + gK (y | MK )π K

Comparing (9.58) to (9.53), we see that the goal is to approximate gk (y | Mk ) by


e− 2 BIC ( Mk ; y) . To do this, we use the Laplace approximation, as in Schwarz [1978].
1

The following requires a number of regularity assumptions, not all of which we will
detail. One is that the data y consists of n iid observations, another that n is large.
Many of the standard likelihood-based assumptions needed can be found in Chapter
6 of Lehmann and Casella [1998], or any other good mathematical statistics text. For
convenience we drop the “k”, and from (9.57) consider

f (y | θ)ρ(θ)dθ = el ( θ ; y) ρ(θ)dθ. (9.59)
Θ Θ

The Laplace approximation expands l (θ ; y) around its maximum, the maximum oc-
curing at the maximum likelihood estimator θ.  Then, assuming all the derivatives
exist,
 ; y) + (θ − θ
l (θ ; y) ≈ l (θ ) + 1 (θ − θ
) ∇(θ ) H(θ
)(θ − θ
), (9.60)
2
160 Chapter 9. Likelihood Methods

) is the d × 1 (θ is d × 1) vector with


where ∇(θ

∇i (θ) = l (θ ; y) | θ=θ, (9.61)
∂θi
and H is the d × d matrix with
∂2
Hij = l (θ ; y) | θ=θ . (9.62)
∂θi ∂θ j
 is the MLE, the derivative of the loglikelihood at the MLE is zero, i.e.,
Because θ
∇(θ) = 0. Also, let
 1 ),
F = − H( θ (9.63)
n
which is called the observed Fisher information contained in one observation. Then
(9.59) and (9.60) combine to show that

     
f (y | θ)ρ(θ)dθ ≈ el ( θ ; y) e− 2 ( θ−θ) nF( θ)( θ−θ) ρ(θ)dθ.
1
(9.64)
Θ Θ
If n is large, the exponential term in the integrand drops off precipitously when θ is
 and assuming that the prior density ρ(θ) is fairly flat for θ near θ,
not close to θ,  we
have
       
e− 2 ( θ−θ) nF( θ)( θ−θ) ρ(θ)dθ ≈ e− 2 ( θ−θ) nF( θ)( θ−θ) dθρ(θ
).
1 1
(9.65)
Θ Θ
The integrand in the last term looks like the density (8.48) as if

θ ∼ Nd (θ, F) − 1 ) ,
 (n (9.66)
but without the constant. Thus the integral is just the reciprocal of that constant, i.e.,
√ √
   
e− 2 ( θ−θ) nF( θ)( θ−θ) dθ = ( 2π )d/2 | n
F| −1/2 = ( 2π )d/2 |
F| −1/2 n −d/2.
1
(9.67)
Θ
Putting (9.64) and (9.67) together gives

log f (y | θ)ρ(θ)dθ ≈ l (θ  ; y) − d log(n )
Θ 2
d 1
+ log(ρ(θ)) + log(2π ) − log(|
F|)
2 2
d
≈ l (θ ; y) − log(n )
2
1
= − BIC( M ; y). (9.68)
2
Dropping the last three terms in the first line is justified by noting that as n → ∞,
 ; y) is of order n (in the iid case), log(n )d/2 is clearly of order log(n ), and the
l (θ
other terms are bounded. (This step may be a bit questionable since n has to be
extremely large before log(n ) starts to dwarf a constant.)
There are a number of approximations and heuristics in this derivation, and in-
deed the resulting approximation may not be especially good. See Berger, Ghosh,
and Mukhopadhyay [1999], for example. A nice property is that under conditions,
if one of the considered models is the correct one, then the BIC chooses the correct
model as n → ∞.
9.4. Model selection 161

9.4.2 AIC: Motivation


The Akaike information criterion can be thought of as a generalization of Mallows’
C p in that it is an attempt to estimate the expected prediction error. In Section 7.6
it was reasonable to use least squares as a measure of prediction error. In general
models, it is not obvious which measure to use. Akaike’s idea is to use deviance.
We wish to predict the unobserved Y New , which is independent of Y but has the
same distribution. We measure the distance between the new variable y New and the
prediction of its density given by the model Mk via the prediction deviance:

k ) ; y New ) = −2lk (θ
deviance( Mk (θ k ; y New ), (9.69)

as in (9.48), except that here, while the MLE θk is based on the data y, the loglike-
lihood is evaluated at the new variable y New . The expected prediction deviance is
then
k ) ; Y New )],
EPredDeviance( Mk ) = E [deviance( Mk (θ (9.70)
where the expected value is over both the data Y (through the θ k ) and the new vari-
able Y New . The best model is then the one that minimizes this value.
We need to estimate the expected prediction deviance, and the observed deviance
(9.48) is the obvious place to start. As for Mallows’ C p , the observed deviance is
likely to be an underestimate because the parameter is chosen to be optimal for the
particular data y. Thus we would like to find how much of an underestimate it is,
i.e., find
k ) ; Y)].
Δ = EPredDeviance( Mk ) − E [deviance( Mk (θ (9.71)
Akaike argues that for large n, the answer is Δ = 2dk , i.e.,

k ) ; Y)] + 2dk ,
EPredDeviance( Mk ) ≈ E [deviance( Mk (θ (9.72)

from which the AIC in (9.51) arises. One glitch in the proceedings is that the approxi-
mation assumes that the true model is in fact Mk (or a submodel thereof), rather than
the most general model, as in (7.83) for Mallows’ C p .
Rather than justify the result in full generality, we will show the exact value for Δ
for multivariate regression, as Hurvich and Tsai [1989] did in the multiple regression
model.

9.4.3 AIC: Multivariate regression


Consider the multivariate regression model (4.8),

Model M : Y ∼ Nn×q (xβ, In ⊗ Σ R ) , (9.73)

where β is p × q, x is n × p with x x invertible, and Σ R is a nonsingular q × q covariance


matrix. Now from (9.3),

n 1
l ( β, Σ R ; y) = − log(| Σ R |) − trace(Σ− 
R (y − xβ ) (y − xβ )).
1
(9.74)
2 2
The MLE’s are then

 R = 1 y Qx y,
β = (x x)−1 x y and Σ (9.75)
n
162 Chapter 9. Likelihood Methods

as in (9.10). Thus, as from (9.11) and (9.47), we have


deviance( M ( β,  R |) + nq,
 R ) ; y) = n log(| Σ (9.76)

and the prediction deviance as in (9.69) is


deviance( M ( β,  −1 (Y New − x β) (Y New − x β)).
 R |) + trace(Σ
 R ) ; Y New ) = n log(| Σ
R
(9.77)

To find EPredDeviance, we first look at

Y New − x β = Y New − Px Y ∼ N (0, (In + Px ) ⊗ Σ R ), (9.78)

because Y New and Y are independent, and Px Y ∼ N (xβ, Px ⊗ Σ R ). Then (Exercise


9.6.10)

E [(Y New − x β) (Y New − x β)] = trace(In + Px ) Σ R = (n + p) Σ R . (9.79)

 R , we have from (9.77) and (9.79) that


Because Y New and Px Y are independent of Σ

EPredDeviance( M ) = E [ n log(| Σ  −1 Σ R )].


 R |)] + (n + p) E [trace(Σ (9.80)
R

Using the deviance from (9.76), the difference Δ from (9.71) can be written

 −1 Σ R )] − nq = n (n + p) E [trace(W−1 Σ R )] − nq,
Δ = (n + p) E [trace(Σ (9.81)
R

the log terms cancelling, where

W = Y Qx Y ∼ Wishart(n − p, Σ R ). (9.82)

By (8.31), with ν = n − p and l ∗ = q,

1
E [ W −1 ] = Σ . (9.83)
n− p−q−1 R

Then from (9.81),

q n
Δ = n (n + p) − nq = q (2p + q + 1). (9.84)
n− p−q−1 n− p−q−1

Thus

AIC∗ ( M ; y) = deviance( M ( β,  R ) ; y) + nq (2p + q + 1) (9.85)
n− p−q−1

for the multivariate regression model. For large n, Δ ≈ 2 dim(Θ ). See Exercise 9.6.11.
In univariate regression q = 1, and (9.85) is the value given in Hurvich and Tsai
[1989].
9.5. Example 163

9.5 Example: Mouth sizes


Return to the both-sides models (7.40) for the mouth size data. We consider the same
submodels as in Section 7.6.1, which we denote M p∗ l ∗ : the p∗ indicates the x-part of
the model, where p∗ = 1 means just use the constant, and p∗ = 2 uses the constant
and the sex indicator; the l ∗ indicates the degree of polynomial represented in the z
matrix, where l ∗ = degree + 1.
To find the MLE as in Section 9.2.2 for these models, because the matrix z is
invertible, we start by multiplying y on the right by (z )−1 :
⎛ ⎞
0.25 −0.15 0.25 −0.05
⎜ 0.25 −0.05 −0.25 0.15 ⎟
( z)  −1 ⎜
y = y (z ) = y ⎝ ⎟. (9.86)
0.25 0.05 −0.25 −0.15 ⎠
0.25 0.15 0.25 0.05

We look at two of the models in detail. The full model M24 in (7.40) is actually just
( z)
multivariate regression, so there is no “before” variable yb . Thus

24.969 0.784 0.203 −0.056
β = (x x)−1 x y(z) = , (9.87)
−2.321 −0.305 −0.214 0.072

and (because n = 27),

 R = 1 (y(z) − x β) (y(z) − x β)


Σ
⎛27 ⎞
3.499 0.063 −0.039 −0.144
⎜ 0.063 0.110 −0.046 0.008 ⎟

=⎝ ⎟. (9.88)
−0.039 −0.046 0.241 −0.005 ⎠
−0.144 0.008 −0.005 0.117

These coefficient estimates were also found in Section 7.1.1.


The observed deviance (9.48), given in (9.76), is


deviance( M24 ( β,  R ) ; y(z) ) = n log(| Σ
 R |) + nq = −18.611, (9.89)

with n = 27 and q = 4.
The best model in (7.103) was M22 , which fit different linear equations to the boys
( z) ( z)
and girls. In this case, y a consists of the first two columns of y(z) , and yb the final
( z)
two columns. As in (9.25), to find the MLE of the coefficients, we shift the yb to be
with the x, yielding

β ( z) ( z) ( z) ( z)
= ((x yb ) (x yb ))−1 (x yb ) y a
γ
⎛ ⎞
24.937 0.827
⎜ −2.272 −0.350 ⎟
⎜ ⎟
=⎜ ⎟, (9.90)
⎝ −0.189 −0.191 ⎠
−1.245 0.063
164 Chapter 9. Likelihood Methods

where the top 2 × 2 submatrix contains the estimates of the β ij ’s. Notice that they are
similar but not exactly the same as the estimates in (9.87) for the corresponding coef-
ficients. The bottom submatrix contains coefficients relating the “after” and “before”
measurements, which are not of direct interest.
There are two covariance matrices needed, both 2 × 2:
 
 z,aa·b = 1 ( z) ( z) β ( z) ( z) β
Σ y b − (x y b ) y b − (x y b )
27 γ γ


3.313 0.065
= (9.91)
0.065 0.100

and

 z,bb = 1 yz y(z)


Σ
27 b b

0.266 −0.012
= . (9.92)
−0.012 0.118

Then by (9.29),


deviance( M22 ( β,  ) ; y(z) ) = n log(| Σ
 z,aa·b|) + n log(| Σ
 z,bb |) + nq
= 27(−1.116 − 3.463 + 4)
= −15.643. (9.93)

Consider testing the two models:

H0 : M22 versus H A : M24 . (9.94)

The likelihood ratio statistic is from (9.49),

deviance( M22 ) − deviance( M24 ) = −15.643 + 18.611 = 2.968. (9.95)

That value is obviously not significant, but formally the chi-square test would com-
pare the statistic to the cutoff from a χ2d f where d f = d24 − d22 . The dimension for
model M p∗ l ∗ is

∗ ∗ q
d p∗ l ∗ = p l + = p∗ l ∗ + 10. (9.96)
2
because the model has p∗ l ∗ non-zero β ij ’s, and Σ R is 4 × 4. Note that all the models
we are considering have the same dimension Σ R . For the hypotheses (9.94), d22 = 14
and d24 = 18, hence the d f = 4 (rather obviously since we are setting four of the β ij ’s
to 0). The test shows that there is no reason to reject the smaller model in favor of the
full one.
The AIC (9.50) and BIC (9.51) are easy to find as well (log(27) ≈ 3.2958):

AIC BIC Cp
M22 −15.643 + 2(14) = 12.357 −15.643 + log(27)(14) = 30.499 599.7 (9.97)
M24 −18.611 + 2(18) = 17.389 −18.611 + log(27)(18) = 40.714 610.2

The Mallows’ C p ’s are from (7.103). Whichever criterion you use, it is clear the smaller
model optimizes it. It is also interesting to consider the BIC-based approximations to
9.6. Exercises 165

the posterior probabilities of these models in (9.53). With π22 = π24 = 12 , we have

e− 2 BIC ( M22)
1

PBIC [ M22 | y(z) ] = = 0.994. (9.98)


e − 12 BIC ( M22) + e− 2 BIC ( M24)
1

That is, between these two models, the smaller one has an estimated probability of
99.4%, quite high.
We repeat the process for each of the models in (7.103) to obtain the following
table (the last column is explained below):

p∗ l∗ Deviance d p∗ l ∗ AIC BIC Cp PBIC GOF


1 1 36.322 11 58.322 72.576 947.9 0.000 0.000
1 2 −3.412 12 20.588 36.138 717.3 0.049 0.019
1 3 −4.757 13 21.243 38.089 717.9 0.018 0.017
1 4 −4.922 14 23.078 41.220 722.6 0.004 0.008 (9.99)
2 1 30.767 12 54.767 70.317 837.7 0.000 0.000
2 2 −15.643 14 12.357 30.499 599.7 0.818 0.563
2 3 −18.156 16 13.844 34.577 601.2 0.106 0.797
2 4 −18.611 18 17.389 40.714 610.2 0.005 0.000

For AIC, BIC, and C p , the best model is the one with linear fits for each sex, M22 .
The next best is the model with quadratic fits for each sex, M23 . The penultimate
column has the BIC-based estimated posterior probabilities, taking the prior proba-
bilities equal. Model M22 is the overwhelming favorite, with about 82% estimated
probability, and M23 is next with about 11%, not too surprising considering the plots
in Figure 4.1. The only other models with estimated probability over 1% are the lin-
ear and quadratic fits with boys and girls equal. The probability that a model shows
differences between the boys and girls can be estimated by summing the last four
probabilities, obtaining 93%.
The table in (9.99) also contains a column “GOF,” which stands for “goodness-
of-fit.” Perlman and Wu [2003] suggest in such model selection settings to find the
p-value for each model when testing the model (as null) versus the big model. Thus
here, for model M p∗ l ∗ , we find the p-value for testing

H0 : M p∗ l ∗ versus H A : M24 . (9.100)

As in (9.49), we use the difference in the models’ deviances, which under the null has
an approximate χ2 distribution, with the degrees of freedom being the difference in
their dimensions. Thus

GOF ( M p∗ l ∗ ) = P [ χ2d f > deviance( M p∗ l ∗ ) − deviance( M24 )], d f = d24 − d p∗ l ∗ .


(9.101)
A good model is one that fits well, i.e., has a large p-value. The two models that fit
very well are M22 and M23 , as for the other criteria, though here M23 fits best. Thus
either model seems reasonable.

9.6 Exercises
Exercise 9.6.1. Consider the statistical model with space Y and densities f (y | θ)
for θ ∈ Θ. Suppose the function g : Θ → Ω is one-to-one and onto, so that a
166 Chapter 9. Likelihood Methods

reparametrization of the model has densities f ∗ (y | ω ) for ω ∈ Ω, where f ∗ (y | ω ) =


f (y | g−1 (ω )). (a) Show that θ uniquely maximizes f (y | θ) over θ if and only if
ω 
 ≡ g(θ) uniquely maximizes f ∗ (y | ω ) over ω. [Hint: Show that f (y | θ ) > f (y | θ)
  ∗
for all θ = θ implies f (y | ω ∗
 ) > f (y | ω ) for all ω = ω,
 and vice versa.] (b) Argue
 is the MLE of θ, then g(θ
that if θ ) is the MLE of ω.

Exercise 9.6.2. Again consider the statistical model with space Y and densities f (y | θ)
for θ ∈ Θ, and suppose g : Θ → Ω is just onto. Let g∗ be any function of θ such that
the joint function h(θ) = ( g(θ), g∗ (θ)), h : Θ → Λ, is one-to-one and onto, and set
the reparametrized density as f ∗ (y | λ) = f (y | h−1 (λ)). Exercise 9.6.1 shows that if
 uniquely maximizes f (y | θ) over Θ, then 
θ λ = h(θ ) uniquely maximizes f ∗ (y | λ)

over Λ. Argue that if θ is the MLE of θ, that it is legitimate to define g(θ ) to be the
MLE of ω = g(θ).
Exercise 9.6.3. Show that (9.5) holds. [What are Qx Qx and Qx x?]
Exercise 9.6.4. Show that if A (p × p) and B (q × q) are positive definite, and u is
p × q, that
trace (Bu Au) > 0 (9.102)
unless u = 0. [Hint: See (8.79).]
Exercise 9.6.5. From (9.18), (9.21), and (9.23), give ( β, Σz ) as a function of ( β, γ, Σz,aa·b,
Σz,bb ), and show that the latter set of parameters has space R p×l × R( q−l )×l × Sl ×
Sq−l .
Exercise 9.6.6. Verify (9.32).
Exercise 9.6.7. Consider maximizing h(Ψ) in (9.32) over Ψ ∈ Sq . (a) Let Ψ = ΓΛΓ 
be the spectral decomposition of Ψ, so that the diagonals of Λ are the eigenvalues
λ1 ≥ λ2 ≥ · · · ≥ λq ≥ 0. (Recall Theorem 1.1.) Show that
q
1 −1
e− 2 trace( Ψ ) = ∏[ λia/2 e− 2 λi ].
1 1
(9.103)
| Ψ| a/2
i =1

(b) Find λi , the maximizer of λia/2 exp(− λi /2), for each i = 1, . . . , q. (c) Show that

these λi ’s satisfy the conditions on the eigenvalues of Λ. (d) Argue that then Ψ  =
(1/a)Iq maximizes h(Ψ).
Exercise 9.6.8. Suppose the null hypothesis in (9.40) holds, so that Y ∼ Nn×q (x1 β 1 , In ⊗
Σ R ). Exercise 5.6.37 shows that Qx1 − Qx = Px2·1 , where x2·1 = Qx1 x2 . (a) Show that
Px2·1 x1 = 0 and Qx Px2·1 = 0. [Hint: See Lemma 5.3.] (b) Show that Qx Y and Px2·1 Y
are independent, and find their distributions. (c) Part (b) shows that W = Y Qx Y
and B = Y Px2·1 Y are independent. What are their distributions? (d) Verify the Wilk’s
distribution in (9.42) for |W| /|W + B|.
Exercise 9.6.9. Consider the multivariate regression model (9.2), where Σ R is known.
(a) Use (9.5) to show that
1
l ( β ; y) − l ( β ; y) = − trace(Σ− 1    
R ( β − β ) x x( β − β )). (9.104)
2
(b) Show that in this case, (9.60) is actually an equality, and give H, which is a function
of Σ R and x x.
9.6. Exercises 167

Exercise 9.6.10. Suppose V is n × q, with mean zero and Cov[V] = A ⊗ B, where A


is n × n and B is q × q. (a) Show that E [V V] = trace(A)B. [Hint: Write E [V V] =
∑ E [Vi Vi ], where the Vi are the rows of V, then use (3.33).] (b) Use part (a) to prove
(9.79).

Exercise 9.6.11. (a) Show that for the model in (9.73), dim(Θ ) = pq + q (q + 1)/2,
where Θ is the joint space of β and Σ R . (b) Show that in (9.84), Δ → 2 dim(Θ ) as
n → ∞.

Exercise 9.6.12 (Caffeine). This question continues the caffeine data in Exercises 4.4.4
and 6.5.6. Start with the both-sides model Y = xβz + R, where as before the Y is
2 × 28, the first column being the scores without caffeine, and the second being the
scores with caffeine. The x is a 28 × 3 ANOVA matrix for the three grades, with
orthogonal polynomials. The linear vector is (−19 , 010  , 1 ) and the quadratic vector
9
   
is (19 , −1.8110 , 19 ) . The z looks at the sum and difference of scores:

1 −1
z= . (9.105)
1 1

The goal of this problem is to use BIC to find a good model, choosing among the
constant, linear and quadratic models for x, and the “overall mean” and “overall
mean + difference models” for the scores. Start by finding Y( z) = Y(z )−1 . (a) For
each of the 6 models, find the deviance (just the log-sigma parts), number of free
parameters, BIC, and estimated probability. (b) Which model has highest probability?
(c) What is the chance that the difference effect is in the model? (d) Find the MLE of
β for the best model.

Exercise 9.6.13 (Leprosy, Part I). This question continues Exercises 4.4.6 and 5.6.41 on
the leprosy data. The model is
⎡⎛ ⎞ ⎤⎛ ⎞
1 1 1 μb μ a
(Y( b) , Y( a) ) = xβ + R = ⎣⎝ 1 1 −1 ⎠ ⊗ 110 ⎦ ⎝ 0 α a ⎠ + R, (9.106)
1 −2 0 0 βa

where 
σbb σba
R ∼ N 0, In ⊗ . (9.107)
σab σaa
Because of the zeros in the β, the MLE is not the usual one for multivariate regres-
sion. Instead, the problem has to be broken up into the conditional part (“after”
conditioning on “before”), and the marginal of the before measurements, as for the
both-sided model in Sections 9.2.2 and 9.5. The conditional is
⎛ ⎛ ∗ ⎞ ⎞
μ
⎜ ⎜ ⎟
(b) ⎜ αa ⎟

Y( a ) | Y( b ) = y( b ) ∼ N ⎜ ⎟
⎝(x y ) ⎝ β a ⎠ , σaa·b In ⎠ , (9.108)
γ

where
σab
μ ∗ = μ a − γμ b and γ = . (9.109)
σbb
168 Chapter 9. Likelihood Methods

In this question, give the answers symbolically, not the actual numerical values. Those
come in the next exercise. (a) What is the marginal distribution of Y( b) ? Write it as
a linear model, without any zeroes in the coefficient matrix. (Note that the design
matrix will not be the entire x.) (b) What are the MLE’s for μ b and σbb ? (c) Give the
MLE’s of μ a , σab , and σaa in terms of the MLE’s of μ ∗ , μ b , γ, σbb and σaa·b. (d) What is
the deviance of this model? How many free parameters (give the actual number) are
there? (e) Consider the model with β a = 0. Is the MLE of αb the same or different
than in the original model? What about the MLE of σaa·b ? Or of σbb ?

Exercise 9.6.14 (Leoprosy, Part II). Continue with the leprosy example from Part I,
Exercise 9.6.13. (a) For the original model in Part I, give the values of the MLE’s of
α a , β a , σaa·b and σbb . (Note that the MLE of σaa·b will be different than the unbiased
estimate of 16.05.) (b) Now consider four models: The original model, the model
with β a = 0, the model with α a = 0, and the model with α a = β a = 0. For each, find
the MLE’s of σaa·b , σbb , the deviance (just using the log terms, not the nq), the number
of free parameters, the BIC, and the BIC-based estimate of the posterior probability
(in percent) of the model. Which model has the highest probability? (c) What is the
probability (in percent) that the drug vs. placebo effect is in the model? The Drug A
vs. Drug D effect?

Exercise 9.6.15 (Skulls). For the data on Egyptian skulls (Exercises 4.4.2, 6.5.5, and
7.7.12), consider the linear model over time, so that
⎛ ⎞
1 −3
⎜ 1 −1 ⎟
⎜ ⎟
x=⎜ 1 0 ⎟ ⊗ 130 . (9.110)
⎝ 1 1 ⎠
1 3

Then the model is the multivariate regression model Y ∼ N (xβ, In ⊗ Σ R ), where Y


is 150 × 4 and β is 2 × 4. Thus we are contemplating a linear regression for each of
the four measurement variables. The question here is which variables require a linear
term. Since there are 24 = 16 possible models. We will look at just four models: no
linear terms are in (β21 = β22 = β23 = β24 = 0), just the first and third variables’
(MaxBreadth and BasLength) linear terms are in (β22 = β24 = 0), the first three
variables’ are in (β24 = 0), or all four linear terms are in. For each of the four models,
find the deviance, number of parameters, BIC, and estimated probabilities. Which
model is best? How much better is it than the second best? [Note: To find the MLE’s
for the middle two models, you must condition on the variables without the linear
terms. I.e., for the second model, find the MLE’s conditionally for (Y1 , Y3 ) | (Y2 , Y4 ) =
(y2 , y4 ), and also for (Y2 , Y4 ) marginally. Here, Y j is the jth column of Y.]

Exercise 9.6.16. (This is a discussion question, in that there is no exact answer. Your
reasoning should be sound, though.) Suppose you are comparing a number of mod-
els using BIC, and the lowest BIC is bmin . How much larger than bmin would a BIC
have to be for you to consider the corresponding model ignorable? That is, what is δ
so that models with BIC > bmin + δ don’t seem especially viable. Why?
Exercise 9.6.17. Often, in hypothesis testing, people misinterpret the p-value to be the
probability that the null is true, given the data. We can approximately compare the
9.6. Exercises 169

two values using the ideas in this chapter. Consider two models, the null (M0 ) and
alternative (M A ), where the null is contained in the alternative. Let deviance0 and
deviance A be their deviances, and dim0 and dim A be their dimensions, respectively.
Supposing that the assumptions are reasonable, the p-value for testing the null is
p-value = P [ χ2ν > δ], where ν = dim A − dim0 and δ = deviance0 − deviance A . (a)
Give the BIC-based estimate of the probability of the null for a given ν, δ and sample
size n. (b) For each of various values of n and ν (e.g, n = 1, 5, 10, 25, 100, 1000 and
ν = 1, 5, 10, 25), find the δ that gives a p-value of 5%, and find the corresponding
estimate of the probability of the null. (c) Are the probabilities of the null close to
5%? What do you conclude?
Chapter 10

Models on Covariance Matrices

The models so far have been on the means of the variables. In this chapter, we
look at some models for the covariance matrix. We start with testing the equality of
covariance matrices, then move on to testing independence and conditional indepen-
dence of sets of variables. Next is factor analysis, where the relationships among the
variables are assumed to be determined by latent (unobserved) variables. Principal
component analysis is sometimes thought of as a type of factor analysis, although it is
more of a decomposition than actual factor analysis. See Section 13.1.5. We conclude
with a particular class of structural models, called invariant normal models
We will base our hypothesis tests on Wishart matrices (one, or several independent
ones). In practice, these matrices will often arise from the residuals in linear models,
especially the Y Qx Y as in (6.18). If U ∼ Wishartq (ν, Σ), where Σ is invertible and
ν ≥ q, then the likelihood is

−1
L (Σ; U) = | Σ| −ν/2 e− 2 trace( Σ U) .
1
(10.1)

The likelihood follows from the density in (8.71). An alternative derivation is to note
that by (8.54), Z ∼ N (0, Iν ⊗ Σ) has likelihood L ∗ (Σ; z) = L (Σ; z z). Thus z z is a
sufficient statistic, and there is a theorem that states that the likelihood for any X is
the same as the likelihood for its sufficient statistic. Since Z Z = D U, (10.1) is the
likelihood for U.
Recall from (5.34) that Sq+ denotes the set of q × q positive definite symmetric
matrices. Then Lemma 9.1 shows that the MLE of Σ ∈ Sq+ based on (10.1) is

 = U,
Σ (10.2)
ν

and the maximum likelihood is


 −ν/2
U
L (Σ ; U) =  
 e− 2 νq .
1
(10.3)
ν

171
172 Chapter 10. Covariance Models

10.1 Testing equality of covariance matrices


We first suppose we have two groups, e.g., boys and girls, and wish to test whether
their covariance matrices are equal. Let U1 and U2 be independent, with

Ui ∼ Wishartq (νi , Σi ), i = 1, 2. (10.4)


The hypotheses are then

H0 : Σ1 = Σ2 versus H A : Σ1 = Σ2 , (10.5)

where both Σ1 and Σ2 are in Sq+ . (That is, we are not assuming any particular struc-
ture for the covariance matrices.) We need the likelihoods under the two hypotheses.
Because the Ui ’s are independent,
−1 −1
L (Σ1 , Σ2 ; U1 , U2 ) = | Σ1 | −ν1 /2 e− 2 trace( Σ1 U1 ) | Σ2 | −ν2 /2 e− 2 trace( Σ2 U2 ) ,
1 1
(10.6)

which, under the null hypothesis, becomes


−1
L (Σ, Σ; U1 , U2 ) = | Σ| −( ν1+ν2 ) /2 e− 2 trace( Σ (U1 +U2 )) ,
1
(10.7)

where Σ is the common value of Σ1 and Σ2 . The MLE under the alternative hypoth-
esis is found by maximizing (10.5), which results in two separate maximizations:

 A1 = U1 , Σ
Under H A : Σ  A2 = U2 . (10.8)
ν1 ν2

Under the null, there is just one Wishart, U1 + U2 , so that

Under H0 : Σ  02 = U1 + U2 .
 01 = Σ (10.9)
ν1 + ν2

Thus    
 U1 −ν1 /2 − 1 ν q  U2 −ν2 /2 − 1 ν q

sup L =   e 1   e 2 2 , (10.10)
ν1  ν 
2

HA 2

and  
 U + U2 −( ν1 +ν2 ) /2 − 1 ( ν +ν ) q
sup L =  1  e 2 1 2 . (10.11)
H0 ν +ν 
1 2

Taking the ratio, note that the parts in the e cancel, hence

sup H A L |U1 /ν1 |−ν1 /2 |U2 /ν2 | −ν2 /2


LR = = . (10.12)
sup H0 L |(U1 + U2 )/(ν1 + ν2 )| −( ν1 +ν2 ) /2

And
     
 U + U2     
2 log( LR) = (ν1 + ν2 ) log  1  − ν1 log  U1  − ν2 log  U2  . (10.13)
ν1 + ν2   ν1   ν2 

Under the null hypothesis, 2 log( LR) approaches a χ2 as in (9.37). To figure out
the degrees of freedom, we have to find the number of free parameters under each
10.1. Testing equality 173

hypothesis. A Σ ∈ Sq+ , unrestricted, has q (q + 1)/2 free parameters, because of the


symmetry. Under the alternative, there are two such sets of parameters. Thus,

dim( H0 ) = q (q + 1)/2, dim( H A ) = q (q + 1)


⇒ dim( H A ) − dim( H0 ) = q (q + 1)/2. (10.14)

Thus, under H0 ,
2 log( LR) −→ χ2q( q+1) /2. (10.15)

10.1.1 Example: Grades data


Using the grades data on n = 107 students in (4.10), we compare the covariance
matrices of the men and women. There are 37 men and 70 women, so that the sample
covariance matrices have degrees of freedom ν1 = 37 − 1 = 36 and ν2 = 70 − 1 = 69,
respectively. Their estimates are:
⎛ ⎞
166.33 205.41 106.24 51.69 62.20
⎜ 205.41 325.43 206.71 61.65 69.35 ⎟
1 ⎜ ⎟
Men: U1 = ⎜ 106.24 206.71 816.44 41.33 41.85 ⎟, (10.16)
ν1 ⎝ 51.69 61.65 41.33 80.37 50.31 ⎠
62.20 69.35 41.85 50.31 97.08

and
⎛ ⎞
121.76 113.31 58.33 40.79 40.91
⎜ 113.31 212.33 124.65 52.51 50.60 ⎟
1 ⎜ ⎟
Women: U2 = ⎜ 58.33 124.65 373.84 56.29 74.49 ⎟. (10.17)
ν2 ⎝ 40.79 52.51 56.29 88.47 60.93 ⎠
40.91 50.60 74.49 60.93 112.88

These covariance matrices are clearly not equal, but are the differences significant?
The pooled estimate, i.e., the common estimate under H0 , is
⎛ ⎞
137.04 144.89 74.75 44.53 48.21
⎜ 144.89 251.11 152.79 55.64 57.03 ⎟
1 ⎜ ⎟
(U1 + U2 ) = ⎜ 74.75 152.79 525.59 51.16 63.30 ⎟ (10.18)
ν1 + ν2 ⎝ 44.53 55.64 51.16 85.69 57.29 ⎠
48.21 57.03 63.30 57.29 107.46

Then
     
 U + U2     
2 log( LR) = (ν1 + ν2 ) log  1  − ν1 log  U1  − ν2 log  U2 
ν1 + ν2   ν1   ν2 
= 105 log(2.6090 × 1010 ) − 36 log(2.9819 × 1010 ) − 69 log(1.8149 × 1010 )
= 20.2331. (10.19)

The degrees of freedom for the χ2 is q (q + 1)/2 = 5 × 6/2 = 15. The p-value is 0.16,
which shows that we have not found a significant difference between the covariance
matrices.
174 Chapter 10. Covariance Models

10.1.2 Testing the equality of several covariance matrices


It is not hard to extend the test to testing the equality of more than two covariance
matrices. That is, we have U1 , . . . , Um , independent, Ui ∼ Wishartl (νi , Σi ), and wish
to test
H0 : Σ1 = · · · = Σm versus H A : not. (10.20)
Then
     
 U + · · · + Um     
2 log( LR) = (ν1 + · · · +νm ) log  1  − ν1 log  U1  − · · · − νm log  Um  ,
ν1 + · · · + νm   ν1   νm 
(10.21)

and under the null,

2 log( LR) −→ χ2d f , d f = (m − 1)q (q + 1)/2. (10.22)

This procedure is Bartlett’s test for the equality of covariances.

10.2 Testing independence of two blocks of variables


In this section, we assume U ∼ Wishartq (ν, Σ), and partition the matrices:
 
U11 U12 Σ11 Σ12
U= and Σ = , (10.23)
U21 U22 Σ21 Σ22

where

U11 and Σ11 are q1 × q1 , and U22 and Σ22 are q2 × q2 ; q = q1 + q2 . (10.24)

Presuming the Wishart arises from multivariate normals, we wish to test whether the
two blocks of variables are independent, which translates to testing

H0 : Σ12 = 0 versus H A : Σ12 = 0. (10.25)

Under the alternative, the likelihood is just the one in (10.1), hence
 −ν/2
U
sup L (Σ; U) =   e− 2 νq .
1
(10.26)
HA ν

Under the null, because Σ is then block diagonal,

| Σ| = | Σ11 | | Σ22 | and trace(Σ−1 U) = trace(Σ11


−1 −1
U11 ) + trace (Σ22 U22 ). (10.27)

Thus the likelihood under the null can be written


−1 −1
| Σ11 | −ν/2 e− 2 trace( Σ11 U11 ) × | Σ22 | −ν/2 e− 2 trace( Σ22 U22 ) .
1 1
(10.28)

The two factors can be maximized separately, so that


   
 U11 −ν/2 − 1 νq  U22 −ν/2 − 1 νq

sup L (Σ; U) =   e 1 
×  e 2 2. (10.29)
ν  ν 
2

H0
10.2. Testing independence 175

Taking the ratio of (10.26) and (10.28), the parts in the exponent of the e again
cancel, hence
2 log( LR) = ν (log(|U11 /ν|) + log(|U22 /ν|) − log(|U/ν|)). (10.30)
(The ν’s in the denominators of the determinants cancel, so they can be erased if
desired.)
Section 13.3 considers canonical correlations, which are a way to summarize rela-
tionships between two sets of variables.

10.2.1 Example: Grades data


Continuing Example 10.1.1, we start with the pooled covariance matrix Σ  = (U1 +
U2 )/(ν1 + ν2 ), which has q = 5 and ν = 105. Here we test whether the first three
variables (homework, labs, inclass) are independent of the last two (midterms, final),
so that q1 = 3 and q2 = 2. Obviously, they should not be independent, but we will
test it formally. Now
2 log( LR) = ν (log(|U11 /ν|) + log(|U22 /ν|) − log(|U/ν|))
= 28.2299. (10.31)

Here the degrees of freedom in the χ2 are q1 × q2 = 6, because that is the number of
covariances we are setting to 0 in the null. Or you can count
dim( H A ) = q (q + 1)/2 = 15,
dim( H0 ) = q1 (q1 + 1)/2 + q2 (q2 + 1)/2 = 6 + 3 = 9, (10.32)
which has dim( H A ) − dim( H0 ) = 6. In either case, the result is clearly significant (the
p-value is less than 0.0001), hence indeed the two sets of scores are not independent.
Testing the independence of several block of variables is almost as easy. Consider
the three variables homework, labs, and midterms, which have covariance
⎛ ⎞
σ11 σ13 σ14
Σ = ⎝ σ31 σ33 σ34 ⎠ (10.33)
σ41 σ43 σ44

We wish to test whether the three are mutually independent, so that


H0 : σ13 = σ14 = σ34 = 0 versus H A : not. (10.34)
Under the alternative, the estimate of Σ is just the usual one from (10.18), where
we pick out the first, third, and fourth variables. Under the null, we have three
independent variables, so   Then
σii = Uii /ν is just the appropriate diagonal from Σ.
the test statistic is
2 log( LR) = ν (log(|U11 /ν|) + log(|U33 /ν|) + log(|U44 /ν|) − log(|U∗ /ν|))
= 30.116, (10.35)
where U∗ contains just the variances and covariances of the variables 1, 3, and 4. We
do not need the determinant notation for the Uii , but leave it in for cases in which
the three blocks of variables are not 1 × 1. The degrees of freedom for the χ2 is then
3, because we are setting three free parameters to 0 in the null. Clearly that is a
significant result, i.e., these three variables are not mutually independent.
176 Chapter 10. Covariance Models

10.2.2 Example: Testing conditional independence


Imagine that we have (at least) three blocks of variables, and wish to see whether
the first two are conditionally independent given the third. The process is exactly
the same as for testing independence, except that we use the conditional covariance
matrix. That is, suppose

Y = (Y1 , Y2 , Y3 ) ∼ N (xβz , In ⊗ Σ R ), Yi is n × q i , q = q1 + q2 + q3 , (10.36)

where ⎛ ⎞
Σ11 Σ12 Σ13
Σ R = ⎝ Σ21 Σ22 Σ23 ⎠ , (10.37)
Σ31 Σ32 Σ33
so that Σii is q i × q i . The null hypothesis is

H0 : Y1 and Y2 are conditionally independent given Y3 . (10.38)

The conditional covariance matrix is


 
Σ11 Σ12 Σ13 −1  Σ 
Cov[(Y1 , Y2 ) | Y3 = y3 ] = − Σ33 Σ32
Σ21 Σ22 Σ23 31
−1 −1 
Σ11 − Σ13 Σ33 Σ31 Σ12 − Σ13 Σ33 Σ32
= −1 Σ −1 Σ
Σ21 − Σ23 Σ33 31 Σ22 − Σ23 Σ33 32

Σ11·3 Σ12·3
≡ (10.39)
Σ21·3 Σ22·3

Then the hypotheses are

H0 : Σ12·3 = 0 versus H A : Σ12·3 = 0. (10.40)

Letting U/ν be the usual estimator of Σ R , where

U = Y Qx Y ∼ Wishart( q1 +q2 +q3 ) (ν, Σ R ), ν = n − p, (10.41)

we know from Proposition 8.1 that the conditional covariance is also Wishart, but
loses q3 degrees of freedom:

U11·3 U12·3
U(1:2)(1:2)·3 ≡
U21·3 U22·3

Σ11·3 Σ12·3
∼ Wishart( q1 +q2 ) ν − q3 , . (10.42)
Σ21·3 Σ22·3

(The U is partitioned analogously to the Σ R .) Then testing the hypothesis Σ12·3 here
is the same as (10.30) but after dotting out 3:

2 log( LR) = (ν − q3 ) ( log(|U11·3 /(ν − q3 )|) + log(|U22·3 /(ν − q3 )|)


− log(|U(1:2)(1:2)·3/(ν − q3 )|)), (10.43)

which is asymptotically χ2q1 q2 under the null.


10.2. Testing independence 177

An alternative (but equivalent) method for calculating the conditional covariance


is to move the conditioning variables Y3 to the x matrix, as we did for covariates.
Thus, leaving out the z,

( Y1 , Y2 ) | Y3 = y 3 ∼ N ( x ∗ β ∗ , I n ⊗ Σ ∗ ) , (10.44)

where 
Σ11·3 Σ12·3
x∗ = (x, y3 ) and Σ∗ = . (10.45)
Σ21·3 Σ22·3

Then
U(1:2)(1:2)·3 = (Y1 , Y2 ) Qx∗ (Y1 , Y2 ). (10.46)
See Exercise 7.7.7.
We note that there appears to be an ambiguity in the denominators of the Ui ’s for
the 2 log( LR). That is, if we base the likelihood on the original Y of (10.36), then the
denominators will be n. If we use the original U in (10.41), the denominators will
be n − p. And what we actually used, based on the conditional covariance matrix
in (10.42), were n − p − q3 . All three possibilities are fine in that the asymptotics as
n → ∞ are valid. We chose the one we did because it is the most focussed, i.e., there
are no parameters involved (e.g., β) that are not directly related to the hypotheses.
Testing the independence of three or more blocks of variables, given another
block, again uses the dotted-out Wishart matrix. For Example, consider Example
10.2.1 with variables homework, inclass, and midterms, but test whether those three
are conditionally independent given the “block 4” variables, labs and final. The
conditional U matrix is now denoted U(1:3)(1:3)·4, and the degrees of freedom are
ν − q4 = 105 − 2 = 103, so that the estimate of the conditional covariance matrix is
⎛ ⎞

σ11·4 
σ12·4 
σ13·4
⎝  1
σ21·4 
σ22·4 σ23·4 ⎠ =
 U

σ31·4 
σ32·4 
σ33·4 ν − q4 (1:3)(1:3)·4
⎛ ⎞
51.9536 −18.3868 5.2905
= ⎝ −18.3868 432.1977 3.8627 ⎠ . (10.47)
5.2905 3.8627 53.2762

Then, to test
H0 : σ12·4 = σ13·4 = σ23·4 = 0 versus H A : not, (10.48)
we use the statistic analogous to (10.35),

2 log( LR) = (ν − q4 ) (log(|U11·4 /(ν − q4 )|) + log(|U22·4 /(ν − q4 )|)


+ log(|U33·4 /(ν − q4 )|) − log(|U(1:3)(1:3)·4/(ν − q4 )|))
= 2.76. (10.49)

The degrees of freedom for the χ2 is again 3, so we accept the null: There does not
appear to be significant relationship among these three variables given the labs and
final scores. This implies, among other things, that once we know someone’s labs and
final scores, knowing the homework or inclass will not help in guessing the midterms
score. We could also look at the sample correlations, unconditionally (from (10.18))
178 Chapter 10. Covariance Models

and conditionally:
Unconditional Conditional on Labs, Final
HW InClass Midterms HW InClass Midterms
HW 1.00 0.28 0.41 1.00 −0.12 0.10
InClass 0.28 1.00 0.24 −0.12 1.00 0.03
Midterms 0.41 0.24 1.00 0.10 0.03 1.00
(10.50)
Notice that the conditional correlations are much smaller than the unconditional
ones, and the conditional correlation between homework and inclass scores is nega-
tive, though not significantly so. Thus it appears that the labs and final scores explain
the relationships among the other variables.

10.3 Factor analysis


The example above suggested that the relationship among three variables could be
explained by two other variables. The idea behind factor analysis is that the relation-
ships (correlations, to be precise) of a set of variables can be explained by a number
of other variables, called factors. The kicker here is that the factors are not observed.
Spearman [1904] introduced the idea based on the idea of a “general intelligence”
factor. This section gives the very basics of factor analysis. More details can be found
in Lawley and Maxwell [1971], Harman [1976] and Basilevsky [1994], as well as many
other books.
The model we consider sets Y to be the n × q matrix of observed variables, and X
to be the n × p matrix of factor variables, which we do not observe. Assume

    Σ XX Σ XY
X Y ∼ N (D δ γ , I n ⊗ Σ ), Σ = , (10.51)
ΣYX ΣYY
where D is an n × k design matrix (e.g., to distinguish men from women), and δ
(k × p) and γ (k × q) are the parameters for the means of X and Y, respectively. Factor
analysis is not primarily concerned with the means (that is what the linear models
are for), but with the covariances. The key assumption is that the variables in Y
are conditionally independent given X, which means the conditional covariance is
diagonal:
⎛ ⎞
ψ11 0 ··· 0
⎜ 0 ψ22 · · · 0 ⎟
⎜ ⎟
Cov[Y | X = x] = ΣYY · X = Ψ = ⎜ . . .. .. ⎟ . (10.52)
⎝ .. .. . . ⎠
0 0 · · · ψqq
Writing out the conditional covariance matrix, we have

Ψ = ΣYY − ΣYX Σ− −1
XX Σ XY ⇒ ΣYY = ΣYX Σ XX Σ XY + Ψ,
1
(10.53)
so that marginally,
Y ∼ N (Dγ, In ⊗ (ΣYX Σ−
XX Σ XY + Ψ)).
1
(10.54)
Because Y is all we observe, we cannot estimate Σ XY or Σ XX separately, but only the
function ΣYX Σ− ∗
XX Σ XY . Note that if we replace X with X = AX for some invertible
1

matrix A,
ΣYX Σ− −1
XX Σ XY = ΣYX ∗ Σ X ∗ X ∗ Σ X ∗ Y ,
1
(10.55)
10.3. Factor analysis 179

so that the distribution of Y is unchanged. See Exercise 10.5.5. Thus in order to


estimate the parameters, we have to make some restrictions. Commonly it is assumed
that Σ XX = I p , and the mean is zero:

X ∼ N (0, In ⊗ I p ). (10.56)

Then, letting β = Σ−
XX Σ XY = Σ XY ,
1

Y ∼ N (Dγ, In ⊗ ( β β + Ψ)). (10.57)

Or, we can write the model as

Y = Dγ + Xβ + R, X ∼ N (0, In ⊗ I p ), R ∼ N (0, In ⊗ Ψ), (10.58)

where X and R are independent. The equation decomposes each variable (column)
in Y into the fixed mean plus the part depending on the factors plus the parts unique
to the individual variables. The element β ij is called the loading of factor i on the
variable j. The variance ψ j is the unique variance of variable j, i.e., the part not
explained by the factors. Any measurement error is assumed to be part of the unique
variance.
There is the statistical problem of estimating the model, meaning the β β and Ψ
(and γ, but we already know about that), and the interpretative problem of finding
and defining the resulting factors. We will take these concerns up in the next two
subsections.

10.3.1 Estimation
We estimate the γ using least squares as usual, i.e.,

 = (D D)−1 D Y.
γ (10.59)

Then the residual sum of squares matrix is used to estimate the β and Ψ:

U = Y QD Y ∼ Wishartq (ν, β β + Ψ), ν = n − k. (10.60)

The parameters are still not estimable, because for any p × p orthogonal matrix
Γ, (Γβ) (Γβ) yields the same β β. We can use the QR decomposition from Theorem
5.3. Our β is p × q with p < q. Write β = ( β 1 , β 2 ), where β 1 has the first p columns
of β. We apply the QR decomposition to β 1 , assuming the columns are linearly
independent. Then β 1 = QR, where Q is orthogonal and R is upper triangular with
positive diagonal elements. Thus we can write Q β 1 = R, or
   
Q β = Q β 1 β 2 = R R ∗ ≡ β ∗ , (10.61)

where R∗ is some p × (q − p) matrix. E.g., with p = 3, q = 5,


⎛ ∗ ⎞
β11 β∗12 β∗13 β∗14 β∗15

β =⎝ 0 β∗22 β∗23 β∗24 β∗25 ⎠ , (10.62)
0 0 β∗33 β∗34 β∗35

where the β∗ii ’s are positive. If we require that β satisfies constraints (10.62), then it
is estimable. (Exercise 10.5.6.) Note that there are p( p − 1)/2 non-free parameters
180 Chapter 10. Covariance Models

(since β ij = 0 for i > j), which means the number of free parameters in the model
is pq − p( p − 1)/2 for the β part, and q for Ψ. Thus for the p-factor model M p , the
number of free parameters is
p ( p − 1)
d p ≡ dim( M p ) = q ( p + 1) − . (10.63)
2
(We are ignoring the parameters in the γ, because they are the same for all the models
we consider.) In order to have a hope of estimating the factors, the dimension of the
factor model cannot exceed the dimension of the most general model, ΣYY ∈ Sq+ ,
which has q (q + 1)/2 parameters. Thus for identifiability we need

q ( q + 1) ( q − p )2 − p − q
− dp = ≥ 0. (10.64)
2 2
E.g., if there are q = 10 variables, at most p = 6 factors can be estimated.
There are many methods for estimating β and Ψ. As in (10.1), the maximum
likelihood estimator maximizes
1  −1
e− 2 trace(( β β+Ψ) U)
1
L ( β, Ψ; U) = (10.65)
| β β + Ψ| ν/2
over β satisfying (10.62) and Ψ being diagonal. There is not a closed form solution to
the maximization, so it must be done numerically. There may be problems, too, such
as having one or more of the ψ j ’s being driven to 0. It is not obvious, but if β and Ψ

are the MLE’s, then the maximum of the likelihood is, similar to (10.3),

 Ψ; 1
 U) = e− 2 νq .
1
L ( β, (10.66)
  
| β β + Ψ| ν/2

See Section 9.4 of Mardia, Kent, and Bibby [1979].


Typically one is interested in finding the simplest model that fits. To test whether
the p-factor model fits, we use the hypotheses

H0 : ΣYY = β β + Ψ, β is p × q versus H A : ΣYY ∈ Sq+ . (10.67)

 YY = U/ν, so that
The MLE for H A is Σ
+ ,ν/2
| β β + Ψ
|
LR = . (10.68)
|U/ν|

Now
2 log( LR) = ν(log(| β β + Ψ
 |) − log(|U/ν|)), (10.69)
which is asymptotically χ2d f
with d f being the difference in (10.64). Bartlett suggests
a slight adjustment to the factor ν, similar to the Box approximation for Wilks’ Λ, so
that under the null,
2q + 5 2p
2 log( LR)∗ = (ν − − )(log(| β β + Ψ
 |) − log(|U/ν|)) −→ χ2 ,
df (10.70)
6 3
where
( q − p )2 − p − q
df = . (10.71)
2
10.3. Factor analysis 181

Alternatively, one can use AIC (9.50) or BIC (9.51) to assess M p for several p.
Because νq is the same for all models, we can take


deviance( M p ( β,  ) ; y) = ν log(| β β + Ψ
 |), (10.72)

so that

BIC( M p ) = ν log(| β β + Ψ


 |) + log(ν) (q ( p + 1) − p( p − 1)/2) , (10.73)
AIC( M p ) = ν log(| β β + Ψ
 |) + 2 (q ( p + 1) − p( p − 1)/2) . (10.74)

10.3.2 Describing the factors


Once you have decided on the number p of factors in the model and the estimate β, 

you have a choice of rotations. That is, since Γ β for any p × p orthogonal matrix Γ
has exactly the same fit, you need to choose the Γ. There are a number of criteria. The
varimax criterion tries to pick a rotation so that the loadings ( βij ’s) are either large
in magnitude, or close to 0. The hope is that it is then easy to interpret the factors
by seeing which variables they load heavily upon. Formally, the varimax rotation is
that which maximizes the sum of the variances of the squares of the elements in each
column. That is, if F is the q × p matrix consisting of the squares of the elements
in Γβ, then the varimax rotation is the Γ that maximizes trace(F Hq F), Hq being the
centering matrix (1.12). There is nothing preventing you from trying as many Γ’s as
you wish. It is an art to find a rotation and interpretation of the factors.
The matrix X, which has the scores of the factors for the individuals, is unobserved,
but can be estimated. The joint distribution is, from (10.51) with the assumptions
(10.56),

     Ip β
X Y ∼N 0 Dγ , In ⊗ Σ , Σ= . (10.75)
β β β + Ψ

Then given the observed Y:

X | Y = y ∼ N (α∗ + yβ∗ , In ⊗ Σ XX ·Y ), (10.76)

where
β∗ = ( β β + Ψ)−1 β , α∗ = −Dγβ∗ , (10.77)

and
Σ XX ·Y = I p − β( β β + Ψ)−1 β . (10.78)

An estimate of X is the estimate of E [X | Y = y]:

)( β β + Ψ
 = (y − Dγ
X  )−1 β . (10.79)
182 Chapter 10. Covariance Models

10.3.3 Example: Grades data


Continue with the grades data in Section 10.2.2, where the D in

E (Y) = Dγ (10.80)

is a 107 × 2 matrix that distinguishes men from women. The first step is to estimate
ΣYY :
 YY = 1 Y QD Y,
Σ (10.81)
ν
where here ν = 107 − 2 (since D has two columns), which is the pooled covariance
matrix in (10.18).
We illustrate with the R program factanal. The input to the program can be a data
matrix or a covariance matrix or a correlation matrix. In any case, the program will
base its calculations on the correlation matrix. Unless D is just a column of 1’s, you
shouldn’t give it Y, but S = Y QD Y/ν, where ν = n − k if D is n × k. You need to also
specify how many factors you want, and the number of observations (actually, ν + 1
for us). We’ll start with one factor. The sigmahat is the S, and covmat= indicates to R
that you are giving it a covariance matrix. (Do the same if you are giving a correlation
matrix.) In such cases, the program does not know what n or k is, so you should set
the parameter n.obs. It assumes that D is 1n , i.e., that k = 1, so to trick it into using
another k, set n.obs to n − k + 1, which in our case is 106. Then the one-factor model
is fit to the sigmahat in (10.18) using

f <− factanal(covmat=sigmahat,factors=1,n.obs=106)

 f$uniquenesses, and the


The output includes the uniquenesses (diagonals of Ψ),
(transpose of the) loadings matrix, f$loadings. Here,

: HW Labs InClass Midterms Final


diagonals of Ψ (10.82)
0.247 0.215 0.828 0.765 0.786

and
HW Labs InClass Midterms Final
β : (10.83)
Factor1 0.868 0.886 0.415 0.484 0.463

The given loadings and uniquenesses are based on the correlation matrix, so the fitted
correlation matrix can be found using

corr0 <− f$loadings%∗%t(f$loadings) + diag(f$uniquenesses)

The result is

One-factor model HW Labs InClass Midterms Final


HW 1.00 0.77 0.36 0.42 0.40
Labs 0.77 1.00 0.37 0.43 0.41
(10.84)
InClass 0.36 0.37 1.00 0.20 0.19
Midterms 0.42 0.43 0.20 1.00 ∗0.22
Final 0.40 0.41 0.19 0.22 1.00

Compare that to the observed correlation matrix. which is in the matrix f$corr:
10.3. Factor analysis 183

Unrestricted model HW Labs InClass Midterms Final


HW 1.00 0.78 0.28 0.41 0.40
Labs 0.78 1.00 0.42 0.38 0.35
(10.85)
InClass 0.28 0.42 1.00 0.24 0.27
Midterms 0.41 0.38 0.24 1.00 ∗0.60
Final 0.40 0.35 0.27 0.60 1.00

The fitted correlations are reasonably close to the observed ones, except for the
midterms/final correlation: The actual is 0.60, but the estimate from the one-factor
model is only 0.22. It appears that this single factor is more focused on other correla-
tions.
For a formal goodness-of-fit test, we have

H0 : One-factor model versus H A : Unrestricted. (10.86)

We can use either the correlation or covariance matrices, as long as we are consistent,
and since factanal gives the correlation, we might as well use that. The MLE under
H A is then corrA, the correlation matrix obtained from S, and under H0 is corr0. Then

2 log( LR) = ν(log(| β β + Ψ


 |) − log(|S|)) (10.87)

is found in R using
105∗log(det(corr0)/det(f$corr))
yielding the value 37.65. It is probably better to use Bartlett’s refinement (10.70),
(105 − (2∗5+5)/6−2/3)∗log(det(corr0)/det(f$corr))
which gives 36.51. This value can be found in f$STATISTIC, or by printing out f. The
degrees of freedom for the statistic in (10.71) is ((q − p)2 − p − q )/2 = 5, since p = 1
and q = 5. Thus H0 is rejected: The one-factor model does not fit.

Two factors
With small q, we have to be careful not to ask for too many factors. By (10.64), two is
the maximum when q = 5. In R, we just need to set factors=2 in the factanal function.
The χ2 for goodness-of-fit is 2.11, on one degree of freedom, hence the two-factor
model fits fine. The estimated correlation matrix is now
Two-factor model HW Labs InClass Midterms Final
HW 1.00 0.78 0.35 0.40 0.40
Labs 0.78 1.00 0.42 0.38 0.35
(10.88)
InClass 0.35 0.42 1.00 0.24 0.25
Midterms 0.40 0.38 0.24 1.00 0.60
Final 0.40 0.35 0.25 0.60 1.00

which is quite close to the observed correlation matrix (10.85) above. Only the In-
Class/HW correlation is a bit off, but not by much.
The uniquenesses and loadings for this model are

: HW Labs InClass Midterms Final


diagonals of Ψ (10.89)
0.36 0.01 0.80 0.48 0.30
184 Chapter 10. Covariance Models

and
HW Labs InClass Midterms Final
β : Factor 1 0.742 0.982 0.391 0.268 0.211 (10.90)
Factor 2 0.299 0.173 0.208 0.672 0.807
The routine gives the loadings using the varimax criterion.
Looking at the uniquenesses, we notice that inclass’s is quite large, which suggests
that it has a factor unique to itself, e.g., being able to get to class. It has fairly low
loadings on both factors. We see that the first factor loads highly on homework
and labs, especially labs, and the second loads heavily on the exams, midterms and
final. (These results are not surprising given the example in Section 10.2.2, where we
see homework, inclass, and midterms are conditionally independent given labs and
final.) So one could label the factors “Diligence” and “Test taking ability”.
The exact same fit can be achieved by using other rotations Γβ, for a 2 × 2 orthog-
onal matrix Γ. Consider the rotation

1 1 1
Γ= √ . (10.91)
2 1 −1

Then the loadings become

HW Labs InClass Midterms Final


Γ β : Factor∗ 1 0.736 0.817 0.424 0.665 0.720 (10.92)
Factor∗ 2 0.314 0.572 0.129 −0.286 −0.421

Now Factor∗ 1 could be considered an overall ability factor, and Factor∗ 2 a contrast
of HW+Lab and Midterms+Final.
Any rotation is fine — whichever you can interpret easiest is the one to take.

Using the BIC to select the number of factors


We have the three models: One-factor (M1 ), two-factor (M2 ), and unrestricted (M Big ).
 |)), where here we take the correlation form of the
The deviances (10.72) are ν log(| Σ
 The relevant quantities are next:
Σ’s.

Model Deviance d BIC BIC PBIC


M1 −156.994 10 −110.454 16.772 0
(10.93)
M2 −192.382 14 −127.226 0 0.768
M Big −194.640 15 −124.831 2.395 0.232

The only difference between the two BIC columns is that the second one has 127.226
added to each element, making it easier to compare them. These results conform to
what we had before. The one-factor model is untenable, and the two-factor model is
fine, with 77% estimated probability. The full model has a decent probability as well.

Estimating the score matrix X


The score matrix is estimated as in (10.79). You have to be careful, though, to use

consistently the correlation form or covariance form. That is, if the ( β,  ) is estimated
from the correlation matrix, then the residuals y − Dγ  must be rescaled so that the
variances are 1. Or you can let R do it, by submitting the residuals and asking for the
“regression” scores:
10.3. Factor analysis 185

Test taking ability


1
0
−1
−2

−5 −4 −3 −2 −1 0 1
Diligence

Figure 10.1: Plot of factor scores for two-factor model.

x <− cbind(1,grades[,1])
gammahat <− solve(t(x)%∗%x,t(x)%∗%grades[,2:6])
resids <− grades[,2:6]− x%∗%gammahat
xhat <− factanal(resids,factors=2,scores=’regression’)$scores

The xhat is then 107 × 2:


⎛ ⎞
−5.038 −1.352
⎜ −4.083 −0.479 ⎟
⎜ ⎟
=⎜
X ⎜ −0.083 −2.536 ⎟
⎟. (10.94)
⎜ .. .. ⎟
⎝ . . ⎠
0.472 1.765

Now we can use the factor scores in scatter plots. For example, Figure 10.1 contains
a scatter plot of the estimated factor scores for the two-factor model. They are by
construction uncorrelated, but one can see how diligence has a much longer lower
tail (lazy people?).
We also calculated box plots to compare the women’s and men’s distribution on
the factors:

par(mfrow=c(1,2))
yl <− range(xhat) # To obtain the same y−scales
w <− (x[,2]==1) # Whether women (T) or not.
boxplot(list(Women=xhat[w,1],Men=xhat[!w,1]),main=’Factor 1’,ylim=yl)
boxplot(list(Women=xhat[w,2],Men=xhat[!w,2]),main=’Factor 2’,ylim=yl)

See Figure 10.2. There do not appear to be any large overall differences.
186 Chapter 10. Covariance Models

2
1

1
Test taking ability
−1 0

−1 0
Diligence
−3

−3
−5

−5
Women Men Women Men

Figure 10.2: Box plots comparing the women and men on their factor scores.

10.4 Some symmetry models


Some structural models on covariances matrices, including testing independence, can
be defined through group symmetries. The advantage of such models is that the
likelihood estimates and tests are very easy to implement. The ones we present
are called invariant normal models as defined in Andersson [1975]. We will be
concerned with these models’ restrictions on the covariance matrices. More generally,
the models are defined on the means as well. Basically, the models are ones which
specify certain linear constraints among the elements of the covariance matrix.
The model starts with
Y ∼ Nn×q (0, In ⊗ Σ), (10.95)
and a q × q group G (see (5.58)), a subgroup of the group of q × q orthogonal matrices
Oq . The model demands that the distribution of Y be invariant under multiplication
on the right by elements of G , that is,

Yg = D Y for all g ∈ G . (10.96)

Now because Cov(Yg) = In ⊗ g Σg, (10.95) and (10.96) imply that

Σ = g Σg for all g ∈ G . (10.97)

Thus we can define the mean zero invariant normal model based on G to be (10.95)
with
Σ ∈ Sq+ (G) ≡ {Σ ∈ Sq+ | Σ = g Σg for all g ∈ G}. (10.98)

A few examples are in order at this point. Typically, the groups are fairly simple
groups.
10.4. Symmetry models 187

10.4.1 Some types of symmetry


Independence and block independence
Partition the variables in Y so that

Y = (Y1 , . . . , YK ), where Yk is n × q k , (10.99)

and ⎛ ⎞
Σ11 Σ12 ··· Σ1K
⎜ Σ21 Σ22 ··· Σ2K ⎟
⎜ ⎟
Σ=⎜ . .. .. .. ⎟ , Σkl is q k × q l . (10.100)
⎝ .. . . . ⎠
ΣK1 ΣK2 ··· ΣKK
Independence of one block of variables from the others entails setting the covari-
ance to zero, that is, Σkl = 0 means Yk and Yl are independent. Invariant normal
models can specify a block being independent of all the other blocks. For example,
suppose K = 3. Then the model that Y1 is independent of (Y2 , Y3 ) has
⎛ ⎞
Σ11 0 0
Σ=⎝ 0 Σ22 Σ23 ⎠ . (10.101)
0 Σ32 Σ33

The group that gives rise to that model consists of two elements:
⎧⎛ ⎞ ⎛ ⎞⎫
⎨ Iq1 0 0 − Iq1 0 0 ⎬
G = ⎝ 0 Iq2 0 ⎠ , ⎝ 0 Iq2 0 ⎠ . (10.102)
⎩ ⎭
0 0 Iq3 0 0 Iq3

(The first element is just Iq , of course.) It is easy to see that Σ of (10.101) is invariant
under G of (10.102). Lemma 10.1 below can be used to show any Σ in S + (G) is of the
form (10.101).
If the three blocks Y1 , Y2 and Y3 are mutually independent, then Σ is block diago-
nal, ⎛ ⎞
Σ11 0 0
Σ=⎝ 0 Σ22 0 ⎠, (10.103)
0 0 Σ33
and the corresponding G consists of the eight matrices
⎧⎛ ⎞⎫
⎨ ± Iq1 0 0 ⎬
G= ⎝ 0 ± Iq2 0 ⎠ . (10.104)
⎩ ⎭
0 0 ± Iq3

An extreme case is when all variables are mutually independent, so that q k = 1


for each k (and K = q), Σ is diagonal, and G consists of all diagonal matrices with
±1’s down the diagonal.

Intraclass correlation structure


The intraclass correlation structure arises when the variables are interchangeable.
For example, the variables may be similar measurements (such as blood pressure)
made several times, or scores on sections of an exam, where the sections are all
188 Chapter 10. Covariance Models

measuring the same ability. In such cases, the covariance matrix would have equal
variances, and equal covariances:
⎛ ⎞
1 ρ ··· ρ
⎜ ρ 1 ··· ρ ⎟
⎜ ⎟
Σ = σ2 ⎜ .. .. .. .. ⎟. (10.105)
⎝ . . . . ⎠
ρ ··· ρ 1

The G for this model is the group of q × q permutation matrices Pq . (A permutation


matrix has exactly one 1 in each row and one 1 in each column, and zeroes elsewhere.
If g is a permutation matrix and x is a q × 1 vector, then gx contains the same elements
as x, but in a different order.)

Compound symmetry
Compound symmetry is an extension of intraclass symmetry, where there are groups
of variables, and the variables within each group are interchangeable. Such models
might arise, e.g., if students are given three interchangeable batteries of math ques-
tions, and two interchangeable batteries of verbal questions. The covariance matrix
would then have the form
⎛ ⎞
a b b c c
⎜ b a b c c ⎟
⎜ ⎟
Σ=⎜ b b a c c ⎟. (10.106)
⎝ c c c d e ⎠
c c c e d

In general, the group would consist of block diagonal matrices, with permutation
matrices as the blocks. That is, with Σ partitioned as in (10.100),
⎧⎛ ⎞ ⎫

⎪ G1 0 ··· 0 ⎪


⎨⎜ 0 G2 ··· 0 ⎟ ⎪

⎜ ⎟
G= ⎜ . .. .. .. ⎟ | G1 ∈ Pq1 , G2 ∈ Pq2 , . . . , GK ∈ PqK . (10.107)

⎪ ⎝ .. . . . ⎠ ⎪


⎩ ⎪

0 0 · · · GK

IID, or spherical symmetry


Combining independence and intraclass correlation structure yields Σ = σ2 Iq , so that
the variables are independent and identically distributed. The group for this model is
the set of permutation matrices augmented with ± signs on the 1’s. (Recall Exercise
8.8.6.)
The largest group possible for these models is the group of q × q orthogonal ma-
trices. When (10.96) holds for all orthogonal g, the distribution of Y is said to be
spherically symmetric. It turns out that this choice also yields Σ = σ2 Iq . This result is a
reflection of the fact that iid and spherical symmetry are the same for the multivari-
ate normal distribution. If Y has some other distribution, then the two models are
distinct, although they still have the same covariance structure.
10.4. Symmetry models 189

10.4.2 Characterizing the structure


It is not always obvious given a structure for the covariance matrix to find the cor-
responding G , or even to decide whether there is a corresponding G . But given the
group, there is a straightforward method for finding the structure. We will consider
just finite groups G , but the idea extends to general groups, in which case we would
need to introduce uniform (Haar) measure on these groups.
For given finite group G and general Σ, define the average of Σ by

∑g∈G g Σg
Σ= . (10.108)
#G

It should be clear that if Σ ∈ Sq+ (G), then Σ = Σ. The next lemma shows that all
averages are in Sq+ (G).

Lemma 10.1. For any Σ ∈ Sq+ , Σ ∈ Sq+ (G).

Proof. For any h ∈ G ,

∑g∈G h g Σgh
h Σh =
#G
∑g∗ ∈G g∗ Σg∗
=
#G
= Σ. (10.109)

The second line follows by setting g∗ = gh, and noting that as g runs over G , so does
g∗ . (This is where the requirement that G is a group is needed.) But (10.109) implies
that Σ ∈ G .

The lemma shows that


Sq+ (G) = {Σ | Σ ∈ Sq+ }, (10.110)

so that one can discover the structure of covariance matrices invariant under a partic-
ular group by averaging a generic Σ. That is how one finds the structures in (10.101),
(10.103), (10.106), and (10.107) from their respective groups.

10.4.3 Maximum likelihood estimates


 ∈ Sq+ (G) that maximizes
The maximum likelihood estimate of Σ in (10.98) is the Σ

1 −1
e− 2 trace( Σ u) , Σ ∈ S + (G), where u = y y.
1
L (Σ; y) = (10.111)
| Σ| n/2

The requirement Σ ∈ Sq+ (G) means that

Σ−1 = (g Σg)−1 = g Σ−1 g (10.112)


190 Chapter 10. Covariance Models

for any g ∈ G , that is, Σ−1 ∈ Sq+ (G), hence


+ ,
−1 ∑g∈G g Σ−1 g
trace(Σ u) = trace u
#G
+ ,
∑g∈G g Σ−1 gu
= trace
#G
+ ,
∑g∈G Σ−1 gug
= trace
#G
+ ,
∑g∈G gug
= trace Σ−1
#G
= trace(Σ−1 u). (10.113)

Thus
1 −1
e− 2 trace( Σ u) .
1
L (Σ; y) = (10.114)
| Σ| n/2
We know from Lemma 9.1 that the maximizer of L in (10.114) over Σ ∈ Sq+ is

 = u,
Σ (10.115)
n

but since that maximizer is in Sq+ (G) by Lemma 10.1, and Sq+ (G) ⊂ Sq+ , it must
be the maximizer over Sq+ (G). That is, (10.115) is indeed the maximum likelihood
estimate for (10.111).
To illustrate, let S = U/n. Then if G is as in (10.104), so that the model is that
three sets of variables are independent (10.103), the maximum likelihood estimate is
the sample analog
⎛ ⎞
S11 0 0
 (G) = ⎝ 0
Σ S22 0 ⎠. (10.116)
0 0 S33
In the intraclass correlation model (10.105), the group is the set of q × q permutation
matrices, and the maximum likelihood estimate has the same form,
⎛ ⎞
1 ρ · · · ρ
⎜ ρ 1 · · · ρ ⎟
 (G) =  ⎜ ⎟
Σ σ2 ⎜ . . .. . ⎟, (10.117)
⎝ .. .. . .. ⎠
ρ · · · ρ 1

where
q
1 ∑ ∑1≤i<j≤q sij
σ2 =
 ∑ σ2 =
s , and ρ
q i=1 ii (2q)
. (10.118)

That is, the common variance is the average of the original variances, and the common
covariance is the average of the original covariances.
10.4. Symmetry models 191

10.4.4 Hypothesis testing and model selection


The deviance for the model defined by the group G is, by (10.3),

 (G)|),
deviance( M (G)) = n log(| Σ (10.119)

where we drop the exponential term since nq is the same for all models. We can
then use this deviance in finding AIC’s or BIC’s for comparing such models, once we
figure out the dimensions of the models, which is usually not too hard. E.g., if the
model is that Σ is unrestricted, so that G A = {Iq }, the trivial group, the dimension
for H A is (q+ 1
2 ). The dimension for the independence model in (10.103) and (10.104)
sums the dimensions for the diagonal blocks: (q12+1) + (q22+1) + (q32+1). The dimension
for the intraclass correlation model (10.105) is 2 (for the variance and covariance).
Also, the likelihood ratio statistic for testing two nested invariant normal models
is easy to find. These testing problems use two nested groups, G A ⊂ G0 , so that the
hypotheses are
H0 : Σ ∈ Sq+ (G0 ) versus H A : Σ ∈ Sq+ (G A ), (10.120)

Note that the larger G , the smaller Sq+ (G), since fewer covariance matrices are in-
variant under a larger group. Then the likelihood ratio test statistic, 2 log( LR), is the
difference of the deviances, as in (9.49).

The mean is not zero


So far this subsection assumed the mean is zero. In the more general case that Y ∼
Nn×q (xβ, In ⊗ Σ), estimate Σ restricted to Sq+ (G) by finding U = Y Qx Y, then taking

 = U or
Σ
U
, (10.121)
n n−p

(where x is n × p), depending on whether you want the maximum likelihood estimate
or an unbiased estimate. In testing, I would suggest taking the unbiased versions,
then using
 (G)|).
deviance( M (G)) = (n − p) log(| Σ (10.122)

10.4.5 Example: Mouth sizes


Continue from Section 7.3.1 with the mouth size data, using the model (7.40). Because
the measurements within each subject are of the same mouth, a reasonable question
to ask is whether the residuals within each subject are exchangeable, i.e., whether Σ R
has the intraclass correlation structure (10.105). Let U = Y Qx Y and the unrestricted
estimate be Σ A = U/ν for ν = n − 2 = 25. Then Σ  A and the estimate under the
intraclass correlation hypothesis Σ 0 , given in (10.117) and (10.118), are
⎛ ⎞
5.415 2.717 3.910 2.710
⎜ 2.717 4.185 2.927 3.317 ⎟
 ⎜
ΣA = ⎝ ⎟ (10.123)
3.910 2.927 6.456 4.131 ⎠
2.710 3.317 4.131 4.986
192 Chapter 10. Covariance Models

and ⎛ ⎞
5.260 3.285 3.285 3.285
⎜ 3.285 5.260 3.285 3.285 ⎟
0 = ⎜
Σ ⎟. (10.124)
⎝ 3.285 3.285 5.260 3.285 ⎠
3.285 3.285 3.285 5.260

To test the null hypothesis that the intraclass correlation structure holds, versus
the general model, we have from (10.119)

 0 |) − log(| Σ
2 log( LR) = 25 (log(| Σ  A |)) = 9.374. (10.125)

The dimension for the general model is d A = q (q + 1)/2 = 10, and for the null is
just d0 = 2, thus the degrees of freedom for this statistic is d f = d A − d0 = 8. The
intraclass correlation structure appears to be plausible.
We can exploit this structure (10.105) on the Σ R to more easily test hypotheses
about the β in both-sides models like (7.40). First, we transform the matrix Σ R into a
diagonal matrix with two distinct variances. Notice that we can write this covariance
as
Σ R = σ2 (1 − ρ)Iq + σ2 ρ1q 1q . (10.126)

Let Γ be any q × q orthogonal matrix whose first column is proportional to 1q , i.e.,



1q / q. Then

Γ  Σ R Γ = σ2 (1 − ρ)Γ  Γ + σ2 ρΓ  1q 1q Γ
⎛ √ ⎞
q
⎜ 0 ⎟ √ 
⎜ ⎟
= σ 2 (1 − ρ )I q + σ 2 ρ ⎜ . ⎟ q 0 ··· 0
⎝ .. ⎠
0

1 + ( q − 1 ) ρ 0
= σ2 ≡ Λ. (10.127)
0 (1 − ρ )I q −1

We used the fact that because all columns of Γ except the first are orthogonal to 1q ,

Γ  1q = q (1, 0, . . . , 0) . As suggested by the notation, this Λ is indeed the eigenvalue
matrix for Σ R , and Γ contains a corresponding set of eigenvectors.
In the model (7.40), the z is almost an appropriate Γ:
⎛ ⎞
1 −3 1 −3
⎜ 1 −1 −1 1 ⎟
z=⎜
⎝ 1
⎟. (10.128)
1 −1 −1 ⎠
1 3 1 3

The columns are orthogonal, and the first is 14 , so we just have to divide each column
by its length to obtain orthonormal columns. The squared lengths of the columns are
the diagonals of z z : (4, 20, 4, 20). Let Δ be the square root of z z,
⎛ ⎞
2 √0 0 0
⎜ 0 20 0 0 ⎟
Δ=⎜
⎝ 0
⎟,
⎠ (10.129)
0 2 √0
0 0 0 20
10.4. Symmetry models 193

and set
Γ = zΔ−1 and β∗ = βΔ, (10.130)
so that the both-sides model can be written

Y = xβΔΔ−1 z + R = xβ∗ Γ  + R. (10.131)

Multiplying everything on the right by Γ yields

Y∗ ≡ YΓ = xβ∗ + R∗ , (10.132)

where
R∗ ≡ RΓ ∼ N (0, In ⊗ Γ  Σ R Γ ) = N (0, In ⊗ Λ). (10.133)
This process is so far similar to that in Section 7.5.1. The estimate of β∗ is straightfor-
ward:

49.938 3.508 0.406 −0.252
β∗ = (x x)−1 x Y∗ = . (10.134)
−4.642 −1.363 −0.429 0.323

These estimates are the same as those for model (6.28), multiplied by Δ as in (10.130).
The difference is in their covariance matrix:

β∗ ∼ N ( β∗ , Cx ⊗ Λ). (10.135)

To estimate the standard errors of the estimates, we look at the sum of squares
and cross products of the estimated residuals,

U∗ = Y∗ Qx Y∗ ∼ Wishartq (ν, Λ), (10.136)

where ν = trace (Qx ) = n − p = 27 − 2 = 25. Because the Λ in (10.127) is diagonal,


the diagonals of U∗ are independent scaled χ2ν ’s:


U11 ∼ τ02 χ2ν , Ujj∗ ∼ τ12 χ2ν , j = 2, . . . , q = 4. (10.137)

Unbiased estimates are

U11 τ2 U22 + · · · Uqq τ12


τ02 = ∼ 0 χ2ν and τ12 = ∼ χ2 . (10.138)
ν ν ( q − 1) ν ( q − 1) ν ( q − 1 ) ν

For our data,

377.915 59.167 + 26.041 + 62.919


τ02 = = 5.117 and τ12 = = 1.975. (10.139)
25 75

∗ ’s from (10.135) are found from


The estimated standard errors of the β ij
⎛ ⎞
 15.117 0 0 0
0.0625 −0.0625 ⎜ 0 1.975 0 0 ⎟
 =
Cx ⊗ Λ ⎜
⊗⎝ ⎟ , (10.140)
−0.0625 0.1534 0 0 1.975 0 ⎠
0 0 0 1.975
194 Chapter 10. Covariance Models

being the square roots of the diagonals:

Standard errors
Constant Linear Quadratic Cubic
(10.141)
Boys 0.972 0.351 0.351 0.351
Girls−Boys 1.523 0.550 0.550 0.550

The t-statistics divide (10.134) by their standard errors:

t-statistics
Constant Linear Quadratic Cubic
(10.142)
Boys 51.375 9.984 1.156 −0.716
Girls−Boys −3.048 −2.477 −0.779 0.586

These statistics are not much different from what we found in Section 6.4.1, but the
degrees of freedom for all but the first column are now 75, rather than 25. The main
impact is in the significance of δ1 , the difference between the girls’ and boys’ slopes.
Previously, the p-value was 0.033 (the t = −2.26 on 25 df). Here, the p-value is 0.016,
a bit stronger suggestion of a difference.

10.5 Exercises
Exercise 10.5.1. Verify the likelihood ratio statistic (10.21) for testing the equality of
several covariance matrices as in (10.20).
−1 −1
Exercise 10.5.2. Verify that trace(Σ−1 U) = trace(Σ11 U11 ) + trace(Σ22 U22 ), as in
(10.27), for Σ being block-diagonal, i.e., Σ12 = 0 in (10.23).
Exercise 10.5.3. Show that the value of 2 log( LR) of (10.31) does not change if the ν’s
in the denominators are erased.
Exercise 10.5.4. Suppose U ∼ Wishartq (ν, Σ), where Σ is partitioned as
⎛ ⎞
Σ11 Σ12 ··· Σ1K
⎜ Σ21 Σ22 ··· Σ2K ⎟
⎜ ⎟
Σ=⎜ .. .. .. .. ⎟, (10.143)
⎝ . . . . ⎠
ΣK1 ΣK2 ··· ΣKK

where Σij is q i × q j , and the q i ’s sum to q. Consider testing the null hypothesis that
the blocks are mutually independent, i.e.,

H0 : Σij = 0 for 1 ≤ i < j ≤ K, (10.144)

versus the alternative that Σ is unrestricted. (a) Find the 2 log( LR), and the degrees
of freedom in the χ2 approximation. (The answer is analogous to that in (10.35).) (b)
Let U∗ = AUA for some diagonal matrix A with positive diagonal elements. Replace
the U in 2 log( LR) with U∗ . Show that the value of the statistic remains the same.
(c) Specialize to the case that all q i = 1 for all i, so that we are testing the mutual
independence of all the variables. Let C be the sample correlation matrix. Show that
2 log( LR) = − ν log(|C|). [Hint: Find the appropriate A from part (b).)
10.5. Exercises 195

Exercise 10.5.5. Show that (10.55) holds.


Exercise 10.5.6. Suppose β and α are both p × q, p < q, and let their decompositions
from (10.61) and (10.62) be β = Qβ∗ and α = Γα∗ , where Q and Γ are orthogonal,
α∗ij = β∗ij = 0 for i > j, and α∗ii ’s and β ii ’s are positive. (We assume the first p columns
of α, and of β, are linearly independent.) Show that α α = β β if and only if α∗ = β∗ .
[Hint: Use the uniqueness of the QR decomposition, in Theorem 5.3.]
Exercise 10.5.7. Show that the conditional parameters in (10.76) are as in (10.77) and
(10.78).
Exercise 10.5.8. Show that if the factor analysis is fit using the correlation matrix,
ij , the loading
then the correlation between variable j and factor i is estimated to be β
of factor i on variable j.
Exercise 10.5.9. What is the factor analysis model with no factors (i.e., erase the β in
(10.57))? Choose from the following: The covariance of Y is unrestricted; the mean
of Y is 0; the Y variables are mutually independent; the covariance matrix of Y is a
constant times the identity matrix.
Exercise 10.5.10. Show that if Σ ∈ Sq+ (G) of (10.98), that Σ in (10.108) is in Sq+ (G).
Exercise 10.5.11. Verify the steps in (10.113).
Exercise 10.5.12. Show that if Σ has intraclass correlation structure (10.105), then
Σ = σ2 (1 − ρ)Iq + σ2 ρ1q 1q as in (10.126).
Exercise 10.5.13. Multivariate complex normals arise in spectral analysis of multiple
time series. A q-dimensional complex normal is Y1 + i Y2 , where Y1 and Y2 are 1 × q
real normal vectors with joint covariance of the form

  Σ1 F
Σ = Cov Y Y = , (10.145)
1 2
−F Σ1

i.e., Cov(Y1 ) = Cov(Y2 ). Here, “i” is the imaginary i = −1. (a) Show that F =
Cov(Y1 , Y2 ) is skew-symmetric, which means that F = −F. (b) What is F when q = 1?
+
(c) Show that the set of Σ’s as in (10.145) is the set S2q (G) in (10.98) with
   0
0 − Iq 0 Iq
G= I2q , , , −I2q . (10.146)
Iq 0 −I p 0

Exercise 10.5.14 (Mouth sizes). For the boys’ and girls’ mouth size data in Table 4.1,
let Σ B be the covariance matrix for the boys’ mouth sizes, and Σ G be the covariance
matrix for the girls’ mouth sizes. Consider testing

H0 : Σ B = Σ G versus H A : Σ B = Σ G . (10.147)

(a) What are the degrees of freedom for the boys’ and girls’ sample covariance matri-
 B |, | Σ
ces? (b) Find | Σ  G |, and the pooled | Σ
 |. (Use the unbiased estimates of the Σi ’s.)
(c) Find 2 log( LR). What are the degrees of freedom for the χ2 ? What is the p-value?
Do you reject the null hypothesis (if α = .05)? (d) Look at trace(Σ  B ), trace(Σ
 G ). Also,
look at the correlation matrices for the girls and for the boys. What do you see?
196 Chapter 10. Covariance Models

Exercise 10.5.15 (Mouth sizes). Continue with the mouth size data from Exercise
10.5.14. (a) Test whether Σ B has the intraclass correlation structure (versus the gen-
eral alternative). What are the degrees of freedom for the χ2 ? (b) Test whether Σ G
has the intraclass correlation structure. (c) Now assume that both Σ B and Σ G have
the intraclass correlation structure. Test whether the covariances matrices are equal.
What are the degrees of freedom for this test? What is the p-value. Compare this
p-value to that in Exercise 10.5.14, part (c). Why is it so much smaller (if it is)?

Exercise 10.5.16 (Grades). This problem considers the grades data. In what follows,
use the pooled covariance matrix in (10.18), which has ν = 105. (a) Test the indepen-
dence of the first three variables (homework, labs, inclass) from the fourth variable,
 11 |, | Σ
the midterms score. (So leave out the final exams at this point.) Find l1 , l2 , | Σ  22 |,

and | Σ|. Also, find 2 log( LR) and the degrees of freedom for the χ . Do you accept or
2
reject the null hypothesis? (b) Now test the conditional independence of the set (home-
work, labs, inclass) from the midterms, conditioning on the final exam score. What
 11 |, | Σ
is the ν for the estimated covariance matrix now? Find the new l1 , l2 , | Σ  22 |, and

| Σ|. Also, find 2 log( LR) and the degrees of freedom for the χ . Do you accept or
2
reject the null hypothesis? (c) Find the correlations between the homework, labs and
inclass scores and the midterms scores, as well as the conditional correlations given
the final exam. What do you notice?

Exercise 10.5.17 (Grades). The table in (10.93) has the BIC’s for the one-factor, two-
factor, and unrestricted models for the Grades data. Find the deviance, dimension,
and BIC for the zero-factor model, M0 . (See Exercise 10.5.9.) Find the estimated
probabilities of the four models. Compare the results to those without M0 .

Exercise 10.5.18 (Exams). The exams matrix has data on 191 statistics students, giving
their scores (out of 100) on the three midterm exams, and the final exam. (a) What
is the maximum number of factors that can be estimated? (b) Give the number of
parameters in the covariance matrices for the 0, 1, 2, and 3 factor models (even if they
are not estimable). (d) Plot the data. There are three obvious outliers. Which obser-
vations are they? What makes them outliers? For the remaining exercise, eliminate
these outliers, so that there are n = 188 observations. (c) Test the null hypothesis
that the four exams are mutually independent. What are the adjusted 2 log( LR)∗
(in (10.70)) and degrees of freedom for the χ2 ? What do you conclude? (d) Fit the
one-factor model. What are the loadings? How do you interpret them? (e) Look at
the residual matrix C − β β,
 where C is the observed correlation matrix of the orig-
inal variables. If the model fits exactly, what values would the off-diagonals of the
residual matrix be? What is the largest off-diagonal in this observed matrix? Are the
diagonals of this matrix the uniquenesses? (f) Does the one-factor model fit?

Exercise 10.5.19 (Exams). Continue with the Exams data from Exercise 10.5.18. Again,
do not use the outliers found in part (d). Consider the invariant normal model where
the group G consists of 4 × 4 matrices of the form
∗ 
G 03
G= (10.148)
03 1
where G∗ is a 3 × 3 permutation matrix. Thus the model is an example of com-
pound symmetry, from Section 10.4.1. The model assumes the three midterms are
10.5. Exercises 197

interchangeable. (a) Give the form of a covariance matrix Σ which is invariant un-
der that G . (It should be like the upper-left 4 × 4 block of the matrix in (10.106).)
How many free parameters are there? (b) For the exams data, give the MLE of the
covariance matrix under the assumption that it is G -invariant. (c) Test whether this
symmetry assumption holds, versus the general model. What are the degrees of free-
dom? For which elements of Σ is the null hypothesis least tenable? (d) Assuming Σ
is G-invariant, test whether the first three variables are independent of the last. (That
is, the null hypothesis is that Σ is G -invariant and σ14 = σ24 = σ34 = 0, while the
alternative is that Σ is G -invariant, but otherwise unrestricted.) What are the degrees
of freedom for this test? What do you conclude?

Exercise 10.5.20 (South Africa heart disease). The data for this question comes from
a study of heart disease in adult males in South Africa from Rousseauw et al. [1983].
(We return to these data in Section 11.8.) The R data frame is SAheart, found in
the ElemStatLearn package [Halvorsen, 2009]. The main variable of interest is “chd”,
congestive heart disease, where 1 indicates the person has the disease, 0 he does
not. Explanatory variables include sbc (measurements on blood pressure), tobacco
use, ldl (bad cholesterol), adiposity (fat %), family history of heart disease (absent
or present), type A personality, obesity, alcohol usage, and age. Here you are to
find common factors among the the explanatory variables excluding age and family
history. Take logs of the variables sbc, ldl, and obesity, and cube roots of alcohol
and tobacco, so that the data look more normal. Age is used as a covariate. Thus
Y is n × 7, and D = (1n xage ). Here, n = 462. (a) What is there about the tobacco
and alcohol variables that is distinctly non-normal? (b) Find the sample correlation
matrix of the residuals from the Y = Dγ + R model. Which pairs of variables have
correlations over 0.25, and what are their correlations? How would you group these
variables? (c) What is the largest number of factors that can be fit for this Y? (d) Give
the BIC-based probabilities of the p-factor models for p = 0 to the maximum found
in part (c), and for the unrestricted model. Which model has the highest probability?
Does this model fit, according to the χ2 goodness of fit test? (e) For the most probable
model from part (d), which variables’ loadings are highest (over 0.25) for each factor?
(Use the varimax rotation for the loadings.) Give relevant names to the two factors.
Compare the factors to what you found in part (b). (f) Keeping the same model,
find the estimated factor scores. For each factor, find the two-sample t-statistic for
comparing the people with heart disease to those without. (The statistics are not
actually distributed as Student’s t, but do give some measure of the difference.) (g)
Based on the statistics in part (f), do any of the factors seem to be important factors in
predicting heart disease in these men? If so, which one(s). If not, what are the factors
explaining?

Exercise 10.5.21 (Decathlon). Exercise 1.9.20 created a biplot for the decathlon data
The data consist of the scores (number of points) on each of ten events for the top 24
men in the decathlon at the 2008 Olympics. For convenience, rearrange the variables
so that the running events come first, then the jumping, then throwing (ignoring the
overall total):
y <− decathlon[,c(1,5,10,6,3,9,7,2,4,8)]
Fit the 1, 2, and 3 factor models. (The chi-squared approximations for the fit might
not be very relevant, because the sample size is too small.) Based on the loadings, can
198 Chapter 10. Covariance Models

you give an interpretation of the factors? Based on the uniquenesses, which events
seem to be least correlated with the others?
Chapter 11

Classification

Multivariate analysis of variance seeks to determine whether there are differences


among several groups, and what those differences are. Classification is a related area
in which one uses observations whose group memberships are known in order to
classify new observations whose group memberships are not known. This goal was
the basic idea behind the gathering of the Fisher/Anderson iris data (Section 1.3.1).
Based on only the petal and sepal measurements of a new iris, can one effectively
classify it into one of the three species setosa, virginica and versicolor? See Figure 1.4
for an illustration of the challenge.
The task is prediction, as in Section 7.6, except that rather than predicting a con-
tinuous variable Y, we predict a categorical variable. We will concentrate mainly on
linear methods arising from Fisher’s methods, logistic regression, and trees. There is
a vast array of additional approaches, including using neural networks, support vec-
tor machines, boosting, bagging, and a number of other flashy-sounding techniques.
Related to classification is clustering (Chapter 12), in which one assumes that there
are groups in the population, but which groups the observations reside in is unob-
served, analogous to the factors in factor analysis. In the machine learning com-
munity, classification is supervised learning, because we know the groups and have
some data on group membership, and clustering is unsupervised learning, because
group membership itself must be estimated. See the book by Hastie, Tibshirani, and
Friedman [2009] for a fine statistical treatment of machine learning.
The basic model is a mixture model, presented in the next section.

11.1 Mixture models


The mixture model we consider assumes there are K groups, numbered from 1 to K,
and p predictor variables on which to base the classifications. The data then consist
of n observations, each a 1 × ( p + 1) vector,

(Xi , Yi ), i = 1, . . . , n, (11.1)

where Xi is the 1 × p vector of predictors for observation i, and Yi is the group number
of observation i, so that Yi ∈ {1, . . . , K }. Marginally, the proportion of the population
in group k is
P [Y = k] = π k , k = 1, . . . , K. (11.2)

199
200 Chapter 11. Classification

0.4
0.3
0.2
pdf
0.1
0.0

0 5 10 15
x

Figure 11.1: Three densities, plus a mixture of the three (the thick line).

Each group then has a conditional distribution Pk :


Xi | Yi = k ∼ Pk , (11.3)
where the Pk will (almost always) depend on some unknown parameters. Assuming
that Pk has density f k (x), the joint density of (Xi , Yi ) is a mixed one as in (2.11), with
f (xi , y i ) = f y i (xi ) π y i . (11.4)
The marginal pdf of Xi is found by summing the joint density over the groups:
f (xi ) = π 1 f 1 (xi ) + · · · + π K f K (xi ). (11.5)
For example, suppose that K = 3, π1 = π2 = π3 = 1/3, and the three groups are,
conditionally,
X | Y = 1 ∼ N (5, 1), X | Y = 2 ∼ N (5, 22 ), X | Y = 3 ∼ N (10, 1). (11.6)
Figure 11.1 exhibits the three pdfs, plus the mixture pdf, which is the thick black line.
The data for classification includes the group index, so that the joint distributions
of (Xi , Yi ) are operative, meaning we can estimate the individual densities. The over-
all density for the data is then
n n
∏ f (xi , y i ) = ∏ f y (xi ) π y
i i
i =1 i =1
⎡ ⎤ ⎡ ⎤

= ⎣π N1 ∏ f 1 (xi )⎦ · · · ⎣π KNK ∏ f K (xi ) ⎦


1
{ i | y i =1} {i|yi =K }
% & ⎡ ⎤
K K
= ∏ π kNk × ⎣∏ ∏ f k (xi ) ⎦ , (11.7)
k =1 k =1 { i | y i = k }
11.2. Classifiers 201

where Nk is the number of observations in group k:

Nk = #{yi = k}. (11.8)

The classification task arises when a new observation, X New , arrives without its group
identification Y New, so its density is that of the mixture. We have to guess what the
group is.
In clustering, the data themselves are without group identification, so we have just
the marginal distributions of the Xi . Thus the joint pdf for the data is

Πni=1 f (xi ) = Πni=1 (π1 f 1 (xi ) + · · · + π K f K (xi )). (11.9)

Thus clustering is similar to classifying new observations, but without having any
previous y data to help estimate the π k ’s and f k ’s. See Section 12.3.

11.2 Classifiers
A classifier is a function C that takes the new observation, and emits a guess at its
group:
C : X −→ {1, . . . , K }, (11.10)
where X is the space of X New . The classifier may depend on previous data, as well
as on the π k ’s and f k ’s, but not on the Y New . A good classifier is one that is unlikely
to make a wrong classification. Thus a reasonable criterion for a classifier is the
probability of an error:
P [C(X New ) = Y New ]. (11.11)
We would like to minimize that probability. (This criterion assumes that any type of
misclassification is equally bad. If that is an untenable assumption, then one can use
a weighted probability:

K K
∑ ∑ wkl P[C(X New ) = k and Y New = l ], (11.12)
k =1 l =1

where wkk = 0.)


Under the (unrealistic) assumption that we know the π k ’s and f k ’s, the best guess
of Y New given X New is the group that has the highest conditional probability.
Lemma 11.1. Define the Bayes classifier by

C B (x) = k i f P [Y = k | X = x] > P [Y = l | X = x] f or l = k. (11.13)

Then C B minimizes (11.11) over classifiers C .

Proof. Let I be the indicator function, so that



1 if C(X New ) = Y New
I [C(X New ) = Y New ] = (11.14)
0 if C(X New ) = Y New

and
P [C(X New ) = Y New ] = E [ I [C(X New) = Y New ]]. (11.15)
202 Chapter 11. Classification

As in (2.34), we have that

E [ I [C(X New) = Y New ]] = E [ e I (X New )], (11.16)


where
e I (x New ) = E [ I [C(X New ) = Y New ] | X New = x New ]
= P [C(x New ) = Y New | X New = x New ]
= 1 − P [C(x New ) = Y New | X New = x New ]. (11.17)

Thus if we minimize the last expression in (11.17) for each x New , we have minimized
the expected value in (11.16). Minimizing (11.17) is the same as maximizing

P [C(x New ) = Y New | X New = x New ], (11.18)


but that conditional probability can be written
K
∑ I [C(xNew ) = l ] P[Y New = l | X New = xNew ]. (11.19)
l =1

This sum equals P [Y New = l | X New = x New ] for whichever k C chooses, so to maxi-
mize the sum, choose the l with the highest conditional probability, as in (11.13).

Now the conditional distribution of Y New given X New is obtained from (11.4) and
(11.5) (it is Bayes theorem, Theorem 2.2):

f k (x New )π k
P [Y New = k | X New = x New ] = . (11.20)
f1 (x New )π 1 + · · · + f K (x
New )π
K

Since, given x New , the denominator is the same for each k, we just have to choose the
k to maximize the numerator:
C B (x) = k if f k (x New )π k > f l (x New )π l f or l = k. (11.21)
We are assuming there is a unique maximum, which typically happens in practice
with continuous variables. If there is a tie, any of the top categories will yield the
optimum.
Consider the example in (11.6). Because the π k ’s are equal, it is sufficient to look at
the conditional pdfs. A given x is then classified into the group with highest density,
as given in Figure 11.2.
Thus the classifications are

⎨ 1 if 3.640 < x < 6.360
CB (x) = 2 if x < 3.640 or 6.360 < x < 8.067 or x > 15.267 (11.22)
⎩ 3 if 8.067 < x < 15.267

In practice, the π k ’s and f k ’s are not known, but can be estimated from the data.
Consider the joint density of the data as in (11.7). The π k ’s appear in only the first
term. They can be estimated easily (as in a multinomial situation) by
Nk
k =
π . (11.23)
n
11.3. Linear discrimination 203

0.4
0.3
2 1 2 3
pdf
0.2
0.1
0.0

0 5 10 15
x

Figure 11.2: Three densities, and the regions in which each is the highest. The den-
sities are 1: N(5,1), solid line; 2: N(5,4), dashed line; 3: N(10,1), dashed/dotted line.
Density 2 is also the highest for x > 15.267.

The parameters for the f k ’s can be estimated using the xi ’s that are associated with
group k. These estimates are then plugged into the Bayes formula to obtain an ap-
proximate Bayes classifier. The next section shows what happens in the multivariate
normal case.

11.3 Fisher’s linear discrimination


Suppose that the individual f k ’s are multivariate normal densities with different
means but the same covariance, so that

Xi | Yi = k ∼ N1× p (μk , Σ). (11.24)

The pdf’s (8.48) are then

1 −1 
e − 2 ( x− μ k ) Σ ( x− μ k )
1
f k (x | μk , Σ) = c.
| Σ|1/2
1 − 1 x −1 μ  − 1 μ Σ −1 μ 
e− 2 xΣ
1
= c. 1/2 exΣ k 2 k k . (11.25)
|Σ|
We can ignore the factors that are the same for each group, i.e., that do not depend
on k, because we are in quest of the highest pdf× π k . Thus for a given x, we choose
the k to maximize −1  −1 
π k exΣ μk − 2 μk Σ μk ,
1
(11.26)
or, by taking logs, the k that maximizes

1
d∗k (x) ≡ xΣ−1 μk − μk Σ−1 μk + log(π k ). (11.27)
2
204 Chapter 11. Classification

These d∗k ’s are called the discriminant functions. Note that in this case, they are
linear in x, hence linear discriminant functions. It is often convenient to target one
group (say the K th ) as a benchmark, then use the functions

dk (x) = d∗k (x) − d∗K (x), (11.28)

so that the final function is 0.


k =
We still must estimate the parameters, but that is straightforward: take the π
NK /n as in (11.23), estimate the μk ’s by the obvious sample means:

1
Nk {i|y∑=k} i
k =
μ x, (11.29)
i

and estimate Σ by the MLE, i.e., because we are assuming the covariances are equal,
the pooled covariance:

K
 = 1 ∑ ∑ (xi − μ
Σ k ) (xi − μ
k ). (11.30)
n k =1 { i | y = k }
i

(The numerator equals X QX for Q being the projection matrix for the design matrix
indicating which groups the observations are from. We could divide by n − K to
obtain the unbiased estimator, but the classifications would still be essentially the
same, exactly so if the π k ’s are equal.) Thus the estimated discriminant functions are

dk (x) = ck + xak , (11.31)

where

ak = ( μ
k − μ  −1 and
K )Σ
1  −1 μ
ck = − (μ  Σ k − μ  −1 μ
K Σ K ) + log(π  K ).
k /π (11.32)
2 k

Now we can define the classifier based upon Fisher’s linear discrimination func-
tion to be
CFLD (x) = k i f dk (x) > dl (x) f or l = k. (11.33)

(The hat is there to emphasize the fact that the classifier is estimated from the data.) If
p = 2, each set {x | dk (x) = dl (x)} defines a line in the x-space. These lines divide the
space into a number of polygonal regions (some infinite). Each region has the same
 x). Similarly, for general q, the regions are bounded by hyperplanes. Figure 11.3
C(
illustrates for the iris data when using just the sepal length and width. The solid line
is the line for which the discriminant functions for setosas and versicolors are equal.
It is basically perfect for these data. The dashed line tries to separate setosas and
virginicas. There is one misclassification. The dashed/dotted line tries to separate
the versicolors and virginicas. It is not particularly successful. See Section 11.4.1 for
a better result using all the variables.
11.4. Cross-validation estimate of error 205

s
s
s

4.0
s
s
s s gg
s ss
s ss g

3.5
Sepal width sss s
s s sss s v gg
ss g
v g
s ss s v g
vg ggv g
3.0
s ss g g
vgv
ss sss v vv g vgg
v gvgvg gg gg
s vv vvvgv v g
gvg vgggv v g g
v vvv g gg
v vv g g
2.5

g v vvg g
v g
v v
s v v v
g
vv
2.0

4.5 5.5 6.5 7.5


Sepal length

Figure 11.3: Fisher’s linear discrimination for the iris data using just sepal length
and width. The solid line separates setosa (s) and versicolor (v); the dashed line
separates setosa and virginica (r); and the dashed/dotted line separates versicolor
and virginica.

Remark
Fisher’s original derivation in Fisher [1936] of the classifier (11.33) did not start with
the multivariate normal density. Rather, in the case of two groups, he obtained the
p × 1 vector a that maximized the ratio of the squared difference of means of the
variable Xi a for the two groups to the variance:

((μ1 − μ2 )a )2


. (11.34)
 
aΣa
The optimal a is (anything proportional to)

a = (μ
 1 − μ  −1 ,
2 )Σ (11.35)

which is the a1 in (11.32). Even though our motivation leading to (11.27) is different
than Fisher’s, because we end up with his coefficients, we will refer to (11.31) as
Fisher’s.

11.4 Cross-validation estimate of error


In classification, an error occurs if an observation is misclassified, so one often uses
the criterion (11.11) to assess the efficacy of a classifier. This criterion depends on
the distribution of the X as well as the Y, and needs to be estimated. To relate the
206 Chapter 11. Classification

error to the data at hand, we take the criterion to be the probability of error given the
observed Xi ’s, (c.f. the prediction error in (7.88)),

1 n  X New ) = Y New | X New = xi ],


n i∑
ClassError = P [ C( i i i (11.36)
=1
( New )
where the (Xi , YiNew ) are independent, and independent of the data, but with the
same distribution as the data. So the criterion is measuring how well the classifier
would work on a new data set with the same predictors xi .
How does one estimate the error? The obvious approach is to try the classifier on
the data, and count the number of misclassifications:
1 n  xi )  = y i ] .
n i∑
ClassErrorObs = I [ C( (11.37)
=1

As in prediction, this error will be an underestimate because we are using the same
data to estimate the classifier and test it out. A common approach to a fair estimate is
to initially set aside a random fraction of the observations (e.g., 10% to 25%) to be test
data, and use the remaining so-called training data to estimate the classifier. Then
this estimated classifier is tested on the test data.
Cross-validation is a method that takes the idea one step further, by repeatedly
separating the data into test and training data. The “leave-one-out” cross-validation
uses single observations as the test data. It starts by setting aside the first observation,
(x1 , y1 ), and calculating the classifier using the data (x2 , y2 ), . . . , (xn , yn ). (That is, we
find the sample means, covariances, etc., leaving out the first observation.) Call the
resulting classifier C(−1) . Then determine whether this classifier classifies the first
observation correctly:
I [ C(−1) (x1 ) = y1 ]. (11.38)
The C(−1) and Y1 are independent, so the function in (11.38) is almost an unbiased
estimate (conditionally on X1 = x1 ) of the error
 X New ) = Y New | X New = x1 ],
P [ C( (11.39)
1 1 1

the only reason it is not exactly unbiased is that C(−1) is based on n − 1 observations,
rather than the n for C. This difference should be negligible.
Repeat the process, leaving out each observation in turn, so that C(−i) is the classi-
fier calculated without observation i. Then the almost unbiased estimate of ClassError
in (11.36) is
n
 LOOCV = 1 ∑ I [ C(−i) (xi ) = yi ].
ClassError (11.40)
n i =1
If n is large, and calculating the classifier is computationally challenging, then leave-
one-out cross-validation can use up too much computer time (especially if one is
trying a number of different classifiers). Also, the estimate, though nearly unbiased,
might have a high variance. An alternative is to leave out more than one observation
each time, e.g., the 10% cross-validation would break the data set into 10 sets of size ≈
n/10, and for each set, use the other 90% to classify the observations. This approach
is much more computationally efficient, and less variable, but does introduce more
bias. Kshirsagar [1972] contains a number of other suggestions for estimating the
classifiaction error.
11.4. Cross-validation estimate of error 207

11.4.1 Example: Iris data


Turn again to the iris data. Figure 1.3 has the scatter plot matrix. Also see Figures
1.4 and 11.3. In R, the iris data is in the data frame iris. You may have to load the
datasets package. The first four columns constitute the n × p matrix of xi ’s, n = 150,
p = 4. The fifth column has the species, 50 each of setosa, versicolor, and virginica.
The basic variables are then
x.iris <− as.matrix(iris[,1:4])
y.iris <− rep(1:3,c(50,50,50)) # gets group vector (1,...,1,2,...,2,3,...,3)
We will offload many of the calculations to the function lda in Section A.3.1. The
following statement calculates the ak and ck in (11.32):
ld.iris <− lda(x.iris,y.iris)
The ak are in the matrix ld.iris$a and the ck are in the vector ld.iris$c, given below:

k ak ck
1 (Setosa) 11.325 20.309 −29.793 −39.263 18.428
(11.41)
2 (Versicolor) 3.319 3.456 −7.709 −14.944 32.159
3 (Virginica) 0 0 0 0 0

Note that the final coefficients are zero, because of the way we normalize the functions
in (11.28).
To see how well the classifier works on the data, we have to first calculate the
dk (xi ). The following places these values in an n × K matrix disc:
disc <− x%∗%ld.iris$a
disc <− sweep(disc,2,ld.iris$c,’+’)
The rows corresponding to the first observation from each species are

k
i 1 2 3
1 97.703 47.400 0 (11.42)
51 −32.305 9.296 0
101 −120.122 −19.142 0

The classifier (11.33) classifies each observation into the group corresponding to the
column with the largest entry. Applied to the observations in (11.42), we have

CFLD (x1 ) = 1, CFLD (x51 ) = 2, CFLD (x101 ) = 3, (11.43)

that is, each of these observations is correctly classified into its group. To find the
CFLD ’s for all the observations, use
imax <− function(z) ((1:length(z))[z==max(z)])[1]
yhat <− apply(disc,1,imax)
where imax is a little function to give the index of the largest value in a vector. To see
how close the predictions are to the observed, use the table command:
table(yhat,y.iris)
208 Chapter 11. Classification

which yields

y
y 1 2 3
1 50 0 0 (11.44)
2 0 48 1
3 0 2 49

Thus there were 3 observations misclassified — two versicolors were classified as


virginica, and one virginica was classified as versicolor. Not too bad. The observed
misclassification rate is

#{CFLD (xi ) = yi } 3
ClassErrorObs = = = 0.02. (11.45)
n 150

As noted above in Section 11.4, this value is likely to be an optimistic (underestimate)


of ClassError in (11.36), because it uses the same data to find the classifier and to test
it out. We will find the leave-one-out cross-validation estimate (11.40) using the code
below, where we set varin=1:4 to specify using all four variables.

yhat.cv <− NULL


n <− nrow(x.iris)
for(i in 1:n) {
dcv <− lda(x.iris[−i,varin],y.iris[−i])
dxi <− x.iris[i,varin]%∗%dcv$a+dcv$c
yhat.cv <− c(yhat.cv,imax(dxi))
}
sum(yhat.cv!=y.iris)/n

Here, for each i, we calculate the classifier without observation i, then apply it to
that left-out observation i, the predictions placed in the vector yhat.cv. We then count
how many observations were misclassified. In this case, ClassError LOOCV = 0.02,
just the same as the observed classification error. In fact, the same three observations
were misclassified.

Subset selection

The above classifications used all four iris variables. We now see if we can obtain
equally good or better results using a subset of the variables. We use the same loop
as above, setting varin to the vector of indices for the variables to be included. For
example, varin = c(1,3) will use just variables 1 and 3, sepal length and petal length.
Below is a table giving the observed error and leave-one-out cross-validation error
(in percentage) for 15 models, depending on which variables are included in the
classification.
11.4. Cross-validation estimate of error 209

2.5
2.0
Petal width
1.5
1.0
0.5

setosa versicolor virginica

Figure 11.4: Boxplots of the petal widths for the three species of iris. The solid
line separates the setosas from the versicolors, and the dashed line separates the
versicolors from the virginicas.

Classification errors
Variables Observed Cross-validation
1 25.3 25.3
2 44.7 48.0
3 5.3 6.7
4 4.0 4.0
1, 2 20.0 20.7
1, 3 3.3 4.0
1, 4 4.0 4.7 (11.46)
2, 3 4.7 4.7
2, 4 3.3 4.0
3, 4 4.0 4.0
1, 2, 3 3.3 4.0
1, 2, 4 4.0 5.3
1, 3, 4 2.7 2.7
2, 3, 4 2.0 4.0
1, 2, 3, 4 2.0 2.0
Note that the cross-validation error estimates are either the same, or a bit larger,
than the observed error rates. The best classifier uses all 4 variables, with an estimated
2% error. Note, though, that Variable 4 (Petal Width) alone has only a 4% error rate.
Also, adding Variable 1 to Variable 4 actual worsens the prediction a little, showing
that adding the extra variation is not worth it. Looking at just the observed error, the
prediction stays the same.
Figure 11.4 shows the classifications using just petal widths. Because the sample
sizes are equal, and the variances are assumed equal, the separating lines between
210 Chapter 11. Classification

two species are just the average of their means. We did not plot the line for setosa
versus virginica. There are six misclassifications, two versicolors and four virginicas.
(Two of the latter had the same petal width, 1.5.)

11.5 Fisher’s quadratic discrimination


When the equality of the covariance matrices is not tenable, we can use a slightly
more complicated procedure. Here the conditional probabilities are proportional to
1 −1 
e − 2 ( x− μ k ) Σ k ( x− μ k )
1
π k f k (x | μ k , Σ k ) = c π k
|Σk | 1/2
−1
(x− μ k )  − 12 log(| Σ k |)+log( πk )
= c e − 2 ( x− μ k ) Σ k
1
. (11.47)
Then the discriminant functions can be taken to be the terms in the exponents (times
2, for convenience), or their estimates:

dkQ (x) = −(x − μ  −1 (x − μk ) + ck ,


k )Σ k (11.48)
where
 k |) + 2 log( Nk /n ),
ck = − log(| Σ (11.49)
and Σ k is the sample covariance matrix from the kth group. Now the boundaries
between regions are quadratic rather than linear, hence Fisher’s quadratic discrimi-
nation function is defined to be
CFQD (x) = k i f dkQ (x) > dl Q (x) f or l = k. (11.50)

11.5.1 Example: Iris data, continued


We consider the iris data again, but as in Section 11.5 we estimate three separate co-
variance matrices. Sections A.3.2 and A.3.3 contain the functions qda and predict.qda
for calculating the quadratic discriminant functions (11.25) and finding the predic-
tions. Apply these to the iris data as follows:
qd.iris <− qda(x.iris,y.iris)
yhat.qd <− NULL
for (i in 1:n) {
yhat.qd <− c(yhat.qd,imax(predict.qda(qd.iris,x.iris[i,])))
}
table(yhat.qd,y.iris)
The resulting table is (11.44), the same as for linear discrimination. The leave-one-out
cross-validation estimate of classification error is 4/150 = 0.0267, which is slightly
worse than that for linear discrimination. It does not appear that the extra complica-
tion of having three covariance matrices improves the classification rate.
Hypothesis testing, AIC, or BIC can also help decide between the model with equal
covariance matrices and the model with three separate covariance matrices. Because
we have already calculated the estimates, it is quite easy to proceed. The two models
are then
MSame ⇒ Σ1 = Σ2 = Σ3 ≡ Σ;
MDiff ⇒ (Σ1 , Σ2 , Σ3 ) unrestricted. (11.51)
11.6. Modifications 211

Both models have the same unrestricted means, and we can consider the π k ’s fixed,
so we can work with just the sample covariance matrices, as in Section 10.1. Let
U1 , U2 , and U3 be the sum of squares and cross-products matrices (1.15) for the three
species, and U = U1 + U2 + U3 be the pooled version. The degrees of freedom for
each species is νk = 50 − 1 = 49. Thus from (10.10) and (10.11), we can find the
deviances (9.47) to be

deviance( MSame ) = (ν1 + ν2 + ν3 ) log(|U/(ν1 + ν2 + ν3 )|)


= −1463.905,
deviance( MDiff ) = ν1 log(|U1 /ν1 |) + ν2 log(|U2 /ν2 |) + ν3 log(|U3 /ν3 |)
= −1610.568 (11.52)

Each covariance matrix has (q+ 1


2 ) = 10 parameters, hence

dSame = 10 and dDiff = 30. (11.53)

To test the null hypothesis MSame versus the alternative MDiff , as in (9.49),

2 log( LR) = deviance( MSame ) − deviance( MDiff )


= −1463.905 + 1610.568
= 146.663, (11.54)

on dSame − dDiff = 20 degrees of freedom. The statistic is highly significant; we reject


emphatically the hypothesis that the covarince matrices are the same. The AIC’s
(9.50) and BIC’s (9.51) are found directly from (11.54):

AIC BIC
MSame −1443.90 −1414.00 (11.55)
MDiff −1550.57 −1460.86

They, too, favor the separate covariance model. Cross-validation above suggests that
the equal-covariance model is slightly better. Thus there seems to be a conflict be-
tween AIC/BIC and cross-validation. The conflict can be explained by noting that
AIC/BIC are trying to model the xi ’s and yi ’s jointly, while cross-validation tries to
model the conditional distribution of the yi ’s given the xi ’s. The latter does not really
care about the distribution of the xi ’s, except to the extent it helps in predicting the
yi ’s.

11.6 Modifications to Fisher’s discrimination


The key component in both the quadratic and linear discriminant functions is the
quadratic form,
1
q (x; μk , ΣK ) = − (x − μk )Σ− 
k (x − μ k ) ,
1
(11.56)
2
where in the case the Σk ’s are equal, the “xΣ−1 x ” part is ignored. Without the − 21 ,
(11.56) is a measure of distance (called the Mahalanobis distance) between an x and
the mean of the kth group, so that it makes sense to classify an observation into the
212 Chapter 11. Classification

group to which it is closest (modulo an additive constant). The idea is plausible


whether the data are normal or not, and whether the middle component is a general
Σk or not. E.g., when taking the Σ’s equal, we could take

1
Σ = I p =⇒ q (x; μk , I p ) = −
x − μk 2 , or
2
1
Σ = Δ, diagonal =⇒ q (x; μk , Δ) = − ∑( xi − μki )2 /δii . (11.57)
2
The first case is regular Euclidean distance. In the second case, one would need to
estimate the δii ’s by the pooled sample variances. These alternatives may be better
when there are not many observations per group, and a fairly large number of vari-
ables p, so that estimating a full Σ introduces enough extra random error into the
classification to reduce its effectiveness.
Another modification is to use functions of the individual variables. E.g., in the
iris data, one could generate quadratic boundaries by using the variables

Sepal Length, (Sepal Length)2 , Petal Length, ( Petal Length)2 (11.58)

in the x. The resulting set of variables certainly would not be multivariate normal,
but the classification based on them may still be reasonable. See the next section for
another method of incorporating such functions.

11.7 Conditioning on X: Logistic regression


Based on the conditional densities of X given Y = k and priors π k , Lemma 11.1
shows that the Bayes classifier in (11.13) is optimal. In Section 11.3, we saw that if the
conditional distributions of the X are multivariate normal, with the same covariance
matrix for each group, then the classifier devolved to a linear one (11.31) in x. The
linearity is not specific to the normal, but is a consequence of the normal being an
exponential family density, which means the density has the form

f (x | θ) = a(x)et(x) θ −ψ( θ) (11.59)

for some 1 × m parameter θ, 1 × m function t(x) (the sufficient statistic), and function
a(x), where ψ (θ) is the normalizing constant.
Suppose that the conditional density of X given Y = k is f (x | θk ), that is, each
group has the same form of the density, but a different parameter value. Then the
analog to equations (11.27) and (11.28) yields discriminant functions like those in
(11.31),
dk (x) = γk + t(x)αk , (11.60)
a linear function of t(x), where αk = θk − θK , and γk is a constant depending on the
parameters. (Note that dK (x) = 0.) To implement the classifier, we need to estimate
the parameters θk and π k , usually by finding the maximum likelihood estimates.
(Note that Fisher’s quadratic discrimination in Section 11.5 also has discriminant
functions (11.48) of the form (11.60), where the t is a function of the x and its square,
xx .) In such models the conditional distribution of Y given X is given by

e d k ( x)
P [ Y = k | X = x] = (11.61)
e d1 ( x) + · · · + e d K − 1 ( x) + 1
11.7. Logistic regression 213

for the dk ’s in (11.60). This conditional model is called the logistic regression model.
Then an alternative method for estimating the γk ’s and αk ’s is to find the values that
maximize the conditional likelihood,
n
L ((γ1 , α1 ), . . . , (γK −1 , aK −1 ) ; (x1 , y1 ), . . . , (xn , yn )) = ∏ P [ Y = y i | X i = xi ] . (11.62)
i =1

(We know that αK = 0 and γK = 0.) There is no closed-form solution for solving
the likelihood equations, so one must use some kind of numerical procedure like
Newton-Raphson. Note that this approach estimates the slopes and intercepts of the
discriminant functions directly, rather than (in the normal case) estimating the means
and variances, and the π k ’s, then finding the slopes and intercepts as functions of
those estimates.
Whether using the exponential family model unconditionally or the logistic model
conditionally, it is important to realize that both lead to the exact same classifier.
The difference is in the way the slopes and intercepts are estimated in (11.60). One
question is then which gives the better estimates. Note that the joint distribution of
the (X, Y ) is the product of the conditional of Y given X in (11.61) and the marginal
of X in (11.5), so that for the entire data set,
% &
n n
∏ f ( y i , x i | θk ) π y i
= ∏ P [ Y = y i | X i = x i , θk ]
i =1 i =1
% &
n
× ∏(π1 f (xi | θ1 ) + · · · + πK f (xi | θK )) .
i =1
(11.63)
Thus using just the logistic likelihood (11.62), which is the first term on the right-
hand side in (11.63), in place of the complete likelihood on the left, leaves out the
information about the parameters that is contained in the mixture likelihood (the
second term on the right). As we will see in Chapter 12, there is information in the
mixture likelihood. One would then expect that the complete likelihood gives better
estimates in the sense of asymptotic efficiency of the estimates. It is not clear whether
that property always translates to yielding better classification schemes, but maybe.
On the other hand, the conditional logistic model is more general in that it yields
valid estimates even when the exponential family assumption does not hold. We can
entertain the assumption that the conditional distributions in (11.61) hold for any
statistics t(x) we wish to use, without trying to model the marginal distributions of
the X’s at all. This realization opens up a vast array of models to use, that is, we can
contemplate any functions t we wish.
In what follows, we restrict ourselves to having K = 2 groups, and renumber the
groups {0, 1}, so that Y is conditionally Bernoulli:
Yi | Xi = xi ∼ Bernoulli(ρ(xi )), (11.64)
where
ρ (x) = P [ Y = 1 | X = x] . (11.65)
The modeling assumption from (11.61) can be translated to the logit (log odds) of ρ,
logit(ρ) = log(ρ/(1 − ρ)). Then
logit(ρ(x)) = logit(ρ(x | γ, α)) = γ + xα . (11.66)
214 Chapter 11. Classification

(We have dropped the t from the notation. You can always define x to be whatever
functions of the data you wish.) The form (11.66) exhibits the reason for calling the
model “logistic regression.” Letting
⎛ ⎞
logit(ρ(x1 | γ, α))
⎜ logit(ρ(x2 | γ, α)) ⎟
⎜ ⎟
logit(ρ ) = ⎜ .. ⎟, (11.67)
⎝ . ⎠
logit(ρ(xn | γ, α))

we can set up the model to look like the regular linear model,
⎛ ⎞
1 x1
⎜ 1 x2 ⎟ 
⎜ ⎟ γ
logit(ρ ) = ⎜ . . ⎟  = Xβ. (11.68)
⎝ .. .. ⎠ α
1 xn

We turn to examples.

11.7.1 Example: Iris data


Consider the iris data, restricting to classifying the virginicas versus versicolors. The
next table has estimates of the linear discrimination functions’ intercepts and slopes
using the multivariate normal with equal covariances, and the logistic regression
model:
Intercept Sepal Length Sepal Width Petal Length Petal Width
Normal 17.00 3.63 5.69 −7.11 −12.64
Logistic 42.64 2.47 6.68 −9.43 −18.29
(11.69)
The two estimates are similar, the logistic giving more weight to the petal widths,
and having a large intercept. It is interesting that the normal-based estimates have an
observed error of 3/100, while the logistic has 2/100.

11.7.2 Example: Spam


The Hewlett-Packard spam data was introduced in Exercise 1.9.14. The n = 4601
observations are emails to George Forman, at Hewlett-Packard labs. The The Y clas-
sifies each email as spam (Y = 1) or not spam, Y = 0. There are q = 57 explana-
tory variables based on the contents of the email. Most of the explanatory variables
are frequency variables with many zeroes, hence are not at all normal, so Fisher’s
discrimination may not be appropriate. One could try to model the variables using
Poissons or multinomials. Fortunately, if we use the logistic model, we do not need to
model the explanatory variables at all, but only decide on the x j ’s to use in modeling
the logit in (11.68).
The ρ(x | γ, α) is the probability an email with message statistics x is spam. We
start by throwing in all 57 explanatory variables linearly, so that in (11.68), the design
matrix contains all the explanatory variables, plus the 1n vector. This fit produces an
observed misclassification error rate of 6.9%.
A number of the coefficients are not significant, hence it makes sense to try subset
logistic regression, that is, find a good subset of explanatory variables to use. It is
11.7. Logistic regression 215

computationally much more time consuming to fit a succession of logistic regression


models than regular linear regression models, so that it is often infeasible to do an
all-subsets exploration. Stepwise procedures can help, though are not guaranteed
to find the best model. Start with a given criterion, e.g., AIC, and a given subset of
explanatory variables, e.g., the full set or the empty set. At each step, one has an
“old” model with some subset of the explanatory variables, and tries every possible
model that either adds one variable or removes one variable from that subset. Then
the “new” model is the one with the lowest AIC. The next step uses that new model
as the old, and adds and removes one variable from that. This process continues until
at some step the new model and old model are the same.
The table in (11.70) shows the results when using AIC and BIC. (The R code is
below.) The BIC has a stronger penalty, hence ends up with a smaller model, 30
variables (including the 1n ) versus 44 for the AIC. For those two “best” models as
well as the full model, the table also contains the 46-fold cross-validation estimate
of the error, in percent. That is, we randomly cut the data set into 46 blocks of 100
observations, then predict each block of 100 from the remaining 4501. For the latter
two models, cross-validation involves redoing the entire stepwise procedure for each
reduced data set. A computationally simpler, but maybe not as defensible, approach
would be to use cross-validation on the actual models chosen when applying stepwise
to the full data set. Here, we found the estimated errors for the best AIC and BIC
models were 7.07% and 7.57%, respectively, approximately the same as for the more
complicated procedure.

p Deviance AIC BIC Obs. error CV error CV se


Full 58 1815.8 1931.8 2305.0 6.87 7.35 0.34
Best AIC 44 1824.9 1912.9 2196.0 6.78 7.15 0.35
Best BIC 30 1901.7 1961.7 2154.7 7.28 7.59 0.37
(11.70)
The table shows that all three models have essentially the same cross-validation
error, with the best AIC’s model being √ best. The standard errors are the standard
deviations of the 46 errors divided by 46, so give an idea of how variable the error
estimates are. The differences between the three errors are not large relative to these
standard errors, so one could arguably take either the best AIC or best BIC model.
The best AIC model has p = 44 parameters, one of which is the intercept. The
table (11.71) categorizes the 41 frequency variables (word or symbol) in this model,
according to the signs of their coefficients. The ones with positive coefficients tend
to indicate spam, while the others indicate non-spam. Note that the latter tend to be
words particular to someone named “George” who works at a lab at HP, while the
spam indicators have words like “credit”, “free”, “money”, and exciting symbols like
“!” and “$”. Also with positive coefficients are the variables that count the number
of capital letters, and the length of the longest run of capitals, in the email.

Positive Negative
3d our over remove internet order make address will hp hpl george lab
mail addresses free business you data 85 parts pm cs meeting original
credit your font 000 money 650 tech- project re edu table conference ;
nology ! $ #
(11.71)
216 Chapter 11. Classification

Computational details
In R, logistic regression models with two categories can be fit using the generalized
linear model function, glm. The spam data is in the data frame Spam. The indicator
variable, Yi , for spam is called spam. We first must change the data matrix into a data
frame for glm: Spamdf <− data.frame(Spam). The full logistic regression model is fit
using
spamfull <− glm(spam ~ .,data=Spamdf,family=binomial)
The “spam ~ .” tells the program that the spam variable is the Y, and the dot means
use all the variables except for spam in the X. The “family = binomial” tells the pro-
gram to fit logistic regression. The summary command, summary(spamfull), will print
out all the coefficients, which I will not reproduce here, and some other statistics, in-
cluding
Null deviance: 6170.2 on 4600 degrees of freedom
Residual deviance: 1815.8 on 4543 degrees of freedom
AIC: 1931.8
The “residual deviance” is the regular deviance in (9.47). The full model uses 58
variables, hence
AIC = deviance +2p = 1815.8 + 2 × 58 = 1931.8, (11.72)
which checks. The BIC is found by substituting log(4601) for the 2.
We can find the predicted classifications from this fit using the function predict,
which returns the estimated linear X β from (11.68) for the fitted model. The Y i ’s
are then 1 or 0 as the ρ(x( i ) | c, 
a) is greater than or less than 12 . Thus to find the
predictions and overall error rate, do
yhat <− ifelse(predict(spamfull)>0,1,0)
sum(yhat!=Spamdf[,’spam’])/4601
We find the observed classifcation error to be 6.87%.

Cross-validation
We will use 46-fold cross-validation to estimate the classification error. We randomly
divide the 4601 observations into 46 groups of 100, leaving one observation who
doesn’t get to play. First, permute the indices from 1 to n:
o <− sample(1:4601)
Then the first hundred are the indices for the observations in the first leave-out-block,
the second hundred in the second leave-out-block, etc. The loop is next, where the
err collects the number of classification errors in each block of 100.
err <− NULL
for(i in 1:46) {
oi <− o[(1:100)+(i−1)∗100]
yfiti <− glm(spam ~ ., family = binomial,data = Spamdf,subset=(1:4601)[− oi])
dhati <− predict(yfiti,newdata=Spamdf[oi,])
yhati <− ifelse(dhati>0,1,0)
err <− c(err,sum(yhati!=Spamdf[oi,’spam’]))
}
11.8. Trees 217

In the loop for cross-validation, the oi is the vector of indices being left out. We then fit
the model without those by using the keyword subset=(1:4601)[− oi], which indicates
using all indices except those in oi. The dhati is then the vector of discriminant
functions evaluated for the left out observations (the newdata). The mean of err is the
estimated error, which for us is 7.35%. See the entry in table in (11.70).

Stepwise
The command to use for stepwise regression is step. To have the program search
through the entire set of variables, use one of the two statements
spamstepa <− step(spamfull,scope=list(upper= ~.,lower = ~1))
spamstepb <− step(spamfull,scope=list(upper= ~.,lower = ~1),k=log(4601))
The first statement searches on AIC, the second on BIC. The first argument in the step
function is the return value of glm for the full data. The upper and lower inputs refer
to the formulas of the largest and smallest models one wishes to entertain. In our
case, we wish the smallest model to have just the 1n vector (indicated by the “~1”),
and the largest model to contain all the vectors (indicated by the “~.”).
These routines may take a while, and will spit out a lot of output. The end result
is the best model found using the given criterion. (If using the BIC version, while
calculating the steps, the program will output the BIC values, though calling them
“AIC.” The summary output will give the AIC, calling it “AIC.” Thus if you use just
the summary output, you must calculate the BIC for yourself. )
To find the cross-validation estimate of classification error, we need to insert the
stepwise procedure after fitting the model leaving out the observations, then predict
those left out using the result of the stepwise procedure. So for the best BIC model,
use the following:
errb <− NULL
for(i in 1:46) {
oi <− o[(1:100)+(i−1)∗100]
yfiti <− glm(spam ~ ., family = binomial, data = Spamdf,subset=(1:4601)[− oi])
stepi <− step(yfiti,scope=list(upper= ~.,lower = ~1),k=log(4501))
dhati <− predict(stepi,newdata=Spamdf[oi,])
yhati <− ifelse(dhati>0,1,0)
errb <− c(errb,sum(yhati!=Spamdf[oi,’spam’]))
}
The estimate for the best AIC model uses the same statements but with k = 2 in
the step function. This routine will take a while, because each stepwise procedure
is time consuming. Thus one might consider using cross-validation on the model
chosen using the BIC (or AIC) criterion for the full data.
The neural networks R package nnet [Venables and Ripley, 2002] can be used to fit
logistic regression models for K > 2.

11.8 Trees
The presentation here will also use just K = 2 groups, labeled 0 and 1, but can be
extended to any number of groups. In the logistic regression model (11.61), we mod-
eled P [Y = 1 | X = x] ≡ ρ(x) using a particular parametric form. In this section we
218 Chapter 11. Classification

45.61%
40
35

8.55%
30
Age
25
20
15
10

38.68%

20 30 40 50 60
Adiposity

Figure 11.5: Splitting on age and adiposity. The open triangles indicate no heart
disease, the solid discs indicate heart disease. The percentages are the percentages of
men with heart disease in each region of the plot.

use a simpler, nonparametric form, where ρ(x) is constant over rectangular regions
of the X-space.
To illustrate, we will use the South African heart disease data from Rousseauw
et al. [1983], which was used in Exercise 10.5.20. The Y is congestive heart disease
(chd), where 1 indicates the person has the disease, 0 he does not. Explanatory
variables include various health measures. Hastie et al. [2009] apply logistic regres-
sion to these data. Here we use trees. Figure 11.5 plots the chd variable for the
age and adiposity (fat percentage) variables. Consider the vertical line. It splits the
data according to whether age is less than 31.5 years. The splitting point 31.5 was
chosen so that the proportions of heart disease in each region would be very dif-
ferent. Here, 10/117 = 8.85% of the men under age 31.5 had heart disease, while
150/345 = 43.48% of those above 31.5 had the disease.
The next step is to consider just the men over age 31.5, and split them on the
adiposity variable. Taking the value 25, we have that 41/106 = 38.68% of the men
over age 31.5 but with adiposity under 25 have heart disease; 109/239 = 45.61% of
the men over age 31.5 and with adiposity over 25 have the disease. We could further
split the younger men on adiposity, or split them on age again. Subsequent steps
split the resulting rectangles, each time with either a vertical or horizontal segment.
There are also the other variables we could split on. It becomes easier to represent
the splits using a tree diagram, as in Figure 11.6. There we have made several splits,
at the nodes. Each node needs a variable and a cutoff point, such that people for
which the variable is less than the cutoff are placed in the left branch, and the others
11.8. Trees 219

age < 31.5


|

tobacco < 0.51 age < 50.5

alcohol < 11.105


0:1.00 typea < 68.5 famhist:a
1:0.01

1.00 0.60
tobacco < 7.605 ldl < 6.705
0.00 0.40
0.70 0.20
0.30 0.80 tobacco < 4.15
typea < 42.5 adiposity < 28.955
0.05
adiposity < 28
1.00
adiposity < 24.435 0.20
typea < 48
1.00 0.00 0.50 0.80
0.00 1.00 0.50 0.70
0.90 0.60 0.30
0.08 0.40 0.60 0.00
0.40 1.00

Figure 11.6: A large tree, with proportions at the leaves

go to the right. The ends of the branches are terminal nodes or leaves. This plot has 15
leaves. At each leaf, there are a certain number of observations. The plot shows the
proportion of 0’s (the top number) and 1’s (the bottom number) at each leaf.
For classification, we place a 0 or 1 at each leaf, depending on whether the pro-
portion of 1’s is less than or greater than 1/2. Figure 11.7 shows the results. Note
that for some splits, both leaves have the same classification, because although their
proportions of 1’s are quite different, they are both on the same side of 1/2. For
classification purposes, we can snip some of the branches off. Further analysis (Sec-
tion 11.8.1) leads us to the even simpler tree in Figure 11.8. The tree is very easy to
interpret, hence popular among people (e.g., doctors) who need to use them. The
tree also makes sense, showing age, type A personality, tobacco use, and family his-
tory are important factors in predicting heart disease among these men. The trees
also are flexible, incorporating continuous or categorical variables, avoiding having
to consider transformations, and automatically incorporating interactions. E.g., the
type A variable shows up only for people between the ages of 31.5 and 50.5, and
family history and tobacco use show up only for people over 50.5.
Though simple to interpret, it is easy to imagine that finding the “best” tree is a
220 Chapter 11. Classification

age < 31.5


|

tobacco < 0.51 age < 50.5

alcohol < 11.105


typea < 68.5 famhist:a
0

0 0 tobacco < 7.605 ldl < 6.705


0 1
tobacco < 4.15
typea < 42.5 adiposity < 28.955
adiposity < 28 1
adiposity < 24.435
typea < 48 1
0 1 1
0
0 0
0 1

Figure 11.7: A large tree, with classifications at the leaves.

rather daunting prospect, as there is close to an infinite number of possible trees in


any large data set (at each stage one can split any variable at any of a number of
points), and searching over all the possibilities is a very discrete (versus continuous)
process. In the next section, we present a popular, and simple, algorithm to find a
good tree.

11.8.1 CART
Two popular commercial products for fitting trees are Categorization and Regression
Trees, CART R
, by Breiman et al. [1984], and C5.0, by Quinlan [1993]. We will take
the CART approach, the main reason being the availability of an R version. It seems
that CART would appeal more to statisticians, and C5.0 to data-miners, but I do not
think the results of the two methods would differ much.
We first need an objective function to measure the fit of a tree to the data. We will
use deviance, although other measures such as the observed misclassification rate are
certainly reasonable. For a tree T with L leaves, each observation is placed in one of
the leaves. If observation yi is placed in leaf l, then that observation’s ρ(xi ) is given
11.8. Trees 221

age < 31.5


|

age < 50.5

typea < 68.5 famhist:a

tobacco < 7.605


0 1 1

0 1

Figure 11.8: A smaller tree, chosen using BIC.

by the parameter for leaf l, say pl . The likelihood for that Bernoulli observation is

p l i (1 − p l )1 − y i .
y
(11.73)
Assuming the observations are independent, at leaf l there is a sample of iid Bernoulli
random variables with parameter pl , hence the overall likelihood of the sample is
L
L ( p1 , . . . , p L | y1 , . . . , y n ) = ∏ pwl (1 − pl )n −w ,
l l l
(11.74)
l =1

where
n l = #{i at leaf l }, wl = #{yi = 1 at leaf l }. (11.75)
This likelihood is maximized over the pl ’s by taking pl = wl /n l . Then the deviance
(9.47) for this tree is
L
deviance(T ) = −2 ∑ (wl log( pl ) + (n l − wl ) log(1 − pl )) . (11.76)
l =1

The CART method has two main steps: grow the tree, then prune the tree. The
tree is grown in a stepwise, greedy fashion, at each stage trying to find the next
split that maximally reduces the objective function. We start by finding the single
split (variable plus cutoff point) that minimizes the deviance among all such splits.
Then the observations at each resulting leaf are optimally split, again finding the
variable/cutoff split with the lowest deviance. The process continues until the leaves
have just a few observations, e.g., stopping when any split would result in a leaf with
fewer than five observations.
To grow the tree for the South African heart disease data in R, we need to install
the package called tree [Ripley, 2010]. A good explanation of it can be found in
Venables and Ripley [2002]. We use the data frame SAheart in the ElemStatLearn
package [Halvorsen, 2009]. The dependent variable is chd. To grow a tree, use
222 Chapter 11. Classification

basetree <− tree(as.factor(chd)~.,data=SAheart)


The as.factor function indicates to the tree function that it should do classification. If
the dependent variable is numeric, tree will fit a so-called regression tree, not what
we want here. To plot the tree, use one of the two statements
plot(basetree);text(basetree,label=’yprob’,digits=1)
plot(basetree);text(basetree)
The first gives the proportions of 0’s and 1’s at each leaf, and the second gives the
classifications of the leaves, yielding the trees in Figures 11.6 and 11.7, respectively.
This basetree is now our base tree, and we consider only subtrees, that is, trees
obtainable by snipping branches off this tree. As usual, we would like to balance
observed deviance with the number of parameters in the model, in order to avoid
overfitting. To whit, we add a penalty to the deviance depending on the number of
leaves in the tree. To use AIC or BIC, we need to count the number of parameters for
each subtree, conditioning on the structure of the base tree. That is, we assume that
the nodes and the variable at each node are given, so that the only free parameters
are the cutoff points and the pl ’s. The task is one of subset selection, that is, deciding
which nodes to snip away. If the subtree has L leaves, then there are L − 1 cutoff
points (there are L − 1 nodes), and L pl ’s, yielding 2L − 1 parameters. Thus the BIC
criterion for a subtree T with L leaves is

BIC(T ) = deviance(T ) + log(n )(2L − 1). (11.77)

The prune.tree function can be used to find the subtree with the lowest BIC. It takes
the base tree and a value k as inputs, then finds the subtree that minimizes

objk (T ) = deviance(T ) + kL. (11.78)

Thus for the best AIC subtree we would take k = 4, and for BIC we would take
k = 2 log(n ):
aictree <− prune.tree(basetree,k=4)
bictree <− prune.tree(basetree,k=2∗log(462)) # n = 462 here.
If the k is not specified, then the routine calculates the numbers of leaves and de-
viances of best subtrees for all values of k. The best AIC subtree is in fact the full
base tree, as in Figure 11.7. Figure 11.9 exhibits the best BIC subtree, which has eight
leaves. There are also routines in the tree package that use cross-validation to choose
a good factor k to use in pruning.
Note that the tree in Figure 11.9 has some redundant splits. Specifically, all leaves
to the left of the first split (age < 31.5) lead to classification “0.” To snip at that node,
we need to determine its index in basetree. One approach is to print out the tree,
resulting in the output in Listing 11.1. We see that node #2 is “age < 31.5,” which is
where we wish to snip, hence we use
bictree.2 <− snip.tree(bictree,nodes=2)
Plotting the result yields Figure 11.8. It is reasonable to stick with the presnipped
tree, in case one wished to classify using a cutoff point for pl ’s other than 12 .
There are some drawbacks to this tree-fitting approach. Because of the stepwise
nature of the growth, if we start with the wrong variable, it is difficult to recover. That
11.8. Trees 223

age < 31.5


|

tobacco < 0.51 age < 50.5

alcohol < 11.105


typea < 68.5 famhist:a
0
tobacco < 7.605
0 0
0 1 1

0 1

Figure 11.9: The best subtree using the BIC criterion, before snipping redundant
leaves.

Listing 11.1: Text representation of the output of tree for the tree in Figure 11.9
node), split, n, deviance, yval, (yprob)
∗ denotes terminal node

1) root 462 596.10 0 ( 0.65368 0.34632 )


2) age < 31.5 117 68.31 0 ( 0.91453 0.08547 )
4) tobacco < 0.51 81 10.78 0 ( 0.98765 0.01235 ) ∗
5) tobacco > 0.51 36 40.49 0 ( 0.75000 0.25000 )
10) alcohol < 11.105 16 0.00 0 ( 1.00000 0.00000 ) ∗
11) alcohol > 11.105 20 27.53 0 ( 0.55000 0.45000 ) ∗
3) age > 31.5 345 472.40 0 ( 0.56522 0.43478 )
6) age < 50.5 173 214.80 0 ( 0.68786 0.31214 )
12) typea < 68.5 161 188.90 0 ( 0.72671 0.27329 ) ∗
13) typea > 68.5 12 10.81 1 ( 0.16667 0.83333 ) ∗
7) age > 50.5 172 236.10 1 ( 0.44186 0.55814 )
14) famhist: Absent 82 110.50 0 ( 0.59756 0.40244 )
28) tobacco < 7.605 58 68.32 0 ( 0.72414 0.27586 ) ∗
29) tobacco > 7.605 24 28.97 1 ( 0.29167 0.70833 ) ∗
15) famhist: Present 90 110.00 1 ( 0.30000 0.70000 ) ∗
224 Chapter 11. Classification

is, even though the best single split may be on age, the best two-variable split may
be on type A and alcohol. There is inherent instability, because having a different
variable at a given node can completely change the further branches. Additionally,
if there are several splits, the sample sizes for estimating the pl ’s at the farther-out
leaves can be quite small. Boosting, bagging, and random forests are among the tech-
niques proposed that can help ameliorate some of these problems and lead to better
classifications. They are more black-box-like, though, losing some of the simplicity of
the simple trees. See Hastie et al. [2009].

Estimating misclassification rate


The observed misclassification rate for any tree is easily found using the summary
command. Below we find the 10-fold cross-validation estimates of the classification
error. The results are in (11.79). Note that the BIC had the lowest estimate, though
by only about 0.01. The base tree was always chosen by AIC. It is interesting that the
BIC trees were much smaller, averaging 5 leaves versus 22 for the AIC/base trees.

Obs. error CV error CV se Average L


Base tree 0.208 0.328 0.057 22
(11.79)
Best AIC 0.208 0.328 0.057 22
Best BIC 0.229 0.317 0.063 5
The following finds the cross-validation estimate for the BIC chosen tree:
o <− sample(1:462) # Reorder the indices
err <− NULL # To collect the errors
for(i in 1:10) {
oi <− o[(1:46)+46∗(i−1)] # Left−out indices
basetreei <− tree(as.factor(chd)~.,data=SAheart,subset=(1:462)[− oi])
bictreei <− prune.tree(basetreei,k=2∗log(416)) # BIC tree w/o left−out data
yhati <− predict(bictreei,newdata=SAheart[oi,],type=’class’)
err <− c(err,sum(yhati!=SAheart[oi,’chd’]))
}
For each of the left-out observations, the predict statement with type=’class’ gives
the tree’s classification of the left-out observations. The estimate of the error is then
mean(err)/46, and the standard error is sd(err)/46.

11.9 Exercises
Exercise 11.9.1. Show that (11.19) follows from (11.18).
Exercise 11.9.2. Compare the statistic in (11.34) and its maximum using the 
a in
(11.35) to the motivation for Hotelling’s T 2 presented in Section 8.4.1.
Exercise 11.9.3. Write the γk in (11.60) as a function of the θi ’s and π i ’s.

Exercise 11.9.4 (Spam). Consider the spam data from Section 11.7.2 and Exercise
1.9.14. Here we simplify it a bit, and just look at four of the 0/1 predictors: Whether
or not the email contains the words “free” or “remove” or the symbols “!” or “$”.
The following table summarizes the data, where the first four columns indicate the
presence (0) or absence (1) of the word or symbol, and the last two columns give the
11.9. Exercises 225

numbers of corresponding emails that are spam or not spam. E.g., there are 98 emails
containing “remove” and “!”, but not “free” nor “$”, 8 of which are not spam, 90 are
spam.
free remove ! $ not spam spam
0 0 0 0 1742 92
0 0 0 1 157 54
0 0 1 0 554 161
0 0 1 1 51 216
0 1 0 0 15 28
0 1 0 1 4 17
0 1 1 0 8 90
0 1 1 1 5 166 (11.80)
1 0 0 0 94 42
1 0 0 1 28 20
1 0 1 0 81 159
1 0 1 1 38 305
1 1 0 0 1 16
1 1 0 1 0 33
1 1 1 0 2 116
1 1 1 1 8 298
Assuming a multinomial distribution for the 25 = 32 possibilities, find the estimated
Bayes classifier of email as “spam” or “not spam” based on the other four variables
in the table. What is the observed error rate?

Exercise 11.9.5 (Crabs). This problem uses data on 200 crabs, categorized into two
species, Orange and Blue, and two sexes. It is in the MASS R package [Venables
and Ripley, 2002]. The data is in the data frame crabs. There are 50 crabs in each
species×sex category; the first 50 are blue males, then 50 blue females, then 50 orange
males, then 50 orange females. The five measurements are frontal lobe size, rear
width, carapace length, carapace width, and body depth, all in millimeters. The goal
here is to find linear discrimination procedures for classifying new crabs into species
and sex categories. (a) The basic model is that Y ∼ N (xβ, I200 ⊗ Σ), where x is any
analysis of variance design matrix (n × 4) that distinguishes the four groups. Find the
MLE of Σ, Σ.  (b) Find the ck ’s and ak ’s in Fisher’s linear discrimination for classifying
all four groups, i.e., classifying on species and sex simultaneously. (Take π k = 1/4 for
all four groups.) Use the version wherein dK = 0. (c) Using the procedure in part (b)
on the observed data, how many crabs had their species misclassified? How many
had their sex misclassified? What was the overall observed misclassification rate (for
simultaneous classification of color and sex)? (d) Use leave-one-out cross-validation
to estimate the overall misclassification rate. What do you get? Is it higher than the
observed rate in part (c)?

Exercise 11.9.6 (Crabs). | Continue with the crabs data from Exercise 11.9.5, but use
classification trees to classify the crabs by just species. (a) Find the base tree using the
command
crabtree <− tree(sp ~ FL+RW+CL+CW+BD,data=crabs)
How many leaves does the tree have? Snip off redundant nodes. How many leaves
does the snipped tree have? What is its observed misclassification rate? (b) Find the
226 Chapter 11. Classification

BIC for the subtrees found using prune.tree. Give the number of leaves, deviance,
and dimension for the subtree with the best BIC. (c) Consider the subtree with the
best BIC. What is its observed misclassification rate? What two variables figure most
prominently in the tree? Which variables do not appear? (d) Now find the leave-
one-out cross-validation estimate of the misclassification error rate for the best model
using BIC. How does this rate compare with the observed rate?

Exercise 11.9.7 (South African heart disease). This question uses the South African
heart disease study discussed in Section 11.8. The objective is to use logistic regres-
sion to classify people on the presence of heart disease, variable chd. (a) Use the
logistic model that includes all the explanatory variables to do the classification. (b)
Find the best logistic model using the stepwise function, with BIC as the criterion.
Which variables are included in the best model from the stepwise procedure? (c) Use
the model with just the variables suggested by the factor analysis of Exercise 10.5.20:
tobacco, ldl, adiposity, obesity, and alcohol. (d) Find the BIC, observed error rate, and
leave-one-out cross-validation rate for the three models in parts (a), (b) and (c). (e)
True or false: (i) The full model has the lowest observed error rate; (ii) The factor-
analysis based model is generally best; (iii) The cross-validation-based error rates are
somewhat larger than the corresponding observed error rates; (iv) The model with
the best observed error rate has the best cv-based error rate as well; (v) The best
model of these three is the one chosen by the stepwise procedure; (vi) Both adiposity
and obesity seem to be important factors in classifying heart disease.

Exercise 11.9.8 (Zipcode). The objective here is to classify handwritten numerals


(0, 1, . . . , 9), so that machines can read people’s handwritten zipcodes. The data set
consists of 16 × 16 grayscale images, that is, each numeral has been translated to a
16 × 16 matrix, where the elements of the matrix indicate the darkness (from −1 to
1) of the image at 256 grid points. The data set is from LeCun [1989], and can be
found in the R package StatElemLearn [Halvorsen, 2009]. This question will use just
the 7’s, 8’s and 9’s, for which there are n = 1831 observations. We put the data in
three matrices, one for each digit, called train7, train8, and train9. Each row contains
first the relevant digit, then the 256 grayscale values, for one image. The task is to use
linear discrimination to distinguish between the digits, even though it is clear that
the data are not multivariate normal. First, create the three matrices from the large
zip.train matrix:
train7 <− zip.train[zip.train[,1]==7,−1]
train8 <− zip.train[zip.train[,1]==8,−1]
train9 <− zip.train[zip.train[,1]==9,−1]
(a) Using the image, contour, and matrix functions in R, reconstruct the images of
some of the 7’s, 8’s and 9’s from their grayscale values. (Or explore the zip2image
function in the StatElemLearn package.) (b) Use linear discrimination to classify the
observation based on the 256 variables under the three scenarios below. In each case,
find both the observed misclassification rate and the estimate using cross-validation.
(i) Using Σ = I256 . (ii) Assuming Σ is diagonal, using the pooled estimates of the
individual variances. (iii) Using the pooled covariance matrix as an estimate of Σ.
(d) Which method had the best error rate, estimated by cross-validation? (e) Create a
data set of digits (7’s, 8’s, and 9’, as well as 5’s) to test classifiers as follows:
test5 <− zip.train[zip.test[,1]==5,−1]
11.9. Exercises 227

test7 <− zip.train[zip.test[,1]==7,−1]


test8 <− zip.train[zip.test[,1]==8,−1]
test9 <− zip.train[zip.test[,1]==9,−1]
Using the discriminant functions from the original data for the best method from part
(b), classify these new observations. What is the error rate for the 7’s, 8’s, and 9’s?
How does it compare with the cross-validation estimate? How are the 5’s classified?

Exercise 11.9.9 (Spam). Use classification trees to classify the spam data. It is best to
start as follows:
Spamdf <− data.frame(Spam)
spamtree <− tree(as.factor(spam)~.,data=Spamdf)
Turning the matrix into a data frame makes the labeling on the plots simpler. (a)
Find the BIC’s for the subtrees obtained using prune.tree. How many leaves in the
best model? What is its BIC? What is its observed error rate? (b) You can obtain a
cross-validation estimate of the error rate by using
cvt <− cv.tree(spamtree,method=’misclass’,K=46)
The “46” means use 46-fold cross-validation, which is the same as leaving 100 out.
The vector cvt$dev contains the number of left-outs misclassified for the various mod-
els. The cv.tree function randomly splits the data, so you should run it a few times,
and use the combined results to estimate the misclassification rates for the best model
you chose in part (a). What do you see? (c) Repeat parts (a) and (b), but using the
first ten principal components of the spam explanatory variables as the predictors.
(Exercise 1.9.15 calculated the principal components.) Repeat again, but this time
using the first ten principal components based on the scaled explanatory variables,
scale(Spam[,1:57]). Compare the effectiveness of the three approaches.

Exercise 11.9.10. This questions develops a Bayes classifier when there is a mix of nor-
mal and binomial explanatory variables. Consider the classification problem based
on (Y, X, Z ), where Y is the variable to be classified, with values 0 and 1, and X and
Z are predictors. X is a 1 × 2 continuous vector, and Z takes the values 0 and 1. The
model for (Y, X, Z ) is given by

X | Y = y, Z = z ∼ N (μyz , Σ), (11.81)

and
P [Y = y & Z = z] = pyz , (11.82)
so that p00 + p01 + p10 + p11 = 1. (a) Find an expression for P [Y = y | X = x & Z =
z]. (b) Find the 1 × 2 vector αz and the constant γz (which depend on z and the
parameters) so that

P [Y = 1 | X = x & Z = z] > P [Y = 0 | X = x & Z = z] ⇔ xαz + γz > 0. (11.83)

(c) Suppose the data are (Yi , Xi , Zi ), i = 1, . . . , n, iid, distributed as above. Find
expressions for the MLE’s of the parameters (the four μyz ’s, the four pyz ’s, and Σ).
228 Chapter 11. Classification

Exercise 11.9.11 (South African heart disease). Apply the classification method in Ex-
ercise 11.9.10 to the South African heart disease data, with Y indicating heart disease
(chd), X containing the two variables age and type A, and Z being the family history
of heart disease variable (history: 0 = absent, 1 = present). Randomly divide the data
into two parts: The training data with n = 362, and the test data with n = 100. E.g.,
use
random.index <− sample(462,100)
sahd.train <− SAheart[−random.index,]
sahd.test <− SAheart[random.index,]
(a) Estimate the αz and γz using the training data. Find the observed misclassification
rate on the training data, where you classify an observation as Y i = 1 if xi
αz + γ
z > 0,
and Yi = 0 otherwise. What is the misclassification rate for the test data (using the
estimates from the training data)? Give the 2 × 2 table showing true and predicted
Y’s for the test data. (b) Using the same training data, find the classification tree. You
don’t have to do any pruning. Just take the full tree from the tree program. Find the
misclassification rates for the training data and the test data. Give the table showing
true and predicted Y’s for the test data. (c) Still using the training data, find the
classification using logistic regression, with the X and Z as the explanatory variables.
What are the coefficients for the explanatory variables? Find the misclassification
rates for the training data and the test data. (d) What do you conclude?
Chapter 12

Clustering

The classification and prediction we have covered in previous chapters were cases
of supervised learning. For example, in classification, we try to find a function that
classifies individuals into groups using their x values, where in the training set we
know what the proper groups are because we observe their y’s. In clustering, we
again wish to classify observations into groups using their x’s, but do not know the
correct groups even in the training set, i.e., we do not observe the y’s, nor often even
know how many groups there are. Clustering is a case of unsupervised learning.
There are many clustering algorithms. Most are reasonably easy to implement
given the number K of clusters. The difficult part is deciding what K should be.
Unlike in classification, there is no obvious cross-validation procedure to balance
the number of clusters with the tightness of the clusters. Only in the model-based
clustering do we have direct AIC or BIC criteria. Otherwise, a number of reasonable
but ad hoc measures have been proposed. We will look at two: gap statistics, and
silhouettes.
In some situations one is not necessarily assuming that there are underlying clus-
ters, but rather is trying to divide the observations into a certain number of groups
for other purposes. For example, a teacher in a class of 40 students might want to
break up the class into four sections of about ten each based on general ability (to
give more focused instruction to each group). The teacher does not necessarily think
there will be wide gaps between the groups, but still wishes to divide for pedagogical
purposes. In such cases K is fixed, so the task is a bit simpler.
In general, though, when clustering one is looking for groups that are well sep-
arated. There is often an underlying model, just as in Chapter 11 on model-based
classification. That is, the data are
(Y1 , X1 ), . . . , (Yn , Xn ), iid, (12.1)
where yi ∈ {1, . . . , K },
X | Y = k ∼ f k (x) = f (x | θk ) and P [Y = k] = π k , (12.2)
as in (11.2) and (11.3). If the parameters are known, then the clustering proceeds
exactly as for classification, where an observation x is placed into the group
f k (x) π k
C(x) = k that maximizes . (12.3)
f 1 (x) π 1 + · · · + f K (x) π K
229
230 Chapter 12. Clustering

See (11.13). The fly in the ointment is that we do not observe the yi ’s (neither in the
training set nor for the new observations), nor do we necessarily know what K is, let
alone the parameter values.
The following sections look at some approaches to clustering. The first, K-means,
does not explicitly use a model, but has in the back of its mind f k ’s being N (μk , σ2 I p ).
Hierarchical clustering avoids the problems of number of clusters by creating a tree
containing clusterings of all sizes, from K = 1 to n. Finally, the model-based cluster-
ing explicitly assumes the f k ’s are multivariate normal (or some other given distribu-
tion), with various possibilities for the covariance matrices.

12.1 K-Means
For a given number K of groups, K-means assumes that each group has a mean vector
μk . Observation xi is assigned to the group with the closest mean. To estimate these
means, we minimize the sum of the squared distances from the observations to their
group means:
n
obj(μ1 , . . . , μK ) = ∑ μmin
,...,μ
1 K
 xi − μ k  2 . (12.4)
i =1

An algorithm for finding the clusters starts with a random set of means μ 1 , . . . , μK
(e.g., randomly choose K observations from the data), then iterate the following two
steps:

1. Having estimates of the means, assign observations to the group corresponding


to the closest mean,

C(xi ) = k that minimizes xi − μk 2 over k. (12.5)

2. Having individuals assigned to groups, find the group means,

1
#{C(xi ) = k} i|C(∑
k =
μ xi . (12.6)
xi )= k

The algorithm is guaranteed to converge, but not necessarily to the global mini-
mum. It is a good idea to try several random starts, then take the one that yields the
lowest obj in (12.4). The resulting means and assignments are the K-means and their
clustering.

12.1.1 Example: Sports data


Recall the data on people ranking seven sports presented in Section 1.6.2. Using the
K-means algorithm for K = 1, . . . , 4, we find the following means (where K = 1 gives
the overall mean):
12.1. K-Means 231

K=1 BaseB FootB BsktB Ten Cyc Swim Jog


Group 1 3.79 4.29 3.74 3.86 3.59 3.78 4.95

K=2 BaseB FootB BsktB Ten Cyc Swim Jog


Group 1 5.01 5.84 4.35 3.63 2.57 2.47 4.12
Group 2 2.45 2.60 3.06 4.11 4.71 5.21 5.85

K=3 BaseB FootB BsktB Ten Cyc Swim Jog


Group 1 2.33 2.53 3.05 4.14 4.76 5.33 5.86 (12.7)
Group 2 4.94 5.97 5.00 3.71 2.90 3.35 2.13
Group 3 5.00 5.51 3.76 3.59 2.46 1.90 5.78

K=4 BaseB FootB BsktB Ten Cyc Swim Jog


Group 1 5.10 5.47 3.75 3.60 2.40 1.90 5.78
Group 2 2.30 2.10 2.65 5.17 4.75 5.35 5.67
Group 3 2.40 3.75 3.90 1.85 4.85 5.20 6.05
Group 4 4.97 6.00 5.07 3.80 2.80 3.23 2.13
Look at the K = 2 means. Group 1 likes swimming and cycling, while group 2
likes the team sports, baseball, football, and basketball. If we compare these to the
K = 3 clustering, we see group 1 appears to be about the same as the team sports
group from K = 2, while groups 2 and 3 both like swimming and cycling. The
difference is that group 3 does not like jogging, while group 2 does. For K = 4, it
looks like the team-sports group has split into one that likes tennis (group 3), and
one that doesn’t (group 2). At this point it may be more useful to try to decide
what number of clusters is “good.” (Being able to interpret the clusters is one good
characteristic.)

12.1.2 Gap statistics


Many measures of goodness for clusterings are based on the tightness of clusters. In
K-means, an obvious measure of closeness is the within-group sum of squares. For
group k, the within sum of squares is

SSk = ∑ xi − μk 2 , (12.8)


i |C(xi )= k

so that for all K clusters we have


K
SS (K ) = ∑ SSk , (12.9)
k =1

which is exactly the optimal value of the objective function in (12.4).


A good K will have a small SS (K ), but SS (K ) is a decreasing function of K. See
the first plot in Figure 12.1. The lowest solid line is K versus log(SS (K )). In fact,
taking K = n (one observation in each cluster) yields SS (n ) = 0. We could balance
SS (K ) and K, e.g., by minimizing SS (K ) + λK for some λ. (Cf. equation (11.78).)
The question is how to choose λ. There does not appear to be an obvious AIC, BIC
or cross-validation procedure, although in Section 12.5 we look at the model-based
“soft” K-means procedure.
232 Chapter 12. Clustering

0.25
8.0
log(SS)

Gap
0.15
7.6
7.2

0.05
2 4 6 8 10 2 4 6 8 10

K K

Figure 12.1: The first plot shows the log of the total sums of squares for cluster sizes
from K = 1 to 10 for the data (solid line), and for 100 random uniform samples (the
clump of curves). The second plot exhibits the gap statistics with ± SD lines.

Tibshirani, Walther, and Hastie [2001] take a different approach, proposing the gap
statistic, which compares the observed log(SS (K ))’s with what would be expected
from a sample with no cluster structure. We are targeting the values

Gap(K ) = E0 [log(SS (K ))] − log(SS (K )), (12.10)

where E0 [·] denotes expected value under some null distribution on the Xi ’s. Tib-
shirani et al. [2001] suggest taking a uniform distribution over the range of the data,
possibly after rotating the data to the principal components. A large value of Gap(K )
indicates that the observed clustering is substantially better than what would be ex-
pected if there were no clusters. Thus we look for a K with a large gap.
Because the sports data are rankings, it is natural to consider as a null distribu-
tion that the observations are independent and uniform over the permutations of
{1, 2, . . . , 7}. We cannot analytically determine the expected value in (12.10), so we
use simulations. For each b = 1, . . . , B = 100, we generate n = 130 random rankings,
perform K-means clustering for K = 1, . . . , 10, and find the corresponding SSb (K )’s.
These make up the dense clump of curves in the first plot of Figure 12.1.
The Gap(K ) in (12.10) is then estimated by using the average of the random curves,

B
( (K ) = 1
Gap ∑ log(SSb (K )) − log(SS(K )). (12.11)
B b =1

The second plot in Figure 12.1 graphs this estimated curve, along with curves plus
or minus one standard deviation of the SSb (K )’s. Clearly K = 2 is much better than
K = 1; K’s larger than two do not appear to be better than two, so that the gap statistic
suggests K = 2 to be appropriate. Even if K = 3 had a higher gap, unless it is higher
by a standard deviation, one may wish to stick with the simpler K = 2. Of course,
interpretability is a strong consideration as well.
12.1. K-Means 233

12.1.3 Silhouettes
Another measure of clustering efficacy is Rousseeuw’s [1987] notion of silhouettes.
The silhouette of an observation i measures how well it fits in its own cluster versus
how well it fits in its next closest cluster. Adapted to K-means, we have

a (i ) =  xi − μ
k 2 and b (i ) = xi − μ
l 2 , (12.12)

where observation i is assigned to group k, and group l has the next-closest group
mean to xi . Then its silhouette is
b (i ) − a (i )
silhouette(i ) = . (12.13)
max{ a(i ), b (i )}

By construction, b (i ) ≥ a(i ), hence the denominator is b (i ), and the silhouette takes


values between 0 and 1. If the observation is equal to its group mean, its silhouette
is 1. If it is halfway between the two group means, its silhouette is 0. For other
clusterings (K-medoids, as in Section 12.2, for example), the silhouettes can range
from -1 to 1, but usually stay above 0, or at least do not go much below.
Figure 12.2 contains the silhouettes for K’s from 2 to 5 for the sports data. The
observations (along the horizontal axis) are arranged by group and, within group,
by silhouettes. This arrangement allows one to compare the clusters. In the first
plot (K = 2 groups), the two clusters have similar silhouettes, and the silhouettes are
fairly “full.” High silhouettes are good, so that the average silhouette is a measure of
goodness for the clustering. In this case, the average is 0.625. For K = 3, notice that
the first silhouette is still full, while the two smaller clusters are a bit frail. The K = 4
and 5 silhouettes are not as full, either, as indicated by their averages.
Figure 12.3 plots the average silhouette versus K. It is clear that K = 2 has the
highest silhouette, hence we would (as when using the gap statistic) take K = 2 as
the best cluster size.

12.1.4 Plotting clusters in one and two dimensions


With two groups, we have two means in p(= 7)-dimensional space. To look at the
data, we can project the observations to the line that runs through the means. This
projection is where the clustering is taking place. Let
μ1 − μ2
z= , (12.14)
μ1 − μ2 
2 to μ
the unit vector pointing from μ 1 . Then using z as an axis, the projections of the
observations onto z have coordinates

wi = xi z , i = 1, . . . , N. (12.15)

Figure 12.4 is the histogram for the wi ’s, where group 1 has wi > 0 and group 2 has
wi < 0. We can see that the clusters are well-defined in that the bulk of each cluster
is far from the center of the other cluster.
We have also plotted the sports, found by creating a “pure” ranking for each sport.
Thus the pure ranking for baseball would give baseball the rank of 1, and the other
234 Chapter 12. Clustering

K=2 K=3

0.2 0.4 0.6 0.8


0.2 0.4 0.6 0.8

0 20 40 60 80 120 0 20 40 60 80 120
Ave = 0.625 Ave = 0.555

K=4 K=5
0.8
0.8
0.6

0.6
0.4

0.4
0.2

0.2

0 20 40 60 80 120 0 20 40 60 80 120
Ave = 0.508 Ave = 0.534

Figure 12.2: The silhouettes for K = 2, . . . , 5 clusters. The horizontal axis indexes the
observations. The vertical axis exhibits the values of the silhouettes.
0.60
0.56
0.52

2 4 6 8 10
K

Figure 12.3: The average silhouettes for K = 2, . . . , 10 clusters.


12.1. K-Means 235

Basketball
Football Jogging
10 Swimming
Baseball
Cycling
Tennis
8
6
4
2
0

−6 −4 −2 0 2 4 6
W

Figure 12.4: The histogram for the observations along the line connecting the two
means for K = 2 groups.

sports the rank of 4.5, so that the sum of the ranks, 28, is the same as for the other
rankings. Adding these sports to the plot helps aid in interpreting the groups: team
sports on the left, individual sports on the right, with tennis on the individual-sport
side, but close to the border.
If K = 3, then the three means lie in a plane, hence we would like to project the
observations onto that plane. One approach is to use principal components (Section
1.6) on the means. Because there are three, only the first two principal components
will have positive variance, so that all the action will be in the first two. Letting
⎛ ⎞
μ
1
Z=⎝ μ
2 ⎠ , (12.16)
μ
3

we apply the spectral decomposition (1.33) in Theorem 1.1 to the sample covariance
matrix of Z:
1 
Z H3 Z = GLG , (12.17)
3
where G is orthogonal and L is diagonal. The diagonals of L here are 11.77, 4.07, and
five zeros. We then rotate the data and the means using G,

W = YG and W( means ) = ZG. (12.18)

Figure 12.5 plots the first two variables for W and W( means), along with the seven pure
rankings. We see the people who like team sports to the right, and the people who
like individual sports to the left, divided into those who can and those who cannot
abide jogging. Compare this plot to the biplot that appears in Figure 1.6.
236 Chapter 12. Clustering

3
2

Swimming
Var 2

Cycling Basketball
1
Tennis
0

Football
Baseball
2
−2

Jogging

−4 −2 0 2 4
Var 1

Figure 12.5: The scatter plot for the data projected onto the plane containing the
means for K = 3.

12.1.5 Example: Sports data, using R


The sports data is in the R matrix sportsranks. The K-means clustering uses the
function kmeans. We create a list whose K th component contains the results for
K = 2, . . . , 10 groups:

kms <− vector(’list’,10)


for(K in 2:10) {
kms[[K]] <− kmeans(sportsranks,centers=K,nstart=10)
}

The centers input specifies the number of groups desired, and nstart=10 means ran-
domly start the algorithm ten times, then use the one with lowest within sum of
squares. The output in kms[[K]] for the K-group clustering is a list with centers, the
K × p of estimated cluster means; cluster, an n-vector that assigns each observation
to its cluster (i.e., the yi ’s); withinss, the K-vector of SSk ’s (so that SS (K ) is found
by sum(kms[[K]]$withinss)); and size, the K-vector giving the numbers of observations
assigned to each group.
12.1. K-Means 237

Gap statistic
To find the gap statistic, we first calculate the vector of SS (K )’s in (12.9) for K =
1, . . . , 10. For K = 1, there is just one large group, so that SS (1) is the sum of the
sample variances of the variables, times n − 1. Thus
n <− nrow(sportsranks) # n=130
ss <− tr(var(sportsranks))∗(n−1) # For K=1
for(K in 2:10) {
ss <− c(ss, sum(kms[[K]]$withinss))
}
The solid line in the first plot in Figure 12.1 is K versus log(ss). (Or something like it;
there is randomness in the results.)
For the summation term on the right-hand side of (12.11), we use uniformly dis-
tributed permutations of {1, . . . , 7}, which uses the command sample(7). For non-
rank statistics, one has to try some other randomization. For each b = 1, . . . , 100,
we create n = 130 random permutations, then go through the K-means process for
K = 1, . . . , 10. The xstar is the n × 7 random data set.
ssb <− NULL
for(b in 1:100) {
xstar <− NULL
for(i in 1:n) xstar <− rbind(xstar,sample(7))
sstar <− tr(var(xstar))∗(n−1)
for(K in 2:10) {
sstar <− c(sstar,sum(kmeans(xstar,centers=K,nstart=10)$withinss))
}
ssb <− rbind(ssb,sstar)
}
Now each column of ssb is a random vector of log(SSb (K ))’s. The gap statistics (12.11)
and the two plots in Figure 12.1 are found using
par(mfrow=c(1,2)) # Set up two plots
matplot(1:10,log(cbind(ss,t(ssb))),type=’l’,xlab=’K’,ylab=’log(SS)’)
ssbm <− apply(log(ssb),2,mean) # Mean of log(ssb[,K])’s
ssbsd <− sqrt(apply(log(ww),2,var)) # SD of log(ssb[,K])’s
gap <− ssbm − log(ss) # The vector of gap statistics
matplot(1:10,cbind(gap,gap+ssbsd,gap−ssbsd),type=’l’,xlab=’K’,ylab=’Gap’)

Silhouettes
Section A.4.1 contains a simple function for calculating the silhouettes in (12.13) for
a given K-means clustering. The sort.silhouette function in Section A.4.2 sorts the
silhouette values for plotting. The following statements produce Figure 12.2:
sil.ave <− NULL # To collect silhouette’s means for each K
par(mfrow=c(3,3))
for(K in 2:10) {
sil <− silhouette.km(sportsranks,kms[[K]]$centers)
sil.ave <− c(sil.ave,mean(sil))
ssil <− sort.silhouette(sil,kms[[K]]$cluster)
238 Chapter 12. Clustering

plot(ssil,type=’h’,xlab=’Observations’,ylab=’Silhouettes’)
title(paste(’K =’,K))
}
The sil.ave calculated above can then be used to obtain Figure 12.3:
plot(2:10,sil.ave,type=’l’,xlab=’K’,ylab=’Average silhouette width’)

Plotting the clusters


Finally, we make plots as in Figures 12.4 and 12.5. For K = 2, we have the one z as in
(12.14) and the wi ’s as in (12.15):
z <− kms[[2]]$centers[1,]− kms[[2]]$centers[2,]
z <− z/sqrt(sum(z^2))
w <− sportsranks%∗%z
xl <− c(−6,6); yl <− c(0,13) # Fix the x− and y−ranges
hist(w[kms[[2]]$cluster==1],col=2,xlim=xl,ylim=yl,main=’K=2’,xlab=’W’)
par(new=TRUE) # To allow two histograms on the same plot
hist(w[kms[[2]]$cluster==2],col=3,xlim=xl,ylim=yl,main=’ ’,xlab=’ ’)
To add the sports’ names:
y <− matrix(4.5,7,7)−3.5∗diag(7)
ws <− y%∗%z
text(ws,c(10,11,12,8,9,10,11),labels=dimnames(sportsranks)[[2]])
The various placement numbers were found by trial and error.
For K = 3, or higher, we can use R’s eigenvector/eigenvalue function, eigen, to
find the G used in (12.18):
z <− kms[[3]]$centers
g <− eigen(var(z))$vectors[,1:2] # Just need the first two columns
w <− sportsranks%∗%g # For the observations
ws <− y%∗%g # For the sports’ names
wm <− z%∗%g # For the groups’ means
cl <− kms[[3]]$cluster
plot(w,xlab=’Var 1’,ylab=’Var 2’,pch=cl)
text(wc,labels=1:3)
text(ws,dimnames(sportsranks)[[2]])

12.2 K-medoids
Clustering with medoids [Kaufman and Rousseeuw, 1990] works directly on dis-
tances between objects. Suppose we have n objects, o1 , . . . , on , and a dissimilarity
measure d(oi , o j ) between pairs. This d satisfies

d(oi , o j ) ≥ 0, d(oi , o j ) = d(o j , oi ), and d(oi , oi ) = 0, (12.19)

but it may not be an actual metric in that it need not satisfy the triangle inequality.
Note that one cannot necessarily impute distances between an object and another
vector, e.g., a mean vector. Rather than clustering around means, the clusters are then
12.2. K-medoids 239

built around some of the objects. That is, K-medoids finds K of the objects (c1 , . . . , cK )
to act as centers (or medoids), the objective being to find the set that minimizes
N
obj(c1 , . . . , cK ) = ∑ {cmin
,...,c }
d (o i , ck ). (12.20)
i =1 1 K

Silhouettes are defined as in (12.13), except that here, for each observation i,
a (i ) = ∑ d(oi , o j ) and b (i ) = ∑ d (o i , o j ), (12.21)
j ∈ Group k j ∈ Group l

where group k is object i’s group, and group l is its next closest group.
In R, one can use the package cluster, [Maechler et al., 2005], which implements
K-medoids clustering in the function pam, which stands for partitioning around
medoids. Consider the grades data in Section 4.2.1. We will cluster the five vari-
ables, homework, labs, inclass, midterms, and final, not the 107 people. A natural
measure of similarity between two variables is their correlation. Instead of using the
usual Pearson coefficient, we will use Kendall’s τ, which is more robust. For n × 1
vectors x and y, Kendall’s τ is

∑1≤i< j≤n Sign( xi − x j )Sign(yi − y j )


T (x, y) = . (12.22)
(n2 )
The numerator looks at the line segment connecting each pair of points ( xi , yi ) and
( x j , y j ), counting +1 if the slope is positive and −1 if it is negative. The denominator
normalizes the statistic so that it is between ±1. Then T (x, y) = +1 means that the
xi ’s and yi ’s are exactly monotonically increasingly related, and −1 means they are
exactly monotonically decreasingly related, much as the correlation coefficient. The
T’s measure similarities, so we subtract each T from 1 to obtain the dissimilarity
matrix:
HW Labs InClass Midterms Final
HW 0.00 0.56 0.86 0.71 0.69
Labs 0.56 0.00 0.80 0.68 0.71
(12.23)
InClass 0.86 0.80 0.00 0.81 0.81
Midterms 0.71 0.68 0.81 0.00 0.53
Final 0.69 0.71 0.81 0.53 0.00
Using R, we find the dissimilarity matrix:
x <− grades[,2:6]
dx <− matrix(nrow=5,ncol=5) # To hold the dissimilarities
for(i in 1:5)
for(j in 1:5)
dx[i,j] <− 1−cor.test(x[,i],x[,j],method=’kendall’)$est
This matrix is passed to the pam function, along with the desired number of groups
K. Thus for K = 3, say, use
pam3 <− pam(as.dist(dx),k=3)
The average silhouette for this clustering is in pam3$silinfo$avewidth. The results for
K = 2, 3 and 4 are
K 2 3 4
(12.24)
Average silhouette 0.108 0.174 0.088
240 Chapter 12. Clustering

We see that K = 3 has the best average silhouette. The assigned groups for this clus-
tering can be found in pam3$clustering, which is (1,1,2,3,3), meaning the groupings
are, reasonably enough,

{HW, Labs} {InClass} {Midterms, Final}. (12.25)

The medoids, i.e., the objects chosen as centers, are in this case labs, inclass, and
midterms, respectively.

12.3 Model-based clustering


In model-based clustering [Fraley and Raftery, 2002], we assume that the model in
(12.2) holds, just as for classification. We then estimate the parameters, which in-
cludes the θk ’s and the π k ’s, and assign observations to clusters as in (12.3):

k )π
f (xi | θ k
 xi ) = k that maximizes
C( . (12.26)
1 )π
f (xi | θ K )π
1 + · · · + f ( x i | θ K

As opposed to classification situations, in clustering we do not observe the yi ’s, hence


cannot use the joint distribution of (Y, X) to estimate the parameters. Instead, we
need to use the marginal of X, which is the denominator in the C :

f (xi ) = f (xi | θ1 , . . . , θK , π1 , . . . , π K )
= f (xi | θ1 )π1 + · · · + f (xi | θK )π K . (12.27)

The density is a mixture density, as in (11.5).


The likelihood for the data is then
n
L (θ1 , . . . , θK , π1 , . . . , π K ; x1 , . . . , xn ) = ∏ ( f (xi | θ1 )π1 + · · · + f (xi | θK )πK ) .
i =1
(12.28)
The likelihood can be maximized for any specific model (specifying the f ’s and θk ’s
as well as K), and models can compared using the BIC (or AIC). The likelihood
(12.28) is not always easy to maximize due to its being a product of sums. Often the
EM algorithm (see Section 12.4) is helpful.
We will present the multivariate normal case, as we did in (11.24) and (11.25) for
classification. The general model assumes for each k that

X | Y = k ∼ N1× p (μk , Σk ). (12.29)

We will assume the μk ’s are free to vary, although models in which there are equalities
among some of the elements are certainly reasonable. There are also a variety of
structural and equality assumptions on the Σk ’s used.

12.3.1 Example: Automobile data


The R function we use is in the package mclust, Fraley and Raftery [2010]. Our
data consists of size measurements on 111 automobiles, the variables include length,
wheelbase, width, height, front and rear head room, front leg room, rear seating, front
and rear shoulder room, and luggage area. The data are in the file cars, from Con-
sumers’ Union [1990], and can be found in the S-Plus R
[TIBCO Software Inc., 2009]
12.3. Model-based clustering 241

−4000
−5000
−BIC

EII VVI
VII EEE
−6000

EEI EEV
VEI VEV
EVI VVV

2 4 6 8
K

Figure 12.6: −BIC’s for fitting the entire data set.

data frame cu.dimensions. The variables in cars have been normalized to have medians
of 0 and median absolute deviations (MAD) of 1.4826 (the MAD for a N (0, 1)).
The routine we’ll use is Mclust (be sure to capitalize the M). It will try various
forms of the covariance matrices and group sizes, and pick the best based on the BIC.
To use the default options and have the results placed in mcars, use
mcars <− Mclust(cars)
There are many options for plotting in the package. To see a plot of the BIC’s, use
plot(mcars,cars,what=’BIC’)
You have to clicking on the graphics window, or hit enter, to reveal the plot. The result
is in Figure 12.6. The horizontal axis specifies the K, and the vertical axis gives the
BIC values, although these are the the negatives of our BIC’s. The symbols plotted
on the graph are codes for various structural hypotheses on the covariances. See
(12.35). In this example, the best model is Model “VVV” with K = 2, which means
the covariance matrices are arbitrary and unequal.
Some pairwise plots (length versus height, width versus front head room, and
rear head room versus luggage) are given in Figure 12.7. The plots include ellipses to
illustrate the covariance matrices. Indeed we see that the two ellipses in each plot are
arbitrary and unequal. To plot variable 1 (length) versus variable 4 (height), use
plot(mcars,cars,what=’classification’,dimens=c(1,4))
We also plot the first two principal components (Section 1.6). The matrix of eigenvec-
tors, G in (1.33), is given by eigen(var(cars))$vectors:
242 Chapter 12. Clustering

10 15 20

Front head room


4
2
Height

0
5

−2
0
−5

−4
−4 −2 0 2 4 −4 −2 0 2 4
Length Width

5
4
2

0
Luggage
0

PC2
−10
−4

−20
−8

−4 −2 0 2 4 6 0 10 20 30
Rear head room PC1

Figure 12.7: Some two-variable plots of the clustering produced by Mclust. The solid
triangles indicate group 1, and the open squares indicate group 2. The fourth graph
plots the first two principal components of the data.

carspc <− cars%∗%eigen(var(cars))$vectors # Principal components

To obtain the ellipses, we redid the clustering using the principal components as the
data, and specifying G=2 groups in Mclust.
Look at the plots. The lower left graph shows that group 2 is almost constant
on the luggage variable. In addition, the upper left and lower right graphs indicate
that group 2 can be divided into two groups, although the BIC did not pick up the
difference. The Table 12.1 exhibits four of the variables for the 15 automobiles in
group 2.
We have divided this group as suggested by the principal component plot. Note
that the first group of five are all sports cars. They have no back seats or luggage areas,
hence the values in the data set for the corresponding variables are coded somehow.
The other ten automobiles are minivans. They do not have specific luggage areas, i.e.,
trunks, either, although in a sense the whole vehicle is a big luggage area. Thus this
group really is a union of two smaller groups, both of which are quite a bit different
than group 1.
12.3. Model-based clustering 243

Rear Head Rear Seating Rear Shoulder Luggage


Chevrolet Corvette −4.0 −19.67 −28.00 −8.0
Honda Civic CRX −4.0 −19.67 −28.00 −8.0
Mazda MX5 Miata −4.0 −19.67 −28.00 −8.0
Mazda RX7 −4.0 −19.67 −28.00 −8.0
Nissan 300ZX −4.0 −19.67 −28.00 −8.0
Chevrolet Astro 2.5 0.33 −1.75 −8.0
Chevrolet Lumina APV 2.0 3.33 4.00 −8.0
Dodge Caravan 2.5 −0.33 −6.25 −8.0
Dodge Grand Caravan 2.0 2.33 3.25 −8.0
Ford Aerostar 1.5 1.67 4.25 −8.0
Mazda MPV 3.5 0.00 −5.50 −8.0
Mitsubishi Wagon 2.5 −19.00 2.50 −8.0
Nissan Axxess 2.5 0.67 1.25 −8.5
Nissan Van 3.0 −19.00 2.25 −8.0
Volkswagen Vanagon 7.0 6.33 −7.25 −8.0

Table 12.1: The automobiles in group 2 of the clustering of all the data.

We now redo the analysis on just the group 1 automobiles:


cars1 <− cars[mcars$classification==1,]
mcars1 <− Mclust(cars1)
The model chosen by BIC is “XXX with 1 components” which means the best cluster-
ing is one large group, where the Σ is arbitrary. See Figure 12.8 for the BIC plot. The
EEE models (equal but arbitrary covariance matrices) appear to be quite good, and
similar BIC-wise, for K from 1 to 4. To get the actual BIC values, look at the vector
mcars1$BIC[,’’EEE’’]. The next table has the BIC’s and corresponding estimates of the
posterior probabilities for the first five model, where we shift the BIC’s so that the
best is 0:
K 1 2 3 4 5
BIC 0 28.54 9.53 22.09 44.81 (12.30)
PBIC 99.15 0 0.84 0 0
Indeed, it looks like one group is best, although three groups may be worth looking
at. It turns out the three groups are basically large, middle-sized, and small cars. Not
profound, perhaps, but reasonable.

12.3.2 Some of the models in mclust


The mclust package considers several models for the covariance matrices. Suppose
that the covariance matrices for the groups are Σ1 , . . . , ΣK , where each has its spectral
decomposition (1.33)
Σk = Γ k Λk Γ k , (12.31)
and the eigenvalue matrix is decomposed as
p
Λk = ck Δk where | Δk | = 1 and ck = [ ∏ λ j ]1/p , (12.32)
j =1
244 Chapter 12. Clustering

−3000
−3400
−BIC
−3800

EII VVI
VII EEE
EEI EEV
VEI VEV
−4200

EVI VVV

2 4 6 8
K

Figure 12.8: −BIC’s for the data set without the sports cars or minivans.

the geometric mean of the eigenvalues. A covariance matrix is then described by


shape, volume, and orientation:

Shape(Σk ) = Δk ;
p
Volume(Σk ) = | Σk | = ck ;
Orientation(Σk ) = Γ k . (12.33)

The covariance matrices are then classified into spherical, diagonal, and ellipsoidal:

Spherical ⇒ Δk = I p ⇒ Σk = ck I p ;
Diagonal ⇒ Γ k = I p ⇒ Σk = ck Dk ;
Ellipsoidal ⇒ Σk is arbitrary. (12.34)

The various models are defined by the type of covariances, and what equalities
there are among them. I haven’t been able to crack the code totally, but the descrip-
tions tell the story. When K ≥ 2 and p ≥ 2, the following table may help translate the
descriptions into restrictions on the covariance matrices through (12.33) and(12.34):
12.4. The EM algorithm 245

Code Description Σk
EII spherical, equal volume σ2 I p
VII spherical, unequal volume σk2 I p
EEI diagonal, equal volume and shape Λ
VEI diagonal, varying volume, equal shape ck Δ
EVI diagonal, equal volume, varying shape cΔk (12.35)
VVI diagonal, varying volume and shape Λk
EEE ellipsoidal, equal volume, shape, and orientation Σ
EEV ellipsoidal, equal volume and equal shape Γ k ΛΓ k
VEV ellipsoidal, equal shape ck Γ k ΔΓ k
VVV ellipsoidal, varying volume, shape, and orientation arbitrary

Here, Λ’s are diagonal matrices with positive diagonals, Δ’s are diagonal matrices
with positive diagonals whose product is 1 as in (12.32), Γ’s are orthogonal matrices,
Σ’s are arbitrary nonnegative definite symmetric matrices, and c’s are positive scalars.
A subscript k on an element means the groups can have different values for that
element. No subscript means that element is the same for each group.
If there is only one variable, but K ≥ 2, then the only two models are “E,” meaning
the variances of the groups are equal, and “V,” meaning the variances can vary. If
there is only one group, then the models are as follows:

Code Description Σ
X one-dimensional σ2
XII spherical σ2 I p (12.36)
XXI diagonal Λ
XXX ellipsoidal arbitrary

12.4 An example of the EM algorithm


The aim of this section is to give the flavor of an implementation of the EM algorithm.
We assume K groups with the multivariate normal distribution as in (12.29), with
different arbitrary Σk ’s. The idea is to iterate two steps:

1. Having estimates of the parameters, find estimates of P [Y = k | X = xi ]’s.

2. Having estimates of P [Y = k | X = xi ]’s, find estimates of the parameters.

Suppose we start with initial estimates of the π k ’s, μk ’s, and Σk ’s. E.g., one could
first perform a K-means procedure, then use the sample means and covariance ma-
trices of the groups to estimate the means and covariances, and estimate the π k ’s by
the proportions of observations in the groups. Then, as in (12.26), for step 1 we use

f (xi | μ  k )π
k , Σ k
P[Y = k | X = xi ] =

f (xi | μ1 , Σ1 )π
1 + · · · + f ( x i | μ  K )π
K , Σ K
( i)
≡ wk , (12.37)

k = (μ
where θ  k ).
k , Σ
246 Chapter 12. Clustering

( i)
Note that for each i, the wk can be thought of as weights, because their sum over
k is 1. Then in Step 2, we find the weighted means and covariances of the xi ’s:

1 N ( i)
k i∑
k =
μ w k xi
n =1
n
 k = 1 ∑ w ( i ) (xi − μ
and Σ k ) ,
k )(xi − μ

n k i =1 k
n
( i)
k =
where n ∑ wk .
i =1
k
n
k =
Also, π . (12.38)
n
The two steps are iterated until convergence. The convergence may be slow, and
it may not approach the global maximum likelihood, but it is guaranteed to increase
the likelihood at each step. As in K-means, it is a good idea to try different starting
points.
In the end, the observations are clustered using the conditional probabilities, be-
cause from (12.26),
 xi ) = k that maximizes w( i) .
C( (12.39)
k

12.5 Soft K-means


We note that the K-means procedure in (12.5) and (12.6) is very similar to the EM
procedure in (12.37) and (12.38) if we take a hard form of conditional probability, i.e.,
take 
( i) 1 if xi is assigned to group k
wk = (12.40)
0 otherwise.
Then the μ k in (12.38) becomes the sample means of the observations assigned to
cluster k.
A model for which model-based clustering mimics K-means clustering assumes
that in (12.29), the covariance matrices Σk = σ2 I p (model “EII” in (12.35)), so that

1 − 12 xi −μk 2
f k (xi ) = c e 2σ . (12.41)
σp
If σ is fixed, then the EM algorithm proceeds as above, except that the covariance
calculation in (12.38) is unnecessary. If we let σ → 0 in (12.37), fixing the means, we
have that
( i)
P[Y = k | X = xi ] −→ wk (12.42)
( i)
k ’s are positive. Thus for small fixed σ,
for the wk in (12.40), at least if all the π
K-means and model-based clustering are practically the same.
Allowing σ to be estimated as well leads to what we call soft K-means, soft be-
cause we use a weighted mean, where the weights depend on the distances from the
observations to the group means. In this case, the EM algorithm is as in (12.37) and
12.6. Hierarchical clustering 247

(12.38), but with the estimate of the covariance replaced with the pooled estimate of
σ2 ,
1 K n ( i)
n k∑ ∑ wk xi − μk 2 .
σ2 =
 (12.43)
=1 i =1

12.5.1 Example: Sports data


In Section 12.1.1, we used K-means to find clusters in the data on peoples’ favorite
sports. Here we use soft K-means. There are a couple of problems with using this
model (12.41): (1) The data are discrete, not continuous as in the multivariate normal;
(2) The dimension is actually 6, not 7, because each observation is a permutation of
1, . . . , 7, hence sums to the 28. To fix the latter problem, we multiply the data matrix
by any orthogonal matrix whose first column is constant, then throw away the first
column of the result (since it is a constant). Orthogonal polynomials are easy in R:
h <− poly(1:7,6) # Gives all but the constant term.
x <− sportsranks%∗%h
The clustering can be implemented in Mclust by specifying the “EII” model in
(12.35):
skms <− Mclust(x,modelNames=’EII’)
The shifted BIC’s are
K 1 2 3 4 5
(12.44)
BIC 95.40 0 21.79 32.28 48.27
Clearly K = 2 is best, which is what we found using K-means in Section 12.1.1. It
turns out the observations are clustered exactly the same for K = 2 whether using
K-means or soft K-means. When K = 3, the two methods differ on only three obser-
vations, but for K = 4, 35 are differently clustered.

12.6 Hierarchical clustering


A hierarchical clustering gives a sequence of clusterings, each one combining two
clusters of the previous stage. We assume n objects and their dissimilarities d as in
(12.19). To illustrate, consider the five grades’ variables in Section 12.2. A possible
hierarchical sequence of clusterings starts with each object in its own group, then
combines two of those elements, say midterms and final. The next step could combine
two of the other singletons, or place one of them with the midterms/final group. Here
we combine homework and labs, then combine all but inclass, then finally have one
big group with all the objects:
{HW} {Labs} {InClass} {Midterms} {Final}
→{HW} {Labs} {InClass} {Midterms, Final}
→{HW, Labs} {InClass} {Midterms, Final}
→{InClass} {HW, Labs, Midterms, Final}
→{InClass, HW, Labs, Midterms, Final} (12.45)
Reversing the steps and connecting, one obtains a tree diagram, or dendrogram, as in
Figure 12.9.
248 Chapter 12. Clustering

0.80

InClass
0.65
Height
0.50

HW

Labs
Midterms

Final
Figure 12.9: Hierarchical clustering of the grades, using complete linkage.

For a set of objects, the question is which clusters to combine at each stage. At
the first stage, we combine the two closest objects, that is, the pair (oi , o j ) with the
smallest d(o1 , o j ). At any further stage, we may wish to combine two individual
objects, or a single object to a group, or two groups. Thus we need to decide how
to measure the dissimilarity between any two groups of objects. There are many
possibilities. Three popular ones look at the minimum, average, and maximum of
the individuals’ distances. That is, suppose A and B are subsets of objects. Then the
three distances between the subsets are

Single linkage: d(A, B) = min d(a, b)


a∈A,b∈B
1
Average linkage: d(A, B) =
#A × #B ∑ ∑ d(a, b)
a ∈A b ∈B
Complete linkage: d(A, B) = max d(a, b) (12.46)
a∈A,b∈B

In all cases, d({a}, {b}) = d(a, b). Complete linkage is an example of Hausdorff
distance, at least when the d is a distance.

12.6.1 Example: Grades data


Consider the dissimilarities for the five variables of the grades data given in (12.23).
The hierarchical clustering using these dissimilarities with complete linkage is given
in Figure 12.9. This clustering is not surprising given the results of K-medoids in
(12.25). As in (12.45), the hierarchical clustering starts with each object in its own
cluster. Next we look for the smallest dissimilarity between two objects, which is the
0.53 between midterms and final. In the dendrogram, we see these two scores being
connected at the height of 0.53.
12.6. Hierarchical clustering 249

We now have four clusters, with dissimilarity matrix


HW Labs InClass {Midterms, Final}
HW 0.00 0.56 0.86 0.71
Labs 0.56 0.00 0.80 0.71 (12.47)
InClass 0.86 0.80 0.00 0.81
{Midterms, Final} 0.71 0.71 0.81 0.00
(The dissimilarity between the cluster {Midterms, Final} and itself is not really zero,
but we put zero there for convenience.) Because we are using complete linkage, the
dissimilarity between a single object and the cluster with two objects is the maximum
of the two individual dissimilarities. For example,
d({HW}, {Midterms, Final}) = max{d(HW, Midterms), d(HW, Final)}
= max{0.71, 0.69}
= 0.71. (12.48)
The two closest clusters are now the singletons HW and Labs, with a dissimilarity of
0.56. The new dissimilarity matrix is then
{HW, Labs} InClass {Midterms, Final}
{HW, Labs} 0.00 0.86 0.71
(12.49)
InClass 0.86 0.00 0.81
{Midterms, Final} 0.71 0.81 0.00
The next step combines the two two-object clusters, and the final step places InClass
with the rest.
To use R, we start with the dissimilarity matrix dx in (12.23). The routine hclust
creates the tree, and plclust plots it. We need the as.dist there to let the function know
we already have the dissimilarities. Then Figure 12.9 is created by the statement
plclust(hclust(as.dist(dx)))

12.6.2 Example: Sports data


Turn to the sports data from Section 12.1.1. Here we cluster the sports, using squared
Euclidean distance as the dissimilarity, and compete linkage. To use squared Eu-
clidean distances, use the dist function directly on the data matrix. Figure 12.10 is
found using
plclust(hclust(dist(t(sportsranks))))
Compare this plot to the K-means plot in Figure 12.5. We see somewhat similar
closenesses among the sports.
Figure 12.11 clusters the individuals using complete linkage and single linkage,
created using
par(mfrow=c(2,1))
dxs <− dist(sportsranks) # Gets Euclidean distances
lbl <− rep(’ ’,130) # Prefer no labels for the individuals
plclust(hclust(dxs),xlab=’Complete linkage’,sub=’ ’,labels=lbl)
plclust(hclust(dxs,method=’single’),xlab=’Single linkage’,sub=’ ’,labels=lbl)
250 Chapter 12. Clustering

40
35

Jogging
30
Height

Tennis
Baseball
25

Football

Basketball

Cycling

Swimming
20

Figure 12.10: Clustering the sports, using complete linkage.


8
Height
4
0

Complete linkage
4
Height
2
0

Single linkage
Figure 12.11: Clustering the individuals in the sports data, using complete linkage
(top) and single linkage (bottom).
12.7. Exercises 251

Complete linkage tends to favor similar-sized clusters, because by using the max-
imum distance, it is easier for two small clusters to get together than anything to
attach itself to a large cluster. Single linkage tends to favor a few large clusters, and
the rest small, because the larger the cluster, the more likely it will be close to small
clusters. These ideas are borne out in the plot, where complete linkage yields a more
treey-looking dendrogram.

12.7 Exercises
p
Exercise 12.7.1. Show that | Σk | = ck in (12.33) follows from (12.31) and (12.32).
( i)
Exercise 12.7.2. (a) Show that the EM algorithm, where we use the wk ’s in (12.40) as
the estimate of P[Y = k | X = xi ], rather than that in (12.37), is the K-means algorithm
of Section 12.1. [Note: You have to worry only about the mean in (12.38).] (b) Show
that the limit as σ → 0 of P[Y = k | X = xi ] is indeed given in (12.40), if we use the f k
in (12.41) in (12.37).

Exercise 12.7.3 (Grades). This problem is to cluster the students in the grades data
based on variables 2–6: homework, labs, inclass, midterms, and final. (a) Use K-
means clustering for K = 2. (Use nstart=100, which is a little high, but makes sure
everyone gets similar answers.) Look at the centers, and briefly characterize the
clusters. Compare the men and women (variable 1, 0=Male, 1=Female) on which
clusters they are in. (Be sure to take into account that there are about twice as many
women as men.) Any differences? (b) Same question, for K = 3. (c) Same question,
for K = 4. (d) Find the average silhouettes for the K = 2, 3 and 4 clusterings from
parts (a), (b) and (c). Which K has the highest average silhouette? (e) Use soft K-
means to find the K = 1, 2, 3 and 4 clusterings. Which K is best according to the
BIC’s? (Be aware that the BIC’s in Mclust are negative what we use.) Is it the same as
for the best K-means clustering (based on silhouettes) found in part (d)? (f) For each
of K = 2, 3, 4, compare the classifications of the data using regular K-means to that of
soft K-means. That is, match the clusters produced by both methods for given K, and
count how many observations were differently clustered.

Exercise 12.7.4 (Diabetes). The R package mclust contains the data set diabetes [Reaven
and Miller, 1979]. There are n = 145 subjects and four variables. The first variable
(class) is a categorical variable indicating whether the subject has overt diabetes (my
interpretation: symptoms are obvious), chemical diabetes (my interpretation: can
only be detected through chemical analysis of the blood), or is normal (no diabetes).
The other three variables are blood measurements: glucose, insulin, sspg. (a) First,
normalize the three blood measurement variables so that they have means zero and
variances 1:
blood <− scale(diabetes[,2:4])
(a) Use K-means to cluster the observations on the three normalized blood measure-
ment variables for K = 1, 2, . . . , 9. (b) Find the gap statistics for the clusterings in part
(a). To generate a random observation, use three independent uniforms, where their
ranges coincide with the ranges of the three variables. So to generate a random data
set xstar:
252 Chapter 12. Clustering

n <− nrow(blood)
p <− ncol(blood)
ranges <− apply(blood,2,range) # Obtains the mins and maxes
xstar <− NULL # To contain the new data set
for(j in 1:p) {
xstar <− cbind(xstar,runif(n,ranges[1,j],ranges[2,j]))
}
Which K would you choose based on this criterion? (b) Find the average silhouettes
for the clusterings found in part (a), except for K = 1. Which K would you choose
based on this criterion? (c) Use model-based clustering, again with K = 1, . . . 9.
Which model and K has the best BIC? (d) For each of the three “best” clusterings
in parts (a), (b), and (c), plot each pair of variables, indicating which cluster each
point was assigned, as in Figure 12.7. Compare these to the same plots that use
the class variable as the indicator. What do you notice? (e) For each of the three
best clusterings, find the table comparing the clusters with the class variable. Which
clustering was closest to the class variable? Why do you suppose that clustering was
closest? (Look at the plots.)

Exercise 12.7.5 (Iris). This question applies model-based clustering to the iris data,
pretending we do not know which observations are in which species. (a) Do the
model-based clustering without any restrictions (i.e., use the defaults). Which model
and number K was best, according to BIC? Compare the clustering for this best model
to the actual species. (b) Now look at the BIC’s for the model chosen in part (a), but
for the various K’s from 1 to 9. Calculate the corresponding estimated posterior
probabilities. What do you see? (c) Fit the same model, but with K = 3. Now
compare the clustering to the true species.

Exercise 12.7.6 (Grades). Verify the dissimilarity matrices in (12.47) and (12.49).

Exercise 12.7.7 (Soft drinks). The data set softdrinks has 23 peoples’ ranking of 8
soft drinks: Coke, Pepsi, Sprite, 7-up, and their diet equivalents. Do a hierarchical
clustering on the drinks, so that the command is
hclust(dist(t(softdrinks2)))
then plot the tree with the appropriate labels. Describe the tree. Does the clustering
make sense?

Exercise 12.7.8 (Cereal). Exercise 1.9.19 presented the cereal data (in the R data ma-
trix cereal), finding the biplot. Do hierarchical clustering on the cereals, and on the
attributes. Do the clusters make sense? What else would you like to know from these
data? Compare the clusterings to the biplot.
Chapter 13

Principal Components and Related


Techniques

Data reduction is a common goal in multivariate analysis — one has too many vari-
ables, and wishes to reduce the number of them without losing much information.
How to approach the reduction depends of course on the goal of the analysis. For
example, in linear models, there are clear dependent variables (in the Y matrix) that
we are trying to explain or predict from the explanatory variables (in the x matrix,
and possibly the z matrix). Then Mallows’ C p or cross-validation are reasonable ap-
proaches. If the correlations between the Y’s are of interest, then factor analysis is
appropriate, where the likelihood ratio test is a good measure of how many factors
to take. In classification, using cross-validation is a good way to decide on the vari-
ables. In model-based clustering, and in fact any situation with a likelihood, one can
balance the fit and complexity of the model using something like AIC or BIC.
There are other situations in which the goal is not so clear cut as in those above;
one is more interested in exploring the data, using data reduction to get a better
handle on the data, in the hope that something interesting will reveal itself. The
reduced data may then be used in more formal models, although I recommend first
considering targeted reductions as mentioned in the previous paragraph, rather than
immediately jumping to principal components.
Below we discuss principal components in more depth, then present multidimen-
sional scaling, and canonical correlations.

13.1 Principal components, redux


Recall way back in Section 1.6 that the objective in principal component analysis was
to find linear combinations (with norm 1) with maximum variance. As an exploratory
technique, principal components can be very useful, as are other projection pursuit
methods. The conceit underlying principal components is that variance is associated
with interestingness, which may or may not hold. As long as in an exploratory mood,
though, if one finds the top principal components are not particularly interesting or
interpretable, then one can go in a different direction.
But be careful not to shift over to the notion that components with low variance
can be ignored. It could very well be that they are the most important, e.g., most

253
254 Chapter 13. Principal Components, etc.

correlated with a separate variable of interest. Using principal components as the


first step in a process, where one takes the first few principal components to use in
another procedure such as clustering or classification, may or may not work out well.
In particular, it makes little sense to use principal components to reduce the variables
before using them in a linear process such as regression, canonical correlations, or
Fisher’s linear discrimination. For example, in regression, we are trying to find the
linear combination of x’s that best correlates with y. Using principal components first
on the x’s will give us a few new variables that are linear combinations of x’s, which
we then further take linear combinations of to correlate with the y. What we end up
with is a worse correlation than if we just started with the original x’s, since some
parts of the x’s are left behind. The same thinking goes when using linear discrimina-
tion: We want the linear combination of x’s that best distinguishes the groups, not the
best linear combination of a few linear combinations of the x’s. Because factor analy-
sis tries to account for correlations among the variables, if one transforms to principal
components, which are uncorrelated, before applying factor analysis, then there will
be no common factors. On the other hand, if one is using nonlinear techniques such
as classification trees, first reducing by principal components may indeed help.
Even principal components are not unique. E.g., you must choose whether or
not to take into account covariates or categorical factors before finding the sample
covariance matrix. You also need to decide how to scale the variables, i.e., whether to
leave them in their original units, or scale so that all variables have the same sample
variance, or scale in some other way. The scaling will affect the principal components,
unlike in factor analysis or linear regression.

13.1.1 Example: Iris data

Recall that the Fisher/Anderson iris data (Section 1.3.1) has n = 150 observations
and q = 4 variables. The measurements of the petals and sepals are in centimeters,
so it is reasonable to leave the data unscaled. On the other hand, the variances of
the variables do differ, so scaling so that each has unit variance is also reasonable.
Furthermore, we could either leave the data unadjusted in the sense of subtracting
the overall mean when finding the covariance matrix, or adjust the data for species by
subtracting from each observation the mean of its species. Thus we have four reason-
able starting points for principal components, based on whether we adjust for species
and whether we scale the variables. Figure 13.1 has plots of the first two principal
components for each of these possibilities. Note that there is a stark difference be-
tween the plots based on adjusted and unadjusted data. The unadjusted plots show
a clear separation based on species, while the adjusted plots have the species totally
mixed, which would be expected because there are differences in means between
the species. Adjusting hides those differences. There are less obvious differences
between the scaled and unscaled plots within adjusted/unadjusted pairs. For the ad-
justed data, the unscaled plot seems to have fairly equal spreads for the three species,
while the scaled data has the virginica observations more spread out than the other
two species.
The table below shows the sample variances, s2 , and first principal component’s
loadings (sample eigenvector), PC1 , for each of the four sets of principal components:
13.1. Principal components, redux 255

Unadjusted, unscaled Unadjusted, scaled


v g v
v s
s v
−4.5

2
v v
v vvvvv g v vvvgvv vvg
sss vvvv vgggg v
vv vvv gvg g
v
vv vvvvvv vv vg vg ssss

1
sssss vvvv gvg g
sssssss vv vvvggggg gggg ssssssss vvvv vvv vggggggg
s v vvv g gg s vvvvvvg vg g
s sssssss vvvvv ggggg ssss s vvg vgg g
−5.5

0
ssss ggg
vvv v ggggg gg g s ssssssssss vvvvvvg ggggg g
g
ssss v v ggg gg v v gggg g g g
g
ggg gg
ssss ssss

−1
s ss
v gg sss
s s g

−2
−6.5

ss g s
s g s gg

2 3 4 5 6 7 8 9 −3 −2 −1 0 1 2 3

Adjusted, unscaled Adjusted, scaled


1.0

g g gg
s 2 g g
g vv g
g vv gg g
g vg
gg v v s v g g g v s g vv ss vs
0.5

v ggg svvg g
s sg
sss s v g s ggsg v ssv
g v g v ss ssvss vs s v v g sv vgvvg vsvssgssssssvs
g
v s gggss gv g sg s vs
vvgvs gvsvsgsssssssvvsvvvvs v v
v s
0

v
v gv vsvsggsssvsgsvvs vgsgvg
0.0

gv v v v sggs s g v g
g v v
s s gvv vvgg g
v svv vsgssgg svsvg
gs gv
v g s svss vvs v vg g
g
−1

g g
g vvs gss gv ggv s gg v v g v
vss g g v g v g
−0.5

g
s g g gg
−2

s g
s g

−1 0 1 2 −4 −2 0 2 4

Figure 13.1: Plots of the first two principal components for the iris data, depending on
whether adjusting for species and whether scaling the variables to unit variance. For
the individual points, “s” indicates setosa, “v” indicates versicolor, and “g” indicates
virginica.

Unadjusted Adjusted
Unscaled Scaled Unscaled Scaled
s2 PC1 s2 PC1 s2 PC1 s2 PC1
Sepal Length 0.69 0.36 1 0.52 0.26 0.74 1 −0.54 (13.1)
Sepal Width 0.19 −0.08 1 −0.27 0.11 0.32 1 −0.47
Petal Length 3.12 0.86 1 0.58 0.18 0.57 1 −0.53
Petal Width 0.58 0.36 1 0.56 0.04 0.16 1 −0.45

Note that whether adjusted or not, the relative variances of the variables affect
the relative weighting they have in the principal component. For example, for the
unadjusted data, petal length has the highest variance in the unscaled data, and
receives the highest loading in the eigenvector. That is, the first principal component
is primarily sepal length. But for the scaled data, all variables are forced to have
the same variance, and now the loadings of the variables are much more equal. The
256 Chapter 13. Principal Components, etc.

0 1 2 3 4 5 6

1.0
Eigenvalue

log(ratio)
0.6
0.2
2 4 6 8 10 2 4 6 8 10
Index Index

Figure 13.2: The left-hand plot is a scree plot (i versus li ) of the eigenvalues for the
automobile data. The right-hand plot shows i versus log(li /li+1 ), the successive log-
proportional gaps.

opposite holds for sepal width. A similar effect is seen for the adjusted data. The
sepal length has the highest unscaled variance and highest loading in PC1 , and petal
width the lowest variance and loading. But scaled, the loadings are approximately
equal.
Any of the four sets of principal components is reasonable. Which to use depends
on what one is interested in, e.g., if wanting to distinguish between species, the
unadjusted plots are likely more interesting, while when interested in relations within
species, adjusting make sense. We mention that in cases where the units are vastly
different for the variables, e.g., population in thousands and areas in square miles of
cities, leaving the data unscaled is less defensible.

13.1.2 Choosing the number of principal components


One obvious question is, “How does one choose p?” Unfortunately, there is not
any very good answer. In fact, it is probably not even a good question, because the
implication of the question is that once we have p, we can proceed using just the
first p principal components, and throw away the remainder. Rather, we take a more
modest approach and ask “Which principal components seem to be worth exploring
further?” A key factor is whether the component has a reasonable interpretation. Of
course, nothing prevents you from looking at as many as you have time for.
The most common graphical technique for deciding on p is the scree plot, in
which the sample eigenvalues are plotted versus their indices. (A scree is a pile of
small stones at the bottom of a cliff.) Consider Example 12.3.1 on the automobile data,
here using the n = 96 autos with trunk space, and all q = 11 variables. Scaling the
variables so that they are have unit sample variance, we obtain the sample eigenvalues

6.210, 1.833, 0.916, 0.691, 0.539, 0.279, 0.221, 0.138, 0.081, 0.061, 0.030. (13.2)

The scree plot is the first one in Figure 13.2. Note that there is a big drop from the
first to the second eigenvalue. There is a smaller drop to the third, then the values
13.1. Principal components, redux 257

seem to level off. Other simple plots can highlight the gaps. For example, the second
plot in the figure shows the logarithms of the successive proportional drops via

li
log(ratioi ) ≡ log . (13.3)
l i +1

The biggest drops are again from #1 to #2, and #2 to #3, but there are almost as large
proportional drops at the fifth and tenth stages.
One may have outside information or requirements that aid in choosing the com-
ponents. For examples, there may be a reason one wishes a certain number of com-
ponents (say, three if the next step is a three-dimensional plot), or to have as few
components as possible in order to achieve a certain percentage (e.g., 95%) of the to-
tal variance. If one has an idea that the measurement error for the observed variables
is c, then it makes sense to take just the principal components that have eigenvalue
significantly greater than c2 . Or, as in the iris data, all the data is accurate just to one
decimal place, so that taking c = 0.05 is certainly defensible.
To assess significance, assume that

1
U ∼ Wishartq (ν, Σ), and S = U, (13.4)
ν
where ν > q and Σ is invertible. Although we do not necessarily expect this dis-
tribution to hold in practice, it will help develop guidelines to use. Let the spectral
decompositions of S and Σ be

S = GLG and Σ = ΓΛΓ  , (13.5)

where G and Γ are orthogonal, and L and Λ are diagonal with nonincreasing diagonal
elements (the eigenvalues), as in Theorem 1.1. The eigenvalues of S will be distinct
with probability 1. If we assume that the eigenvalues of Σ are also distinct, then
Theorem 13.5.1 in Anderson [1963] shows that for large ν, the sample eigenvalues are
approximately independent, and li ≈ N (λi , 2λ2i /ν). If components with λi ≤ c2 are
ignorable, then it is reasonable to ignore the li for which

√ l i − c2 c2
ν √ < 2, equivalently, li < √ √ . (13.6)
2 li 1 − 2 2/ ν

(One may be tempted to take c = 0, but if any λi = 0, then the corresponding li will
be zero as well, so that there is no need for hypothesis testing.) Other test statistics
(or really “guidance statistics”) can be easily derived, e.g., to see whether the average
of the k smallest eigenvalues are less than c2 , or the sum of the first p are greater than
some other cutoff.

13.1.3 Estimating the structure of the component spaces


If the eigenvalues of Σ are distinct, then the spectral decomposition (13.5) splits
the q-dimensional space into q orthogonal one-dimensional spaces. If, say, the first
two eigenvalues are equal, then the first two subspaces are merged into one two-
dimensional subspace. That is, there is no way to distinguish between the top two
dimensions. At the extreme, if all eigenvalues are equal, in which case Σ = λIq , there
258 Chapter 13. Principal Components, etc.

is no statistically legitimate reason to distinguish any principal components. More


generally, suppose there are K distinct values among the λi ’s, say
α1 > α2 > · · · > α K , (13.7)
where q k of the λi ’s are equal to αk :

λ 1 = · · · = λ q 1 = α1 ,
λ q 1 + 1 = · · · = λ q 1 + q 2 = α2 ,
..
.
λq1 +···+qK −1+1 = · · · = λq = αK . (13.8)
Then the space is split into K orthogonal subspaces, of dimensions q1 , . . . , q K ,
where q = q1 + · · · + q K . The vector (q1 , . . . , q K ) is referred to as the pattern of equal-
ities among the eigenvalues. Let Γ be an orthogonal matrix containing eigenvectors
as in (13.5), and partition it as
 
Γ = Γ1 Γ2 · · · Γ K , Γ k is q × q k , (13.9)
so that Γ k contains the eigenvectors for the q k eigenvalues that equal αk . These are
not unique because Γ k J for any q k × q k orthogonal matrix J will also yield a set of
eigenvectors for those eigenvalues. The subspaces have corresponding projection
matrices P1 , . . . , PK , which are unique, and we can write
K
Σ= ∑ αk Pk , where Pk = Γ k Γ k . (13.10)
k =1

With this structure, the principal components can be defined only in groups, i.e., the
first q1 of them represent one group, which have higher variance than the next group
of q2 components, etc., down to the final q K components. There is no distinction
within a group, so that one would take either the top q1 components, or the top
q1 + q2 , or the top q1 + q2 + q3 , etc.
Using the distributional assumption (13.4), we find the Bayes information criterion
to choose among the possible patterns (13.8) of equality. The best set can then be used
in plots such as in Figure 13.3, where the gaps will be either enhanced (if large) or
eliminated (if small). The model (13.8) will be denoted M( q1 ,...,qK ) . Anderson [1963]
(see also Section 12.5) shows the following.
Theorem 13.1. Suppose (13.4) holds, and S and Σ have spectral decompositions as in (13.5).
Then the MLE of Σ under the model M( q1 ,...,qK ) is given by Σ   , where the 
 = GΛG λi ’s are
found by averaging the relevant l i ’s:

 1
λ1 = . . . =  λ q1 = 
α1 = ( l + . . . + l q 1 ),
q1 1
 1
λ q1 +1 = · · · = 
λ q1 + q2 = 
α2 = (l + · · · + l q 1 + q 2 ),
q 2 q1 +1
..
.
 1
λq1 +···+qK −1+1 = · · · = 
λq = 
αK = (l + · · · + l q ). (13.11)
q K q1 +···+qK −1+1
13.1. Principal components, redux 259

The number of free parameters is


K
1 2
d(q1 , . . . q K ) = (q − ∑ q2k ) + K. (13.12)
2 k =1

The deviance can then be taken to be


q K
 ) ; S) = ν
deviance( M( q1,...qK ) (Σ ∑ log(λi ) = ν ∑ qk log(αk ). (13.13)
i =1 k =1

See Exercise 13.4.3. Using (13.12), we have


K
BIC( M( q1 ,...qK ) ) = ν ∑ qk log(αk ) + log(ν)d(q1 , . . . qK ). (13.14)
k =1

13.1.4 Example: Automobile data


Let S be the scaled covariance matrix for the automobiles with trunks, described in
Section 12.3.1. Equation (13.2) and Figure 13.2 exhibit the eigenvalues of S, They are
denoted l j in (13.15). We first illustrate the model (13.8) with pattern (1, 1, 3, 3, 2, 1).
The MLE’s of the eigenvalues are then found by averaging the third through fifth,
the sixth through eighth, the ninth and tenth, and leaving the others alone, denoted
below by the  λ j ’s:

j 1 2 3 4 5
lj 6.210 1.833 0.916 0.691 0.539

λj 6.210 1.833 0.716 0.716 0.716
(13.15)
j 6 7 8 9 10 11
lj 0.279 0.221 0.138 0.081 0.061 0.030

λj 0.213 0.213 0.213 0.071 0.071 0.030

With ν = n − 1 = 95,
 ) ; S) = 95
deviance( M(1,1,3,3,2,1)(Σ ∑ log(λ j ) = −1141.398, (13.16)
j

and d(1, 1, 3, 3, 2, 1) = 54, hence

BIC( M(1,1,3,3,2,1) ) = −1141.398 + log(95) 54 = −895.489. (13.17)

Table 13.1.4 contains a number of models, one each for K from 1 to 11. Each
pattern after the first was chosen to be the best that is obtained from the previous by
summing two consecutive q k ’s. The estimated probabilities are among those in the
table. Clearly, the preferred is the one with MLE in (13.15). Note that the assumption
(13.4) is far from holding here, both because the data are not normal, and because
we are using a correlation matrix rather than a covariance matrix. We are hoping,
though, that in any case, the BIC is a reasonable balance of the fit of the model on the
eigenvalues and the number of parameters.
260 Chapter 13. Principal Components, etc.

Pattern d BIC BIC PBIC


(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) 66 −861.196 34.292 0.000
(1, 1, 1, 1, 1, 2, 1, 1, 1, 1) 64 −869.010 26.479 0.000
(1, 1, 1, 2, 2, 1, 1, 1, 1) 62 −876.650 18.839 0.000
(1, 1, 3, 2, 1, 1, 1, 1) 59 −885.089 10.400 0.004
(1, 1, 3, 2, 1, 2, 1) 57 −892.223 3.266 0.159
(1, 1, 3, 3, 2, 1) 54 −895.489 0.000 0.812
(1, 1, 3, 3, 3) 51 −888.502 6.987 0.025
(1, 4, 3, 3) 47 −870.824 24.665 0.000
(1, 4, 6) 37 −801.385 94.104 0.000
(5, 6) 32 −657.561 237.927 0.000
11 1 4.554 900.042 0.000

Table 13.1: The BIC’s for the sequence of principal component models for the auto-
mobile data.

Sample MLE
2

2
log(eigenvalue)

log(eigenvalue)
1

1
−1

−1
−3

−3

2 4 6 8 10 2 4 6 8 10
Index Index

Figure 13.3: Plots of j versus the sample l j ’s, and j versus the MLE’s 
λ j ’s for the
chosen model.

Figure 13.3 shows the scree plots, using logs, of the sample eigenvalues and the
fitted ones from the best model. Note that the latter gives more aid in deciding how
many components to choose because the gaps are enhanced or eliminated. That is,
taking one or two components is reasonable, but because there is little distinction
among the next three, one may as well take all or none of those three. Similarly with
numbers six, seven and eight.
What about interpretations? Below we have the first five principal component
13.1. Principal components, redux 261

loadings, multiplied by 100 then rounded off:


PC1 PC2 PC3 PC4 PC5
Length 36 −23 3 −7 26
Wheelbase 37 −20 −11 6 20
Width 35 −29 19 −11 1
Height 25 41 −41 10 −10
FrontHd 19 30 68 45 −16
(13.18)
RearHd 25 47 1 28 6
FrtLegRoom 10 −49 −30 69 −37
RearSeating 30 26 −43 2 18
FrtShld 37 −16 20 −11 10
RearShld 38 −1 7 −13 12
Luggage 28 6 −2 −43 −81
The first principal component has fairly equal positive loadings for all variables, in-
dicating an overall measure of bigness. The second component tends to have positive
loadings for tallness (height, front headroom, rear headroom), and negative loadings
for the length and width-type variables. This component then measures tallness rela-
tive to length and width. The next three may be harder to interpret. Numbers 3 and 4
could be front seat versus back seat measurements, and number 5 is mainly luggage
space. But from the analysis in (13.15), we have that from a statistical significance
point of view, there is no distinction among the third through fifth components, that
is, any rotation of them is equally important. Thus we might try a varimax rotation
on the three vectors, to aid in interpretation. (See Section 10.3.2 for a description of
varimax.) The R function varimax will do the job. The results are below:
PC3∗ PC4∗ PC5∗
Length −4 −20 18
Wheelbase −11 0 21
Width 13 −17 −6
Height −32 29 −1
FrontHd 81 15 7
(13.19)
RearHd 11 18 19
FrtLegRoom 4 83 6
RearSeating −42 10 18
FrtShld 12 −22 1
RearShld 0 −19 3
Luggage −4 8 −92
These three components are easy to interpret, weighting heavily on front headroom,
front legroom, and luggage space, respectively.
Figure 13.4 plots the first two principal components. The horizontal axis repre-
sents the size, going from the largest at the left to the smallest at the right. The
vertical axis has tall/narrow cars at the top, and short/wide at the bottom. We also
performed model-based clustering (Section 12.3) using just these two variables. The
best clustering has two groups, whose covariance matrices have the same eigenval-
ues but different eigenvectors (“EEV” in (12.35)), indicated by the two ellipses, which
have the same size and shape, but have different orientations. These clusters are rep-
resented in the plot as well. We see the clustering is defined mainly by the tall/wide
variable.
262 Chapter 13. Principal Components, etc.

3
2
Tall vs. wide
1
0
−1
−3

−6 −4 −2 0 2 4
Overall size

Figure 13.4: The first two principal component variables for the automobile data
(excluding sports cars and minivans), clustered into two groups.

Using R
In Section 12.3.1 we created cars1, the reduced data set. To center and scale the data,
so that the means are zero and variances are one, use
xcars <− scale(cars1)
The following obtains eigenvalues and eigenvectors of S:
eg <− eigen(var(xcars))
The eigenvalues are in eg$values and the matrix of eigenvectors are in eg$vectors. To
find the deviance and BIC for the pattern (1, 1, 3, 3, 2, 1) seen in (13.15 and (13.17), we
use the function pcbic (detailed in Section A.5.1):
pcbic(eg$values,95,c(1,1,3,3,2,1))
In Section A.5.2 we present the function pcbic.stepwise, which uses the stepwise pro-
cedure to calculate the elements in Table 13.1.4:
pcbic.stepwise(eg$values,95)

13.1.5 Principal components and factor analysis


Factor analysis and principal components have some similarities and some differ-
ences. Recall the factor analysis model with p factors in (10.58). Taking the mean to
be 0 for simplicity, we have
Y = Xβ + R, (13.20)
where X and R are independent, with

X ∼ N (0, In ⊗ I p ) and R ∼ N (0, In ⊗ Ψ), Ψ diagonal. (13.21)


13.1. Principal components, redux 263

For principal components, where we take the first p components, partition Γ and
Λ in (13.5) as 
  Λ1 0
Γ = Γ1 Γ2 and Λ = . (13.22)
0 Λ2
Here, Γ1 is q × p, Γ2 is q × (q − p), Λ1 is p × p, and Λ2 is (q − p) × (q − p), the Λk ’s
being diagonal. The large eigenvalues are in Λ1 , the small ones are in Λ2 . Because
Iq = ΓΓ  = Γ1 Γ1 + Γ2 Γ2 , we can write

Y = YΓ1 Γ1 + YΓ2 Γ2 = Xβ + R, (13.23)

where

X = YΓ1 ∼ N (0, Σ X ), β = Γ1 and R = YΓ2 Γ2 ∼ N (0, In ⊗ Σ R ). (13.24)

Because Γ1 Γ2 = 0, X and R are again independent. We also have (Exercise 13.4.4)

Σ X = Γ1 ΓΛΓ  Γ1 = Λ1 , and Σ R = Γ2 Γ2 ΓΛΓ  Γ2 Γ2 = Γ2 Λ2 Γ2 . (13.25)

Comparing these covariances to the factor analytic ones in (13.20), we see the follow-
ing:
ΣX ΣR
Factor analysis Ip Ψ (13.26)
Principal components Λ1 Γ2 Λ1 Γ2
The key difference is in the residuals. Factor analysis chooses the p-dimensional X so
that the residuals are uncorrelated, though not necessarily small. Thus the correlations
among the Y’s are explained by the factors X. Principal components chooses the p-
dimensional X so that the residuals are small (the variances sum to the sum of the
(q − p) smallest eigenvalues), but not necessarily uncorrelated. Much of the variance
of the Y is explained by the components X.
A popular model that fits into both frameworks is the factor analytic model (13.20)
with the restriction that
Ψ = σ2 Iq , σ2 “small.” (13.27)
The interpretation in principal components is that the X contains the important infor-
mation in Y, while the residuals R contain just random measurement error. For factor
analysis, we have that the X explains the correlations among the Y, and the residuals
happen to have the same variances. In this case, we have

Σ = βΣ XX β + σ2 Iq . (13.28)

Because β is q × p, there are at most p positive eigenvalues for βΣ XX β . Call these


λ1∗ ≥ λ2∗ ≥ · · · ≥ λ∗p , and let Λ1∗ be the p × p diagonal matrix with diagonals λ∗i . Then
the spectral decomposition is
∗ 
 Λ1 0
βΣ XX β = Γ Γ (13.29)
0 0

for some orthogonal Γ. But any orthogonal matrix contains eigenvectors for Iq = ΓΓ  ,
hence Γ is also an eigenvector matrix for Σ:
∗ 
 Λ1 + σ2 I p 0
Σ = ΓΛΓ = Γ Γ . (13.30)
0 σ2 Iq − p
264 Chapter 13. Principal Components, etc.

Thus the eigenvalues of Σ are

λ1∗ + σ2 ≥ λ2∗ + σ2 ≥ · · · ≥ λ∗p + σ2 ≥ σ2 = · · · = σ2 , (13.31)

and the eigenvectors for the first p eigenvalues are the columns of Γ1 . In this case the
factor space and the principal component space are the same. In fact, if the λ∗j are
distinct and positive, the eigenvalues (13.30) satisfy the structural model (13.8) with
pattern (1, 1, . . . , 1, q − p). A common approach to choosing p is to use hypothesis
testing on such models to find the smallest p for which the model fits. See Anderson
[1963] or Mardia, Kent, and Bibby [1979]. Of course, AIC or BIC could be used as
well.

13.1.6 Justification of the principal component MLE, Theorem 13.1


We first find the MLE of Σ, and the maximal value of the likelihood, for U as in (13.4),
where the eigenvalues of Σ satisfy (13.8). We know from (10.1) that the likelihood for
S is
ν −1
L (Σ ; S) = | Σ| −ν/2 e− 2 trace( Σ S) . (13.32)
For the general model, i.e., where there is no restriction among the eigenvalues of Σ
(M(1,1,...,1) ), the MLE of Σ is S.
Suppose there are nontrivial restrictions (13.8). Write Σ and S in their spectral
decomposition forms (13.5) to obtain
ν  − 1 GLG )
L (Σ ; S) = | Λ| −ν/2 e− 2 trace(( ΓΛΓ )
− 1 Γ  GLG Γ )
(∏ λi )−ν/2 e− 2 trace( Λ
1
= . (13.33)

Because of the multiplicities in the eigenvalues, the Γ is not uniquely determined


from Σ, but any orthogonal matrix that maximizes the likelihood is adequate.
We start by fixing Λ, and maximizing the likelihood over the Γ. Set

F = G Γ, (13.34)

which is also orthogonal, and note that


q q
1 1 li
− trace(Λ−1 F LF) = − ∑∑ f ij2
2 2 i =1 j =1
λj
q q q
1
= ∑∑ f ij2 li β j −
2λq i∑
li , (13.35)
i =1 j =1 =1

where
1 1
βj = − + . (13.36)
2λ j 2λq
(The 1/(2λq ) is added because we need the β j ’s to be nonnegative in what follows.)
Note that the last term in (13.35) is independent of F. We do summation by parts by
letting

δi = β i − β i+1 , di = li − li+1 , 1 ≤ i ≤ q − 1, δq = β q , dq = lq , (13.37)


13.1. Principal components, redux 265

so that
q q
li = ∑ dk and β j = ∑ δm . (13.38)
k=i m= j

Because the li ’s and λi ’s are nondecreasing in i, the li ’s are positive, and by (13.36)
the β j ’s are also nonnegative, we have that the δi ’s and di ’s are all nonnegative. Using
(13.38) and interchanging the orders of summation, we have
q q q q q q
∑∑ f ij2 li β j = ∑∑∑ ∑ f ij2 dk δm
i =1 j =1 i =1 j =1 k = i m = j
q q k m
= ∑ ∑ dk δm ∑∑ f ij2 . (13.39)
k =1 m =1 i =1 j =1

Because F is an orthogonal matrix,

k m
∑∑ f ij2 ≤ min{k, m}. (13.40)
i =1 j =1

Also, with F = Iq , the f ii = 1, hence expression in (13.40) is an equality. By the


nonnegativity of the dk ’s and δm ’s, the sum in (13.39), hence in (13.35), is maximized
(though not uniquely) by taking F = Iq . Working back, from (13.34), the (not neces-
sarily unique) maximizer over Γ of (13.35) is


Γ = G. (13.41)

Thus the maximum over Γ in the likelihood (13.33) is


ν
L (GΛG ; S) = (∏ λi )−ν/2 e− 2 ∑( li /λi ) . (13.42)

Break up the product according to the equalities (13.8):

K
− q k ν/2 − ν ( tk /α k )
∏ αk e 2 , (13.43)
k =1

where
q1 q1 +···+ q k
t1 = ∑ li and tk = ∑ li for 2 ≤ k ≤ K. (13.44)
i =1 i = q1 +···+ q k −1+1

It is easy to maximize over each αk in (13.43), which proves that (13.11) is indeed the
MLE of the eigenvalues. Thus with (13.41), we have the MLE of Σ as in Theorem 13.1.
We give a heuristic explanation of the dimension (13.12) of the model M( q1 ,...qK ) in
(13.8). To describe the model, we need the K distinct parameters among the λi ’s, as
well as the K orthogonal subspaces that correspond to the distinct values of λi . We
start by counting the number of free parameters needed to describe an s-dimensional
subspace of a t-dimensional space, s < t. Any such subspace can be described by a
t × s basis matrix B, that is, the columns of B comprise a basis for the subspace. (See
Section 5.2.) The basis is not unique, in that BA for any invertible s × s matrix A is
also a basis matrix, and in fact any basis matrix equals BA for some such A. Take A
266 Chapter 13. Principal Components, etc.

to be the inverse of the top s × s submatrix of B, so that BA has Is as its top s × s part.
This matrix has (t − s) × s free parameters, represented in the bottom (t − s) × s part
of it, and is the only basis matrix with Is at the top. Thus the dimension is (t − s) × s.
(If the top part of B is not invertible, then we can find some other subset of s rows to
use.)
Now for model (13.8), we proceed stepwise. There are q1 (q2 + · · · + q K ) parame-
ters needed to specify the first q1 -dimensional subspace. Next, focus on the subspace
orthogonal to that first one. It is (q2 + · · · + q K )-dimensional, hence to describe the
second, q2 -dimensional, subspace within that, we need q2 × (q3 + · · · + q K ) parame-
ters. Continuing, the total number of parameters is

K
1 2
q 1 ( q 2 + · · · + q K ) + q 2 ( q 3 + · · · + q K ) + · · · + q K −1 q k = (q − ∑ q2k ). (13.45)
2 k =1

Adding K for the distinct λi ’s, we obtain the dimension in (13.12).

13.2 Multidimensional scaling


Given n objects and defined distances, or dissimiarities, between them, multidimen-
sional scaling tries to mimic the dissimilarities as close as possible by representing
the objects in p-dimensional Euclidean space, where p is fairly small. That is, sup-
pose o1 , . . . , on are the objects, and d(oi , o j ) is the dissimilarity between the i th and
jth objects, as in (12.19). Let Δ be the n × n matrix of d2 (oi , o j )’s. The goal is to find
1 × p vectors  x1 , . . . , 
xn so that

Δij = d2 (oi , o j ) ≈ 
xi − 
x j 2 . (13.46)

Then the xi ’s are plotted in R p , giving an approximate visual representation of the
original dissimilarities.
There are a number of approaches. Our presentation here follows that of Mar-
dia, Kent, and Bibby [1979], which provides more in-depth coverage. We will start
with the case that the original dissimilarities are themselves Euclidean distances, and
present the so-called classical solution. Next, we exhibit the classical solution when
the distances may not be Euclidean. Finally, we briefly mention the nonmetric ap-
proach.

13.2.1 Δ is Euclidean: The classical solution


Here we assume that object oi has associated a 1 × q vector yi , and let Y be the n × q
matrix with yi as the i th row. Of course, now Y looks like a regular data matrix. It
might be that the objects are observations (people), or they are variables, in which
case this Y is really the transpose of the usual data matrix. Whatever the case,

d2 (o i , o j ) =  y i − y j  2 . (13.47)

For any n × p matrix X with rows xi , define Δ(X) to be the n × n matrix of xi − x j 2 ’s,
so that (13.47) can be written Δ = Δ(Y).
The classical solution looks for 
xi ’s in (13.46) that are based on rotations of the
yi ’s, much like principal components. That is, suppose B is a q × p matrix with
13.2. Multidimensional scaling 267

orthonormal columns, and set


  = YB.
xi = yi B, and X (13.48)
The objective is then to choose the B that minimizes
 
 
∑ ∑1≤i< j≤n yi − y j 2 − xi − xj 2  (13.49)

over B. Exercise 13.4.9 shows that yi − y j 2 > 


xi − 
x j 2 , which means that the ab-
solute values in (13.49) can be removed. Also, note that the sum over the yi − y j 2 is
independent of B, and by symmetry, minimizing (13.49) is equivalent to maximizing
n n
∑ ∑ xi − xj 2 ≡ 1n Δ(X )1n . (13.50)
i =1 j =1

The next lemma is useful in relating the Δ(Y) to the deviations.


Lemma 13.1. Suppose X is n × p. Then
Δ(X) = a1n + 1n a − 2Hn XX Hn , (13.51)
where a is the n × 1 vector with ai = xi − x2 , x is the mean of the xi ’s, and Hn is the
centering matrix (1.12).

Proof. Write

xi − x j 2 = (xi − x) − (x j − x)2


= ai − 2(Hn X)i (Hn X)j + a j , (13.52)

from which we obtain (13.51).

By (13.50) and (13.51) we have


n n
∑ ∑ xi − xj 2 = 1n Δ(X )1n
i =1 j =1

= 1n (a1n + 1n a − 2Hn XX Hn )1n


n
= 2n1n a = 2n ∑ 
xi − 
x 2 , (13.53)
i =1

because Hn 1n = 0. But then


n
∑ xi − x2 = trace(X  Hn X )
i =1
= trace(B Y Hn YB). (13.54)
Maximizing (13.54) over B is a principal components task. That is, as in Lemma 1.3,
this trace is maximized by taking B = G1 , the first p eigenvectors of Y Hn Y. To
summarize:
268 Chapter 13. Principal Components, etc.

Proposition 13.1. If Δ = Δ(Y), then the classical solution of the multidimensional scaling
 = YG1 , where the columns of G1 consist of the first p eigenvectors
problem for given p is X
of Y Hn Y.

If one is interested in the distances between variables, so that the distances of in-
terest are in Δ(Y ) (note the transpose), then the classical solution uses the first p
eigenvectors of YHq Y .

13.2.2 Δ may not be Euclidean: The classical solution


Here, we are given only the n × n dissimilarity matrix Δ. The dissimilarities may or
may not arise from Euclidean distance on vectors yi , but the solution acts as if they
do. That is, we assume there is an n × q matrix Y such that Δ = Δ(Y), but we do not
observe the Y, nor do we know the dimension q. The first step in the process is to

derive the Y from the Δ(Y), then we apply Proposition 13.1 to find the X.
It turns out that we can assume any value of q as long as it is larger than the
values of p we wish to entertain. Thus we are safe taking q = n. Also, note that using
(13.51), we can see that Δ(Hn X) = Δ(X), which implies that the sample mean of Y is
indeterminate. Thus we may as well assume the mean is zero, i.e.,
Hn Y = Y, (13.55)
so that (13.51) yields
Δ = Δ(Y) = a1n + 1n a − 2YY . (13.56)
To eliminate the a’s, we can pre- and post-multiply by Hn :
Hn ΔHn = Hn (a1n + 1n a − 2YY )Hn = −2YY , (13.57)
hence
1
YY = −Hn ΔHn . (13.58)
2
Now consider the spectral decomposition (1.33) of YY ,
YY = JLJ , (13.59)
where the orthogonal J = (j1 , . . . , jn ) contains the eigenvectors, and the diagonal L
contains the eigenvalues. Separating the matrices, we can take
 √ √ √ 
Y = JL1/2 = l 1 j1 l 2 j2 · · · l n jn . (13.60)
Now we are in the setting of Section 13.2.1, hence by Proposition 13.1, we need G1 ,
the first p eigenvectors of

Y Hn Y = (JL1/2 ) (JL1/2 ) = L1/2 J JL1/2 = L. (13.61)


But the matrix of eigenvectors is just In , hence we take the first p columns of Y in
(13.60):  √ √ 

=
X l 1 j1 l 2 j2 · · · lp jp . (13.62)
It could be that Δ is not Euclidean, that is, there is no Y for which Δ = Δ(Y).
In this case, the classical solution uses the same algorithm as in equations (13.58) to
(13.62). A possible glitch is that some of the eigenvalues may be negative, but if p is
small, the problem probably won’t raise it’s ugly head.
13.2. Multidimensional scaling 269

13.2.3 Nonmetric approach


The original approach to multidimensional scaling attempts to find the X that gives
the same ordering of the observed dissimilarities, rather than trying to match the
actual values of the dissimilarities. See Kruskal [1964]. That is, take the t ≡ (n2 )
pairwise dissimilarities and order them:

d(oi1 , o j1 ) ≤ d(oi2 , o j2 ) ≤ · · · ≤ d(oit , o jt ). (13.63)

The ideal would be to find the 


xi ’s so that


xi 1 − 
x j1 2 ≤ 
xi 2 − 
x j2 2 ≤ · · · ≤ 
xi t − 
x jt 2 . (13.64)

That might not be (actually probably is not) possible for given p, so instead one
 that comes as close as possible, where close is measured by some “stress”
finds the X
function. A popular stress function is given by

∑ ∑1≤i< j≤n ( x j 2 − d∗ij )2


xi − 
) =
Stress2 (X , (13.65)
∑ ∑1≤i< j≤n 
xi − 
x j 4

where the d∗ij ’s are constants that have the same ordering as the original dissimilarities
d(oi , o j )’s in (13.63), and among such orderings minimize the stress. See Johnson and
Wichern [2007] for more details and some examples. The approach is “nonmetric”
because it does not depend on the actual d’s, but just their order.

13.2.4 Examples: Grades and sports


The examples here all start with Euclidean distance matrices, and use the classical
solution, so everything is done using principal components.
In Section 12.6.1 we clustered the five variables for the grades data, (homework,
labs, inclass, midterms, final), for the n = 107 students. Here we find the multidi-
mensional scaling plot. The distance between two variables is the sum of squares of
the difference in scores the people obtained for them. So we take the transpose of the
data. The following code finds the plot.
ty <− t(grades[,2:6])
ty <− scale(ty,scale=F)
ev <− eigen(var(ty))$vectors[,1:2]
tyhat <− ty%∗%ev
lm <− range(tyhat)∗1.1
plot(tyhat,xlim=lm,ylim=lm,xlab=’Var 1’,ylab=’Var 2’,type=’n’)
text(tyhat,labels=dimnames(ty)[[1]])
To plot the variables’ names rather than points, we first create the plot with no plot-
ting: type=’n’. Then text plots the characters in the labels parameters, which we give
as the names of the first dimension of ty. The results are in Figure 13.5. Notice how
the inclass variable is separated from the others, homeworks and labs are fairly close,
and midterms and final are fairly close together, not surprisingly given the clustering.
Figure 13.5 also has the multidimensional scaling plot of the seven sports from
the Louis Roussos sports data in Section 12.1.1, found by substituting sportsranks for
grades[,2:6] in the R code. Notice that the first variable orders the sports according
270 Chapter 13. Principal Components, etc.

Grades Sports

30
Jog
150

20
10
Labs
Var 2

Var 2
0 50

HW FootB
BaseB

0
Cyc BsktB
Midterms Swim Ten
InClass

−20
−100

Final

−100 0 50 150 −20 0 10 20 30


Var 1 Var 1

Figure 13.5: Multidimensional scaling plot of the grades’ variables (left) and the
sports’ variables (right).

to how many people typically participate, i.e., jogging, swimming and cycling can
be done solo, tennis needs two to four people, basketball has five per team, baseball
nine, and football eleven. The second variable serves mainly to separate jogging from
the others.

13.3 Canonical correlations


Testing the independence of two sets of variables in the multivariate normal dis-
tribution is equivalent to testing their covariance is zero, as in Section 10.2. When
there is no independence, one may wish to know where the lack of independence
lies. A projection-pursuit approach is to find the linear combination of the first set
which is most highly correlated with a linear combination of the second set, hoping
to isolate the factors within each group that explain a substantial part of the overall
correlations.
The distributional assumption is based on partitioning the 1 × q vector Y into Y1
(1 × q1 ) and Y2 (1 × q2 ), with

  Σ11 Σ12
Y= Y1 Y2 ∼ N (μ, Σ), where Σ = , (13.66)
Σ21 Σ22

Σ11 is q1 × q1 and Σ22 is q2 × q2 . If α (q1 × 1) and β (q2 × 1) are coefficient vectors,


then

Cov[Y1 α, Y2 β] = α Σ12 β,
Var [Y1 α] = α Σ11 α, and
Var [Y2 β] = β Σ22 β, (13.67)

hence
α Σ12 β
Corr [Y1 α, Y2 β] =  . (13.68)
α Σ11 α β Σ22 β
13.3. Canonical correlations 271

Analogous to principal component analysis, the goal in canonical correlation analysis


is to maximize the correlation (13.68) over α and β. Equivalently, we could maximize
the covariance α Σ12 β in (13.68) subject to the two variances equaling one. This pair
of linear combination vectors may not explain all the correlations between Y1 and Y2 ,
hence we next find the maximal correlation over linear combinations uncorrelated
with the first combinations. We continue, with each pair maximizing the correlation
subject to being uncorrelated with the previous.
The precise definition is below. Compare it to Definition 1.2 for principal compo-
nents.
Definition 13.1. Canonical correlations. Assume (Y1 , Y2 ) are as in (13.66), where Σ11
and Σ22 are invertible, and set m = min{q1 , q2 }. Let α1 , . . . , αm be a set of q1 × 1 vectors,
and β 1 , . . . , β m be a set of q2 × 1 vectors, such that
(α1 , β 1 ) is any (α, β) that maximizes α Σ12 β over α Σ11 α = β Σ11 β = 1;
(α2 , β 2 ) is any (α, β) that maximizes α Σ12 β over α Σ11 α = β Σ11 β = 1,
α Σ11 α1 = β Σ22 β 1 = 0;
..
.
(αm , β m ) is any (α, β) that maximizes α Σ12 β over α Σ11 α = β Σ11 β = 1,
α Σ11 αi = β Σ22 β i = 0,
i = 1, . . . , m − 1. (13.69)

Then δi ≡ αi Σ12 β i is the i th canonical correlation, and αi and β i are the associated
canonical correlation loading vectors.
Recall that principal component analysis (Definition 1.2) led naturally to the spec-
tral decomposition theorem (Theorem 1.1). Similarly, canonical correlation analysis
will lead to the singular value decomposition (Theorem 13.2 below). We begin the
canonical correlation analysis with some simplfications. Let

γi = Σ1/2 1/2
11 α i and ψi = Σ22 β i (13.70)
for each i, so that the γi ’s and ψi ’s are sets of orthonormal vectors, and
δi = Corr [Y1 αi , Y2 β i ] = γi Ξψi , (13.71)
where
−1/2 −1/2
Ξ = Σ11 Σ12 Σ22 . (13.72)
This matrix Ξ is a multivariate generalization of the correlation coefficient which is
useful here, but I don’t know exactly how it should be interpreted.
In what follows, we assume that q1 ≥ q2 = m. The q1 < q2 case can be handled
similarly. The matrix Ξ Ξ is a q2 × q2 symmetric matrix, hence by the spectral decom-
position in (1.33), there is a q2 × q2 orthogonal matrix Γ and a q2 × q2 diagonal matrix
Λ with diagonal elements λ1 ≥ λ2 ≥ · · · ≥ λq2 such that
Γ  Ξ ΞΓ = Λ. (13.73)
Let γ1 , . . . , γq2 denote the columns of Γ, so that the column of ΞΓ is Ξγi . Then
i th
(13.73) shows these columns are orthogonal and have squared lengths equal to the
λi ’s, i.e.,
Ξγi 2 = λi and (Ξγi ) (Ξγj ) = 0 if i = j. (13.74)
272 Chapter 13. Principal Components, etc.

Furthermore, because the γi ’s satisfy the equations for the principal components’
loading vectors in (1.28) with S = Ξ Ξ,

Ξγi  = λi maximizes Ξγ over γ = 1, γ γj = 0 for j < i. (13.75)

Now for the first canonical correlation, we wish to find unit vectors ψ and γ to
maximize ψ  Ξγ. By Corollary 8.2, for γ fixed, the maximum over ψ is when ψ is
proportional to Ξγ, hence

Ξγ
ψ  Ξγ ≤ Ξγ, equality achieved with ψ = . (13.76)
Ξγ

But by (13.75), Ξγ is maximized when γ = γ1 , hence


 Ξγ1
ψ  Ξγ ≤ ψ1 Ξγ1 = λ1 , ψ1 = . (13.77)
Ξγ1 

Thus the first canonical correlation δ1 is λ1 . (The ψ1 is arbitrary if λ1 = 0.)
We proceed for i = 2, . . . , k, where k is the index of the last positive eigenvalue,
i.e., λk > 0 = λk+1 = · · · = λq2 . For the i th canonical correlation, we need to find unit
vectors ψ orthogonal to ψ1 , . . . , ψi−1, and γ orthogonal to γ1 , . . . , γi−1 , that maximize
ψ  Ξγ. Again by (13.75), the best γ is γi , and the best ψ is proportional to Ξγi , so that
 Ξγi
ψ  Ξγ ≤ ψi Ξγi = λi ≡ δi , ψi = . (13.78)
Ξγi 
That this ψi is indeed orthogonal to previous ψj ’s follows from the second equation
in (13.75).
If λi = 0, then ψ  Ξγi = 0 for any ψ. Thus the canonical correlations for i > k
are δi = 0, and the corresponding ψi ’s, i > k, can be any set of vectors such that
ψ1 , . . . , ψq2 are orthonormal.
Finally, to find the αi andβ i in (13.69), we solve the equations in (13.70):
−1 −1
αi = Σ11 ψi and β i = Σ22 γi , i = 1, . . . , m. (13.79)

Backing up a bit, we have almost obtained the singular value decomposition of Ξ.


Note that by (13.78), δi = Ξγi , hence we can write
 
ΞΓ = Ξγ1 · · · Ξγq2
 
= ψ1 δ1 · · · ψq2 δq2
= ΨΔ, (13.80)

where Ψ = (ψ1 , . . . , ψq2 ) has orthonormal columns, and Δ is diagonal with δ1 , . . . , δq2
on the diagonal. Shifting the Γ to the other side of the equation, we obtain the
following.
Theorem 13.2. Singular value decomposition. The q1 × q2 matrix Ξ can be written

Ξ = ΨΔΓ  (13.81)

where Ψ (q1 × m) and Γ (q2 × m) have orthonormal columns, and Δ is an m × m diagonal


matrix with diagonals δ1 ≥ δ2 ≥ · · · ≥ δm ≥ 0, where m = min{q1 , q2 }.
13.3. Canonical correlations 273

To summarize:
−1/2 −1/2
Corollary 13.1. Let (13.81) be the singular value decomposition of Ξ = Σ11 Σ12 Σ22 for
model (13.66). Then for 1 ≤ i ≤ min{q1 , q2 }, the i canonical correlation is δi , with loading
th

vectors αi and β i given in (13.79), where ψi (γi ) is the i th column of Ψ (Γ).


Next we present an example, where we use the estimate of Σ to estimate the
canonical correlations. Theorem 13.3 guarantees that the estimates are the MLE’s.

13.3.1 Example: Grades


Return to the grades data. In Section 10.3.3, we looked at factor analysis, finding two
main factors: An overall ability factor, and a contrast of homework and labs versus
midterms and final. Here we lump in inclass assignments with homework and labs,
and find the canonical correlations between the sets (homework, labs, inclass) and
(midterms, final), so that q1 = 3 and q2 = 2. The Y is the matrix of residuals from the
model (10.80). In R,
x <− cbind(1,grades[,1])
y <− grades[,2:6]− x%∗%solve(t(x)%∗%x,t(x))%∗%grades[,2:6]
s <− t(y)%∗%y/(nrow(y)−2)
corr <− cov2cor(s)
The final statement calculates the correlation matrix from the S, yielding

HW Labs InClass Midterms Final


HW 1.00 0.78 0.28 0.41 0.40
Labs 0.78 1.00 0.42 0.38 0.35
(13.82)
InClass 0.28 0.42 1.00 0.24 0.27
Midterms 0.41 0.38 0.24 1.00 0.60
Final 0.40 0.35 0.27 0.60 1.00

There are q1 × q2 = 6 correlations between variables in the two sets. Canonical


correlations aim to summarize the overall correlations by the two δi ’s. The estimate
of the Ξ matrix in (13.69) is given by

 = S−1/2 S12 S−1/2


Ξ 22
⎛11 ⎞
0.236 0.254
= ⎝ 0.213 0.146 ⎠ , (13.83)
0.126 0.185

found in R using
symsqrtinv1 <− symsqrtinv(s[1:3,1:3])
symsqrtinv2 <− symsqrtinv(s[4:5,4:5])
xi <− symsqrtinv1%∗%s[1:3,4:5]%∗%symsqrtinv2
where
symsqrtinv <− function(x) {
ee <− eigen(x)
ee$vectors%∗%diag(sqrt(1/ee$values))%∗%t(ee$vectors)
}
274 Chapter 13. Principal Components, etc.

calculates the inverse symmetric square root of an invertible symmetric matrix x. The
singular value decomposition function in R is called svd:
sv <− svd(xi)
a <− symsqrtinv1%∗%sv$u
b <− symsqrtinv2%∗%sv$v
The component sv$u is the estimate of Ψ and the component sv$v is the estimate of Γ
in (13.81). The matrices of loading vectors are obtained as in (13.79):
⎛ ⎞
−0.065 0.059
−1/2 
A = S11 Ψ = ⎝ −0.007 −0.088 ⎠ ,
−0.014 0.039

−1/2  −0.062 −0.12
and B = S22 Γ= . (13.84)
−0.053 0.108

The estimated canonical correlations (singular values) are in the vector sv$d, which
are
d1 = 0.482 and d2 = 0.064. (13.85)
The d1 is fairly high, and d2 is practically negligible. (See the next section.) Thus it
is enough to look at the first columns of A and B. We can change signs, and take the
first loadings for the first set of variables to be (0.065, 0.007, 0.014), which is primarily
the homework score. For the second set of variables, the loadings are (0.062, 0.053),
essentially a straight sum of midterms and final. Thus the correlations among the two
sets of variables can be almost totally explained by the correlation between homework
and the sum of midterms and final, which correlation is 0.45, almost the optimum of
0.48.

13.3.2 How many canonical correlations are positive?


One might wonder how many of the δi ’s are nonzero. We can use BIC to get an idea.
The model is based on the usual
1
S= U, where U ∼ Wishartq (ν, Σ), (13.86)
ν
ν ≥ q, with Σ partitioned as in (13.66), and S is partitioned similarly. Let ΨΔΓ 
in (13.81) be the singular value decomposition of Ξ as in (13.72). Then Model K
(1 ≤ K ≤ m ≡ min{q1 , q2 }) is given by

MK : δ1 > δ2 > · · · > δK > δK +1 = · · · = δm = 0, (13.87)

where the δi ’s are the canonical correlations, i.e., diagonals of Δ. Let


−1/2 −1/2
S11 S12 S22 = PDG (13.88)

be the sample analog of Ξ (on the left), and its singular value decomposition (on the
right).
We first obtain the MLE of Σ under model K. Note that Σ and (Σ11 , Σ22 , Ξ) are
in one-to-one correspondence. Thus it is enough to find the MLE of the latter set of
parameters. The next theorem is from Fujikoshi [1974].
13.3. Canonical correlations 275

Theorem 13.3. For the above setup, the MLE of (Σ11 , Σ22 , Ξ) under model MK in (13.87)
is given by
(S11 , S22 , PD( K ) G ), (13.89)

where D( K ) is the diagonal matrix with diagonals (d1 , d2 , . . . , dK , 0, 0, . . . , 0).

That is, the MLE is obtained by setting to zero the sample canonical correlations
that are set to zero in the model. One consequence of the theorem is that the natu-
ral sample canonical correlations and accompanying loading vectors are indeed the
MLE’s. The deviance, for comparing the models MK , can be expressed as

K
deviance( MK ) = ν ∑ log(1 − d2i ). (13.90)
i =1

See Exercise 13.4.14.


For the BIC, we need the dimension of the model. The number of parameters for
the Σii ’s we know to be q i (q i + 1)/2. For Ξ, we look at the singular value decompo-
sition (13.81):
K
Ξ = ΨΔ( K ) Γ  = ∑ δi ψi γi . (13.91)
i =1

The dimension for the δi ’s is K. Only the first K of the ψi ’s enter into the equation.
Thus the dimension is the same as for principal components with K distinct eigen-
values, and the rest equal at 0, yielding pattern (1, 1, . . . , 1, q1 − K ), where there are
K ones. Similarly, the γi ’s dimension is as for pattern (1, 1, . . . , 1, q2 − K ). Then by
(13.45),

1 2
dim(Γ ) + dim(Ψ) + dim(Δ( K ) ) = ( q − K − ( q 1 − K )2 )
2 1
1
+ (q22 − K − (q2 − K )2 ) + K
2
= K ( q − K ). (13.92)

Finally, we can take

K
BIC( MK ) = ν ∑ log(1 − d2k ) + log(ν)K (q − K ) (13.93)
k =1

because the q i (q i + 1)/2 parts are the same for each model.
In the example, we have three models: K = 0, 1, 2. K = 0 means the two sets
of variables are independent, which we already know is not true, and K = 2 is the
unrestricted model. The calculations, with ν = 105, d1 = 0.48226 and d2 = 0.064296:

K Deviance dim(Ξ) BIC PBIC


0 0 0 0 0.0099
(13.94)
1 −27.7949 4 −9.1791 0.9785
2 −28.2299 6 −0.3061 0.0116

Clearly K = 1 is best, which is what we figured above.


276 Chapter 13. Principal Components, etc.

13.3.3 Partial least squares


A similar idea is to find the linear combinations of the variables to maximize the
covariance, rather than correlation:

Cov(Y1 a, Y2 b) = a Σ12 b. (13.95)

The process is the same as for canonical correlations, but we use the singular value
decomposition of Σ12 instead of Ξ. The procedure is called partial least squares,
but it could have been called canonical covariances. It is an attractive alternative to
canonical correlations when there are many variables and not many observations, in
which cases the estimates of Σ11 and Σ22 are not invertible.

13.4 Exercises
Exercise 13.4.1. In the model (13.4), find the approximate test for testing the null
hypothesis that the average of the last k (k < q) eigenvalues is less than the constant
c2 .

Exercise 13.4.2. Verify the expression for Σ in (13.10).

Exercise 13.4.3. Show that the deviance for the model in Theorem 13.1 is given by
(13.13). [Hint: Start with the likelihood as in (13.32). Show that
q
 −1 S ) = li
trace(Σ ∑ 
λ
= q. (13.96)
i =1 i

Argue you can then ignore the part of the deviance that comes from the exponent.]

Exercise 13.4.4. Verify (13.25). [Hint: First, show that Γ1 Γ = (I p 0) and Γ2 Γ =
(0 Iq− p ).]

Exercise 13.4.5. Show that (13.30) follows from (13.28) and (13.29).

Exercise 13.4.6. Prove (13.40). [Hint: First, explain why ∑ki=1 f ij2 ≤ 1 and ∑m
j =1 f ij ≤ 1.]
2

Exercise 13.4.7. Verify the equality in (13.42), and show that (13.11) does give the
maximizers of (13.43).

Exercise 13.4.8. Verify the equality in (13.45).

Exercise 13.4.9. Show that yi − y j 2 >  xi − 


x j 2 for yi and 
xi in (13.48). [Hint: Start
by letting B2 be any (q − p) × q matrix such that (B, B2 ) is an orthogonal matrix. Then
yi − y j 2 = (yi − y j )(B, B2 )2 (why?), and by expanding equals (yi − y j )B2 +
(yi − y j )B2 2 .]

Exercise 13.4.10. Verify (13.52) by expanding the second expression.

Exercise 13.4.11. In (13.80), verify that Ξγi = ψδi .


13.4. Exercises 277

Exercise 13.4.12. For the canonical correlations situation in Corollary 13.1, let α =
(α1 , . . . , αm ) and β = ( β 1 , . . . , β m ) be matrices with columns being the loading vec-
tors. Find the covariance matrix of the transformation

    α 0
Y1 α Y2 β = Y1 Y2 . (13.97)
0 β

[It should depend on the parameters only through the δi ’s.]


Exercise 13.4.13. Given the singular decomposition of Ξ in (13.81), find the spec-
tral decompositions of ΞΞ and of Ξ Ξ. What can you say about the two matrices’
eigenvalues? How are these eigenvalues related to the singular values in Δ?
Exercise 13.4.14. This exercise derives the deviance for the canonical correlation
model in (13.87). Start with

 ; S)) = ν log(| Σ
−2 log( L (Σ  −1 S )
 |) + ν trace(Σ (13.98)

 is the estimate given in Theorem 13.3. (a) Show


for the likelihood in (13.32), where Σ
that + ,  + 1/2 ,
Σ1/2 0 Iq1 Ξ Σ11 0
Σ= 11 , (13.99)
0 Σ1/2 Ξ  Iq2 0 Σ1/2
22 22

(b) Letting CK = PD( K ) G and C = PDG (= PD( m) G ), show that


+  −1 ,
 −1 S) = trace Iq1 CK Iq1 C
trace(Σ
CK Iq2 C Iq2

= trace((Iq1 − CK CK )−1 (Iq1 − CK C + CCK − CK CK )) + trace(Iq2 )


= trace((Iq1 − CK CK )−1 (Iq1 − CK CK )) + trace(Iq2 )
=q. (13.100)

[Hint: The first equality uses part (a). The second equality might be easiest to show
by letting 
Iq1 − CK
H= , (13.101)
0 Iq2
and multiplying the two large matrices by H on the left and H on the right. For the
third equality, using orthogonality of the columns of G, show that CK C = CCK =
 | = |S11 ||S22 ||Iq1 − CK C |, where CK is given in part (b).
CK CK .] (c) Show that | Σ K
[Hint: Recall (5.83).] (d) Show that |Iq1 − CK CK | = ∏K i =1 (1 − di ). (e) Use parts (b)
2
through (d) to find an expression for (13.98), then argue that for comparing MK ’s, we
can take the deviance as in (13.87).
Exercise 13.4.15. Verify the calculation in (13.92).

Exercise 13.4.16 (Painters). The biplot for the painters data set (in the MASS package)
was analyzed in Exercise 1.9.18 (a) Using the first four variables, without any scaling,
find the sample eigenvalues li . Which seem to be large, and which small? (b) Find
the pattern of the li ’s that has best BIC. What are the MLE’s of the λi ’s for the best
pattern? Does the result conflict with your answer to part (a)?
278 Chapter 13. Principal Components, etc.

Exercise 13.4.17 (Spam). In Exercises 1.9.15 and 11.9.9, we found principal compo-
nents for the spam data. Here we look for the best pattern of eigenvalues. Note that
the data is far from multivariate normal, so the distributional aspects should not be
taken too seriously. (a) Using the unscaled spam explanatory variables (1 through
57), find the best pattern of eigenvalues based on the BIC criterion. Plot the sample
eigenvalues and their MLE’s. Do the same, but for the logs. How many principal
components is it reasonable to take? (b) Repeat part (b), but using the scaled data,
scale(Spam[,1:57]). (c) Which approach yielded the more satisfactory answer? Was
the decision to use ten components in Exercise 11.9.9 reasonable, at least for the scaled
data?

Exercise 13.4.18 (Iris). This question concerns the relationships between the sepal
measurements and petal measurements in the iris data. Let S be pooled covariance
matrix, so that the denominator is ν = 147. (a) Find the correlation between the
sepal length and petal length, and the correlation between the sepal width and petal
width. (b) Find the canonical correlation quantities for the two groups of variables
{Sepal Length, Sepal Width} and {Petal Length, Petal Width}. What do the loadings
show? Compare the di ’s to the correlations in part (a). (c) Find the BIC’s for the three
models K = 0, 1, 2, where K is the number of nonzero δi ’s. What do you conclude?

Exercise 13.4.19 (Exams). Recall the exams data set (Exercise 10.5.18) has the scores
of 191 students on four exams, the three midterms (variables 1, 2, and 3) and the final
exam. (a) Find the canonical correlations quantities, with the three midterms in one
group, and the final in its own group. Describe the relative weightings (loadings) of
the midterms. (b) Apply the regular multiple regression model with the final as the
Y and the three midterms as the X’s. What is the correlation between the Y and the
 How does this correlation compare to d1 in part (a)? What do you get if you
fit, Y?
square this correlation? (c) Look at the ratios βi /ai1 for i = 1, 2, 3, where βi is the
regression coefficient for midterm i in part (b), and ai1 is the first canonical correlation
loading. What do you conclude? (d) Run the regression again, with the final still Y,
but use just the one explanatory variable Xa1 . Find the correlation of Y and the Y  for
this regression. How does it compare to that in part (b)? (e) Which (if either) yields
a linear combination of the midterms that best correlates with the final, canonical
correlation analysis or multiple regression. (f) Look at the three midterms’ variances.
What do you see? Find the regular principal components (without scaling) for the
midterms. What are the loadings for the first principal component? Compare them
to the canonical correlations’ loadings in part (a). (g) Run the regression again, with
the final as the Y again, but with just the first principal component of the midterms as
the sole explanatory variable. Find the correlation between Y and Y  here. Compare
to the correlations in parts (b) and (d). What do you conclude?

Exercise 13.4.20 (States). This problems uses the matrix states, which contains several
demographic variables on the 50 United States, plus D.C. We are interested in the
relationship between crime variables and money variables:
Crime: Violent crimes per 100,000 people
Prisoners: Number of people in prison per 10,000 people.
Poverty: Percentage of people below the poverty line.
Employment: Percentage of people employed
Income: Median household income
13.4. Exercises 279

Let the first two variables be Y1 , and the other three be Y2 . Scale them to have mean
zero and variance one:
y1 <− scale(states[,7:8])
y2 <− scale(states[,9:11])
Find the canonical correlations between the Y1 and Y2 . (a) What are the two canonical
correlations? How many of these would you keep? (b) Find the BIC’s for the K = 0, 1
and 2 canonical correlation models. Which is best? (c) Look at the loadings for the
first canonical correlation, i.e., a1 and b1 . How would you interpret these? (d) Plot
the first canonical variables: Y1 a1 versus Y2 b1 . Do they look correlated? Which
observations, if any, are outliers? (e) Plot the second canonical variables: Y1 a2 versus
Y2 b1 2. Do they look correlated? (f) Find the correlation matrix of the four canonical
variables: (Y1 a1 , Y1 a2 , Y2 b1 , Y2 b2 ). What does it look like? (Compare it to the result
in Exercise 13.4.9.)
Appendix A

Extra R routines

These functions are very barebones. They do not perform any checks on the inputs,
and are not necessarily efficient. You are encouraged to robustify and enhance any of
them to your heart’s content.

A.1 Estimating entropy


We present a simple method for estimating the best entropy. See Hyvärinen et al.
[2001] for a more sophisticated approach, which is implemented in the R package
fastICA [Marchini et al., 2010]. First, we need to estimate the negentropy (1.46) for
a given univariate sample of n observations. We use the histogram as the density,
where we take K bins of equal width d, where K is the smallest integer larger than
log2 (n ) + 1 [Sturges, 1926]. Thus bin i is (bi−1 , bi ], i = 1, . . . , K, where bi = b0 + d × i,
and b0 and d are chosen so that (b0 , bK ] covers the range of the data. Letting pi be the
proportion of observations in bin i, the histogram estimate of the density g is

pi
g( x ) = if bi−1 < x ≤ bi . (A.1)
d
From (2.102) in Exercise 2.7.16, we have that the negative entropy (1.46) is
 K
1 1
Negent( g) = 1 + log 2π Var [I] + + ∑ pi log( pi ), (A.2)
2 12 i =1

where I is the random variable with P [I = i ] = pi , hence


+ ,2
K K
Var [I] = ∑i 2
pi − ∑ i pi . (A.3)
i =1 i =1

See Section A.1.1 for the R function we use to calculate this estimate.
For projection pursuit, we have our n × q data matrix Y, and wish to find first the
q × 1 vector g1 with norm 1 that maximizes the estimated negentropy of Yg1 . Next
we look for the g2 with norm 1 orthogonal to g1 that maximizes the negentropy of
Yg2 , etc. Then our rotation is given by the orthogonal matrix G = (g1 , g2 , . . . , gq ).

281
282 Appendix A. Extra R routines

We need to parametrize the orthogonal matrices somehow. For q = 2, we can set



cos(θ ) − sin(θ )
G( θ ) = E2 ( θ ) ≡ . (A.4)
sin(θ ) cos(θ )

Clearly each such matrix is orthogonal. As θ ranges from 0 to 2π, E2 (θ ) ranges


through half of the orthogonal matrices (those with determinant equal to +1), the
other half obtainable by switching the minus sign from the sine term to one of the
cosine terms. For our purposes, we need only to take 0 ≤ θ < π, since the other
G’s are obtained from that set by changing the sign on one or both of the columns.
Changing signs does not affect the negentropy, nor the graph except for reflection
around an axis. To find the best G(θ ), we perform a simple line search over θ. See
Section A.1.2.
For q = 3 we use Euler angles θ1 , θ2 , and θ3 , so that

G( θ1 , θ2 , θ3 ) = E3 ( θ1 , θ2 , θ3 )
  
E2 ( θ3 ) 0 1 0 E2 ( θ1 ) 0
≡ . (A.5)
0 1 0 E2 ( θ2 ) 0 1

See Anderson et al. [1987] for similar parametrizations when q > 3. The first step is to
find the G = (g1 , g2 , g3 ) whose first column, g1 , achieves the maximum negentropy
of Yg1 . Here it is enough to take θ3 = 0, so that the left-hand matrix is the identity.
Because our estimate of negentropy for Yg is not continuous in g, we use the simulated
annealing option in the R function optim to find the optimal g1 . The second step is to
find the best further rotation of the remaining variables, Y(g2 , g3 ), for which we can
use the two-dimensional procedure above. See Section A.1.3.

A.1.1 negent: Estimating negative entropy

Description: Calculates the histogram-based estimate (A.2) of the negentropy (1.46)


for a vector of observations. See Listing A.1 for the code.
Usage: negent(x,K=log2(length(x))+1)
Arguments:
x: The n-vector of observations.
K: The number of bins to use in the histogram.

Value: The value of the estimated negentropy.

A.1.2 negent2D: Maximizing negentropy for q = 2 dimensions

Description: Searches for the rotation that maximizes the estimated negentropy of
the first column of the rotated data, for q = 2 dimensional data. See Listing A.2 for
the code.
Usage: negent2D(y,m=100)
Arguments:
A.2. Both-sides model 283

y: The n × 2 data matrix.


m: The number of angles (between 0 and π) over which to search.

Value: A list with the following components:


vectors: The 2 × 2 orthogonal matrix G that optimizes the negentropy.
values: Estimated negentropies for the two rotated variables. The largest is first.

A.1.3 negent3D: Maximizing negentropy for q = 3 dimensions

Description: Searches for the rotation that maximizes the estimated negentropy of
the first column of the rotated data, and of the second variable fixing the first, for
q = 3 dimensional data. The routine uses a random start for the function optim using
the simulated annealing option SANN, hence one may wish to increase the number
of attempts by setting nstart to a integer larger than 1. See Listing A.3 for the code.
Usage: negent2D(y,nstart=1,m=100,...)
Arguments:
y: The n × 3 data matrix.
nstart: The number of times to randomly start the search routine.
m: The number of angles (between 0 and π) over which to search to find the second
variables.
. . .: Further optional arguments to pass to the optim function to control the simulated
annealing algorithm.

Value: A list with the following components:


vectors: The 3 × 3 orthogonal matrix G that optimizes the negentropy.
values: Estimated negentropies for the three rotated variables, from largest to small-
est.

A.2 Both-sides model


A.2.1 bothsidesmodel: Calculate the estimates

Description: For the both-sides model,


Y = xβz + R, R ∼ N (0, In ⊗ Σ R ), (A.6)
where Y is n × q, x is n × p, and z is q × l, the function finds the least-squares estimates
of β; the standard errors and t-statistics (with degrees of freeedom) for the individual
βij ’s; and the matrices Cx = (x x)−1 and Σz of (6.5). The function requires that n ≥ p,
and x x and z z be invertible. See Listing A.4 for the code.
Usage: bothsidesmodel(x,y,z)
Arguments:
284 Appendix A. Extra R routines

x: An n × p design matrix.
y: The n × q matrix of observations.
z: A q × l design matrix.

Value: A list with the following components:


Betahat: The least-squares estimate of β.
ij .
se: The p × l matrix with the ijth element being the standard error of β
ij .
T: The p × l matrix with the ijth element being the t-statistic based on β
nu: The degrees of freedom for the t-statistics, ν = n − p.
Cx: The p × p matrix Cx .
Sigmaz: The q × q matrix Σz .

A.2.2 bothsidesmodel.test: Test blocks of β are zero

Description: Performs tests of the null hypothesis H0 : β∗ = 0, where β∗ is a


block submatrix of β as in Section 7.2. An example is given in (7.11). The in-
put consists of the output from the bothsidesmodel function, plus vectors giving the
rows and columns of β to be tested. In the example, we set rows <− c(1,4,5) and
cols <− c(1,3). See Listing A.5 for the code.
Usage: bothsidesmodel.test(bsm,rows,cols)
Arguments:
bsm: The output of the bothsidesmodel function.
rows: The vector of rows to be tested.
cols: The vector of columns to be tested.

Value: A list with the following components:

Hotelling: A list with the components of the Lawley-Hotelling T 2 test (7.22):

T2: The T 2 statistic (7.19).


F: The F version (7.22) of the T 2 statistic.
df: The degrees of freedom for the F.
pvalue: The p-value of the F.
Wilks: A list with the components of the Wilks Λ test (7.37):
lambda: The Λ statistic (7.35).
Chisq: The χ2 version (7.37) of the Λ statistic, using Bartlett’s correction.
df: The degrees of freedom for the χ2 .
pvalue: The p-value of the χ2 .
A.3. Classification 285

A.3 Classification

A.3.1 lda: Linear discrimination

Description: Finds the coefficents ak and constants ck for Fisher’s linear discrimina-
tion function dk in (11.31) and (11.32). See Listing A.6 for the code.

Usage: lda(x,y)

Arguments:

x: The n × p data matrix.

y: The n-vector of group identities, assumed to be given by the numbers 1, . . . , K for


K groups.

Value: A list with the following components:

a: A p × K matrix, where column k contains the coefficents ak for (11.31). The final
column is all zero.

c: The K- vector of constants ck for (11.31). The final value is zero.

A.3.2 qda: Quadratic discrimination

Description: The function returns the elements needed to calculate the quadratic
discrimination in (11.48). Use the output from this function in predict.qda (Section
A.3.2) to find the predicted groups. See Listing A.7 for the code.

Usage: qda(x,y)

Arguments:

x: The n × p data matrix.

y: The n-vector of group identities, assumed to be given by the numbers 1, . . . , K for


K groups.

Value: A list with the following components:

Mean: A K × p matrix, where row k contains the sample mean vector for group k.

Sigma: A K × p × p array, where the Sigma[k,,] contains the sample covariance matrix
 k.
for group k, Σ

c: The K- vector of constants ck for (11.48).


286 Appendix A. Extra R routines

A.3.3 predict.qda: Quadratic discrimination prediction

Description: The function uses the output from the function qda (Section A.3.2) and
a p-vector x, and calculates the predicted group for this x. See Listing A.8 for the
code.
Usage: predict.qda(qd,newx)
Arguments:

qd: The output from qda.

newx: A p-vector x whose components match the variables used in the qda function.

Value: A K-vector of the discriminant values dkQ (x) in (11.48) for the given x.

A.4 Silhouettes for K-Means Clustering


A.4.1 silhouette.km: Calculate the silhouettes
This function is a bit different from the silhouette function in the cluster package,
[Maechler et al., 2005].
Description: Find the silhouettes (12.13) for K-means clustering from the data and
and the groups’ centers. See Listing A.9 for the code.
Usage: silhouette.km(x,centers)
Arguments:

x: The n × p data matrix.

centers: The K × p matrix of centers (means) for the K clusters, row k being the
center for cluster k.

Value: The n-vector of silhouettes, indexed by the observations’ indices.

A.4.2 sort.silhouette: Sort the silhouettes by group

Description: Sorts the silhouettes, first by group, then by value, preparatory to plot-
ting. See Listing A.10 for the code.
Usage: sort.silhouette(sil,clusters)
Arguments:

sil: The n-vector of silhouette values.

clusters: The n-vector of cluster indices.

Value: The n-vector of sorted silhouettes.


A.5. Estimating the eigenvalues 287

A.5 Estimating the eigenvalues


We have two main functions, pcbic to find the MLE and BIC for a particular pattern,
and pcbic.stepwise, which uses a stepwise search to find a good pattern. The functions
pcbic.unite and pcbic.patterns are used by the main functions, and probably not of
much interest on their own.

A.5.1 pcbic: BIC for a particular pattern

Description: Find the BIC and MLE from a set of observed eigenvalues for a specific
pattern. See Listing A.11 for the code.
Usage: pcbic(eigenvals,n,pattern)
Arguments:
eigenvals: The q-vector of eigenvalues of the covariance matrix, in order from largest
to smallest.
n: The degrees of freedom in the covariance matrix.
pattern: The pattern of equalities of the eigenvalues, given by the K-vector
(q1 , . . . , q K ) as in (13.8).

Value: A list with the following components:

lambdaHat: A q-vector containing the MLE’s for the eigenvalues.


Deviance: The deviance of the model, as in (13.13).
Dimension: The dimension of the model, as in (13.12).
BIC: The value of the BIC for the model, as in (13.14).

A.5.2 pcbic.stepwise: Choosing a good pattern

Description: Uses the stepwise procedure described in Section 13.1.4 to find a pattern
for a set of observed eigenvalues with good BIC value. See Listing A.11 for code.
Usage: pcbic.stepwise(eigenvals,n)
Arguments:
eigenvals: The q-vector of eigenvalues of the covariance matrix, in order from largest
to smallest.
n: The degrees of freedom in the covariance matrix.

Value: A list with the following components:

Patterns: A list of patterns, one for each value of length K.


BICs: A vector of the BIC’s for the above patterns.
288 Appendix A. Extra R routines

BestBIC: The best (smallest) value among the BIC’s in BICs.


BestPattern: The pattern with the best BIC.
lambdaHat: A q-vector containing the MLE’s for the eigenvalues for the pattern with
the best BIC.

A.5.3 Helper functions


The function pcbic.unite takes as arguments a pattern (q1 , . . . , q K ), called pattern, and
an index i, called index1, where 1 ≤ i < K. It returns the pattern obtained by summing
q i and q i+1 . See Listing A.13. The function pcbic.patterns (Listing A.14) takes the
arguments eigenvals, n, and pattern0 (as for pcbic in Section A.5.1), and returns the
best pattern and its BIC among the patterns obtainable by summing two consecutive
terms in pattern0 via pcbic.unite.
A.5. Estimating the eigenvalues 289

Listing A.1: The function negent


negent <− function(x,K=ceiling(log2(length(x))+1)) {
p <− table(cut(x,breaks=K))/length(x)
sigma2 <− sum((1:K)^2∗p)−sum((1:K)∗p)^2
p <− p[(p>0)]
(1+log(2∗pi∗(sigma2+1/12)))/2+sum(p∗log(p))
}

Listing A.2: The function negent2D


negent2D <− function(y,m=100) {
thetas <− (1:m)∗pi/m
ngnt <− NULL
for(theta in thetas) {
x <− y%∗%c(cos(theta),sin(theta))
ngnt <− c(ngnt,negent(x))
}
i <− imax(ngnt)
g <− c(cos(thetas[i]),sin(thetas[i]))
g <− cbind(g,c(−g[2],g[1]))
list(vectors = g,values = c(ngnt[i],negent(y%∗%g[,2])))
}

Listing A.3: The function negent3D


negent3D <− function(y,nstart=1,m=100,...) {
f <− function(thetas) {
cs <− cos(thetas)
sn <− sin(thetas)
negent(y%∗%c(cs[1],sn[1]∗c(cs[2],sn[2])))
}
tt <− NULL
nn <− NULL
for(i in 1:nstart) {
thetas <− runif(3)∗pi
o <− optim(thetas,f,method=’SANN’,control=list(fnscale=−1),...)
tt <− rbind(tt,o$par)
nn <− c(nn,o$value)
}
i <− imax(nn) # The index of best negentropy
cs<−cos(tt[i,])
sn<−sin(tt[i,])
g.opt <− c(cs[1],sn[1]∗cs[2],sn[1]∗sn[2])
g.opt <− cbind(g.opt,c(−sn[1],cs[1]∗cs[2],sn[2]∗cs[1]))
g.opt <− cbind(g.opt,c(0,−sn[2],cs[2]))
x <− y%∗%g.opt[,2:3]
n2 <− negent2D(x,m=m)
g.opt[,2:3] <− g.opt[,2:3]%∗%n2$vectors
list(vectors=g.opt,values = c(nn[i],n2$values))
}
290 Appendix A. Extra R routines

Listing A.4: The function bothsidesmodel


bothsidesmodel <− function(x,y,z) {
if(is.vector(x)) x <− matrix(x,ncol=1)
xpxi <− solve(t(x)%∗%x)
xx <− xpxi%∗%t(x)
zz <− solve(t(z)%∗%z,t(z))
b <− xx%∗%y
yr <− y−x%∗%b
b <− b%∗%t(zz)
yr <− yr%∗%t(zz)
df <− nrow(x)−ncol(x)
sigmaz <− t(yr)%∗%yr/df
se <− sqrt(outer(diag(xpxi),diag(sigmaz),"∗"))
tt <− b/se
list(Betahat = b, Se = se, T=tt,Cx=xpxi,Sigmaz=sigmaz,nu=nrow(x)−ncol(x))
}

Listing A.5: The function bothsidesmodel.test


bothsidesmodel.test <− function(bsm,rows,cols) {
lstar <− length(cols)
pstar <− length(rows)
nu <− bsm$nu
bstar <− bsm$Betahat[rows,cols]
if(lstar==1) bstar <− matrix(bstar,ncol=1)
if(pstar==1) bstar <− matrix(bstar,nrow=1)
W.nu <− bsm$Sigmaz[cols,cols]
B <− t(bstar)%∗%solve(bsm$Cx[rows,rows])%∗%bstar
t2 <− tr(solve(W.nu)%∗%B)
f <− (nu−lstar+1)∗t2/(lstar∗pstar∗nu)
df <− c(lstar∗pstar,nu−lstar+1)
W <− W.nu∗nu
lambda <− det(W)/det(W+B)
chis <− −(nu−(lstar−pstar+1)/2)∗log(lambda)
Hotelling <− list(T2 = t2, F = f, df = df,pvalue = 1−pf(f,df[1],df[2]))
Wilks <− list(Lambda=lambda,Chisq=chis,df=df[1],pvalue=1−pchisq(chis,df[1]))
list(Hotelling = Hotelling,Wilks = Wilks)
}
A.5. Estimating the eigenvalues 291

Listing A.6: The function lda


lda <− function(x,y) {
if(is.vector(x)) {x <− matrix(x,ncol=1)}
K <− max(y)
p <− ncol(x)
n <− nrow(x)
m <− NULL
v <− matrix(0,ncol=p,nrow=p)
for(k in 1:K) {
xk <− x[y==k,]
if(is.vector(xk)) {xk <− matrix(xk,ncol=1)}
m <− rbind(m,apply(xk,2,mean))
v <− v + var(xk)∗(nrow(xk)−1)
}
v <− v/n
phat <− table(y)/n

ck <− NULL
ak <− NULL
vi <− solve(v)
for(k in 1:K) {
c0 <− −(1/2)∗(m[k,]%∗%vi%∗%m[k,]−m[K,]%∗%vi%∗%m[K,])
+log(phat[k]/phat[K])
ck <− c(ck,c0)
a0 <− vi%∗%(m[k,]−m[K,])
ak <− cbind(ak,a0)
}
list(a = ak, c = ck)
}

Listing A.7: The function qda


qda <− function(x,y) {
K <− max(y)
p <− ncol(x)
n <− nrow(x)
m <− NULL
v <− array(0,c(K,p,p))
ck <− NULL
phat <− table(y)/n
for(k in 1:K) {
xk <− x[y==k,]
m <− rbind(m,apply(xk,2,mean))
nk <− nrow(xk)
v[k,,] <− var(xk)∗(nk−1)/nk
ck <− c(ck,−log(det(v[k,,]))+2∗log(phat[k]))
}

list(Mean = m,Sigma = v, c = ck)


}
292 Appendix A. Extra R routines

Listing A.8: The function predict.qda


predict.qda <− function(qd,newx) {
newx <− c(newx)
disc <− NULL
K <− length(qd$c)
for(k in 1:K) {
dk <− −t(newx −qd$Mean[k,])%∗%
solve(qd$Sigma[k,,],newx−qd$Mean[k,])+qd$c[k]
disc <− c(disc,dk)
}
disc
}

Listing A.9: The function silhouette.km


silhouette.km <− function(x,centers) {
dd <− NULL
k <− nrow(centers)
for(i in 1:k) {
xr <− sweep(x,2,centers[i,],’−’)
dd<−cbind(dd,apply(xr^2,1,sum))
}
dd <− apply(dd,1,sort)[1:2,]
(dd[2,]−dd[1,])/dd[2,]
}

Listing A.10: The function sort.silhouette


sort.silhouette <− function(sil,cluster) {
ss <− NULL
ks <− sort(unique(cluster))
for(k in ks) {
ss <− c(ss,sort(sil[cluster==k]))
}
ss
}
A.5. Estimating the eigenvalues 293

Listing A.11: The function pcbic


pcbic <− function(eigenvals,n,pattern) {
p <− length(eigenvals)
l <− eigenvals
k <− length(pattern)
istart <− 1
for(i in 1:k) {
iend <− istart+pattern[i]
l[istart:(iend−1)] = mean(l[istart:(iend−1)])
istart <− iend
}
dev <− n∗sum(log(l))
dimen <− (p^2−sum(pattern^2))/2 + k
bic <− dev + log(n)∗dimen
list(lambdaHat = l,Deviance = dev,Dimension = dimen,BIC = bic)
}

Listing A.12: The function pcbic.stepwise


pcbic.stepwise <− function(eigenvals,n) {
k <− length(eigenvals)
p0 <− rep(1,k)
b <− rep(0,k)
pb <− vector(’list’,k)
pb[[1]] <− p0
b[1] <− pcbic(eigenvals,n,p0)$BIC
for(i in 2:k) {
psb <− pcbic.subpatterns(eigenvals,n,pb[[i−1]])
b[i] <− min(psb$bic)
pb[[i]] <− psb$pattern[,psb$bic==b[i]]
}
ib <− (1:k)[b==min(b)]
list(Patterns = pb,BICs = b,
BestBIC = b[ib],BestPattern = pb[[ib]],
LambdaHat = pcbic(eigenvals,n,pb[[ib]])$lambdaHat)
}

Listing A.13: The function pcbic.unite


pcbic.unite <− function(pattern,index1) {
k <− length(pattern)
if(k==1) return(pattern)
if(k==2) return(sum(pattern))
if(index1==1) return(c(pattern[1]+pattern[2],pattern[3:k]))
if(index1==k−1) return(c(pattern[1:(k−2)],pattern[k−1]+pattern[k]))
c(pattern[1:(index1−1)],pattern[index1]+pattern[index1+1],pattern[(index1+2):k])
}
294 Appendix A. Extra R routines

Listing A.14: The function pcbic.subpatterns


pcbic.subpatterns <− function(eigenvals,n,pattern0) {
b <− NULL
pts <− NULL
k <− length(pattern0)
if(k==1) return(F)
for(i in 1:(k−1)) {
p1 <− pcbic.unite(pattern0,i)
b2 <− pcbic(eigenvals,n,p1)
b <− c(b,b2$BIC)
pts <− cbind(pts,p1)
}
list(bic=b,pattern=pts)
}
Bibliography

Hirotugu Akaike. A new look at the statistical model identification. IEEE Transactions
on Automatic Control, 19:716 – 723, 1974.

E. Anderson. The irises of the Gaspe Peninsula. Bulletin of the American Iris Society,
59:2–5, 1935.

E. Anderson. The species problem in iris. Annals of the Missouri Botanical Garden, 23:
457 – 509, 1936.

T. W. Anderson. Asymptotic theory for principal component analysis. The Annals of


Mathematical Statistics, 34:122–148, 1963.

T. W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley, NY, third


edition, 2003.

T. W. Anderson, I. Olkin, and L. G. Underhill. Generation of random orthogonal


matrices. SIAM Journal on Scientific and Statistical Computing, 8:625–629, 1987.

Steen Andersson. Invariant normal models. Annals of Statistics, 3:132 – 154, 1975.

David R. Appleton, Joyce M. French, and Mark P. J. Vanderpump. Ignoring a covari-


ate: An example of Simpson’s paradox. The American Statistician, 50(4):340–341,
1996.

Robert B. Ash. Basic Probability Theory. John Wiley and Sons Inc., http://www.math.
uiuc.edu/~r-ash/BPT.html, 1970.

Daniel Asimov. The grand tour: A tool for viewing multidimensional data. SIAM
Journal on Scientific and Statistical Computing, 6:128–143, 1985.

M. S. Bartlett. A note on multiplying factors for various χ2 approximations. Journal


of the Royal Statistical Society B, 16:296 – 298, 1954.

M. S. Bartlett. A note on tests of significance in multivariate analysis. Mathematical


Proceedings of the Cambridge Philosophical Society, 35:180–185, 1939.

Alexander Basilevsky. Statistical Factor Analysis and Related Methods: Theory and Appli-
cations. John Wiley & Sons, 1994.

295
296 Bibliography

J. Berger, J. Ghosh, and N. Mukhopadhyay. Approximations to the Bayes factor in


model selection problems and consistency issues. Technical report, 1999. URL
http://www.stat.duke.edu/~berger/papers/00-10.html.
Peter J. Bickel and Kjell A. Doksum. Mathematical Statistics: Basic Ideas and Selected
Topics, Volume I. Prentice Hall, second edition, 2000.
G. E. P. Box and Mervin E. Muller. A note on the generation of random normal
deviates. Annals of Mathematical Statistics, 29(2):610–611, 1958.
Leo Breiman, Jerome Friedman, Charles J. Stone, and R. A. Olshen. Classification and
Regression Trees. CRC Press, 1984.
Andreas Buja and Daniel Asimov. Grand tour methods: An outline. In D. M. Allen,
editor, Computer Science and Statistics: Proceedings of the 17th Symposium on the Inter-
face, pages 63–67, New York and Amsterdam, 1986. Elsevier/North-Holland.
T. K. Chakrapani and A. S. C. Ehrenberg. An alternative to factor analysis in market-
ing research part 2: Between group analysis. Professional Marketing Research Society
Journal, 1:32–38, 1981.
Herman Chernoff. The use of faces to represent points in k-dimensional space graph-
ically. Journal of the American Statistical Association, 68:361–368, 1973.
Vernon M. Chinchilli and Ronald K. Elswick. A mixture of the MANOVA and
GMANOVA models. Communications in Statistics: Theory and Methods, 14:3075–
3089, 1985.
Ronald Christensen. Plane Answers to Complex Questions: The Theory of Linear Models.
Springer-Verlag Inc, third edition, 2002.
J. W. L. Cole and James E. Grizzle. Applications of multivariate analysis of variance
to repeated measurements experiments. Biometrics, 22:810–828, 1966.
Consumers’ Union. Body dimensions. Consumer Reports, April 286 – 288, 1990.
D. Cook and D. F. Swayne. Interactive and Dynamic Graphics for Data Analysis. Springer,
2007.
DASL Project. Data and Story Library. Cornell University, Ithaca, NY, 1996. URL
http://http://lib.stat.cmu.edu/DASL.
M. Davenport and G. Studdert-Kennedy. Miscellanea: The statistical analysis of aes-
thetic judgment: An exploration. Applied Statistics, 21(3):324–333, 1972.
Bill Davis and J. Jerry Uhl. Matrices, Geometry & Mathematica. Calculus & Mathemat-
ica. Math Everywhere, Inc., 1999.
Edward J. Dudewicz and Thomas G. Ralley. The Handbook of Random Number Genera-
tion and Testing with TESTRAND Computer Code. American Sciences Press, 1981.
R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of
Eugenics, 7:179 – 188, 1936.
Chris Fraley and Adrian Raftery. mclust: Model-based clustering / normal mix-
ture modeling, 2010. URL http://cran.r-project.org/web/packages/mclust/
index.html. R package version 3.4.8.
Bibliography 297

Chris Fraley and Adrian E Raftery. Model-based clustering, discriminant analysis,


and density estimation. Journal of the American Statistical Association, 97(458):611–
631, 2002.
A. Frank and A. Asuncion. UCI machine learning repository, 2010. URL http:
//archive.ics.uci.edu/ml/.
Yasunori Fujikoshi. The likelihood ratio tests for the dimensionality of regression
coefficients. J. Multivariate Anal., 4:327–340, 1974.
K. R. Gabriel. Biplot display of multivariate matrices for inspection of data and
diagnois. In V. Barnett, editor, Interpreting Multivariate Data, pages 147 – 173. John
Wiley & Sons, London, 1981.
Leon J. Gleser and Ingram Olkin. Linear models in multivariate analysis. In Essays in
Probability and Statistics, pages 276–292. Wiley, New York, 1970.
A. K. Gupta and D. G. Kabe. On Mallows’ C p for the GMANOVA model under
double linear restrictions on the regression parameter matrix. Journal of the Japan
Statistical Society, 30(2):253–257, 2000.
P. R. Halmos. Measure Theory. The University Series in Higher Mathematics. Van
Nostrand, 1950.
Kjetil Halvorsen. ElemStatLearn: Data sets, functions and examples from the
book: The Elements of Statistical Learning, Data Mining, Inference, and Prediction
by Trevor Hastie, Robert Tibshirani and Jerome Friedman., 2009. URL http://
cran.r-project.org/web/packages/ElemStatLearn/index.html. Material from
the book’s webpage and R port and packaging by Kjetil Halvorsen; R package
version 0.1-7.
D. J. Hand and C. C. Taylor. Multivariate Analysis of Variance and Repeated Measures: A
Practical Approach for Behavioural Scientists. Chapman & Hall Ltd, 1987.
Harry Horace Harman. Modern Factor Analysis. University of Chicago Press, 1976.
Hikaru Hasegawa. On Mallows’ C p in the multivariate linear regression model under
the uniform mixed linear constraints. Journal of the Japan Statistical Society, 16:1–6,
1986.
Trevor Hastie, Robert Tibshirani, and J. H. Friedman. The Elements of Statistical Learn-
ing: Data Mining, Inference, and Prediction. Springer-Verlag Inc, second edition, 2009.
URL http://www-stat.stanford.edu/~tibs/ElemStatLearn/.
Claire Henson, Claire Rogers, and Nadia Reynolds. Always Coca-Cola. Technical
report, University Laboratory High School, Urbana, IL, 1996.
Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for
nonorthogonal problems. Technometrics, 12(1):pp. 55–67, 1970.
Robert V. Hogg, Joseph W. McKean, and A. T. Craig. Introduction to Mathematical
Statistics. Prentice Hall, sixth edition, 2004.
Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt. Spam data.
Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304, 1999.
298 Bibliography

Peter J. Huber. Projection pursuit (C/R: P475-525). The Annals of Statistics, 13:435–475,
1985.

Clifford M. Hurvich and Chih-Ling Tsai. Regression and time series model selection
in small samples. Biometrika, 76:297–307, 1989.

Aapo Hyvärinen, Juha Karhunen, and Erkki Oja. Independent Component Analysis.
Wiley-Interscience, May 2001.

IBM. System/360 scientific subroutine package, version III, programmer’s manual,


program number 360A-CM-03X. In Manual GH20-0205-4, White Plains, NY, 1970.
International Business Machines Corporation.

Richard Arnold Johnson and Dean W. Wichern. Applied Multivariate Statistical Analy-
sis. Pearson Prentice-Hall Inc, sixth edition, 2007.

Takeaki Kariya. Testing in the Multivariate General Linear Model. Kinokuniya, 1985.

Leonard Kaufman and Peter J. Rousseeuw. Finding Groups in Data: An Introduction to


Cluster Analysis. John Wiley & Sons, 1990.

J. B. Kruskal. Non-metric multidimensional scaling. Psychometrika, 29:1 – 27, 115 –


129, 1964.

A. M. Kshirsagar. Bartlett decomposition and Wishart distribution. Annals of Mathe-


matical Statistics, 30(1):239–241, 1959.

Anant M. Kshirsagar. Multivariate Analysis. Marcel Dekker, 1972.

S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of Mathe-


matical Statistics, 22(1):79–86, 1951.

D. N. Lawley and A. E. Maxwell. Factor Analysis As a Statistical Method. Butterworth


and Co Ltd, 1971.

Yann LeCun. Generalization and network design strategies. Technical Report CRG-
TR-89-4, Department of Computer Science, University of Toronto, 1989. URL http:
//yann.lecun.com/exdb/publis/pdf/lecun-89t.pdf.

E. L. Lehmann and George Casella. Theory of Point Estimation. Springer-Verlag Inc,


second edition, 1998.

Martin Maechler, Peter Rousseeuw, Anja Struyf, and Mia Hubert. Cluster analysis
basics and extensions, 2005. URL http://cran.r-project.org/web/packages/
cluster/index.html. Rousseeuw et. al. provided the S original which has been
ported to R by Kurt Hornik and has since been enhanced by Martin Maechler.

C. L. Mallows. Some comments on C p . Technometrics, 15(4):661–675, 1973.

J. L. Marchini, C. Heaton, and B. D. Ripley. fastICA: FastICA algorithms to per-


form ICA and projection pursuit, 2010. URL http://cran.r-project.org/web/
packages/fastICA/index.html. R package version 1.1-13.

K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate Analysis. Academic Press,


London, 1979.
Bibliography 299

Kenny J. Morris and Robert Zeppa. Histamine-induced hypotension due to morphine


and arfonad in the dog. Journal of Surgical Research, 3(6):313–317, 1963.
I. Olkin and S. N. Roy. On multivariate distribution theory. Annals of Mathematical
Statistics, 25(2):329–339, 1954.
NBC Olympics. http://www.2008.nbcolympics.com, 2008.
K. Pearson. On lines and planes of closest fit to systems of points in space. Philosoph-
ical Magazine, 2(6):559–572, 1901.
Michael D. Perlman and Lang Wu. On the validity of the likelihood ratio and maxi-
mum likelihood methods. J. Stat. Plann. Inference, 117(1):59–81, 2003.
Richard F. Potthoff and S. N. Roy. A generalized multivariate analysis of variance
model useful especially for growth curve problems. Biometrika, 51:313–326, 1964.
J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers
Inc., San Francisco, 1993.
R Development Core Team. The Comprehensive R Archive Network, 2010. URL
http://cran.r-project.org/.
C. R. Rao. Linear Statistical Inference and Its Applications. Wiley, New York, 1973.
G. M. Reaven and R. G. Miller. An attempt to define the nature of chemical diabetes
using a multidimensional analysis. Diabetologia, 16:17–24, 1979.
Brian Ripley. tree: Classification and regression trees, 2010. URL http://cran.
r-project.org/web/packages/tree/. R package version 1.0-28.
J. Rousseauw, J. Plessis, A. Benade, P. Jordaan, J. Kotze, P. Jooste, and J. Ferreira.
Coronary risk factor screening in three rural communities. South African Medical
Journal, 64:430 – 436, 1983.
Peter Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of
cluster analysis. J. Comput. Appl. Math., 20(1):53–65, 1987.
Henry Scheffé. The Analysis of Variance. John Wiley & Sons, 1999.
Gideon Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6:
461–464, 1978.
George W. Snedecor and William G. Cochran. Statistical Methods. Iowa State Univer-
sity Press, Ames, Iowa, eighth edition, 1989.
C. Spearman. “General intelligence,” objectively determined and measured. American
Journal of Psychology, 15:201–293, 1904.
David J. Spiegelhalter, Nicola G. Best, Bradley P. Carlin, and Angelika van der Linde.
Bayesian measures of model complexity and fit (Pkg: P583-639). Journal of the Royal
Statistical Society, Series B: Statistical Methodology, 64(4):583–616, 2002.
H. A. Sturges. The choice of a class interval. Journal of the American Statistical Associa-
tion, 21(153):65–66, 1926.
A. Thomson and R. Randall-MacIver. Ancient Races of the Thebaid. Oxford University
Press, 1905.
300 Bibliography

TIBCO Software Inc. S-Plus. Palo Alto, CA, 2009. URL http://www.tibco.com.
R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data
set via the gap statistic. Journal of the Royal Statistical Society B, 63:411 – 423, 2001.
W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Springer, New York,
fourth edition, 2002.
J H Ware and R E Bowden. Circadian rhythm analysis when output is collected at
intervals. Biometrics, 33(3):566–571, 1977.
Sanford Weisberg. Applied Linear Regression. John Wiley & Sons, third edition, 2005.
Wikipedia. List of breakfast cereals — Wikipedia, The Free Encyclopedia,
2011. URL http://en.wikipedia.org/w/index.php?title=List_of_breakfast_
cereals&oldid=405905817. [Online; accessed 7-January-2011].
Wikipedia. Pterygomaxillary fissure — Wikipedia, The Free Encyclopedia,
2010. URL http://en.wikipedia.org/w/index.php?title=Pterygomaxillary_
fissure&oldid=360046832. [Online; accessed 7-January-2011].
John Wishart. The generalised product moment distribution in samples from a nor-
mal multivariate population. Biometrika, 20A:32 – 52, 1928.
Peter Wolf and Uni Bielefeld. aplpack: Another Plot PACKage, 2010. URL http:
//cran.r-project.org/web/packages/aplpack/index.html. R package version
1.2.3.
John W. Wright, editor. The Universal Almanac. Andrews McMeel Publishing, Kansas
City, MO, 1997.
Gary O. Zerbe and Richard H. Jones. On application of growth curve techniques to
time series data. Journal of the American Statistical Association, 75:507–509, 1980.
Index

The italic page numbers are references to Exercises.

affine transformation, 40–41 conjugate prior, 43


covariance matrix, 40 multivariate normal distribution
definition, 40 mean, 64–65
mean, 40 mean and covariance, 147–149
multivariate normal distribution, regression parameter, 109–110
49–52 Wishart distribution, 146–147
conditional mean, 56 biplot, 13–14
data matrix, 53–54 cereal data, 23–24
Akaike information criterion (AIC), 126, decathlon data, 24
158 painters data, 23
equality of covariance matrices sports data, 13–14
iris data, 210–211 both-sides model, 73–80
factor analysis, 181
a.k.a. generalized multivariate anal-
motivation, 161
ysis of variance (GMANOVA)
multivariate regression, 161–162
model, 80
caffeine data, 81
Bartlett’s test, 174
Bayes information criterion (BIC), 158– estimation of coefficients, 89–90
160 bothsidesmodel (R routine), 283–
as posterior probability, 159 284
both-sides model caffeine data, 110–111
caffeine data, 167 covariance matrix, 103–106
clustering, 240 expected value, 103
equality of covariance matrices grades data, 111
iris data, 210–211 histamine in dogs data, 108–
factor analysis, 181 109
principal components mouth size data, 106–108
automobile data, 259–262 prostaglandin data, 110
Bayes theorem, 39, 159 Student’s t, 106
multivariate normal distribution, estimation of covariance matrix,
64 105
Bayesian inference, 39 fits and residuals, 104–105
binomial, 43–44 covariance matrix, 104–105

301
302 Index

distribution, 104 iris data, 278


estimation, 104 states data, 278–279
expected value, 104 Cauchy-Schwarz inequality, 135–136
multivariate normal distribution, Chernoff’s faces, 2–3
105 iris data, 5
growth curve model planets data, 2–3
births data, 76–78 spam data, 22
mouth size data, 107 sports data, 22–23
histamine in dogs data, 81–82 classification, 199–224, 253
hypothesis testing Bayes classifier, 201
approximate χ2 tests, 113–114 spam data, 224–225
bothsidesmodel.test (R routine), classifier, 201
284 estimation, 202
histamine in dogs data, 120 cross-validation, 205–206
Hotelling’s T 2 and projection leave out more than one, 206
pursuit, 139–140 leave-one-out, 206, 208
Hotelling’s T 2 test, 116, 138– discriminant function, 204
139 linear, 204
Lawley-Hotelling trace test, 114– error, 201, 206
117 cross-validation estimate, 206
linear restrictions, 120–121 observed, 206, 208
mouth size data, 114, 118–120 Fisher’s linear discrimination, 203–
Pillai trace test, 118 210
prostaglandin data, 131 crabs data, 225
Roy’s maximum root test, 118 definition, 204
Wilks’ Λ, 117 heart disease data, 227–228
iid model, 75–78 iris data, 207–210
independence of coefficients and lda (R routine), 285
covariance estimator, 106 modification, 211–212
intraclass correlation structure zipcode data, 226–227
mouth size data, 191–194 Fisher’s quadratic discrimination,
least squares, 89–90 210–211
Mallows’ C p definition, 210
caffeine data, 133 iris data, 210–211
mouth size data, 129–130 predict.qda (R routine), 286
mouth size data, 79–80 qda (R routine), 285
parity data, 82–83 illustration, 202
prostaglandin data, 80 logistic regression, 212–217
pseudo-covariates heart disease data, 226
histamine in dogs data, 132– iris data, 214
133 spam data, 214–217
new observation, 201
canonical correlations, 175, 270–275 subset selection
Bayes information criterion, 274– iris data, 208–210
275 trees, 217–224
grades data, 275 Akaike information criterion, 222
compared to partial least squares, Bayes information criterion, 222
276 C5.0, 220
exams data, 278 Categorization and Regression
grades data, 273–274 Trees (CART), 220
Index 303

crabs data, 225–226 Mclust (R routine), 241


cross-validation, 224 mixture density, 240
deviance, 221 multivariate normal model, 240
heart disease data, 218–224 plotting
pruning, 222 sports data, 233–235
snipping, 219 soft K-means, 246–247
spam data, 227 sports data, 247
tree (R routine), 221 conditional probability, 30
clustering, 199, 229–251 smoking data, 45–46
classifier, 229 contrasts, 122
density of data, 201 convexity, 20
hierarchical, 230, 247–251 correlation inequality, 136
average linkage, 248 covariance (variance-covariance) matrix
cereal data, 252 collection of random variables, 34
complete linkage, 248 generalized variance, 146
dissimilarities, 239 multivariate normal distribution,
grades data, 247–249, 252 49–51
hclust (R routine), 249 data matrix, 53
plclust (R routine), 249 of two vectors, 34–35
single linkage, 248 principal components, 11
soft drink data, 252 sample, 8
sports data, 249–251 sum of squares and cross prod-
K-means, 230–238 uct matrix, 11
algorithm, 230 covariance models, 171–191
diabetes data, 251–252 factor analysis, see separate entry
gap statistics, 231–232, 237 invariant normal models, see sym-
grades data, 251 metry models
kmeans (R routine), 236–237 symmetry models, see separate en-
objective function, 230 try
relationship to EM algorithm, testing conditional independence
246 grades data, 196
silhouette.km (R routine), 286 testing equality, 172–174
silhouettes, 233–235, 237–238
mouth size data, 195–196
sort.silhouette (R routine), 286
testing independence, 174–175
sports data, 230–233, 236–238
grades data, 196
K-medoids, 238–240
testing independence of blocks of
dissimilarities, 238
variables, 270
grades data, 239–240
cross-validation, see classification
objective function, 239
pam (R routine), 239
silhouettes, 233, 239 data examples
model-based, 229, 230, 240–246, automobiles
253 entropy, 24
automobile data, 240–243 model-based clustering, 240–243
Bayes information criterion, 240– principal components, 256–257,
241, 243 259–262
classifier, 240 births, 76
EM algorithm, 245–246 growth curve models, 76–78
iris data, 252 caffeine, 81
likelihood, 240 Bayes information criterion, 167
304 Index

both-sides model, 81, 110–111, K-medoids clustering, 239–240


133 multidimensional scaling, 269
orthogonalization, 101 multivariate regression, 70–71
cereal, 23 testing conditional independence,
biplot, 23–24 177–178
cereals testing equality of covariance
hierarchical clustering, 252 matrices, 173
crabs, 225 testing independence, 175
classification, 225–226 Hewlett-Packard spam data, 22
decathlon, 24 Chernoff’s faces, 22
biplot, 24 classification, 224–225, 227
factor analysis, 197–198 logistic regression, 214–217
diabetes, 251 principal components, 22, 278
K-means clustering, 251–252 histamine in dogs, 71
elections, 23 both-sides model, 81–82, 108–
principal components, 23 109, 120, 132–133
exams, 196 multivariate analysis of variance,
canonical correlations, 278 71–73
covariance models, 196–197 multivariate regression, 131–132
factor analysis, 196 leprosy, 82
Fisher-Anderson iris data, 5 covariates, 121–124
Akaike information criterion, 210– multivariate analysis of variance,
211 82
Bayes information criterion, 210– multivariate regression, 167–168
211 orthogonalization, 101
canonical correlations, 278 Louis Roussos sports data, 13
Chernoff’s faces, 5 biplot, 13–14
classification, 199 Chernoff’s faces, 22–23
Fisher’s linear discrimination, hierarchical clustering, 249–251
207–210 K-means clustering, 230–233, 236–
Fisher’s quadratic discrimina- 238
tion, 210–211 multidimensional scaling, 269–
logistic regression, 214 270
model-based clustering, 252 plotting clusters, 233–235
principal components, 12, 254– principal components, 13–14
256 soft K-means clustering, 247
projection pursuit, 16–18 mouth sizes, 71
rotations, 10, 24 both-sides model, 79–80, 106–
scatter plot matrix, 5 108, 114, 118–120, 191–194
subset selection in classification, covariance models, 195–196
208–210 intraclass correlations, 191–194
grades, 70 Mallows’ C p , 129–130
Bayes information criterion, 184 model selection, 163–165
both-sides model, 111 multivariate regression, 71, 130–
canonical correlations, 273–275 131
covariance models, 196 pseudo-covariates, 125
factor analysis, 182–185, 196 painters, 23
hierarchical clustering, 247–249, biplot, 23
252 principal components, 277–278
K-means clustering, 251 parity, 82
Index 305

both-sides model, 82–83 of collection, 33


planets, 2 of matrix, 34
Chernoff’s faces, 2–3 of vector, 34
star plot, 2–3 variance, 33
prostaglandin, 80
both-sides model, 80, 110, 131 factor analysis, 171, 178–185, 253
RANDU Akaike information criterion, 181
rotations, 24–25 Bartlett’s refinement, 183
skulls, 80 Bayes information criterion, 181
multivariate regression, 80, 110, grades data, 184, 196
131, 168 compared to principal components,
orthogonalization, 101 262–264
smoking, 45 decathlon data, 197–198
conditional probability, 45–46 estimation, 179–180
soft drinks, 252 exams data, 196
hierarchical clustering, 252 factanal (R routine), 182
South African heart disease, 197 goodness-of-fit, 183
classification, 226–228 grades data, 182–185
factor analysis, 197 heart disease data, 197
trees, 218–224 likelihood, 180
states, 278 model, 178
canonical correlations, 278–279 model selection, 180–181
zipcode, 226 rotation of factors, 181, 184
classification, 226–227 scores, 181
data matrix, 2 estimation, 184
planets data, 2 structure of covariance matrix, 178
data reduction, 253 uniquenesses, 182
density, see probability distributions varimax, 181
deviance, see likelihood Fisher’s linear discrimination, see clas-
dissimilarities, 238, 239 sification
distributions, see probability distribu- Fisher’s quadratic discrimination, see
tions classification
fit, 87
eigenvalues and eigenvectors, 12, see
also principal components gamma function, 43
positive and nonnegative definite generalized multivariate analysis of vari-
matrices, 61–62 ance model (GMANOVA), see
uniqueness, 21 both-sides model
EM algorithm, 240, 245–246 glyphs, 2–3
entropy and negentropy, see projection Chernoff’s faces, 2–3
pursuit star plot, 2–3
Euler angles, 282 grand tour, 9
expected value, 32–33
correlation coefficient, 33 Hausdorff distance, 248
covariance, 33 hypothesis testing, 156
covariance (variance-covariance) ma- conditional independence, 176–178
trix grades data, 177–178
of matrix, 34 likelihood ratio statistic, 176, 177
of two vectors, 34 equality of covariance matrices, 172–
mean, 33 174
306 Index

Bartlett’s test, 174 likelihood ratio test, 156–157


grades data, 173 asymptotic null distribution, 156
likelihood ratio statistic, 172, 174 covariance models, 171
F test, 116 equality of covariance matrices,
Hotelling’s T 2 test, 116 172–174
distribution, 138–139 factor analysis, 180–181
null distribution, 116 independence, 174–175
independence of blocks of vari- multivariate regression, 157
ables, 174–175 statistic, 156, 158
as symmetry model, 187 symmetry models, 191
grades data, 175 maximum likelihood estimation,
likelihood ratio statistic, 175 see separate entry
Lawley-Hotelling trace test, 114– multivariate regression, 152
117 principal components, 264–266
likelihood ratio test, 156–157 Wishart, 171
asymptotic null distribution, 156 linear discrimination, Fisher’s, see clas-
Pillai trace test, 118 sification
Roy’s maximum root test, 118 linear models, 67–83, 253
symmetry models, 191 both-sides model, see separate en-
likelihood ratio statistic, 191 try
Wilks’ Λ, 117 covariates, 121–125, 154
adjusted estimates, 123
Jensen’s inequality, 20, 21–22 general model, 124
leprosy data, 121–124
least squares, 88–89 definition, 90–91
definition, 88 estimation, 103–111
equation, 88 Gram-Schmidt orthogonalization,
normal equations, 88 91–93
projection, 88 growth curve model, 74, 76
regression, 44 hypothesis testing, 113–133
least squares estimation hypothesis tests, see also both-sides
both-sides model, 89–90 model
distribution, 103–109 least squares, see separate entry
likelihood, 151–165 least squares estimation, see sep-
definition, 151 arate entry
deviance linear regression, 67–69
Akaike and Bayes information analysis of covariance, 68
criteria, 158 analysis of variance, 68
canonical correlations, 275 cyclic model, 69
definition, 158 model, 67
equality of covariance matrices, polynomial model, 69
211 simple, 68
factor analysis, 181 model selection, 125–130
likelihood ratio statistic, 158 multivariate analysis of variance,
multivariate regrssion, 162 121, see separate entry
observed, 158 multivariate regression, see sepa-
prediction, 161 rate entry
principal components, 259, 262 orthogonal polynomials, 95–96
symmetry models, 191 prediction, 125–130
trees, 221 big model, 126
Index 307

estimator of prediction sum of QR, 93–94, 141–143


squares, 128 singular value, 272–273
expected value of prediction sum spectral, see separate entry
of squares, 128 determinant
Mallows’ C p , 127–130 Cholesky decomposition, 100
Mallows’ C p in univariate re- decomposing covariance, 99
gression, 129 group, 186
Mallows’ C p , definition, 128 complex normal, 195
prediction sum of squares, 127 compound symmetry, 188
submodel, 126 definition, 93
pseudo-covariates, 125 orthogonal, 142, 143, 186
mouth size data, 125 permutation, 188
repeated measures, 74 upper triangular, 94
ridge regression estimator, 110 upper unitriangular, 93
linear subspace idempotent, 20
basis definition, 8
definition, 87 in Wishart distribution, 58, 65,
existence, 100 105
definition, 85 Kronecker product, 62
dimension, 87 projection, 89, 97
least squares, see separate entry spectral decomposition, 58
linear independence, 86 identity, 7
orthogonal, 87 inverse
orthogonal complement, 89 block, 99
projection Kronecker product
definition, 87 definition, 53
properties, 87–88 properties, 54, 62
projection matrix, see matrix: pro- orthogonal, 186
jection orthogonal polynomials, 95–96
span positive and nonnegative definite,
definition, 85 51
matrix representation, 86 and eigenvalues, 61–62
transformation, 86 projection, 87–88, 127, 258
linear transformations, 8 idempotent, 89, 97
logit, 213 properties, 89
square root, 50
Mallows’ C p , 157, 161 Cholesky decomposition, 95
both-sides model, 126–130 symmetric, 51
caffeine data, 133 maximum likelihood estimation, 151–
definition, 128 156
expected prediction sum of squares, both-sides model, 153–155
127 covariance matrix, 155
prediction sum of squares, 127 definition, 152
prostaglandin data, 131 multivariate normal covariance ma-
linear regression, 129 trix, 153
matrix multivariate regression, 152–153
centering, 7, 20 coefficients, 152
decompositions covariance matrix, 153
Bartlett’s, 142 symmetry models, 189–190
Cholesky, 94–95 mixture models, 199–203
308 Index

conditional probability of predic- affine transformation, 54


tor, 200 conditional distributions, 56–57
data, 199 definition, 49
density of data, 200 density, 140–141
illustration, 200 Kronecker product covariance,
marginal distribution of predic- 141
tor, 200 independence, 52
marginal probability of group, 200 marginals, 52
model selection, 210–211 mean, 49
Akaike information criterion, see moment generating function, 50
separate entry QR decomposition, 141–143
Bayes information criterion, 126, standard normal
see separate entry collection, 49
classification, 210–211 univariate, 49
factor analysis, 180–181 multivariate regression, 69–73
linear models, 125–130 Akaike information criterion, 161–
Mallows’ C p , see separate entry 162
mouth size data Bayes information criterion
Akaike and Bayes information leprosy data, 168
criterion, 163–165 skulls data, 168
penalty, 126, 158 covariates
Mallows’ C p , 129 histamine in dogs data, 131–
moment generating function, 35 132
chi-square, 60 estimation
double exponential, 61 skulls data, 110
gamma, 60 grades data, 70–71
multivariate normal, 50 hypothesis testing
standard normal, 49 mouth size data, 130–131
standard normal collection, 49 skulls data, 131
sum of random variables, 45 likelihood ratio test, 157
uniqueness, 35 maximum likelihood estimation,
multidimensional scaling, 266–270 152–153
classical solution, 266–268 leprosy data, 167–168
Euclidean distances, 266–268 mouth size data, 71
non-Euclidean distances, 268 skulls data, 80
grades data, 269
nonmetric, 269 orthogonal, see also linear models: Gram-
stress function, 269 Schmidt orthogonalization
principal components, 267 matrix
sports data, 269–270 definition, 10
multivariate analysis of variance, 121 orthonormal set of vectors, 9
histamine in dogs data, 71–73 to subspace, 87
leprosy data, 82, 121 vectors, 9
test statistics, 117–118 orthogonalization
multivariate normal distribution, 49– caffeine data, 101
66 leprosy data, 101
affine transformations, 51 skulls data, 101
conditional distributions, 55–57
covariance matrix, 49 partial least squares, 276
data matrix, 52–54 Pillai trace test, 118
Index 309

mouth size data, 119 dependence through a function,


prediction, 199, see also linear models: 38
prediction discrete, 30
principal components, 10–15, 171, 253– independence, 36
267 iterated expectation, 32
Bayes information criterion multivariate normal data ma-
automobile data, 259–262 trix, 56–57
pcbic (R routine), 287 multivariate normal distribution,
pcbic.stepwise (R routine), 287– 55–57
288 notation, 30
best K, 13, 18–19 plug-in property, 37–38, 42, 56,
biplot, 13–14 137, 138
choosing the number of, 256–262 variance decomposition, 38
compared to factor analysis, 262– Wishart distribution, 136–137
264 continuous, 28
definition, 11 density, 28–29
eigenvalues and eigenvalues, 12 mixed-type, 29
election data, 23 probability density function (pdf),
iris data, 12, 254–256 28
likelihood, 264–266 probability mass function (pmf),
painters data, 277–278 28
scree plot, 256 discrete, 28
double exponential, 61
automobile data, 256–257
expected values, 32–33
spam data, 22, 278
exponential family, 212
spectral decomposition theorem,
F, 115, 145
11
gamma, 60
sports data, 13–14
Haar on orthogonal matrices, 143
subspaces, 257–262
Half-Wishart, 142, 146
uncorrelated property, 11, 18
Hotelling’s T 2 , 138–139
varimax, 261 projection pursuit, 139–140
probability distributions, 27–37 independence, 35–37
beta definition, 36
and F, 130 marginals, 29–30
defined via gammas, 130 moment generating function, see
density, 43 separate entry
Wilks Λ, 130 multivariate normal, see separate
beta-binomial, 43 entry
chi-square multivariate t, 148
and Wisharts, 59 mutual independence
definition, 59 definition, 37
density, 60–61 representations, 27
moment generating function, 60– Student’s t, 106, 145
61 uniform, 29
conditional distributions, 30–31, Wilks Λ, see separate entry
37–39 Wishart, see separate entry
conditional independence, 38 projection, 85–96, see also matrix: pro-
conditional space, 30 jection
continuous, density, 31 projection pursuit, 10, 15–18
covariates, 122, 124 entropy and negentropy, 16–18
310 Index

automobile data, 24 variance, 6


estimation, 281–282 scatter plot
maximized by normal, 16, 19– matrix, 3–5
20 iris data, 5
negent (R routine), 282 stars, 3
negent2D (R routine), 282–283 spectral decomposition, 11–12
negent3D (R routine), 283 eigenvalues and eigenvectors, 12,
Hotelling’s T 2 , 139–140 21
iris data, 16–18 intraclass correlation structure, 21
kurtosis, 15 theorem, 12
skewness, 15 spherical symmetry, 188
star plot, 2
R routines planets data, 2–3
both-sides model sum of squares and cross-products ma-
bothsidesmodel, 283–284 trix, 8
bothsidesmodel.test, 284 supervised learning, 199, 229
classification symmetry models, 186–191
lda, 285 a.k.a. invariant normal models,
predict.qda, 286 186
qda, 285 complex normal structure, 195
clustering compound symmetry, 188
silhouette.km, 286 grades data, 196–197
sort.silhouette, 286 definition, 186
entropy and negentropy hypothesis testing, 191
negent, 282 iid, 188
negent2D, 282–283 independence of blocks of vari-
negent3D, 283 ables, 187
principal components intraclass correlation structure, 187–
helper functions, 288 188
pcbic, 287 mouth size data, 191–194, 196
pcbic.stepwise, 287–288 spectral decomposition, 21
random variable, 27 likelihood ratio statistic, 191
collection of, 27 maximum likelihood estimation,
rectangle, 35 189–190
residual, 55, 87 independence, 190
rotations, 10 intraclass correlation, 190
example data sets, 25 spherical symmetry, 188
iris data, 24 structure from group, 189
RANDU, 24–25
Roy’s maximum root test, 118 total probability formula, 33
mouth size data, 119 trace of a matrix, 18

sample unsupervised learning, 199, 229


correlation coefficient, 7
covariance, 7 variance, 33, see also covariance (variance-
covariance (variance-covariance) ma- covariance) matrix
trix, 8 varimax, 181, 261
marginals, 8 R function, 261
mean, 6 vector
mean vector, 7 norm of, 9
Index 311

one, 7

Wilks’ Λ, 117
mouth size data, 119
Wishart distribution, 57–60, 171
and chi-squares, 59
Bartlett’s decomposition, 142
conditional property, 136–137
definition, 57
density, 143–144
expectation of inverse, 137–138
for sample covariance matrix, 58
Half-Wishart, 142, 146
likelihood, 171
linear transformations, 59
marginals, 60
mean, 59
sum of independent, 59

You might also like