0% found this document useful (0 votes)
17 views121 pages

Jacobson Erik D 201108 Ma

This document presents a thesis on using geometric interpretations to discuss topics in statistics, including the general linear model. It begins by introducing two complementary geometric perspectives for viewing multivariate data. It then discusses random variables, expectations, means, variances, and common probability distributions from a geometric viewpoint. The thesis develops the geometry of the general linear model and associated hypothesis testing. It applies this geometric framework to the analysis of variance, simple regression, multiple regression, and correlation analysis. The last chapter describes how homogeneous coordinates and perspective projections can be used to generate representations of data vectors for figures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views121 pages

Jacobson Erik D 201108 Ma

This document presents a thesis on using geometric interpretations to discuss topics in statistics, including the general linear model. It begins by introducing two complementary geometric perspectives for viewing multivariate data. It then discusses random variables, expectations, means, variances, and common probability distributions from a geometric viewpoint. The thesis develops the geometry of the general linear model and associated hypothesis testing. It applies this geometric framework to the analysis of variance, simple regression, multiple regression, and correlation analysis. The last chapter describes how homogeneous coordinates and perspective projections can be used to generate representations of data vectors for figures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 121

The Geometry of the General Linear Model

by

Erik D. Jacobson

(Under the direction of Malcolm R. Adams, Edward Azoff, and Theodore Shifrin.)

Abstract

Two complementary geometric interpretations of data are used to discuss topics from elementary

statistics including random variables, vectors of random variables, expectation, mean, variance,

and the normal, F , and t probability distributions. The geometry of the general linear model and

the associated hypothesis testing is developed, followed by a geometrically oriented discussion

of the analysis of variance, simple regression, and multiple regression using examples. Geometry

affords a rich discussion of orthogonality, multicollinearity, and suppressor variables, as well

as multiple, partial, and semi-partial correlation. The last chapter describes the mathematical

application of homogeneous coordinates and perspective projections in the computer program

used to generate the representations of data vectors for several figures in this text.

Index words: Projections, geometry, homogeneous coordinates, statistics, linear models,


regression
The Geometry of the General Linear Model

by

Erik D. Jacobson

B.A., Dartmouth College, 2004

A Thesis Submitted to the Graduate Faculty

of The University of Georgia in Partial Fulfillment

of the

Requirements for the Degree

Master of Arts

Athens, Georgia

2011
c
2011

Erik D. Jacobson

All Rights Reserved


The Geometry of the General Linear Model

by

Erik D. Jacobson

Approved:

Major Professor: Theodore Shifrin

Committee: Malcolm R. Adams


Edward Azoff

Electronic Version Approved:

Maureen Grasso
Dean of the Graduate School
The University of Georgia
August 2011
Acknowledgments

I recognize the insight, ideas, and encouragement offered by my thesis committee: Theodore

Shifrin, Malcolm R. Adams, and Edward Azoff and by Jonathan Templin. All four gamely dug

into novel material, entertained ideas from other disciplines, and showed admirable forbearance

as deadlines slipped past and the project expanded. I cannot thank them enough.

ii
Contents

Acknowledgments ii

List of Figures v

List of Tables viii

Introduction 1

1 Statistical Foundations and Geometry 4

1.1 Two geometric interpretations of data . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Random variables and expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Centered vectors, sample mean, and sample variance . . . . . . . . . . . . . . . . 10

1.4 An illustrative example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 The General Linear Model 23

2.1 Statistical models and the general linear model . . . . . . . . . . . . . . . . . . . 23

2.2 Linear combinations of random variables . . . . . . . . . . . . . . . . . . . . . . . 31

2.3 Testing the estimated model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Geometry and Analysis of Variance 41

3.1 One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2 Factorial designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4 The Geometry of Simple Regression and Correlation 57

4.1 Simple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

iii
4.2 Correlation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5 The Geometry of Multiple Regression 69

5.1 Multiple correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2 Regression coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.3 Orthogonality, multicollinearity, and suppressor variables . . . . . . . . . . . . . 76

5.4 Partial and semi-partial correlation . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6 Probability Distributions 90

6.1 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.2 The χ2 distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.3 The F -distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.4 Student’s t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

7 Manipulating Space: Homogeneous Coordinates, Perspective Projections,

and the Graphics Pipeline 99

7.1 Homogeneous coordinates for points in Rn . . . . . . . . . . . . . . . . . . . . . . 100

7.2 The perspective projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.3 The graphics pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

References 109

iv
List of Figures

1.1 Strip plots (a one-dimensional scatterplot) can illustrate observations of a single

variable, but histograms convey the variable’s distribution more clearly. . . . . . 5

1.2 Scatterplots can illustrate multivariate data. . . . . . . . . . . . . . . . . . . . . . 6

1.3 Vector diagram representations of variable vectors that span two and three di-

mensional subspaces of individual space (Rn ). . . . . . . . . . . . . . . . . . . . . 7

1.4 The centered vector yc is the difference of the observation vector y and the mean

vector y1, and the subspace spanned by centered vectors is orthogonal to the

mean vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.5 In variable space, the standard deviation s is a convenient, distribution-related

unit for many variables; in this figure, the origin of each axis is shifted to the

mean of the associated variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.6 (a) The vector y is plotted in individual space; one must decide if (b) the vector

y is more likely a sample from a distribution centered at the origin of individual

space or instead (c) a sample from of a distribution centered away from the origin

on the line spanned by (1,1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.7 The vector y can be understood as the sum of y1 and a vector e that is orthogonal

to 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.8 The distribution of the random variable Y has different centers and relies on a

different estimates for the standard deviation under (a) the null hypothesis and

(b) the alternative hypothesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

v
1.9 The t-ratio of ky 1k to kek under the t-distribution provides the same probability

information about the likelihood of the observation under the null hypothesis as

the F -ratio of ky 1k2 to kek2 under the F -distribution. . . . . . . . . . . . . . . . 22

2.1 The vector ŷ, the projection of y onto V = C(X), is seen to be the unique vector

in V that is closest to y. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2 The vector ŷ, the projection of the vector y into C([1 x]), is equal to b0 1 + b1 x

and also is equal to b00 1 + b01 xc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1 The vector ŷ is the sum of two components: y 1 and ŷc . . . . . . . . . . . . . . . 61

4.2 A scatterplot for the simple regression example showing the residuals, the differ-

ence between the observed and predicted values for each individual. . . . . . . . 63

4.3 Least-squares estimate and residuals for the transformed and untransformed data. 64

4.4 (a) Panels of scatter plots give an idealized image of correlation, but in practice,

(b) plots with the same correlation can vary quite widely. . . . . . . . . . . . . . 66

4.5 The vector diagram illustrates that rxc yc = cos(θxc yc ) and that rxc yc0 = − cos(π −

θxc yc0 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.1 The data are illustrated with a 3D scatter plot that also shows the regression

plane and the error component for the prediction of district mean total fourth

grade achievement score (e2 ) in the Acton, MA. . . . . . . . . . . . . . . . . . . . 71

5.2 The geometric relationships among the vectors yc , xc1 , and xc2 . . . . . . . . . . . 73

5.3 The vector diagrams of VXc1 and VXc3 suggest why the value of the coefficient b2

varies between models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.4 The vectors xc1 , xc2 , and xc3 are moderately pairwise correlated but nearly

collinear. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.5 The generalized volume of the parallelepiped formed by the set of vectors {ui :

0 < i ≤ n} is equivalent to length in one dimension, area in two dimensions, and

volume in three dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.6 The linear combination of xreg and xall equal to ŷ0 (the projection of y0 into

V[xall xreg ] ) must include a term with a positive sign. . . . . . . . . . . . . . . . . . 83

vi
5.7 The arcs in the vector diagram indicate angles for three kinds of correlation

between y and x2 : the angle θp corresponds to the partial correlation conditioning

for x1 ; the angle θs corresponds to the semi-partial correlation with x1 after

controlling for x2 , and the angle θyx1 corresponds to Pearson’s correlation, ryx1 . 85

6.1 The normal distribution with three different standard deviations. . . . . . . . . . 91

6.2 The χ2 distribution with three different degrees of freedom. . . . . . . . . . . . . 93

6.3 The F -distribution centers around 1 as the maximum degrees-of-freedom param-

eter becomes large. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.4 The t-distribution approaches the normal distribution as the degrees-of-freedom

parameter increases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.1 The x- and y-coordinates of the perspective projection are proportional to k/(k −z).105

7.2 The perspective space transformation takes the viewing frustum to the paral-

lelepiped [−w/2, w/2] × [−h/2, h/2] × [−1, 1] in perspective space. . . . . . . . . 108

vii
List of Tables

1.1 A generic data set with one dependent variable, m − 1 independent variables, and

n observations for each variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.1 Data for a 3-level factor recording tutoring treatment. . . . . . . . . . . . . . . . 43

3.2 Data for a 2-factor experiment recording observed gain-scores for tutoring and

lecture treatments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.1 Simulated score data for 4 tutors employed by the tutoring company. . . . . . . . 59

4.2 Modified data for 4 tutors and log-transformed data. . . . . . . . . . . . . . . . . 64

5.1 Sample data for Massachusetts school districts in the 1997-1998 school year.

Source: Massachusetts Comprehensive Assessment System and the 1990 U.S.

Census. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.2 The value of the regression coefficient for per capita income and corresponding

F -ratios in three different models of mean total fourth grade achievement. . . . . 78

5.3 The effect of multicollinearity on the stability of regression coefficients. . . . . . . 84

5.4 Suppressor variables increase the predictive power of a model although they them-

selves are uncorrelated with the criterion. . . . . . . . . . . . . . . . . . . . . . . 85

viii
Introduction

Many of the most popular and useful techniques for developing statistical models are subsumed

by the general linear model. This thesis presents the general linear model and the accompanying

strategy of hypothesis testing from a primarily geometric point of view, detailing both the

standard view of data as points in a space defined by the variables (data-as-points) and the less

common perspective of data as vectors in a space defined by the individuals (data-as-vectors). I

also develop the relevant statistical ideas geometrically and use geometric arguments to relate the

general linear model to analysis of variance (ANOVA) models, and to correlation and regression

models.

My approach to this material is original, although the central ideas are not new. The standard

treatment of this material is predominantly algebraic with a minimal (in the case of regression) or

nonexistent (in the case of ANOVA) discussion of the data-as-points geometry and no mention of

the data-as-vectors approach. In addition, these models are often taught separately in different

courses and never related. Only a very few texts present the data-as-vectors approach (although

historically this was the geometry emphasized by early statisticians), and all most all of these

texts are written for graduate students in statistics and presume sophisticated understandings

of many statistical and mathematical concepts. Another major limitation of these texts is the

quality of the drawings, which are schematic and not based on examples of real data. By contrast,

I emphasize geometry, particularly the data-as-vectors perspective, in order to introduce common

statistical models and to explain how they are in fact closely related to one another. I am able to

use precise drawings of examples with real data because of the DataVectors computer program

I developed to generate interactive representations of vector geometry (available upon request:

[email protected]). I am not aware of any other program that can generate these kinds of

1
representations. Although it is unlikely they would be useful representations for reports of

original research, the drawings produced by the program (and the interactivity of the program

itself) have potential as powerful pedagogical tools for those learning to think about linear

models using the data-as-vectors geometric perspective.

I have drawn on many sources in preparing this manuscript. In particular, Thomas D.

Wickens’ (1995) text The Geometry of Multivariate Statistics and David J. Saville’s and Gra-

ham R. Wood’s (1991) Statistical Methods: The Geometric Approach introduced me to the

data-as-vectors geometry. Other statistical texts that were particularly helpful include Ronald

Christensen’s (1996) Plane Answers to Complex Questions; S. R. Searle’s (1971) Linear Mod-

els; Michael H. Kutner and colleagues’ (2005) Applied Linear Statistical Models; and George

Casella’s and Roger L. Berger’s (2002) text titled Statistical Inference. Using R to work out

examples was made possible by Julian J. Faraway’s (2004) Linear Models with R. It is worth

mentioning that Elazar J. Pedhazur’s (1997) Multiple Regression in Behavioral Research: Ex-

planation and Prediction led me to initially ask the questions that this manuscript addresses

and informed my understanding of applied multiple regression techniques. The discussion of

projections and homogeneous coordinates in Theodore Shifrin and Malcolm R. Adams’ (2002)

Linear Algebra: A Geometric Approach and Ken Shoemake’s landmark 1992 talk on the arcball

control at the Graphics Interface conference in Vancouver, Canada were invaluable for writing

the DataVectors program. In the work that follows, all of the discussion, examples, and figures

are my own. Except where noted, the proofs are my own work or my adaptation of standard

proofs that are used without citation in two or more standard references.

The statistical foundations of the subsequent chapters are developed in Chapter 1. The

topics addressed include random variables, vectors of random variables, expectation, mean,

variance, and the normal, F -, and t- probability distributions. Two complimentary geometric

interpretations of data are presented and contrasted throughout. In Chapter 2, we develop

the geometry of the general linear hypothesis testing by examining a simple experiment. In

Chapter 3, we turn to several examples of analyses from the ANOVA framework and illustrate

the geometric meaning of dummy variables and contrasts. The relationships between these

contrasts, the F -ratio, and hypothesis testing are explored.

2
Chapter 4 describes simple regression and correlation analysis using two geometric interpre-

tations of data. Variable transformations provide a way to make simple regression models more

general and enable the development of models for non-linear data. In Chapter 5, we discuss

multiple regression from a geometric point of view. The geometric perspectives developed in

the previous chapters afford a rich discussion of orthogonality, multicollinearity, and suppressor

variables, as well as multiple, partial, and semi-partial correlation. Chapter 6 takes us through

a tour of four probability distributions that are referenced in the text and complements the

statistical foundations presented in the first chapter.

The last chapter describes the mathematical basis of the DataVectors program used to gen-

erate some of the figures in this text. Points in R3 can be represented using homogeneous

coordinates which facilitate affine translations of these points via matrix multiplication. Per-

spective projections of R3 to an arbitrary plane can also be realized via matrix multiplication

directly from Euclidean or homogeneous coordinates. Computer systems producing perspective

representations of 3-dimensional objects often perform an intermediate transform of the viewing

frustum in R3 to an appropriately scaled parallelepiped in perspective space, retaining some

information about relative depth.

3
Chapter 1

Statistical Foundations and


Geometry

Applied statistics is fundamentally concerned with finding quantitative relationships among

variables, the observed features of individuals in a population. Usually, data from an entire

population are not available and instead these relationships must be inferred from a sample, a

subset of the population. We denote the size of the sample by n. Variables are categorized as

independent or dependent; the independent variables are used to predict or explain the depen-

dent variables. Much of the work of applied statistics is finding appropriate models that relate

independent and dependent variables and making disciplined decisions about the hypothesized

parameters of these models. After discussing some foundational ideas of statistics from two

different geometric perspectives (including random variables, expectation, mean, and variance),

statistical models and hypothesis testing are introduced by means of an illustrative example.

Variables
Dependent Independent
Individuals Var1 Var2 Var3 ··· Varm
Individual1 Obs1,1 Obs1,2 Obs1,3 · · · Obs1,m
Individual2 Obs2,1 Obs2,2 Obs2,3 · · · Obs2,m
.. .. .. .. .. ..
. . . . . .
Individualn Obsn,1 Obsn,2 Obsn,3 · · · Obsn,m

Table 1.1: A generic data set with one dependent variable, m − 1 independent variables, and n
observations for each variable.

4
1.1 Two geometric interpretations of data

One canonical representation of a data set is a table of observations (see Table 1.1). The rows of

the table correspond to individuals and the columns of the table correspond to the variables of

interest. Each entry in the table is a single observation (a real number) of a particular variable

for a particular individual. This representation can be taken as an n × m matrix over R where

n is the number of individuals and m is the number of variables comprising the data set. Then

the columns and rows of the data matrix can be treated as vectors in Rm and Rn , respectively.

Histogram Strip Plot


80
Frequency

40
0

10 20 30 40 50 10 20 30 40

MA District Per Capita Income ($K) MA District Per Capita Income ($K)

Figure 1.1: Strip plots (a one-dimensional scatterplot) can illustrate observations of a single
variable, but histograms convey the variable’s distribution more clearly.

Data are usually interpreted geometrically by considering each row vector as a point in

Euclidean space with coordinate axes that correspond to the variables in the data set. Thus,

each individual can be located in Rm with the ith coordinate given by that individual’s ith

variable observation. This space is called variable space. When only one variable is involved, the

space is one dimensional and a strip chart (and more commonly the histogram which more clearly

conveys the variable’s distribution) illustrates this representation (see Figure 1.1). Scatterplots

can illustrate data sets with up to three variables (see Figure 1.2).

There is a second geometric interpretation. The column vectors of the data matrix can

be understood as vectors in Rn and used to generate useful vector diagrams (see Figure 1.3).

Vector diagrams represent (subspaces) of individual space, and offer a complementary geometric

5
MA School Districts (1998−1999)


● ● ●
● ●
● ● ●

● ●
● ●● ●
780

● ● ● ● ● ● ●

● ● ● ●● ●● ●

● ●
●● ● ● ●
● ● ●● ●
● ● ● ●
● ● ● ●● ● ● ●
● ● ● ● ●● ● ● ● ●

District Mean 4th Grade Total Score

● ● ● ●
● ● ●● ● ●
● ● ● ● ● ●● ● ● ● ●
● ● ●● ●● ● ●● ● ●● ● ●

Teacher−Student Ratio
● ●
760

● ● ● ●● ●
● ● ●● ●● ● ●
● ● ●● ● ●
● ●● ●● ● ● ● ●
●● ● ● ● ●
● ● ● ●
● ● ●● ● ●● ● ●

● ● ● ● ●
● ● ●● ● ●●●● ● ● ●
● ●
● ●● ●● ● ●
● ●

740

● ● ● ●
● ● ● ●● ● ● ●

●● ●●●
● ● ●
● ●
● ● ● ●

● 25
720


20
700

15
680
660

10
640

5
5 10 15 20 25 30 35 40
Per Capita Income
(thousands)

Figure 1.2: Scatterplots can illustrate multivariate data.

interpretation of the data. We use boldface letters to denote the vectors, and it is customary for y

to denote the dependent variable(s) (called Var1 in Table 1.1), and for xi where i ranges between

1 and m − 1, to denote the independent variables (called Var2 , Var3 , etc. in Table 1.1). The

reader likely notices one immediate hurdle we face when interpreting the data matrix as vectors in

Rn —the dimension of the individual space is equal to the number of individuals, and, for almost

every useful data set, this far exceeds that which we visualize, let alone reasonably illustrate

on a two-dimensional page or computer screen. In practice, we are limited to illustrating up to

3-dimensional subspaces of individual space. In cases where the number of variables (and hence

the dimension of smallest-dimensioned subspace of interest) is greater than three, planes and

lines must represent higher-dimensional subspaces and vector diagrams become schematic.

6
y
y

x
x1
x2

Figure 1.3: Vector diagram representations of variable vectors that span two and three dimen-
sional subspaces of individual space (Rn ).

1.2 Random variables and expectation

In statistics, a random experiment is any process with an outcome that cannot be predicted

before the experiment occurs. The set of outcomes for a random experiment is called the sample

space and denoted Ω. The subsets of the sample space are called events and this collection

is denoted S. (Technical considerations may prohibit the inclusion of all subsets of Ω; the

collection of events S must be a σ-algebra.) A probability measure P is a real-valued function

defined on S, such that P (A) ≥ 0, for all A ∈ S and P (S) = 1. In addition, P satisfies the

countable additivity axiom: If {An : n ∈ N} is a countable, pairwise disjoint collection of events

then !
[ X
P An = P (An ).
n∈N n∈N

In other words, (Ω, S, P ) is a measure space with probability measure P .

A real random variable, often denoted with capital Roman letter, is a real-valued function

defined on Ω, e.g., Y : Ω → R. Each random variable is associated with its cumulative dis-

tribution function, FY : R → R where FY (y) = P (Y ≤ y), where Y ≤ y indicates the set

{ω ∈ Ω : Y (ω) ≤ y}. The cumulative distribution function allows the computation of the prob-

ability of any set in S. The function Y is a continuous random variable if there is a function
Ry
fY : R → R satisfying FY (y) = fY (x)dx; fY is called the probability density function. In an
−∞
analogous way, a discrete random variable has an associated density function mY : R → R that

is non-zero only at countably many points and satisfies FY (y) = mY (x).


P
x≤y

7
One of the most important concepts in statistics is that of the expected value E(Y ) of a ran-

dom variable. In the case where Y is discrete, the expected value can be defined as a weighted

average, E(Y ) = ymY (y). For example, the expected value of the random variable that
P
y∈R
y
assigns each die roll event to its face-value is E(Y ) = 6 = 6 . If Y is continuous, then
21
P
y∈{1,2,...,6}
R∞
we define E(Y ) = yfY (y)dy.
−∞

Expectation has the following three properties, stated without proof.

Theorem 1.2.1. For any random variables X and Y , any a in R, and any function g : R → R,

• E(X + Y ) = E(X) + E(Y )

• E(aY ) = aE(Y )
R∞
• E(g ◦ Y ) = g(y)fY (y)dy
−∞

In general, the expectation of a linear combination of random variables is the corresponding

linear combination of the respective expected values, but for our purposes we only need the

weaker result involving a single random variable that is stated next.

Corollary 1.2.2. For any random variable Y, any a in R, and any functions g1 and g2

• E(g1 ◦ Y + g2 ◦ Y ) = E(g1 ◦ Y ) + E(g2 ◦ Y ), and

• E(ag1 ◦ Y ) = aE(g1 ◦ Y ).

Expectation is useful for understanding several other important concepts. Suppose X :

ΩX → R and Y : ΩY → R are two random variables with associated probability measures PX

and PY . The joint probability of the event (A and B), where A ⊂ ΩX and B ⊂ ΩY , is defined

PX ×PY (A and B) = PX (A)PY (B), and the variables have a joint probability distribution defined

FXY (x, y) = P1 × P2 (X ≤ x and Y ≤ y). The expected value of a joint probability distribution

is related to the expected value of the component variables. Two random variables are said to

be independent if FXY (A and B) = PX (A)PY (B). Whenever two random variables, X and Y ,

are independent, the following statement holds:

E(XY ) = E(X)E(Y )

8
Note that independent is used with several different meanings in this text. Independent variables

in models are those used to predict or explain the dependent variable, but we will see that our

methods require us to assume that the sample of n observations of the dependent variable are the

realized values of the independent random variables Y1 , . . . , Yn . We also will discuss independent

hypothesis tests, a usage which follows from the probabilistic definition just provided for random

variables. If two hypothesis tests are independent, then the outcome of one test does not affect

the likelihood of the possible outcomes of the second.

Expectation is also used to define the covariance of two random variables. The covariance

of the random variables X and Y is

Cov(X, Y ) = E (X − E(X))(Y − E(Y )) .




It is a useful fact that the covariance of independent random variables is 0.

Probability distributions are families of functions indexed by parameters. Parameters are

features of a distribution that can be adjusted to improve the correspondence of a mathematical

probability model and the real-world situation it describes. They are often denoted by Greek

letters, and many are defined using expectation. For example, the mean, µY , of a random

variable Y is a parameter for the normal distribution (see Section 6.1) and is defined as the

expected value of Y :

µY = E(Y ). (1.1)

Later sections of this chapter provide more discussion of this definition and more examples of

parameters.

Problems in inferential statistics often require one to estimate parameters using information

from a sample data vector, a finite set of values that the random variable Y takes during an

experiment with n trials. The sample is often denoted by a bolded, lowercase Roman letter. This
n
notation is used because a sample can be understood as an n-tuple or a vector in VY ⊂ Rn .
Q
i=1
For example, using the definition of Y in the previous paragraph (the face-value of a die roll),

the experiment in which we roll a die 4 times might yield the sample y = (1, 3, 5, 2)T ∈ R4 . Each

value yi is a realized value of the random variable Y . The sample data vector is geometrically

9
understood as a vector in individual space.

Equivalently, samples of n observations can be understood as the values of n independent,

identically distributed random variables Yi . The sample y is the realized value of a vector of

random variables, called a random variable vector and denoted with boldface, capital Roman

letter: Y = (Y1 , Y2 , · · · , Yn )T . Because the roll of a die does not change the likelihood of later

rolls, we say the 4 consecutive rolls are independent. In out example, we might perform the

same experiment just as easily by rolling 4 identical dice at the same time and recording the

4 resulting face-values. The second view of samples, as the realization of random variables, is

more common and will be used from now on in this document. Note that the same notation (a

boldface, capital Roman letter) is used for matrices in standard presentations of linear algebra

and in this text. In the following chapters, the context will indicate whether such a symbol

represents a vector of random variables or a matrix of constants.

Functions of sample data are called statistics, and one important class of statistics are es-

timates of parameters called estimators. They are conceptually distinct from the (possibly un-

known) parameters and are usually denoted by roman letters. For example, the sample mean,

an estimator for the parameter µY , is often symbolized y. Another common notation for esti-

mators, including vectors of estimators, is the hat notation. In the next chapter, for example,

we will use the symbol ŷ to represent the least-squares estimator for y.

1.3 Centered vectors, sample mean, and sample variance

In the previous section, we used expectation to define the mean of a random variable (see

equation (??) and claimed that samples can be used to estimate these parameters. In this

section we define the statistics y and s2 , provide geometric descriptions of both, and show they

are estimators for the mean (µ) and the variance (σ 2 ), respectively. We begin with the useful

concept of an centered vector.

10
1.3.1 Centered vectors

The centered (data) vector vc of a vector v = (v1 , v2 , · · · , vn )T is defined to be the vector

1·v
 
vc = (vc1 , vc2 , · · · , vcn ) = v −
T
1, (1.2)
n

where 1 denotes the vector (1, 1, · · · , 1)T ∈ Rn .

1.3.2 The sample mean

The sample mean, y, gives an indication of the central tendency of the sample y, and is defined

as the average value obtained by summing the observations and dividing by n, the number of

observations (see equation 1.3). It is usually denoted by placing a bar over the lowercase letter

denoting the vector, but it is not bolded because it is a scalar. One can gain intuition about

the sample mean by imagining the observations as uniform weights glued to a weightless ruler

according to their value. In this sense, the mean can be understood as the center of mass of the

sample distribution. In variable space, the sample mean can be represented as a point on the

axis of the variable.

n
1X
y= yi (1.3)
n
i=1

Since the sample mean is a scalar, it has no direct vector representation in individual space.

However, the mean vector is a useful individual-space object and is written y1. The definition

for a centered vector (see equation 1.2) can be written more parsimoniously as the difference of

two vectors, yc = y − y1. This relationship can also be illustrated geometrically with vector

diagrams (Figure 1.4).

A few facts suggested by Figure 1.4 are worth demonstrating in generality. First, the mean

vector y1 can be understood as (and obtained by) the orthogonal projection of y on the line in

Rn spanned by the vector 1. We have

Pn
y·1 i=1 yi
Proj1 y = 1= 1 = y1. (1.4)
1·1 n

11
y1

y1 y x1

x
1
yc
xc
yc

Figure 1.4: The centered vector yc is the difference of the observation vector y and the mean
vector y1, and the subspace spanned by centered vectors is orthogonal to the mean vectors.

It follows that a centered vector and the corresponding mean vector are orthogonal.

In the last section, we called estimators those statistics that can be used to estimate pa-

rameters. A special class of estimators are called unbiased because the expected value of the

estimator is equal to the parameter. For example, we can write E(Y ) = µY if Y is an unbiased

estimator of µY . To make sense of the statement E(Y ) we must think of Y as a linear combina-

tion of the n random variables associated with the sample y, and consider y to be a realization

of this random variable. We use the following definition of Y as a random variable:

n
1X
Y = Yi , (1.5)
n
i=1

where Yi is the random variable realized by the ith observation in the sample y. The definition

of Y as a random variable demonstrates that random variables can be constructed as linear

combinations of other random variables. We will see in later sections that, given distributional

information about the component random variables Yi , we can estimate the mean and variance

of the composite random variables such as Y .

It remains to show that Y is an unbiased estimator of µ, and the computation is straight-

12
forward:

n
!
1X
E(Y ) = E Yi
n
i=1
n
1X
= E(Yi )
n
i=1
n
1 X
= µ=µ
n
i=1

Notice that this proof relies on the assumption that the random variables Yi are identically

distributed (in particular, they must all have the same mean, µ).

1.3.3 The variance and sample variance

The variance of a random variable Y , σY2 , indicates its variability and helps answer the question:

How much do the observed values of the variable vary from one to another? Returning to the

physical model for the values in a distribution imagined as uniform weights glued to a massless

ruler according to their value, the variance can be understood as the rotational inertia of the

variable’s distribution about the mean. The greater the variance, the greater the force would be

required to change the rotation rate of the distribution by some fixed amount. A random variable

with low variance has realized values that cluster tightly around the mean of the variable.

The variance is defined to be the expected value of the squared deviation of a variable from

the mean:

σY2 = Var(Y ) = E (Y − µ)2 . (1.6)




If the variable’s values are known for the entire population (of size n), then y = µ and the

variance can be computed as the mean squared deviation from the mean:

n
1X
s2n = (yi − y)2 . (1.7)
n
i=1

In most analyses, however, only a sample of observations are available, and the formula (1.7) sys-

tematically underestimates the true population variance: n1 (yi − µ)2 < σ 2 . This phenomenon
P

is more noticeable with small samples.

13
An unbiased estimator, s2 , of the variance is obtained using n − 1 instead of n in the

denominator:
n n
1 X 1 X 2 kyc k2
s2 = (yi − y)2 = yci = . (1.8)
n−1 n−1 n−1
i=1 i=1

The proof that E(s2n ) 6= σ 2 = E(s2 ) relies on the independence of the observations in the sample,

one of the characteristics of a random sample and a common assumption made of empirical

samples by researchers who perform statistical analyses. It is important to notice the similarity

between equation 1.8 and the numerator and denominator of the F -ratio (see equation 2.15). In

fact, we shall see that the estimate of the sample variance s2 can be understood geometrically as

the per-dimension length of the centered vector yc . This will be more fully explained in Section

2.1.3, but the key idea is that yc lives in the orthogonal complement of 1, and this space is

(n − 1)-dimensional. The information in one dimension of the observation vector can be used to

center the vector (and estimate the mean µ), and the information in each of the remaining n − 1

dimensions provides an estimate of the variance σ 2 . The best estimate is therefore the mean of

these n − 1 values which conveniently sum to kyc k2 .

In a manner analogous to y, the sample variance s2 can also be understood as the realization

of the random variable S 2 , which is defined as a linear combination of other random variables:

n
1 X
S2 = (Yi − Y )2 . (1.9)
n−1
i=1

In Chapter 2, we prove that S 2 is an unbiased estimator after a prerequisite result about the vari-

ance of random variables formed from linear combinations of random variables. Next, we briefly

consider another statistic related to the sample variance that is frequently used in statistical

analyses.

1.3.4 The standard deviation and Chebyshev’s inequality

Sample variance is denoted s2 because it is equal to the square of the sample standard deviation,

s. Both the variance and the standard deviation address the variability of a random variable.

However, the standard deviation is more easily interpreted than variance because the units

for variance are the squared units of the variable. By taking the square root, the variance is

14
transformed to the metric of the variable and can be more easily compared with values in a data

set. The standard deviation is also useful for estimating the probability that the variable will

take a value in a particular interval of the domain.

The most general result of this kind is Chebyshev’s inequality, which states that

1
P (|Y − µ| ≥ kσ) ≤ , (1.10)
k2

regardless of the distribution of Y . For sufficiently large samples, Y can be used to estimate µ

and s can be used to estimate σ. For example, suppose that for some sample of 30 observations,

Y = 1 and s = 2. Then the probability that the next observation of Y deviates from the mean
1 1
by more than 4 = 2s is at most 2 = or 25%.
2 4
The proof of this inequality follows easily from a more general statement called Markov’s

inequality: If Ỹ ≥ 0, then for any value of Ỹ , denoted a, we have:

E(Ỹ )
P (Ỹ ≥ a) ≤ (1.11)
a

The proof of Markov’s inequality in the continuous case follows:

Z
E(Y ) = yfY (y)dy
ZVY Z
= yfY (y)dy + yfY (y)dy
Y <a Y ≥a
Z
≥ yfY (y)dy
Y ≥a
Z
≥ a fY (y)dy
Y ≥a
= a P (Y ≥ a)

Chebyshev’s inequality follows from Markov’s inequality by replacing Ỹ with (Y − µ)2 ≥ 0

15
and taking a to be k 2 σ 2 . Since |Y − µ| ≥ kσ if and only if Ỹ ≥ k 2 σ 2 , we have

P (|Y − µ| ≥ kσ)) = P (Ỹ ≥ a)


E(Ỹ )

a
σ2 1
= 2 2
= 2
k σ k

Chebyshev’s inequality illustrates the utility of standard deviations as a tool for describing

the distribution of variables in variable space (Rm ). The standard deviation can be understood

as the length of an interval on the axis of a variable that follows a normal distribution (see Section
without
6.1) and calculating and appropriately
used to helps represent labeling(Figure
the distribution the variable axes.
1.5). Nevertheless, one limitation of

variable space representations of data is that standard deviations are hard to see in a scatterplot

without calculating sHistogram of X labeling the variable


and appropriately Scatterplot
axes. of (X , Y ), CorrXY = 0.7
200

3
2
150

Y in units of sY
Frequency

1
100

−1 0
50

−3
0

−3 −1 0 1 2 3 −3 −1 0 1 2 3

X in units of sX X in units of sX

Figure 1.5: In variable space, the standard deviation s is a convenient, distribution-related unit
Figure
for many1.5: In variable
variables; space,
in this thethe
figure, standard
origin ofdeviation
each axissisisshifted
a convenient, distribution-related
to the mean of the associatedunit
for many
variable. variables; in this figure, the origin of each axis is shifted to the mean of the associated
variable.
Standard deviations also have a useful interpretation in vector drawings of individual space
Standard deviations also have a useful interpretation in vector drawings of individual space
(R nn). Unlike scatterplots in variable space, vector drawings in individual space already represent
(R ). Unlike scatterplots in variable space, vector drawings in individual space already represent
the standard
the standarddeviation
deviationbybynature
natureofoftheir
theirconstruction.
construction. From
From equation
equation (1.8),
(1.8), it isitclear
is clear
thatthat
the the
variance of a variable can be expressed using the dot product of the centered variable with itself.
variance of a variable can be expressed using the dot product of√the centered variable with itself.
It follows from the definition of the Euclidean norm (!v! = √ v · v) that the standard deviation
It follows from the definition of the Euclidean norm (kvk = v · v) that the standard deviation
of a variable is proportional to the length of the centered vector.

yc · yc 16 1
s2 = ⇐⇒ s = !yc ! √
n−1 n−1
Since the constant of proportionality √1
n−1
depends only on the dimension of individual space,
of a variable is proportional to the length of the centered vector.

y c · yc 1
s2 = ⇐⇒ s = kyc k √
n−1 n−1

Since the constant of proportionality √1


n−1
depends only on the dimension of individual space,

all centered variable vectors are scaled proportionally. Assuming comparable units, this means

the ratio of the length of two centered vectors in individual space is equal to the ratio of the

standard deviations of these variables. For example, in the left panel of Figure 1.3, the vector

y has a smaller standard deviation than the vector x because it is shorter. Histograms of these

two variables would show the yi bunched more tightly around their mean.

1.4 An illustrative example

We conclude this chapter with an example to introduce statistical models and hypothesis test-

ing.1 The geometry of individual space is essential here; variable space does not afford a simple

or compelling treatment of these ideas and their standard algebraic treatment at the elementary

level masks rather than illuminates meaning. In fact, the presentation here is closer to the

original formulation of the ideas by R.A. Fisher in the early twentieth century (Herr, 1980).

Suppose a tutor would like to discover if her tutoring improves her students’ scores on a

standardized test. She picks two of her students at random and for each calculates the difference

in test score before and after a one month period of tutoring: g1 = 7, g2 = 9. Then she plots

these gain-scores in individual space as the vector y = (g1 , g2 ). (See Figure 1.6a)

To proceed, we must make a few reasonable assumptions about the situation. First, we

assume that the students’ scores are independent random variables. Among other things, this

means that the gain-score of either student does not affect the gain-score of the other. Next,

we assume that the gain-score for both students follows the same distribution. In particular, we

want to assume that if we could somehow go back in time and repeat the month of tutoring,

then the average gain-score over many repetitions would be the same for each student. Neither

student is predisposed to a greater benefit from tutoring than the other. This assumption lets
1
This example is inspired by something similar in Saville & Wood (1991). I have changed the context and
values, used my own figures, and considerably developed the discussion.

17
y y y
9

(a) (b) (c)

Figure 1.6: (a) The vector y is plotted in individual space; one must decide if (b) the vector y
is more likely a sample from a distribution centered at the origin of individual space or instead
(c) a sample from of a distribution centered away from the origin on the line spanned by (1,1).

us postulate a true mean gain-score µ for the population of students, and our goal is to use the

data we have to estimate it. Finally, we make the assumption that the common distribution of

gain-scores is normal with a mean of 0. This implies that the signed length of the vector follows

a normal distribution with a mean of 0 (where the sign of the length is given by the sign of
s
the mean gain-score) and standard deviation √ = s, where s is the standard deviation of
2−1
gain-scores. Moreover, all directions are equally likely because of the assumed independence.

The tutor’s question has two possible answers: The tutoring makes little difference in stu-

dents’ gain-scores or there is some effect. In the first case, we would expect many repetitions of

her procedure to look like Figure 1.6b, and in the second case, many repetitions might look like

Figure 1.6c, with the center of the distribution displaced from the origin. In both figures, the

standard deviation of the length of the vector is indicated by a dashed circle.

1.4.1 The geometry of hypotheses

We call the first possibility the null hypothesis and write H0 : µ = 0, where µ is the mean

gain-score resulting from tutoring. The second possibility is called the alternative hypothesis

H1 : µ 6= 0. Certainly after plotting 100 repetitions of the tutor’s procedure it would likely

be easy to decide which hypothesis was most plausible; the challenge is to pick the most likely

18
hypotheses based only on a single trial.

The center of the distribution for the vector y must lie along the line spanned by the vector

1 = (1, 1). Geometrically, we can understand the vector y as the sum of the vector y1 and a

second vector orthogonal to the first which can be written e = y − y1, where y is the length of

the orthogonal projection of y on 1. This relationship is shown in Figure 1.7.

y
9
y
y1

7 y

Figure 1.7: The vector y can be understood as the sum of y1 and a vector e that is orthogonal
to 1.

The idea for testing the null hypothesis is to compare the lengths of y1 and e using the ratio

sgn (y)ky1k
t= .
kek

To make sense of the hypothesis test geometrically, consider both parts of Figure 1.8. In both,

the shaded region indicates the cone in individual space where the t-ratio is large. In Figure 1.8a,

the vector y gives an estimate of the variance of the distribution of gain-scores under the null

hypothesis, and the corresponding standard deviation is indicated by the dashed circle centered

at the origin of individual space. In Figure 1.8b, it is instead the vector e that gives an estimate

of the variance of the distribution of gain-scores relative to y1, and the corresponding standard

deviation of this distribution is indicated by the radius of the dashed circle centered at (y, y).

If the t-ratio is large, then the vector y is ‘close’ in some sense to the line spanned by 1.

19
y y
y1 y1
y

0 y

Figure 1.8: The distribution of the random variable Y has different centers and relies on a dif-
ferent estimates for the standard deviation under (a) the null hypothesis and (b) the alternative
hypothesis.

In this case, we can see geometrically why the null hypothesis is unlikely. If y usually lands

anywhere within the dashed circle in Figure 1.8a, then it is rare y will land in the shaded cone.

The observation vector y is unusual under the null hypothesis and thus we can then reject the

null hypothesis in favor of the more plausible alternative hypothesis. Notice that under the

alternative hypothesis, the t-ratio will usually be large. Geometrically, we can see that the

dashed circle in Figure 1.8b is entirely within the shaded cone. On the other hand, whenever

the t-ratio is small we have no evidence with which we might reject the null hypothesis.

It can be shown that this ratio follows the t-distribution (see Section 6.4). Integrating the

probability distribution function for the t-distribution between the values of −8 and 8 gives the

quantity 0.921. This suggests that under the assumption of the null hypothesis, we can expect

a t-ratio with an absolute value as high or higher than the one we observed only 8% of the time.

Equivalently, we can expect a sample vector in individual space as close or closer to the line

spanned by 1 only 8% of the time under the assumption of the null hypothesis.

20
1.4.2 The F-ratio and the t-test

Like the t-ratio, the F -ratio is used for testing hypotheses and follows the F -distribution (see

Section 6.4). This example illustrates the relationship between the t-ratio and the F -ratio. The

F -ratio is more versatile and will be used almost exclusively from here on for hypothesis testing.

We begin by introducing the linear algebraic notation for describing the vector relationships

depicted in Figure 1.7. This equation is called the model for the experiment. Models will be

defined and elaborated in much more generality in the next chapter. Under this model, the

observation vector y is assumed to be equal to the sum of the error vector e and the matrix

product of mean vector y1 considered as a 2 × 1 matrix and the constant y considered as a 1 × 1

matrix.

     
7 1 −1
 
  =   8 +  .
9 1 1

y = Xb + e,

This model explains each sample as the sum of a constant mean gain-score (the product Xb)

plus a small amount of random error (the vector e) that is different for each sample. Another

way to understand the hypothesis test is to reframe our goal as the search for the estimate vector

b of the true vector β. The later describes the true relationship between Y and X. The null

hypothesis states that the vector we are trying to estimate β is zero, whereas the alternative

hypothesis states it is non-zero. It is noteworthy that throughout this thesis, the alternative

hypothesis is a negation of the null hypothesis instead of a hypothesis that specifies a particular

value of β.

The F -ratio is a comparison of the per-dimension squared lengths of (1) the projection y1

of the observation vector y onto the model vector 1 and (2) the error vector e = y − y1. In

this case, the F -ratio is simply the square of the t-ratio (see Section 6.4) and it follows the

21
F -distribution (see Section 6.3). The F -ratio can be written

ky1k2 /1
F = .
kek2 /1

(We mean by per-dimension a factor that is the inverse of the number of dimensions of the sub-

space containing the vector, after individual space has been partitioned into the model subspace

and error subspace. More on this later; see Section 2.1.3.)

kek2
y

y1

ky 1k2

Figure 1.9: The t-ratio of ky 1k to kek under the t-distribution provides the same probability
information about the likelihood of the observation under the null hypothesis as the F -ratio of
ky 1k2 to kek2 under the F -distribution.

The values ky1k2 and kek2 are illustrated in Figure 1.9. Again it is quite evident from the

geometry that this ratio is much greater than one. The F -distribution tells us how unusual such

an observation would be, agreeing with our prior result. Under the null hypothesis, we would

expect a sample with an F statistic this large or larger only 8% of the time.

22
Chapter 2

The General Linear Model

2.1 Statistical models and the general linear model

A statistical model is analogous to a mathematical function. It expresses the dependent variable

Y as a function of the independent variables xi . One difference between functions and models

is that models incorporate random error. For a fixed set of input values, the model does not

always return the same value because each output contains a random component.

A model is called the true model when it is assumed to express the real relationship between

the variables. However, if the data collected is restricted from the population to a sample, the

true model can never be discovered with certainty. Only approximations of the true model are

possible. To make the problem of finding an approximation tractable, a family of models is

defined using a small number of parameters, θ1 , θ2 , · · · , θk (see equation 2.1). The true model

can then be expressed with fixed (but unknown) parameter values, and the sample data can be

used to estimate them.

Y = fθ1 ,θ2 ,··· ,θk (x1 , x2 , · · · , xp ) (2.1)

We define the design matrix X of a linear model to be the matrix [1 x1 x2 · · · xp ]. With

this notation and a general framework for statistical models in mind, we are prepared to state

the general linear model:

Y = Xβ + E, (2.2)

23
where Y, E ∈ Rn and X is an n × (p + 1) matrix over R. In the general linear model, the

vector of parameters β = (β0 , β1 , · · · , βp )T is analogous to the parameters θ1 , θ2 , · · · , θk used

in equation (2.1). The vector E = (E1 , E2 , · · · , En )T ∈ Rn is the random error component of

the model and is the difference between the observed values and those predicted by the product

XB.

The example model from the first chapter (see equation 1.12) is a simple case of the general

linear model. The design matrix for this model is simply the vector X = 1, Xb is the vector

y1, and the vector e is the centered observation vector yc . In general, the matrix product Xb

is the projection of y onto the column space of X (which in this case is the line spanned by

1). Recall that the use of a lower-case, bold y and e indicates vectors of observed values from

a particular sample. The capital, bold Y and E in equation (2.2) indicate the corresponding

random variable. Just as Greek characters refer to individual population parameters and Roman

characters refer to corresponding sample statistics (e.g., µ and x), we use β for the vector of

model parameters but b for the vector of corresponding sample statistics.

An equivalent statement of the general linear model is a system of equations for each depen-

dent random variable Yi . Thus, for all i, we have:

Yi = Xi β = β0 + β1 xi,1 + · · · + βp xi,p + Ei , (2.3)

where Xi is the ith row of the design matrix X. As demonstrated in later chapters, the general

linear model subsumes many of the most popular statistical techniques including analysis of

variance (ANOVA) models in which the design matrix columns xi are categorical variables

(for example, gender or teacher certification status), simple regression and correlation models

comparing two continuous variables, and multiple regression models in which there are two or

more continuous, independent variables (e.g., teacher salary and years of experience).

The formulation of the general linear model presented here treats the observed values of the

independent variables as fixed constants instead of as realized random variables. In experimen-

tal data sets, this is often appropriate because researchers control the values of the independent

variables (e.g., dosage in drug trials). In observational data sets, it often makes more sense

24
to treat the observations of independent variables as realized random variables, because these

observations are determined by choice of the sample rather than experimental design. Under

certain conditions, the general linear model can be used when both independent and depen-

dent variables are treated as random variables. For example, the requirement that all random

variables have a multivariate normal distribution is sufficient but not strictly necessary. Even

though the results hold in greater generality, all independent variables are treated as fixed in

order to simplify the presentation.

2.1.1 Fitting the general linear model

Given a data set and a model, the next step of statistical analysis is to use the sample data to

estimate the parameters of the model. This process is called fitting the model. Our goal is that

the vector of parameter estimates, b = (b0 , b1 , · · · , bp )T , be as close as possible to β, the vector

of putative true parameter values. Such a comparison is unfortunately impossible because β is

unknown. However, since the model separates realized values of each random variable Yi into a

systematic component Xi β (where Xi is the ith row of the matrix X) and a random component

Ei , a feasible goal is to find a vector b so that Xi b is as close as possible to Yi , for each i,

1 ≤ i ≤ n. To accomplish this, we need a number that can summarize the model deviation over

all the observed values in the sample. A natural choice for this number is the length of the error

vector E, because the Euclidean norm depends on the value of each coordinate. Therefore, it

suffices to find a vector b which minimizes the expected length of E.

With a sample of observations y in hand, the best estimate available for Y is simply the

sample y. Restating the goal identified above in terms of the sample, we say we are looking

for a vector b that will minimize the difference between Xb and y. This difference is the best

available estimate for E and is denoted e. It follows that the fitted model can be written

y = Xb + e, (2.4)

25
and similarly for each observed yi we can write

yi = Xi b + ei .

ev

eŷ
v V

ŷ − v

Figure 2.1: The vector ŷ, the projection of y onto V = C(X), is seen to be the unique vector in
V that is closest to y.

The strategy for finding b can be understood geometrically by imagining y in n-dimensional

Euclidean space. The vector β is assumed to lie in the (p + 1)-dimensional subspace spanned

by 1 and the vectors xi . We are looking for a vector ŷ = Xb in the column space of X, C(X),

such that the length of e = y − ŷ is minimized. (The hat notation indicates an estimate; ŷ is an

estimate for the observed vector of y-values y.) The situation is represented in Figure 2.1 and

the proof of Lemma 2.1.1 follows easily.

Lemma 2.1.1. Let V ⊂ Rn be a subspace and suppose y ∈ Rn . For each vector v ∈ V, let

ev = y − v. Then there is a unique vector ŷ ∈ V such that 0 ≤ keŷ k < kev k for all v 6= ŷ.

Proof. We can write Rn = V ⊕ V ⊥ , and claim that ŷ is the projection of y onto V, the unique

vector in V such that eŷ ∈ V ⊥ . Let v 6= ŷ ∈ V. By the Pythagorean Theorem, kev k2 =

keŷ k2 + kŷ − vk2 , which implies 0 ≤ keŷ k < kev k.

The projection of y onto C(X), the vector ŷ = Xb, gives the desired estimate for y. It

remains to state and prove the general method for obtaining b using the sample data vector y

and the matrix X. The trivial case in which y ∈ C(X) does not require estimation because the

model (equation 2.4) can then be solved directly for b—it is not addressed here.

26
Theorem 2.1.2. Let X be an n × k matrix with linearly independent columns, let y ∈ Rn

be in the complement of C(X), and let b be a vector in Rk . Suppose that Xb = ŷ and that

y − ŷ ∈ C(X)⊥ . We claim that b = (XT X)−1 XT y.

Proof. C(X)⊥ is the null-space of the matrix XT , and so

XT (y − Xb) = 0.

Consequently,

(XT X)b = XT y.

The result follows as long as XT X is nonsingular, for in this case

b = (XT X)−1 XT y.

To see that XT X is nonsingular, let v ∈ Rk and suppose XT X(v) = 0. It follows that

XT (Xv) · v = 0, and therefore that kXvk2 = 0. We conclude that v = 0 since Xv = 0

and X has rank k.

This method produces a vector b of parameter estimates, called the least squares estimate

because it minimizes the sum of the squared components of e. Another method of finding a

formula for b is to use calculus to find critical points of the function S(b) = kek2 . The resulting

formula can be used obtain the normal equations

(XT X)b = XT y,

which is also an intermediate step in the proof of Theorem 2.1.2.

2.1.2 Centering independent variable vectors

The least squares solution b can be obtained from the orthogonal projection of the observed

dependent variable vector y onto the column space of the design matrix X = [1 x1 x2 · · · xp ].

In fact, the same solution can be obtained from the projection of y onto the subspace of Rn

27
spanned by the vector 1 and the centered vectors xci = xi − xi · 1. Let X0 denote the block

matrix [1 Xc ] with Xc = [xc1 xc2 · · · xcp ]. Centering a vector entails subtracting the projection

of that vector onto the vector 1 and because 1 is in both C(X) and C(X0 ), these subspaces are

equal.

From Theorem 2.1.2, we know that the least squares solution for y = Xb + e is

b = (XT X)−1 XT y,

and that the least squares solution for y = X0 b + e is

b0 = ((X0 )T X0 )−1 (X0 )T y.

The following theorem relates these solutions.

Theorem 2.1.3. Whenever b = (b0 , . . . , bp )T is the least squares solution for the linear model of

y with the design matrix X and b0 = (b00 , . . . , b0p )T is the least squares solution for the linear model

of y with the corresponding centered design matrix Xc , then bi = b0i for all 1 ≤ i ≤ p. Moreover,

the parameter estimate b0 can also be obtained from b0 ; in particular, b0 = b00 − pi=1 b0i xi .
P

Proof. Since C(X) = C(X0 ) we know that Xb = X0 b0 . The result follows immediately from the

equation  
1 −x1 · · · −xp
 
 
 
X  = X0 ,
 
0 Ip
 

 

where Ip is the p × p identity matrix.

The geometry of the result is instructive and readily apparent in the case where there is

only one independent variable, that is, when X = [1 x]. It follows from Theorem 2.1.2 that

ŷ = b0 1 + b1 x. Let X0 = [1 xc ]. Now the vectors x, xc , and x1 form a right triangle in

C(X) = C(Xc ) that is similar to the triangle formed by b1 x, b01 xc , and z = b1 x − b01 xc = z1 (see

Figure 2.2). It follows that b1 = b01 and that b0 = y − z. Certainly b00 = y, and by similarity we

28
have that z = b01 x. Thus, b0 = b00 − b01 x and we can therefore write ŷ = (b00 − b01 x)1 + b01 xc , which

is in terms of b0 as desired.
x1 x
y1 = b00 1 ŷ
b0 1

b1 x1 = b01 x1 b1 x

b1 xc = b01 xc xc

Figure 2.2: The vector ŷ, the projection of the vector y into C([1 x]), is equal to b0 1 + b1 x and
also is equal to b00 1 + b01 xc .

2.1.3 Subspaces of individual space

It is often convenient to think of individual space (Rn ) partitioned into several subspaces that

correspond to different parts of the assumed linear model. One important subspace in individual

space is spanned by the vector 1. We denote this line V1 .

As we have already seen, the least squares solution b is derived by projecting the observation

vector y into the column space of X, C(X). For analogous notation, we also use VX to denote

this subspace of Rn . Certainly V1 ⊂ VX and moreover, we can write VX = V1 ⊕ VXc , where

VXc denotes the orthogonal complement of V1 in VX . Finally, we let Ve denote the orthogonal

complement of VX in Rn . Putting these statements together gives the equation

Rn = V1 ⊕ VXc ⊕ Ve , (all orthogonal). (2.5)

From linear algebra, we know there is corresponding equation relating the dimensions of these

subspaces to the dimension of individual space:

n = 1 + p + (n − p − 1). (2.6)

The vectors that make up linear models (e.g., y, ŷ, e, etc.) are each contained in precisely

one of these subspaces. The (ambient) dimension of each vector is defined to be the dimension

of its associated (smallest) ambient space. Of course, each vector is one-dimensional in the

29
traditional sense. A basic technique for estimating models and testing hypotheses is finding a

convenient basis for individual space based on the vectors that make up the model. This new

definition for dimension is directly related to this implicit basis imposed on Rn by the linear

model. The vector 1 is 1-dimensional and is also (almost always) taken to be the first of the

implicit basis vectors. Next we choose a set of p orthogonal basis vectors for VXc . If the vectors

of centered predictors {xci : 1 ≤ i ≤ p} are all orthogonal, so much the better. The vector

ŷc is contained in VXc and therefore has (ambient) dimension p. Finally, we can pick any set

of n − p − 1 vectors that span Ve to complete the basis for Rn . The vector e, naturally, has

the ambient space Ve and therefore has dimension n − p − 1. The observation vector y is an

n-dimensional vector in individual space. It is rarely necessary to specify these vectors explicitly

but several arguments depend on their existence and orthogonality.

We already have used the name individual space for Rn . The subspaces of individual space

imposed by a linear model also have convenient names. The space V1 is called mean space, the

space VXc is called the effect space or the model space, and the space Ve is called the error space.

2.1.4 The assumptions of the general linear model

If we adopt the general linear model for a particular data set, then there are three assumptions

we are required to make. The logic of fitting the model and testing hypotheses about estimated

parameters depends on these assumptions. First, we assume that the sample y ∈ Rn is a set of

n observations of n independent random variables Yi (see Section 1.2), and that each Yi follows

the normal distribution (see Section 6.1). Further we assume that the expected value of each Yi

is a linear combination of the variables xi specified by the parameters in the vector β, that is

µYi = β0 + β1 xi,1 + · · · + βp xi,p . Finally, we assume that the variables Yi have a common variance

σ 2 . In summary, for all i, we assume E (Yi − µYi )(Yj − µYj ) = 0 for all j = 6 i (a consequence


of independence), and that the random variable Yi follows the normal distribution with a mean

of β0 + β1 xi,1 + · · · + βp xi,p and a variance of σ 2 .

The three assumptions about the random variables Yi can be reframed as assumptions about

30
the error component of the general linear model. From equation (2.3), we can write

Ei = Yi − β0 + β1 xi,1 + · · · + βp xi,p , (2.7)

which illustrates the dependence between the random variable Ei and the random variable Yi ,

for each 1 ≤ i ≤ n. If the variance of Yi is known, it is clear that the variance of Ei must be

the same. Because the true parameter vector β minimizes E (kEk), we know that the expected

value of each Ei must be zero. It follows that the following three assumptions about the random

variables Ei are equivalent to the first set assumptions concerning the random variables Yi .

Therefore, by adopting a linear model we have assumed:

1. All Ei are independent random variables with normal distributions.

2. E(Ei ) = 0 for all i.

3. V ar(Ei ) = σ 2 for all i.

The assumptions about the random error variables play a central role in justifying hypothesis

tests of the parameter estimates, b.

2.2 Linear combinations of random variables

To test hypotheses about the parameters of the general linear model requires that we under-

stand the distributions of linear combinations of random variables such as Y and S 2 , which are

linear combinations of the random variables Y1 , · · · , Yn . Assuming that the Yi s are all random

variables with the same normal distribution, it is reasonable to ask for the distribution of linear

combinations of these variables such as Y and S 2 .

We saw in Section 1.3.2 that, given a sample y, the statistic y is an unbiased estimator for

µ. In this section, we prove that S 2 is an unbiased estimator of σ 2 (see equation 1.9) and find

the distributions of Y and S 2 . These results (and the methods developed to obtain them) afford

a rigorous geometric foundation for the F -ratio developed in the next section.

We begin with the following lemma:

31
Lemma 2.2.1. Let Y be a random variable with |Var(Y )| < ∞. Then for any a and b in R,

Var(aY + b) = a2 Var(Y ).

Proof. Using the definition of variance (see equation 1.9) and the linearity of expectation (see

Corollary 1.2.2), we have

2
Var(aY + b) = E (aY + b) − E(aY + b)
2
= E aY + b − aE(Y ) − b
2
= E aY − aE(Y )
2
= a2 E Y − E(Y )

= a2 Var(Y )

Next, we consider the variance of sums of random variables.

Lemma 2.2.2. Let Y1 , · · · , Yn denote n random variables and suppose the random variables are

independent (i.e. E (Yi − µYi )(Yj − µYj ) = 0, for all i 6= j). Then


n n
!
X X
Var Yi = Var(Yi ).
i=1 i=1

Proof. We apply the definition of variance, the linearity of expectation, and the hypothesis of

32
independence:
 !2   !2 
n n n
!
X X X
Var Yi = E Yi − nµi  = E (Yi − µi ) 
i=1 i=1 i=1
 
Xn X
n n X
X n
= E (Yi − µi )(Yj − µj ) = E (Yi − µi )(Yj − µj )

 
i=1 j=1 i=1 j=1
n
X n X
X
= E(Yi − µi )2 + E (Yi − µi )(Yj − µj )


i=1 i=1 j6=i


n
X
= Var(Yi )
i=1

The main result follows.


n
Theorem 2.2.3. Let W be a random variable and suppose W = ai Yi , where ai ∈ R, for all
P
i=1
i, and Y1 , · · · , Yn are independent random variables with finite variance. Then

n
X
Var(W ) = a2i Var(Yi ).
i=1

Proof. The result is a straightforward consequence of Lemma 2.2.1 and Lemma 2.2.2.

2.2.1 The variance of the sample mean

The fact that the variance of a sum of independent random variables is the sum of their variances

(Lemma 2.2.2) yields another useful fact. We claim that the variance of the sample mean,
σ2
Var(Y ), is . This follows easily by applying the definition of Y and recalling the assumption
n
that the observations of a sample are assumed to be independent. Since Y = n1 ni=1 Yi , we
P

33
have:

Var(Y ) = E (Y − E(Y ))2 (2.8)




1 X
= E ( Yi − nµ)2 (2.9)

n 2
1 X 
= 2
Var Yi (2.10)
n
1 X σ2
= 2
Var(Yi ) = (2.11)
n n

Recall that the variance of a variable is the square of the standard deviation. Considering

the variance of Y , we conclude that the standard deviation of Y is

σ
σY = √ , (2.12)
n

where σ is the common standard deviation of the random variables Yi (see Section 2.1.4).

2.2.2 Sample variance is an unbiased estimator

Whenever the variance of a random variable is calculated from a random sample y = (y1 , · · · , yn ),

the unbiased estimator stated in equation (1.8) is used. Recall that this differed from the naive

estimate s2n (see equation 1.7) by a correction factor of n−1 .


n
It remains to show that the

random variable S 2 corresponding to the estimate s2 is in fact unbiased. We wish to show that

E(S 2 ) = σ 2 and in the process demonstrate that s2n calculated from sample data is a biased

estimate of variance.

We start with a standard algebraic argument that S 2 is unbiased. As with the proof of

34
Lemma 2.2.2, this argument relies on partitioning a sum of squares.

1 X
 
E(S ) =
2
E (Yi − Y ) 2
n−1
1 X
= (Yi − µ + µ − Y )2

E
n−1
1 X
= (Yi − µ)2 − 2n(Y − µ)2 + n(Y − µ)2

E
n−1
1 X 
= E (Yi − µ)2 − nE (Y − µ)2

n−1
1 X 
= Var(Yi ) − nVar(Y )
n−1
1
= (nσ 2 − σ 2 ) = σ 2
n−1

Although this algebraic argument that S 2 is an unbiased estimator does not support a

geometric interpretation, a geometric argument is possible. Given a sample y, to find s2 we

would like to calculate the per-dimension squared length of y − µ1, a vector with ambient space

Rn . Thus, when µ = µYi for all 1 ≤ i ≤ n, is known, the statistic n1 ni=1 (yi − µ)2 gives
P

an unbiased estimate of σ 2 . In the present case, however, the common mean, µ, is unknown.

Instead of using the true parameter µ we use the sample mean y. The key idea here is that one

dimension of individual space y ∈ Rn is used to estimate the mean by calculating y. The true

population parameter µ is independent of the sample y, so y − µ1 has ambient dimension of n.

The situation is different for y which instead depends directly on the sample y. Once y has been

decomposed as the sum of the mean vector y1 and the centered vector yc = y − y1, the centered

vector no longer has ambient dimension of n. Instead, the vector yc has ambient dimension

n − 1. The desired result follows from precisely the same concept—the statistic s2 is the average

per-dimension squared length of the centered data vector yc . In order to demonstrate this

rigorously, we use an approach that will be useful for geometrically justifying hypothesis testing

using the F -ratio: we find a convenient orthonormal basis for Rn .

We have seen that y can be obtained by projecting y on the line spanned by 1 (see equation

1.4). Let {u1 , u2 , · · · , un } be an orthonormal basis for Rn chosen so that u1 = √1 1


n
and with the

other basis vectors fixed but unspecified. We can write each basis vector in terms of the original

coordinates as ui = (ai1 , · · · , ain )T . Since {ui } is a basis, y = ci ui for some c1 , . . . , cn ∈ R. We


P

35
can find each ci by taking the (signed) length of the projection of y on ui , where ci is negative

if the projection of y on ui has the opposite direction as ui . We know that the coefficient
n √
ci = aij yj for each i, and in particular that c1 = n(y). Since each ui is a unit vector, we
P
j=1
n
also have a2ij = 1, for all i.
P
j=1
Next we consider ci for all 1 ≤ i ≤ n, as the realized value of the corresponding linear
n
combination of random variables Ci = aij Yj . Now if E(Yi ) = 0, for all 1 ≤ i ≤ n, we have
P
j=1
that E(Ci ) = 0. (This assumption is analogous to the assumption that each coordinate of the

error vector E in the general linear model has an expected value of 0; that is, E(Ei ) = 0.) By

Theorem 2.2.3, we have that

X
Var(Ci ) = ai Var(Yi )
X
= a2i σ 2
X
= σ2 a2i

= σ2

We have proved

Lemma 2.2.4. Let Yi , 1 ≤ i ≤ n, be independent random variables such that Yi follows a

normal distribution with a mean of 0 and a variation of σ 2 for all i, and suppose that u ∈ Rn

is a unit vector. If Cu is the projection of the random variable vector Y = (Y1 , . . . , Yn ) onto u,

then Var(C) = σ 2 .

By the definition of variance, we see that

2 
Var(Ci ) = E Ci − E(Ci )

= E Ci2 .


It is clear now that

E(Ci2 ) = σ 2 , (2.13)

and it follows that, with a particular vector of observations y (the realized values of the random

36
variable vector Y), the mean of the squared lengths of the vectors ci ui for 1 < i ≤ n will give the

best available estimate of σ 2 . It almost goes without saying that the sum ci ui = yc because
P
i6=1
c1 u1 = y1 (see equation 1.2). We now see the utility of the per-dimension squared length of yc

for estimating σ 2 . To summarize, we have

n n
c2i kci ui k2
P P
i=2 i=2 kyc k2
s2 = = =
n−1 n−1 n−1

We are happy to observe that this method of estimating variance agrees with the definition of

sample variance (see equation 1.8).

2.3 Testing the estimated model

Statistical analyses, in addition to making estimates of the parameters defining putative true

models for variables in a data set, often provide statistical tests for these estimates. Hypothesis

testing entails a comparison of the estimated model with a simpler model that is described by

the null hypothesis. Such a test, for example, can provide evidence that the true model is more

similar to the estimated model than to a model, say, in which the dependent variable has no

systematic relationship with the independent variables.

The general strategy for hypothesis testing with the general linear model is to compare a

restricted model obtained from a hypothesis that constrains one or more parameters with the

unrestricted model, which corresponds to an alternative hypothesis without these constraints.

The null hypothesis is often the most restricted model (the parameter vector β is set to the

zero vector so there is no systematic portion in the model). On the other hand, if parameter

estimates can vary, the estimated model is the best choice because it has the smallest squared

error vector. If the observed sample under the distribution implied by the restricted model is

so unusual that this null hypothesis is untenable, then the null hypothesis is rejected in favor of

the alternative hypothesis and the estimated model.

The foremost concern, given the sample y and a vector of parameter estimates b, is to

determine how close the estimated model is to the true model. Recall that the general linear

37
model decomposes each observed yi as the sum of (1) a linear combination of the xi s (this is the

systematic portion of the model) and (2) the term ei , which is the random portion of the model.

Furthermore, by adopting the linear model, we assume that, for all 1 ≤ i ≤ n, the expected

value of Ei is zero and the variance of Ei is σ 2 > 0 (see Section 2.1.4). We cannot compare b

with β without knowing the true model. Instead, we will compare the estimated model with

the model that has no systematic portion.

The model with no systematic portion is precisely the model in which the parameters are

each constrained to zero: β = 0. This is the same as saying that we expect all of the random

variables Yi to behave like the random variables Ei , for if β = 0 then Y = Xβ + E certainly

implies that Y = E. It follows that for all 1 ≤ i ≤ n, the expected value of Yi is zero and the

variance of Yi is σ 2 > 0.

Under any basis for Rn , the coordinates of Y are each estimates of σ (see the discussion

in the previous section, for example), and so the expected value of the squared length of the

random vector Y is nσ 2 . By contrapositive, we observe that if we can show that the expected

value of the squared length of Y is unlikely to be nσ 2 , then we are able to conclude that β is

unlikely to be the zero vector. In this case, then b is the best estimate for β in the column space

of X (i.e. when the parameters are allowed to vary), and so it is reasonable to conclude that b

is close to β. This logic is used in every hypothesis test of the estimated model. The crux of

the argument is using the observed data to show that we can (or cannot) reasonably expect the

expected value of the squared length of Y to be nσ 2 given the evidence from the sample y.

Let us call the model with no systematic portion the null model. We want to know how

likely the sample y is if reality actually corresponds to the null hypothesis that the linear model

is no better than chance at predicting or explaining the observed data y. In other words, the

observed data are due merely to chance and have no systematic relationship with the independent

variables.

For the sake of argument, we first assume the null model holds and therefore hypothesize that

expected squared length of Y is nσ 2 . Next we suppose that {u1 , u2 , · · · , un } is an orthonormal

basis for Rn chosen so that {u1 , u2 , · · · , up+1 } span C(X). If Y is written as Ci ui , where Ci
P

is a random variable constructed by the appropriate linear combination of the random variables

38
Yi , 1 ≤ i ≤ n, then the expected value of Ci2 is σ 2 for all i because we have assumed that

E(Yi2 ) = σ 2 for all i. Thus, we expect the per-dimension squared length of Ŷ to be σ 2 :

n
1 1 X
kYk2 = k Ci ui k2 = σ 2
n n
i=1

Furthermore, by the least-squares estimate for b, we can express Y as the sum of the vector

Ŷ ∈ C(X) and the random variable vector E ∈ C(X)⊥ . It follows that the expected per-

dimension squared length of each term of Y will be σ 2 :

p+1
1 1 X
kŶk2 = k C i ui k 2 = σ 2
p+1 p+1
i=1

and that
n
1 1 X 2
kEk2 = C i ui = σ 2 .
n−p−1 n−p−1
i=p+2

In fact, we expect that the per dimension squared length of Ŷ and E to be the same:

kŶk2 kEk2
= σ2 = . (2.14)
p+1 n−p−1

This justifies why the F -ratio,


kŷk2 /(p + 1)
F = , (2.15)
kek2 /(n − p − 1)

is an appropriate statistic for evaluating the likelihood of the observed data under the assumption

of the null model.

If null hypothesis is correct, then we would expect the per-dimension squared length of the

sample least squares estimate ŷ to be similar to the per-dimension squared length of the sample

error vector e because of equation (2.14) In this case, we would expect the value of F to be

close to 1. On the other hand, if the F -ratio is large then the average squared lengths of the

projections of ŷ onto the arbitrary basis vectors {ui } would be greater than σ 2 , implying that

the random variables Yi do not have a mean of zero. Moreover, if F is sufficiently large, then

the sample is unlikely under the assumption of the null hypothesis. For this reason, when F

is sufficiently large, we have some confidence in rejecting the null hypothesis and accepting the

39
alternative hypothesis that β is not the zero vector. It is important to stress that this procedure

does not ‘prove’ that β is close to b. Having decided that in all likelihood β 6= 0, the estimate

b provides the best guess we can make of the unknown parameter β given the sample y.

For example, another common model (also called the null model in some texts) is the model

where β1 is left free to be estimated but the rest of the parameters are constrained to 0. The

null hypothesis for a test of the estimated model against this model is written

H0 : β = (1 0)T , (2.16)

where 0 is a p-dimensional row vector. Under this hypothesis, the restricted model can be

written
  
Y= 1 b0 + E.

The alternative hypothesis can be expressed Ha : β 6= 0 and the corresponding unrestricted

model is the full general linear model expressed in equation (2.2). Once the null and alterna-

tive hypotheses have been articulated, the corresponding models are fit using the least-squares

method and compared using the F -ratio:

kŷc k2 /(p)
F = .
kec k2 /(n − p − 1)

Because one dimension of individual space is spanned by the vector 1, the vector ŷc has an

ambient dimension of p, and e has an ambient dimension of (n − p − 1).

40
Chapter 3

Geometry and Analysis of Variance

Analysis of variance (ANOVA) is a term that describes a wide range of statistical models ap-

propriate for answering many different kinds of questions about a wide range of experimental

and observational data sets. What unites these techniques is that the independent variable(s) in

ANOVA are always categorical variables (for example, gender or teacher certification status) and

the dependent variable(s) are continuous. Since ANOVA techniques were developed separately

from the regression techniques that will be discussed in the next chapter, different vocabulary

is used for characteristics of both kinds of models that are actually very similar or identical.

For example, the independent variables in ANOVA models are usually called factors, whereas

the independent variables in regression models are called predictors. In a similar way, the coef-

ficients for the independent variables in an ANOVA model are called effects but are often called

parameters in regression analyses. The hypotheses one can test using ANOVA generally concern

the differences of means in the dependent variable at different levels of a factor, the finite set

of discrete values attained by the factor. When a single independent variable or factor is used

in the model, then the analysis is called one-way ANOVA, and if two factors are used, then the

analysis is two-way ANOVA.

41
3.1 One-way ANOVA

In the example from section 1.4, we were interested in whether or not the mean gain-score on a

standardized test after one month of tutoring was significantly different from 0. We were only

interested in a single population, namely those students who had been tutored for one month.

In many kinds of research comparisons between two or more treatment groups are required. For

example, it is quite plausible that this particular tutor has no effect over and above the effect of

studying alone for an extra month. One-way ANOVA models allow us to compare the means of

different groups. When individuals are assigned to treatment groups randomly, it is defensible

to conclude that membership in a particular treatment group is responsible for differences in

outcomes.

Suppose a tutoring company wanted to research the efficacy of private tutoring to generate

data for an advertising campaign. Because individuals who have already elected private tutoring

may differ systematically from individuals who have not, they focus on the population of 42

tutees who participate in group tutoring sessions. Of these, three are randomly selected for 1

hour of private tutoring, three are randomly selected for 2 hours of private tutoring, and three

are randomly selected as controls (they continue to participate in the group tutoring session).

In this design, the treatment factor has three levels:

1. Weekly group tutoring session

2. Weekly 1 hour private tutoring session

3. Weekly 2 hour private tutoring session

The scores before and after one month of tutoring are used to calculate the gain-scores as before.

Simulated data for this study are presented in Table 3.1.

3.1.1 Dummy variables

The factors (i.e., independent variables) in ANOVA models specify the factor level for each

observation (instead of measurement data) and are called dummy variables. These vectors are

42
Table 3.1: Data for a 3-level factor recording tutoring treatment.

Levels Observed
gain-scores
Group tutoring 6.93
6.13
4.25
1 hour private tutoring 11.94
7.43
9.43
2 hours private tutoring 12.44
14.64
9.17

more easily interpreted if they are orthogonal and when they can be chosen to encode particular

hypotheses of interest.

The simplest method of creating dummy variables is to first sort y by factor level so that all

observations of the same level are consecutive. For example, the three gain-scores of students

who attended group tutoring might be in the first three slots of y, the three gains-scores of

students who received 1 hour of private tutoring might be in the fourth, fifth, and sixth slots,

and the scores of the final level might be in the seventh, eight, and ninth slots. We then create

a dummy variable Xi to represent the ith factor level under the convention that Xij = 1 if yj is

an observation of the ith factor level and 0 elsewhere. The dummy variables constructed in this

43
manner for the tutoring example are presented in following fitted model:

y = Xb + e,

     
 6.93  1 0 0  1.16 
     
 6.13  1 0 0  0.36 
     
     
 4.25  1 0 0 −1.52
     
      
11.94 0 1 0 5.77  2.34 
     
      
 7.43  = 0 0  9.60  + −2.17
      
   1    
.
      
 9.43  0 1 0 12.08 −0.17
     
     
12.44 0 0 1  0.36 
     
  
  
     
14.64 0 0 1  2.56 
     
     
9.17 0 0 1 −2.91

Geometrically, this means that we are considering the orthogonal basis of individual space com-

prised of the columns of X and any 6 other arbitrary vectors that span the error space. Notice

that with this simple form of dummy coding, the design matrix X does not include the mean

vector 1. Moreover, if the mean vector were added as another column in the design matrix X

then this matrix would be singular, since 1 is X1 + X2 + X3 .

3.1.2 Hypothesis tests with dummy variables

Estimating this model using least squares gives the vector b = (5.77, 9.60, 12.08)T , which can

be interpreted as the vector of factor level means. An overall test of the hypothesis that these

means are significantly different from 0 can be accomplished by calculating the F -ratio with 3

and 6 degrees of freedom:


kXbk2 /3 271.45
F = ≈ ≈ 55.875
2
kek /6 4.8584

This value is so large that, under the assumption that all of the factor level means are 0 (H0 :

µ1 = µ2 = µ3 = 0), we would expect an F value this large or larger only 0.009 % of the time.

44
We can conclude that at least one factor mean is significantly different than zero.

In addition, since the dummy variables are orthogonal, each can be used to test a hypothesis

that is independent of the rest of the model. In particular, we can test the hypothesis that each

factor level mean is different than zero (H0 : µ1 = 0; H0 : µ2 = 0; H0 : µ3 = 0). The F -ratio for

each test is presented below, along with the corresponding p-value. The p-value of a hypothesis

test is the probability of obtaining an F -ratio as large or larger under the assumption of the

corresponding null hypothesis. Notice that the numerator degrees of freedom for each of these

tests is 1 because each Xi is a vector in the chosen orthogonal basis for individual space.

kX(b1 , 0, 0)T k2 /1 99.88


Fµ1 =0 = ≈ ≈ 20.558; p = 0.003958
kek /62 4.8584
kX(0, b2 , 0)T k2 /1 276.48
Fµ2 =0 = ≈ ≈ 56.908; p = 0.000281
kek /62 4.8584
kX(0, 0, b3 )T k2 /1 438.02
Fµ3 =0 = ≈ ≈ 90.158; p = 0.000078
kek /62 4.8584

The results of these hypotheses tests suggest that we can reject the null hypotheses that any

one of the factor level means is zero; in each case, the gain-score is significantly different from

0 because the p-values are smaller than 0.05. (Other common significance levels are 0.1 and

0.01—researcher judgment and consensus in an academic field guide the choice.)

3.1.3 Dummy variables and contrasts

In spite of the hypothesis tests we were able to preform with the simplest kind of dummy

variables, we have not yet answered the most important question we sought to address with

the tutoring experiment: Does private tutoring have a different effect on gain-scores than group

tutoring? In addition, we might also like to answer the question: Are there differences in the

effect on gain-scores between 1 hour and 2 hours of private tutoring? These questions can be

answered by using a more clever strategy of constructing dummy variables so that they encode

hypotheses of interest.

These more elaborate dummy variables are often called contrasts. Researchers will often

select contrasts that are orthogonal to each other so the contrasts are independent and the

45
hypotheses they encode can be tested separately. Although not strictly necessary, most designs

include the vector 1 in order to test the null hypothesis H0 : µi = 0 where i ranges over all

factor levels. In the following discussion, we let X01 indicate the vector 1.

First we write down the questions we seek to answer and their translation as null hypotheses.

Question Null Hypothesis

1. Does private tutoring have a different effect H0 : µ2 +µ3


2 − µ1 = 0
on gain-scores than group tutoring?

2. Are there differences in the effect on gain- H0 : µ2 − µ3 = 0


scores between 1 hour and 2 hours of private

tutoring?

The next step is to find a dummy variable for each question so that when the F-ratio is large,

we have evidence to reject the associated null hypothesis. Geometrically, we want to construct a

vector in the column space of X so that the squared length of the projection of y on this vector

can be compared with the average squared length of the projection of y on arbitrary vectors

spanning the error space. In particular, we want relatively large projections on this vector to

be inconsistent with the hypothesis that average gain-score for private tutoring is the same as

the gain-score for group tutoring. This can be accomplished with (any multiple of) the vector

2 X2
1
+ 21 X3 − X1 , where the Xi indicate the simple dummy variables from the previous section.
µ2 +µ3
Essentially, we are checking whether our hypothesis about the group means (H0 : 2 −µ1 = 0)

is a feasible description of the relationship between observed values in these groups, across the

whole data set. Usually, it is convenient to pick dummy values that are integers to ease data-

entry, so let X02 = X2 + X3 − 2X1 . This vector works for testing the first hypothesis because

if the squared length of the projection of y on this vector is large then the observed data are

unlikely to have come from a population that is described by the null hypothesis (1): Private

tutoring (in 1 or 2 hour sessions) has no different effect on gain-scores than group tutoring.

Similarly, we can construct a vector for the null hypothesis corresponding to the second

question, Are there differences in the effect on gain-scores between 1 hour and 2 hours of private

tutoring? We take X03 to be X2 − X3 , and reason that large squared lengths of the projection of

y on this vector are inconsistent with the null hypothesis for the second question. Furthermore,

46
X01 · X02 = 0, X01 · X03 = 0, and X02 · X03 = 0, so the new design matrix X0 = [ X01 X02 X03 ] is

full-rank. The fitted model is

y = X0 b0 + e,

     
 6.93  1 −2 0   1.16 
     
 6.13  1 −2 0   0.36 
     
     
4.25 1 −2 0 −1.52
     
     
      
11.94 1 1 1   9.15   2.34 
     
 
   
 7.43  = 1 1 1   1.69  + −2.17
      
.
   
  
      
 9.43  1 1 1  −1.24 −0.17
     
     
12.44 1 1 −1  0.36 
     
     
     
14.64 1 1 −1  2.56 
     
     
9.17 1 1 −1 −2.91

Notice that the error vector is the same as in the previous fitted model; this makes sense because

the column space of X is the same as the column space of X0 .

Comparing these two models, we find that only the values of the estimate b0 are different

and this is because they have different interpretations. The value b01 ≈ 9.15 can be interpreted

as the mean gain-score over all the students. A hypothesis test of this value can allow one to

reject the null hypothesis that all three tutoring treatments have average gain-scores of 0.

The value b02 is related by a constant to an estimate dˆ1 for the average difference in gain-scores
µ2 +µ3
between group and private tutoring treatments d1 = 2 − µ1 . This constant depends on k,

the number of observations at each factor level (in this example k = 3) and on the particular

dummy variable selected (we chose 2 · 21 X2 + 21 X3 − X1 ). When the number of observations




at each factor level are not equal, the computation is more complicated but still possible—this

case is not discussed here. We have

47
E(y · X02 ) = −2E(y11 + y12 + . . . + y1k ) + 1E(y21 + y22 + . . . + y2k )

+1E(y31 + y32 + + . . . + y3k )

= −2kµ1 + kµ2 + kµ3 = 2kd1 ,

and so if we assume

y · X02 = 2k dˆ1 ,

then we can compute

kX02 k2
dˆ1 = b2
2k
18
dˆ1 = b2 = 3b2 = 5.07.
6

Thus, we estimate the difference in gain-scores between group and private tutoring to be about

5 points.

In the same way, b03 is related by a constant to an estimate dˆ2 for the average difference in

gain-score between the 1 hour and 2 hours private tutoring treatments d2 = µ2 − µ3 . Following

an argument similar to the one above, we compute:

kX03 k2
dˆ2 = b3 (3.1)
k
6
dˆ2 = b3 = 2b3 = −2.48 (3.2)
3

Thus, we estimate that 1 hour of tutoring yields gain-scores that are a little more than 2 points

lower than the gain-scores after 2 hours of tutoring.

Next we want to check to see if these values are significantly different from 0. As before, we

can compute the F -ratio for the model overall (which tests the null hypothesis H0 : b0 = 0) and

compute F -ratios for each estimate b0i . An overall test of the null hypothesis that b0 = 0 can be

48
accomplished by calculating the F -ratio with 3 and 6 degrees of freedom:

kX0 b0 k2 /3 271.45
F = ≈ ≈ 55.875
2
kek /6 4.8584

This value is exactly the same as the F-ratio for the hypothesis test that b = 0 because ŷ =

Xb = X0 b0 . As before, the F value is so large that we would expect an F value this large or

larger only 0.009 % of the time. We can conclude that at least one of the estimates in b0 is not

zero.

Whenever the vector 1 is included in the design matrix, a more sensitive test of the model

fit is possible. The general model y = X0 b0 + e can be written

y = b01 X01 + b02 X02 + b03 X03 + e,

which is equivalent to

y − y1 = b02 X02 + b03 X03 + e.

This new model equates the centered vector yc with the sum of two (orthogonal) vectors in the

effect space and the error vector. All of these vectors (and those spanning the error space) are

orthogonal to 1, and so analysis can be restricted to the 8-dimensional subspace of individual

space orthogonal to 1. The null hypothesis becomes H0 : b02 = b03 = 0 and the corresponding

reduced model has a design matrix made up entirely of the vector 1. The corresponding F -ratio

has only 2 degrees of freedom for the centered estimate ŷc but retains 6 degrees of freedom for

the centered error vector ec = e.

kŷc k2 /2 30.347
F = ≈ ≈ 6.246.
kec k2 /6 4.8584

This result has a p-value of 0.03416, so we can reject the hypothesis that b02 = b03 = 0. This test

is more sensitive because we will not reject the overall null hypothesis in those cases where only

the intercept of the model b01 is significantly different from 0.

It remains to see if each of the values estimated by b0 are significantly different than zero.

49
As with the first set of dummy variables, we can accomplish this by means of three F -ratios:

kX0 (b01 , 0, 0)T k2 /1 753.69


Fb01 =0 = ≈ ≈ 155.131; p = 0.00002
kek /62 4.8584
kX0 (0, b02 , 0)T k2 /1 51.44
Fb02 =0 = ≈ ≈ 10.589; p = 0.01738
kek /62 4.8584
kX0 (0, 0, b03 )T k2 /1 9.25
Fb03 =0 = ≈ ≈ 1.904; p = 0.21685
kek2 /6 4.8584

From these ratios we can reject the first two null hypotheses: the average overall gain score and

the average difference between private and group tutoring are significantly different than 0 (both

p-values are below the common threshold of 0.05). However, there is a relatively high chance

(p-value = 22%) of obtaining the observed estimate for b03 under the null hypothesis H0 : b03 = 0

and we cannot reject this hypothesis. We say that the difference in gain scores between the

students who received 1 hour and 2 hours of tutoring is not significant.

In fact, the data for the 1 hour and 2 hour treatment gain scores was simulated from normal

distributions with different means, but there are not enough scores to separate the pattern from

chance. This is a problem of insufficient statistical power, and corresponds with the probability

of failing to reject a false null hypothesis. In linear models, the primary determinant of power is

the size of the sample. Techniques are available to find the minimum sample size for obtaining

a sufficiently powerful test so that the probability of failing to reject a false null hypothesis is

guaranteed to be less than some predetermined threshold. Further discussion of statistical power

is beyond the scope of this thesis.

We conclude this section by observing that the number of independent hypotheses that can be

simultaneously tested is constrained by the need to invert the matrix XT X in order to estimate

β using least squares. When conducting one-way ANOVA, a model of a factor that has k levels

can be constructed with k − 1 dummy variables and the vector 1. Any more, and the matrix

XT X will be singular and a different technique for finding something analogous to XT X−1 is

required: the generalized inverse. This extension of the least-squares method is beyond the

scope of this thesis.

50
3.2 Factorial designs

We discuss one other kind of ANOVA design, factorial designs. These models are quite flexible,

and, although we only discuss an example of two-way ANOVA, the methods can easily be

generalized to any number of factors.

Consider an extension to the tutoring study discussed in the previous section in which the

researchers would like to discover if a short content lecture affects the gain-scores experienced by

the students who are being tutored. In this new experiment, there are two factors: lecture and

the tutoring treatment. Each factor has two levels: students are randomly assigned to attend

(or not attend) the lectures, and students are randomly assigned to participate in group tutoring

or in private tutoring for a one month period. Simulated data for this example are presented in

Table 3.2.

Table 3.2: Data for a 2-factor experiment recording observed gain-scores for tutoring and lecture
treatments.

No lectures Lectures
Group tutoring 5.00 10.78
5.58 4.83
2.60 6.05
Private tutoring 10.83 12.53
5.17 9.20
6.68 10.39

3.2.1 Dummy variables with factorial designs

One way we can think of this problem is as a one-way ANOVA of a factor with four levels.

The effect or regression space is then the subspace of individual space spanned by the simple

dummy variables seen in the last section, Xi where Xij = 1 whenever yj is an observation of

factor level i. However, to aid the reader, we instead use subscripts that denote the group: Xgn

(group tutoring and no lectures), Xgl (group tutoring and lectures), Xpn (private tutoring and

no lectures), and Xpl (private tutoring and lectures). The fitted model is

51
y = Xb + e,

     
 5.00  1 0 0 0  0.61 
     
 5.58  1 0 0 0  1.19 
     
     
 2.60  1 0 0 0 −1.80
     
     
10.78 0 1 0 0  3.56 
     
      
     
 4.83  0 1 0 0 4.393
 −2.39 
     
     
 6.05  0 1 0 0  7.220
     
 −1.17
 =  +

 
 
 .
10.83 0 0 1 0  7.560   3.27 
      
 
   
      
 5.17  0 0 1 0 10.707 −2.39
  
  
     
 6.68  0 0 1 0 −0.88
     
     
12.53 0 0 0 1  1.82 
     
     
     
 9.20  0 0 0 1 −1.51
     
     
10.39 0 0 0 1 −0.32

As in the last example, the coefficients for these dummy variables can be interpreted as the

means of each group in the population: group tutoring and no lectures (µ̂gn = 4.393), group

tutoring and lectures (µ̂gl = 7.220), private tutoring and no lectures (µ̂pn = 7.560), and private

tutoring and lectures (µ̂pl = 10.707). As before, F tests can be used to show that each one of

these estimated means is significantly different from zero.

By using these simple dummy variables, each group mean is estimated using only 3 of the

12 data points; the rest of the data are ignored. We are not as interested in each of these four

groups, however, as much as we are interested in the overall effect of each factor on the outcome

measure. The real strength of factorial designs are the contrasts that can be constructed to make

use of all of the data in the experiment, effectively increasing the sample size for the factors of

interest. This is beneficial because it increases statistical power without the expense of collecting

more data.

52
3.2.2 Constructing factorial contrasts

To be explicit, consider the question of the effect of lectures on gain-score. We would like to

compare all of the individuals in the experiment who attended lectures with those who did not.

The null hypothesis for this question might be written H0 : 12 (µgl + µpl ) − 12 (µgn + µpn ) = 0. The

appropriate contrast can be formed as a linear combination of the corresponding simple dummy

variables: X02 = −Xgn + Xgl − Xpn + Xpl . This contrast helps answer the question, Do lectures

affect gain scores?

In a similar way, we form the contrast X03 = −Xgn − Xgl + Xpn + Xpl to test the null

hypothesis H0 : 21 (µpn + µpl ) − 12 (µgn + µgl ). This null hypothesis corresponds to the question,

Does private tutoring affect gain scores?

A final kind of contrast important in factorial ANOVA models is called the interaction

contrast. This contrast helps to answer the question, Is the increase of gain-scores due to

lectures with group tutoring the same as the increase of gain-scores due to lectures with private

tutoring? The null hypothesis for this question says there is no difference in the increase of gain-

score due to lectures between the two tutoring conditions: H0 : (µgn − µgl ) − (µpn − µpl ) = 0.

Constructing the corresponding contrast is straightforward: X04 = Xgn − Xgl − Xpn + Xpl . As in

the one-way ANOVA example, the first column in the modified design matrix X0 is the vector

1.

53
y = X0 b0 + e,

     
 5.00  1 −1 −1 1  0.61 
     
 5.58  1 −1 −1 1  1.19 
     
     
 2.60  1 −1 −1 1 −1.80
     
     
10.78 1 1  3.56 
     
   −1 −1

      


 4.83  1 1 −1 −1 7.47
 −2.39 
     
     
 6.05  1 1 1.493
  
−1 −1   −1.17
  
 =  +

 
 
 .
10.83 1 1 −1 1.663  3.27 
      
   −1  

      
 5.17  1 −1 1 −1 0.080 −2.39
  
  
     
 6.68  1 −1 1 −1 −0.88
     
     
12.53 1 1 1 1  1.82 
     
     
     
 9.20  1 1 1 1 −1.51
     
     
10.39 1 1 1 1 −0.32

3.2.3 Interpreting hypothesis tests of factorial designs

Since these contrasts are orthogonal, they can each be tested independently for significance using

F -ratios and the F -distribution to obtain the p-values for the corresponding null hypotheses.

kX0 (b01 , 0, 0, 0)T k2 /1 669.61


Fb01 =0 = ≈ ≈ 112.490; p = 0.00000
kek2 /8 5.953
kX0 (0, b02 , 0, 0)T k2 /1 26.76
Fb02 =0 = ≈ ≈ 4.496; p = 0.06680
kek2 /8 5.953
kX0 (0, 0, b03 , 0)T k2 /1 33.20
Fb03 =0 = ≈ ≈ 5.577; p = 0.04584
kek2 /8 5.953
kX0 (0, 0, 0, b04 )T k2 /1 0.08
Fb04 =0 = ≈ ≈ 0.013; p = 0.91236
kek2 /8 5.953

When studying the results of tests of factorial designs, the first thing to check is the interac-

tion term. As indicated by the high p-value of 0.91, the estimated interaction parameter is quite

54
likely under the null hypothesis of no interaction. This is the desired result, because when there

is a significantly non-zero interaction, we can no longer interpret the parameter estimates b02 and

b03 as estimates of the main effect for lectures and private tutoring, respectively. The reason for

this is that X02 averages the increase due to lectures in the case of group and of private tutoring.

If there is an interaction, then this average includes not just the main effect of lectures (the

increase in gain score due to lectures) but also half of the interaction effect (the increase in gain

score over and above the increase due to lectures and the increase due to private tutoring).

In this case, the interaction is very small and also statistically non-significant. This means

that we are free to interpret the values b02 and b03 as functions only of the increase in gain-scores

due to lectures and private tutoring, respectively, as long as these values are significant. To

answer the question of statistical significance, we return to the results of the F -ratios. The

p-value for the Fb02 =0 ratio tells us there is a 6.7% chance of obtaining an estimate of the gain

score due to lectures as large or larger, which is greater than the common threshold of 5%.

However, depending on the consequences of rejecting a true null hypothesis, many researchers

might still consider this a useful estimate of the effect. The p-value for the hypothesis test of

main effect of private tutoring is 4.6% and below the common threshold. We can conclude that

the increase in gain-scores is significantly different from 0.

However, we must be careful when interpreting the estimates obtained from b02 and b03 . As

we saw in the last section, we must compute the constant that relates the estimated difference

dˆ1 in gain scores due to lectures to the value b02 , and find the constant relating b03 to an estimate

dˆ2 of the difference in gain scores due to private tutoring. In both cases the squared length of

the column vector is 12, we multiplied the contrasts by a constant of 2 to clear fractions, and

the number of people in each groups is 3. We compute

12 0
dˆ1 = b = 4b02 = 2.97,
3·2 2

and
12 0
dˆ2 = b = 4b03 = 3.33.
3·2 3

We can conclude (since b02 and b03 are significantly different from 0) that private tutoring and

55
short lectures each independently increase gain-scores by about 3 points.

56
Chapter 4

The Geometry of Simple Regression


and Correlation

Regression analysis seeks to characterize the relationship between two continuous variables in

order to describe, predict, or explain phenomena. The nature of the data plays an important

role in interpreting the results of regression analysis. For example, when regression analysis is

applied to experimental data, only the dependent variable is a random variable. The independent

variables are fixed by the researcher and therefore not random. In the context of medical

research, this kind of data can be used to explain how independent variables such as dosage are

related to continuous outcome variables such as blood pressure or the level of glucose in blood.

The experimental design and the scientific theory explaining the mechanism by which the drug

effects the dependent variable together support causal claims concerning the effect of changes

in dosage.

With observational data, regression analysis supports predictions of unknown quantities but

the assumption of causality may not be justified or causality may go in the opposite direction. For

example, vocabulary and height among children are correlated but this is likely caused by another

variable that causes both: the child’s age. One can imagine using data from several observatories

to estimate the trajectory of a comet, but in fact it is the actual trajectory that causes the data

collected by the observatories, not vice-versa. Many businesses and other institutions rely on

regression analyses to make predictions. For example, colleges and universities solicit students

57
scores on standardized tests such as the SAT in order to make enrollment decisions because

these scores can partially predict student success.

In the social sciences and economics, experimental data are rare. Regression analysis can be

applied to observational data sets as long as appropriate conditions are met (in particular, the

independent variables must not be correlated with estimation errors), and regression analysis

can be used to analyze data sets that are composed of random, independent variables. In

these cases, care must be taken that the presumed regression makes sense from a logical and

theoretical point of view. Regressing incidents of lung cancer on tobacco sales by congressional

district makes sense because smoking may cause cancer. Interpreting the results of such a study

would allow one to make statements such as, “a decrease in tobacco sales by x amount will

result in a decrease in cancer incidence by y amount.” Indeed, this reasoning might motivate tax

policy. However, to speak of increasing the incidence of cancer does not make sense (and were

it somehow possible to do so, it is still doubtful this would then cause more people to smoke).

Regressing tobacco sales on cancer incidence does not make sense because cancer incidence is not

the kind of variable that can be manipulated directly by researchers or society. In certain areas

of the social sciences, a dearth of appropriate theoretical explanations may not allow researchers

to make causal claims at all, although predictions and descriptions are well warranted and useful.

When regression analysis is applied to observational data for the purpose of prediction, the

independent variables are called predictors and the dependent variable is called the criterion.

The theoretical assumption of causality is relaxed in this case, but the regression equation still

has a meaningful interpretation in the context of prediction. For example, economists regressing

income on the number of books owned might concluded that it is reasonable to predict an increase

of x dollars above average income for those who own y more books than average. However, it

is likely in analyses like these that the number of books is a proxy for other factors that might

be more difficult or costly to measure and that presumably cause both book ownership and

income. No one would propose giving out books as policy to eradicate poverty (especially if the

population was illiterate), but the information about books in the home can be used to adjust

predictions about future income.

Regression analysis is very flexible. One flexibility is that the dependent variable in a re-

58
gression model can be transformed so that the relationship between independent and dependent

variables is more nearly linear. For example, population growth is often an exponential phe-

nomenon and if population is the dependent variable for a model that is linear in time, then using

the logarithm of the population will likely provide a better-fitting model. Regression analysis is

closely related to correlation analysis in which both the independent and dependent variables

are assumed to be random. This variation is discussed in Section 4.2.

4.1 Simple regression

Before taking up an example of multiple regression, it is worthwhile to consider an example of

a simple regression model, regression of a single independent variable on a dependent variable.

We take as our example a variation of the tutoring study discussed in the last chapter. Suppose

a tutoring company wanted to research the efficacy of particular tutors in order to generate data

for hiring decisions. The analysis will use the scores of tutors on the standardized test as the

independent variable and the average gain-scores of tutees on the same tests as the dependent

variable. To illustrate this example we use the simulated data presented in Table 4.1.

Table 4.1: Simulated score data for 4 tutors employed by the tutoring company.

Tutor Average tutee


score gain-score
Tutor A 620 7.33
Tutor B 690 8.16
Tutor C 720 11.07
Tutor D 770 11.94

To proceed with regression analysis, we must assume that the three conditions discussed in

Section 2.1.4 hold for these data. First, we suppose the model of the true relationship between

tutor scores and tutee gain-scores to be of the form

Y = Xβ + E,

59
and we assume that all Ei are independent random variables with normal distributions, where

E(Ei ) = 0 and Var(Ei ) = σ 2 for all i. This is equivalent to assuming that Y is a vector of

random variables Yi , each with common variance σ 2 and mean µ. Recall that β = (β0 , β1 )T is

estimated by b = (b0 , b1 )T using the method of least squares.

There are two conventions for defining the design matrix X for linear regression. One option

is to set X = [ 1 x ], where x is the vector of observations of the independent variable, in this

case the vector of tutor scores, x = (620, 690, 720, 770)T . The other option is to use the centered

vector xc = x − x1 instead of x; this yields the design matrix X0 = [ 1 xc ]. We saw in Section

2.1.2 that these matrices produce equivalent results in general, and so we are free to adopt the

centered design matrix X0 for the discussion of regression analyses. Using this design matrix

is convenient for producing vector drawings of relevant subspaces of individual space because

the subspace of individual space spanned by xc (and by the centered columns xci in general) is

orthogonal to V1 .

The fitted model for the tutoring study can now be expressed explicitly:

y = X0 b0 + e,

 
   
7.33 1 −80 0.344
      
 8.16  1 −10 9.625
     
−1.135
 =  + .
      
 
11.07 1 20  0.033  0.785 
     
     
11.94 1 70 0.006

We can interpret this fitted model by providing meaning for the estimated parameters in the

vector b0 = (b00 , b01 )T from the given context. In particular, this fitted model gives the overall

mean gain-score of b00 = 9.625 and says that for every point increase in the score of tutors

above the mean of 700, the gain-score of the tutees increases by b01 = 0.033 points. It remains

to determine whether the results are statistically significant. Before answering this important

question using the familiar F statistic, we briefly discuss the geometry of the fitted regression

model.

60
Since there are only 4 observations of each variable, individual space for this study is R4 .

The model is fitted by projecting y onto the mean space V1 and the model space VXc . Since each

of these spaces is a line, there are 2 remaining dimensions in the error space Ve . Understood

geometrically, the vector b0 indicates that ŷ, the projection of y into the column space of X0 ,

is the sum of the component in mean space (the vector 9.625(1, 1, 1, 1)T ) and the component in

model space (the vector 0.033(−80, −10, 20, 70)T f). This is illustrated by the vector diagram in

Figure 4.1.
y

y 1 = 9.625 · 1
VX0

ŷc = 0.033 · xc Vxc

Figure 4.1: The vector ŷ is the sum of two components: y 1 and ŷc .

These figures give some indication that y and ŷ are close and thus it is plausible that the

fitted model may provide useful information about the relative performance of the tutors. Recall

that the least-squares method for obtaining b0 guarantees the ŷ that minimizes the error vector e.

(Note that in regression contexts, the error vector is often described as the vector of residuals.) As

in the case of ANOVA, we rely on the F statistic to provide a rigorous determination of closeness.

The important hypothesis to test here is simply that b01 is not equal to zero (H0 : b01 6= 0). We

wish to know if the average per-dimension squared length of ŷ is significantly greater than the

average per-dimension squared length of e, or alternatively if the average per-dimension squared

length of ŷc is significantly greater than the average per-dimension squared length of ec . These

are two different tests and will have different results. The first compares the error between the

regression model and the model that assumes E(Yi ) = 0 for all i. The second is more sensitive

test, analogous to the F test introduced in Section 1.4.2. In this second test we are comparing

61
the full regression model with the model that allows Y to have a non-zero mean. In the first

test, ŷ has 2 degrees of freedom, but in the second test ŷc has only 1 degree of freedom. In

both cases the error vector has 2 degrees of freedom. In either case, the general procedure is the

same: if the F statistic is sufficiently large then we can reject the null hypothesis in favor of the

alternative hypothesis (Hα : b01 = 0.033). The results of these tests follow.

kŷk2 /2 191.6986
Fb00 =b01 =0 = ≈ ≈ 189.482, p = 0.00525
2
kek /2 1.0117

kŷc k2 /1 12.837
Fb01 =0 = ≈ ≈ 12.688, p = 0.07057
kek /22 1.0117

The results of these hypothesis tests are clearly different. The low p-value for the test of the

fitted model against the null model provides support for rejecting the hypothesis that b0 = 0.

However, the second test tells a slightly different story. The hypothesis that b01 = 0 cannot be

rejected at the traditionally accepted level of risk for failing to reject a false null hypothesis

(5%). However, another often-used level is 10% and the p-value for the second F test is below

this threshold. In some cases, an analyst may decide to reject this null hypothesis in favor of

the hypothesis that the coefficient b01 is not 0.

It is worth nothing that because the two columns of the design matrix are orthogonal (1·xc =

0), the coefficients b00 and b01 can be tested independently. Thus, we can independently test the

hypothesis that µY is zero by comparing the squared length of the other component of ŷ, namely

ŷ − ŷc = y1 = b00 1 (see Figure 4.1), with the squared length of the error vector e:

kŷ − ŷc k2 /1 370.563


Fb00 =0 = ≈ ≈ 366.260, p = 0.00272.
2
kec k /2 1.0117

We will see that in multiple regression analyses the columns of X are not always orthogonal and

thus do not afford independent tests of each predictor. This is one of the primary differences

between the columns of ANOVA design matrices, which almost always are orthogonal, and

design matrices for multiple regression, which are rarely orthogonal.

To conclude our discussion of simple regression, we consider the contribution of the geometry

of variable space and compare this with the geometry of individual space presented above.

62
Scatterplot of Tutor and Tutee Scores

14
12
Average tutee gain−score


10


8


6

600 650 700 750 800

Tutor score

Figure 4.2: A scatterplot for the simple regression example showing the residuals, the difference
between the observed and predicted values for each individual.

Recall that scatterplots can be used to represent data sets in variable space; in the case of

simple regression, there are just two variables and so scatterplots often provide convenient

representations of the data. The scatterplot corresponding to Table 4.1 is provided in Figure

4.2, and shows the residuals, the (vertical) differences between each data point and the line

representing the least-squares estimates of gain-scores for all tutor scores. The error vector in

Figure 4.1 under the natural coordinate system of individual space is the vector of residuals,

(0.344, −1.135, 0.785, 0.006)T .

The simple regression example also provides an opportunity to discuss how linear models can

63
Table 4.2: Modified data for 4 tutors and log-transformed data.

Tutor Average tutee Log-transformed ave.


score gain-score tutee gain-score
Tutor A 620 7.33 1.992
Tutor B 690 8.16 2.099
Tutor C 720 9.77 2.280
Tutor D 770 11.94 2.480

be extended with link functions. For example, suppose Tutor C had an average tutee gain-score

of 9.77 instead of 11.07. In this case, transforming the dependent variable by taking the natural

logarithm produces a data set that is more nearly linear than the untransformed data (see

Table 4.2). The scatterplot with the least-squares estimation line (as well as the corresponding,

untransformed analog) is shown in Figure 4.3.

Tutor & Log−transformed Tutee Scores Tutor & Tutee Scores


3.0
Log of average tutee gain−score

14
Average tutee gain−score

12


2.5


10



8



2.0


6
4
1.5

600 650 700 750 800 600 650 700 750 800

Tutor score Tutor score

Figure 4.3: Least-squares estimate and residuals for the transformed and untransformed data.

64
4.2 Correlation analysis

In the most basic sense, correlation is a measure of the linear association of two variables. The

conceptual distinction between cause-effect relationships and mere association did not long pre-

cede the development of a statistical measure of this association. In the middle of the nineteenth

century, the philosopher John Stuart Mill recognized the associated occurrence of events as a

necessary but insufficient criterion for causation (Cook and Campbell, 1979). His philosophical

work set the stage for Sir Francis Galton, who defined correlation conceptually, worked out a

mathematical theory for the bivariate normal distribution by 1885, and also observed that corre-

lation must always be less than 1. In 1895, Karl Pearson built on Galton’s work, developing the

most common formula for the correlation coefficient used today. Called the product-moment

correlation coefficient or Pearson’s r, this index of correlation can be understood as the dot

product of the normalized, centered variables:

(xi − x)(xi − y) (xci )(yci ) xc yc


P P
rxy = pP = pP = · (4.1)
(xi − x) 2 (yi − y)2 (xci ) 2 (yci ) 2 kxc k kyc k
P P

It is straightforward to establish that |r| ≤ 1 using the Cauchy-Schwarz Inequality.

In variable space, the correlation coefficient can be understood as an indication of the linearity

of the relationship between two variables. It answers the question: How well can one variable

be be approximated as a linear function of the other? In introductory statistics texts, panels

of scatterplots are frequently presented to illustrate what various values of correlation might

look like (Figure 4.4). These diagrams may lead to the misconception that the correlation tells

more about a bivariate relationship than in fact it does (for example, see Anscombe, 1973). In

practice, it is often difficult to estimate the correlation by looking at a scatterplot or given the

correlation, to obtain a clear sense of what the scatter plot might look like.

In individual space, however, the correlation between two variables has the simple interpreta-

tion of being the cosine of the angle between the centered variable vectors. Much of the power of

the vector approach is derived from this straightforward geometric interpretation of correlation.

This relationship is clear when we consider the 2-dimensional subspace spanned by the centered

variable vectors (Figure 4.5). Given the centered vectors xc and yc , a right triangle is formed

65
r = −1 r= 0 r= 1 r = 0.3 r = −0.6 r = 0.9

(a) Typical correlation examples.


12

12

12

10

10

10

● ● ●
● ● ● ●
● ● ● ● ●
8

8
● ●
● ● ●
● ●

● ●

6

6
● ●

● ●

4

4

5 10 15 5 10 15 5 10 15

(b) All have correlation of 0.816 (Anscombe, 1973).

Figure 4.4: (a) Panels of scatter plots give an idealized image of correlation, but in practice, (b)
plots with the same correlation can vary quite widely.

by yc , the projection of yc onto xc (called ŷc ), and the difference between these vectors, yc − ŷc .

The cosine of an angle in a right triangle is the ratio of the adjacent side and the hypotenuse.

When the lengths of xc and yc are 1, the cosine of the angle between them is simply the length

of the projection ŷc , the quantity xc · yc . More generally, the cosine of the angle between two

centered vectors is the dot product of the centered, normalized vectors:

xc ·yc
kProjyc xc k kyc k2
kyc k x c · yc
cos(θxc yc ) = = = =r (4.2)
kxc k kxc k kxc k kyc k

When ŷc lies in the opposite direction as xc , the correlation is negative.

Correlation analysis usually involves using r to estimate the parameter called Pearson’s ρ,
σXY
which is defined by ρ = σX σY , where σXY is the covariation of the random variables X and Y .

Both variables are assumed to be normal and to follow a bivariate normal distribution. The

66
yc0

yc

yc0 − ŷc0
yc − ŷc
θxc yc0
θxc yc
ŷc0 ŷc xc
Figure 4.5: The vector diagram illustrates that rxc yc = cos(θxc yc ) and that rxc yc0 = − cos(π −
θxc yc0 ).

most common hypothesis test is whether or not the correlation of two variables is significantly

different from zero. The test statistic used for this test follows Student’s t distribution (see

Section 6.4) which when squared is equivalent to the F-ratio for the same hypothesis test with

1 degree of freedom in the numerator (see Section 1.4.2 and 6.3).

For example, consider the variable S defined as the average total scores on state assessments

per district of eighth graders in Massachusetts at the end of the 1997-1998 school year and the

variable I defined as the per capita income in the same school districts during the tax year 1997.

A random sample of 21 districts yields a sample correlation between S and I of rSI = 0.7834.

This suggests there may be a moderate positive linear association between the variables. We

want to test whether this correlation is significantly different from 0 because a correlation of

0 means there is no association. As in the last section, first we establish the null hypothesis,

H0 : ρ = 0, and the alternative hypothesis, H1 : ρ 6= 0. We assume the null hypothesis is true

and compute the probability of obtaining the observed correlation. If it is low, we reject the

null hypothesis and conclude that the correlation of the variables is positive.

The test statistic can be computed:

√ √
rXY n − 2 0.7834 21 − 2
= √ = 5.4942.
1 − 0.78342
q
1 − rXY
2

This number is the t value corresponding to the observed correlation. The probability of obtain-

67
R∞
ing a test statistic at least this high can be computed 1 − 5.4942 f (t)dt ≈ 10−5 where f (t) is the

distribution function of Student’s t-distribution with 21 − 2 = 19 degrees of freedom. Because

the probability is well below the standard threshold of 0.05, we reject the null hypothesis in favor

of the alternative hypothesis, concluding that there is a positive correlation between districts’

per capita income and the average total score on the eighth grade state exam.

Although correlation plays an important role in research, it frequently does not give the

most useful information about a data set. Fisher (1958) wrote, “The regression coefficients are

of interest and scientific importance in many classes of data where the correlation coefficient, if

used at all, is an artificial concept of no real utility” (p. 129). Correlations are easy to compute

but often hard to interpret. Even correct interpretations might not answer the instrumental

question at the heart of most scientific research, How do manipulations of one variable affect

another? In many cases, social scientists and market analysts are content to avoid addressing

causality and instead answer the different question: How do variations in some variables predict

the variation in others?

68
Chapter 5

The Geometry of Multiple


Regression

In the last chapter, we saw that one flexibility of regression analysis is that variables can be

transformed via (not necessarily linear) functions and in this way used to model non-linear phe-

nomena. Another flexibility is the option to use more than one continuous predictor. Regression

models that include more than one independent variable are called multiple regression models.

We begin with a two-predictor multiple regression model using data from the Massachusetts

Comprehensive Assessment System and the 1990 U.S. Census on Massachusetts school districts

in the academic year 1997-1998. (This data set is included with the statistical software package

R.)

Given data from all 220 Massachusetts school districts, suppose Y denotes the per-district

average total score of fourth graders on the Massachusetts state achievement test, and suppose

X1 and X2 denote the per-capita income and the student-teacher ratio, respectively. We want to

predict the district average total score (Y ) given the per-capita income and the student-teacher

ratio (X1 and X2 ). We hypothesize that higher student-teacher ratio will be predictive of a

lower average total score and that greater per-capita income will be predictive of higher average

total scores. The first 5 data points are shown in Table 5.1.

The model for multiple regression with two predictors is similar to a two-way ANOVA model;

the only difference is that the vectors x1 and x2 contain measurement data instead of numerical

69
Table 5.1: Sample data for Massachusetts school districts in the 1997-1998 school year. Source:
Massachusetts Comprehensive Assessment System and the 1990 U.S. Census.

District Average total Ratio of Per capita Percent free Percent


name achievement students income or reduced English
score (fourth grade) per teacher (thousands) lunch learners
Abington 714 19.0 16.379 11.8 0
Acton 731 22.6 25.792 2.5 1.246
Acushnet 704 19.3 14.040 14.1 0
Agawam 704 17.9 16.111 12.1 0.323
Amesbury 701 17.5 15.423 17.4 0
.. .. .. .. .. ..
. . . . . .
N=220 y = 709.827 x1 =17.344 x2 =18.747 x3 =15.316 x4 =1.118
sy = 15.126 sx1 =2.277 sx2 =5.808 sx3 =15.060 sx4 =2.901

tags for specifying factor levels and contrasts. The design matrix has three columns: X =

[1 x1 x2 ]. The first 5 data points from Table 5.1 are shown below.

y = Xb + e,

     
714  1 19.0 16.379  9.673 
     
731 1 22.6 25.792
    15.967 
 699.657
    
    
704 1 19.3 14.040    3.642 
      
  =    −1.096  +  .
704 1 17.9 16.111
      
−1.116
    
 1.557
  
    
701 1 17.5 15.423 −3.483
     
 .  . .. ..   . 
.. .. . . ..

As always, the goal of the analysis is to find the b = (b0 , b1 , b2 )T so that the squared length

of the vector e = [e1 , e2 , · · · ]T is minimized. One geometric interpretation of the values ei is as

70
the vertical distances between the ith data point and the regression plane in a three-dimensional

scatterplot. This interpretation corresponds to the two-dimensional interpretation of residuals

(see Figure 4.2) and is illustrated in Figure 5.1. The measure of closeness is the sum of the

squared lengths of the vertical error components, and the scatterplot provides a rough sense of

how well the model fits the data.

MA School Districts (1998−1999)

Acton


● ● ●
● ●
● ● ●

● ●
● ●● ●
780

● ● ● ● ● ● ●

● ● ● ●● ●● ●

● ●
●● ● ● ●
● ● ●● ●
● ● ● ●
● ● ● ●● ● ● ●
● ● ● ● ●● ● ● ● ●
● ●
District mean 4th grade Total Score

● ● ● ● ● ●

●● ● ● ●● ●●● ● ● ● ●
● ●
● ● ●● ●● ● ●● ● ●

Student−Teacher Ratio
● ●● ●
760

● ● ● ● ●● ● ●
● ● ● ●● ● ● ●● ● ●
● ●● ●● ● ● ● ●
●● ● ● ● ●
● ● ● ●


● ●
● ●● ● ●●
● ●
● ●
● ●● ●● ● ●●●● ● ● ● ●
● ●
● ●● ● ● ●
● ●

740

● ● ● ●
● ● ● ● ● ● ● ●

●● ●●●
● ● ●
● ●
● ● ● ●

● 25
720


20
700

15
680
660

10
640

5
5 10 15 20 25 30 35 40
Per Capita Income
(thousands)

Figure 5.1: The data are illustrated with a 3D scatter plot that also shows the regression plane
and the error component for the prediction of district mean total fourth grade achievement score
(e2 ) in the Acton, MA.

To make use of the second geometric interpretation (a diagram of vectors in individual space),

it is necessary to center the data vectors x1 and x2 . The new design matrix is X0 = [ 1 xc1 xc2 ],

and the corresponding model is y = X0 b0 + e. As noted before, this does not change the model

space because C(X) is equal to C(X0 ). In general, b0i = bi when i 6= 0, but the first coefficients

in the two models are not necessarily equal. The value b0 denotes the intercept or the expected

value of Y when X1 = X2 = 0. The value of b00 instead denotes the mean value y which is also

71
the expected value of Y when X1 = x1 and X2 = x2 .

y = X0 b0 + e,

     
714
  1.000 1.656 −2.368  9.673 
     
731 1.000 5.256 7.045 
    15.967 
 709.827
    
    
704 1.000 1.956 −4.707    3.642 
      
  =    −1.096  +  .
704 1.000 0.556 −2.636
      
−1.116
    
 1.557
  
    
701 1.000 0.156 −3.324 −3.483
     
 .   . .. ..   . 
.. .. . . ..

We can check by inspection that the error vectors in both models are the same (recall that this

follows from the orthogonality of mean space and error space). By considering the geometric

representation of the models in individual space, we can see that the models are simply two

different ways to write the explanatory portion of the model, Xb = X0 b0 . These two ways of

writing the model essentially correspond to different choice of basis vectors for model space. In

fact, we see that:

y = 1y + xc1 b1 + xc2 b2 + e

y − 1y = yc = xc1 b1 + xc2 b2 + e

Because the vectors xc1 and xc2 are both orthogonal to 1, by choosing the centered representa-

tion, it is possible to view only the vectors orthogonal to 1 in the vector diagram, leaving enough

dimensions to show the relationship among yc and the vectors xc1 and xc2 under the constraints

of a two-dimensional figure (see Figure 5.3). We can create similar diagrams schematically with

a greater number of predictors by representing hyper planes using planes or lines. In vector

diagrams, the measure of fit is the squared length of the error vector. The representation in

a single entity, the error vector e, of all of the error across the entire data set is one of the

72
strengths of vector diagrams and the geometric interpretation afforded by individual space.

yc

xc1

xc2

Vxc

Figure 5.2: The geometric relationships among the vectors yc , xc1 , and xc2 .

After representing the multiple regression solution in these two ways, it remains to determine

if the vector ŷ = X0 b0 actually provides a better prediction of y than chance. As before, there

are two competing hypotheses to consider. On the one hand, the null hypothesis states that

the population parameters β 01 = β 02 = 0. The alternative hypothesis is that ŷ is indeed close to

E(Y) and that it is consequently acceptable to use X1 and X2 to predict Y.

As before, we use the F -ratio to compare the estimate of the variance of Y obtained from the

average per-dimension squared length of the error vector with the estimate of the variance of Y

obtained from the average per-dimension squared length of ŷ. Under the null hypothesis, these

estimates should be fairly close and the F -ratio should be small. On the other hand, we reject

the null hypothesis if the F -ratio is large—if the estimates of the variance of Y by projecting y

into model space and into error space are significantly different. In the case of this model, we

73
have
kŷc k2 /1 10402.5
Fb01 =b02 =0 = ≈ ≈ 77.03, p = 0.00000.
kec k2 /217 135.042

Based on this analysis, we can reject the null hypothesis and tentatively conclude that ŷ is

close to Y. Note that the very small (and likely non-zero) p-value is due to the high degrees

of freedom in the denominator (see Section 6.3). It is important to stress, however, that one

cannot rule out the possibility that other models provide even better predictions of Y. In the

following sections, we will see extensions of this example that illustrate this point.

5.1 Multiple correlation

One way to measure overall model fit is the generalization of the correlation coefficient, r, called

the multiple correlation coefficient; it is denoted R. In terms of the geometry of individual space,

we saw that the notation rxy indicates the cosine of the angle between the (centered) vectors x

and y. Multiple correlation is the correlation between the criterion variable and the projection

of this variable onto the span of the predictor variables. We thus define

Ry.x1 ...xn = ryc ŷc ,

where ŷc is the projection of yc onto the space C([xc1 , . . . , xcn ]).

A simple geometric argument is sufficient to justify the role of the squared multiple correlation

coefficient, R2 , as the measure of the proportion of the variance of the criterion yc that is

explained by the model in question. Consider the triangle formed by yc , ŷc , and the error vector

e. Since ŷc and e are orthogonal, the Pythagorean theorem provides the following equation:

kyc k2 = kŷc k2 + kek2 .

||ŷc ||2
The squared correlation is the squared cosine and must therefore be ||yc ||2
. Now a frequently-used

estimator for the variance of a random variable vector Y is the per-dimension squared length of

yc (see equation 1.3.4). Thus, the ratio of the variance explained by the model (kŷc k2 /(n − 1))
||ŷc ||2
to the sample variance (kyc k2 /(n − 1)) is simply the ratio ||yc ||2
. This equality demonstrates

74
that R2 can be used to assess model fit. For example, if two models are proposed, then the one

with the greater R2 value is said to be the model with the better fit.

In the present example, the squared length of the vector ŷc is 20805.1 and the squared length

of yc is 50109.44. Therefore, we have that Ry.x


2
1 x2
= 0.4152, and we can say that this model,

which uses the student-teacher ratio and the district average per-capita income to predict district

average total MCAS scores in the fourth grade, explains a little more than 40% of the observed

variance among these district average scores.

5.2 Regression coefficients

We turn now to the task of interpreting the model coefficient vector b. The first coefficient is

the easiest to deal with; it is simply the value of y when the predictors are both 0. The meaning

of the first coefficient is slightly different for the centered and non-centered models. One must

decide if the data and the meaning of the variables allow a prediction of y when both predictors

are zero (i.e., when x1 = x2 = 0). However, in centered models, xc1 = xc2 = 0 when x1 = x1 1

and x2 = x2 1. It follows that b0 is the model prediction of Y in the case that both predictors

attain their mean values.

One might be tempted to interpret the regression coefficients b01 , and b02 in the last example

as the respective changes produced in the mean total score per unit change in the student-

teacher ratio and per capita income. However, it is quite possible that other variables are

actually responsible for the change in average test score and only happen to be associated with

the variables in the model as well. For example, the cock’s crow precedes and is highly correlated

with sunrise but the spinning earth causes both phenomena.

In ANOVA designs with 2 or more factors, the orthogonality of the factors means we can

interpret each model coefficient independently. In regression analyses, however, the predictors

are often correlated. This means that regression coefficients cannot be interpreted without

considering the whole model. The relationship can certainly be expressed x2 = x1 + z, for some

vector z. Suppose, for example, that the predictors x1 and x2 are correlated. Then in the model

y = b1 x1 + b2 x2 ,

75
and if x1 is increased by 1, then we have

y0 = b1 (x1 + 1) + b2 (x1 + 1 + z) = y + (b1 + b2 )1.

If the predictors were independent, then one would expect

y0 = y + b1 1,

instead.

The best interpretation of the regression coefficient bi when predictors are correlated is as

the increase in the criterion variable per unit increase in the predictor variable, holding all other

variables constant. In applied contexts, however, this might not actually make much sense.

For example, it might not be feasible for a school district to hire more teachers (increasing the

student-teacher ratio) without a larger per capita income and concomitantly higher tax revenue

because of budgetary constraints.

Without careful design and theoretical support for the variables included in the model and

reasonable confidence that these variables are the only relevant ones, multiple regression offers

at best prediction with regression coefficients that have little value for generating explanations

or identifying causal effects.

5.3 Orthogonality, multicollinearity, and suppressor variables

Multiple regression analyses can have features that seem counter-intuitive or paradoxical on

the first take and the explanation of many of these is greatly facilitated by vector diagrams of

individual space and the associated geometric orientation. This section elucidates four of these

features: orthogonality, multicollinearity, partial correlation, and suppressor variables.

We have already discussed the caution required when interpreting regression coefficients

in a model that has non-orthogonal predictors. When predictors in a model are orthogonal,

the estimates of corresponding regression coefficients can be interpreted independently of one

another. Moreover, as we saw in the section on factorial contrasts (see Section 3.2.2), the

76
associated regression coefficients can be tested independently for significance. Experimental

design allows the researcher to ensure that predictors are orthogonal, but this is rarely the

case in observational research. What is more important than whether or not two predictors

are orthogonal is how close the predictors are to being orthogonal. Correlation, especially

understood in relation to the measure of the angle formed between two centered vectors in

individual space, is especially helpful for quantifying this relationship.

To examine the impact of near orthogonality and its absence, consider three models of the

mean total fourth grade scores for Massachusetts districts, summarized in Table 5.1. The first

model is

y = [ 1 xc1 xc2 ]b0 + e

and was presented above: the variables of student-teacher ratio and per capita income predict

mean total fourth grade achievement. The second model is simpler, using only per capita income

variable to predict achievement and can be expressed

y = [ 1 xc2 ]b0 + e.

The third model is like the first, but uses the percentage of students eligible for free or reduced

lunch, x3 , instead of the student-teacher ratio. It can be expressed

y = [ 1 xc2 xc3 ]b0 + e.

The estimates of the regression coefficient b02 from each model are presented in Table 5.2 along

with the squared length of the corresponding projection of y onto xc2 , the associated degrees of

freedom, and the average per dimension length.

One important observation to be made about the regression coefficient for the per capita

income variable is that it is seems quite similar (but yet different) in the first two models and

very different in the third model. In is outside the scope of this thesis to describe how one

decides whether differences in these estimates are significant. However, calculating the p-values

assures us that all three F -ratios are significantly different than 0 and, using statistical methods

77
Table 5.2: The value of the regression coefficient for per capita income and corresponding F -
ratios in three different models of mean total fourth grade achievement.

Design Source of Squared Degrees of Ave. length F -ratio Estimate


matrix variation length freedom per dim. of b02
X01 = [ 1 xc1 xc2 ] b02 xc2 19475.2 1 19475.2 144.215 1.557
e 29304.3 217 135.0
X02 = [ 1 xc2 ] b02 xc2 19475.2 1 19475.2 138.59 1.624
e 30634 218 140.5
X03 = [ 1 xc2 xc3 ] b02 xc2 19475 1 19475.2 250.64 0.694
e 16861 217 77.7

outside the scope of the present discussion, the estimates from the first and second model are not

significantly different but the third estimate is significantly different from the first two. Recall

that the independence of predictors in an orthogonal design implies that the inclusion of other

variables in the model would not affect the estimation of the model. In three different orthogonal

models containing x2 and different other variables as predictors, all three of the estimates for b02

would be identical.

The correlations among the variables x1 , x2 , and x3 cause differences in the estimates for b02

reported in Table 5.2. The reason that the first two estimates are similar is that student-teacher

ratio and per capita income are not highly correlated (r = −0.157). This implies that the

corresponding centered vectors for these variables, xc1 and xc2 , form a 99◦ angle in individual

space. Geometrically, we can see that they are fairly close to orthogonal. On the other hand,

the per capita income and the percentage of students eligible for free or reduced lunch are more

highly correlated variables with r = −0.563, as one might expect. The angle formed between

the (centered) vectors xc2 and xc3 in individual space is 124◦ .

The vector diagrams of the three model subspaces in Figure 5.3 illustrate these ideas clearly.

The Pythagorean theorem guarantees that whenever the predictors are mutually orthogonal,

the sum of the squared lengths of the error vector and each projection of y onto the subspaces

78
yc
yc

xc1 VXc1 xc3 VXc3


xc2 xc2

(a) The vector yc and model space VXc1 . (b) The vector yc and model space VXc3 .

Figure 5.3: The vector diagrams of VXc1 and VXc3 suggest why the value of the coefficient b2
varies between models.

corresponding to individual predictors is equal to the squared length of the observation vector.

When predictors are nearly orthogonal, then this additivity and independence are to some

extent maintained. The inclusion of the teacher-student ratio variable did not significantly

change the estimate of b02 . Whenever the predictors are not orthogonal, the additivity property

fails and individual projections are no longer the same as the contribution of a predictor to the

overall model. For this reason, interpreting the regression coefficients for variables in models

of observational data are difficult; the inclusion or exclusion of a correlated variable can have

large consequences for the estimation of these coefficients. The safer approach is to use the

whole model rather than attempting to interpret the regression coefficients. This is especially

appropriate when the model is being used for prediction rather than explanation.

At the other extreme from orthogonality lies collinearity or multicollinearity, the feature of

linear dependence among a set of predictors. Multicollinearity is easy to define (there exists a

linear combination of xi s equal to zero) and collinearity is the 2-dimensional analog. It is often

easy to identify and fix. In studies using observational data, multicollinearity often means that

79
the analyst has inadvertently included redundant variables such as the sub-scores and the total

score on an instrument. More problematic is near multicollinearity in which the design matrix X

is not singular, yet the analysis results in unstable (and therefore likely misleading) conclusions.

Multicollinearity is fundamentally a geometric feature of the set of predictors and is best

understood through representations in individual space. Considering each predictor as a vector

in individual space, we know from linear algebra that a linearly independent set of p-vectors

spans a p-dimensional space. In a set of predictors that is nearly collinear, there is at least one

vector xi that is close to the subspace spanned by the remaining predictors in the sense that the

angle between xi and the projection of xi into the space spanned by the rest of the vectors is

small. In practice, this suggests that we can detect near multicollinearity by regressing xi onto

the set of the rest of the predictors for each i and checking for good fit. If the fit is good, then

including xi in the set of predictors xj , j 6= i, may not be justified because it likely adds little

new information.

It is important to note that high pairwise correlation is indicative of near multicollinearity

but that multicollinearity is possible even if all of the variables are only moderately pairwise

correlated. As we will see, the MCAS variables we examine in this chapter are not related in

this way, but it is not hard to create a hypothetical data set that has high multicollinearity

but in which no pair of predictors are highly correlated. Let xc1 = (1, 0, −0.5, −0.5)T , xc2 =

(0, 1, −0.5, −0.5)T , and xc3 = (1, −1, 0.05, −.05)T . Then all three of pairwise correlations are

moderate: r1,2 = 0.333, r2,3 = −0.577, and r1,3 = 0.577. However, xc3 is very close to the span

of xc1 and xc2 . For example, the vector v = xc1 − xc2 = (1, −1, 0, 0)T is very close to xc3 ; their

correlation is almost 1: rv,xc3 = 0.99875. (See Figure 5.4.)

Another geometric way to think about near multicollinearity is to consider the parallelepiped

defined by the set of normalized and centered predictor vectors. Let ui = |xci | ;
xci
then the gen-

eralized volume of the parallelepiped defined by the set of vectors {ui : 0 < i ≤ n} is given

80
V[xc1 xc2 ]
xc3 V[xc1 xc2 ] xc3
xc2 xc1
xc2
xc1

(a) The vectors xc1 , xc2 , and xc3 are moderately pair- (b) The vector xc3 is very close to V[xc1 xc2 ] .
wise correlated.

Figure 5.4: The vectors xc1 , xc2 , and xc3 are moderately pairwise correlated but nearly collinear.

by
 1/2
u · u u1 · u2 · · · u1 · un

 1 1


 u2 · u1 u2 · u2 · · · u2 · un
 

.
 
 .. .. .. .. 

 . . . .




 
un · u1 un · u2 · · · un · un

If n = 1, then this is just the length of u1 (which is always 1) and when n = 2, the generalized

volume is simply the area of a parallelogram with vertices at the the origin and u1 , u2 , u1 + u2

(see Figure 5.5). When u1 and u2 are orthogonal, the area of the parallelogram is 1. As these

vectors approach multicollinearity, the area approaches 0. The same relationship holds when we

extend to higher dimensions. If the predictors are orthogonal, then the generalized volume of

the associated parallelepiped is 1. A generalized volume that is close to 0 is evidence of near

multicollinearity.

If one suspects near multicollinearity in a set of p predictors, multiple regression can be

used to identify which vectors in the set are problematic. One merely uses regression p times

81
u2 u3 u2
u1
u1 u1

Figure 5.5: The generalized volume of the parallelepiped formed by the set of vectors {ui : 0 <
i ≤ n} is equivalent to length in one dimension, area in two dimensions, and volume in three
dimensions.

and in each regression analysis, the set {xi : i 6= k, 1 ≤ k ≤ p} is used to predict xk . If the

angle between xk and x̂k (which is in the span of {xi : i 6= k} ) is small, then we know that

xk likely adds little new information to the set of predictors that excludes xk . Considerations

about theoretical importance of variables should be used in choosing an appropriate subset of

the original vectors to be used for predicting the dependent variable.

The spending per pupil and the spending per ‘regular’ pupil (not including occupational,

bilingual, or special needs students) are two variables in the MCAS data set that provide a good

example of near multicollinearity. As we might expect, these variables are highly correlated

(r = 0.966) and comparing the models for the student-teacher ratio using each predictor alone

and the model with both predictors illustrates why multicollinearity is problematic.

We consider the student-teacher ratio x2 as an independent variable and denote it y0 . Then

let xreg indicate the observed spending per regular pupil in each district and xall the observed

spending per pupil. The three models can then be expressed y0 = [1 xreg ] + e, y0 = [1 xall ] + e,

and y0 = [1 xall xreg ] + e. The estimates for breg and ball in these models is summarized in

Table 5.3. All three models explain about a quarter of the variation in the student-teacher ratio

(R2 ) and the overall fit is significant (p-value > 0.05) . However, there is not much increase in

explanatory power in the model with both predictors over the models with just one predictor,

especially over the model using spending per regular student. The changes in the regression

coefficients are also noteworthy. Both spending per pupil and spending per regular pupil on

their own contribute negatively to the student-teacher ratio. (This makes sense because low

82
y0
V[xall xreg ]

0.002 xreg
− 0.002 xreg
0.002 xall
ŷ0

Figure 5.6: The linear combination of xreg and xall equal to ŷ0 (the projection of y0 into
V[xall xreg ] ) must include a term with a positive sign.

student-teacher ratios cost more per pupil.) However, in the third model the magnitude of the

coefficient of pupil spending is halved and that of regular student spending is doubled. Even

more unexpected is the change in the sign for ball : in the third model this coefficient is positive,

implying that as spending per pupil increases the student-teacher ratio increases. This coefficient

is not statistically significant; it has a p-value greater than 0.05.

These results might seem paradoxical—Why would two good predictors of the student-

teacher ratio not be even better when used together? The geometry of individual space makes

the reason obvious. The vectors xall and xreg point in almost the same direction. Since the

83
Table 5.3: The effect of multicollinearity on the stability of regression coefficients.

Model ball (p-value) breg (p-value) R2 (p-value)


y0 = [1 xall ] + e -0.0011 (0.000) n/a 0.2289 (0.000)
y0 = [1 xreg ] + e n/a -0.0013 (0.000) 0.2643 (0.000)
y0 = [1 xall xreg ] + e 0.0006 (0.241) -0.0020 (0.001) 0.2656 (0.000)

correlation between these vectors is 0.966, the angle between them is only 15◦ . To move suffi-

ciently in the direction orthogonal to this (so as to reach the projection of y in the model space)

requires a multiple of one of the predictors that goes so far in the first direction that it must be

corrected with a predictor with the wrong sign (see Figure 5.6).

We saw that including predictors that are highly correlated with each other is counterpro-

ductive even when each is a good predictor of the criterion variable. It is perhaps paradoxical

that including predictors that are nearly orthogonal to the criterion variable (and hence very

poor predictors of criterion variables) can actually improve the prediction considerably. Such a

variable is called a suppressor variable and the reason that suppressor variables function as they

do is easily explained using the geometry of individual space.

Let y denote the observed percentages of students eligible for free or reduced lunch in each

school district and xpercap denote the observed average per capita income in each district. The

total spending per pupil xall has a very low correlation with the percentage of students eligible

for free or reduced lunch (r = 0.07) and explains only 0.04% of the variation. However, when it

is added to the model using xpercap to predict y, a much better prediction is achieved. The two

models y = [1 xpercap ] + e is contrasted with the model y = [1 xpercap xall ] + e in Table 5.4.

What is striking about this example is that xall on its own predicts essentially nothing of

the percentage of students eligible for free or reduced lunch. However, adding it to the model

that uses xpercap significantly improves the prediction. From the geometry, we can see that the

plane spanned by xpercap and xall is much closer to y than the line generated by xpercap alone.

Given such a plane, any vector in the plane together with the first vector provides an improved

84
Table 5.4: Suppressor variables increase the predictive power of a model although they them-
selves are uncorrelated with the criterion.

Model bpercap (p-value) ball (p-value) R2 (p-value)


y = [1 xpercap ] + e -1.4591 (0.000) n/a 0.3135 (0.000)
y = [1 xall ] + e n/a 0.0011 (0.298) 0.0004 (0.298)
y = [1 xpercap xall ] + e -1.8038 (0.000) 0.0053 (0.000) 0.4101 (0.000)

prediction of the criterion, even when the new vector happens to be orthogonal to the criterion

variable.

5.4 Partial and semi-partial correlation

y⊥1
y

θp

x1
θyx1
θs

x2⊥1
x2
V[x1 x2 ]

Figure 5.7: The arcs in the vector diagram indicate angles for three kinds of correlation between
y and x2 : the angle θp corresponds to the partial correlation conditioning for x1 ; the angle θs
corresponds to the semi-partial correlation with x1 after controlling for x2 , and the angle θyx1
corresponds to Pearson’s correlation, ryx1 .

In regression analyses involving two or more predictors, it is often useful to examine the

relationship between a criterion variable and one of the predictors after taking into account the

relationship of these variables with the other predictors in the model. For example, suppose

the correlation between children’s height and intelligence is found to be quite high and one

85
is tempted by the dubious hypothesis that tall people are more intelligent. By using a third

variable, age, which is also known to be correlated with both height and intelligence, we would

like to examine the correlation between height and intelligence after taking into consideration

the age of the participants. This is a simple example of statistical control and is described

as conditioning for the effect of some variable(s). It is motivated by the idea of experimental

control in which randomness removes all of the differences between the treatment and control

groups except the variables being studied. The statistic that encodes the correlation between

two variables while conditioning for others is called the partial correlation coefficient.

In order for conditioning to be a valid procedure, we must check the implicit assumptions

about the causal relationship among the variables involved. In particular, by conditioning we

assume that the correlation between the conditioning variable(s) and each of the two variables

to be correlated is entirely due to a presumed causal relationship by which the conditioning

variable(s) affects each of the variables to be correlated. Returning to our example, we assume

that the correlations between age and height and between age and performance are entirely

due to the causal process of maturation; one expects that as children get older they also grow

taller and become more intelligent. We would likely find in this hypothetical example that after

accounting for the ages of children the remaining association between height and intelligence

would be quite small and most probably due to chance rather than any true relationship. In

this way, conditioning can be used to identify cases of so-called spurious correlation in which

two variables are highly correlated only because they are both correlated to a third variable,

often a common cause. However, care must be taken with analyses of partial correlation because

in cases where the assumption of causality is not warranted, the partial correlation coefficients

have little if any meaning.

The partial correlation between the variables x1 and x2 , conditioning for the variable x3 is

written rx1 x2 .x3 or simply r12.3 . Partial correlation is best understood using the geometry of

individual space. Given the predictors x1 and x2 , and the predictor x3 (whose effects we are

controlling for), partial correlation of x1 and x2 controlled for x3 is simply the correlation of

x1⊥3 = x1 − projx3 x2 and x2⊥3 = x2 − projx3 x1 . (Note that we extend this notation for centered

vectors in the following way: x1c⊥3 = x1c − projx3 x1c .)

86
Thus we have the following definition which depends on intuition from individual-space

geometry:
x1⊥3c · x2⊥3c
r12.3 = cos(θx1⊥3c x2⊥3c ) =
|x1⊥3c ||x2⊥3c |

that has the same relationship to geometry as correlation: the cosine of the angle between two

centered, normalized vectors. Partial correlation is more often defined

r12 − r13 r23


r12.3 = p .
1 − r13 1 − r23
p
2 2

It is straightforward to show that these definitions are equivalent. To simplify notation, in

the following computations we take all vectors as the corresponding centered vector. Using the

first definition, it follows that

(x1 − projx3 x1 ) · (x2 − projx3 x2 )


r12.3 =
|x1⊥3 ||x2⊥3 |
x3 ·x1 3 ·x2 x3 ·x1 x3 ·x2
x1 · x2 − kx3 k (x2 · x3 ) − xkx 3k
(x1 · x3 ) + kx3 k kx3 k (x3 · x3 )
=
|x1⊥3 ||x2⊥3 |
(x1 ·x3 )(x2 ·x3 )
x1 · x 2 − kx3 k2
=
|x1⊥3 ||x2⊥3 |
r − r13 r23
= p 12 p
1 − r13
2 1 − r23
2

since we have xi ·xj = kxi kkxj krij from the definition of correlation and kxi⊥j k2 = (1−rij
2 )kx k2
i

by the Pythagorean theorem. We conclude that the standard definition for partial correlation

can be derived from the definition inspired by geometric intuition of individual space.

The vector diagram in Figure 5.7 illustrates the geometric interpretation of correlation and

partial correlation in individual space. We also observe that correlation, the angle between two

unconditioned vectors, can be substantially different from partial correlation, the angle between

vectors conditioned on a common set of variables.

The third notable angle in Figure 5.7 is θs , which is the angle between y and the projection

of y on the component of x2 that is perpendicular to x1 . The corresponding correlation is

called the semipartial correlation and is used for measuring the unique contribution of x2 to the

prediction ŷ over and above that contribution of x1 .

87
Semipartial correlations play an important role in statistical decisions about whether or

not to include a predictor or set of predictors in a model. In general, if there are two sets

of predictors (say the column vectors of the matrices X1 and X2 ) for a criterion variable y,

then we can interpret the squared semipartial correlation of the first set (X1 ) and y as the

amount of variation in y that is explained by the subspace of the model space orthogonal to the

conditioning space, C(X2 ). Figure 5.7 shows the geometry when these spaces are each spanned

by a single vector.

The ideas here are similar to the orthogonal decomposition of the model space we saw in the

case of ANOVA analyses. However, it is rarely the case that C(X1 ), for example, is orthogonal

to the space C(X2 ), so instead we consider the orthogonal complement of the conditioning space

in the model space, C(X2 )⊥ ∩ C(X1 ) ⊕ C(X2 ). With partial correlation, C(X2 )⊥ is compared

to only the portion of y that is perpendicular to the conditioning set, whereas with semipartial

correlation, a comparison is made to the whole of y, including any portion within the span of

the conditioning set.

Because of the additivity provided by orthogonality of the subspaces of the model space,

it is clear that semipartial correlation corresponds to a decomposition of the variability of y.

Consider the model

y = [ X1 X2 ] · b + e.

Taking the dot product of each side of the equation with itself and applying the fact that the

error space and the model space are orthogonal, we obtain

2
|y|2 = [ X1 X2 ]b + |e|2 .

or

|y|2 = |ŷX1 X2 |2 + |e|2 . (5.1)

It can be written

y = [ X1⊥2 X2 ]b + e,

88
where X1⊥2 is a matrix such that C(X1⊥2 ) = C(X2 )⊥ ∩ C(X1 ) ⊕ C(X2 ). Then we can write

y = X1⊥2 b1 + X2 b2 + e.

Taking the dot product of each side of the equation with itself (and invoking orthogonality) we

get that

|y|2 = |X1⊥2 b1 |2 + |X2 b2 |2 + |e|2 ,

which can also be written

|y|2 = |ŷX1⊥2 |2 + |ŷX2 |2 + |e|2 , (5.2)

where ŷX2 , for example, indicates the projection of y onto C(X2 ). Comparing equation 5.1 and

equation 5.2 shows that

|ŷX1⊥2 |2 = |ŷX1 X2 |2 − |ŷX2 |2 .

It follows easily that the squared semipartial correlation can be written in terms of the squared

multiple correlation coefficient of the full regression model, y = [ X1⊥2 X2 ]b+e and the reduced

regression model, y = X2 b + e. We have

|ŷX1⊥2 |2 |ŷX1 X2 |2 |ŷX2 |2


2
Ry.1(2) = = − = Ry.12 − Ry.2 .
|y|2 |y|2 |y|2

This equation explains why squared semipartial correlation is often interpreted as the importance

of a predictor or set of predictors; it is the increase in the explanatory power of the new model

over and above the conditioning model.

89
Chapter 6

Probability Distributions

The statistical techniques of the preceding chapters rely on the four probability distributions

discussed in this chapter. Probability distributions are families of functions indexed by parame-

ters. These parameters specify a distribution function when they are fixed to particular values.

As we have seen, it is common to assume a distribution family (in most examples we have

assumed that variables follow a normal distribution), and then use what is known about the

distribution to estimate the putative parameter(s) which will fix the distribution function that

so it agrees with the sample data. For example, we use observation vector to estimate the mean

and standard deviation of the dependent variable in ANOVA and regression models.

We saw that Chebyshev’s inequality (see Section 1.3.4) is useful because it makes no as-

sumptions about the distribution of a random variable. However, if something is known a priori

about the distribution of a variable, then bounds for the probability of extreme events can often

be significantly improved over the estimates provided by Chebyshev’s inequality. Knowing (or

assuming) the distribution function of a random variable gives a great deal of information about

the expected values of the variable.

6.1 The normal distribution

All of the methods of analysis subsumed by the general linear model assume that the random

variables Yi and the random variables of error Ei have distributions that can be roughly approx-

90
imated by the normal distribution. The normal distribution is denoted N (µ, σ 2 ) because it is

completely determined by the parameters µ and σ 2 . If the random variable Y follows a normal

distribution with mean µY and variance σY2 , then we write Y ∼ N (µY , σY2 ), and the probability

density function for Y is given by

1
“ ”2
y−µY
−1
fY (y) = √ e 2 σY
. (6.1)
2πσY

The other distributions used in analyzing the general linear model (such as the F -distribution

used in hypothesis testing) can be derived from the normal distribution. A more detailed

description of the relationships among the distributions discussed in this section can be found

in many standard statistical texts (e.g., Casella & Berger, 2002; Searle, 1971).

The Normal Distribution

s = 0.8
0.5

s = 1.0
s = 2.0
0.4
Probability Density

0.3
0.2
0.1

P(−1.96 < x < 1.96) = 0.95


0.0

−3 −2 −1 0 1 2 3

Figure 6.1: The normal distribution with three different standard deviations.

The normal distribution is often used in statistical analyses for a number of reasons. The

91
best motivation, perhaps, is provided by the Central Limit Theorem, which states that the

distribution of the sample mean Y , approaches a normal distribution as the size of the sample

increases, no matter what the distribution of the random variable Y . This is very useful because

it allows one to estimate distribution of the sample mean although very little is known about

the distribution of the underlying random variables. The normal distribution is useful as the

limit of other probability distributions such as the binomial distribution and the Student’s t-

distribution. The binomial probability distribution gives the probability of obtaining 0 ≤ k ≤ n

successful outcomes in n trials when the probability of a success on a single trial is 0 ≤ p ≤ 1.

The mean of this distribution is np and the variance is np(1 − p). When both np and n(1 − p) are

sufficiently large (in general, it is recommended that both be at least 5), the normal distribution

with µ = np and σ 2 = np(1 − p) provides a very good continuous approximation for the discrete

binomial distribution.

Knowing that a variable follows the normal probability distribution function provides a much

stronger result than Chebyshev’s inequality (see Section 1.3.4). The key point is that the infor-

mation about the distribution of Y allows us to make much better approximations concerning the

probability of extreme observations of Y . To be explicit, an observation of Y falls within 1 stan-



dard deviation of the mean with an approximate probability of 68% because −σ fY (y)dy ≈ 0.68,

and within 2 standard deviations of the mean with an approximate probability of 95% because
R 2σ
−2σ fY (y)dy ≈ 0.95 (see Figure 6.1). (If the goal is to find bounds containing 95% of the area,

then ±1.96 standard deviations provide more accuracy.) Thus, the probability that an obser-

vation is more extreme than 2 standard deviations from the mean is only 5%, a significantly

tighter bound than that of 25% provided by Chebyshev’s inequality (which is the best bound

available for an unknown distribution).

6.2 The χ2 distribution

The χ2 distribution is not used directly in the statistical analyses discussed in this text; however,

it is an important distribution for many other kinds of statistical analyses because it can often

be used to find the probability of the observed deviation of data from expectation. The χ2

92
The χ2 Distribution

0.15
k=5
k = 13
0.10 k = 23
Probability Density

P(x < 22.362) = 0.95


0.05
0.00

0 10 20 30 40

Figure 6.2: The χ2 distribution with three different degrees of freedom.

distribution is denoted χ2k where k is the only parameter of the distribution, a posititive integer

called the degrees of freedom. If a random variable V follows a χ2 distribution with k degrees of

freedom, then we write V ∼ χ2k , and the probability density function for V is given by

1
fV (v) = v k/2−1 e−v/2 , (6.2)
2k/2 Γ(k/2)

R∞
where Γ(z) = 0 tz−1 e−t dt. You may notice that the mean of the χ2 distribution for k degrees

of freedom is k (see Figure 6.2).

The primary reason for mentioning the χ2 distribution is that the squared length of a random

vector Y in Rk (appropriately scaled and with 0 expectation for each coordinate) will follow a

χ2 distribution with k degrees of freedom. The proof follows from two facts:

93
Lemma 6.2.1. 1

• If Y ∼ N (0, 1) is a random variable, then Y 2 ∼ χ21 .

• If W1 , · · · , Wn are independent random variables and Wi ∼ χ2ki for all i, then


P
Wi ∼

χ2Σki .

Because Y is a random variable in Rk , we can write Y = (Y1 , · · · , Yk )T . Suppose that the

coordinate variables Yi are independent and that each Yi ∼ N (0, σ 2 ). It follows that if we

scale each coordinate variable by σ,


1
denoting the result Yi0 , then we have Yi0 = 1
σ Yi ∼ N (0, 1).

Applying the first part of Lemma 6.2.1, we conclude (Yi0 )2 ∼ χ21 . Applying the second part of

Lemma 6.2.1, we obtain (Yi0 )2 ∼ χ2k . Moreover,


P

X 1 X 2
(Yi0 )2 = Yi
σ2
1
= Y·Y
σ2
1
= kYk2 .
σ2

This proves

Theorem 6.2.2. Let Y be a random vector in Rk with independent coordinate variables each

distributed N (0, σ 2 ). Then 1


σ2
kYk2 ∼ χ2k .

Theorem 6.2.2 shows that the degrees-of-freedom parameter corresponds to the dimension

of the space containing the vector Y. As we will see, this theorem justifies the claim that the

F -ratio (see equation 2.15) follows an F -distribution.

6.3 The F -distribution

Snedecor’s F -distribution has a central role in testing hypotheses related to the general linear

model, and it is named after the statistician R. A. Fisher. The F -distribution has two param-
1
For proof see Casella & Berger (2002).

94
eters, often denoted p and q and called the numerator and denominator degrees of freedom,

and is denoted F (p, q). Both parameters affect the shape of the distribution, but in general, as

the maximum degrees-of-freedom parameter increases (there are two), the distribution becomes

more centered around 1 (see Figure 6.3). If a random variable W follows an F -distribution with

p and q degrees of freedom, we write W ∼ F (p, q), and the probability density function for W

is:

s !
p+q 
Γ (pw)p q q
fW (w) = 2 
. (6.3)
Γ p
2 Γ q
2 w
(pw + q)p+q

Variables that follow an F -distribution have a close relationship to variables that follow the χ2

distribution. Whenever the variables U and V follow χ2 distributions with with p and q degrees

of freedom, respectively, then the (adjusted) ratio of these variables follows an F -distribution

with p and q degrees of freedom. That is, if U ∼ χ2p and V ∼ χ2q , then

U q U/p
F = · = ∼ F (p, q). (6.4)
V p V /q

This equation provides a more general description of the F -ratio (see equation 2.15). Notice
q
that the adjustment factor p corrects for the relative degrees of freedom in the two variables

that follow χ2 distributions. If U and V are the squared length of random vectors in Rp and

Rq , respectively, then we can understand the adjustment factor geometrically as a correction

for the dimensions of these vectors. In this way, the F -ratio can be understood as a ratio of

per-dimension squared lengths.

In Section 2.3, we saw that the vector ŷ is in the (p + 1)-dimensional subspace of Rn that is

spanned by the p + 1 independent column vectors of the design matrix X. In a similar way, the

vector e is in the (n − p − 1)-dimensional space which is the orthogonal complement of C(X).

The discussion above uses Theorem 6.2.2 to link the more general statement of F -ratio provided

in equation (6.4) with the F -ratio we will use for testing hypotheses about the general linear

model.

95
The F Distribution

1.5
(p,q): 27, 500
(p,q): 11, 130
1.0 (p,q): 3, 7
Probability Density

0.5

P(x < 1.863) = 0.95


0.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Figure 6.3: The F -distribution centers around 1 as the maximum degrees-of-freedom parameter
becomes large.

6.4 Student’s t-distribution

We briefly consider one other distribution that is often used in analyses based on the general

linear model. This eponymous distribution is called the Student’s t-distribution after the pen

name of William Gosset, a statistician employed by the Guinness Brewery early in the twentieth

century. The t-distribution is often applied to hypothesis tests involving small samples. In

the context of the general linear model, the t-distribution is helpful for obtaining confidence

intervals for the parameter estimates in the vector b. Confidence intervals are an alternative to

hypothesis testing, and provide a range of likely values for parameter estimates.

The t-distribution has one parameter k, which is the number of degrees of freedom, and

96
The t Distribution

0.4
k: 15
k: 3
0.3 k: 1
Probability Density

0.2
0.1

P(−2.57 < x < −2.57) = 0.95


0.0

−4 −2 0 2 4

Figure 6.4: The t-distribution approaches the normal distribution as the degrees-of-freedom
parameter increases.

the distribution is denoted t(k). If U is a random variable that follows a t-distribution with k

degrees of freedom, then we write U ∼ t(k) and the probability density function for U is:

−(k+1)/2
Γ( k+1 ) u2

fU (u) = √ 2 k 1+ (6.5)
kπΓ( 2 ) k

As k increases, the t-distribution approaches the normal distribution with a mean of 0 and

variance of 1 (see Figure 6.4). In addition, there is a close relationship between the t-distribution

and the F -distribution. This relationship is straightforward: a squared t statistic follows an F -

distribution with 1 degree of freedom in the numerator. That is, if U ∼ t(k) then U 2 ∼ F (1, k).

This relationship has an intriguing geometric implication for the t-distribution. We can think

97
of U 2 as a ratio of the (per-dimension) lengths of two vectors. Since the numerator has 1 degree

of freedom, the vector in the numerator, Vn , lives in a 1-dimensional subspace. The vector in the

denominator, Vd , of the ratio lives in a k-dimensional subspace. We can choose an orthonormal

basis {ui } for the combined k + 1 dimensional space so that Vn = cu1 and Vd = ki=1 ci ui . The
P

variable U 2 can then be expressed

kck2
U 2 = Pk ,
2
i=1 kci k /k

and it follows that U can be expressed

kck kVn k
U = Pk √ = √ . (6.6)
i=1 kci k/ k kVd k/ k

We saw this expression of a variable that follows the t-distribution in the Section 1.4.

In summary, we can say that the sum of squares of normally distributed variables follows

the χ2 distribution. (Geometrically, this is the squared length of a vector.) In addition, a ratio

of variables that follow the χ2 distribution itself follows an F -distribution. Finally, the square

of a variable that follows the t-distribution, itself follows an F -distribution.

98
Chapter 7

Manipulating Space: Homogeneous


Coordinates, Perspective
Projections, and the Graphics
Pipeline

This chapter describes the mathematical basis of the DataVectors program. Points in R3 can be

represented using homogeneous coordinates which facilitate affine translations of these points via

matrix multiplication. Perspective projections of R3 to an arbitrary plane can also be realized

via matrix multiplication directly from Euclidean or homogeneous coordinates. However, it is

more convenient for computer systems producing computer-generated perspective drawings of

3-dimensional objects to instead transform the viewing frustum in R3 to an appropriately scaled

parallelepiped in perspective space, retaining some information about relative depth.

In order to illustrate and explore more concretely the ideas discussed in this manuscript and

to generate precise figures from data, I wrote the DataVectors program in the R language. This

programming language has built-in support for statistical analysis (including matrix arithmetic

routines) and celebrated graphical capabilities. Because it is open source and free, the language

is used widely by academics and research scientists for developing new statistical techniques.

99
The DataVectors program accepts up to three data vectors of any length and displays a

3-dimensional model of the (centered) model space that can be rotated in any direction using a

mouse. This chapter describes the mathematical basis of the program, including homogeneous

coordinates, perspective projections, and the graphical pipeline.

7.1 Homogeneous coordinates for points in Rn

The set of n × n matrices is denoted Mn . Linear transformations of Rn can be realized as

left-multiplication by invertible elements of Mn . This subset of Mn forms a group under matrix

multiplication of called the general linear group which is denoted GLn . Affine transformations

of Rn cannot be represented by matrix multiplication since 0 is a fixed by matrix multiplica-

tion. However, the homogenization of inhomogeneous equations by introducing another variable

with the constant term as its coefficient provides a solution. The group GLn+1 includes every

linear transformation that fixes one dimension (i.e., those matrices that correspond to linear

transformations of Rn ). Moreover, the group GLn+1 contains matrices corresponding to shear

transformations of Rn+1 that can be used to achieve affine transformations of Rn viewed as Rn+1

under an appropriate equivalence class. We begin with a definition.

Definition 1 (Homogeneous Coordinates). Given x = (x1 , . . . , xn ) ∈ Rn , the vector X =

(X1 , . . . , Xn , Xn+1 ) ∈ Rn+1 contains homogeneous coordinates for x if xi = Xi


Xn+1 for all i.

Whenever Xn+1 = 0, the coordinates represent a point at infinity but this case is not needed

for the present discussion. When Xn+1 6= 0, the definition establishes an equivalence relation:

X ≡ Y if and only if there exists a non-zero c in R such that Y = c(X1 , . . . , Xn+1 ). We take

(X1 , . . . , Xn , 1) as the canonical representative of the equivalence class of X.

A shear transformation of Rn+1 is a linear transformation that fixes a subspace W ⊂ Rn+1 ,

translating each points along a vector a parallel to W and proportionally to the distance between

the point and W . To obtain a translation of Rn+1 , pick W to be the hyper-plane (xn+1 = 0) and

a to be the desired vector of translation in Rn . Then xn+1 = 1 is the natural embedding of Rn in

Rn+1 , where each point is mapped to its canonical homogeneous coordinate. This hyperplane is

effectively translated by the vector a by means of the shear transformation of Rn+1 . Translated

100
coordinates in Rn can be recovered by means of the inverse embedding restricted to the range

of the embedding. In other words, if v ∈ Rn and V = (v, 1)T is the corresponding homogeneous

embedding, we can recover the translated vector v0 from the translated homogeneous vector

V0 ∈ Rn by stripping off the last coordinate of V0 . The matrix representation of this translation

is

 
 
a 
 
 In
T = , h Ta ∈ M (n + 1), (7.1)
 
h a 
 
 
 
0 ... 0 1

where In is the identity in GLn . The subscript h denotes transformations of homogeneous

coordinates and distinguishes these from transformations of Rn described in the rest of this

chapter.

The extension of the linear transformations of rotations and dilations from Euclidean coor-

dinates to homogeneous coordinates is straightforward. Although these results hold in greater

generality, we restrict the discussion to transformations of homogeneous coordinates for points

in R3 . One common approach to rotations in R3 is the yaw-pitch-roll system. An arbitrary ro-

tation is described (non-uniquely) as a succession of rotations by the angles γ, φ, and θ around

the z-axis, y-axis, and x-axis, respectively. A rotation is denoted by the matrix Rθφγ , and the

rotation of a vector v in a right-handed coordinate system is obtained by left-multiplying by the

rotation matrix: vrot = Rθφγ v. We have:

Rθφγ = Rx (θ)Ry (φ)Rz (γ), where

 
 1 0 0 
Rx (θ) = 
 
 0 cos(θ) − sin(θ)
,

 
0 sin(θ) cos(θ)

101
 
 cos(θ) 0 sin(θ) 
Ry (φ) = 
 
 0 1 0 ,

 
− sin(θ) 0 cos(θ)
 
 cos(θ) − sin(θ) 0 
Rz (γ) = 
 
 sin(θ) cos(θ) 0 .

 
0 0 1

Rotation matrices can be easily realized in R4 by adjoining an extra row and column. Thus,

Rθφγ corresponds to the following matrix in GL(4).

 
0
..
 
.
 
 Rθφγ 
h Rθφγ = (7.2)
 


 0 

 
0 ... 0 1

The scaling transformation, where each vector coordinate is transformed to some multiple

of itself, is similarly handled. Let Skx ky kz denote the scaling transformation that multiplies the

x, y, and z coordinates by the constants kx , ky , and kz , respectively. We have

Skx ky kz = Skx Sky Skz , where

 
 kx 0 0 
=
 
Skx  0 1 0 .

 
0 0 1
 
 1 0 0 
=
 
Sky  0 ky 0  .

 
0 0 1

102
 
 1 0 0 
=
 
Skz  0 1 0
.

 
0 0 kz

Moreover, S corresponds to the following matrix in GL(4).

 
kx 0 0 0
 
 0 0 0 
 
ky
h Skx ky kz = (7.3)
 

 0 0 kz 0 
 
 
0 0 0 1

By left-multiplying the product of translation, rotation, and scaling matrices to the ho-

mogeneous coordinates of points in R3 we can achieve all the transformations of interest for

representing perspective drawings of 3-dimensional objects.

7.2 The perspective projection

The major challenge in modeling vector space is creating representations of vectors in 3-dimensional

space on a 2-dimensional computer screen. This is made possible by the perspective projection,

defined with a view plane and a viewpoint v = (v1 , v2 , v3 ) ∈ R3 not in the view plane. The

possibly affine view plane is defined by n · x = c or equivalently as N · X = 0 where X is the

homogeneous coordinates for x ∈ R3 and N = (n1 , n2 , n3 , −c). The perspective projection sends

a point in space x to the intersection of the view plane with the line passing through v and x.

Theorem 7.2.1. Given the viewpoint V and the (possibly affine) view plane with normal N, the

transformation matrix corresponding to the perspective projection is given by PV,N = VNT −

(N · V)I4 .1

Proof. Let X denote the homogeneous coordinates of a point x ∈ R3 and let k1 V + k2 X denote

the image of X under the perspective projection PV,N , for constants k1 , k2 ∈ R; then k1 and k2
1
This theorem and its proof follow Marsh (1999).

103
satisfy k1 (N · V) + k2 (N · X) = 0. If N · X = 0, then

(VNT − (N · V)I4 )X = (N · X)V − (N · V)I4 X

= −(N · V)X

Therefore, (VNT −(N·V)I4 )X is a multiple of X, which is precisely what we would expect given

the equivalence relation on points expressed with homogeneous coordinates, and we conclude

that PV,N = VNT − (N · V)I4 in this case. On the other hand, whenever N · X 6= 0 (and

k1 6= 0), we have k2 = −k1 (N · V)/(N · X) and by substitution obtain

PV,N X = k1 V − (k1 (N · V)/(N · X))X

= (N · X)V − (N · V)X

= (VNT − (N · V)I)X

as required.

To apply this to the problem of computer graphics, we make the simplifying assumptions

that the view plane is the xy-plane (i.e., n = (0, 0, 1, 0)T ) and that the view point is on the z-axis

(i.e., v = (0, 0, k, 1)T ). Under these assumptions, the matrix for the perspective transformation

is:

 
−k 0 0 0
 
 0 −k 0 0
 

Pv,n = (7.4)
 

 0 0 0 0
 

 
0 0 1 −k

Checking, we have:

104
     
x −kx kx/(k − z)
     
     
 y   −ky   ky/(k − z) 
Pv,n   =  ≡ (7.5)
     

 z   0 0
     
  
     
1 z−k 1

As we expect, the x- and y-coordinates of the image of this projection is proportional to the

ratio of the distance from the viewpoint to the viewing plane and the distance of the pre-image

from the view point along the z-axis (see Figure 7.1).

p'
p = (x,y,z)

yk/(k-z)
y
k-z
Z
v = (0,0,k)

Figure 7.1: The x- and y-coordinates of the perspective projection are proportional to k/(k − z).

7.3 The graphics pipeline

Although it is clear that the perspective projection from a view point on the z-axis to the

xy-plane is quite simple, it is perhaps not yet clear why it is always possible to make these

simplifying assumptions. As long as the line of sight is orthogonal to the view plane, a solution

follows from the transformations developed in the first section. If the view point is not on the

z-axis, we left multiply everything by the appropriate rotation matrix to ensure that the view

point is on the z-axis. Let v0 be a view point not on the z-axis and Rv a rotation taking v0 to

v = (0, 0, kv0 k)T . It follows that the matrix product Pv,n Rv takes any point p to some p0 on

the xy-plane. To recover the image of Pv0 ,n we simply compute Rv−1 p0 , where Rv−1 is the inverse

rotation operator. The graphics pipeline refers to a sequence of mathematical operations such

105
as this that transform a point in R3 into a pixel on the computer monitor. All transformations

are represented using matrix multiplication and the entire pipeline can be conceived as the

composition of all the matrix transformations.

The perspective transformation is problematic because it is not of full rank and therefore

singular. Information regarding the distance of a point to the center of projection is lost and

cannot be recovered from the information that remains in the image. We will find that it is

useful to decompose the perspective transformation into translation and dilation transformations

followed by an orthogonal projection. We refrain from projecting from 3 dimensions into 2

dimensions until the last step of the pipeline just before the pixel is displayed. In the penultimate

step, points have been distorted to achieve perspective but still retain information about relative

depth. This space is called perspective space. The advantage is that the transformation of

Euclidean space into perspective space has an inverse so the depth information used to resolve

issues such as object collisions can be recovered. In addition, the depth information aides in

drawing realistic effects such as simulated fog, in which the transparency (a color aspect) of a

point is proportional to its distance from the center of projection.

Ideally, we want the x- and y-coordinates in perspective space would have the same values

as the x and y screen coordinates. This allows the final projection from perspective space to the

screen coordinates to be an orthogonal projection in the z direction, and obtaining the screen

coordinates would require no further calculations. After rotating the view point to the z-axis

and translating it to the origin, we associate the truncated pyramid called the viewing frustum

(see Figure 7.2) with the region [−bx , bx ] × [−by , by ] × [−1, 1]. The viewing frustum is defined

by the near and far clipping planes, n = (0, 0, n, 0)T and f = (0, 0, f, 0)T , and the dimensions

of the visible view plane. For a centered, square screen, bx = w/2 and by = h/2, and so these

dimensions are [−w/2, w/2] × [−h/2, h/2], where w and h are the screen width and height in

screen coordinates. This association is achieved via the perspective space transformation:

106
 
2n
w 0 0 0
 
 0 2n
0 0
 

h
Sv,n = (7.6)
 

−(f +n) −2f n
 0 0
 
f −n f −n 
 
0 0 −1 0

The perspective space coordinates of P = (x, y, z, 1)T are given by

     
x 2n/w 2n/(−zw)
     
2n/h 2n/(−zh)
     
 y     
Sv,n   =  =
     

−(f +n) 2f n (f +n) 2f n
 z   f −n z − +
     
f −n   f −n z(f −n) 
     
1 −z 1

By using the 4th row of Sv,n to record the z-coordinate, dividing by this coordinate to

obtain the canonical homogeneous coordinates for the point applies the correct perspective

scaling to the x- and y-coordinates. It follows that Sv,n can be decomposed further as the

product of dilation transformations (in the x- and y-coordinates) that associate the original

x and y coordinates in the viewing plane with the desired screen coordinates and an even

simpler perspective transformation. The 3rd coordinate of p0 is not suppressed to 0 as under the
(f +n)
perspective projection, but retains depth information via the invertible function z 0 = f −n +
2f n
z(f −n) .

It is straightforward to verify that this formula maps the viewing frustum to the appropriate

parallelepiped in perspective space. Once mapped to perspective space, the screen coordinates

can be read off the first two coordinates of p0 . If needed, an inverse function can be used to

determine the original z-coordinate of any point in perspective space.

In sum, beginning with an arbitrary viewpoint v and an orthogonal view plane with normal

v, the graphics pipeline (1) rotates space around the origin so that v lies on the z-axis (say,

via Rθφγ ). Then, (2) space is translated along the z-axis so that v lies at the origin, and the

image of the origin is (0, 0, −kvk), (via T ). Next, (3) the space is dilated in order to identify

the viewing plane with the computer screen (via D) and (4) the viewing frustum is transformed

into a parallelepiped in perspective space (via S). Finally (5), the screen pixel can be drawn

107
(-w/2,h/2,1)
(-bx,by,f) (-w/2,h/2,-1) (w/2,h/2,1)

(bx,by,f)
(-bx(n/f),by(n/f),n)
p'
p

(w/2,-h/2,1)
v = (0,0,k) (bx,-by,f)

Figure 7.2: The perspective space transformation takes the viewing frustum to the parallelepiped
[−w/2, w/2] × [−h/2, h/2] × [−1, 1] in perspective space.

using the x-and y-coordinates of each point in perspective space, an orthogonal projection onto

the xy-plane in perspective space:

 
1 0 0 0
 
 0 1 0 0 
 
M =
 

0 0 0 0
 
 
 
0 0 0 0

Then the whole pipeline taking p in the original space to p0 on the screen can be written as a

product of matrices:

p0 = M SDT Rp.

108
References

Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27 , 17–21.

Bryant, P. (1984). Geometry, statistics, probability: Variations on a common theme. The

American Statistician, 38 , 38–48.

Casella, G. & Berger, R. L. (2002). Statistical Inference (2nd Ed.). South Melbourne, Australia:

Thompson Learning.

Christensen, R. (1996). Plane Answers to Complex Questions. New York, NY: Springer-Verlag.

Cook, T.D. & Campbell, D.T. (1979). Quasi-Experimentation: Design and Analysis for Field

Settings. Rand McNally, Chicago, Illinois.

Elazar J. Pedhazur (1997). Multiple Regression in Behavioral Research: Explanation and Pre-

diction (3rd Ed.). South Melbourne, Australia: Thompson Learning.

Herr, D. G. (1980). On the history of the use of geometry in the general linear model. The

American Statistician, 34, 43–47.

Faraway, J. J. (2004). Linear Models with R. Boca Raton, FL: Chapman & Hall.

Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied Linear Statistical Models.

New York, NY: McGraw-Hill.

Marsh, D. (1999). Applied Geometry for Computer Graphics and CAD. New York, NY: Springer-

Verlag.

Pitman, J. (1992). Probability. New York, NY: Springer-Verlag.

Rogers, J. L. & Nicewander, W. A., (1988). Thirteen ways to look at the correlation coefficient.

The American Statistician, 42, 59–66.

Saville, D. J. & Wood, G. R. (1986). A method for teaching statistics using n-dimensional

geometry. The American Statistician, 40, 205–214.

109
Saville, D. J. & Wood, G. R. (1991). Statistical Methods: The Geometric Approach. New York,

NY: Springer-Verlag.

Searle, S. R. (1971). Linear Models. New York, NY: Wiley.

Shifrin, T., & Adams, M. R. (2002). Linear Algebra: A Geometric Approach. New York, NY:

Freeman.

Shoemake, K. (1992). ARCBALL: A User Interface for Specifying Three-Dimensional Orienta-

tion Using a Mouse. Paper presented at the annual proceedings of Graphics Interface in

Vancouver, Canada.

Shoemake, K. (no date). Quaternions. Retrieved on January 11, 2011 from

www.cs.caltech.edu/courses/cs171/quatut.pdf

Wickens, T. D. (1995). The Geometry of Multivariate Statistics. Mahwah, NJ: Erlbaum.

110

You might also like