Jacobson Erik D 201108 Ma
Jacobson Erik D 201108 Ma
by
Erik D. Jacobson
(Under the direction of Malcolm R. Adams, Edward Azoff, and Theodore Shifrin.)
Abstract
Two complementary geometric interpretations of data are used to discuss topics from elementary
statistics including random variables, vectors of random variables, expectation, mean, variance,
and the normal, F , and t probability distributions. The geometry of the general linear model and
of the analysis of variance, simple regression, and multiple regression using examples. Geometry
as multiple, partial, and semi-partial correlation. The last chapter describes the mathematical
used to generate the representations of data vectors for several figures in this text.
by
Erik D. Jacobson
of the
Master of Arts
Athens, Georgia
2011
c
2011
Erik D. Jacobson
by
Erik D. Jacobson
Approved:
Maureen Grasso
Dean of the Graduate School
The University of Georgia
August 2011
Acknowledgments
I recognize the insight, ideas, and encouragement offered by my thesis committee: Theodore
Shifrin, Malcolm R. Adams, and Edward Azoff and by Jonathan Templin. All four gamely dug
into novel material, entertained ideas from other disciplines, and showed admirable forbearance
as deadlines slipped past and the project expanded. I cannot thank them enough.
ii
Contents
Acknowledgments ii
List of Figures v
Introduction 1
iii
4.2 Correlation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6 Probability Distributions 90
References 109
iv
List of Figures
1.3 Vector diagram representations of variable vectors that span two and three di-
1.4 The centered vector yc is the difference of the observation vector y and the mean
vector y1, and the subspace spanned by centered vectors is orthogonal to the
mean vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
unit for many variables; in this figure, the origin of each axis is shifted to the
1.6 (a) The vector y is plotted in individual space; one must decide if (b) the vector
space or instead (c) a sample from of a distribution centered away from the origin
1.7 The vector y can be understood as the sum of y1 and a vector e that is orthogonal
to 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.8 The distribution of the random variable Y has different centers and relies on a
different estimates for the standard deviation under (a) the null hypothesis and
v
1.9 The t-ratio of ky 1k to kek under the t-distribution provides the same probability
information about the likelihood of the observation under the null hypothesis as
2.1 The vector ŷ, the projection of y onto V = C(X), is seen to be the unique vector
in V that is closest to y. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 The vector ŷ, the projection of the vector y into C([1 x]), is equal to b0 1 + b1 x
4.2 A scatterplot for the simple regression example showing the residuals, the differ-
ence between the observed and predicted values for each individual. . . . . . . . 63
4.3 Least-squares estimate and residuals for the transformed and untransformed data. 64
4.4 (a) Panels of scatter plots give an idealized image of correlation, but in practice,
(b) plots with the same correlation can vary quite widely. . . . . . . . . . . . . . 66
4.5 The vector diagram illustrates that rxc yc = cos(θxc yc ) and that rxc yc0 = − cos(π −
θxc yc0 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1 The data are illustrated with a 3D scatter plot that also shows the regression
plane and the error component for the prediction of district mean total fourth
5.2 The geometric relationships among the vectors yc , xc1 , and xc2 . . . . . . . . . . . 73
5.3 The vector diagrams of VXc1 and VXc3 suggest why the value of the coefficient b2
5.4 The vectors xc1 , xc2 , and xc3 are moderately pairwise correlated but nearly
collinear. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5 The generalized volume of the parallelepiped formed by the set of vectors {ui :
5.6 The linear combination of xreg and xall equal to ŷ0 (the projection of y0 into
vi
5.7 The arcs in the vector diagram indicate angles for three kinds of correlation
controlling for x2 , and the angle θyx1 corresponds to Pearson’s correlation, ryx1 . 85
parameter increases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.1 The x- and y-coordinates of the perspective projection are proportional to k/(k −z).105
7.2 The perspective space transformation takes the viewing frustum to the paral-
vii
List of Tables
1.1 A generic data set with one dependent variable, m − 1 independent variables, and
3.2 Data for a 2-factor experiment recording observed gain-scores for tutoring and
lecture treatments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1 Simulated score data for 4 tutors employed by the tutoring company. . . . . . . . 59
5.1 Sample data for Massachusetts school districts in the 1997-1998 school year.
Census. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 The value of the regression coefficient for per capita income and corresponding
5.4 Suppressor variables increase the predictive power of a model although they them-
viii
Introduction
Many of the most popular and useful techniques for developing statistical models are subsumed
by the general linear model. This thesis presents the general linear model and the accompanying
strategy of hypothesis testing from a primarily geometric point of view, detailing both the
standard view of data as points in a space defined by the variables (data-as-points) and the less
also develop the relevant statistical ideas geometrically and use geometric arguments to relate the
general linear model to analysis of variance (ANOVA) models, and to correlation and regression
models.
My approach to this material is original, although the central ideas are not new. The standard
treatment of this material is predominantly algebraic with a minimal (in the case of regression) or
nonexistent (in the case of ANOVA) discussion of the data-as-points geometry and no mention of
the data-as-vectors approach. In addition, these models are often taught separately in different
courses and never related. Only a very few texts present the data-as-vectors approach (although
historically this was the geometry emphasized by early statisticians), and all most all of these
texts are written for graduate students in statistics and presume sophisticated understandings
of many statistical and mathematical concepts. Another major limitation of these texts is the
quality of the drawings, which are schematic and not based on examples of real data. By contrast,
statistical models and to explain how they are in fact closely related to one another. I am able to
use precise drawings of examples with real data because of the DataVectors computer program
[email protected]). I am not aware of any other program that can generate these kinds of
1
representations. Although it is unlikely they would be useful representations for reports of
original research, the drawings produced by the program (and the interactivity of the program
itself) have potential as powerful pedagogical tools for those learning to think about linear
Wickens’ (1995) text The Geometry of Multivariate Statistics and David J. Saville’s and Gra-
ham R. Wood’s (1991) Statistical Methods: The Geometric Approach introduced me to the
data-as-vectors geometry. Other statistical texts that were particularly helpful include Ronald
Christensen’s (1996) Plane Answers to Complex Questions; S. R. Searle’s (1971) Linear Mod-
els; Michael H. Kutner and colleagues’ (2005) Applied Linear Statistical Models; and George
Casella’s and Roger L. Berger’s (2002) text titled Statistical Inference. Using R to work out
examples was made possible by Julian J. Faraway’s (2004) Linear Models with R. It is worth
mentioning that Elazar J. Pedhazur’s (1997) Multiple Regression in Behavioral Research: Ex-
planation and Prediction led me to initially ask the questions that this manuscript addresses
projections and homogeneous coordinates in Theodore Shifrin and Malcolm R. Adams’ (2002)
Linear Algebra: A Geometric Approach and Ken Shoemake’s landmark 1992 talk on the arcball
control at the Graphics Interface conference in Vancouver, Canada were invaluable for writing
the DataVectors program. In the work that follows, all of the discussion, examples, and figures
are my own. Except where noted, the proofs are my own work or my adaptation of standard
proofs that are used without citation in two or more standard references.
The statistical foundations of the subsequent chapters are developed in Chapter 1. The
topics addressed include random variables, vectors of random variables, expectation, mean,
variance, and the normal, F -, and t- probability distributions. Two complimentary geometric
the geometry of the general linear hypothesis testing by examining a simple experiment. In
Chapter 3, we turn to several examples of analyses from the ANOVA framework and illustrate
the geometric meaning of dummy variables and contrasts. The relationships between these
2
Chapter 4 describes simple regression and correlation analysis using two geometric interpre-
tations of data. Variable transformations provide a way to make simple regression models more
general and enable the development of models for non-linear data. In Chapter 5, we discuss
multiple regression from a geometric point of view. The geometric perspectives developed in
the previous chapters afford a rich discussion of orthogonality, multicollinearity, and suppressor
variables, as well as multiple, partial, and semi-partial correlation. Chapter 6 takes us through
a tour of four probability distributions that are referenced in the text and complements the
The last chapter describes the mathematical basis of the DataVectors program used to gen-
erate some of the figures in this text. Points in R3 can be represented using homogeneous
coordinates which facilitate affine translations of these points via matrix multiplication. Per-
spective projections of R3 to an arbitrary plane can also be realized via matrix multiplication
3
Chapter 1
variables, the observed features of individuals in a population. Usually, data from an entire
population are not available and instead these relationships must be inferred from a sample, a
subset of the population. We denote the size of the sample by n. Variables are categorized as
independent or dependent; the independent variables are used to predict or explain the depen-
dent variables. Much of the work of applied statistics is finding appropriate models that relate
independent and dependent variables and making disciplined decisions about the hypothesized
parameters of these models. After discussing some foundational ideas of statistics from two
different geometric perspectives (including random variables, expectation, mean, and variance),
statistical models and hypothesis testing are introduced by means of an illustrative example.
Variables
Dependent Independent
Individuals Var1 Var2 Var3 ··· Varm
Individual1 Obs1,1 Obs1,2 Obs1,3 · · · Obs1,m
Individual2 Obs2,1 Obs2,2 Obs2,3 · · · Obs2,m
.. .. .. .. .. ..
. . . . . .
Individualn Obsn,1 Obsn,2 Obsn,3 · · · Obsn,m
Table 1.1: A generic data set with one dependent variable, m − 1 independent variables, and n
observations for each variable.
4
1.1 Two geometric interpretations of data
One canonical representation of a data set is a table of observations (see Table 1.1). The rows of
the table correspond to individuals and the columns of the table correspond to the variables of
interest. Each entry in the table is a single observation (a real number) of a particular variable
for a particular individual. This representation can be taken as an n × m matrix over R where
n is the number of individuals and m is the number of variables comprising the data set. Then
the columns and rows of the data matrix can be treated as vectors in Rm and Rn , respectively.
40
0
10 20 30 40 50 10 20 30 40
MA District Per Capita Income ($K) MA District Per Capita Income ($K)
Figure 1.1: Strip plots (a one-dimensional scatterplot) can illustrate observations of a single
variable, but histograms convey the variable’s distribution more clearly.
Data are usually interpreted geometrically by considering each row vector as a point in
Euclidean space with coordinate axes that correspond to the variables in the data set. Thus,
each individual can be located in Rm with the ith coordinate given by that individual’s ith
variable observation. This space is called variable space. When only one variable is involved, the
space is one dimensional and a strip chart (and more commonly the histogram which more clearly
conveys the variable’s distribution) illustrates this representation (see Figure 1.1). Scatterplots
can illustrate data sets with up to three variables (see Figure 1.2).
There is a second geometric interpretation. The column vectors of the data matrix can
be understood as vectors in Rn and used to generate useful vector diagrams (see Figure 1.3).
Vector diagrams represent (subspaces) of individual space, and offer a complementary geometric
5
MA School Districts (1998−1999)
●
● ● ●
● ●
● ● ●
●
● ●
● ●● ●
780
● ● ● ● ● ● ●
●
● ● ● ●● ●● ●
●
● ●
●● ● ● ●
● ● ●● ●
● ● ● ●
● ● ● ●● ● ● ●
● ● ● ● ●● ● ● ● ●
●
District Mean 4th Grade Total Score
● ● ● ●
● ● ●● ● ●
● ● ● ● ● ●● ● ● ● ●
● ● ●● ●● ● ●● ● ●● ● ●
Teacher−Student Ratio
● ●
760
● ● ● ●● ●
● ● ●● ●● ● ●
● ● ●● ● ●
● ●● ●● ● ● ● ●
●● ● ● ● ●
● ● ● ●
● ● ●● ● ●● ● ●
●
● ● ● ● ●
● ● ●● ● ●●●● ● ● ●
● ●
● ●● ●● ● ●
● ●
●
740
● ● ● ●
● ● ● ●● ● ● ●
●
●● ●●●
● ● ●
● ●
● ● ● ●
●
● 25
720
●
20
700
15
680
660
10
640
5
5 10 15 20 25 30 35 40
Per Capita Income
(thousands)
interpretation of the data. We use boldface letters to denote the vectors, and it is customary for y
to denote the dependent variable(s) (called Var1 in Table 1.1), and for xi where i ranges between
1 and m − 1, to denote the independent variables (called Var2 , Var3 , etc. in Table 1.1). The
reader likely notices one immediate hurdle we face when interpreting the data matrix as vectors in
Rn —the dimension of the individual space is equal to the number of individuals, and, for almost
every useful data set, this far exceeds that which we visualize, let alone reasonably illustrate
3-dimensional subspaces of individual space. In cases where the number of variables (and hence
the dimension of smallest-dimensioned subspace of interest) is greater than three, planes and
lines must represent higher-dimensional subspaces and vector diagrams become schematic.
6
y
y
x
x1
x2
Figure 1.3: Vector diagram representations of variable vectors that span two and three dimen-
sional subspaces of individual space (Rn ).
In statistics, a random experiment is any process with an outcome that cannot be predicted
before the experiment occurs. The set of outcomes for a random experiment is called the sample
space and denoted Ω. The subsets of the sample space are called events and this collection
is denoted S. (Technical considerations may prohibit the inclusion of all subsets of Ω; the
defined on S, such that P (A) ≥ 0, for all A ∈ S and P (S) = 1. In addition, P satisfies the
then !
[ X
P An = P (An ).
n∈N n∈N
A real random variable, often denoted with capital Roman letter, is a real-valued function
defined on Ω, e.g., Y : Ω → R. Each random variable is associated with its cumulative dis-
{ω ∈ Ω : Y (ω) ≤ y}. The cumulative distribution function allows the computation of the prob-
ability of any set in S. The function Y is a continuous random variable if there is a function
Ry
fY : R → R satisfying FY (y) = fY (x)dx; fY is called the probability density function. In an
−∞
analogous way, a discrete random variable has an associated density function mY : R → R that
7
One of the most important concepts in statistics is that of the expected value E(Y ) of a ran-
dom variable. In the case where Y is discrete, the expected value can be defined as a weighted
average, E(Y ) = ymY (y). For example, the expected value of the random variable that
P
y∈R
y
assigns each die roll event to its face-value is E(Y ) = 6 = 6 . If Y is continuous, then
21
P
y∈{1,2,...,6}
R∞
we define E(Y ) = yfY (y)dy.
−∞
Theorem 1.2.1. For any random variables X and Y , any a in R, and any function g : R → R,
• E(aY ) = aE(Y )
R∞
• E(g ◦ Y ) = g(y)fY (y)dy
−∞
linear combination of the respective expected values, but for our purposes we only need the
Corollary 1.2.2. For any random variable Y, any a in R, and any functions g1 and g2
• E(ag1 ◦ Y ) = aE(g1 ◦ Y ).
and PY . The joint probability of the event (A and B), where A ⊂ ΩX and B ⊂ ΩY , is defined
PX ×PY (A and B) = PX (A)PY (B), and the variables have a joint probability distribution defined
FXY (x, y) = P1 × P2 (X ≤ x and Y ≤ y). The expected value of a joint probability distribution
is related to the expected value of the component variables. Two random variables are said to
be independent if FXY (A and B) = PX (A)PY (B). Whenever two random variables, X and Y ,
E(XY ) = E(X)E(Y )
8
Note that independent is used with several different meanings in this text. Independent variables
in models are those used to predict or explain the dependent variable, but we will see that our
methods require us to assume that the sample of n observations of the dependent variable are the
realized values of the independent random variables Y1 , . . . , Yn . We also will discuss independent
hypothesis tests, a usage which follows from the probabilistic definition just provided for random
variables. If two hypothesis tests are independent, then the outcome of one test does not affect
Expectation is also used to define the covariance of two random variables. The covariance
probability model and the real-world situation it describes. They are often denoted by Greek
letters, and many are defined using expectation. For example, the mean, µY , of a random
variable Y is a parameter for the normal distribution (see Section 6.1) and is defined as the
expected value of Y :
µY = E(Y ). (1.1)
Later sections of this chapter provide more discussion of this definition and more examples of
parameters.
Problems in inferential statistics often require one to estimate parameters using information
from a sample data vector, a finite set of values that the random variable Y takes during an
experiment with n trials. The sample is often denoted by a bolded, lowercase Roman letter. This
n
notation is used because a sample can be understood as an n-tuple or a vector in VY ⊂ Rn .
Q
i=1
For example, using the definition of Y in the previous paragraph (the face-value of a die roll),
the experiment in which we roll a die 4 times might yield the sample y = (1, 3, 5, 2)T ∈ R4 . Each
value yi is a realized value of the random variable Y . The sample data vector is geometrically
9
understood as a vector in individual space.
identically distributed random variables Yi . The sample y is the realized value of a vector of
random variables, called a random variable vector and denoted with boldface, capital Roman
letter: Y = (Y1 , Y2 , · · · , Yn )T . Because the roll of a die does not change the likelihood of later
rolls, we say the 4 consecutive rolls are independent. In out example, we might perform the
same experiment just as easily by rolling 4 identical dice at the same time and recording the
4 resulting face-values. The second view of samples, as the realization of random variables, is
more common and will be used from now on in this document. Note that the same notation (a
boldface, capital Roman letter) is used for matrices in standard presentations of linear algebra
and in this text. In the following chapters, the context will indicate whether such a symbol
Functions of sample data are called statistics, and one important class of statistics are es-
timates of parameters called estimators. They are conceptually distinct from the (possibly un-
known) parameters and are usually denoted by roman letters. For example, the sample mean,
an estimator for the parameter µY , is often symbolized y. Another common notation for esti-
mators, including vectors of estimators, is the hat notation. In the next chapter, for example,
In the previous section, we used expectation to define the mean of a random variable (see
equation (??) and claimed that samples can be used to estimate these parameters. In this
section we define the statistics y and s2 , provide geometric descriptions of both, and show they
are estimators for the mean (µ) and the variance (σ 2 ), respectively. We begin with the useful
10
1.3.1 Centered vectors
1·v
vc = (vc1 , vc2 , · · · , vcn ) = v −
T
1, (1.2)
n
The sample mean, y, gives an indication of the central tendency of the sample y, and is defined
as the average value obtained by summing the observations and dividing by n, the number of
observations (see equation 1.3). It is usually denoted by placing a bar over the lowercase letter
denoting the vector, but it is not bolded because it is a scalar. One can gain intuition about
the sample mean by imagining the observations as uniform weights glued to a weightless ruler
according to their value. In this sense, the mean can be understood as the center of mass of the
sample distribution. In variable space, the sample mean can be represented as a point on the
n
1X
y= yi (1.3)
n
i=1
Since the sample mean is a scalar, it has no direct vector representation in individual space.
However, the mean vector is a useful individual-space object and is written y1. The definition
for a centered vector (see equation 1.2) can be written more parsimoniously as the difference of
two vectors, yc = y − y1. This relationship can also be illustrated geometrically with vector
A few facts suggested by Figure 1.4 are worth demonstrating in generality. First, the mean
vector y1 can be understood as (and obtained by) the orthogonal projection of y on the line in
Pn
y·1 i=1 yi
Proj1 y = 1= 1 = y1. (1.4)
1·1 n
11
y1
y1 y x1
x
1
yc
xc
yc
Figure 1.4: The centered vector yc is the difference of the observation vector y and the mean
vector y1, and the subspace spanned by centered vectors is orthogonal to the mean vectors.
It follows that a centered vector and the corresponding mean vector are orthogonal.
In the last section, we called estimators those statistics that can be used to estimate pa-
rameters. A special class of estimators are called unbiased because the expected value of the
estimator is equal to the parameter. For example, we can write E(Y ) = µY if Y is an unbiased
estimator of µY . To make sense of the statement E(Y ) we must think of Y as a linear combina-
tion of the n random variables associated with the sample y, and consider y to be a realization
n
1X
Y = Yi , (1.5)
n
i=1
where Yi is the random variable realized by the ith observation in the sample y. The definition
combinations of other random variables. We will see in later sections that, given distributional
information about the component random variables Yi , we can estimate the mean and variance
12
forward:
n
!
1X
E(Y ) = E Yi
n
i=1
n
1X
= E(Yi )
n
i=1
n
1 X
= µ=µ
n
i=1
Notice that this proof relies on the assumption that the random variables Yi are identically
distributed (in particular, they must all have the same mean, µ).
The variance of a random variable Y , σY2 , indicates its variability and helps answer the question:
How much do the observed values of the variable vary from one to another? Returning to the
physical model for the values in a distribution imagined as uniform weights glued to a massless
ruler according to their value, the variance can be understood as the rotational inertia of the
variable’s distribution about the mean. The greater the variance, the greater the force would be
required to change the rotation rate of the distribution by some fixed amount. A random variable
with low variance has realized values that cluster tightly around the mean of the variable.
The variance is defined to be the expected value of the squared deviation of a variable from
the mean:
If the variable’s values are known for the entire population (of size n), then y = µ and the
variance can be computed as the mean squared deviation from the mean:
n
1X
s2n = (yi − y)2 . (1.7)
n
i=1
In most analyses, however, only a sample of observations are available, and the formula (1.7) sys-
tematically underestimates the true population variance: n1 (yi − µ)2 < σ 2 . This phenomenon
P
13
An unbiased estimator, s2 , of the variance is obtained using n − 1 instead of n in the
denominator:
n n
1 X 1 X 2 kyc k2
s2 = (yi − y)2 = yci = . (1.8)
n−1 n−1 n−1
i=1 i=1
The proof that E(s2n ) 6= σ 2 = E(s2 ) relies on the independence of the observations in the sample,
one of the characteristics of a random sample and a common assumption made of empirical
samples by researchers who perform statistical analyses. It is important to notice the similarity
between equation 1.8 and the numerator and denominator of the F -ratio (see equation 2.15). In
fact, we shall see that the estimate of the sample variance s2 can be understood geometrically as
the per-dimension length of the centered vector yc . This will be more fully explained in Section
2.1.3, but the key idea is that yc lives in the orthogonal complement of 1, and this space is
(n − 1)-dimensional. The information in one dimension of the observation vector can be used to
center the vector (and estimate the mean µ), and the information in each of the remaining n − 1
dimensions provides an estimate of the variance σ 2 . The best estimate is therefore the mean of
In a manner analogous to y, the sample variance s2 can also be understood as the realization
of the random variable S 2 , which is defined as a linear combination of other random variables:
n
1 X
S2 = (Yi − Y )2 . (1.9)
n−1
i=1
In Chapter 2, we prove that S 2 is an unbiased estimator after a prerequisite result about the vari-
ance of random variables formed from linear combinations of random variables. Next, we briefly
consider another statistic related to the sample variance that is frequently used in statistical
analyses.
Sample variance is denoted s2 because it is equal to the square of the sample standard deviation,
s. Both the variance and the standard deviation address the variability of a random variable.
However, the standard deviation is more easily interpreted than variance because the units
for variance are the squared units of the variable. By taking the square root, the variance is
14
transformed to the metric of the variable and can be more easily compared with values in a data
set. The standard deviation is also useful for estimating the probability that the variable will
The most general result of this kind is Chebyshev’s inequality, which states that
1
P (|Y − µ| ≥ kσ) ≤ , (1.10)
k2
regardless of the distribution of Y . For sufficiently large samples, Y can be used to estimate µ
and s can be used to estimate σ. For example, suppose that for some sample of 30 observations,
Y = 1 and s = 2. Then the probability that the next observation of Y deviates from the mean
1 1
by more than 4 = 2s is at most 2 = or 25%.
2 4
The proof of this inequality follows easily from a more general statement called Markov’s
E(Ỹ )
P (Ỹ ≥ a) ≤ (1.11)
a
Z
E(Y ) = yfY (y)dy
ZVY Z
= yfY (y)dy + yfY (y)dy
Y <a Y ≥a
Z
≥ yfY (y)dy
Y ≥a
Z
≥ a fY (y)dy
Y ≥a
= a P (Y ≥ a)
15
and taking a to be k 2 σ 2 . Since |Y − µ| ≥ kσ if and only if Ỹ ≥ k 2 σ 2 , we have
Chebyshev’s inequality illustrates the utility of standard deviations as a tool for describing
the distribution of variables in variable space (Rm ). The standard deviation can be understood
as the length of an interval on the axis of a variable that follows a normal distribution (see Section
without
6.1) and calculating and appropriately
used to helps represent labeling(Figure
the distribution the variable axes.
1.5). Nevertheless, one limitation of
variable space representations of data is that standard deviations are hard to see in a scatterplot
3
2
150
Y in units of sY
Frequency
1
100
−1 0
50
−3
0
−3 −1 0 1 2 3 −3 −1 0 1 2 3
X in units of sX X in units of sX
Figure 1.5: In variable space, the standard deviation s is a convenient, distribution-related unit
Figure
for many1.5: In variable
variables; space,
in this thethe
figure, standard
origin ofdeviation
each axissisisshifted
a convenient, distribution-related
to the mean of the associatedunit
for many
variable. variables; in this figure, the origin of each axis is shifted to the mean of the associated
variable.
Standard deviations also have a useful interpretation in vector drawings of individual space
Standard deviations also have a useful interpretation in vector drawings of individual space
(R nn). Unlike scatterplots in variable space, vector drawings in individual space already represent
(R ). Unlike scatterplots in variable space, vector drawings in individual space already represent
the standard
the standarddeviation
deviationbybynature
natureofoftheir
theirconstruction.
construction. From
From equation
equation (1.8),
(1.8), it isitclear
is clear
thatthat
the the
variance of a variable can be expressed using the dot product of the centered variable with itself.
variance of a variable can be expressed using the dot product of√the centered variable with itself.
It follows from the definition of the Euclidean norm (!v! = √ v · v) that the standard deviation
It follows from the definition of the Euclidean norm (kvk = v · v) that the standard deviation
of a variable is proportional to the length of the centered vector.
yc · yc 16 1
s2 = ⇐⇒ s = !yc ! √
n−1 n−1
Since the constant of proportionality √1
n−1
depends only on the dimension of individual space,
of a variable is proportional to the length of the centered vector.
y c · yc 1
s2 = ⇐⇒ s = kyc k √
n−1 n−1
all centered variable vectors are scaled proportionally. Assuming comparable units, this means
the ratio of the length of two centered vectors in individual space is equal to the ratio of the
standard deviations of these variables. For example, in the left panel of Figure 1.3, the vector
y has a smaller standard deviation than the vector x because it is shorter. Histograms of these
two variables would show the yi bunched more tightly around their mean.
We conclude this chapter with an example to introduce statistical models and hypothesis test-
ing.1 The geometry of individual space is essential here; variable space does not afford a simple
or compelling treatment of these ideas and their standard algebraic treatment at the elementary
level masks rather than illuminates meaning. In fact, the presentation here is closer to the
original formulation of the ideas by R.A. Fisher in the early twentieth century (Herr, 1980).
Suppose a tutor would like to discover if her tutoring improves her students’ scores on a
standardized test. She picks two of her students at random and for each calculates the difference
in test score before and after a one month period of tutoring: g1 = 7, g2 = 9. Then she plots
these gain-scores in individual space as the vector y = (g1 , g2 ). (See Figure 1.6a)
To proceed, we must make a few reasonable assumptions about the situation. First, we
assume that the students’ scores are independent random variables. Among other things, this
means that the gain-score of either student does not affect the gain-score of the other. Next,
we assume that the gain-score for both students follows the same distribution. In particular, we
want to assume that if we could somehow go back in time and repeat the month of tutoring,
then the average gain-score over many repetitions would be the same for each student. Neither
student is predisposed to a greater benefit from tutoring than the other. This assumption lets
1
This example is inspired by something similar in Saville & Wood (1991). I have changed the context and
values, used my own figures, and considerably developed the discussion.
17
y y y
9
Figure 1.6: (a) The vector y is plotted in individual space; one must decide if (b) the vector y
is more likely a sample from a distribution centered at the origin of individual space or instead
(c) a sample from of a distribution centered away from the origin on the line spanned by (1,1).
us postulate a true mean gain-score µ for the population of students, and our goal is to use the
data we have to estimate it. Finally, we make the assumption that the common distribution of
gain-scores is normal with a mean of 0. This implies that the signed length of the vector follows
a normal distribution with a mean of 0 (where the sign of the length is given by the sign of
s
the mean gain-score) and standard deviation √ = s, where s is the standard deviation of
2−1
gain-scores. Moreover, all directions are equally likely because of the assumed independence.
The tutor’s question has two possible answers: The tutoring makes little difference in stu-
dents’ gain-scores or there is some effect. In the first case, we would expect many repetitions of
her procedure to look like Figure 1.6b, and in the second case, many repetitions might look like
Figure 1.6c, with the center of the distribution displaced from the origin. In both figures, the
We call the first possibility the null hypothesis and write H0 : µ = 0, where µ is the mean
gain-score resulting from tutoring. The second possibility is called the alternative hypothesis
H1 : µ 6= 0. Certainly after plotting 100 repetitions of the tutor’s procedure it would likely
be easy to decide which hypothesis was most plausible; the challenge is to pick the most likely
18
hypotheses based only on a single trial.
The center of the distribution for the vector y must lie along the line spanned by the vector
1 = (1, 1). Geometrically, we can understand the vector y as the sum of the vector y1 and a
second vector orthogonal to the first which can be written e = y − y1, where y is the length of
y
9
y
y1
7 y
Figure 1.7: The vector y can be understood as the sum of y1 and a vector e that is orthogonal
to 1.
The idea for testing the null hypothesis is to compare the lengths of y1 and e using the ratio
sgn (y)ky1k
t= .
kek
To make sense of the hypothesis test geometrically, consider both parts of Figure 1.8. In both,
the shaded region indicates the cone in individual space where the t-ratio is large. In Figure 1.8a,
the vector y gives an estimate of the variance of the distribution of gain-scores under the null
hypothesis, and the corresponding standard deviation is indicated by the dashed circle centered
at the origin of individual space. In Figure 1.8b, it is instead the vector e that gives an estimate
of the variance of the distribution of gain-scores relative to y1, and the corresponding standard
deviation of this distribution is indicated by the radius of the dashed circle centered at (y, y).
If the t-ratio is large, then the vector y is ‘close’ in some sense to the line spanned by 1.
19
y y
y1 y1
y
0 y
Figure 1.8: The distribution of the random variable Y has different centers and relies on a dif-
ferent estimates for the standard deviation under (a) the null hypothesis and (b) the alternative
hypothesis.
In this case, we can see geometrically why the null hypothesis is unlikely. If y usually lands
anywhere within the dashed circle in Figure 1.8a, then it is rare y will land in the shaded cone.
The observation vector y is unusual under the null hypothesis and thus we can then reject the
null hypothesis in favor of the more plausible alternative hypothesis. Notice that under the
alternative hypothesis, the t-ratio will usually be large. Geometrically, we can see that the
dashed circle in Figure 1.8b is entirely within the shaded cone. On the other hand, whenever
the t-ratio is small we have no evidence with which we might reject the null hypothesis.
It can be shown that this ratio follows the t-distribution (see Section 6.4). Integrating the
probability distribution function for the t-distribution between the values of −8 and 8 gives the
quantity 0.921. This suggests that under the assumption of the null hypothesis, we can expect
a t-ratio with an absolute value as high or higher than the one we observed only 8% of the time.
Equivalently, we can expect a sample vector in individual space as close or closer to the line
spanned by 1 only 8% of the time under the assumption of the null hypothesis.
20
1.4.2 The F-ratio and the t-test
Like the t-ratio, the F -ratio is used for testing hypotheses and follows the F -distribution (see
Section 6.4). This example illustrates the relationship between the t-ratio and the F -ratio. The
F -ratio is more versatile and will be used almost exclusively from here on for hypothesis testing.
We begin by introducing the linear algebraic notation for describing the vector relationships
depicted in Figure 1.7. This equation is called the model for the experiment. Models will be
defined and elaborated in much more generality in the next chapter. Under this model, the
observation vector y is assumed to be equal to the sum of the error vector e and the matrix
matrix.
7 1 −1
= 8 + .
9 1 1
y = Xb + e,
This model explains each sample as the sum of a constant mean gain-score (the product Xb)
plus a small amount of random error (the vector e) that is different for each sample. Another
way to understand the hypothesis test is to reframe our goal as the search for the estimate vector
b of the true vector β. The later describes the true relationship between Y and X. The null
hypothesis states that the vector we are trying to estimate β is zero, whereas the alternative
hypothesis states it is non-zero. It is noteworthy that throughout this thesis, the alternative
hypothesis is a negation of the null hypothesis instead of a hypothesis that specifies a particular
value of β.
The F -ratio is a comparison of the per-dimension squared lengths of (1) the projection y1
of the observation vector y onto the model vector 1 and (2) the error vector e = y − y1. In
this case, the F -ratio is simply the square of the t-ratio (see Section 6.4) and it follows the
21
F -distribution (see Section 6.3). The F -ratio can be written
ky1k2 /1
F = .
kek2 /1
(We mean by per-dimension a factor that is the inverse of the number of dimensions of the sub-
space containing the vector, after individual space has been partitioned into the model subspace
kek2
y
y1
ky 1k2
Figure 1.9: The t-ratio of ky 1k to kek under the t-distribution provides the same probability
information about the likelihood of the observation under the null hypothesis as the F -ratio of
ky 1k2 to kek2 under the F -distribution.
The values ky1k2 and kek2 are illustrated in Figure 1.9. Again it is quite evident from the
geometry that this ratio is much greater than one. The F -distribution tells us how unusual such
an observation would be, agreeing with our prior result. Under the null hypothesis, we would
expect a sample with an F statistic this large or larger only 8% of the time.
22
Chapter 2
Y as a function of the independent variables xi . One difference between functions and models
is that models incorporate random error. For a fixed set of input values, the model does not
always return the same value because each output contains a random component.
A model is called the true model when it is assumed to express the real relationship between
the variables. However, if the data collected is restricted from the population to a sample, the
true model can never be discovered with certainty. Only approximations of the true model are
defined using a small number of parameters, θ1 , θ2 , · · · , θk (see equation 2.1). The true model
can then be expressed with fixed (but unknown) parameter values, and the sample data can be
this notation and a general framework for statistical models in mind, we are prepared to state
Y = Xβ + E, (2.2)
23
where Y, E ∈ Rn and X is an n × (p + 1) matrix over R. In the general linear model, the
the model and is the difference between the observed values and those predicted by the product
XB.
The example model from the first chapter (see equation 1.12) is a simple case of the general
linear model. The design matrix for this model is simply the vector X = 1, Xb is the vector
y1, and the vector e is the centered observation vector yc . In general, the matrix product Xb
is the projection of y onto the column space of X (which in this case is the line spanned by
1). Recall that the use of a lower-case, bold y and e indicates vectors of observed values from
a particular sample. The capital, bold Y and E in equation (2.2) indicate the corresponding
random variable. Just as Greek characters refer to individual population parameters and Roman
characters refer to corresponding sample statistics (e.g., µ and x), we use β for the vector of
An equivalent statement of the general linear model is a system of equations for each depen-
where Xi is the ith row of the design matrix X. As demonstrated in later chapters, the general
linear model subsumes many of the most popular statistical techniques including analysis of
variance (ANOVA) models in which the design matrix columns xi are categorical variables
(for example, gender or teacher certification status), simple regression and correlation models
comparing two continuous variables, and multiple regression models in which there are two or
more continuous, independent variables (e.g., teacher salary and years of experience).
The formulation of the general linear model presented here treats the observed values of the
tal data sets, this is often appropriate because researchers control the values of the independent
variables (e.g., dosage in drug trials). In observational data sets, it often makes more sense
24
to treat the observations of independent variables as realized random variables, because these
observations are determined by choice of the sample rather than experimental design. Under
certain conditions, the general linear model can be used when both independent and depen-
dent variables are treated as random variables. For example, the requirement that all random
variables have a multivariate normal distribution is sufficient but not strictly necessary. Even
though the results hold in greater generality, all independent variables are treated as fixed in
Given a data set and a model, the next step of statistical analysis is to use the sample data to
estimate the parameters of the model. This process is called fitting the model. Our goal is that
unknown. However, since the model separates realized values of each random variable Yi into a
systematic component Xi β (where Xi is the ith row of the matrix X) and a random component
1 ≤ i ≤ n. To accomplish this, we need a number that can summarize the model deviation over
all the observed values in the sample. A natural choice for this number is the length of the error
vector E, because the Euclidean norm depends on the value of each coordinate. Therefore, it
With a sample of observations y in hand, the best estimate available for Y is simply the
sample y. Restating the goal identified above in terms of the sample, we say we are looking
for a vector b that will minimize the difference between Xb and y. This difference is the best
available estimate for E and is denoted e. It follows that the fitted model can be written
y = Xb + e, (2.4)
25
and similarly for each observed yi we can write
yi = Xi b + ei .
ev
eŷ
v V
ŷ − v
ŷ
Figure 2.1: The vector ŷ, the projection of y onto V = C(X), is seen to be the unique vector in
V that is closest to y.
Euclidean space. The vector β is assumed to lie in the (p + 1)-dimensional subspace spanned
by 1 and the vectors xi . We are looking for a vector ŷ = Xb in the column space of X, C(X),
such that the length of e = y − ŷ is minimized. (The hat notation indicates an estimate; ŷ is an
estimate for the observed vector of y-values y.) The situation is represented in Figure 2.1 and
Lemma 2.1.1. Let V ⊂ Rn be a subspace and suppose y ∈ Rn . For each vector v ∈ V, let
ev = y − v. Then there is a unique vector ŷ ∈ V such that 0 ≤ keŷ k < kev k for all v 6= ŷ.
Proof. We can write Rn = V ⊕ V ⊥ , and claim that ŷ is the projection of y onto V, the unique
The projection of y onto C(X), the vector ŷ = Xb, gives the desired estimate for y. It
remains to state and prove the general method for obtaining b using the sample data vector y
and the matrix X. The trivial case in which y ∈ C(X) does not require estimation because the
model (equation 2.4) can then be solved directly for b—it is not addressed here.
26
Theorem 2.1.2. Let X be an n × k matrix with linearly independent columns, let y ∈ Rn
be in the complement of C(X), and let b be a vector in Rk . Suppose that Xb = ŷ and that
XT (y − Xb) = 0.
Consequently,
(XT X)b = XT y.
b = (XT X)−1 XT y.
This method produces a vector b of parameter estimates, called the least squares estimate
because it minimizes the sum of the squared components of e. Another method of finding a
formula for b is to use calculus to find critical points of the function S(b) = kek2 . The resulting
(XT X)b = XT y,
The least squares solution b can be obtained from the orthogonal projection of the observed
dependent variable vector y onto the column space of the design matrix X = [1 x1 x2 · · · xp ].
In fact, the same solution can be obtained from the projection of y onto the subspace of Rn
27
spanned by the vector 1 and the centered vectors xci = xi − xi · 1. Let X0 denote the block
matrix [1 Xc ] with Xc = [xc1 xc2 · · · xcp ]. Centering a vector entails subtracting the projection
of that vector onto the vector 1 and because 1 is in both C(X) and C(X0 ), these subspaces are
equal.
From Theorem 2.1.2, we know that the least squares solution for y = Xb + e is
b = (XT X)−1 XT y,
Theorem 2.1.3. Whenever b = (b0 , . . . , bp )T is the least squares solution for the linear model of
y with the design matrix X and b0 = (b00 , . . . , b0p )T is the least squares solution for the linear model
of y with the corresponding centered design matrix Xc , then bi = b0i for all 1 ≤ i ≤ p. Moreover,
the parameter estimate b0 can also be obtained from b0 ; in particular, b0 = b00 − pi=1 b0i xi .
P
Proof. Since C(X) = C(X0 ) we know that Xb = X0 b0 . The result follows immediately from the
equation
1 −x1 · · · −xp
X = X0 ,
0 Ip
The geometry of the result is instructive and readily apparent in the case where there is
only one independent variable, that is, when X = [1 x]. It follows from Theorem 2.1.2 that
C(X) = C(Xc ) that is similar to the triangle formed by b1 x, b01 xc , and z = b1 x − b01 xc = z1 (see
Figure 2.2). It follows that b1 = b01 and that b0 = y − z. Certainly b00 = y, and by similarity we
28
have that z = b01 x. Thus, b0 = b00 − b01 x and we can therefore write ŷ = (b00 − b01 x)1 + b01 xc , which
is in terms of b0 as desired.
x1 x
y1 = b00 1 ŷ
b0 1
b1 x1 = b01 x1 b1 x
b1 xc = b01 xc xc
Figure 2.2: The vector ŷ, the projection of the vector y into C([1 x]), is equal to b0 1 + b1 x and
also is equal to b00 1 + b01 xc .
It is often convenient to think of individual space (Rn ) partitioned into several subspaces that
correspond to different parts of the assumed linear model. One important subspace in individual
As we have already seen, the least squares solution b is derived by projecting the observation
vector y into the column space of X, C(X). For analogous notation, we also use VX to denote
VXc denotes the orthogonal complement of V1 in VX . Finally, we let Ve denote the orthogonal
From linear algebra, we know there is corresponding equation relating the dimensions of these
n = 1 + p + (n − p − 1). (2.6)
The vectors that make up linear models (e.g., y, ŷ, e, etc.) are each contained in precisely
one of these subspaces. The (ambient) dimension of each vector is defined to be the dimension
of its associated (smallest) ambient space. Of course, each vector is one-dimensional in the
29
traditional sense. A basic technique for estimating models and testing hypotheses is finding a
convenient basis for individual space based on the vectors that make up the model. This new
definition for dimension is directly related to this implicit basis imposed on Rn by the linear
model. The vector 1 is 1-dimensional and is also (almost always) taken to be the first of the
implicit basis vectors. Next we choose a set of p orthogonal basis vectors for VXc . If the vectors
of centered predictors {xci : 1 ≤ i ≤ p} are all orthogonal, so much the better. The vector
ŷc is contained in VXc and therefore has (ambient) dimension p. Finally, we can pick any set
of n − p − 1 vectors that span Ve to complete the basis for Rn . The vector e, naturally, has
the ambient space Ve and therefore has dimension n − p − 1. The observation vector y is an
n-dimensional vector in individual space. It is rarely necessary to specify these vectors explicitly
We already have used the name individual space for Rn . The subspaces of individual space
imposed by a linear model also have convenient names. The space V1 is called mean space, the
space VXc is called the effect space or the model space, and the space Ve is called the error space.
If we adopt the general linear model for a particular data set, then there are three assumptions
we are required to make. The logic of fitting the model and testing hypotheses about estimated
parameters depends on these assumptions. First, we assume that the sample y ∈ Rn is a set of
n observations of n independent random variables Yi (see Section 1.2), and that each Yi follows
the normal distribution (see Section 6.1). Further we assume that the expected value of each Yi
is a linear combination of the variables xi specified by the parameters in the vector β, that is
µYi = β0 + β1 xi,1 + · · · + βp xi,p . Finally, we assume that the variables Yi have a common variance
σ 2 . In summary, for all i, we assume E (Yi − µYi )(Yj − µYj ) = 0 for all j = 6 i (a consequence
of independence), and that the random variable Yi follows the normal distribution with a mean
The three assumptions about the random variables Yi can be reframed as assumptions about
30
the error component of the general linear model. From equation (2.3), we can write
which illustrates the dependence between the random variable Ei and the random variable Yi ,
for each 1 ≤ i ≤ n. If the variance of Yi is known, it is clear that the variance of Ei must be
the same. Because the true parameter vector β minimizes E (kEk), we know that the expected
value of each Ei must be zero. It follows that the following three assumptions about the random
variables Ei are equivalent to the first set assumptions concerning the random variables Yi .
The assumptions about the random error variables play a central role in justifying hypothesis
To test hypotheses about the parameters of the general linear model requires that we under-
stand the distributions of linear combinations of random variables such as Y and S 2 , which are
linear combinations of the random variables Y1 , · · · , Yn . Assuming that the Yi s are all random
variables with the same normal distribution, it is reasonable to ask for the distribution of linear
We saw in Section 1.3.2 that, given a sample y, the statistic y is an unbiased estimator for
µ. In this section, we prove that S 2 is an unbiased estimator of σ 2 (see equation 1.9) and find
the distributions of Y and S 2 . These results (and the methods developed to obtain them) afford
a rigorous geometric foundation for the F -ratio developed in the next section.
31
Lemma 2.2.1. Let Y be a random variable with |Var(Y )| < ∞. Then for any a and b in R,
Var(aY + b) = a2 Var(Y ).
Proof. Using the definition of variance (see equation 1.9) and the linearity of expectation (see
2
Var(aY + b) = E (aY + b) − E(aY + b)
2
= E aY + b − aE(Y ) − b
2
= E aY − aE(Y )
2
= a2 E Y − E(Y )
= a2 Var(Y )
Lemma 2.2.2. Let Y1 , · · · , Yn denote n random variables and suppose the random variables are
independent (i.e. E (Yi − µYi )(Yj − µYj ) = 0, for all i 6= j). Then
n n
!
X X
Var Yi = Var(Yi ).
i=1 i=1
Proof. We apply the definition of variance, the linearity of expectation, and the hypothesis of
32
independence:
!2 !2
n n n
!
X X X
Var Yi = E Yi − nµi = E (Yi − µi )
i=1 i=1 i=1
Xn X
n n X
X n
= E (Yi − µi )(Yj − µj ) = E (Yi − µi )(Yj − µj )
i=1 j=1 i=1 j=1
n
X n X
X
= E(Yi − µi )2 + E (Yi − µi )(Yj − µj )
n
X
Var(W ) = a2i Var(Yi ).
i=1
Proof. The result is a straightforward consequence of Lemma 2.2.1 and Lemma 2.2.2.
The fact that the variance of a sum of independent random variables is the sum of their variances
(Lemma 2.2.2) yields another useful fact. We claim that the variance of the sample mean,
σ2
Var(Y ), is . This follows easily by applying the definition of Y and recalling the assumption
n
that the observations of a sample are assumed to be independent. Since Y = n1 ni=1 Yi , we
P
33
have:
1 X
= E ( Yi − nµ)2 (2.9)
n 2
1 X
= 2
Var Yi (2.10)
n
1 X σ2
= 2
Var(Yi ) = (2.11)
n n
Recall that the variance of a variable is the square of the standard deviation. Considering
σ
σY = √ , (2.12)
n
where σ is the common standard deviation of the random variables Yi (see Section 2.1.4).
Whenever the variance of a random variable is calculated from a random sample y = (y1 , · · · , yn ),
the unbiased estimator stated in equation (1.8) is used. Recall that this differed from the naive
random variable S 2 corresponding to the estimate s2 is in fact unbiased. We wish to show that
E(S 2 ) = σ 2 and in the process demonstrate that s2n calculated from sample data is a biased
estimate of variance.
We start with a standard algebraic argument that S 2 is unbiased. As with the proof of
34
Lemma 2.2.2, this argument relies on partitioning a sum of squares.
1 X
E(S ) =
2
E (Yi − Y ) 2
n−1
1 X
= (Yi − µ + µ − Y )2
E
n−1
1 X
= (Yi − µ)2 − 2n(Y − µ)2 + n(Y − µ)2
E
n−1
1 X
= E (Yi − µ)2 − nE (Y − µ)2
n−1
1 X
= Var(Yi ) − nVar(Y )
n−1
1
= (nσ 2 − σ 2 ) = σ 2
n−1
Although this algebraic argument that S 2 is an unbiased estimator does not support a
would like to calculate the per-dimension squared length of y − µ1, a vector with ambient space
Rn . Thus, when µ = µYi for all 1 ≤ i ≤ n, is known, the statistic n1 ni=1 (yi − µ)2 gives
P
an unbiased estimate of σ 2 . In the present case, however, the common mean, µ, is unknown.
Instead of using the true parameter µ we use the sample mean y. The key idea here is that one
dimension of individual space y ∈ Rn is used to estimate the mean by calculating y. The true
The situation is different for y which instead depends directly on the sample y. Once y has been
decomposed as the sum of the mean vector y1 and the centered vector yc = y − y1, the centered
vector no longer has ambient dimension of n. Instead, the vector yc has ambient dimension
n − 1. The desired result follows from precisely the same concept—the statistic s2 is the average
per-dimension squared length of the centered data vector yc . In order to demonstrate this
rigorously, we use an approach that will be useful for geometrically justifying hypothesis testing
We have seen that y can be obtained by projecting y on the line spanned by 1 (see equation
other basis vectors fixed but unspecified. We can write each basis vector in terms of the original
35
can find each ci by taking the (signed) length of the projection of y on ui , where ci is negative
if the projection of y on ui has the opposite direction as ui . We know that the coefficient
n √
ci = aij yj for each i, and in particular that c1 = n(y). Since each ui is a unit vector, we
P
j=1
n
also have a2ij = 1, for all i.
P
j=1
Next we consider ci for all 1 ≤ i ≤ n, as the realized value of the corresponding linear
n
combination of random variables Ci = aij Yj . Now if E(Yi ) = 0, for all 1 ≤ i ≤ n, we have
P
j=1
that E(Ci ) = 0. (This assumption is analogous to the assumption that each coordinate of the
error vector E in the general linear model has an expected value of 0; that is, E(Ei ) = 0.) By
X
Var(Ci ) = ai Var(Yi )
X
= a2i σ 2
X
= σ2 a2i
= σ2
We have proved
normal distribution with a mean of 0 and a variation of σ 2 for all i, and suppose that u ∈ Rn
is a unit vector. If Cu is the projection of the random variable vector Y = (Y1 , . . . , Yn ) onto u,
then Var(C) = σ 2 .
2
Var(Ci ) = E Ci − E(Ci )
= E Ci2 .
E(Ci2 ) = σ 2 , (2.13)
and it follows that, with a particular vector of observations y (the realized values of the random
36
variable vector Y), the mean of the squared lengths of the vectors ci ui for 1 < i ≤ n will give the
best available estimate of σ 2 . It almost goes without saying that the sum ci ui = yc because
P
i6=1
c1 u1 = y1 (see equation 1.2). We now see the utility of the per-dimension squared length of yc
n n
c2i kci ui k2
P P
i=2 i=2 kyc k2
s2 = = =
n−1 n−1 n−1
We are happy to observe that this method of estimating variance agrees with the definition of
Statistical analyses, in addition to making estimates of the parameters defining putative true
models for variables in a data set, often provide statistical tests for these estimates. Hypothesis
testing entails a comparison of the estimated model with a simpler model that is described by
the null hypothesis. Such a test, for example, can provide evidence that the true model is more
similar to the estimated model than to a model, say, in which the dependent variable has no
The general strategy for hypothesis testing with the general linear model is to compare a
restricted model obtained from a hypothesis that constrains one or more parameters with the
The null hypothesis is often the most restricted model (the parameter vector β is set to the
zero vector so there is no systematic portion in the model). On the other hand, if parameter
estimates can vary, the estimated model is the best choice because it has the smallest squared
error vector. If the observed sample under the distribution implied by the restricted model is
so unusual that this null hypothesis is untenable, then the null hypothesis is rejected in favor of
The foremost concern, given the sample y and a vector of parameter estimates b, is to
determine how close the estimated model is to the true model. Recall that the general linear
37
model decomposes each observed yi as the sum of (1) a linear combination of the xi s (this is the
systematic portion of the model) and (2) the term ei , which is the random portion of the model.
Furthermore, by adopting the linear model, we assume that, for all 1 ≤ i ≤ n, the expected
value of Ei is zero and the variance of Ei is σ 2 > 0 (see Section 2.1.4). We cannot compare b
with β without knowing the true model. Instead, we will compare the estimated model with
The model with no systematic portion is precisely the model in which the parameters are
each constrained to zero: β = 0. This is the same as saying that we expect all of the random
implies that Y = E. It follows that for all 1 ≤ i ≤ n, the expected value of Yi is zero and the
variance of Yi is σ 2 > 0.
Under any basis for Rn , the coordinates of Y are each estimates of σ (see the discussion
in the previous section, for example), and so the expected value of the squared length of the
random vector Y is nσ 2 . By contrapositive, we observe that if we can show that the expected
value of the squared length of Y is unlikely to be nσ 2 , then we are able to conclude that β is
unlikely to be the zero vector. In this case, then b is the best estimate for β in the column space
of X (i.e. when the parameters are allowed to vary), and so it is reasonable to conclude that b
is close to β. This logic is used in every hypothesis test of the estimated model. The crux of
the argument is using the observed data to show that we can (or cannot) reasonably expect the
expected value of the squared length of Y to be nσ 2 given the evidence from the sample y.
Let us call the model with no systematic portion the null model. We want to know how
likely the sample y is if reality actually corresponds to the null hypothesis that the linear model
is no better than chance at predicting or explaining the observed data y. In other words, the
observed data are due merely to chance and have no systematic relationship with the independent
variables.
For the sake of argument, we first assume the null model holds and therefore hypothesize that
basis for Rn chosen so that {u1 , u2 , · · · , up+1 } span C(X). If Y is written as Ci ui , where Ci
P
is a random variable constructed by the appropriate linear combination of the random variables
38
Yi , 1 ≤ i ≤ n, then the expected value of Ci2 is σ 2 for all i because we have assumed that
n
1 1 X
kYk2 = k Ci ui k2 = σ 2
n n
i=1
Furthermore, by the least-squares estimate for b, we can express Y as the sum of the vector
Ŷ ∈ C(X) and the random variable vector E ∈ C(X)⊥ . It follows that the expected per-
p+1
1 1 X
kŶk2 = k C i ui k 2 = σ 2
p+1 p+1
i=1
and that
n
1 1 X 2
kEk2 = C i ui = σ 2 .
n−p−1 n−p−1
i=p+2
In fact, we expect that the per dimension squared length of Ŷ and E to be the same:
kŶk2 kEk2
= σ2 = . (2.14)
p+1 n−p−1
is an appropriate statistic for evaluating the likelihood of the observed data under the assumption
If null hypothesis is correct, then we would expect the per-dimension squared length of the
sample least squares estimate ŷ to be similar to the per-dimension squared length of the sample
error vector e because of equation (2.14) In this case, we would expect the value of F to be
close to 1. On the other hand, if the F -ratio is large then the average squared lengths of the
projections of ŷ onto the arbitrary basis vectors {ui } would be greater than σ 2 , implying that
the random variables Yi do not have a mean of zero. Moreover, if F is sufficiently large, then
the sample is unlikely under the assumption of the null hypothesis. For this reason, when F
is sufficiently large, we have some confidence in rejecting the null hypothesis and accepting the
39
alternative hypothesis that β is not the zero vector. It is important to stress that this procedure
does not ‘prove’ that β is close to b. Having decided that in all likelihood β 6= 0, the estimate
b provides the best guess we can make of the unknown parameter β given the sample y.
For example, another common model (also called the null model in some texts) is the model
where β1 is left free to be estimated but the rest of the parameters are constrained to 0. The
null hypothesis for a test of the estimated model against this model is written
H0 : β = (1 0)T , (2.16)
where 0 is a p-dimensional row vector. Under this hypothesis, the restricted model can be
written
Y= 1 b0 + E.
model is the full general linear model expressed in equation (2.2). Once the null and alterna-
tive hypotheses have been articulated, the corresponding models are fit using the least-squares
kŷc k2 /(p)
F = .
kec k2 /(n − p − 1)
Because one dimension of individual space is spanned by the vector 1, the vector ŷc has an
40
Chapter 3
Analysis of variance (ANOVA) is a term that describes a wide range of statistical models ap-
propriate for answering many different kinds of questions about a wide range of experimental
and observational data sets. What unites these techniques is that the independent variable(s) in
ANOVA are always categorical variables (for example, gender or teacher certification status) and
the dependent variable(s) are continuous. Since ANOVA techniques were developed separately
from the regression techniques that will be discussed in the next chapter, different vocabulary
is used for characteristics of both kinds of models that are actually very similar or identical.
For example, the independent variables in ANOVA models are usually called factors, whereas
the independent variables in regression models are called predictors. In a similar way, the coef-
ficients for the independent variables in an ANOVA model are called effects but are often called
parameters in regression analyses. The hypotheses one can test using ANOVA generally concern
the differences of means in the dependent variable at different levels of a factor, the finite set
of discrete values attained by the factor. When a single independent variable or factor is used
in the model, then the analysis is called one-way ANOVA, and if two factors are used, then the
41
3.1 One-way ANOVA
In the example from section 1.4, we were interested in whether or not the mean gain-score on a
standardized test after one month of tutoring was significantly different from 0. We were only
interested in a single population, namely those students who had been tutored for one month.
In many kinds of research comparisons between two or more treatment groups are required. For
example, it is quite plausible that this particular tutor has no effect over and above the effect of
studying alone for an extra month. One-way ANOVA models allow us to compare the means of
different groups. When individuals are assigned to treatment groups randomly, it is defensible
outcomes.
Suppose a tutoring company wanted to research the efficacy of private tutoring to generate
data for an advertising campaign. Because individuals who have already elected private tutoring
may differ systematically from individuals who have not, they focus on the population of 42
tutees who participate in group tutoring sessions. Of these, three are randomly selected for 1
hour of private tutoring, three are randomly selected for 2 hours of private tutoring, and three
are randomly selected as controls (they continue to participate in the group tutoring session).
The scores before and after one month of tutoring are used to calculate the gain-scores as before.
The factors (i.e., independent variables) in ANOVA models specify the factor level for each
observation (instead of measurement data) and are called dummy variables. These vectors are
42
Table 3.1: Data for a 3-level factor recording tutoring treatment.
Levels Observed
gain-scores
Group tutoring 6.93
6.13
4.25
1 hour private tutoring 11.94
7.43
9.43
2 hours private tutoring 12.44
14.64
9.17
more easily interpreted if they are orthogonal and when they can be chosen to encode particular
hypotheses of interest.
The simplest method of creating dummy variables is to first sort y by factor level so that all
observations of the same level are consecutive. For example, the three gain-scores of students
who attended group tutoring might be in the first three slots of y, the three gains-scores of
students who received 1 hour of private tutoring might be in the fourth, fifth, and sixth slots,
and the scores of the final level might be in the seventh, eight, and ninth slots. We then create
a dummy variable Xi to represent the ith factor level under the convention that Xij = 1 if yj is
an observation of the ith factor level and 0 elsewhere. The dummy variables constructed in this
43
manner for the tutoring example are presented in following fitted model:
y = Xb + e,
6.93 1 0 0 1.16
6.13 1 0 0 0.36
4.25 1 0 0 −1.52
11.94 0 1 0 5.77 2.34
7.43 = 0 0 9.60 + −2.17
1
.
9.43 0 1 0 12.08 −0.17
12.44 0 0 1 0.36
14.64 0 0 1 2.56
9.17 0 0 1 −2.91
Geometrically, this means that we are considering the orthogonal basis of individual space com-
prised of the columns of X and any 6 other arbitrary vectors that span the error space. Notice
that with this simple form of dummy coding, the design matrix X does not include the mean
vector 1. Moreover, if the mean vector were added as another column in the design matrix X
Estimating this model using least squares gives the vector b = (5.77, 9.60, 12.08)T , which can
be interpreted as the vector of factor level means. An overall test of the hypothesis that these
means are significantly different from 0 can be accomplished by calculating the F -ratio with 3
This value is so large that, under the assumption that all of the factor level means are 0 (H0 :
µ1 = µ2 = µ3 = 0), we would expect an F value this large or larger only 0.009 % of the time.
44
We can conclude that at least one factor mean is significantly different than zero.
In addition, since the dummy variables are orthogonal, each can be used to test a hypothesis
that is independent of the rest of the model. In particular, we can test the hypothesis that each
factor level mean is different than zero (H0 : µ1 = 0; H0 : µ2 = 0; H0 : µ3 = 0). The F -ratio for
each test is presented below, along with the corresponding p-value. The p-value of a hypothesis
test is the probability of obtaining an F -ratio as large or larger under the assumption of the
corresponding null hypothesis. Notice that the numerator degrees of freedom for each of these
tests is 1 because each Xi is a vector in the chosen orthogonal basis for individual space.
The results of these hypotheses tests suggest that we can reject the null hypotheses that any
one of the factor level means is zero; in each case, the gain-score is significantly different from
0 because the p-values are smaller than 0.05. (Other common significance levels are 0.1 and
In spite of the hypothesis tests we were able to preform with the simplest kind of dummy
variables, we have not yet answered the most important question we sought to address with
the tutoring experiment: Does private tutoring have a different effect on gain-scores than group
tutoring? In addition, we might also like to answer the question: Are there differences in the
effect on gain-scores between 1 hour and 2 hours of private tutoring? These questions can be
answered by using a more clever strategy of constructing dummy variables so that they encode
hypotheses of interest.
These more elaborate dummy variables are often called contrasts. Researchers will often
select contrasts that are orthogonal to each other so the contrasts are independent and the
45
hypotheses they encode can be tested separately. Although not strictly necessary, most designs
include the vector 1 in order to test the null hypothesis H0 : µi = 0 where i ranges over all
factor levels. In the following discussion, we let X01 indicate the vector 1.
First we write down the questions we seek to answer and their translation as null hypotheses.
tutoring?
The next step is to find a dummy variable for each question so that when the F-ratio is large,
we have evidence to reject the associated null hypothesis. Geometrically, we want to construct a
vector in the column space of X so that the squared length of the projection of y on this vector
can be compared with the average squared length of the projection of y on arbitrary vectors
spanning the error space. In particular, we want relatively large projections on this vector to
be inconsistent with the hypothesis that average gain-score for private tutoring is the same as
the gain-score for group tutoring. This can be accomplished with (any multiple of) the vector
2 X2
1
+ 21 X3 − X1 , where the Xi indicate the simple dummy variables from the previous section.
µ2 +µ3
Essentially, we are checking whether our hypothesis about the group means (H0 : 2 −µ1 = 0)
is a feasible description of the relationship between observed values in these groups, across the
whole data set. Usually, it is convenient to pick dummy values that are integers to ease data-
entry, so let X02 = X2 + X3 − 2X1 . This vector works for testing the first hypothesis because
if the squared length of the projection of y on this vector is large then the observed data are
unlikely to have come from a population that is described by the null hypothesis (1): Private
tutoring (in 1 or 2 hour sessions) has no different effect on gain-scores than group tutoring.
Similarly, we can construct a vector for the null hypothesis corresponding to the second
question, Are there differences in the effect on gain-scores between 1 hour and 2 hours of private
tutoring? We take X03 to be X2 − X3 , and reason that large squared lengths of the projection of
y on this vector are inconsistent with the null hypothesis for the second question. Furthermore,
46
X01 · X02 = 0, X01 · X03 = 0, and X02 · X03 = 0, so the new design matrix X0 = [ X01 X02 X03 ] is
y = X0 b0 + e,
6.93 1 −2 0 1.16
6.13 1 −2 0 0.36
4.25 1 −2 0 −1.52
11.94 1 1 1 9.15 2.34
7.43 = 1 1 1 1.69 + −2.17
.
9.43 1 1 1 −1.24 −0.17
12.44 1 1 −1 0.36
14.64 1 1 −1 2.56
9.17 1 1 −1 −2.91
Notice that the error vector is the same as in the previous fitted model; this makes sense because
Comparing these two models, we find that only the values of the estimate b0 are different
and this is because they have different interpretations. The value b01 ≈ 9.15 can be interpreted
as the mean gain-score over all the students. A hypothesis test of this value can allow one to
reject the null hypothesis that all three tutoring treatments have average gain-scores of 0.
The value b02 is related by a constant to an estimate dˆ1 for the average difference in gain-scores
µ2 +µ3
between group and private tutoring treatments d1 = 2 − µ1 . This constant depends on k,
the number of observations at each factor level (in this example k = 3) and on the particular
at each factor level are not equal, the computation is more complicated but still possible—this
47
E(y · X02 ) = −2E(y11 + y12 + . . . + y1k ) + 1E(y21 + y22 + . . . + y2k )
and so if we assume
y · X02 = 2k dˆ1 ,
kX02 k2
dˆ1 = b2
2k
18
dˆ1 = b2 = 3b2 = 5.07.
6
Thus, we estimate the difference in gain-scores between group and private tutoring to be about
5 points.
In the same way, b03 is related by a constant to an estimate dˆ2 for the average difference in
gain-score between the 1 hour and 2 hours private tutoring treatments d2 = µ2 − µ3 . Following
kX03 k2
dˆ2 = b3 (3.1)
k
6
dˆ2 = b3 = 2b3 = −2.48 (3.2)
3
Thus, we estimate that 1 hour of tutoring yields gain-scores that are a little more than 2 points
Next we want to check to see if these values are significantly different from 0. As before, we
can compute the F -ratio for the model overall (which tests the null hypothesis H0 : b0 = 0) and
compute F -ratios for each estimate b0i . An overall test of the null hypothesis that b0 = 0 can be
48
accomplished by calculating the F -ratio with 3 and 6 degrees of freedom:
kX0 b0 k2 /3 271.45
F = ≈ ≈ 55.875
2
kek /6 4.8584
This value is exactly the same as the F-ratio for the hypothesis test that b = 0 because ŷ =
Xb = X0 b0 . As before, the F value is so large that we would expect an F value this large or
larger only 0.009 % of the time. We can conclude that at least one of the estimates in b0 is not
zero.
Whenever the vector 1 is included in the design matrix, a more sensitive test of the model
which is equivalent to
This new model equates the centered vector yc with the sum of two (orthogonal) vectors in the
effect space and the error vector. All of these vectors (and those spanning the error space) are
space orthogonal to 1. The null hypothesis becomes H0 : b02 = b03 = 0 and the corresponding
reduced model has a design matrix made up entirely of the vector 1. The corresponding F -ratio
has only 2 degrees of freedom for the centered estimate ŷc but retains 6 degrees of freedom for
kŷc k2 /2 30.347
F = ≈ ≈ 6.246.
kec k2 /6 4.8584
This result has a p-value of 0.03416, so we can reject the hypothesis that b02 = b03 = 0. This test
is more sensitive because we will not reject the overall null hypothesis in those cases where only
It remains to see if each of the values estimated by b0 are significantly different than zero.
49
As with the first set of dummy variables, we can accomplish this by means of three F -ratios:
From these ratios we can reject the first two null hypotheses: the average overall gain score and
the average difference between private and group tutoring are significantly different than 0 (both
p-values are below the common threshold of 0.05). However, there is a relatively high chance
(p-value = 22%) of obtaining the observed estimate for b03 under the null hypothesis H0 : b03 = 0
and we cannot reject this hypothesis. We say that the difference in gain scores between the
In fact, the data for the 1 hour and 2 hour treatment gain scores was simulated from normal
distributions with different means, but there are not enough scores to separate the pattern from
chance. This is a problem of insufficient statistical power, and corresponds with the probability
of failing to reject a false null hypothesis. In linear models, the primary determinant of power is
the size of the sample. Techniques are available to find the minimum sample size for obtaining
a sufficiently powerful test so that the probability of failing to reject a false null hypothesis is
guaranteed to be less than some predetermined threshold. Further discussion of statistical power
We conclude this section by observing that the number of independent hypotheses that can be
simultaneously tested is constrained by the need to invert the matrix XT X in order to estimate
β using least squares. When conducting one-way ANOVA, a model of a factor that has k levels
can be constructed with k − 1 dummy variables and the vector 1. Any more, and the matrix
XT X will be singular and a different technique for finding something analogous to XT X−1 is
required: the generalized inverse. This extension of the least-squares method is beyond the
50
3.2 Factorial designs
We discuss one other kind of ANOVA design, factorial designs. These models are quite flexible,
and, although we only discuss an example of two-way ANOVA, the methods can easily be
Consider an extension to the tutoring study discussed in the previous section in which the
researchers would like to discover if a short content lecture affects the gain-scores experienced by
the students who are being tutored. In this new experiment, there are two factors: lecture and
the tutoring treatment. Each factor has two levels: students are randomly assigned to attend
(or not attend) the lectures, and students are randomly assigned to participate in group tutoring
or in private tutoring for a one month period. Simulated data for this example are presented in
Table 3.2.
Table 3.2: Data for a 2-factor experiment recording observed gain-scores for tutoring and lecture
treatments.
No lectures Lectures
Group tutoring 5.00 10.78
5.58 4.83
2.60 6.05
Private tutoring 10.83 12.53
5.17 9.20
6.68 10.39
One way we can think of this problem is as a one-way ANOVA of a factor with four levels.
The effect or regression space is then the subspace of individual space spanned by the simple
dummy variables seen in the last section, Xi where Xij = 1 whenever yj is an observation of
factor level i. However, to aid the reader, we instead use subscripts that denote the group: Xgn
(group tutoring and no lectures), Xgl (group tutoring and lectures), Xpn (private tutoring and
no lectures), and Xpl (private tutoring and lectures). The fitted model is
51
y = Xb + e,
5.00 1 0 0 0 0.61
5.58 1 0 0 0 1.19
2.60 1 0 0 0 −1.80
10.78 0 1 0 0 3.56
4.83 0 1 0 0 4.393
−2.39
6.05 0 1 0 0 7.220
−1.17
= +
.
10.83 0 0 1 0 7.560 3.27
5.17 0 0 1 0 10.707 −2.39
6.68 0 0 1 0 −0.88
12.53 0 0 0 1 1.82
9.20 0 0 0 1 −1.51
10.39 0 0 0 1 −0.32
As in the last example, the coefficients for these dummy variables can be interpreted as the
means of each group in the population: group tutoring and no lectures (µ̂gn = 4.393), group
tutoring and lectures (µ̂gl = 7.220), private tutoring and no lectures (µ̂pn = 7.560), and private
tutoring and lectures (µ̂pl = 10.707). As before, F tests can be used to show that each one of
By using these simple dummy variables, each group mean is estimated using only 3 of the
12 data points; the rest of the data are ignored. We are not as interested in each of these four
groups, however, as much as we are interested in the overall effect of each factor on the outcome
measure. The real strength of factorial designs are the contrasts that can be constructed to make
use of all of the data in the experiment, effectively increasing the sample size for the factors of
interest. This is beneficial because it increases statistical power without the expense of collecting
more data.
52
3.2.2 Constructing factorial contrasts
To be explicit, consider the question of the effect of lectures on gain-score. We would like to
compare all of the individuals in the experiment who attended lectures with those who did not.
The null hypothesis for this question might be written H0 : 12 (µgl + µpl ) − 12 (µgn + µpn ) = 0. The
appropriate contrast can be formed as a linear combination of the corresponding simple dummy
variables: X02 = −Xgn + Xgl − Xpn + Xpl . This contrast helps answer the question, Do lectures
In a similar way, we form the contrast X03 = −Xgn − Xgl + Xpn + Xpl to test the null
hypothesis H0 : 21 (µpn + µpl ) − 12 (µgn + µgl ). This null hypothesis corresponds to the question,
A final kind of contrast important in factorial ANOVA models is called the interaction
contrast. This contrast helps to answer the question, Is the increase of gain-scores due to
lectures with group tutoring the same as the increase of gain-scores due to lectures with private
tutoring? The null hypothesis for this question says there is no difference in the increase of gain-
score due to lectures between the two tutoring conditions: H0 : (µgn − µgl ) − (µpn − µpl ) = 0.
Constructing the corresponding contrast is straightforward: X04 = Xgn − Xgl − Xpn + Xpl . As in
the one-way ANOVA example, the first column in the modified design matrix X0 is the vector
1.
53
y = X0 b0 + e,
5.00 1 −1 −1 1 0.61
5.58 1 −1 −1 1 1.19
2.60 1 −1 −1 1 −1.80
10.78 1 1 3.56
−1 −1
4.83 1 1 −1 −1 7.47
−2.39
6.05 1 1 1.493
−1 −1 −1.17
= +
.
10.83 1 1 −1 1.663 3.27
−1
5.17 1 −1 1 −1 0.080 −2.39
6.68 1 −1 1 −1 −0.88
12.53 1 1 1 1 1.82
9.20 1 1 1 1 −1.51
10.39 1 1 1 1 −0.32
Since these contrasts are orthogonal, they can each be tested independently for significance using
F -ratios and the F -distribution to obtain the p-values for the corresponding null hypotheses.
When studying the results of tests of factorial designs, the first thing to check is the interac-
tion term. As indicated by the high p-value of 0.91, the estimated interaction parameter is quite
54
likely under the null hypothesis of no interaction. This is the desired result, because when there
is a significantly non-zero interaction, we can no longer interpret the parameter estimates b02 and
b03 as estimates of the main effect for lectures and private tutoring, respectively. The reason for
this is that X02 averages the increase due to lectures in the case of group and of private tutoring.
If there is an interaction, then this average includes not just the main effect of lectures (the
increase in gain score due to lectures) but also half of the interaction effect (the increase in gain
score over and above the increase due to lectures and the increase due to private tutoring).
In this case, the interaction is very small and also statistically non-significant. This means
that we are free to interpret the values b02 and b03 as functions only of the increase in gain-scores
due to lectures and private tutoring, respectively, as long as these values are significant. To
answer the question of statistical significance, we return to the results of the F -ratios. The
p-value for the Fb02 =0 ratio tells us there is a 6.7% chance of obtaining an estimate of the gain
score due to lectures as large or larger, which is greater than the common threshold of 5%.
However, depending on the consequences of rejecting a true null hypothesis, many researchers
might still consider this a useful estimate of the effect. The p-value for the hypothesis test of
main effect of private tutoring is 4.6% and below the common threshold. We can conclude that
However, we must be careful when interpreting the estimates obtained from b02 and b03 . As
we saw in the last section, we must compute the constant that relates the estimated difference
dˆ1 in gain scores due to lectures to the value b02 , and find the constant relating b03 to an estimate
dˆ2 of the difference in gain scores due to private tutoring. In both cases the squared length of
the column vector is 12, we multiplied the contrasts by a constant of 2 to clear fractions, and
12 0
dˆ1 = b = 4b02 = 2.97,
3·2 2
and
12 0
dˆ2 = b = 4b03 = 3.33.
3·2 3
We can conclude (since b02 and b03 are significantly different from 0) that private tutoring and
55
short lectures each independently increase gain-scores by about 3 points.
56
Chapter 4
Regression analysis seeks to characterize the relationship between two continuous variables in
order to describe, predict, or explain phenomena. The nature of the data plays an important
role in interpreting the results of regression analysis. For example, when regression analysis is
applied to experimental data, only the dependent variable is a random variable. The independent
variables are fixed by the researcher and therefore not random. In the context of medical
research, this kind of data can be used to explain how independent variables such as dosage are
related to continuous outcome variables such as blood pressure or the level of glucose in blood.
The experimental design and the scientific theory explaining the mechanism by which the drug
effects the dependent variable together support causal claims concerning the effect of changes
in dosage.
With observational data, regression analysis supports predictions of unknown quantities but
the assumption of causality may not be justified or causality may go in the opposite direction. For
example, vocabulary and height among children are correlated but this is likely caused by another
variable that causes both: the child’s age. One can imagine using data from several observatories
to estimate the trajectory of a comet, but in fact it is the actual trajectory that causes the data
collected by the observatories, not vice-versa. Many businesses and other institutions rely on
regression analyses to make predictions. For example, colleges and universities solicit students
57
scores on standardized tests such as the SAT in order to make enrollment decisions because
In the social sciences and economics, experimental data are rare. Regression analysis can be
applied to observational data sets as long as appropriate conditions are met (in particular, the
independent variables must not be correlated with estimation errors), and regression analysis
can be used to analyze data sets that are composed of random, independent variables. In
these cases, care must be taken that the presumed regression makes sense from a logical and
theoretical point of view. Regressing incidents of lung cancer on tobacco sales by congressional
district makes sense because smoking may cause cancer. Interpreting the results of such a study
would allow one to make statements such as, “a decrease in tobacco sales by x amount will
result in a decrease in cancer incidence by y amount.” Indeed, this reasoning might motivate tax
policy. However, to speak of increasing the incidence of cancer does not make sense (and were
it somehow possible to do so, it is still doubtful this would then cause more people to smoke).
Regressing tobacco sales on cancer incidence does not make sense because cancer incidence is not
the kind of variable that can be manipulated directly by researchers or society. In certain areas
of the social sciences, a dearth of appropriate theoretical explanations may not allow researchers
to make causal claims at all, although predictions and descriptions are well warranted and useful.
When regression analysis is applied to observational data for the purpose of prediction, the
independent variables are called predictors and the dependent variable is called the criterion.
The theoretical assumption of causality is relaxed in this case, but the regression equation still
has a meaningful interpretation in the context of prediction. For example, economists regressing
income on the number of books owned might concluded that it is reasonable to predict an increase
of x dollars above average income for those who own y more books than average. However, it
is likely in analyses like these that the number of books is a proxy for other factors that might
be more difficult or costly to measure and that presumably cause both book ownership and
income. No one would propose giving out books as policy to eradicate poverty (especially if the
population was illiterate), but the information about books in the home can be used to adjust
Regression analysis is very flexible. One flexibility is that the dependent variable in a re-
58
gression model can be transformed so that the relationship between independent and dependent
variables is more nearly linear. For example, population growth is often an exponential phe-
nomenon and if population is the dependent variable for a model that is linear in time, then using
the logarithm of the population will likely provide a better-fitting model. Regression analysis is
closely related to correlation analysis in which both the independent and dependent variables
We take as our example a variation of the tutoring study discussed in the last chapter. Suppose
a tutoring company wanted to research the efficacy of particular tutors in order to generate data
for hiring decisions. The analysis will use the scores of tutors on the standardized test as the
independent variable and the average gain-scores of tutees on the same tests as the dependent
variable. To illustrate this example we use the simulated data presented in Table 4.1.
Table 4.1: Simulated score data for 4 tutors employed by the tutoring company.
To proceed with regression analysis, we must assume that the three conditions discussed in
Section 2.1.4 hold for these data. First, we suppose the model of the true relationship between
Y = Xβ + E,
59
and we assume that all Ei are independent random variables with normal distributions, where
E(Ei ) = 0 and Var(Ei ) = σ 2 for all i. This is equivalent to assuming that Y is a vector of
random variables Yi , each with common variance σ 2 and mean µ. Recall that β = (β0 , β1 )T is
There are two conventions for defining the design matrix X for linear regression. One option
case the vector of tutor scores, x = (620, 690, 720, 770)T . The other option is to use the centered
2.1.2 that these matrices produce equivalent results in general, and so we are free to adopt the
centered design matrix X0 for the discussion of regression analyses. Using this design matrix
is convenient for producing vector drawings of relevant subspaces of individual space because
the subspace of individual space spanned by xc (and by the centered columns xci in general) is
orthogonal to V1 .
The fitted model for the tutoring study can now be expressed explicitly:
y = X0 b0 + e,
7.33 1 −80 0.344
8.16 1 −10 9.625
−1.135
= + .
11.07 1 20 0.033 0.785
11.94 1 70 0.006
We can interpret this fitted model by providing meaning for the estimated parameters in the
vector b0 = (b00 , b01 )T from the given context. In particular, this fitted model gives the overall
mean gain-score of b00 = 9.625 and says that for every point increase in the score of tutors
above the mean of 700, the gain-score of the tutees increases by b01 = 0.033 points. It remains
to determine whether the results are statistically significant. Before answering this important
question using the familiar F statistic, we briefly discuss the geometry of the fitted regression
model.
60
Since there are only 4 observations of each variable, individual space for this study is R4 .
The model is fitted by projecting y onto the mean space V1 and the model space VXc . Since each
of these spaces is a line, there are 2 remaining dimensions in the error space Ve . Understood
geometrically, the vector b0 indicates that ŷ, the projection of y into the column space of X0 ,
is the sum of the component in mean space (the vector 9.625(1, 1, 1, 1)T ) and the component in
model space (the vector 0.033(−80, −10, 20, 70)T f). This is illustrated by the vector diagram in
Figure 4.1.
y
y 1 = 9.625 · 1
VX0
ŷ
Figure 4.1: The vector ŷ is the sum of two components: y 1 and ŷc .
These figures give some indication that y and ŷ are close and thus it is plausible that the
fitted model may provide useful information about the relative performance of the tutors. Recall
that the least-squares method for obtaining b0 guarantees the ŷ that minimizes the error vector e.
(Note that in regression contexts, the error vector is often described as the vector of residuals.) As
in the case of ANOVA, we rely on the F statistic to provide a rigorous determination of closeness.
The important hypothesis to test here is simply that b01 is not equal to zero (H0 : b01 6= 0). We
wish to know if the average per-dimension squared length of ŷ is significantly greater than the
length of ŷc is significantly greater than the average per-dimension squared length of ec . These
are two different tests and will have different results. The first compares the error between the
regression model and the model that assumes E(Yi ) = 0 for all i. The second is more sensitive
test, analogous to the F test introduced in Section 1.4.2. In this second test we are comparing
61
the full regression model with the model that allows Y to have a non-zero mean. In the first
test, ŷ has 2 degrees of freedom, but in the second test ŷc has only 1 degree of freedom. In
both cases the error vector has 2 degrees of freedom. In either case, the general procedure is the
same: if the F statistic is sufficiently large then we can reject the null hypothesis in favor of the
alternative hypothesis (Hα : b01 = 0.033). The results of these tests follow.
kŷk2 /2 191.6986
Fb00 =b01 =0 = ≈ ≈ 189.482, p = 0.00525
2
kek /2 1.0117
kŷc k2 /1 12.837
Fb01 =0 = ≈ ≈ 12.688, p = 0.07057
kek /22 1.0117
The results of these hypothesis tests are clearly different. The low p-value for the test of the
fitted model against the null model provides support for rejecting the hypothesis that b0 = 0.
However, the second test tells a slightly different story. The hypothesis that b01 = 0 cannot be
rejected at the traditionally accepted level of risk for failing to reject a false null hypothesis
(5%). However, another often-used level is 10% and the p-value for the second F test is below
this threshold. In some cases, an analyst may decide to reject this null hypothesis in favor of
It is worth nothing that because the two columns of the design matrix are orthogonal (1·xc =
0), the coefficients b00 and b01 can be tested independently. Thus, we can independently test the
hypothesis that µY is zero by comparing the squared length of the other component of ŷ, namely
ŷ − ŷc = y1 = b00 1 (see Figure 4.1), with the squared length of the error vector e:
We will see that in multiple regression analyses the columns of X are not always orthogonal and
thus do not afford independent tests of each predictor. This is one of the primary differences
between the columns of ANOVA design matrices, which almost always are orthogonal, and
To conclude our discussion of simple regression, we consider the contribution of the geometry
of variable space and compare this with the geometry of individual space presented above.
62
Scatterplot of Tutor and Tutee Scores
14
12
Average tutee gain−score
●
10
●
8
●
6
Tutor score
Figure 4.2: A scatterplot for the simple regression example showing the residuals, the difference
between the observed and predicted values for each individual.
Recall that scatterplots can be used to represent data sets in variable space; in the case of
simple regression, there are just two variables and so scatterplots often provide convenient
representations of the data. The scatterplot corresponding to Table 4.1 is provided in Figure
4.2, and shows the residuals, the (vertical) differences between each data point and the line
representing the least-squares estimates of gain-scores for all tutor scores. The error vector in
Figure 4.1 under the natural coordinate system of individual space is the vector of residuals,
The simple regression example also provides an opportunity to discuss how linear models can
63
Table 4.2: Modified data for 4 tutors and log-transformed data.
be extended with link functions. For example, suppose Tutor C had an average tutee gain-score
of 9.77 instead of 11.07. In this case, transforming the dependent variable by taking the natural
logarithm produces a data set that is more nearly linear than the untransformed data (see
Table 4.2). The scatterplot with the least-squares estimation line (as well as the corresponding,
14
Average tutee gain−score
12
●
2.5
●
10
●
●
●
8
●
●
2.0
●
6
4
1.5
600 650 700 750 800 600 650 700 750 800
Figure 4.3: Least-squares estimate and residuals for the transformed and untransformed data.
64
4.2 Correlation analysis
In the most basic sense, correlation is a measure of the linear association of two variables. The
conceptual distinction between cause-effect relationships and mere association did not long pre-
cede the development of a statistical measure of this association. In the middle of the nineteenth
century, the philosopher John Stuart Mill recognized the associated occurrence of events as a
necessary but insufficient criterion for causation (Cook and Campbell, 1979). His philosophical
work set the stage for Sir Francis Galton, who defined correlation conceptually, worked out a
mathematical theory for the bivariate normal distribution by 1885, and also observed that corre-
lation must always be less than 1. In 1895, Karl Pearson built on Galton’s work, developing the
most common formula for the correlation coefficient used today. Called the product-moment
correlation coefficient or Pearson’s r, this index of correlation can be understood as the dot
In variable space, the correlation coefficient can be understood as an indication of the linearity
of the relationship between two variables. It answers the question: How well can one variable
of scatterplots are frequently presented to illustrate what various values of correlation might
look like (Figure 4.4). These diagrams may lead to the misconception that the correlation tells
more about a bivariate relationship than in fact it does (for example, see Anscombe, 1973). In
practice, it is often difficult to estimate the correlation by looking at a scatterplot or given the
correlation, to obtain a clear sense of what the scatter plot might look like.
In individual space, however, the correlation between two variables has the simple interpreta-
tion of being the cosine of the angle between the centered variable vectors. Much of the power of
the vector approach is derived from this straightforward geometric interpretation of correlation.
This relationship is clear when we consider the 2-dimensional subspace spanned by the centered
variable vectors (Figure 4.5). Given the centered vectors xc and yc , a right triangle is formed
65
r = −1 r= 0 r= 1 r = 0.3 r = −0.6 r = 0.9
●
12
12
12
●
10
10
10
●
● ● ●
● ● ● ●
● ● ● ● ●
8
8
● ●
● ● ●
● ●
●
● ●
●
6
6
● ●
●
● ●
●
4
4
●
5 10 15 5 10 15 5 10 15
Figure 4.4: (a) Panels of scatter plots give an idealized image of correlation, but in practice, (b)
plots with the same correlation can vary quite widely.
by yc , the projection of yc onto xc (called ŷc ), and the difference between these vectors, yc − ŷc .
The cosine of an angle in a right triangle is the ratio of the adjacent side and the hypotenuse.
When the lengths of xc and yc are 1, the cosine of the angle between them is simply the length
of the projection ŷc , the quantity xc · yc . More generally, the cosine of the angle between two
xc ·yc
kProjyc xc k kyc k2
kyc k x c · yc
cos(θxc yc ) = = = =r (4.2)
kxc k kxc k kxc k kyc k
Correlation analysis usually involves using r to estimate the parameter called Pearson’s ρ,
σXY
which is defined by ρ = σX σY , where σXY is the covariation of the random variables X and Y .
Both variables are assumed to be normal and to follow a bivariate normal distribution. The
66
yc0
yc
yc0 − ŷc0
yc − ŷc
θxc yc0
θxc yc
ŷc0 ŷc xc
Figure 4.5: The vector diagram illustrates that rxc yc = cos(θxc yc ) and that rxc yc0 = − cos(π −
θxc yc0 ).
most common hypothesis test is whether or not the correlation of two variables is significantly
different from zero. The test statistic used for this test follows Student’s t distribution (see
Section 6.4) which when squared is equivalent to the F-ratio for the same hypothesis test with
For example, consider the variable S defined as the average total scores on state assessments
per district of eighth graders in Massachusetts at the end of the 1997-1998 school year and the
variable I defined as the per capita income in the same school districts during the tax year 1997.
A random sample of 21 districts yields a sample correlation between S and I of rSI = 0.7834.
This suggests there may be a moderate positive linear association between the variables. We
want to test whether this correlation is significantly different from 0 because a correlation of
0 means there is no association. As in the last section, first we establish the null hypothesis,
and compute the probability of obtaining the observed correlation. If it is low, we reject the
null hypothesis and conclude that the correlation of the variables is positive.
√ √
rXY n − 2 0.7834 21 − 2
= √ = 5.4942.
1 − 0.78342
q
1 − rXY
2
This number is the t value corresponding to the observed correlation. The probability of obtain-
67
R∞
ing a test statistic at least this high can be computed 1 − 5.4942 f (t)dt ≈ 10−5 where f (t) is the
the probability is well below the standard threshold of 0.05, we reject the null hypothesis in favor
of the alternative hypothesis, concluding that there is a positive correlation between districts’
per capita income and the average total score on the eighth grade state exam.
Although correlation plays an important role in research, it frequently does not give the
most useful information about a data set. Fisher (1958) wrote, “The regression coefficients are
of interest and scientific importance in many classes of data where the correlation coefficient, if
used at all, is an artificial concept of no real utility” (p. 129). Correlations are easy to compute
but often hard to interpret. Even correct interpretations might not answer the instrumental
question at the heart of most scientific research, How do manipulations of one variable affect
another? In many cases, social scientists and market analysts are content to avoid addressing
causality and instead answer the different question: How do variations in some variables predict
68
Chapter 5
In the last chapter, we saw that one flexibility of regression analysis is that variables can be
transformed via (not necessarily linear) functions and in this way used to model non-linear phe-
nomena. Another flexibility is the option to use more than one continuous predictor. Regression
models that include more than one independent variable are called multiple regression models.
We begin with a two-predictor multiple regression model using data from the Massachusetts
Comprehensive Assessment System and the 1990 U.S. Census on Massachusetts school districts
in the academic year 1997-1998. (This data set is included with the statistical software package
R.)
Given data from all 220 Massachusetts school districts, suppose Y denotes the per-district
average total score of fourth graders on the Massachusetts state achievement test, and suppose
X1 and X2 denote the per-capita income and the student-teacher ratio, respectively. We want to
predict the district average total score (Y ) given the per-capita income and the student-teacher
ratio (X1 and X2 ). We hypothesize that higher student-teacher ratio will be predictive of a
lower average total score and that greater per-capita income will be predictive of higher average
total scores. The first 5 data points are shown in Table 5.1.
The model for multiple regression with two predictors is similar to a two-way ANOVA model;
the only difference is that the vectors x1 and x2 contain measurement data instead of numerical
69
Table 5.1: Sample data for Massachusetts school districts in the 1997-1998 school year. Source:
Massachusetts Comprehensive Assessment System and the 1990 U.S. Census.
tags for specifying factor levels and contrasts. The design matrix has three columns: X =
[1 x1 x2 ]. The first 5 data points from Table 5.1 are shown below.
y = Xb + e,
714 1 19.0 16.379 9.673
731 1 22.6 25.792
15.967
699.657
704 1 19.3 14.040 3.642
= −1.096 + .
704 1 17.9 16.111
−1.116
1.557
701 1 17.5 15.423 −3.483
. . .. .. .
.. .. . . ..
As always, the goal of the analysis is to find the b = (b0 , b1 , b2 )T so that the squared length
70
the vertical distances between the ith data point and the regression plane in a three-dimensional
(see Figure 4.2) and is illustrated in Figure 5.1. The measure of closeness is the sum of the
squared lengths of the vertical error components, and the scatterplot provides a rough sense of
Acton
●
●
●
● ● ●
● ●
● ● ●
●
● ●
● ●● ●
780
● ● ● ● ● ● ●
●
● ● ● ●● ●● ●
●
● ●
●● ● ● ●
● ● ●● ●
● ● ● ●
● ● ● ●● ● ● ●
● ● ● ● ●● ● ● ● ●
● ●
District mean 4th grade Total Score
● ● ● ● ● ●
●
●● ● ● ●● ●●● ● ● ● ●
● ●
● ● ●● ●● ● ●● ● ●
Student−Teacher Ratio
● ●● ●
760
● ● ● ● ●● ● ●
● ● ● ●● ● ● ●● ● ●
● ●● ●● ● ● ● ●
●● ● ● ● ●
● ● ● ●
●
●
● ●
● ●● ● ●●
● ●
● ●
● ●● ●● ● ●●●● ● ● ● ●
● ●
● ●● ● ● ●
● ●
●
740
● ● ● ●
● ● ● ● ● ● ● ●
●
●● ●●●
● ● ●
● ●
● ● ● ●
●
● 25
720
●
20
700
15
680
660
10
640
5
5 10 15 20 25 30 35 40
Per Capita Income
(thousands)
Figure 5.1: The data are illustrated with a 3D scatter plot that also shows the regression plane
and the error component for the prediction of district mean total fourth grade achievement score
(e2 ) in the Acton, MA.
To make use of the second geometric interpretation (a diagram of vectors in individual space),
it is necessary to center the data vectors x1 and x2 . The new design matrix is X0 = [ 1 xc1 xc2 ],
and the corresponding model is y = X0 b0 + e. As noted before, this does not change the model
space because C(X) is equal to C(X0 ). In general, b0i = bi when i 6= 0, but the first coefficients
in the two models are not necessarily equal. The value b0 denotes the intercept or the expected
value of Y when X1 = X2 = 0. The value of b00 instead denotes the mean value y which is also
71
the expected value of Y when X1 = x1 and X2 = x2 .
y = X0 b0 + e,
714
1.000 1.656 −2.368 9.673
731 1.000 5.256 7.045
15.967
709.827
704 1.000 1.956 −4.707 3.642
= −1.096 + .
704 1.000 0.556 −2.636
−1.116
1.557
701 1.000 0.156 −3.324 −3.483
. . .. .. .
.. .. . . ..
We can check by inspection that the error vectors in both models are the same (recall that this
follows from the orthogonality of mean space and error space). By considering the geometric
representation of the models in individual space, we can see that the models are simply two
different ways to write the explanatory portion of the model, Xb = X0 b0 . These two ways of
writing the model essentially correspond to different choice of basis vectors for model space. In
y = 1y + xc1 b1 + xc2 b2 + e
y − 1y = yc = xc1 b1 + xc2 b2 + e
Because the vectors xc1 and xc2 are both orthogonal to 1, by choosing the centered representa-
tion, it is possible to view only the vectors orthogonal to 1 in the vector diagram, leaving enough
dimensions to show the relationship among yc and the vectors xc1 and xc2 under the constraints
of a two-dimensional figure (see Figure 5.3). We can create similar diagrams schematically with
a greater number of predictors by representing hyper planes using planes or lines. In vector
diagrams, the measure of fit is the squared length of the error vector. The representation in
a single entity, the error vector e, of all of the error across the entire data set is one of the
72
strengths of vector diagrams and the geometric interpretation afforded by individual space.
yc
xc1
xc2
Vxc
Figure 5.2: The geometric relationships among the vectors yc , xc1 , and xc2 .
After representing the multiple regression solution in these two ways, it remains to determine
if the vector ŷ = X0 b0 actually provides a better prediction of y than chance. As before, there
are two competing hypotheses to consider. On the one hand, the null hypothesis states that
As before, we use the F -ratio to compare the estimate of the variance of Y obtained from the
average per-dimension squared length of the error vector with the estimate of the variance of Y
obtained from the average per-dimension squared length of ŷ. Under the null hypothesis, these
estimates should be fairly close and the F -ratio should be small. On the other hand, we reject
the null hypothesis if the F -ratio is large—if the estimates of the variance of Y by projecting y
into model space and into error space are significantly different. In the case of this model, we
73
have
kŷc k2 /1 10402.5
Fb01 =b02 =0 = ≈ ≈ 77.03, p = 0.00000.
kec k2 /217 135.042
Based on this analysis, we can reject the null hypothesis and tentatively conclude that ŷ is
close to Y. Note that the very small (and likely non-zero) p-value is due to the high degrees
of freedom in the denominator (see Section 6.3). It is important to stress, however, that one
cannot rule out the possibility that other models provide even better predictions of Y. In the
following sections, we will see extensions of this example that illustrate this point.
One way to measure overall model fit is the generalization of the correlation coefficient, r, called
the multiple correlation coefficient; it is denoted R. In terms of the geometry of individual space,
we saw that the notation rxy indicates the cosine of the angle between the (centered) vectors x
and y. Multiple correlation is the correlation between the criterion variable and the projection
of this variable onto the span of the predictor variables. We thus define
where ŷc is the projection of yc onto the space C([xc1 , . . . , xcn ]).
A simple geometric argument is sufficient to justify the role of the squared multiple correlation
coefficient, R2 , as the measure of the proportion of the variance of the criterion yc that is
explained by the model in question. Consider the triangle formed by yc , ŷc , and the error vector
e. Since ŷc and e are orthogonal, the Pythagorean theorem provides the following equation:
||ŷc ||2
The squared correlation is the squared cosine and must therefore be ||yc ||2
. Now a frequently-used
estimator for the variance of a random variable vector Y is the per-dimension squared length of
yc (see equation 1.3.4). Thus, the ratio of the variance explained by the model (kŷc k2 /(n − 1))
||ŷc ||2
to the sample variance (kyc k2 /(n − 1)) is simply the ratio ||yc ||2
. This equality demonstrates
74
that R2 can be used to assess model fit. For example, if two models are proposed, then the one
with the greater R2 value is said to be the model with the better fit.
In the present example, the squared length of the vector ŷc is 20805.1 and the squared length
which uses the student-teacher ratio and the district average per-capita income to predict district
average total MCAS scores in the fourth grade, explains a little more than 40% of the observed
We turn now to the task of interpreting the model coefficient vector b. The first coefficient is
the easiest to deal with; it is simply the value of y when the predictors are both 0. The meaning
of the first coefficient is slightly different for the centered and non-centered models. One must
decide if the data and the meaning of the variables allow a prediction of y when both predictors
are zero (i.e., when x1 = x2 = 0). However, in centered models, xc1 = xc2 = 0 when x1 = x1 1
and x2 = x2 1. It follows that b0 is the model prediction of Y in the case that both predictors
One might be tempted to interpret the regression coefficients b01 , and b02 in the last example
as the respective changes produced in the mean total score per unit change in the student-
teacher ratio and per capita income. However, it is quite possible that other variables are
actually responsible for the change in average test score and only happen to be associated with
the variables in the model as well. For example, the cock’s crow precedes and is highly correlated
In ANOVA designs with 2 or more factors, the orthogonality of the factors means we can
interpret each model coefficient independently. In regression analyses, however, the predictors
are often correlated. This means that regression coefficients cannot be interpreted without
considering the whole model. The relationship can certainly be expressed x2 = x1 + z, for some
vector z. Suppose, for example, that the predictors x1 and x2 are correlated. Then in the model
y = b1 x1 + b2 x2 ,
75
and if x1 is increased by 1, then we have
y0 = y + b1 1,
instead.
The best interpretation of the regression coefficient bi when predictors are correlated is as
the increase in the criterion variable per unit increase in the predictor variable, holding all other
variables constant. In applied contexts, however, this might not actually make much sense.
For example, it might not be feasible for a school district to hire more teachers (increasing the
student-teacher ratio) without a larger per capita income and concomitantly higher tax revenue
Without careful design and theoretical support for the variables included in the model and
reasonable confidence that these variables are the only relevant ones, multiple regression offers
at best prediction with regression coefficients that have little value for generating explanations
Multiple regression analyses can have features that seem counter-intuitive or paradoxical on
the first take and the explanation of many of these is greatly facilitated by vector diagrams of
individual space and the associated geometric orientation. This section elucidates four of these
We have already discussed the caution required when interpreting regression coefficients
in a model that has non-orthogonal predictors. When predictors in a model are orthogonal,
another. Moreover, as we saw in the section on factorial contrasts (see Section 3.2.2), the
76
associated regression coefficients can be tested independently for significance. Experimental
design allows the researcher to ensure that predictors are orthogonal, but this is rarely the
case in observational research. What is more important than whether or not two predictors
are orthogonal is how close the predictors are to being orthogonal. Correlation, especially
understood in relation to the measure of the angle formed between two centered vectors in
To examine the impact of near orthogonality and its absence, consider three models of the
mean total fourth grade scores for Massachusetts districts, summarized in Table 5.1. The first
model is
and was presented above: the variables of student-teacher ratio and per capita income predict
mean total fourth grade achievement. The second model is simpler, using only per capita income
y = [ 1 xc2 ]b0 + e.
The third model is like the first, but uses the percentage of students eligible for free or reduced
The estimates of the regression coefficient b02 from each model are presented in Table 5.2 along
with the squared length of the corresponding projection of y onto xc2 , the associated degrees of
One important observation to be made about the regression coefficient for the per capita
income variable is that it is seems quite similar (but yet different) in the first two models and
very different in the third model. In is outside the scope of this thesis to describe how one
decides whether differences in these estimates are significant. However, calculating the p-values
assures us that all three F -ratios are significantly different than 0 and, using statistical methods
77
Table 5.2: The value of the regression coefficient for per capita income and corresponding F -
ratios in three different models of mean total fourth grade achievement.
outside the scope of the present discussion, the estimates from the first and second model are not
significantly different but the third estimate is significantly different from the first two. Recall
that the independence of predictors in an orthogonal design implies that the inclusion of other
variables in the model would not affect the estimation of the model. In three different orthogonal
models containing x2 and different other variables as predictors, all three of the estimates for b02
would be identical.
The correlations among the variables x1 , x2 , and x3 cause differences in the estimates for b02
reported in Table 5.2. The reason that the first two estimates are similar is that student-teacher
ratio and per capita income are not highly correlated (r = −0.157). This implies that the
corresponding centered vectors for these variables, xc1 and xc2 , form a 99◦ angle in individual
space. Geometrically, we can see that they are fairly close to orthogonal. On the other hand,
the per capita income and the percentage of students eligible for free or reduced lunch are more
highly correlated variables with r = −0.563, as one might expect. The angle formed between
The vector diagrams of the three model subspaces in Figure 5.3 illustrate these ideas clearly.
The Pythagorean theorem guarantees that whenever the predictors are mutually orthogonal,
the sum of the squared lengths of the error vector and each projection of y onto the subspaces
78
yc
yc
(a) The vector yc and model space VXc1 . (b) The vector yc and model space VXc3 .
Figure 5.3: The vector diagrams of VXc1 and VXc3 suggest why the value of the coefficient b2
varies between models.
corresponding to individual predictors is equal to the squared length of the observation vector.
When predictors are nearly orthogonal, then this additivity and independence are to some
extent maintained. The inclusion of the teacher-student ratio variable did not significantly
change the estimate of b02 . Whenever the predictors are not orthogonal, the additivity property
fails and individual projections are no longer the same as the contribution of a predictor to the
overall model. For this reason, interpreting the regression coefficients for variables in models
of observational data are difficult; the inclusion or exclusion of a correlated variable can have
large consequences for the estimation of these coefficients. The safer approach is to use the
whole model rather than attempting to interpret the regression coefficients. This is especially
appropriate when the model is being used for prediction rather than explanation.
At the other extreme from orthogonality lies collinearity or multicollinearity, the feature of
linear dependence among a set of predictors. Multicollinearity is easy to define (there exists a
linear combination of xi s equal to zero) and collinearity is the 2-dimensional analog. It is often
easy to identify and fix. In studies using observational data, multicollinearity often means that
79
the analyst has inadvertently included redundant variables such as the sub-scores and the total
score on an instrument. More problematic is near multicollinearity in which the design matrix X
is not singular, yet the analysis results in unstable (and therefore likely misleading) conclusions.
in individual space, we know from linear algebra that a linearly independent set of p-vectors
spans a p-dimensional space. In a set of predictors that is nearly collinear, there is at least one
vector xi that is close to the subspace spanned by the remaining predictors in the sense that the
angle between xi and the projection of xi into the space spanned by the rest of the vectors is
small. In practice, this suggests that we can detect near multicollinearity by regressing xi onto
the set of the rest of the predictors for each i and checking for good fit. If the fit is good, then
including xi in the set of predictors xj , j 6= i, may not be justified because it likely adds little
new information.
but that multicollinearity is possible even if all of the variables are only moderately pairwise
correlated. As we will see, the MCAS variables we examine in this chapter are not related in
this way, but it is not hard to create a hypothetical data set that has high multicollinearity
but in which no pair of predictors are highly correlated. Let xc1 = (1, 0, −0.5, −0.5)T , xc2 =
(0, 1, −0.5, −0.5)T , and xc3 = (1, −1, 0.05, −.05)T . Then all three of pairwise correlations are
moderate: r1,2 = 0.333, r2,3 = −0.577, and r1,3 = 0.577. However, xc3 is very close to the span
of xc1 and xc2 . For example, the vector v = xc1 − xc2 = (1, −1, 0, 0)T is very close to xc3 ; their
Another geometric way to think about near multicollinearity is to consider the parallelepiped
defined by the set of normalized and centered predictor vectors. Let ui = |xci | ;
xci
then the gen-
eralized volume of the parallelepiped defined by the set of vectors {ui : 0 < i ≤ n} is given
80
V[xc1 xc2 ]
xc3 V[xc1 xc2 ] xc3
xc2 xc1
xc2
xc1
(a) The vectors xc1 , xc2 , and xc3 are moderately pair- (b) The vector xc3 is very close to V[xc1 xc2 ] .
wise correlated.
Figure 5.4: The vectors xc1 , xc2 , and xc3 are moderately pairwise correlated but nearly collinear.
by
1/2
u · u u1 · u2 · · · u1 · un
1 1
u2 · u1 u2 · u2 · · · u2 · un
.
.. .. .. ..
. . . .
un · u1 un · u2 · · · un · un
If n = 1, then this is just the length of u1 (which is always 1) and when n = 2, the generalized
volume is simply the area of a parallelogram with vertices at the the origin and u1 , u2 , u1 + u2
(see Figure 5.5). When u1 and u2 are orthogonal, the area of the parallelogram is 1. As these
vectors approach multicollinearity, the area approaches 0. The same relationship holds when we
extend to higher dimensions. If the predictors are orthogonal, then the generalized volume of
multicollinearity.
used to identify which vectors in the set are problematic. One merely uses regression p times
81
u2 u3 u2
u1
u1 u1
Figure 5.5: The generalized volume of the parallelepiped formed by the set of vectors {ui : 0 <
i ≤ n} is equivalent to length in one dimension, area in two dimensions, and volume in three
dimensions.
and in each regression analysis, the set {xi : i 6= k, 1 ≤ k ≤ p} is used to predict xk . If the
angle between xk and x̂k (which is in the span of {xi : i 6= k} ) is small, then we know that
xk likely adds little new information to the set of predictors that excludes xk . Considerations
The spending per pupil and the spending per ‘regular’ pupil (not including occupational,
bilingual, or special needs students) are two variables in the MCAS data set that provide a good
example of near multicollinearity. As we might expect, these variables are highly correlated
(r = 0.966) and comparing the models for the student-teacher ratio using each predictor alone
and the model with both predictors illustrates why multicollinearity is problematic.
let xreg indicate the observed spending per regular pupil in each district and xall the observed
spending per pupil. The three models can then be expressed y0 = [1 xreg ] + e, y0 = [1 xall ] + e,
and y0 = [1 xall xreg ] + e. The estimates for breg and ball in these models is summarized in
Table 5.3. All three models explain about a quarter of the variation in the student-teacher ratio
(R2 ) and the overall fit is significant (p-value > 0.05) . However, there is not much increase in
explanatory power in the model with both predictors over the models with just one predictor,
especially over the model using spending per regular student. The changes in the regression
coefficients are also noteworthy. Both spending per pupil and spending per regular pupil on
their own contribute negatively to the student-teacher ratio. (This makes sense because low
82
y0
V[xall xreg ]
0.002 xreg
− 0.002 xreg
0.002 xall
ŷ0
Figure 5.6: The linear combination of xreg and xall equal to ŷ0 (the projection of y0 into
V[xall xreg ] ) must include a term with a positive sign.
student-teacher ratios cost more per pupil.) However, in the third model the magnitude of the
coefficient of pupil spending is halved and that of regular student spending is doubled. Even
more unexpected is the change in the sign for ball : in the third model this coefficient is positive,
implying that as spending per pupil increases the student-teacher ratio increases. This coefficient
These results might seem paradoxical—Why would two good predictors of the student-
teacher ratio not be even better when used together? The geometry of individual space makes
the reason obvious. The vectors xall and xreg point in almost the same direction. Since the
83
Table 5.3: The effect of multicollinearity on the stability of regression coefficients.
correlation between these vectors is 0.966, the angle between them is only 15◦ . To move suffi-
ciently in the direction orthogonal to this (so as to reach the projection of y in the model space)
requires a multiple of one of the predictors that goes so far in the first direction that it must be
corrected with a predictor with the wrong sign (see Figure 5.6).
We saw that including predictors that are highly correlated with each other is counterpro-
ductive even when each is a good predictor of the criterion variable. It is perhaps paradoxical
that including predictors that are nearly orthogonal to the criterion variable (and hence very
poor predictors of criterion variables) can actually improve the prediction considerably. Such a
variable is called a suppressor variable and the reason that suppressor variables function as they
Let y denote the observed percentages of students eligible for free or reduced lunch in each
school district and xpercap denote the observed average per capita income in each district. The
total spending per pupil xall has a very low correlation with the percentage of students eligible
for free or reduced lunch (r = 0.07) and explains only 0.04% of the variation. However, when it
is added to the model using xpercap to predict y, a much better prediction is achieved. The two
models y = [1 xpercap ] + e is contrasted with the model y = [1 xpercap xall ] + e in Table 5.4.
What is striking about this example is that xall on its own predicts essentially nothing of
the percentage of students eligible for free or reduced lunch. However, adding it to the model
that uses xpercap significantly improves the prediction. From the geometry, we can see that the
plane spanned by xpercap and xall is much closer to y than the line generated by xpercap alone.
Given such a plane, any vector in the plane together with the first vector provides an improved
84
Table 5.4: Suppressor variables increase the predictive power of a model although they them-
selves are uncorrelated with the criterion.
prediction of the criterion, even when the new vector happens to be orthogonal to the criterion
variable.
y⊥1
y
θp
x1
θyx1
θs
x2⊥1
x2
V[x1 x2 ]
Figure 5.7: The arcs in the vector diagram indicate angles for three kinds of correlation between
y and x2 : the angle θp corresponds to the partial correlation conditioning for x1 ; the angle θs
corresponds to the semi-partial correlation with x1 after controlling for x2 , and the angle θyx1
corresponds to Pearson’s correlation, ryx1 .
In regression analyses involving two or more predictors, it is often useful to examine the
relationship between a criterion variable and one of the predictors after taking into account the
relationship of these variables with the other predictors in the model. For example, suppose
the correlation between children’s height and intelligence is found to be quite high and one
85
is tempted by the dubious hypothesis that tall people are more intelligent. By using a third
variable, age, which is also known to be correlated with both height and intelligence, we would
like to examine the correlation between height and intelligence after taking into consideration
the age of the participants. This is a simple example of statistical control and is described
as conditioning for the effect of some variable(s). It is motivated by the idea of experimental
control in which randomness removes all of the differences between the treatment and control
groups except the variables being studied. The statistic that encodes the correlation between
two variables while conditioning for others is called the partial correlation coefficient.
In order for conditioning to be a valid procedure, we must check the implicit assumptions
about the causal relationship among the variables involved. In particular, by conditioning we
assume that the correlation between the conditioning variable(s) and each of the two variables
variable(s) affects each of the variables to be correlated. Returning to our example, we assume
that the correlations between age and height and between age and performance are entirely
due to the causal process of maturation; one expects that as children get older they also grow
taller and become more intelligent. We would likely find in this hypothetical example that after
accounting for the ages of children the remaining association between height and intelligence
would be quite small and most probably due to chance rather than any true relationship. In
this way, conditioning can be used to identify cases of so-called spurious correlation in which
two variables are highly correlated only because they are both correlated to a third variable,
often a common cause. However, care must be taken with analyses of partial correlation because
in cases where the assumption of causality is not warranted, the partial correlation coefficients
The partial correlation between the variables x1 and x2 , conditioning for the variable x3 is
written rx1 x2 .x3 or simply r12.3 . Partial correlation is best understood using the geometry of
individual space. Given the predictors x1 and x2 , and the predictor x3 (whose effects we are
controlling for), partial correlation of x1 and x2 controlled for x3 is simply the correlation of
x1⊥3 = x1 − projx3 x2 and x2⊥3 = x2 − projx3 x1 . (Note that we extend this notation for centered
86
Thus we have the following definition which depends on intuition from individual-space
geometry:
x1⊥3c · x2⊥3c
r12.3 = cos(θx1⊥3c x2⊥3c ) =
|x1⊥3c ||x2⊥3c |
that has the same relationship to geometry as correlation: the cosine of the angle between two
the following computations we take all vectors as the corresponding centered vector. Using the
since we have xi ·xj = kxi kkxj krij from the definition of correlation and kxi⊥j k2 = (1−rij
2 )kx k2
i
by the Pythagorean theorem. We conclude that the standard definition for partial correlation
can be derived from the definition inspired by geometric intuition of individual space.
The vector diagram in Figure 5.7 illustrates the geometric interpretation of correlation and
partial correlation in individual space. We also observe that correlation, the angle between two
unconditioned vectors, can be substantially different from partial correlation, the angle between
The third notable angle in Figure 5.7 is θs , which is the angle between y and the projection
called the semipartial correlation and is used for measuring the unique contribution of x2 to the
87
Semipartial correlations play an important role in statistical decisions about whether or
not to include a predictor or set of predictors in a model. In general, if there are two sets
of predictors (say the column vectors of the matrices X1 and X2 ) for a criterion variable y,
then we can interpret the squared semipartial correlation of the first set (X1 ) and y as the
amount of variation in y that is explained by the subspace of the model space orthogonal to the
conditioning space, C(X2 ). Figure 5.7 shows the geometry when these spaces are each spanned
by a single vector.
The ideas here are similar to the orthogonal decomposition of the model space we saw in the
case of ANOVA analyses. However, it is rarely the case that C(X1 ), for example, is orthogonal
to the space C(X2 ), so instead we consider the orthogonal complement of the conditioning space
in the model space, C(X2 )⊥ ∩ C(X1 ) ⊕ C(X2 ). With partial correlation, C(X2 )⊥ is compared
to only the portion of y that is perpendicular to the conditioning set, whereas with semipartial
correlation, a comparison is made to the whole of y, including any portion within the span of
Because of the additivity provided by orthogonality of the subspaces of the model space,
y = [ X1 X2 ] · b + e.
Taking the dot product of each side of the equation with itself and applying the fact that the
2
|y|2 = [ X1 X2 ]b + |e|2 .
or
It can be written
y = [ X1⊥2 X2 ]b + e,
88
where X1⊥2 is a matrix such that C(X1⊥2 ) = C(X2 )⊥ ∩ C(X1 ) ⊕ C(X2 ). Then we can write
y = X1⊥2 b1 + X2 b2 + e.
Taking the dot product of each side of the equation with itself (and invoking orthogonality) we
get that
where ŷX2 , for example, indicates the projection of y onto C(X2 ). Comparing equation 5.1 and
It follows easily that the squared semipartial correlation can be written in terms of the squared
multiple correlation coefficient of the full regression model, y = [ X1⊥2 X2 ]b+e and the reduced
This equation explains why squared semipartial correlation is often interpreted as the importance
of a predictor or set of predictors; it is the increase in the explanatory power of the new model
89
Chapter 6
Probability Distributions
The statistical techniques of the preceding chapters rely on the four probability distributions
discussed in this chapter. Probability distributions are families of functions indexed by parame-
ters. These parameters specify a distribution function when they are fixed to particular values.
As we have seen, it is common to assume a distribution family (in most examples we have
assumed that variables follow a normal distribution), and then use what is known about the
distribution to estimate the putative parameter(s) which will fix the distribution function that
so it agrees with the sample data. For example, we use observation vector to estimate the mean
and standard deviation of the dependent variable in ANOVA and regression models.
We saw that Chebyshev’s inequality (see Section 1.3.4) is useful because it makes no as-
sumptions about the distribution of a random variable. However, if something is known a priori
about the distribution of a variable, then bounds for the probability of extreme events can often
be significantly improved over the estimates provided by Chebyshev’s inequality. Knowing (or
assuming) the distribution function of a random variable gives a great deal of information about
All of the methods of analysis subsumed by the general linear model assume that the random
variables Yi and the random variables of error Ei have distributions that can be roughly approx-
90
imated by the normal distribution. The normal distribution is denoted N (µ, σ 2 ) because it is
completely determined by the parameters µ and σ 2 . If the random variable Y follows a normal
distribution with mean µY and variance σY2 , then we write Y ∼ N (µY , σY2 ), and the probability
1
“ ”2
y−µY
−1
fY (y) = √ e 2 σY
. (6.1)
2πσY
The other distributions used in analyzing the general linear model (such as the F -distribution
used in hypothesis testing) can be derived from the normal distribution. A more detailed
description of the relationships among the distributions discussed in this section can be found
in many standard statistical texts (e.g., Casella & Berger, 2002; Searle, 1971).
s = 0.8
0.5
s = 1.0
s = 2.0
0.4
Probability Density
0.3
0.2
0.1
−3 −2 −1 0 1 2 3
Figure 6.1: The normal distribution with three different standard deviations.
The normal distribution is often used in statistical analyses for a number of reasons. The
91
best motivation, perhaps, is provided by the Central Limit Theorem, which states that the
distribution of the sample mean Y , approaches a normal distribution as the size of the sample
increases, no matter what the distribution of the random variable Y . This is very useful because
it allows one to estimate distribution of the sample mean although very little is known about
the distribution of the underlying random variables. The normal distribution is useful as the
limit of other probability distributions such as the binomial distribution and the Student’s t-
The mean of this distribution is np and the variance is np(1 − p). When both np and n(1 − p) are
sufficiently large (in general, it is recommended that both be at least 5), the normal distribution
with µ = np and σ 2 = np(1 − p) provides a very good continuous approximation for the discrete
binomial distribution.
Knowing that a variable follows the normal probability distribution function provides a much
stronger result than Chebyshev’s inequality (see Section 1.3.4). The key point is that the infor-
mation about the distribution of Y allows us to make much better approximations concerning the
and within 2 standard deviations of the mean with an approximate probability of 95% because
R 2σ
−2σ fY (y)dy ≈ 0.95 (see Figure 6.1). (If the goal is to find bounds containing 95% of the area,
then ±1.96 standard deviations provide more accuracy.) Thus, the probability that an obser-
vation is more extreme than 2 standard deviations from the mean is only 5%, a significantly
tighter bound than that of 25% provided by Chebyshev’s inequality (which is the best bound
The χ2 distribution is not used directly in the statistical analyses discussed in this text; however,
it is an important distribution for many other kinds of statistical analyses because it can often
be used to find the probability of the observed deviation of data from expectation. The χ2
92
The χ2 Distribution
0.15
k=5
k = 13
0.10 k = 23
Probability Density
0 10 20 30 40
distribution is denoted χ2k where k is the only parameter of the distribution, a posititive integer
called the degrees of freedom. If a random variable V follows a χ2 distribution with k degrees of
freedom, then we write V ∼ χ2k , and the probability density function for V is given by
1
fV (v) = v k/2−1 e−v/2 , (6.2)
2k/2 Γ(k/2)
R∞
where Γ(z) = 0 tz−1 e−t dt. You may notice that the mean of the χ2 distribution for k degrees
The primary reason for mentioning the χ2 distribution is that the squared length of a random
vector Y in Rk (appropriately scaled and with 0 expectation for each coordinate) will follow a
χ2 distribution with k degrees of freedom. The proof follows from two facts:
93
Lemma 6.2.1. 1
χ2Σki .
coordinate variables Yi are independent and that each Yi ∼ N (0, σ 2 ). It follows that if we
Applying the first part of Lemma 6.2.1, we conclude (Yi0 )2 ∼ χ21 . Applying the second part of
X 1 X 2
(Yi0 )2 = Yi
σ2
1
= Y·Y
σ2
1
= kYk2 .
σ2
This proves
Theorem 6.2.2. Let Y be a random vector in Rk with independent coordinate variables each
Theorem 6.2.2 shows that the degrees-of-freedom parameter corresponds to the dimension
of the space containing the vector Y. As we will see, this theorem justifies the claim that the
Snedecor’s F -distribution has a central role in testing hypotheses related to the general linear
model, and it is named after the statistician R. A. Fisher. The F -distribution has two param-
1
For proof see Casella & Berger (2002).
94
eters, often denoted p and q and called the numerator and denominator degrees of freedom,
and is denoted F (p, q). Both parameters affect the shape of the distribution, but in general, as
the maximum degrees-of-freedom parameter increases (there are two), the distribution becomes
more centered around 1 (see Figure 6.3). If a random variable W follows an F -distribution with
p and q degrees of freedom, we write W ∼ F (p, q), and the probability density function for W
is:
s !
p+q
Γ (pw)p q q
fW (w) = 2
. (6.3)
Γ p
2 Γ q
2 w
(pw + q)p+q
Variables that follow an F -distribution have a close relationship to variables that follow the χ2
distribution. Whenever the variables U and V follow χ2 distributions with with p and q degrees
of freedom, respectively, then the (adjusted) ratio of these variables follows an F -distribution
with p and q degrees of freedom. That is, if U ∼ χ2p and V ∼ χ2q , then
U q U/p
F = · = ∼ F (p, q). (6.4)
V p V /q
This equation provides a more general description of the F -ratio (see equation 2.15). Notice
q
that the adjustment factor p corrects for the relative degrees of freedom in the two variables
that follow χ2 distributions. If U and V are the squared length of random vectors in Rp and
for the dimensions of these vectors. In this way, the F -ratio can be understood as a ratio of
In Section 2.3, we saw that the vector ŷ is in the (p + 1)-dimensional subspace of Rn that is
spanned by the p + 1 independent column vectors of the design matrix X. In a similar way, the
The discussion above uses Theorem 6.2.2 to link the more general statement of F -ratio provided
in equation (6.4) with the F -ratio we will use for testing hypotheses about the general linear
model.
95
The F Distribution
1.5
(p,q): 27, 500
(p,q): 11, 130
1.0 (p,q): 3, 7
Probability Density
0.5
Figure 6.3: The F -distribution centers around 1 as the maximum degrees-of-freedom parameter
becomes large.
We briefly consider one other distribution that is often used in analyses based on the general
linear model. This eponymous distribution is called the Student’s t-distribution after the pen
name of William Gosset, a statistician employed by the Guinness Brewery early in the twentieth
century. The t-distribution is often applied to hypothesis tests involving small samples. In
the context of the general linear model, the t-distribution is helpful for obtaining confidence
intervals for the parameter estimates in the vector b. Confidence intervals are an alternative to
hypothesis testing, and provide a range of likely values for parameter estimates.
The t-distribution has one parameter k, which is the number of degrees of freedom, and
96
The t Distribution
0.4
k: 15
k: 3
0.3 k: 1
Probability Density
0.2
0.1
−4 −2 0 2 4
Figure 6.4: The t-distribution approaches the normal distribution as the degrees-of-freedom
parameter increases.
the distribution is denoted t(k). If U is a random variable that follows a t-distribution with k
degrees of freedom, then we write U ∼ t(k) and the probability density function for U is:
−(k+1)/2
Γ( k+1 ) u2
fU (u) = √ 2 k 1+ (6.5)
kπΓ( 2 ) k
As k increases, the t-distribution approaches the normal distribution with a mean of 0 and
variance of 1 (see Figure 6.4). In addition, there is a close relationship between the t-distribution
distribution with 1 degree of freedom in the numerator. That is, if U ∼ t(k) then U 2 ∼ F (1, k).
This relationship has an intriguing geometric implication for the t-distribution. We can think
97
of U 2 as a ratio of the (per-dimension) lengths of two vectors. Since the numerator has 1 degree
of freedom, the vector in the numerator, Vn , lives in a 1-dimensional subspace. The vector in the
basis {ui } for the combined k + 1 dimensional space so that Vn = cu1 and Vd = ki=1 ci ui . The
P
kck2
U 2 = Pk ,
2
i=1 kci k /k
kck kVn k
U = Pk √ = √ . (6.6)
i=1 kci k/ k kVd k/ k
We saw this expression of a variable that follows the t-distribution in the Section 1.4.
In summary, we can say that the sum of squares of normally distributed variables follows
the χ2 distribution. (Geometrically, this is the squared length of a vector.) In addition, a ratio
of variables that follow the χ2 distribution itself follows an F -distribution. Finally, the square
98
Chapter 7
This chapter describes the mathematical basis of the DataVectors program. Points in R3 can be
represented using homogeneous coordinates which facilitate affine translations of these points via
In order to illustrate and explore more concretely the ideas discussed in this manuscript and
to generate precise figures from data, I wrote the DataVectors program in the R language. This
programming language has built-in support for statistical analysis (including matrix arithmetic
routines) and celebrated graphical capabilities. Because it is open source and free, the language
is used widely by academics and research scientists for developing new statistical techniques.
99
The DataVectors program accepts up to three data vectors of any length and displays a
3-dimensional model of the (centered) model space that can be rotated in any direction using a
mouse. This chapter describes the mathematical basis of the program, including homogeneous
multiplication of called the general linear group which is denoted GLn . Affine transformations
with the constant term as its coefficient provides a solution. The group GLn+1 includes every
linear transformation that fixes one dimension (i.e., those matrices that correspond to linear
transformations of Rn+1 that can be used to achieve affine transformations of Rn viewed as Rn+1
Whenever Xn+1 = 0, the coordinates represent a point at infinity but this case is not needed
for the present discussion. When Xn+1 6= 0, the definition establishes an equivalence relation:
X ≡ Y if and only if there exists a non-zero c in R such that Y = c(X1 , . . . , Xn+1 ). We take
translating each points along a vector a parallel to W and proportionally to the distance between
the point and W . To obtain a translation of Rn+1 , pick W to be the hyper-plane (xn+1 = 0) and
Rn+1 , where each point is mapped to its canonical homogeneous coordinate. This hyperplane is
effectively translated by the vector a by means of the shear transformation of Rn+1 . Translated
100
coordinates in Rn can be recovered by means of the inverse embedding restricted to the range
of the embedding. In other words, if v ∈ Rn and V = (v, 1)T is the corresponding homogeneous
embedding, we can recover the translated vector v0 from the translated homogeneous vector
V0 ∈ Rn by stripping off the last coordinate of V0 . The matrix representation of this translation
is
a
In
T = , h Ta ∈ M (n + 1), (7.1)
h a
0 ... 0 1
coordinates and distinguishes these from transformations of Rn described in the rest of this
chapter.
The extension of the linear transformations of rotations and dilations from Euclidean coor-
the z-axis, y-axis, and x-axis, respectively. A rotation is denoted by the matrix Rθφγ , and the
1 0 0
Rx (θ) =
0 cos(θ) − sin(θ)
,
0 sin(θ) cos(θ)
101
cos(θ) 0 sin(θ)
Ry (φ) =
0 1 0 ,
− sin(θ) 0 cos(θ)
cos(θ) − sin(θ) 0
Rz (γ) =
sin(θ) cos(θ) 0 .
0 0 1
Rotation matrices can be easily realized in R4 by adjoining an extra row and column. Thus,
0
..
.
Rθφγ
h Rθφγ = (7.2)
0
0 ... 0 1
The scaling transformation, where each vector coordinate is transformed to some multiple
of itself, is similarly handled. Let Skx ky kz denote the scaling transformation that multiplies the
kx 0 0
=
Skx 0 1 0 .
0 0 1
1 0 0
=
Sky 0 ky 0 .
0 0 1
102
1 0 0
=
Skz 0 1 0
.
0 0 kz
kx 0 0 0
0 0 0
ky
h Skx ky kz = (7.3)
0 0 kz 0
0 0 0 1
By left-multiplying the product of translation, rotation, and scaling matrices to the ho-
mogeneous coordinates of points in R3 we can achieve all the transformations of interest for
The major challenge in modeling vector space is creating representations of vectors in 3-dimensional
space on a 2-dimensional computer screen. This is made possible by the perspective projection,
defined with a view plane and a viewpoint v = (v1 , v2 , v3 ) ∈ R3 not in the view plane. The
homogeneous coordinates for x ∈ R3 and N = (n1 , n2 , n3 , −c). The perspective projection sends
a point in space x to the intersection of the view plane with the line passing through v and x.
Theorem 7.2.1. Given the viewpoint V and the (possibly affine) view plane with normal N, the
(N · V)I4 .1
Proof. Let X denote the homogeneous coordinates of a point x ∈ R3 and let k1 V + k2 X denote
the image of X under the perspective projection PV,N , for constants k1 , k2 ∈ R; then k1 and k2
1
This theorem and its proof follow Marsh (1999).
103
satisfy k1 (N · V) + k2 (N · X) = 0. If N · X = 0, then
= −(N · V)X
Therefore, (VNT −(N·V)I4 )X is a multiple of X, which is precisely what we would expect given
the equivalence relation on points expressed with homogeneous coordinates, and we conclude
that PV,N = VNT − (N · V)I4 in this case. On the other hand, whenever N · X 6= 0 (and
= (N · X)V − (N · V)X
= (VNT − (N · V)I)X
as required.
To apply this to the problem of computer graphics, we make the simplifying assumptions
that the view plane is the xy-plane (i.e., n = (0, 0, 1, 0)T ) and that the view point is on the z-axis
(i.e., v = (0, 0, k, 1)T ). Under these assumptions, the matrix for the perspective transformation
is:
−k 0 0 0
0 −k 0 0
Pv,n = (7.4)
0 0 0 0
0 0 1 −k
Checking, we have:
104
x −kx kx/(k − z)
y −ky ky/(k − z)
Pv,n = ≡ (7.5)
z 0 0
1 z−k 1
As we expect, the x- and y-coordinates of the image of this projection is proportional to the
ratio of the distance from the viewpoint to the viewing plane and the distance of the pre-image
from the view point along the z-axis (see Figure 7.1).
p'
p = (x,y,z)
yk/(k-z)
y
k-z
Z
v = (0,0,k)
Figure 7.1: The x- and y-coordinates of the perspective projection are proportional to k/(k − z).
Although it is clear that the perspective projection from a view point on the z-axis to the
xy-plane is quite simple, it is perhaps not yet clear why it is always possible to make these
simplifying assumptions. As long as the line of sight is orthogonal to the view plane, a solution
follows from the transformations developed in the first section. If the view point is not on the
z-axis, we left multiply everything by the appropriate rotation matrix to ensure that the view
point is on the z-axis. Let v0 be a view point not on the z-axis and Rv a rotation taking v0 to
v = (0, 0, kv0 k)T . It follows that the matrix product Pv,n Rv takes any point p to some p0 on
the xy-plane. To recover the image of Pv0 ,n we simply compute Rv−1 p0 , where Rv−1 is the inverse
rotation operator. The graphics pipeline refers to a sequence of mathematical operations such
105
as this that transform a point in R3 into a pixel on the computer monitor. All transformations
are represented using matrix multiplication and the entire pipeline can be conceived as the
The perspective transformation is problematic because it is not of full rank and therefore
singular. Information regarding the distance of a point to the center of projection is lost and
cannot be recovered from the information that remains in the image. We will find that it is
useful to decompose the perspective transformation into translation and dilation transformations
dimensions until the last step of the pipeline just before the pixel is displayed. In the penultimate
step, points have been distorted to achieve perspective but still retain information about relative
depth. This space is called perspective space. The advantage is that the transformation of
Euclidean space into perspective space has an inverse so the depth information used to resolve
issues such as object collisions can be recovered. In addition, the depth information aides in
drawing realistic effects such as simulated fog, in which the transparency (a color aspect) of a
Ideally, we want the x- and y-coordinates in perspective space would have the same values
as the x and y screen coordinates. This allows the final projection from perspective space to the
screen coordinates to be an orthogonal projection in the z direction, and obtaining the screen
coordinates would require no further calculations. After rotating the view point to the z-axis
and translating it to the origin, we associate the truncated pyramid called the viewing frustum
(see Figure 7.2) with the region [−bx , bx ] × [−by , by ] × [−1, 1]. The viewing frustum is defined
by the near and far clipping planes, n = (0, 0, n, 0)T and f = (0, 0, f, 0)T , and the dimensions
of the visible view plane. For a centered, square screen, bx = w/2 and by = h/2, and so these
dimensions are [−w/2, w/2] × [−h/2, h/2], where w and h are the screen width and height in
screen coordinates. This association is achieved via the perspective space transformation:
106
2n
w 0 0 0
0 2n
0 0
h
Sv,n = (7.6)
−(f +n) −2f n
0 0
f −n f −n
0 0 −1 0
x 2n/w 2n/(−zw)
2n/h 2n/(−zh)
y
Sv,n = =
−(f +n) 2f n (f +n) 2f n
z f −n z − +
f −n f −n z(f −n)
1 −z 1
By using the 4th row of Sv,n to record the z-coordinate, dividing by this coordinate to
obtain the canonical homogeneous coordinates for the point applies the correct perspective
scaling to the x- and y-coordinates. It follows that Sv,n can be decomposed further as the
product of dilation transformations (in the x- and y-coordinates) that associate the original
x and y coordinates in the viewing plane with the desired screen coordinates and an even
simpler perspective transformation. The 3rd coordinate of p0 is not suppressed to 0 as under the
(f +n)
perspective projection, but retains depth information via the invertible function z 0 = f −n +
2f n
z(f −n) .
It is straightforward to verify that this formula maps the viewing frustum to the appropriate
parallelepiped in perspective space. Once mapped to perspective space, the screen coordinates
can be read off the first two coordinates of p0 . If needed, an inverse function can be used to
In sum, beginning with an arbitrary viewpoint v and an orthogonal view plane with normal
v, the graphics pipeline (1) rotates space around the origin so that v lies on the z-axis (say,
via Rθφγ ). Then, (2) space is translated along the z-axis so that v lies at the origin, and the
image of the origin is (0, 0, −kvk), (via T ). Next, (3) the space is dilated in order to identify
the viewing plane with the computer screen (via D) and (4) the viewing frustum is transformed
into a parallelepiped in perspective space (via S). Finally (5), the screen pixel can be drawn
107
(-w/2,h/2,1)
(-bx,by,f) (-w/2,h/2,-1) (w/2,h/2,1)
(bx,by,f)
(-bx(n/f),by(n/f),n)
p'
p
(w/2,-h/2,1)
v = (0,0,k) (bx,-by,f)
Figure 7.2: The perspective space transformation takes the viewing frustum to the parallelepiped
[−w/2, w/2] × [−h/2, h/2] × [−1, 1] in perspective space.
using the x-and y-coordinates of each point in perspective space, an orthogonal projection onto
1 0 0 0
0 1 0 0
M =
0 0 0 0
0 0 0 0
Then the whole pipeline taking p in the original space to p0 on the screen can be written as a
product of matrices:
p0 = M SDT Rp.
108
References
Casella, G. & Berger, R. L. (2002). Statistical Inference (2nd Ed.). South Melbourne, Australia:
Thompson Learning.
Christensen, R. (1996). Plane Answers to Complex Questions. New York, NY: Springer-Verlag.
Cook, T.D. & Campbell, D.T. (1979). Quasi-Experimentation: Design and Analysis for Field
Elazar J. Pedhazur (1997). Multiple Regression in Behavioral Research: Explanation and Pre-
Herr, D. G. (1980). On the history of the use of geometry in the general linear model. The
Faraway, J. J. (2004). Linear Models with R. Boca Raton, FL: Chapman & Hall.
Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied Linear Statistical Models.
Marsh, D. (1999). Applied Geometry for Computer Graphics and CAD. New York, NY: Springer-
Verlag.
Rogers, J. L. & Nicewander, W. A., (1988). Thirteen ways to look at the correlation coefficient.
Saville, D. J. & Wood, G. R. (1986). A method for teaching statistics using n-dimensional
109
Saville, D. J. & Wood, G. R. (1991). Statistical Methods: The Geometric Approach. New York,
NY: Springer-Verlag.
Shifrin, T., & Adams, M. R. (2002). Linear Algebra: A Geometric Approach. New York, NY:
Freeman.
tion Using a Mouse. Paper presented at the annual proceedings of Graphics Interface in
Vancouver, Canada.
www.cs.caltech.edu/courses/cs171/quatut.pdf
110