0% found this document useful (0 votes)
45 views25 pages

Chapter 1. Introduction and Review of Univariate General Linear Models

This document provides an introduction and review of univariate general linear models. It discusses how multiple regression analysis is widely used across many disciplines to study relationships between variables. It reviews how regression models allow researchers to account for and explain variation in a criterion variable based on predictor variables. The document concludes by stating that the goal is to introduce multivariate versions of general linear models and several of their applications.

Uploaded by

emjay emjay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views25 pages

Chapter 1. Introduction and Review of Univariate General Linear Models

This document provides an introduction and review of univariate general linear models. It discusses how multiple regression analysis is widely used across many disciplines to study relationships between variables. It reviews how regression models allow researchers to account for and explain variation in a criterion variable based on predictor variables. The document concludes by stating that the goal is to introduce multivariate versions of general linear models and several of their applications.

Uploaded by

emjay emjay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

CHAPTER 1.

INTRODUCTION AND REVIEW OF


UNIVARIATE GENERAL LINEAR MODELS

Few data analytic techniques command a position of greater importance in


the social, behavioral, and physical sciences than multiple regression
analysis. Exemplary applications can be found in the full range of disci-
plines, including anthropology (Cardoso & Garcia, 2009), economics
(Card, Dobkin, & Maestas, 2009), political science (Baek, 2009), sociology
(Arthur, Van Buren, & Del Campo, 2009), and all branches of psychology
(Ellis, MacDonald, Lincoln, & Cabral, 2008; Pekrun, Elliot, & Maier,
2009).
In each of these disciplines, the purpose of the investigator is to study the
relationship between the variables. Fitting regression models to data allows
the analyst the ability to account for or explain variation in a criterion vari-
able as a function of one or more predictor variables. The general linear
model is an extension of regression models to accommodate both qualitative
and quantitative predictor variables. It is widely recognized that multiple
regression analysis is a data analytic system that subsumes all linear models
(Cohen, 1968), including those that are based on continuously distributed
predictor variables (classic regression analysis), those that are based on
schemes to accommodate categorical predictors (classic analysis of vari-
ance), and those models that are based on any combination of continuous
and categorical predictors.1 Together these models define the general linear
model. The regression model is flexible enough to handle many different
realizations of predictor variables, including interactions between continu-
ous predictor variables, between categorical predictor variables, and
between combinations of continuous and categorical predictor variables.
The breadth of coverage of possible analyses afforded by these combina-
tions explains why the technique is so widely used in all scientific disci-
plines from anthropology to zoology.
In this volume, our goal is to introduce the multivariate version of the
general linear model and to illustrate several of its applications. Multivariate
models are distinguished by the presence of more than one dependent

1
Some authors prefer the terms quantitative and qualitative to describe predictor
variables that are continuous or categorical. In this volume, we use the term con-
tinuous to denote variables whose underlying metric is continuous or discrete, and
we use the term categorical to denote nominal group structure that has no meaning-
ful underlying metric except to identify categories.
1
2

variable that are to be analyzed simultaneously by fitting a single model


to the data. Much of the conceptual and statistical basis of multivariate
linear model analysis is a direct generalization of univariate regression
analysis, which we briefly review in this chapter. This review of univari-
ate strategies for analyzing linear models is intended to set the stage for
the remaining chapters. In Chapter 2, we introduce the example data sets
to be used throughout along with a discussion of the first step in the gen-
eral linear model (GLM) analysis of specifying the model. In Chapters 3,
4, and 5, we cover the estimation of parameters of the model, the assess-
ment of goodness of fit of the model along with the related multivariate
test statistics, and testing hypotheses on the model. Chapter 6 introduces
the linear model solution to the multivariate analysis of variance, and
Chapter 7 concludes the volume with an introduction to canonical corre-
lation analysis, which is a linear model that subsumes all of the material
of preceding chapters. The overriding goal of the text is to present an
integrated view of all these various techniques under a single modeling
framework.

Review of Univariate Linear Model Analysis


The main goal of the linear model is to evaluate relationships in order to
explain variability in a response variable as a function of some specified
model and an error of prediction:

Response = Model + Error.

In the univariate case, regression models are those models that are lim-
ited to a single criterion, response, dependent, or outcome variable.2
Univariate regression models can be expressed mathematically as a regres-
sion function,

Y = β 0 + β1 X 1 + ε, [1.1]

2
We use the terms dependent, criterion, response, and outcome interchangeably in
this volume to describe the Y variable in models. The X variables in the model will
be interchangeably referred to as predictor, explanatory, or independent variables.
These terms appear throughout the literature on regression analysis. Some authors
prefer to reserve the term dependent variable to experimental designs with manipu-
lated conditions.
3

for a simple model with a single predictor variable. For a more complex
model with multiple predictors, we may write3

Y = β 0 + β1 X 1 + β 2 X 2 +  + βq Xq + ε. [1.2]

In Equations 1.1 and 1.2, Y represents a single column vector response


variable that is intended to be explained by the weighted linear combination
of regression coefficients, β 0 , β1 , , βq , and explanatory variables,
X 1 , X 2 , , X q , and includes a disturbance or error term ε, which captures
all other sources of variability, both systematic and random, that are respon-
sible for variation in Y. The X j explanatory variables, j = 1, 2,…, q, can be
either continuous or categorical.4 Many contemporary textbooks emphasize
this integrative linear model approach to both regression analysis and the
analysis of variance in the univariate case (see, e.g., Cohen, Cohen, West,
& Aiken, 2003; Myers & Well, 2003).
Although we briefly review the basic ideas of univariate regression/
linear model analysis in this chapter, our purpose is to set the stage for the
analysis of multivariate multiple regression/general linear model analysis
with continuous and categorical predictor variables—multivariate models
can be conceptualized as generalizations of their univariate counterparts.
Whereas univariate regression models are defined by their single column
vector of Y scores, multivariate models are defined largely by the fact that
more than one dependent variable is simultaneously included in the model
specification. The collection of the explanatory variables, X 1 , X 2 , , X q ,
can be identical for univariate and multivariate models; only the number of
Y variables, the number of columns of regression coefficients, and the num-
ber of associated disturbance terms, ε, will differ.
As models become more complex, it will be convenient to express the
models and their applications in matrix algebraic terms. Although we intro-
duce the basic matrix notation to identify the linear models discussed in this
volume, we do not present a full coverage of the topic. A chapter-length
coverage of many of the details is given in Draper and Smith (1998,

3
We do not identify the response and explanatory variables Y or X with a subscript
to indicate the serial order of the 1st through the nth observations. In this volume,
all models are based on the full set of n observations, and the index of summation
or multiplication is assumed to be across all n participants.
4
Coding schemes for categorical variables will be introduced at greater length in
later sections.
4

Chap. 4); textbook-length coverage can be found in Namboodiri (1984)


or Schott (1997).
The univariate multiple regression model of Equations 1.1 and 1.2 can
be conveniently summarized in matrix notation as5

y (n × 1) = X(n × q +1) (q +1 ×1) + (n × 1) [1.3]

in which y(n × 1) is a single-column vector whose dimensions are noted in the


row-by-column subscript. The X j predictor variables, j = 1, 2,…, q, col-
lected in a design matrix, X(n × q+1), are the counterpart of the same predictor
variables in the univariate model of Equation 1.2, now expressed as a
matrix of order (n × q + 1) with n rows identifying each of the i = 1, 2,  , n
cases and q + 1 columns that capture the predictor variables. The “+1” in
the q + 1 dimension allows for a unit vector of X 0 ≡ 1 (≡ means “by defini-
tion equal to”) to estimate the intercept of the model. The vector  of
Equation 1.3 is a (q + 1 × 1) column vector of regression coefficients con-
taining one row for each of the q + 1 explanatory variables. Expanding
Equation 1.3 shows the elements contained in the matrices for a univariate
multiple regression model with q + 1 predictor variables:

 Y1  1 X 11 X 12  X 1q   β 0   ε1 
Y  1 X X 22  X 2 q   β1   ε 2 
 2 =  21
+  .
             
      
Yn  1 X n1 X n 2  X nq  β q  ε n 

The multivariate multiple regression model is a generalization of


Equation 1.3 and would be written as

Y(n × p) = X(n × q +1) B(q +1 × p) + E(n × p ) . [1.4]

The matrix Y(n × p) is a two-dimensional array of numbers in which the


rows of the matrix represent all the n observations (subjects, cases) and
the columns of the matrix contain the p > 1 response variables, Yk , for
k = 1, 2,  , p. Hence, the order of the matrix Y is (n × p). The structure

5
We use italics to represent scalars (e.g., X, Y, Z, β, ε), boldface lowercase letters to
denote row or column vectors (e.g., a, b, y, x, , ), and boldface uppercase letters
to denote matrices (e.g., X, Y, B, E, ). If a column or row vector is deliberately
represented by a matrix symbol, its vector status will be made explicit by the order
of the matrix, e.g., (n × 1) or (1 × p).
5

of the design matrix, X(n × q +1) ,������������������������������������������


does not differ from univariate to multi-
variate models and is identical to that of Equation 1.3. The matrix B(q +1 × p)
of Equation 1.4 is an augmented collection of regression coefficients, one
row for each of the q + 1 explanatory variables and p columns to accom-
modate the multiple response variables. Finally, the matrix E(n × p) is a
collection of vectors of disturbance terms, one row for each of the n cases
on each of the p response variables in the model. Expanding Equation 1.4
reveals the matrix elements that would be contained in the multivariate
model,

Y11 Y12  X 1 p  1 X 11 X 12  X 1q 
Y  Y2 p  1 X 21  X 2 q 
 21 Y22 =
X 22
          
   
Yn1 Yn 2  Ynp  1 X n1 X n2  X nq 
 β 01 β 02  β 0 p   ε11 ε12 ε1 ε n1 
β  β1 p  ε 21 ε 22 ε 2 ε n 2 
 11 β12 + .
           
   
β q1 β n 2  β qp   ε n1 ε n1 ε n ε np 

In the succeeding chapters, we will pursue more of the details of structur-


ing the design matrix to accommodate both continuous and categorical
predictor variables. For the remainder of this chapter, we set the stage by
focusing on a review of univariate linear models.
We assume that the reader has a reasonably good understanding of uni-
variate multiple regression analysis at the level of Cohen et al. (2003) and
a similarly good understanding of analysis of variance models at the level
of Myers and Well (2003). We also assume an elementary grasp of matrix
addition, subtraction, multiplication, and inverse (division). We hope to
show that much of multivariate analysis can be seen as a generalization of
univariate analysis. Toward that end, we turn now to a review of the uni-
variate regression model in which we introduce four steps of general linear
model analysis:

1. Specify the model.


2. Estimate the parameters of the model.
3. Define measures of goodness of fit of the model.
4. Develop methods for testing hypotheses about the model.
6

Because of space constraints, we do not undertake a discussion of diag-


nosis of the adequacy of the models that is covered in detail elsewhere
(Cohen et al., 2003, Chap. 4).

Specifying the Univariate Regression Model


The dimensions of Y(n × p) define the initial distinction between univariate
and multivariate models. If the designation of the model includes a single-
column vector of scores, then y (n × 1) represents the dependent variable as
noted in Equation 1.3. Consider a regression model in which y (n × 1) is
hypothesized to be a function of three predictors—continuously distributed
variables X 1 and X 2 and a dichotomous categorical variable X 3 . Ultimately
data must be collected that conform to the model specifications. To make
matters more concrete, let Y represent the construct of executive functioning
as measured by scores on the Trail Making Test–Part B (TMT-B, Tombaugh,
2004). Neuropsychologists consider the TMT-B to be a measure of higher-
order brain function governing the activities of planning, organization, and
anticipation. Since executive functioning is a critical cognitive skill, under-
standing how status on this dimension might vary with advancing age,
increasing education, and differences in gender is important. A fictitious
data set based on n = 40 observations with correlation structure nearly
identical to that reported by Tombaugh (2004) specifies a three-predictor
model defined in Equation 1.3. The prototypical matrices required to
specify this linear model would include the following:

 72  1 41 13 0   ε1 
115  1 51 β 0  ε 
   18 1  β   2
y (40 × 1) = 117  , X(40 × 4) = 1 80 14 0  , (4 × 1) =   ,  =  ε3  .
1

    β 2   
           
111 1 59 10 0  β3  ε 40 

In this univariate model, the vector y is time to completion of the TMT-B


task, X 1 is the participant’s age and X 2 is the participant’s education, both
continuous predictor variables. The vector X 3 is a dummy-coded regressor,
representing a categorical variable of gender coded as 1 = female and 0 =
male. The vector X 0 ≡ 1 is included as the first column of the design matrix
to accommodate the model intercept. The means, standard deviations, and
correlations for these data are shown in Table 1.1.
Articulating this descriptive information along with writing out the
regression model specified in Equation 1.2 or 1.3 are the statistical details
required to specify the model.
7

Table 1.1  Means, Standard Deviations, and Correlations for the TMT-B


Data

TMT-B Age Education Gender

TMT-B   1.000

Age    .632   1.000

Education     -.244     -.171   1.000

Gender     -.046    .014     -.114 1.000

Mean 93.77 58.48 12.60 .45

Standard deviation 32.77 21.68  2.60 .50

Note: n = 40. TMT-B = Trail Making Test–Part B.

A second important aspect of linear model specification depends heavily


on the theory that dictates the mathematical model and provides the sub-
stantive explanation of the hypothesized relationship between response
and explanatory variables. The theoretical basis of the research often
includes the logic used to explain the mechanism through which the Y
and X variables are presumed to be associated. These very important
details of model specification are context specific and will vary from
study to study. While we will endeavor to provide the flavor of such
arguments in the examples used to illustrate the procedures here, a full
discussion of this aspect of model specification is beyond the scope of
this volume. Extensive coverage of this topic is given in Jaccard and
Jacoby (2010).

Estimating the Parameters of the Model


The models of Equations 1.1 to 1.4 are population regression functions
with parameters of the model defined in the elements of (q +1 × 1) = (β 0 , β1 , , β q )
(q +1 × 1) = (β 0 , β1 , , β q ) for univariate models and of B(q +1 × p) for the multivariate case.
For the q-predictor univariate regression model of Equation 1.3, it is known
that the long-run expected value of the function for a single criterion vari-
able is given by

E (Y |X ) = X = β 0 + β1 X 1 + β 2 X 2 +  + β q X q . [1.5]
8

Figure 1.1  
The Linear Regression Function With Expected Values
(Means) of the Conditional Distributions of Y on X
for the Data of Table 1.1

120
Y3
Y2
105
E{Y3} = 109
T-B
TM

90
E{Y} = 39+.93X
E{Y2} = 86
75

Y1 E{Y1} = 75
0
30 40 50 60 70 80 X
Age

Note: Y1 , Y2 , and Y3 are illustrative cases.

These expected values are the means of the conditional probability dis-
tributions of Y, say µ Y |X , for each of the values of X j . The linear model
( j)
specifying the relationship between Y and X requires that the conditional
means of Y|X fall precisely on a straight line defined by the model as illus-
trated in Figure 1.1 for a single predictor variable. Linear models with two
predictors require that the regression surface defined by X be a two-
dimensional plane with partial slopes defining the X axes of the graph as
shown in Figure 1.2. For the simple regression model of Equation 1.1, the
parameter β 0 defines the expected value of Y|X = 0 and β1 defines the
expected rate of change in Y per unit change in X. From the example data
of Table 1.1, the regression function of Y = TMT-B on X = Age would
appear as in Figure 1.1, in which the conditional means of Y (time to com-
pletion of TMT-B) given three values of X = 40, 50, and 75, for example
(i.e., E{Y1}, E{Y2}, E{Y3}), lie precisely on the regression line to satisfy the
assumption of linearity. Note that the values of the observations Y1, Y2, and
Y3 appear in the plane of their respective probability distribution but deviate
from their conditional mean. The vector of deviations,  = y − X, are the
error terms of the regression model in Equation 1.3.
9

Figure 1.2  Regression of Trail Making Test–Part B on Age and


Education
40 80 120
TMT-B
18
16

100
14
Ed

80
uc
12

60
ati

Age
on
10

8 40

A similar example of a two-predictor model is illustrated in Figure 1.2


by the graph of the relationship between Y = TMT-B, X 1 = age, and X 2 =
education for the n = 40 sample data descriptively summarized in Table 1.1.
The scatterplot reveals a positive relationship between Y and X 1 and a
negative relationship between Y and X 2 . The population regression func-
tion E (Y |X ) = X is defined by the planar surface with partial slopes of
β X1 and β X 2 . The discrepancies between the observations and the model
(i.e., the distance between the circles and the plane) are indices of lack of
model fit and are captured in the errors of the model,  = Y − X.
Thus, all univariate linear models in which the observations are decom-
posed into model and error components can be written as

y = X + . [1.6]

The differences between Y and the expected values of Y are the errors of
prediction of the model,

= y − E ( y|X) ,
6
[1.7]

6
The symbols Y , y,
Y  , and µ

(y|X1 X 2 X q ) will denote sample estimates of the popula-
tion E (Y|X ) .
10

which are illustrated by the distance from each point to the two-dimensional
plane in Figure 1.2. The closer all the observed values are to the fitted regres-
sion plane, the better the fit of the model to the data.
The criterion of least squares is used to estimate optimal values of 
such that the discrepancies between the observations and the value pre-
dicted by the model are as small as possible. Using the differential calculus,
the values of  are chosen to minimize the sum of the squared errors of
prediction:

Σ2 = ′ = (y − X)′ (y − X) . [1.8]

Substituting the sample estimates of the population parameters  ( q +1 × 1) =


(β 0 , β 1 ,, β q ) into Equation 1.8, it can be shown that taking the partial

derivatives of ′, setting them to zero, and solving the resulting set
of simultaneous equations lead to the optimal solution of the regression
coefficients,7

 = ( X′ X)−1 ( X′ Y).
 [1.9]

Applying Equation 1.8 to the example data of Table 1.1 gives the
unstandardized parameter estimates of the regression of TMT-B on age,
education, and gender,8

 β 0  65.69 
  
  
 =  β1  =  0.92  .
  
β 2   −1.87 
    −4.68
β3 

7
We will use the diacritic ^ over the symbol to denote a sample estimate of its
population parameter.
8
( X′ X)−1
is the inverse of the uncorrected raw score sum of squares and cross
products matrix (SSCP) of X and ( X′ Y) is the uncorrected raw score sum of cross
products (SCP) between X and Y. The unstandardized regression coefficients of
Equation 1.8 are identical to those obtained by mean corrected SSCP and SCP
matrices. Details of the relationship between raw score and mean corrected SSCP
and SCP matrices are given in Rencher (1998, pp. 269–271).
11

Interpretations follow the usual rules: Each one year increase in age is
accompanied by an increase of approximately 9 of a second to complete
10
the TMT-B task; each additional year of education reduces the time-to-
completion of about 2 seconds; and males and females differ by an average
of about 4.7 seconds on the timed TMT-B where females show faster per-
formance. The expected time to completion of the TMT-B for a 50-year-old
woman with 12 years of education would be estimated at 85 seconds.
It is occasionally useful to reparameterize the regression model to mean
zero and unit variance (e.g., ZY , Z X1, Z X 2, Z X 3) in which the particulars of the
regression model in standard score form9 can be expressed in terms of cor-
relation coefficients. The standard score regression model can be written in
scalar and matrix form as

ZY = β1* Z X1 + β*2 Z X 2 +  + β*q X q + 

Z y = Z X* + , [1.10]

with errors of prediction defined as


 = Z y − Z X*. [1.11]
 *, of the standardized regression parame-
The least squares estimates, 
ters chosen to minimize the sum of squared errors of Equation 1.11 are
found by
 * = R −1 R
 [1.12]
XX XY ,

where R XX and R XY are, respectively, the correlation matrices between pre-


 * for the example
dictors and between predictors and criterion.10 Estimating 
data of Table 1.1 yields the fitted model,

 β * 
 1   0.61
 = β *2  =  −0.15 .
 *
 
 β *   −0.07 
 3 

9
The symbol β* will be used to denote parameters in standard score form with the
standardized estimates of the parameters denoted by β *.
10
R XX and R XY are the sample size–adjusted SSCP and SCP matrices in standard
score form.
12

The usual rules for interpreting standardized coefficients apply; each


coefficient represents a β *j standard deviation change in Y per standard
deviation change in X j . There may be little to be gained by interpreting any
single standardized regression coefficient in lieu of its unstandardized
counterpart, but it is often recommended that standardized coefficients be
used if comparative evaluation of the relative influence of predictors is a
goal of the analysis (Bring, 1994; Darlington, 1990, pp. 217–218). These
recommendations are based on the fact that the absolute value of the
unstandardized regression coefficients (β j ) are partly dependent on the
scale of measurement, which can differ across predictors while the stand-
ardized coefficients (β *j ) are scale adjusted.11 For the predictor variables of
age and education, the raw regression coefficients suggest that age is a less
important predictor than education (ignoring the differences in scale—
SDage = 21.68, SDeducation = 2.60 ), whereas the standardized coefficients
suggest the opposite relative importance with age being greater than educa-
tion after adjusting for underlying scale differences. The issue of testing the
significance of these differences (i.e., β 1 vs. β 2, and β 1* vs. β *2) will be shown
in a later section to be tests of quite different conceptual hypotheses even if
the raw scores are on equal scales, where SDX1 = SDX 2.

Assumptions Needed to Justify the Validity


of the Least Squares Estimates
There are no assumptions required to justify the least squares estimation of
the parameters—that process is purely descriptive. But several important
assumptions about the linear model can be introduced at this point. If met,
the assumptions provide a degree of confidence in the interpretation of the
coefficients as well as justify the validity of the test statistics to be
discussed in a later section of this chapter. The assumptions include the
following:

•• The model is linear; the E (Y | X ) lies precisely on a straight line.


•• The model is correctly specified; no important variables are omitted
from the analysis.

11
Standardized regression coefficients have little meaning for categorical predictor
variables. The standard deviation of the numbers used to designate categories of a
nominal grouping variable has no meaningful interpretation beyond the ability of
the numerals to distinguish categories. In a later section, we note that the standard-
ized version of a dichotomous predictor may have a useful interpretation when
involved in a test of relative importance when compared with other predictors in the
model.
13

• The predictor variables X j are measured without error.


• E ( ε) = 0 . The errors of the regression model are a random variable
with mean zero.
• Cov(εi , ε j ) = 0 , for i ≠ j. The errors are assumed to be independent
with covariance of zero.
• V ( ε) = σ I ( n × n ) . The variance of the errors is assumed to be a constant.
2

The quantity σ 2 is a population parameter and is estimated in the


sample by the mean square error,

 2
σ 2 = ∑ (Y − X B) ,
n − qf − 1

where n − qf −1 denotes the degrees of freedom for error based on qf


predictor variables in the full model.
•• i ~ N (X β, σ I ). The errors of the model are
ε 2
assumed to be normally
distributed with mean Xβ and variance σ 2 I , which provides the con-
nection to the probability distribution that underlies the test statistics
applied to the regression coefficients.

More extensive accounts of the assumptions and the diagnosis of their


violations can be found in Cohen et al. (2003, Sect. 4.3–4.5).

Partitioning the Sums of Squares and


Defining Measures of Goodness of Fit
The strength of the relationship between the criterion and the predictor
variables in a linear model is documented by two indices: the sum of

(
squared errors SS =
ERROR ∑(
2
) ∑ )
Y − X β = ε 2 =  ′ and the squared mul-
tiple correlation coefficient (R ). To achieve each of these measures
2

requires that the variability in the response variable be partitioned into its
constituent parts related to Equation 1.3. The partitioned SS is

SSTOTAL = SS MODEL + SS ERROR . [1.13]

The estimated vector of errors of the model is given by  = y − X 


 
and the sum of squared errors of Equation 1.13 is defined by ′. As a measure
of goodness of fit,  ′ has known lower and upper bounds, 0 ≤  ′ ≤ SSTOTAL,
defining a range from no relationship to perfect relationship. The
mea­sure  ′ is ambiguous as a measure of strength of association unless SSTOTAL
is known. The mean corrected total sum of squares is SSTOTAL = ∑ (Y − Y ) = y ′ y − yyn
2
14

= ∑ (Y − Y ) = y ′ y − yyy
2
L ′ yn=, where
(y ′ y − yyisn)an (n × 1) vector of the mean of Y repeated n times.
Redefining y ′ y = (y ′ y − yyn) to be the mean corrected SSTOTAL , and rede-
fining   ′ X′ y = (  ′ X′ y − yyn) to represent the mean corrected SS
MODEL, the
partition of the sums of squares of Equation 1.13 is12

 ′ X′ y + ′.
y′y =  [1.14]

It is common practice to rely on the value of R 2 , which is scaled to take


on values in the interval [0, 1], as an index of goodness of fit. SSTOTAL is the
maximum variability available in Y, SS ERROR is the variability in Y that can-
not be accounted for by the model, and SS MODEL is that part of the variability
in Y that is accounted for by the model. The proportion of the variability in
2
Y that is accounted for by the model, R , is the scaled measure of goodness
fit and is computed as

′
RY2 X1 X 2 X q = 1 − [1.15]
y′y

or more commonly,
 X′ y
′
RY2 X1 X 2… X q =
. [1.16]
y′y

If the partitioning is done in terms of standard scores, it can be shown


that a convenient definition of R 2 is given by

RY2 X1 X 2 X q = β 1*rY
• • X1 + β *2 rY • X2 +  + β q*rY • Xq . [1.17]

For the TMT-B example data of Table 1.1, the mean corrected Total and
 ′ X′ y = 17758.00 . The fit of the model is
Model SS are y ′ y = 41875.33, 
found to be
17758.00
RY2 X1 X 2 X 3 =

= .424.
41875.33

12
The uncorrected sum of squares of Y, ∑
Y 2 = y ′ y , contains both the SS associ-
ated with the predictor variables ( β1 ,β 2 ,β q ) and the SS associated with the inter-
cept. The mean corrected SS, y ′ y − yyn disaggregates these two quantities. Rencher
(1998, Sect. 4.3–4.5) gives details of the relationships between uncorrected and
mean-corrected SS.
15

About 42% of the variation in TMT-B performance is accounted for by


age, education, and gender. About 58% of the variation in Y remains unex-
plained and is a function of other unknown sources, both systematic and
random.

Full and Restricted Models and Squared Semipartial Correlations


In addition to the full model R 2 based on qf predictors, it is often of interest
to ascertain the proportion of variation in Y that is uniquely attributable to
X j adjusted for all remaining X-variables. These squared semipartial cor-
relation coefficients ( rY ( X1| X 2 X 3 X qf ) , rY ( X 2 | X1 X 3 X qf ) ,…, rY ( X qf | X1 X 2 X qf −1 ) ) can be
2 2 2

computed from the extra sums of squares approach (Draper & Smith, 1998,
pp. 149–160), which requires evaluating the difference between full and
restricted model R 2 s. Define the full model R 2full as the proportion of varia-
tion in Y accounted for by all the qf predictors in the model, X 1 , X 2 ,, X qf .
Define a restricted model Rrestricted 2
as the proportion of variability in Y
accounted for by a subset of qr < q f predictors, say X 1 , X 2 ,, X qr. Since the
full model R 2full documents the proportion of variability in Y accounted for
by all the predictors and the restricted model Rrestricted 2
represents
the proportion of the variability in Y accounted for by the qr predictors,
the difference between the full and restricted model R 2 s must represent the
unique incremental variation in Y accounted for by those predictors that
are not contained in the restricted model. The difference R 2full − Rrestricted 2
is
the squared semipartial correlation coefficient. ������������������������� Examples of squared semi-
partial correlations for the TMT-B example data are

rY2( X1 |X 2 X 3 ) = RY2 X1 X 2 X 3 − RY2 X 2 X 3 = .424 − .065 = .359,


• •

rY2( X 2 |X1 X 3 ) = RY2 X1 X 2 X 3 − RY2 X1 X 3 = .424 − .403 = .021,


• •

rY2( X 3 |X1 X 2 ) = RY2 X1 X 2 X 3 − RY2 X1 X 2 = .424 − .419 = .005,


• •

RY2( X1 X 2 |X 3 ) = RY2 X1 X 2 X 3 − RY2 X 3 = .424 − .002 = .422.


• •

About 36% of the variation in TMT-B performance is attributable to


age after adjusting for the variance accounted for by education and gen-
der; the variance in TMT-B that is uniquely attributable to education or
gender is negligible. The multiple squared semipartial of age and educa-
tion adjusted for gender appears to be the best prediction model, but it is
unclear if this value is a significant improvement over age alone
(rYX2 1 = .63242 = .400) because we know little about the sampling variabil-
ity that accompanies these models. Methods for assessing statistical
16

significance and testing hypotheses on contrasts between predictors are


reviewed in the next section.

Testing Hypotheses on the Regression Coefficients and R 2 s


The trustworthiness of β or R 2 depends on knowledge of the sampling
variability of that statistic and test statistics for evaluating hypotheses on
the parameters of the model. The two most common methods include the
F-test on values of R 2 and the single degree of freedom t-test on the model
regression coefficient where t = F . A generic F-test on df h and df e
degrees of freedom based on appropriately specified full and restricted
models can be defined as

R 2full − Rrestricted
2
df e .
F(dfh , dfe ) = • [1.18]
1 − R 2full df h

Let qf be the number of predictors in the full model (exclusive of the unit
vector X 0 ), let qr be the number of predictors in the restricted model, and
let df h = qf − qr and df e = n − qf −1. If it is assumed that εi ~ N (0,σ 2 ) , then
the test statistic in Equation 1.18 follows the F distribution with qf − qr and
n − qf −1 degrees of freedom. The numerator of the left-most ratio of the
F-test is the definition of the squared semipartial correlation. The nature of
Rrestricted
2
will be dictated by the hypothesis to be tested since the hypothesis
dictates the constraints to be placed on the full model. If the hypothesis on
the whole model H 0 :β1 = β 2 =  = β q f = 0 is desired,13 the restricted model
will contain only β 0 with Rrestricted 2
= 0 , leading to the numerator
R full − Rrestricted = R full . A test of a hypothesis on a single regression coeffi-
2 2 2

cient, such as H 0 :β1 = 0 , would involve Rrestricted 2


= RY2 X 2 X 3 X q . Hypotheses

f
on any single coefficient, or set of coefficients, can be tested in this manner.
Further examples of hypothesis tests involving restrictions placed on the
linear model are given in Rindskopf (1984).
For single dfh tests, the t-test on the hypothesis H 0 :β j = k is, in common
usage,14
β j − β j
t( dfe ) = , [1.19]
MSE  1 
 
SS X j  1 − RX2 j other  •

13
This is equivalent to the hypothesis H 0 :ρY2 • X1 X 2  X q f = 0.
14
The value of k need not be hypothesized to be 0; any theoretically defensible value
of k is permissible.
17

where
SS ERROR ,
MSE =
n − qf −1

SSxj is the sum of squares of the predictor variable involved in the test, and

1
1 − RX2 j other

is the variance inflation factor (VIF) that adjusts for the multicollinearity
among the predictor variables. For the TMT-B example, the F-test
on the whole model R 2 = .424 is F(3, 36 ) = 8.84, p < .001 . The test of
the significance of each of the individual partial regression coefficients
for age, education, and gender yielded, respectively, values of
t(36) = 4.74, p < .001; t(36) = − 1.15, p = .258; and t(36) = .57, p = .575. Only
the age variable is uniquely related to TMT-B performance. The t-test sta-
tistics on the individual coefficients are the F that would have been
obtained by the full- versus restricted-model approach of Equation 1.18.
The results of the test of hypotheses on the values of β j and on the values
of their respective partial and semipartial correlations are identical.

The General Linear Hypothesis Test


Although the two methods for testing hypotheses described above are in
wide usage, they are special cases of a much more general approach to test-
ing hypotheses in linear models—the general linear hypothesis test. The
general linear test is a compact procedure that covers an astonishing array
of common and specialized tests of hypotheses in both the univariate and
the multivariate linear models.
Assume for the univariate model of Equation 1.3 that we wish to test the
hypothesis that all the sample regression coefficients in a full model have
been drawn from a population in which all the coefficients, with the excep-
tion of the intercept, are simultaneously equal to zero. We can formalize
this hypothesis by a linear combination of the parameters specified by the
matrix product L = 0. That is,

 β0 
0 1 0  0    β1  0 
0 0 1  0  
β1  β   
H 0 : L =   β2 = 2  = 0  . [1.20]
          
    β   
0 0 0  1    q f  0 
β q f 
18

The L matrix is of order ( c × q f +1 ) whose role is to identify the coef-


ficients of interest in any hypothesis. Other hypotheses might involve only
a single-parameter estimate (e.g., H 0 :β1 = 0 ), or some subset of the param-
  β1  0 
eters  e.g., H 0 :   =   . In general, any desired hypothesis can be
  β 3   0 
defined as a product of a vector (or matrix) of contrast coefficients, L(c × q +1),
and the vector of parameters, (q+1 × 1) , from the full model analysis. A more
general form of the contrast is possible where the vector k can contain zeros
(the traditional null hypothesis) or any other vector of theoretically justified
nonzero values:
L(c × q +1)(q +1 × 1) = k (c × 1) . [1.21]

The subscript c denotes the number of rows in L that will be equivalent


to df h in the associated test statistic. Once the desired hypothesis is speci-
fied, we can substitute the estimates of the parameters   into Equation 1.22
to obtain the sums of squares for the hypothesis:

( )
−1
 )′ L ( X′ X)−1 L′
SS HYPOTHESIS = (L )
(L [1.22]

and the SSHYPOTHESIS can be used as the numerator of a familiar version of


the F-test,
SS df
F(dfh , dfe ) = HYPOTHESIS • e .
SS ERROR df h [1.23]

Under the assumption that the errors of the model are normally distrib-
uted, F will follow the F distribution on df h = c and df e = n − qf −1 degrees
of freedom.
The Test of the Whole Model Hypothesis
β 1 β 2 β 3  0 and ρY X X X = 0
2

1 2 3

For the TMT-B example data, we found the estimated regression coeffi-
cients to be

 β 0 
  
65.69 

 β1   0.92 

=  = 
β 2   −1.87 
    −4.68
β3 
19

and we desire a test the hypothesis that parameters for X 1 , X 2 , and X 3 are
simultaneously equal to 0, H 0 :β1 = β 2 = β 3 = 0 . This statement is also a test
of H 0: ρY2 ⋅ X1 X 2 X 3 = 0. The general linear test of the full model hypothesis is
given in L
β 0 
0 1 0 0     β1  0 
L = 0 0 1 0   = β 2  = 0  ,
β1
β 
0 0 0 1   2   β 3  0 
β3 
which ignores the intercept. For the contrast matrix L, the inverse of the
sum of squares and cross-products matrix among the three predictor vari-
 the hypothesis SS of
ables, X′ X −1 , and the estimates of the parameters 
Equation 1.22 is

  65.69  ′
 0 1 0 0   0.92 
SS HYPOTHESIS =  0 0 1 0  
  −1.87 
 0 0 0 1   
  −4.68
−1
  40 2, 339 504
−1
18  0 0 0 
 0 1 0 0    
 0 0 1 0  2, 339 155,103 29, 097 1, 059  1 0 0 
   504 29, 097 6, 614 221  0 1 0 
 0 0 0 1     
  18 1059 221 18  0 0 1 
  65.69 
 0 1 0 0   0.92 
  0 0 1 0  
   −1.87 
 0 0 0 1   
  −4.68

which yields SS HYPOTHESIS = 17758.00. The SS HYPOTHESIS is identical to the


SS MODEL obtained from   ′ X′ y − y ′ yn. With df = c = 3, df = n − q −1 = 36,
h e f
and  ′ = 24117.33, the F-test on the whole model association is found
to be
17758 • 36 = 8.84, p = .0002.
F(3,16 ) =
24117.33 3

With RY2⋅ X X X = .424 , there is sufficient evidence to reject H 0 .


1 2 3
20

Testing the Individual Contributions of X1, X2, and X3


by the General Linear Test
Hypothesis tests on the individual partial regression coefficients β1 , β 2 ,
and β 3 can be readily tested within the L= 0 framework. For testing a
hypothesis on β1 , we specify
β 0 
β 
L = [0 1 0 0]   = β1 = 0
1
[1.24]
β 2 
 
β3 
and to specify the null hypothesis on β 2 and on β 3, we employ the vector 
and the appropriate vectors L = [0 0 1 0] and L = [0 0 0 1] , respec-
 into Equations 1.22 and
tively. All these hypotheses are tested by substituting 
1.23; the results are summarized in Table 1.2.

Table 1.2  General Linear Hypothesis Tests on Individual Partial


Regression Coefficients

Hypothesis β β * rsemipartial F(1,16 ) p

Age: β1 = 0  0.919  0.608  .599 22.44 <.001

Education: β2 = 0 −1.870 −0.148 −.145  1.32  .258

Gender: β3 = 0 -4.683 −0.072 −.072  0.32  .575

In this model, age is the only significant contributor to the prediction of TMT-B.
The test statistic on the any unstandardized β j is also the test of the sig-
nificance of the standardized β * and the semipartial correlation rY X |X X  .
( j 1 2 )
The test of hypotheses on sets of predictors is also identical for unstandard-
ized and standardized partial regression coefficients and the multiple semi-
partial correlations associated with each set. These equivalences no longer
hold when more complex hypotheses are tested by the general linear test.

Testing More Complex Hypotheses With the General Linear Test


The general linear hypothesis test is suitable for formulating and test-
ing many complex hypotheses (Draper & Smith, 1998, pp. 217–221;
21

Rindskopf, 1984). Consider the question of whether age is a better predictor


of TMT-B performance than is education, both adjusted for gender. This
question asks if the coefficients β 1 and β 2 are significantly different from
one another as expressed in the null hypothesis H 0 :β1 = β 2. This hypothesis
can be specified by the contrast matrix L = [0 1 −1 0] that deletes β 0
and β 3 from  and defines the difference between β1and β 2 . The symbolic
contrast of the null hypothesis, L = 0 is
β 0 
β 
L = [0 1 −1 0]   = [β1 −β 2 ] = 0,
1
[1.25]
β 2 
 
β3 

which gives the basis for evaluating the SS HYPOTHESIS and the numerator of
the F-test. Substituting the estimates β j into Equation 1.22 we find,

 65.69 
 
 = [0 1 −1 0]  .92  = [ −.95]
L
 1.87 
 
 −4.68

( )
−1
and L ( X′ X) L′ = 259.53 with  ′ = 24117.33. The F-test of Equation
−1

1.23 is
234.71 36 = 0.32, p = .573.
F(1,16 ) = •
24117.33 1

The unstandardized partial slopes of age and education (reverse


scored),15 shown in Figure 1.3, do not differ significantly from one

15
Age has a positive relationship to TMT-B; performance deteriorates with increas-
ing age. Conversely, TMT-B has a negative relationship with increasing years of
education. Contrasts between regression coefficients are sensitive to both magni-
tude and direction and a choice must be made between testing differences in mag-
nitude only, or testing differences in both magnitude and direction. Theoretical
considerations based on substance knowledge should be brought to bear to make
this choice. For the age versus education comparison illustrated here, only the mag-
nitude of the effect is of interest. Reversing the scoring of the education variable
equates the sign of both age and education coefficients; hence the contrast is one of
magnitude and not direction. If there is theoretical justification to leave the signs of
the regression coefficients in the original scoring of age and education, then a test
of both magnitude and direction would result. The F-test on this contrast is F(1,16 ) =
3.01, p = .091, still a nonsignificant result.
22

another. Interpreting this lack of a significant difference should be done


cautiously. Many authors point out that such a contrast makes sense
only if the two variables are measured on the same scale, which is not
the case with age (range = 33–105, SD = 21.7) and education (range =
8–18, SD = 2.6).
The unstandardized partial slopes of Figure 1.3 do not differ signifi-
cantly from one another, but the squared semipartial correlations suggest
that the variance in TMT-B accounted for by age is substantially greater
than the variance accounted for by education. The relative rank order of the
two predictor variables is opposite when unstandardized slopes and semi-
partial correlations are used for the ranking, largely due to the differences
of scale of the predictor variables.
An alternative test that avoids the issue of inequalities of scale is the
general linear test applied to standardized coefficients by testing the
hypothesis H 0 : β1* − β*2 = 0.16 Estimating the parameters by β 1* and β *2 and

Figure 1.3  
Comparison of Unstandardized Partial Slopes, Standardized
Partial Slopes, and Semipartial Correlations

150 150

100 100
TMT-B

TMT-B

50 50

b = .919 b = −1.870
beta = .608 beta = −.148
semipartial r = .599 semipartial r = −.145
semipartial r-square = .359 semipartial r-square = .021
0 0
−2 −1 0 1 2 6 4 2 0 −2 −4
Age adjusted for Education and Gender Education (reverse) adjusted for Age and Gender

Note: Education is reverse scored to guarantee a positive slope.

16
The scoring of the education variable was also reversed in this analysis to con-
strain the sign of each standardized slope to a positive value. The contrast is there-
fore a test of the difference in magnitude of semipartial correlations.
23

performing the same sequence of computations on the standardized vari-


ables ZY , Z X1, Z X 2, and Z X 3 leads to

 0.61
L = [1 −1 0]  0.15 = .460,
 *

 −0.07 

SS HYPOTHESIS = 3.397, SS ERROR = 22.461,17 and F(1, 36) = 5.44, p = .025 . The
standardized parameter estimates differ significantly by the hypothesis test
applied to standardized coefficients. The reason for the differing results is
a consequence of the differences in the scales of measurement of the pre-
dictor variables; it can be shown that the jth standardized coefficient is a
ratio of its semipartial correlation to the square root of the proportion of
variation in X j not accounted for by the remaining predictors X j , (e.g.,
tolerance) in the full model, that is,

rY
( X j |X j ′ ) .
β *j =
1 − R 2j .other

Unstandardized regression coefficients and their standard errors have


absolute magnitude for two reasons: (1) scale of measurement and (2) the
underlying relationship between the predictor and response variables.
Conversely, the magnitude of the standardized coefficients is most heavily
determined by the semipartial correlations and the tolerances of the predic-
tors. Consequently, a difference between standardized coefficients constitutes
a test of the differences between semipartial correlation coefficients18—it is a
test of differences between correlations of Y and each predictor after adjust-
ment for other predictors in the model. The test statistic of differences
between raw regression coefficients and between semipartial correlations
need not be equal. The two tests are numerically independent because they
test conceptually different hypotheses—differences in rates of change versus
differences in strength of association. The tests between coefficients for

17  *′ Z′ Z = ( n − 1) (1 − R 2
The error sum of squares in standard score form is ZY′ ZY − B X Y Y X1 X

 * ′ Z′ Z = ( n − 1) (1 − R 2
ZY − B .
X Y •Y X1 X 2 X 3 )
18
The test of the differences between two standardized regression coefficients from
a regression analysis is defined as

β 1* − β *2
t=
(
MSE LR −XX1 L′ )
24

unstandardized and standardized models will be identical only when S X1 = S X 2.


Similar tests of differences between correlation coefficients are discussed in
Olkin and Finn (1995). Draper and Smith (1998, pp. 218–219) and Rencher
(1998, pp. 295–300) give examples of more complicated linear hypothesis tests
in which the same principles apply.

Generalizing From Univariate to Multivariate


General Linear Models
We have begun this volume with a review of the common strategies for
modeling a single response variable as a function of one or more continu-
ous and/or categorical explanatory variables. Such models have great flex-
ibility and can accommodate any combination of predictor variable types,
including their interactions and powers.

(Cohen et al., 2003, pp. 640–642), where R −1


XX is the inverse of the correlation
matrix among the predictors and

1 − RY2 X1 X 2 X 3
MSE =

.
n − qf −1
Substituting the definitions

rY ( X1 |X 2 )
β 1* =
1 − R12.2
and

rY ( X 2 |X1 )
β *2 =
1 − R12.2
into t sets the numerator to

rY ( X1 |X 2 ) − rY ( X 2 |X1 )
.
1 − R12.2

Setting the contrast matrix to L = [1 −1 0] and performing the symbolic multi-


plication of the quantity MSE (LR −1
XX L ′) , the denominator of t reduces to

MSE 2 (1 + r12 )
.
1 − R12.2
25

Recounting these details here sets the stage for the generalization of
these same analytic concepts to those instances where more than one dependent
variable is to be analyzed simultaneously. Models with p > 1 response variables
are classified as multivariate models that can be treated with the same four-step
process—the specification of the multivariate model, estimation of its parame-
ters, identifying measures of strength association, and defining appropriate tests
of significance. We pursue these topics in the chapters that follow.

The quantities 1 − R12.2 in the numerator and denominator cancel, leaving

rY ( X1 |X 2 ) − rY ( X 2 |X1 ) .
t=
MSE ( 2(1 + r12 ))

Hence, the test of the hypothesis β1* − β*2 = 0 is a test of the differences between
semipartial correlation coefficients. In this interpretation, approximately 36% of the
variance in TMT-B is accounted for by age while about 2% of the variance in
TMT-B is accounted for by education. The absolute values of the two correlations
are significantly different from one another, while the absolute values of the two
unstandardized slopes do not differ significantly. The difference between unstand-
ardized rates of change is being masked by differences in variance of the predictors.

You might also like