0% found this document useful (0 votes)
43 views55 pages

Lecture 06

This document discusses multiple linear regression. It introduces a general regression model that extends simple linear regression to include multiple predictor variables. It explains that including additional variables allows researchers to control for confounding factors and test the effects of multiple factors simultaneously. It provides examples to illustrate interpreting regression coefficients, the least squares fit, fitted residuals, and the mean squared error.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views55 pages

Lecture 06

This document discusses multiple linear regression. It introduces a general regression model that extends simple linear regression to include multiple predictor variables. It explains that including additional variables allows researchers to control for confounding factors and test the effects of multiple factors simultaneously. It provides examples to illustrate interpreting regression coefficients, the least squares fit, fitted residuals, and the mean squared error.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

VI. Multiple Linear Regression


The term “multiple” refers to the inclusion of more than one
regression variable. The general regression model extends the simple
model by including p – 1 covariates Xi1,XXi2,…,X
Xi,p-1
i 1 as follows:

Yi   0  1 X i1   2 X i 2     p X i , p 1   i ,

where as before we assume εi ~ N(0,σ2).


Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Why “control” for additional factors?


In scientific settings, there may be several reasons to include additional
variables in a regression model:

• Other investigative variables have potential scientifically interesting


predictive effects relative to Y.
• We may be interested in testing the simultaneous effects of multiple
factors.
factors
• We need to model a predictor that (a) is a nominal (unordered)
categorical variable (e.g., the feed type in Example , or (b) has
nonlinear effects that require including polynomial terms (e.g., an X2
variable for a quadratic effect).
• There are confounding factors that need to be controlled for as a part
of the modeling analysis.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Confounding Variables
Suppose that we are investigating the relationship between two
variables
i bl X andd Y.
Y That
Th is,i we want to know
k if variation
i i in i X causes
variation in Y.

Controlled or randomized experiments are those in which study


subjects are randomly assigned to levels of X. This ensures balanced
variation between factors that might influence the relationship
between X and Y. We often refer to these additional factors as
confounding variables.

Observational studies are those in which there is no randomization –


g
investigative groups
g p are defined based on how subjects j selected
themselves to a particular treatment or exposure. In this case, we need
to control for confounding factors as a part of the data analysis.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.A
Consider recent findings from a study of “lighthearted” pursuits on average blood
pressure:

http://www.cnn.com/2011/HEALTH/03/25/laughter.music.lower.blood.pressure/index.html.

Is this a controlled or observational study? What advantages (if any) does the study
design have in this case?

Consider also the following results reported on a study of weight and bullying among
children:

http://inhealth.cnn.com/living-well/overweight-kids-more-likely-to-be-bullied?did=cnnmodule1.

What kind of study is this? What factors might the researchers need to control for?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Interpretation of Regression Coefficients


For the multiple regression model,
model a coefficient βj represents the
effect of Xij on the E{Yi} (the average of the outcome variable),
holding all other variables constant.

In this sense, we are controlling for the effects of other factors, by


assumingg a common effect of Xij on the average g of Y across all levels
of the additional variables.

Note that in the regression setting it is also quite straightforward to


model interactions between two predictor variables, if the effect of
one depends on the effect of the other. We will discuss interactions
more later.
l
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.B

C id a two-variable
Consider i bl model
d l

Yi   0  1 X i1   2 X i 2   i .
Suppose that β0 = 5, β1 = 2, and β2 = 1.5.

What is your interpretation of β0?

What is the effect of Xi1 on the average of Yi when Xi2 = 3? What is


the effect of Xi1 when Xi2 = 10?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.B (continued)


A geometric illustration of this relationship is shown in the plot below. This is a
three-dimensional scatterplot, assuming a random sample of points. The
superimposed plane represents the combined linear effect of X1 and X2 on the average
of Y. What model parameter measures the variation of the points around the plane?

What
h relationship
l i hi do d we assume between
b X1 and
d Y if we hold
h ld X2 constant?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.B (continued)

Note that for three or more predictor variables,


variables the linear relationship
is manifest through a hyperplane in p dimensions, where p – 1
represents the number of variables included in the model.

Beyond two predictors, it is therefore difficult to actually visualize


the model,, but we can apply
pp y the same interpretation
p of the relative
effects (in terms of the model coefficients) on the average of the
outcome variable.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Additional Remarks on “Linear”


“Linear” refers to the effects of the coefficients, not necessarily on the
effects
ff off the
h predictors
di themselves.
h l

We have alreadyy seen this before when discussingg the role of


transformations for a single variable. For example, we may observe a
relationship such that

log(Y) = β0 + Xi1β1 + ··· + Xi,p-1βp-1 + εi,

so that
h fitting
fi i the
h linear
li regression
i model
d l is
i simply
i l a matter off fitting
fi i

Y’ = β0 + Xi1β1 + ··· + Xi,p


i,p-11βpp-11 + εi,

where Y’ = log(Y).
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Least-squares Fit
Note that the least squares
q fit involves the same minimization of
squared residuals (in this case around a plane or hyperplane). The
residual is defined as:

 i  Yi  (  0  1 X i1     p X i , p 1 ).
To find the least-squares
least squares fit
fit, we minimize the objective function
n n
Q    i2   [Yi  (  0  1 X i1     p X i , p 1 )]2 .
i 1 i 1

Note that there is a lot of matrix notation in chapter 6 to describe the


l t
least-squares approach.h Unless
U l you’re ’ interested
i t t d in
i the
th linear
li
algebra, you can ignore all of that.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Fitted Residuals and MSE


As in the simple case, the least-square fit yields estimates b0, b1,…,
bp-1 for
f the
h regression
i coefficients,
ffi i andd our prediction
di i equationi can be b
expressed as:
Yˆi  b0  b1 X i1  b2 X i 2    b p 1 X i , p 1.

The observed residuals are then given by:

ei  Yi  Yˆi  Yi  (b0  b1 X i1    b p 1 X i , p 1 ), i  1,..., n,

and our estimated error variance (MSE) is computed as

1 1 SSE
  ˆ
n n
s 
2
e 
2
i 1 i
(Yi  Yi ) 
2
 MSE.
n p n p i 1
n p
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Analysis of Variance
Note the error degrees of freedom for MSE.
MSE This suggests an
ANOVA table for a multiple regression model that looks something
like this:
Degrees of Sum of Mean
Source Freedom Squares Squares F-statistic p-value
Regression p–1 SSR MSR MSR/MSE

Error n–p SSE MSE

Total n–1 SSTO


Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Components of Variation

A with
As ith the
th simple
i l model:
d l

• SSTO represents the total variation in Y.

• SSR represents the variation in Y explained by the regression


model that includes the set of X variables X1,…Xp-11.

• SSE represents the residual (unexplained) variation, or the total


variation of the points around the hyperplane.
hyperplane
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

ANOVA F Test
Note that there are p – 1 degrees of freedom associated with SSR,
and n – 1 degrees of freedom for SSE
SSE. The ANOVA F test statistic is
therefore given by MSR/MSE, and approximately follows an
F(p – 1, n – 1) distribution.
The question is: what is the F statistic for? For a regression model
with multiple predictors, the F statistic tests the relationship between
Y andd the
th sett off X variables
i bl X1,…X Xp-1. In
I other
th words, d the
th overall
ll F
test evaluates the hypothesis

H0: β1 = β2 = ··· = βp-1 = 0,


0 versus
ers s
HA: At least one βj ≠ 0, for j = 1,…,p – 1.
Under H0, MSR and MSE should be approximately equal, so a large
value of F (i.e., a small p-value) provides evidence against the null.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Distribution of Individual Fitted Coefficients


As with the estimated slope in the simple case, the individual fitted
coefficients
ffi i b1,…bbp-1 are at least
l approximately
i l normally
ll distributed.
di ib d

All we need to know for a ggiven bj is its estimated standard error


s{bj}, and the t distribution can be applied as before.

In other words
words, the statistic
bk   k
s{bk }
follows a t(n – p) distribution. Note that (especially in the case of
multiple
lti l predictors)
di t ) the
th computation
t ti off s{b
{bj} iis too
t complicated
li t d to
t
perform by hand, and in general is carried out using software.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Inference for Individual Coefficients


Using the distribution on the previous slide, a (1 – α)100%
confidence interval for βj is given by
b j  t (1   / 2; n  p ) s{b j }.
We also would like to test the null hypothesis H0 : βj = 0 versus the
alternative hypothesis HA: βj ≠ 0. A test statistic for assessing the
evidence against H0 is given by
bj  0
t .
s{b j }
Under H0, this test statistic approximately follows the t(n–p)
distribution. The p-value is therefore given by 2P{t(n–p) ≥ |t|}.
Note that we can conceivably test against any specific value of βj,
although 0 is generally the value of interest.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.C
The data for this example come from a study of cholesterol levels (Y) among 25
individuals, who were measured also for weight in kg (X1) and age in years (X2).
The data are tabulated below:
Chol Wt Age Chol Wt Age
354 84 46 254 57 23
190 73 20 395 59 60
405 65 52 434 69 48
263 70 30 220 60 34
451 76 57 374 79 51
302 69 25 308 75 50
288 63 28 220 82 34
385 72 36 311 59 46
402 79 57 181 67 23
365 75 44 274 85 37
209 27 24 303 55 40
290 89 31 244 63 30
346 65 52
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.C (continued)


A good exploratory tool when looking at associations between several
variables is a scatterplot matrix, as shown on the following slide. (A
scatterplot matrix can be produced in SAS – for version 9.2 or
higher
g – usingg commands shown later in this example.)
p )

What bivariate associations do you observe?

Based on these associations, what do you predict we will see with


respect to a fitted model that includes both predictors?

Is it possible that age might confound the relationship between weight


and cholesterol level?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.C (continued)


We want to fit a linear model for these data of the form
Yi   0  1 X i1   2 X i 2   i ,
The SAS code below reads in the data file, fits the two-variable model
(with confidence intervals for the individual variables),
variables) and also
produces a scatterplot matrix.
options ls=79 nodate;

data;
infile "c:\chris\classes\stat5100\data\cholesterol.txt" firstobs=2;
input chol weight age;

proc reg;
model chol=weight age / clb;
run;

proc sgscatter;
title "Scatterplot Matrix for Cholesterol Data";
matrix chol weight age;
run;
Stat 5100 – Linear Regression and Time Series
SAS Output
Dr. Corcoran, Spring 2011

Example VI.C (continued)


The REG Procedure
Model: MODEL1
Dependent Variable: chol
Interpret the 95%Number
confidence intervals
of Observations for each of25 the coefficients of
Read
weight and age. Number of Observations Used 25

Analysis of Variance
IIs th
there evidence
id that
th t either
ith weighti ht or age is
i individually
i di id ll associated
i t d
with
Source
average cholesterol
DF
level in the
Sum of
Squares
presence of the Fother?
Mean
Square Value Pr > F

Model
M d l 2 102571 51285 26
26.36
36 <.0001
0001
Interpret the ANOVA F test results.
Error 22 42806 1945.73752
Corrected Total 24 145377

Root MSE 44.11051 R-Square 0.7056


Dependent Mean 310.72000 Adj R-Sq 0.6788
Coeff Var 14.19623
Stat 5100 – Linear Regression and Time Series
SAS Output
Dr. Corcoran, Spring 2011

Example VI.C (continued)

Interpret the 95% confidence intervals for each of the coefficients of


Parameter Estimates

weight and
Variable
age.
DF
Parameter
Estimate
Standard
Error t Value Pr > |t|

Intercept 1 77.98254 52.42964 1.49 0.1511


IIs th
there evidence
weight id 1 that
th t either
ith weight
i ht or age is
0.41736 i individually
i di id ll associated
0.72878 i t d
0.57 0.5727
with average cholesterol level in the presence of the other?
age 1 5.21659 0.75724 6.89 <.0001

Parameter Estimates

Interpret the ANOVA


Variable F test
DF results.
95% Confidence Limits

Intercept 1 -30.74988 186.71495


weight
g 1 -1.09403 1.92875
age 1 3.64616 6.78702
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.C (continued)


Usingg the SAS results,, write out the fitted model.

Interpret the fitted regression coefficients, including the intercept.

Report and interpret the estimated model variance.

IInterpret the
h 95% confidence
fid iintervals
l for
f eachh off the
h coefficients
ffi i off
weight and age.

Is there evidence that either weight or age is individually associated


with average cholesterol level in the presence of the other?

Interpret the ANOVA F test results.


Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Analyzing Partial Sums of Squares


Sums of squares in the ANOVA table can be decomposed into
components that correspond to individual predictor variables or
sets of predictor variables.

This is a very useful tool for comparing models in order to assess


which variables should be included and in what form.

When variables (either one or more) are added to a given model,


the SSE is always reduced. The idea behind analyzing so-called
partial or extra sums of squares is that we would like to know
whether the addition of these additional or extra factors leads to a
significant reduction in SSE. If the reduction is significant, this
indicates that the additional variables likewise have significant
effects with respect to the outcome variable.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.D
Consider again the cholesterol data in Example VI.C. Suppose we
b i with
begin i h a simple
i l regression
i modeld l using
i only l the
h weight
i h variable
i bl
Xi1. Fitting that model, we obtain the prediction equation

Yˆi  199.30  1.62 X i1 ,


along with the ANOVA table below:

Source df SS MS F p-value
Regression 1 10232 10232 1.74 0.200

Error 23 135145 5875.883

T t l
Total 24 145377
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.D (continued)


If we add in the age
g variable Xi2, as in the two-variable model of
Example VI.C, note what the effect is on the regression and error
sums of squares. Denote the two-variable model as (I), and the simple
model on the previous slide as (II).
(II)

How is (I) nested in (II)?

What are the SSR and SSE for model (I)? What are the SSR and SSE
for model ((II)?
) How do theyy compare?
p
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Effects of Additional Covariates on


ANOVA Sums of Squares

In summary, what we observe in Example VI.D holds generally:


adding covariates to a regression model always increases the SSR,
and decreases the SSE by the same amount.

Thee issue
ssue iss w
whether
e e thiss cchange
a ge in thee sums
su s of
o squares
squa es iss significant
s g ca
enough to warrant the larger model.

In other words
words, is the information or cost required to estimate
additional effects worth it?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Nested Models
Consider the predictive factors X1, X2,…, Xq, yielding the model
Yi   0  1 X i1   2 X i 2     q X iq   i . (I)
Suppose we augment this with additional factors represented by Xq+1,
Xq+2,…, Xp-1, where
h p–1 ≥ q, yielding
i ldi the
h modeld l
Yi   0  1 X i1     q X iq   q 1 X i ,q 1     p 1 X i , p 1   i , (II)

We say that model (I) is nested within model (II), in the sense that the
regression parameters in model (I) represent a subset of the
parameters
t in i model
d l (II).
(II)
We can evaluate whether (II) is an improvement on (I) – i.e., whether
the reduction in SSE is significant relative to the number of additional
parameters – by using a partial F test that compares (II) to (I).
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Notation

Regression Sum of squares


SSR(X1,…,Xp-1)
including all coefficients

Regression sum of squares


SSR(X
SS ( 1,…,
,…,Xq) i l di only
including l X1,…,XXq.

Additional or extra regression


SSR(Xq+1,…,Xp-1 | X1,…,Xq) =
sum of squares due to
SSE(X1,…,Xq) – SSE(X1,…,Xp-1)
inclusion
c us o oof Xq+1,…,
,…,Xp-1
p 1.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Comparing Nested Models with Partial F Tests


Considering the effect of adding to the model, how do we assess
whether the additional covariates are “worth it”? We wish to assess
the hypotheses
H 0 :  q 1   q  2     p 1  0, versus
H A :  k  0, for at least one k  q  1, q  2,..., p  1.

To make this comparison, we can use the F statistic

MSR ( X q 1 ,..., X p 1 | X 1 ,..., X q )


F
MSE ( X 1 ,..., X p 1 )
SSR( X q 1 ,..., X p 1 | X 1 ,..., X q ) /( p  q)
 .
MSE ( X 1 ,..., X p 1 )
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Comparing Nested Models with Partial F Tests, continued

Note the degrees of freedom in the numerator correspond to the


difference in the number of parameters fit for each model.

Hence, under
H d th
the null
ll hypothesis,
h th i this
thi statistic
t ti ti follows
f ll an F(p–q,
F( n–p))
distribution.

The p-value is the right tail probability corresponding to the observed


statistic.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.E

Using
i the
h resultsl in
i Example VI.C C andd Example VI.D, carry out
an F test of the hypothesis of no age effect.

What are the test statistic and corresponding p-value?

What are the degrees of freedom associated with this test statistic?

How do these results compare to the confidence interval and t-test


f the
for h coefficient
ffi i off age??
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Common Variable Types in Multiple Regression

A multivariable regression model can accommodate:

• non-linear (polynomial) effects,

• nominal categorical variables,

• interactions.
i i

The accompanying three handouts describe such models, and


provide examples analyzed using SAS to illustrate their application
and interpretation.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

General Guidelines for Model Selection


What do you do with a large data set that has many covariates? How
might
i ht you go about
b t finding
fi di which
hi h subset
b t off the
th covariates
i t adequately
d t l
describe their effects on the outcome of interest? How might you
decide when to treat a given covariate as discrete versus continuous?
Which covariates might interact?
These questions are often best answered using a combination of both
the
h art andd science
i off model
d l building.
b ildi The
Th former
f has
h to do
d with
i h our
substantive knowledge of the research question at hand (e.g., what’s
p
been accomplished previously
p y in your
y line of research),
) and the latter
has to do with empiricism and formal statistical analyses.
The slides that follow contain some suggestions for achieving the
b l
balance off parsimony
i andd sophistication
hi i i that h underlies
d li a goodd model.d l
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

1. Understand your covariates.


In any data analysis setting, you must spend some quality time
exploring the distributions of the outcome and predictor variable.

This involves performing simple cross


cross-validations
validations and univariate
procedures to ensure consistency and correctness, and exploring two-
or three-way (or higher order) associations between the variables in
order to discern the types of marginal associations that may exist.
exist

Exploratory analysis generally involves a combination of descriptive,


inferential, and plotting tools.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

1. Understand your covariates (continued).


Many an analyst has found after much work on a problem that, for
example, some of the data were either miscoded or coded in a way
that the investigator neglected to understand. Models and their
associated interpretations in that case need to be completely
reassessed.

In addition, a thoughtful exploratory analysis helps you to avoid


including covariates in the model that are very highly correlated.
Including such a set of covariates results in the problem of
collinearity, where your fitted model may be unreliable: small
perturbations in the data may lead to large differences in the resulting
parameter estimates,
estimates as well as their standard errors.
errors
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

2. Start simple.

It s a good idea to make a short list of some the most important


It’s
covariates in your data set, and begin by considering simple
models that involve only those. Your list should be dictated by
your familiarity with the problem at hand, as well as the primary
hypotheses of interest. For example, in an observational public
health studyy of ppeople,
p , it’s a good
g bet that you
y would want to
account for measures like gender, race, age, and socioeconomic
status.

Especially with respect to observational data, there is a tendency


for researchers to consider everything at once. This can make the
analysis
l i initially
i i i ll overwhelming.
h l i
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

3. Use automated model selection procedures sparingly.

Procedures that automatically select a model for you,you while popular


in some settings, are completely ad hoc. Stepwise methods, for
example, basically add covariates one at a time (or subtract one at a
time), using an arbitrary significance level as the only criterion for
inclusion versus exclusion. While such a tool might prove useful
for exploratory purposes, it
it’ss a terrible idea to wholly rely on these
kinds of procedures.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

4. Remember goodness-of-fit.
Goodness-of-fit
Goodness of fit is an aspect of data analysis that is all too often
ignored. For example, investigators often simply treat covariates
as continuous, assuming that their effects are linear without
bothering to simply check such assumptions through simple
exploratory analyses.

Avoid running roughshod over your data by checking to make sure


the final model you’ve selected is not missing something
important.
important
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Variable Inclusion versus Exclusion

There
h are severall motivations
i i that
h can inform
i f variable
i bl inclusion.
i l i
For example, you likely want to include a variable if:

• It represents a primary hypothesis of interest.

• IIt has
h been
b consistently
i l usedd in
i prior
i analyses
l in
i the
h same
line of research.

• It has a statistically significant predictive effect within


your own analyses.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Formal Model Comparison Procedures


In terms of the science (i
(i.e.,
e empirical approches) for model selection,
selection
we have already discussed important tools such as t tests for
individual coefficients and partial F tests for one or more coefficients.

Aside from these, there are other strategies and metrics that make an
objective
j attempt
p at balancingg predictive
p power
p versus parsimony.
p y

The issue: given P–1 total predictor variables available in your


dataset how do we select the “best”
dataset, best subset p–1? This will yield p
total regression coefficients (including the intercept) such that

1 ≤ p ≤ P.
P
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Using the Coefficient of Determination


The so-called coefficient of multiple
p determination is conventionally y
denoted by R2, and represents the percent of the total variation in Y
that is explained by a given regression model with p–1 predictor
variables Its definition is given by
variables.

SSR SSE
R2   1 .
SSTO SSTO

You can see that, by construction, 0 ≤ R2 ≤ 1.


Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

The Adjusted R2
Recall that whenever a ppredictor is added to a model,, SSR increases,,
so that R2 will also increase. To balance this increase against the
worth of an additional coefficient, the adjusted R2 is often also
considered defined as
considered,

SSE /( n  p )
R  1
2
a .
SSTO /( n  1)

Note that the adjusted value may actually decrease with an additional
predictor, if the decrease in SSE is offset by the loss of a degree of
freedom in the numerator.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Selection Criteria Based on R2 Measures


R p2  The value of R 2 for a g
given set of p covariates.
Ra2, p  The value of Ra2 for a given set of p covariates.
g R p2 .
One criterion is to choose a set of p covariates that yyield a “large”
Recall that since the this value increases each time a variable is
added to the model, under this criterion we need to make a
subjective decision about whether the increase is sufficiently large
to warrant the additional factor.
To make
T k the
th decision
d i i more objective,
bj ti another
th criterion
it i isi to
t select
l ta
model with p covariates that maximizes the adjusted R2.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Mallows’ Cp

This metric attempts


p to qquantify
y the total mean square
q error ((MSE))
across the sampled subjects, relative to the model fit. Note that MSE
is measure that combines variability and bias. Here, the bias
represents deviations from the true underlying model for E{Y} that
arise because important variables have not been included.

While the technical details behind the computation of Cp for a given


model are not that critical, in terms of interpretation we note that if key
factors are omitted then Cp > pp. That is,, ggenerallyy speaking
p g models
with little bias (which is more desirable) will yield values of Cp close
to p.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

AICp and SBCp

As with the adjusted R2 and Mallows’ Cp, both the AICp and SBCp
penalize models that have large numbers of predictors relative to their
predictive power. Their definitions are respectively given by

AICp = nlog(SSEp) – nlog(n) + 2p, and

SBCp = nlog(SSEp) – nlog(n) + [log(n)]p.

The first term in these definitions will decrease as p increases.


increases The
second term is fixed for a given sample size. The third term in either
will increase with larger p.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Comparing AICp and SBCp

Models with small SSEp will perform well using either of these
criteria, provided that the penalties imposed by the third term are not
too large.

Note that if n ≥ 8, then the penalty for SBCp is larger than AICp.
e ce, the
Hence, t e SSBCCp ccriterion
te o favors
avo s models
ode s with
w t fewer
ewe covariates.
cova ates.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.E
To illustrate the use of these various criteria,, we will use the
Concord, NH, summer water-use data that we examined for Example
V.U, which are posted on the course website in the file concord.txt.
The outcome variable Y is gallons of water used during the summer
months in 1980. Predictors include annual income, years of
education, retirement status (yes/no), and number of individuals in
the household.

See the accompanying


p y g handout for SAS code,, includingg commands
to generate all possible regressions along with the model selection
metrics discussed thus far.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Collinearity
Collinearity or multicollinearity describes a setting in which two or
more predictors
di t are highly
hi hl correlated.
l t d When
Wh suchh a sett or subset
b t off
explanatory variables is included in a model, we can experience
problems with

• the conventional interpretation of regression coefficients (i.e.,


effects of predictive factors holding all other variables constant);
• higher sampling variability of fitted coefficients, making it more
difficult to detect actual associations between the response
variable and predictors;
• numerical computation required for the least-squares fit.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.F
Consider a case where we have a model for an outcome variable Y = (5, 7, 4, 3, 8),
with two predictors X1 = (4(4, 5,
5 55, 44, 11) and X2 = (8,
(8 10,
10 10,
10 88, 22)
22). Pair
Pair-wise
wise
scatterplots of these three variables are shown below. What is the relationship
between Y and X1? Between Y and X2? What do you notice about the X1 and X2?
What is the correlation coefficient between the two predictors?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.F (continued)


Suppose we consider first the model
Y   0  1 X 1   .

Because X1 = 2X2, we have perfect collinearity between the predictors. In general,


this
hi is
i true if any predictor
di variable
i bl can be
b expressedd as a linear
l f
function
i off the
h
others. In this case, suppose we now consider the two-variable model:

Y   0  1 X 1   2 2 X 1     0  (1  2 2 ) X 1   .

Notice that for this second model the coefficients are no longer unique. That is,
suppose we choose α2 = 0. Then α1 = β1 yields the original (simple) model. Now
consider α2 = 2. Then α1 = β1 – 4 yields the original model.

p
Computationallyy speaking,
p g given
g our data, the least-squares
q solution for the two-
variable model will be indeterminate.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.F (continued)


In fact, if we try to fit the two-variable model in SAS, this is the result:
The REG Procedure
Model: MODEL1
Dependent Variable: Y

Number of Observations Read 5


Number of Observations Used 5

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 10.89811 10.89811 5.19 0.1072


Error 3 6.30189 2.10063
Corrected Total 4 17.20000

Root MSE 1.44935


1 44935 R
R-Square
Square 0
0.6336
6336
Dependent Mean 5.40000 Adj R-Sq 0.5115
Coeff Var 26.83990
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.F (continued)


NOTE: Model is not full rank. Least-squares solutions for the parameters are
not
t unique.
i S
Some statistics
t ti ti will
ill b
be misleading.
i l di A reported
t d DF of
f 0 or B
means that the estimate is biased.
NOTE: The following parameters have been set to 0, since the variables are a
linear combination of other variables as shown.

X2 = 2 * X1

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 1.52830 1.81920 0.84 0.4625


X1 B 0.71698 0.31478 2.28 0.1072
X2 0 0 . . .
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Near versus Perfect Collinearity


Of course, having perfectly correlated variables is an easily detectable
problem, as your software package will yield an error when fitting
models that include those variables.

The more practical worry in real-world research settings is the issue


of variables that are highly – but not perfectly – correlated.

How can you detect such a potential problem?


Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Collinearity Diagnostics
• Compute the correlation matrix for your set of variables. A rule of thumb is to be
very careful
f l about
b including
i l di two predictors
di i h an r2 off aroundd 00.8
with 8 or hi
higher.
h

• Regress each predictor on all others in turn, and examine the value of the
coefficient of determination R2 for each model.
model

• Examine the marginal effect on the fit of coefficients already in the model when
an additional factor is added. For example, if a previously included variable has
a highly significant coefficient whose magnitude, sign, and/or p-value
dramatically changes with the addition of another factor, then those two
predictors should be closely examined.

• (Not mentioned in the text.) Examine the eigenvalues of the correlation matrix of
the predictors. Eigenvalues equal to zero or relatively close to zero indicate
singularity or near
near-singularity
singularity in this matrix.
matrix The ratio of the largest to smallest
eigenvalue is called the condition number of the matrix. A common rule of
thumb is that a condition number > 30 represents a red flag.

You might also like