Lecture 06
Lecture 06
Yi 0 1 X i1 2 X i 2 p X i , p 1 i ,
Confounding Variables
Suppose that we are investigating the relationship between two
variables
i bl X andd Y.
Y That
Th is,i we want to know
k if variation
i i in i X causes
variation in Y.
Example VI.A
Consider recent findings from a study of “lighthearted” pursuits on average blood
pressure:
http://www.cnn.com/2011/HEALTH/03/25/laughter.music.lower.blood.pressure/index.html.
Is this a controlled or observational study? What advantages (if any) does the study
design have in this case?
Consider also the following results reported on a study of weight and bullying among
children:
http://inhealth.cnn.com/living-well/overweight-kids-more-likely-to-be-bullied?did=cnnmodule1.
What kind of study is this? What factors might the researchers need to control for?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011
Example VI.B
C id a two-variable
Consider i bl model
d l
Yi 0 1 X i1 2 X i 2 i .
Suppose that β0 = 5, β1 = 2, and β2 = 1.5.
What
h relationship
l i hi do d we assume between
b X1 and
d Y if we hold
h ld X2 constant?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011
so that
h fitting
fi i the
h linear
li regression
i model
d l is
i simply
i l a matter off fitting
fi i
where Y’ = log(Y).
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011
Least-squares Fit
Note that the least squares
q fit involves the same minimization of
squared residuals (in this case around a plane or hyperplane). The
residual is defined as:
i Yi ( 0 1 X i1 p X i , p 1 ).
To find the least-squares
least squares fit
fit, we minimize the objective function
n n
Q i2 [Yi ( 0 1 X i1 p X i , p 1 )]2 .
i 1 i 1
1 1 SSE
ˆ
n n
s
2
e
2
i 1 i
(Yi Yi )
2
MSE.
n p n p i 1
n p
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011
Analysis of Variance
Note the error degrees of freedom for MSE.
MSE This suggests an
ANOVA table for a multiple regression model that looks something
like this:
Degrees of Sum of Mean
Source Freedom Squares Squares F-statistic p-value
Regression p–1 SSR MSR MSR/MSE
Components of Variation
A with
As ith the
th simple
i l model:
d l
ANOVA F Test
Note that there are p – 1 degrees of freedom associated with SSR,
and n – 1 degrees of freedom for SSE
SSE. The ANOVA F test statistic is
therefore given by MSR/MSE, and approximately follows an
F(p – 1, n – 1) distribution.
The question is: what is the F statistic for? For a regression model
with multiple predictors, the F statistic tests the relationship between
Y andd the
th sett off X variables
i bl X1,…X Xp-1. In
I other
th words, d the
th overall
ll F
test evaluates the hypothesis
In other words
words, the statistic
bk k
s{bk }
follows a t(n – p) distribution. Note that (especially in the case of
multiple
lti l predictors)
di t ) the
th computation
t ti off s{b
{bj} iis too
t complicated
li t d to
t
perform by hand, and in general is carried out using software.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011
Example VI.C
The data for this example come from a study of cholesterol levels (Y) among 25
individuals, who were measured also for weight in kg (X1) and age in years (X2).
The data are tabulated below:
Chol Wt Age Chol Wt Age
354 84 46 254 57 23
190 73 20 395 59 60
405 65 52 434 69 48
263 70 30 220 60 34
451 76 57 374 79 51
302 69 25 308 75 50
288 63 28 220 82 34
385 72 36 311 59 46
402 79 57 181 67 23
365 75 44 274 85 37
209 27 24 303 55 40
290 89 31 244 63 30
346 65 52
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011
data;
infile "c:\chris\classes\stat5100\data\cholesterol.txt" firstobs=2;
input chol weight age;
proc reg;
model chol=weight age / clb;
run;
proc sgscatter;
title "Scatterplot Matrix for Cholesterol Data";
matrix chol weight age;
run;
Stat 5100 – Linear Regression and Time Series
SAS Output
Dr. Corcoran, Spring 2011
Analysis of Variance
IIs th
there evidence
id that
th t either
ith weighti ht or age is
i individually
i di id ll associated
i t d
with
Source
average cholesterol
DF
level in the
Sum of
Squares
presence of the Fother?
Mean
Square Value Pr > F
Model
M d l 2 102571 51285 26
26.36
36 <.0001
0001
Interpret the ANOVA F test results.
Error 22 42806 1945.73752
Corrected Total 24 145377
weight and
Variable
age.
DF
Parameter
Estimate
Standard
Error t Value Pr > |t|
Parameter Estimates
IInterpret the
h 95% confidence
fid iintervals
l for
f eachh off the
h coefficients
ffi i off
weight and age.
Example VI.D
Consider again the cholesterol data in Example VI.C. Suppose we
b i with
begin i h a simple
i l regression
i modeld l using
i only l the
h weight
i h variable
i bl
Xi1. Fitting that model, we obtain the prediction equation
Source df SS MS F p-value
Regression 1 10232 10232 1.74 0.200
T t l
Total 24 145377
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011
What are the SSR and SSE for model (I)? What are the SSR and SSE
for model ((II)?
) How do theyy compare?
p
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011
Thee issue
ssue iss w
whether
e e thiss cchange
a ge in thee sums
su s of
o squares
squa es iss significant
s g ca
enough to warrant the larger model.
In other words
words, is the information or cost required to estimate
additional effects worth it?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011
Nested Models
Consider the predictive factors X1, X2,…, Xq, yielding the model
Yi 0 1 X i1 2 X i 2 q X iq i . (I)
Suppose we augment this with additional factors represented by Xq+1,
Xq+2,…, Xp-1, where
h p–1 ≥ q, yielding
i ldi the
h modeld l
Yi 0 1 X i1 q X iq q 1 X i ,q 1 p 1 X i , p 1 i , (II)
We say that model (I) is nested within model (II), in the sense that the
regression parameters in model (I) represent a subset of the
parameters
t in i model
d l (II).
(II)
We can evaluate whether (II) is an improvement on (I) – i.e., whether
the reduction in SSE is significant relative to the number of additional
parameters – by using a partial F test that compares (II) to (I).
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011
Notation
Hence, under
H d th
the null
ll hypothesis,
h th i this
thi statistic
t ti ti follows
f ll an F(p–q,
F( n–p))
distribution.
Example VI.E
Using
i the
h resultsl in
i Example VI.C C andd Example VI.D, carry out
an F test of the hypothesis of no age effect.
What are the degrees of freedom associated with this test statistic?
• interactions.
i i
2. Start simple.
4. Remember goodness-of-fit.
Goodness-of-fit
Goodness of fit is an aspect of data analysis that is all too often
ignored. For example, investigators often simply treat covariates
as continuous, assuming that their effects are linear without
bothering to simply check such assumptions through simple
exploratory analyses.
There
h are severall motivations
i i that
h can inform
i f variable
i bl inclusion.
i l i
For example, you likely want to include a variable if:
• IIt has
h been
b consistently
i l usedd in
i prior
i analyses
l in
i the
h same
line of research.
Aside from these, there are other strategies and metrics that make an
objective
j attempt
p at balancingg predictive
p power
p versus parsimony.
p y
1 ≤ p ≤ P.
P
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011
SSR SSE
R2 1 .
SSTO SSTO
The Adjusted R2
Recall that whenever a ppredictor is added to a model,, SSR increases,,
so that R2 will also increase. To balance this increase against the
worth of an additional coefficient, the adjusted R2 is often also
considered defined as
considered,
SSE /( n p )
R 1
2
a .
SSTO /( n 1)
Note that the adjusted value may actually decrease with an additional
predictor, if the decrease in SSE is offset by the loss of a degree of
freedom in the numerator.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011
Mallows’ Cp
As with the adjusted R2 and Mallows’ Cp, both the AICp and SBCp
penalize models that have large numbers of predictors relative to their
predictive power. Their definitions are respectively given by
Models with small SSEp will perform well using either of these
criteria, provided that the penalties imposed by the third term are not
too large.
Note that if n ≥ 8, then the penalty for SBCp is larger than AICp.
e ce, the
Hence, t e SSBCCp ccriterion
te o favors
avo s models
ode s with
w t fewer
ewe covariates.
cova ates.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011
Example VI.E
To illustrate the use of these various criteria,, we will use the
Concord, NH, summer water-use data that we examined for Example
V.U, which are posted on the course website in the file concord.txt.
The outcome variable Y is gallons of water used during the summer
months in 1980. Predictors include annual income, years of
education, retirement status (yes/no), and number of individuals in
the household.
Collinearity
Collinearity or multicollinearity describes a setting in which two or
more predictors
di t are highly
hi hl correlated.
l t d When
Wh suchh a sett or subset
b t off
explanatory variables is included in a model, we can experience
problems with
Example VI.F
Consider a case where we have a model for an outcome variable Y = (5, 7, 4, 3, 8),
with two predictors X1 = (4(4, 5,
5 55, 44, 11) and X2 = (8,
(8 10,
10 10,
10 88, 22)
22). Pair
Pair-wise
wise
scatterplots of these three variables are shown below. What is the relationship
between Y and X1? Between Y and X2? What do you notice about the X1 and X2?
What is the correlation coefficient between the two predictors?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011
Y 0 1 X 1 2 2 X 1 0 (1 2 2 ) X 1 .
Notice that for this second model the coefficients are no longer unique. That is,
suppose we choose α2 = 0. Then α1 = β1 yields the original (simple) model. Now
consider α2 = 2. Then α1 = β1 – 4 yields the original model.
p
Computationallyy speaking,
p g given
g our data, the least-squares
q solution for the two-
variable model will be indeterminate.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
X2 = 2 * X1
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Collinearity Diagnostics
• Compute the correlation matrix for your set of variables. A rule of thumb is to be
very careful
f l about
b including
i l di two predictors
di i h an r2 off aroundd 00.8
with 8 or hi
higher.
h
• Regress each predictor on all others in turn, and examine the value of the
coefficient of determination R2 for each model.
model
• Examine the marginal effect on the fit of coefficients already in the model when
an additional factor is added. For example, if a previously included variable has
a highly significant coefficient whose magnitude, sign, and/or p-value
dramatically changes with the addition of another factor, then those two
predictors should be closely examined.
• (Not mentioned in the text.) Examine the eigenvalues of the correlation matrix of
the predictors. Eigenvalues equal to zero or relatively close to zero indicate
singularity or near
near-singularity
singularity in this matrix.
matrix The ratio of the largest to smallest
eigenvalue is called the condition number of the matrix. A common rule of
thumb is that a condition number > 30 represents a red flag.