0% found this document useful (0 votes)

43 views55 pages

Lecture 06

This document discusses multiple linear regression. It introduces a general regression model that extends simple linear regression to include multiple predictor variables. It explains that including additional variables allows researchers to control for confounding factors and test the effects of multiple factors simultaneously. It provides examples to illustrate interpreting regression coefficients, the least squares fit, fitted residuals, and the mean squared error.

Uploaded by

tshilidzimulaudzi73

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views55 pages

Lecture 06

Uploaded by

tshilidzimulaudzi73

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

VI. Multiple Linear Regression

The term “multiple” refers to the inclusion of more than one
regression variable. The general regression model extends the simple
model by including p – 1 covariates Xi1,XXi2,…,X
Xi,p-1
i 1 as follows:

Yi   0  1 X i1   2 X i 2     p X i , p 1   i ,

where as before we assume εi ~ N(0,σ2).

Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Why “control” for additional factors?

In scientific settings, there may be several reasons to include additional
variables in a regression model:

• Other investigative variables have potential scientifically interesting

predictive effects relative to Y.
• We may be interested in testing the simultaneous effects of multiple
factors.
factors
• We need to model a predictor that (a) is a nominal (unordered)
categorical variable (e.g., the feed type in Example , or (b) has
nonlinear effects that require including polynomial terms (e.g., an X2
variable for a quadratic effect).
• There are confounding factors that need to be controlled for as a part
of the modeling analysis.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Confounding Variables
Suppose that we are investigating the relationship between two
variables
i bl X andd Y.
Y That
Th is,i we want to know
k if variation
i i in i X causes
variation in Y.

Controlled or randomized experiments are those in which study

subjects are randomly assigned to levels of X. This ensures balanced
variation between factors that might influence the relationship
between X and Y. We often refer to these additional factors as
confounding variables.

Observational studies are those in which there is no randomization –

g
investigative groups
g p are defined based on how subjects j selected
themselves to a particular treatment or exposure. In this case, we need
to control for confounding factors as a part of the data analysis.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.A
Consider recent findings from a study of “lighthearted” pursuits on average blood
pressure:

http://www.cnn.com/2011/HEALTH/03/25/laughter.music.lower.blood.pressure/index.html.

Is this a controlled or observational study? What advantages (if any) does the study
design have in this case?

Consider also the following results reported on a study of weight and bullying among
children:

http://inhealth.cnn.com/living-well/overweight-kids-more-likely-to-be-bullied?did=cnnmodule1.

What kind of study is this? What factors might the researchers need to control for?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Interpretation of Regression Coefficients

For the multiple regression model,
model a coefficient βj represents the
effect of Xij on the E{Yi} (the average of the outcome variable),
holding all other variables constant.

In this sense, we are controlling for the effects of other factors, by

assumingg a common effect of Xij on the average g of Y across all levels
of the additional variables.

Note that in the regression setting it is also quite straightforward to

model interactions between two predictor variables, if the effect of
one depends on the effect of the other. We will discuss interactions
more later.
l
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.B

C id a two-variable
Consider i bl model
d l

Yi   0  1 X i1   2 X i 2   i .
Suppose that β0 = 5, β1 = 2, and β2 = 1.5.

What is your interpretation of β0?

What is the effect of Xi1 on the average of Yi when Xi2 = 3? What is

the effect of Xi1 when Xi2 = 10?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.B (continued)

A geometric illustration of this relationship is shown in the plot below. This is a
three-dimensional scatterplot, assuming a random sample of points. The
superimposed plane represents the combined linear effect of X1 and X2 on the average
of Y. What model parameter measures the variation of the points around the plane?

What
h relationship
l i hi do d we assume between
b X1 and
d Y if we hold
h ld X2 constant?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.B (continued)

Note that for three or more predictor variables,

variables the linear relationship
is manifest through a hyperplane in p dimensions, where p – 1
represents the number of variables included in the model.

Beyond two predictors, it is therefore difficult to actually visualize

the model,, but we can apply
pp y the same interpretation
p of the relative
effects (in terms of the model coefficients) on the average of the
outcome variable.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Additional Remarks on “Linear”

“Linear” refers to the effects of the coefficients, not necessarily on the
effects
ff off the
h predictors
di themselves.
h l

We have alreadyy seen this before when discussingg the role of

transformations for a single variable. For example, we may observe a
relationship such that

log(Y) = β0 + Xi1β1 + ··· + Xi,p-1βp-1 + εi,

so that
h fitting
fi i the
h linear
li regression
i model
d l is
i simply
i l a matter off fitting
fi i

Y’ = β0 + Xi1β1 + ··· + Xi,p

i,p-11βpp-11 + εi,

where Y’ = log(Y).
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Least-squares Fit
Note that the least squares
q fit involves the same minimization of
squared residuals (in this case around a plane or hyperplane). The
residual is defined as:

 i  Yi  (  0  1 X i1     p X i , p 1 ).
To find the least-squares
least squares fit
fit, we minimize the objective function
n n
Q    i2   [Yi  (  0  1 X i1     p X i , p 1 )]2 .
i 1 i 1

Note that there is a lot of matrix notation in chapter 6 to describe the

l t
least-squares approach.h Unless
U l you’re ’ interested
i t t d in
i the
th linear
li
algebra, you can ignore all of that.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Fitted Residuals and MSE

As in the simple case, the least-square fit yields estimates b0, b1,…,
bp-1 for
f the
h regression
i coefficients,
ffi i andd our prediction
di i equationi can be b
expressed as:
Yˆi  b0  b1 X i1  b2 X i 2    b p 1 X i , p 1.

The observed residuals are then given by:

ei  Yi  Yˆi  Yi  (b0  b1 X i1    b p 1 X i , p 1 ), i  1,..., n,

and our estimated error variance (MSE) is computed as

1 1 SSE
  ˆ
n n
s 
2
e 
2
i 1 i
(Yi  Yi ) 
2
 MSE.
n p n p i 1
n p
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Analysis of Variance
Note the error degrees of freedom for MSE.
MSE This suggests an
ANOVA table for a multiple regression model that looks something
like this:
Degrees of Sum of Mean
Source Freedom Squares Squares F-statistic p-value
Regression p–1 SSR MSR MSR/MSE

Error n–p SSE MSE

Total n–1 SSTO

Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Components of Variation

A with
As ith the
th simple
i l model:
d l

• SSTO represents the total variation in Y.

• SSR represents the variation in Y explained by the regression

model that includes the set of X variables X1,…Xp-11.

• SSE represents the residual (unexplained) variation, or the total

variation of the points around the hyperplane.
hyperplane
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

ANOVA F Test
Note that there are p – 1 degrees of freedom associated with SSR,
and n – 1 degrees of freedom for SSE
SSE. The ANOVA F test statistic is
therefore given by MSR/MSE, and approximately follows an
F(p – 1, n – 1) distribution.
The question is: what is the F statistic for? For a regression model
with multiple predictors, the F statistic tests the relationship between
Y andd the
th sett off X variables
i bl X1,…X Xp-1. In
I other
th words, d the
th overall
ll F
test evaluates the hypothesis

H0: β1 = β2 = ··· = βp-1 = 0,

0 versus
ers s
HA: At least one βj ≠ 0, for j = 1,…,p – 1.
Under H0, MSR and MSE should be approximately equal, so a large
value of F (i.e., a small p-value) provides evidence against the null.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Distribution of Individual Fitted Coefficients

As with the estimated slope in the simple case, the individual fitted
coefficients
ffi i b1,…bbp-1 are at least
l approximately
i l normally
ll distributed.
di ib d

All we need to know for a ggiven bj is its estimated standard error

s{bj}, and the t distribution can be applied as before.

In other words
words, the statistic
bk   k
s{bk }
follows a t(n – p) distribution. Note that (especially in the case of
multiple
lti l predictors)
di t ) the
th computation
t ti off s{b
{bj} iis too
t complicated
li t d to
t
perform by hand, and in general is carried out using software.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Inference for Individual Coefficients

Using the distribution on the previous slide, a (1 – α)100%
confidence interval for βj is given by
b j  t (1   / 2; n  p ) s{b j }.
We also would like to test the null hypothesis H0 : βj = 0 versus the
alternative hypothesis HA: βj ≠ 0. A test statistic for assessing the
evidence against H0 is given by
bj  0
t .
s{b j }
Under H0, this test statistic approximately follows the t(n–p)
distribution. The p-value is therefore given by 2P{t(n–p) ≥ |t|}.
Note that we can conceivably test against any specific value of βj,
although 0 is generally the value of interest.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.C
The data for this example come from a study of cholesterol levels (Y) among 25
individuals, who were measured also for weight in kg (X1) and age in years (X2).
The data are tabulated below:
Chol Wt Age Chol Wt Age
354 84 46 254 57 23
190 73 20 395 59 60
405 65 52 434 69 48
263 70 30 220 60 34
451 76 57 374 79 51
302 69 25 308 75 50
288 63 28 220 82 34
385 72 36 311 59 46
402 79 57 181 67 23
365 75 44 274 85 37
209 27 24 303 55 40
290 89 31 244 63 30
346 65 52
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.C (continued)

A good exploratory tool when looking at associations between several
variables is a scatterplot matrix, as shown on the following slide. (A
scatterplot matrix can be produced in SAS – for version 9.2 or
higher
g – usingg commands shown later in this example.)
p )

What bivariate associations do you observe?

Based on these associations, what do you predict we will see with

respect to a fitted model that includes both predictors?

Is it possible that age might confound the relationship between weight

and cholesterol level?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.C (continued)

We want to fit a linear model for these data of the form
Yi   0  1 X i1   2 X i 2   i ,
The SAS code below reads in the data file, fits the two-variable model
(with confidence intervals for the individual variables),
variables) and also
produces a scatterplot matrix.
options ls=79 nodate;

data;
infile "c:\chris\classes\stat5100\data\cholesterol.txt" firstobs=2;
input chol weight age;

proc reg;
model chol=weight age / clb;
run;

proc sgscatter;
title "Scatterplot Matrix for Cholesterol Data";
matrix chol weight age;
run;
Stat 5100 – Linear Regression and Time Series
SAS Output
Dr. Corcoran, Spring 2011

Example VI.C (continued)

The REG Procedure
Model: MODEL1
Dependent Variable: chol
Interpret the 95%Number
confidence intervals
of Observations for each of25 the coefficients of
Read
weight and age. Number of Observations Used 25

Analysis of Variance
IIs th
there evidence
id that
th t either
ith weighti ht or age is
i individually
i di id ll associated
i t d
with
Source
average cholesterol
DF
level in the
Sum of
Squares
presence of the Fother?
Mean
Square Value Pr > F

Model
M d l 2 102571 51285 26
26.36
36 <.0001
0001
Interpret the ANOVA F test results.
Error 22 42806 1945.73752
Corrected Total 24 145377

Root MSE 44.11051 R-Square 0.7056

Dependent Mean 310.72000 Adj R-Sq 0.6788
Coeff Var 14.19623
Stat 5100 – Linear Regression and Time Series
SAS Output
Dr. Corcoran, Spring 2011

Example VI.C (continued)

Interpret the 95% confidence intervals for each of the coefficients of

Parameter Estimates

weight and
Variable
age.
DF
Parameter
Estimate
Standard
Error t Value Pr > |t|

Intercept 1 77.98254 52.42964 1.49 0.1511

IIs th
there evidence
weight id 1 that
th t either
ith weight
i ht or age is
0.41736 i individually
i di id ll associated
0.72878 i t d
0.57 0.5727
with average cholesterol level in the presence of the other?
age 1 5.21659 0.75724 6.89 <.0001

Parameter Estimates

Interpret the ANOVA

Variable F test
DF results.
95% Confidence Limits

Intercept 1 -30.74988 186.71495

weight
g 1 -1.09403 1.92875
age 1 3.64616 6.78702
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.C (continued)

Usingg the SAS results,, write out the fitted model.

Interpret the fitted regression coefficients, including the intercept.

Report and interpret the estimated model variance.

IInterpret the
h 95% confidence
fid iintervals
l for
f eachh off the
h coefficients
ffi i off
weight and age.

Is there evidence that either weight or age is individually associated

with average cholesterol level in the presence of the other?

Interpret the ANOVA F test results.

Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Analyzing Partial Sums of Squares

Sums of squares in the ANOVA table can be decomposed into
components that correspond to individual predictor variables or
sets of predictor variables.

This is a very useful tool for comparing models in order to assess

which variables should be included and in what form.

When variables (either one or more) are added to a given model,

the SSE is always reduced. The idea behind analyzing so-called
partial or extra sums of squares is that we would like to know
whether the addition of these additional or extra factors leads to a
significant reduction in SSE. If the reduction is significant, this
indicates that the additional variables likewise have significant
effects with respect to the outcome variable.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.D
Consider again the cholesterol data in Example VI.C. Suppose we
b i with
begin i h a simple
i l regression
i modeld l using
i only l the
h weight
i h variable
i bl
Xi1. Fitting that model, we obtain the prediction equation

Yˆi  199.30  1.62 X i1 ,

along with the ANOVA table below:

Source df SS MS F p-value
Regression 1 10232 10232 1.74 0.200

Error 23 135145 5875.883

T t l
Total 24 145377
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.D (continued)

If we add in the age
g variable Xi2, as in the two-variable model of
Example VI.C, note what the effect is on the regression and error
sums of squares. Denote the two-variable model as (I), and the simple
model on the previous slide as (II).
(II)

How is (I) nested in (II)?

What are the SSR and SSE for model (I)? What are the SSR and SSE
for model ((II)?
) How do theyy compare?
p
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Effects of Additional Covariates on

ANOVA Sums of Squares

In summary, what we observe in Example VI.D holds generally:

adding covariates to a regression model always increases the SSR,
and decreases the SSE by the same amount.

Thee issue
ssue iss w
whether
e e thiss cchange
a ge in thee sums
su s of
o squares
squa es iss significant
s g ca
enough to warrant the larger model.

In other words
words, is the information or cost required to estimate
additional effects worth it?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Nested Models
Consider the predictive factors X1, X2,…, Xq, yielding the model
Yi   0  1 X i1   2 X i 2     q X iq   i . (I)
Suppose we augment this with additional factors represented by Xq+1,
Xq+2,…, Xp-1, where
h p–1 ≥ q, yielding
i ldi the
h modeld l
Yi   0  1 X i1     q X iq   q 1 X i ,q 1     p 1 X i , p 1   i , (II)

We say that model (I) is nested within model (II), in the sense that the
regression parameters in model (I) represent a subset of the
parameters
t in i model
d l (II).
(II)
We can evaluate whether (II) is an improvement on (I) – i.e., whether
the reduction in SSE is significant relative to the number of additional
parameters – by using a partial F test that compares (II) to (I).
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Notation

Regression Sum of squares

SSR(X1,…,Xp-1)
including all coefficients

Regression sum of squares

SSR(X
SS ( 1,…,
,…,Xq) i l di only
including l X1,…,XXq.

Additional or extra regression

SSR(Xq+1,…,Xp-1 | X1,…,Xq) =
sum of squares due to
SSE(X1,…,Xq) – SSE(X1,…,Xp-1)
inclusion
c us o oof Xq+1,…,
,…,Xp-1
p 1.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Comparing Nested Models with Partial F Tests

Considering the effect of adding to the model, how do we assess
whether the additional covariates are “worth it”? We wish to assess
the hypotheses
H 0 :  q 1   q  2     p 1  0, versus
H A :  k  0, for at least one k  q  1, q  2,..., p  1.

To make this comparison, we can use the F statistic

MSR ( X q 1 ,..., X p 1 | X 1 ,..., X q )

F
MSE ( X 1 ,..., X p 1 )
SSR( X q 1 ,..., X p 1 | X 1 ,..., X q ) /( p  q)
 .
MSE ( X 1 ,..., X p 1 )
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Comparing Nested Models with Partial F Tests, continued

Note the degrees of freedom in the numerator correspond to the

difference in the number of parameters fit for each model.

Hence, under
H d th
the null
ll hypothesis,
h th i this
thi statistic
t ti ti follows
f ll an F(p–q,
F( n–p))
distribution.

The p-value is the right tail probability corresponding to the observed

statistic.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.E

Using
i the
h resultsl in
i Example VI.C C andd Example VI.D, carry out
an F test of the hypothesis of no age effect.

What are the test statistic and corresponding p-value?

What are the degrees of freedom associated with this test statistic?

How do these results compare to the confidence interval and t-test

f the
for h coefficient
ffi i off age??
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Common Variable Types in Multiple Regression

A multivariable regression model can accommodate:

• non-linear (polynomial) effects,

• nominal categorical variables,

• interactions.
i i

The accompanying three handouts describe such models, and

provide examples analyzed using SAS to illustrate their application
and interpretation.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

General Guidelines for Model Selection

What do you do with a large data set that has many covariates? How
might
i ht you go about
b t finding
fi di which
hi h subset
b t off the
th covariates
i t adequately
d t l
describe their effects on the outcome of interest? How might you
decide when to treat a given covariate as discrete versus continuous?
Which covariates might interact?
These questions are often best answered using a combination of both
the
h art andd science
i off model
d l building.
b ildi The
Th former
f has
h to do
d with
i h our
substantive knowledge of the research question at hand (e.g., what’s
p
been accomplished previously
p y in your
y line of research),
) and the latter
has to do with empiricism and formal statistical analyses.
The slides that follow contain some suggestions for achieving the
b l
balance off parsimony
i andd sophistication
hi i i that h underlies
d li a goodd model.d l
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

1. Understand your covariates.

In any data analysis setting, you must spend some quality time
exploring the distributions of the outcome and predictor variable.

This involves performing simple cross

cross-validations
validations and univariate
procedures to ensure consistency and correctness, and exploring two-
or three-way (or higher order) associations between the variables in
order to discern the types of marginal associations that may exist.
exist

Exploratory analysis generally involves a combination of descriptive,

inferential, and plotting tools.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

1. Understand your covariates (continued).

Many an analyst has found after much work on a problem that, for
example, some of the data were either miscoded or coded in a way
that the investigator neglected to understand. Models and their
associated interpretations in that case need to be completely
reassessed.

In addition, a thoughtful exploratory analysis helps you to avoid

including covariates in the model that are very highly correlated.
Including such a set of covariates results in the problem of
collinearity, where your fitted model may be unreliable: small
perturbations in the data may lead to large differences in the resulting
parameter estimates,
estimates as well as their standard errors.
errors
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

2. Start simple.

It s a good idea to make a short list of some the most important

It’s
covariates in your data set, and begin by considering simple
models that involve only those. Your list should be dictated by
your familiarity with the problem at hand, as well as the primary
hypotheses of interest. For example, in an observational public
health studyy of ppeople,
p , it’s a good
g bet that you
y would want to
account for measures like gender, race, age, and socioeconomic
status.

Especially with respect to observational data, there is a tendency

for researchers to consider everything at once. This can make the
analysis
l i initially
i i i ll overwhelming.
h l i
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

3. Use automated model selection procedures sparingly.

Procedures that automatically select a model for you,you while popular

in some settings, are completely ad hoc. Stepwise methods, for
example, basically add covariates one at a time (or subtract one at a
time), using an arbitrary significance level as the only criterion for
inclusion versus exclusion. While such a tool might prove useful
for exploratory purposes, it
it’ss a terrible idea to wholly rely on these
kinds of procedures.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

4. Remember goodness-of-fit.
Goodness-of-fit
Goodness of fit is an aspect of data analysis that is all too often
ignored. For example, investigators often simply treat covariates
as continuous, assuming that their effects are linear without
bothering to simply check such assumptions through simple
exploratory analyses.

Avoid running roughshod over your data by checking to make sure

the final model you’ve selected is not missing something
important.
important
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Variable Inclusion versus Exclusion

There
h are severall motivations
i i that
h can inform
i f variable
i bl inclusion.
i l i
For example, you likely want to include a variable if:

• It represents a primary hypothesis of interest.

• IIt has
h been
b consistently
i l usedd in
i prior
i analyses
l in
i the
h same
line of research.

• It has a statistically significant predictive effect within

your own analyses.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Formal Model Comparison Procedures

In terms of the science (i
(i.e.,
e empirical approches) for model selection,
selection
we have already discussed important tools such as t tests for
individual coefficients and partial F tests for one or more coefficients.

Aside from these, there are other strategies and metrics that make an
objective
j attempt
p at balancingg predictive
p power
p versus parsimony.
p y

The issue: given P–1 total predictor variables available in your

dataset how do we select the “best”
dataset, best subset p–1? This will yield p
total regression coefficients (including the intercept) such that

1 ≤ p ≤ P.
P
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Using the Coefficient of Determination

The so-called coefficient of multiple
p determination is conventionally y
denoted by R2, and represents the percent of the total variation in Y
that is explained by a given regression model with p–1 predictor
variables Its definition is given by
variables.

SSR SSE
R2   1 .
SSTO SSTO

You can see that, by construction, 0 ≤ R2 ≤ 1.

Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

The Adjusted R2
Recall that whenever a ppredictor is added to a model,, SSR increases,,
so that R2 will also increase. To balance this increase against the
worth of an additional coefficient, the adjusted R2 is often also
considered defined as
considered,

SSE /( n  p )
R  1
2
a .
SSTO /( n  1)

Note that the adjusted value may actually decrease with an additional
predictor, if the decrease in SSE is offset by the loss of a degree of
freedom in the numerator.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Selection Criteria Based on R2 Measures

R p2  The value of R 2 for a g
given set of p covariates.
Ra2, p  The value of Ra2 for a given set of p covariates.
g R p2 .
One criterion is to choose a set of p covariates that yyield a “large”
Recall that since the this value increases each time a variable is
added to the model, under this criterion we need to make a
subjective decision about whether the increase is sufficiently large
to warrant the additional factor.
To make
T k the
th decision
d i i more objective,
bj ti another
th criterion
it i isi to
t select
l ta
model with p covariates that maximizes the adjusted R2.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Mallows’ Cp

This metric attempts

p to qquantify
y the total mean square
q error ((MSE))
across the sampled subjects, relative to the model fit. Note that MSE
is measure that combines variability and bias. Here, the bias
represents deviations from the true underlying model for E{Y} that
arise because important variables have not been included.

While the technical details behind the computation of Cp for a given

model are not that critical, in terms of interpretation we note that if key
factors are omitted then Cp > pp. That is,, ggenerallyy speaking
p g models
with little bias (which is more desirable) will yield values of Cp close
to p.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

AICp and SBCp

As with the adjusted R2 and Mallows’ Cp, both the AICp and SBCp
penalize models that have large numbers of predictors relative to their
predictive power. Their definitions are respectively given by

AICp = nlog(SSEp) – nlog(n) + 2p, and

SBCp = nlog(SSEp) – nlog(n) + [log(n)]p.

The first term in these definitions will decrease as p increases.

increases The
second term is fixed for a given sample size. The third term in either
will increase with larger p.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Comparing AICp and SBCp

Models with small SSEp will perform well using either of these
criteria, provided that the penalties imposed by the third term are not
too large.

Note that if n ≥ 8, then the penalty for SBCp is larger than AICp.
e ce, the
Hence, t e SSBCCp ccriterion
te o favors
avo s models
ode s with
w t fewer
ewe covariates.
cova ates.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.E
To illustrate the use of these various criteria,, we will use the
Concord, NH, summer water-use data that we examined for Example
V.U, which are posted on the course website in the file concord.txt.
The outcome variable Y is gallons of water used during the summer
months in 1980. Predictors include annual income, years of
education, retirement status (yes/no), and number of individuals in
the household.

See the accompanying

p y g handout for SAS code,, includingg commands
to generate all possible regressions along with the model selection
metrics discussed thus far.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Collinearity
Collinearity or multicollinearity describes a setting in which two or
more predictors
di t are highly
hi hl correlated.
l t d When
Wh suchh a sett or subset
b t off
explanatory variables is included in a model, we can experience
problems with

• the conventional interpretation of regression coefficients (i.e.,

effects of predictive factors holding all other variables constant);
• higher sampling variability of fitted coefficients, making it more
difficult to detect actual associations between the response
variable and predictors;
• numerical computation required for the least-squares fit.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.F
Consider a case where we have a model for an outcome variable Y = (5, 7, 4, 3, 8),
with two predictors X1 = (4(4, 5,
5 55, 44, 11) and X2 = (8,
(8 10,
10 10,
10 88, 22)
22). Pair
Pair-wise
wise
scatterplots of these three variables are shown below. What is the relationship
between Y and X1? Between Y and X2? What do you notice about the X1 and X2?
What is the correlation coefficient between the two predictors?
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.F (continued)

Suppose we consider first the model
Y   0  1 X 1   .

Because X1 = 2X2, we have perfect collinearity between the predictors. In general,

this
hi is
i true if any predictor
di variable
i bl can be
b expressedd as a linear
l f
function
i off the
h
others. In this case, suppose we now consider the two-variable model:

Y   0  1 X 1   2 2 X 1     0  (1  2 2 ) X 1   .

Notice that for this second model the coefficients are no longer unique. That is,
suppose we choose α2 = 0. Then α1 = β1 yields the original (simple) model. Now
consider α2 = 2. Then α1 = β1 – 4 yields the original model.

p
Computationallyy speaking,
p g given
g our data, the least-squares
q solution for the two-
variable model will be indeterminate.
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.F (continued)

In fact, if we try to fit the two-variable model in SAS, this is the result:
The REG Procedure
Model: MODEL1
Dependent Variable: Y

Number of Observations Read 5

Number of Observations Used 5

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 10.89811 10.89811 5.19 0.1072

Error 3 6.30189 2.10063
Corrected Total 4 17.20000

Root MSE 1.44935

1 44935 R
R-Square
Square 0
0.6336
6336
Dependent Mean 5.40000 Adj R-Sq 0.5115
Coeff Var 26.83990
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Example VI.F (continued)

NOTE: Model is not full rank. Least-squares solutions for the parameters are
not
t unique.
i S
Some statistics
t ti ti will
ill b
be misleading.
i l di A reported
t d DF of
f 0 or B
means that the estimate is biased.
NOTE: The following parameters have been set to 0, since the variables are a
linear combination of other variables as shown.

X2 = 2 * X1

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 1.52830 1.81920 0.84 0.4625

X1 B 0.71698 0.31478 2.28 0.1072
X2 0 0 . . .
Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Near versus Perfect Collinearity

Of course, having perfectly correlated variables is an easily detectable
problem, as your software package will yield an error when fitting
models that include those variables.

The more practical worry in real-world research settings is the issue

of variables that are highly – but not perfectly – correlated.

How can you detect such a potential problem?

Stat 5100 – Linear Regression and Time Series
Dr. Corcoran, Spring 2011

Collinearity Diagnostics
• Compute the correlation matrix for your set of variables. A rule of thumb is to be
very careful
f l about
b including
i l di two predictors
di i h an r2 off aroundd 00.8
with 8 or hi
higher.
h

• Regress each predictor on all others in turn, and examine the value of the
coefficient of determination R2 for each model.
model

• Examine the marginal effect on the fit of coefficients already in the model when
an additional factor is added. For example, if a previously included variable has
a highly significant coefficient whose magnitude, sign, and/or p-value
dramatically changes with the addition of another factor, then those two
predictors should be closely examined.

• (Not mentioned in the text.) Examine the eigenvalues of the correlation matrix of
the predictors. Eigenvalues equal to zero or relatively close to zero indicate
singularity or near
near-singularity
singularity in this matrix.
matrix The ratio of the largest to smallest
eigenvalue is called the condition number of the matrix. A common rule of
thumb is that a condition number > 30 represents a red flag.

notes_c
No ratings yet
notes_c
156 pages
Quat Book
No ratings yet
Quat Book
879 pages
thesis of survival
No ratings yet
thesis of survival
80 pages
reg
No ratings yet
reg
110 pages
STAT 135: Linear Regression: Joan Bruna
No ratings yet
STAT 135: Linear Regression: Joan Bruna
232 pages
Litelature P2H
No ratings yet
Litelature P2H
41 pages
20221560F AP16 Manual English
No ratings yet
20221560F AP16 Manual English
156 pages
Correlation, Simple Linear Regression and Multiple Linear Regression Practice
No ratings yet
Correlation, Simple Linear Regression and Multiple Linear Regression Practice
50 pages
Chapter_8_Linear_regression (1)
No ratings yet
Chapter_8_Linear_regression (1)
22 pages
survival analysis of persons testing HIV
No ratings yet
survival analysis of persons testing HIV
26 pages
Lecture 5
No ratings yet
Lecture 5
45 pages
Linear Regression
No ratings yet
Linear Regression
30 pages
Lecture 12 - Adv. Correlation and Multiple Regression
No ratings yet
Lecture 12 - Adv. Correlation and Multiple Regression
32 pages
Mapa y Jerarquía Espacial de La Pobreza en México
No ratings yet
Mapa y Jerarquía Espacial de La Pobreza en México
45 pages
ME772_S25_Lecture 6 Fracture Toughness, High and Low Cycle Fatigue
No ratings yet
ME772_S25_Lecture 6 Fracture Toughness, High and Low Cycle Fatigue
36 pages
Note 13 - Linear Regression
No ratings yet
Note 13 - Linear Regression
25 pages
LI L2 Unit Test 2A
No ratings yet
LI L2 Unit Test 2A
2 pages
Regression Equation: Independent Variable Predictor Variable Explanatory Variable Dependent Variable Response Variable
No ratings yet
Regression Equation: Independent Variable Predictor Variable Explanatory Variable Dependent Variable Response Variable
60 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
73 pages
Econometrics Chapter Three (1)
No ratings yet
Econometrics Chapter Three (1)
55 pages
Chapter 10 - 2 - 2
No ratings yet
Chapter 10 - 2 - 2
33 pages
Unit 8
No ratings yet
Unit 8
17 pages
survival8
No ratings yet
survival8
12 pages
Chapter 9 Simple Linear Regression and Correlation (1) (1)
No ratings yet
Chapter 9 Simple Linear Regression and Correlation (1) (1)
56 pages
survival20
No ratings yet
survival20
11 pages
Notes 2
No ratings yet
Notes 2
15 pages
Regression Analysis
No ratings yet
Regression Analysis
37 pages
survival3
No ratings yet
survival3
8 pages
Cosine Similarity
No ratings yet
Cosine Similarity
5 pages
Pradytha Galuh Putranti_2304220013_SSD_B ING-STAT (2)
No ratings yet
Pradytha Galuh Putranti_2304220013_SSD_B ING-STAT (2)
26 pages
Linera Regression II PDF
No ratings yet
Linera Regression II PDF
14 pages
RECO 2040 Construction Project Management I: Elemental Estimate
No ratings yet
RECO 2040 Construction Project Management I: Elemental Estimate
24 pages
survival 21
No ratings yet
survival 21
11 pages
Simple Linear Regression Analysis - Final
No ratings yet
Simple Linear Regression Analysis - Final
46 pages
Correlation BVRamana
No ratings yet
Correlation BVRamana
8 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
20 pages
survival 7
No ratings yet
survival 7
5 pages
Biostat Lecture 10
No ratings yet
Biostat Lecture 10
47 pages
Mult Regression
No ratings yet
Mult Regression
28 pages
New Product Line Achievement (PLA) FAQ - July 2023
No ratings yet
New Product Line Achievement (PLA) FAQ - July 2023
7 pages
Development and Use of Eco Plaster
No ratings yet
Development and Use of Eco Plaster
6 pages
Multiple Linear Regression & Nonlinear Regression Models
No ratings yet
Multiple Linear Regression & Nonlinear Regression Models
51 pages
Summary of Topics For Midterm Exam #2: STA 371G, Fall 2017
No ratings yet
Summary of Topics For Midterm Exam #2: STA 371G, Fall 2017
6 pages
Method Statement of Cross Hole Sonic Logging (CSL) Pipe Filling
No ratings yet
Method Statement of Cross Hole Sonic Logging (CSL) Pipe Filling
8 pages
Lesson1 - Simple Linier Regression
No ratings yet
Lesson1 - Simple Linier Regression
40 pages
Lect5 Math231
No ratings yet
Lect5 Math231
31 pages
Baren Report 06
No ratings yet
Baren Report 06
20 pages
Water Rocket challenge
No ratings yet
Water Rocket challenge
3 pages
Applied Linear Regression Models 4th Ed Note
No ratings yet
Applied Linear Regression Models 4th Ed Note
46 pages
Review Lecture
No ratings yet
Review Lecture
44 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
55 pages
13 Chapter14
No ratings yet
13 Chapter14
28 pages
Linear Regression
No ratings yet
Linear Regression
47 pages
Topic 7 Linear Regreation CHP14
No ratings yet
Topic 7 Linear Regreation CHP14
21 pages
Stat 353 Study Guide
No ratings yet
Stat 353 Study Guide
44 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
120.508 Module 8 Multiple Regression (PDF Full Page Color)
No ratings yet
120.508 Module 8 Multiple Regression (PDF Full Page Color)
52 pages
Statistic SimpleLinearRegression
No ratings yet
Statistic SimpleLinearRegression
7 pages
Reg Analysis
No ratings yet
Reg Analysis
63 pages
Unit 4
No ratings yet
Unit 4
12 pages
Agile Landscape - Chriss Webb
0% (1)
Agile Landscape - Chriss Webb
2 pages
MAS316/Math352 Regression Analysis: 1 Multiple Linear Regression Models
No ratings yet
MAS316/Math352 Regression Analysis: 1 Multiple Linear Regression Models
12 pages
FinQuiz - Curriculum Note, @InsightSquad Study Session 2, Reading 5
No ratings yet
FinQuiz - Curriculum Note, @InsightSquad Study Session 2, Reading 5
11 pages
4.1 Multiple Regression Models
No ratings yet
4.1 Multiple Regression Models
6 pages
FCDS - RA ch1 Sp21
No ratings yet
FCDS - RA ch1 Sp21
14 pages
Product Life Cycle Management by Saathwik Chandan Nune
No ratings yet
Product Life Cycle Management by Saathwik Chandan Nune
5 pages
Simple Linear
No ratings yet
Simple Linear
10 pages
Judas Gospel
No ratings yet
Judas Gospel
24 pages
Unit 5
No ratings yet
Unit 5
10 pages
BS Classes V2
No ratings yet
BS Classes V2
70 pages
Simple Linear Regression, Cont.: BIOST 515 January 13, 2004
No ratings yet
Simple Linear Regression, Cont.: BIOST 515 January 13, 2004
23 pages
Part-Ii:Guidanceto Practicals Data Collection Surveying Through App (For Exercises 1 To 4)
No ratings yet
Part-Ii:Guidanceto Practicals Data Collection Surveying Through App (For Exercises 1 To 4)
8 pages
TSNotes 1
No ratings yet
TSNotes 1
29 pages
Math644 Chapter 1 Part1
No ratings yet
Math644 Chapter 1 Part1
5 pages
Regression
No ratings yet
Regression
24 pages
Linear Regression (Simple & Multiple)
No ratings yet
Linear Regression (Simple & Multiple)
29 pages
Math644 - Chapter 1 - Part2 PDF
No ratings yet
Math644 - Chapter 1 - Part2 PDF
14 pages
Untitled 472
No ratings yet
Untitled 472
13 pages
Physics of A Car Crash Ali Adnaan Raza (ME)
No ratings yet
Physics of A Car Crash Ali Adnaan Raza (ME)
2 pages
Pt100-Temperature-Relay Type TR250: Digital, 3 Sensors, 3 Limits
No ratings yet
Pt100-Temperature-Relay Type TR250: Digital, 3 Sensors, 3 Limits
1 page
Example Class One
No ratings yet
Example Class One
4 pages
UAE Shell Turbo Oil T 46
No ratings yet
UAE Shell Turbo Oil T 46
3 pages
Detailed Lesson Plan in Math
100% (2)
Detailed Lesson Plan in Math
4 pages
Cryptography and Network Security: Sixth Edition by William Stallings
No ratings yet
Cryptography and Network Security: Sixth Edition by William Stallings
40 pages
Presentation - Jeddah Airport 2 ISTP (KOM)
No ratings yet
Presentation - Jeddah Airport 2 ISTP (KOM)
44 pages
Trends and Recent Advancements in Bridge Launching Techniques
No ratings yet
Trends and Recent Advancements in Bridge Launching Techniques
7 pages
OAError Detail Page
No ratings yet
OAError Detail Page
3 pages
2020 September IStructE Examiners Report
No ratings yet
2020 September IStructE Examiners Report
12 pages
Bivariate
No ratings yet
Bivariate
28 pages
Martina (2002) - Punching Behavior of Biaxial Hollow Slabs
No ratings yet
Martina (2002) - Punching Behavior of Biaxial Hollow Slabs
6 pages
Optical Based Diagnoisis of DC
No ratings yet
Optical Based Diagnoisis of DC
0 pages
Induction Type Directional Over Current Relay
70% (10)
Induction Type Directional Over Current Relay
7 pages
Regression and Multiple Regression Analysis
100% (1)
Regression and Multiple Regression Analysis
21 pages
A Treatise on the Calculus of Finite Differences
From Everand
A Treatise on the Calculus of Finite Differences
George Boole
4/5 (1)
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Lectures on Boolean Algebras
From Everand
Lectures on Boolean Algebras
Paul R. Halmos
4/5 (2)
Complex Variables
From Everand
Complex Variables
Francis J. Flanigan
No ratings yet
Calculus of Variations
From Everand
Calculus of Variations
Lev D. Elsgolc
No ratings yet