Review Lecture
Review Lecture
REVIEW LECTURE
The goal of regression analysis to model the relationship between independent variables
(predictor) and a dependent variable (response).
1 Estimation:
models the relationship between a predictor/predictors and a response with an
observed data set.
2 Prediction:
predict new outcomes given a new set of inputs with a built model.
Examples:
(2) high school grade point average (gpa) and college entrance test score
0 10 20 30 40 50
Celsius
30 35 40 45
Yi = β0 + β1 xi + εi , i = 1, . . . , n
70
65
60
Women Men
Gender
If we connect the mean of the height for both genders, we get a “line.”
Normal Distribution
Inference for β1 (Population Slope)
b1 ± tn−2,1− α2 se(b1 )
n
√ σ̂ . Recall that σ̂ 2 = M SE, M SE = 1
(Yi − Ŷi )2 , and
P
where se(b1 ) = Sxx n−2
i=1
n
(xi − x̄)2 .
P
Sxx =
i=1
Hypothesis Test for β1
1 Hypothesis Test
H0 : β1 = β10 and H1 : β1 6= β10 ,
where β10 is a fixed value. Usually β10 = 0.
2 Test statistic
b1 − β10
t? =
se(b1 )
Yi
200
^
Y i
150
Y
100
50
30 35 40 45
3 If R2 = 1, all of the data points fall on the regression line. The predictor x
accounts for all of the variation in Y .
R^2 = 0.0028
^
Y
Y
3.5
3.0
GPA
2.5
2.0
1.5
60 65 70
Height
In this example SSR = 0.0276, SSE = 9.7055 and SST O = 9.7331. Note that
R2 = SSR/SST O = 0.0028.
Skin cancer mortality rates (Y ) vs latitude (x):
R^2 = 0.6798
220
Mortality (Deaths per 10 million)
^
Y
Y
200
180
160
140
120
100
30 35 40 45
In this example SSR = 36464.2, SSE = 17173.07 and SST O = 53637.27. Note that
R2 = sSSR/SST O = 0.6798.
Interpretation of R2 when 0 < R2 < 1
When R2 is some number between 0 and 1, like 0.6 or 0.3, we say either
or
For the mean response E(Yh ) when the predictor value is xh . The general
formula for the confidence interval in words is
1 Ŷh is the fitted value or predicted value of the response when the predictor is
xh .
2 tn−2,1−α/2 is the t-multiplier.
q
2
se Ŷh = σ̂ n1 + (xhS−x̄)
3
xx
and it is the standard error of Ŷh .
Prediction Interval for a New Response Yh(new)
Yi
200
^
Y i
150
Y
100
50
30 35 40 45
which is equivalent to
SST O = SSR + SSE
The formula for each entry is summarized for you in the following analysis of
variance table:
Residual Analysis (Diagnostics) To Check “LINE”
Conditions
To conduct linear regression analysis, the four “LINE” conditions must hold. How can we
know about whether or not the four “LINE” conditions hold?
The “tools” are about using residuals (observation errors):
Residual vs Fit
40
20
Residual
0
-20
-40
Fitted Values
Normally Distributed Residuals
The following normal Q-Q plot suggests that the residuals (and hence the error
terms) are normally distributed:
0
-1
-2
-2 -1 0 1 2
Theoretical Quantiles
Remedies (Transformations) when “LINE” Conditions
are not Met
(2) What transformation do we use when non-normality and/or unequal variances are the
problem(s)?
In this case, we transform Y values and log-transforming Y is one commonly used remedy.
A population model for a multiple linear regression model that relates a response Y to
(p − 1) x variables is written as
1 We assume that the εi are independent and have a normal distribution with mean 0
and constant variance σ 2 .
2 The subscript i refers to the ith individual or unit in the population, and for the x
variables, the subscript following i simply denotes which x variable it is.
3 The word “linear” in “multiple linear regression” refers to the fact that the model is
linear in the parameters β0 , β1 , ...βp−1 .
2 The intercept term, β0 , represents the mean response, E(Y ), when all the predictors
x1 , x2 , . . . , xp−1 , are zero. As in a simple linear regression setting, it may or may not
have any practical meaning.
General F -Test
1 Ŷh is the “fitted value” or “predicted value” of the response when the predictor values
are xh .
Ŷh = b0 + b1 xh1 + b2 xh2 + ... + bp−1 xh(p−1)
2 tα/2,n−p is the t-multiplier. Note again that the t-multiplier has n − p degrees of
freedom because the prediction interval uses the mean square error (M SE) whose
denominator is n − p.
3 Observe that the only difference in the formulas is that the standard error of the
prediction for Yh,(new) has an extra M SE term in it that the standard error of the fit for
E(Yh ) does not.
Multiple Linear Regression Model Assumptions
The four conditions (“LINE”) that comprise the multiple linear regression model generalize the
simple linear regression model conditions to take account of the fact that we now have multiple
predictors:
1 The mean of the response, E(Yi ), at each set of values of the predictors, (xi1 , xi2 , . . .), is a
Linear function.
3 The errors, εi , at each set of values of the predictors, (xi1 , xi2 , . . .), are Normally distributed.
4 The errors, εi , at each set of values of the predictors, (xi1 , xi2 , . . .), have Equal variances
(denoted σ 2 ).
Similarly to simple linear regression, an alternative way to describe all four assumptions is that the
errors, εi , are independent normal random variables with mean zero and constant variance, σ 2 .
As in simple linear regression, we can assess whether these conditions seem to hold for a
multiple linear regression model applied to a particular sample dataset by looking at the prediction
errors, i.e., the residuals, ei = Yi − Ŷi .
Residual Analysis (Diagnostics)
As in simple linear regression, we can assess whether these conditions seem to hold for a
multiple linear regression model applied to a particular sample dataset by looking at the prediction
errors, i.e., the residuals, ei = Yi − Ŷi .
3600 Non-Smoker
Smoker
3400
3200
Birth Weight
3000
2800
2600
2400
34 36 38 40 42
Length of Gestation
The regression model contains additive effects and the response function can be written as a
sum of functions of the predictor variables:
where Yi is birth weight for individual i, xi1 is length of gestation for individual i, and xi2 is a
categorical variable representing smoking status (Yes/No).
2. If there’s an interaction effect between a continuous predictor and a categorical
(qualitative) predictor, then the fitted lines for each category are non-parallel.
Treatment Effectiveness vs Age
70
60
Treatment Effectiveness
50
40
A
30
B
C
20 30 40 50 60
Age
The regression model contains interaction effects and the response function cannot be written
as a sum of functions of the predictor variables.
E(Yi ) = β0 + β1 xi1 + β2 xi2 + β3 xi3 + β12 xi1 xi2 + β13 xi1 xi3
where Yi is treatment effectiveness for individual i, xi1 is age for individual i, xi2 = 1 if individual i
receives treatment A; 0 otherwise, and xi3 = 1 if individual i receives treatment B; 0 otherwise.
1 Recall that if we have two continuous predictors xi1 and xi2 in a multiple linear regression
model, we are fitting a “plane” to an observed data set.
2 What the plane will look like if there is an interaction effect between xi1 and xi2 ? It will be a
“warped plane”.
3 Example: A data set consists of 654 observations on children aged 3 to 19. Forced
Expiratory Volume (FEV), which is a measure of lung capacity, is the response (Y ). Age (x1 )
and height (x2 ) are two continuous predictors.
No Interaction Effect between Height and Age Interaction Effect Between Height and Age
FEV
FEV
he
he
igh
igh
t
age age
Variable Selection and Model Building
Strategy:
1 Know your goal, know your research question. Knowing how you plan to use your
regression model can assist greatly in the model building stage.
2 Then, at each step along the way we either enter or remove a predictor based on
some criteria, for example:
1 General (partial) F -tests – that is, the t-tests for the slope parameters – that are
obtained
3 We stop when no more predictors can be justifiably entered or removed from our
stepwise model.
Information Criteria (Other Criteria to Choose
Predictors)
1 Notice that the only difference between AIC and BIC is the multiplier of p, the number of
parameters.
2 The BIC places a higher penalty ( log(n) > 2 when n > 7) on the number of parameters in
the model so will tend to reward more parsimonious (smaller) models.
3 For regression models, the information criteria combine information about the SSE, number
of parameters p in the model, and the sample size n.
The general idea behind best subsets regression is that we select the subset of
predictors that do the best at meeting some well-defined objective criterion:
2 largest adjusted R2
Outliers and high leverage data points have the potential to be influential, but we
generally have to investigate further to determine whether or not they are actually
influential.
50
40
30 Scatterplot of Y vs x
y
20
10
0 2 4 6 8
It’s not an influential point because the predicted responses, estimated slope coefficients, and
hypothesis test results are not affected by the inclusion of the red data point.
Example 2
70
60
50
40 Scatterplot of Y vs x
y
30
20
10
0 2 4 6 8 10 12 14
It’s not an influential point because the predicted responses, estimated slope coefficients, and
hypothesis test results are not affected by the inclusion of the red data point.
Example 3
50
40
30 Scatterplot of Y vs x
y
20
10
0 2 4 6 8 10 12
It is also an influential point. The predicted responses and estimated slope coefficients are clearly
affected by the presence of the red data point.
The Idea of “Leave-One-Out”
2 4 6 8 10
Observe that the red data point “pulls” the estimated regression line towards it. When the
red data point is omitted, the estimated regression line “bounces back” away from the
point.
Some Advice with Problematic Data Points
1 If the data point is a procedural error and invalidates the measurement, delete it.
2 If the data point is not representative of the intended study population, delete it.
Consider the possibility that you might have just mis-formulated your regression model:
1 Do not delete data points just because they do not fit your pre-conceived regression model.
2 You must have a good, objective reason for deleting data points.
3 If you delete any data after you’ve collected it, justify and describe it in your reports.
4 If you are not sure what to do about a data point, analyze the data twice – once with and
once without the data point – and report the results of both analyses.