0% found this document useful (0 votes)
21 views54 pages

Stats101A - Chapter 3

This document discusses diagnostic tools for checking the validity of linear regression models, including residual plots, leverage points, standardized residuals, and Cook's distance. It explains that while common regression outputs may indicate a significant model, the model assumptions around the error term need to be validated. Diagnostic tools help check that the error term has a mean of 0, constant variance, and is normally distributed without discernible patterns. Leverage points and standardized residuals are particularly important for validating models with influential observations.

Uploaded by

Zhen Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views54 pages

Stats101A - Chapter 3

This document discusses diagnostic tools for checking the validity of linear regression models, including residual plots, leverage points, standardized residuals, and Cook's distance. It explains that while common regression outputs may indicate a significant model, the model assumptions around the error term need to be validated. Diagnostic tools help check that the error term has a mean of 0, constant variance, and is normally distributed without discernible patterns. Leverage points and standardized residuals are particularly important for validating models with influential observations.

Uploaded by

Zhen Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Chapter 3

STATS 101A Introduction to Data Analysis and Regression


Maria Cha
Recall : Model Assumption
• In the population linear model,
𝑌! = 𝛽" + 𝛽# 𝑋! + 𝑒!

• Assumptions about the ‘Error term (𝑒! )’


• 1. random and independent one another.
• 2. normally distributed.
• 3. their mean is 0, and the variance is 𝜎 $ , which is
usually unknown.
𝑒! ~𝑁(0, 𝜎 $ )
Recall : Model Assumption
• Based on these assumptions,
• 𝐸(𝑌|𝑋 = 𝑥) = 𝛽" + 𝛽#𝑥 : the average 𝑌 for a given 𝑋
is a linear function of 𝑋.
• 𝑉𝑎𝑟(𝑌|𝑋 = 𝑥) = 𝜎 $: the variance of 𝑌 is constant.

• Question: Testing the slope, ANOVA for the model, or


𝑅2 is not giving enough information about the model
validity? Do we need another step to check if the model
is valid?
Valid model
• Example : Anscombe’s four data sets
Valid model
• Model 1: 𝑌# vs. 𝑋#
• Regression model: 𝑌0# = 3.0 + 0.5𝑋#
• R results: The slope is significant (p-value:0.002), the
regression model is significant (p-value 0.002), and
the 𝑅2 = 0.67.

• The regression model (𝑌0 = 3.0 + 0.5𝑋) and the


corresponding R results are same for Model 2 (𝑌$ vs.
𝑋$), Model 3(𝑌% vs. 𝑋%), and Model 4 (𝑌& vs. 𝑋&).
Valid model
Valid model
• The numerical regression output suggests that all four
models are significant.

• Remember, all tests and analysis for the regression


model are mathematically designed based on the
same assumptions for the error term.
Valid model
• Let’s take a look the graphical output, i.e. scatter
plots of 𝑌 vs. 𝑋.

• What can we observe or check from the scatter plot?


• 1. The average 𝑌 should be the linear function of
𝑋.
𝐸(𝑌|𝑋 = 𝑥) = 𝛽" + 𝛽# 𝑥

• 2. The variance of 𝑌 given 𝑋 should be constant.


𝑉𝑎𝑟(𝑌|𝑋 = 𝑥) = 𝜎 $
Valid model
Valid model
• The numerical regression output should always be
supplemented by an analysis to ensure that an
appropriate model has been fitted to the data.

• Even though the regression output suggests a


significant regression model, it could not be an
appropriate model, if the model assumptions are
violated.

• We need diagnostic tools for checking model


assumptions.
Diagnostic tool (1) Residual
• Residuals give us significant information for checking
model assumptions for the error term.

• If 𝛽4" and 𝛽4# are good estimates of 𝛽" and 𝛽# , i.e. they
are close one another enough, then the residual (𝑒̂! )
will resemble the error term, 𝑒! .

𝑒̂! = 𝑦! − 𝑦8!
= 𝛽" + 𝛽# 𝑥! + 𝑒! − 𝛽4" + 𝛽4# 𝑥!
= 𝛽" − 𝛽4" + 𝛽# − 𝛽4# 𝑥! + 𝑒!
≈ 𝑒!
Diagnostic tool (1) Residual
• What can we observe or check from the residual plot,
i.e. scatter plot of residual vs. 𝑋?

• Recall that we assume 𝑒! ~𝑁(0, 𝜎 $).

• Thus, we want to check from residual vs. 𝑋 that


• 1. if the average residuals is zero (with a horizontal
line),
• 2. if the variance of the residuals is constant (not
dependent to 𝑥),
• 3. if the residuals show any discernable pattern.
Diagnostic tool (1) Residual
Diagnostic tool (2) Leverage
• Leverage points : data points which exercise
considerable influence on the fitted model. Its x-
value is distant from the other x-values.
• Good leverage point : if its y-value follows the
trend pattern, i.e. not an outlier.
• Bad leverage point : if its y-value does not follow
the trend pattern, which is also an outlier.
• Note: There’s an outlier that is not a leverage
point. In this case, its x-value is not distant from
the other x-values. Thus, it does not “lever”
up/down the regression line as much.
Diagnostic tool (2) Leverage
• Good vs. Bad Leverage points
• Adding an extraordinary point may/may not lever the
regression line: black >> red.
Diagnostic tool (2) Leverage
• Leverages (ℎ𝑖𝑖 ) measure how the predicted 𝑦 changes
if we were to change the observed 𝑦 for each
observation 𝑖.

• In other words, how does 𝑦>! (mathematically) depend on


𝑦! ?

𝑦>! = ∑' ℎ!' 𝑦' ,

# ̅ " +*)̅
(*! +*)(*
, where ℎ!' = (
+ .//
.
Diagnostic tool (2) Leverage
• Show :

𝑦>! = 𝛽@" + 𝛽@#𝑥!


= 𝑦A − 𝛽@#𝑥̅ + 𝛽@#𝑥!
= 𝑦A − 𝛽@# 𝑥̅ − 𝑥!
( (
1 (𝑥' − 𝑥)̅
= F 𝑦' − F 𝑦' 𝑥̅ − 𝑥!
𝑛 𝑆𝑋𝑋
'0# '0#
(
1 𝑥! − 𝑥̅ (𝑥' − 𝑥)̅
=F + 𝑦'
𝑛 𝑆𝑋𝑋
'0#
(
= F ℎ!' 𝑦'
'0#
Diagnostic tool (2) Leverage
• ‘Hat matrix’ ℎ!' , where 𝑖, 𝑗 = 1, … , 𝑛

ℎ## ℎ#$ … ℎ#(


ℎ ℎ$$ … ⋮
ℎ!' = $#
⋮ ⋮ ⋱ ⋮
ℎ(# … … ℎ((

• The diagonal entries of the hat matrix are the


‘leverages’.
• Leverages are independent of the y values, but only
dependent to the 𝑥 values.
Diagnostic tool (2) Leverage
• Leverages (ℎ𝑖𝑖 ) of the 𝑥𝑖 :
1 (𝑥! − 𝑥)̅ $
ℎ!! = +
𝑛 Σ(𝑥' − 𝑥)̅ $
• Note that ∑('0# ℎ!' = 1. If ℎ!! ≅ 1 then the other ℎ!' are
close to zero. Thus, ℎ!! shows how 𝑦! affects 𝑦>! , since
𝑦>! = ∑' ℎ!' 𝑦' .

$
• ℎA !! = ( , for 𝑖 = 1, … , 𝑛.

• The higher the leverage is, the more influential the data
point is. More specifically, if ℎ!! > 4/𝑛, i.e. if a leverage
is greater than two times of its average value, then the
corresponding 𝑥𝑖 is a leverage point.
Diagnostic tool (2) Leverage
• Revisit the example: Suppose we now have 6 observations for
weights and heights, and want to predict weights with the heights
using simple linear regression model:

# ̅
(&! '&)(& ̅
" '&)
𝑋 𝑌 𝑋 − 𝑋& & !
(𝑋 − 𝑋) The entries of ℎ!" = +
$ *++
60 105 -10 100
66 140 -4 16

72 185 2 4

70 145 0 0 ℎ!" =
62 120 -8 64
90 250 20 400
𝑋& = 70 𝑌& = 157.5 Σ = 584
Diagnostic tool (2) Leverage
• The diagonal entries (ℎ23 ) are the leverages.

• Based on the leverages, which observation(s) is(are)


leverage point(s)?

• Determine whether it(they) is(are) good or bad leverage


point(s).
Diagnostic tool (2) Leverage
• In R,
Diagnostic tool (3) Standardized residual
• We shall write the variances of 𝑖𝑡ℎ residual and the
fitted 𝑦𝑖 with respect to the leverages (ℎ𝑖𝑖) :
𝑉𝑎𝑟 𝑒̂! = 𝜎 $ (1 − ℎ!! )
𝑉𝑎𝑟 𝑦8! = 𝜎 $ ℎ!!
(*See p.61 for derivation.)

• It implies that the variance of the residuals are not


constant but it depends on ℎ𝑖𝑖.

• The relationship between ℎ𝑖𝑖 and prediction : If ℎ𝑖𝑖 is


close to 1, i.e. 𝑥𝑖 is the leverage point, then
𝑦8! = ℎ!! 𝑦! + B ℎ!' 𝑦' ≈ 𝑦!
'1!
Diagnostic tool (3) Standardized residual
• Now it makes sense that if ℎ𝑖𝑖 is close to 1, i.e. 𝑥𝑖 is
the leverage point, then
𝑉𝑎𝑟 𝑒̂! = 𝜎 $ (1 − ℎ!! ) ≈ 0 (very small)
𝑉𝑎𝑟 𝑦8! = 𝜎 $ ℎ!! ≈ 𝜎 $ = 𝑉𝑎𝑟(𝑦! )

• Standardized residual (𝑟𝑖) : Need to deal with non-


constant variances of the residuals when the data
includes the leverage points.
𝑒̂!
𝑟! =
𝑠 1 − ℎ!!
2..
, where 𝑠 is the estimate of 𝜎, i.e. s = .
(+$
Diagnostic tool (3) Standardized residual
• When the (high) leverage points exist, it should be
more informative to look at plots of the standardized
residuals instead of those of the residuals.

• Why : Even though the (population) errors have a


constant variance, the variance of the residuals
could show the non-constant variance.

• Implication of the standardized residuals : how


many estimated standard deviations any point is
away from the fitted regression model (Think about
the interpretation of any z-score)
Diagnostic tool (3) Standardized residual
• If the standardized residual falls outside the interval
from -2 to 2, then it is considered as an outlier.

• In addition, any leverage point whose standardized


residual falls outside the interval from -2 to 2 is
considered as a bad leverage point.
Diagnostic tool (3) Standardized residual
• Standardized residuals are also used to check the
assumption for the constant variance.

• It is recommended to plot of
|𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑧𝑒𝑑 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙| 𝑣𝑠. 𝑋 (or fitted Y value)
, and the square root is used to reduce the skewness in
the absolute values.

• Interpretation of the plot will be discussed in the later


slides.
Diagnostic tool (4) Cook’s distance
• Using leverages and standardized residuals, we are
able to classify which observations could be problematic
and violating the model assumptions.

• Cook’s distance is measuring and assessing the


influence of the certain cases:
∑('0#(𝑦>' ! − 𝑦>' )$ 𝑟!$ ℎ!!
𝐷! = $ = ∗
2𝑆 2 1 − ℎ!!

• 𝑦>' ! denotes the 𝑗𝑡ℎ fitted value based on the fit obtained
when the 𝑖𝑡ℎ case has been deleted from the fit.
• 𝑟𝑖 is the 𝑖𝑡ℎ standardized residual and ℎ𝑖𝑖 is the 𝑖𝑡ℎ
leverage value.
Diagnostic tool (4) Cook’s distance
𝑟!$ ℎ!!
𝐷! = ∗
2 1 − ℎ!!

• Cook’s distance increases when the observation


produces high standardized residuals, i.e. the point lies
far way from the regression line. Also it increases when
the leverage is big, i.e. 𝑥 value is far away from its
mean.
• There is some guideline for the “big“ Cook’s distance
when
4
𝐷! >
𝑛−2

, which is a bit harsh.


Diagnostic tool (5) Normal Q-Q plot
• Normal Q-Q plot is one of the most widely used tool
to assess the normality of the errors.

• Scatter plot of ordered standardized residuals vs.


expected order statistics from 𝑁(0,1).

• If the plot shows close to the straight line, then the


errors are considered to be consistent with the
normal distribution.
Diagnostic tool (5) Normal Q-Q plot
Diagnostic tool : R result
• plot(model) produces four diagnostic plots. (data:
iowatest.txt)
Diagnostic tool : R result
• The plot of Residual vs. Fitted is equivalent to The
plot of Residual vs. 𝑋, since the fitted 𝑌 is a linear
function of 𝑋.

• The red line shows


whether 1) the
relationship is linear (the
line should be straight),
and 2) the average of the
errors is 0.
Diagnostic tool : R result
• Normal Q-Q plot is to check the normality of the
error term.
• If the points are aligned to
the straight line, then it
implies the normality of
the errors.
• Observation 70 : the 𝑟70 <
− 3, but it should have
been (theoretically) -2.7 if
all 𝑟𝑖 ’s are from the normal
distribution.
Diagnostic tool : R result
• The plot of |𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑧𝑒𝑑 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙|vs. fitted 𝑌
value is to check the constant variance of the error
term.
• If it does not show any
apparent pattern, then it
suggests the constant
variance.
Diagnostic tool : R result
• The plot of Standardized residuals vs. Leverages is
to check the outliers or (possible) influential points.

• Check if 𝑟𝑖 ’s fall outside


−2, 2 .
• Check If the leverage
is greater than 4/𝑛 =
0.03 (𝑛 = 133)
• Check if the 𝐷𝑖’s are
greater than 4/(𝑛 − 2)
= 0.0305
Diagnostic tool : R result
• armspan2015.txt data
• Check if 𝑟𝑖’s fall
outside (−2, 2) : 𝑖 =
47
• Check If the leverage
is greater than 4/𝑛 =
0.056 (𝑛 = 71) : 𝑖 =
67
• Check if the 𝐷𝑖’s are
greater than 4/(𝑛 −
2) = 0.058 𝑖 = 46,67
So..what should we do?
• Points should not be routinely deleted from an analysis
just because they do not fit the model. Outliers and bad
leverage points are signals, flagging potential problems
with the model.
• Outliers often point out an important feature of the
problem not considered before. They may point to an
alternative model in which the points are not an outlier.
In this case it is then worth considering fitting an
alternative model by including extra predictor variables
or by transforming 𝑌 and/or 𝑋.
• When the assumption that the variance of the errors is
constant does not hold, we can use weighted least
squares to account for the changing variance (Chapter
4)
Recall : Model Assumption
• In the population linear model,

𝑌! = 𝛽" + 𝛽#𝑋! + 𝑒!
, we assume that .
𝑒! ~𝑁 0, 𝜎 $ .

• In other words, we want to check


• 1. Linearity of the relationship,
• 2. Independence and randomness of the error term
(difficult to check),
• 3. Normality of the error term (or 𝑌),
• 4. Constant variance of the error term
(homoscedasticity),
• (5. Outliers or influential points)
Checking the assumption
• 1. Linearity of the relationship :
o scatter plot of 𝑌 vs. 𝑋
U
o scatter plot of residuals vs. 𝑋 (or 𝑌)
• 2. Normality of the error term or 𝑌
o Normal Q-Q plot
• 3. Constant variance of the error term (homoscedasticity)
U
o scatter plot of residuals vs. 𝑋 (or 𝑌)
U
o scatter plot of |Standardized residuals|0.5 vs. 𝑋 (or 𝑌)
• 4. Outliers or influential points :
o leverages, standardized residuals, Cook’s distance
o scatter plot of Standardized residuals vs. Leverages
What if they are violated?
• 1. Linearity of the relationship : The predictions and
estimates of the slope and intercept will be biased.
• 2. Normality of the error term (or 𝑌): The estimates of
intercept and slope, as well as predicted values, are still
unbiased.
• large sample size : good approximations of the p-
values for the t-tests. So are the Confidence intervals
for slope and intercept.
• small sample size : Then everything falls apart.
• If the normality fails, prediction intervals are not
accurate, regardless of the size of 𝑛. (the CLT
doesn’t apply)
• 3. Constant variance and independence of the error term
fail : All inference tools invalidated.
Transformation
• Most widely used tool to overcome the violation of the
assumptions.
• It is possible to change the distribution (shape) of the
variable, and the outliers from the original data could
be considered as not an outlier in the transformed
data.
How to find the best transformation
• Method 1: Inverse Response plot
• Suppose that the true regression model is
𝑌 = 𝑔(𝛽" + 𝛽#𝑋 + 𝑒)

, so 𝑋 and 𝑌 are not linearly associated.

• In other words, the transformed 𝑌 is linearly associated


with 𝑋.
𝑔+#(𝑌) = 𝛽" + 𝛽#𝑋 + 𝑒

• For example, if 𝑔 𝑌 = exp(𝑌), then 𝑔+# 𝑌 = log(𝑌)


Or, if 𝑔 𝑌 = Y $, then 𝑔+# 𝑌 = 𝑌.
How to find the best transformation
• How do we find the best 𝑔+#(𝑌)?
• Consider the power transformation of
𝑔+# 𝑌 = 𝑌 3

, where 𝜆 = 0 corresponds to natural logarithms.

• Try different 𝜆’s and choose the 𝜆 that minimizes RSS of


the model.

• Inverse response plot : a scatter plot of


𝑌0 = 𝑌 3 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑑 𝑌 𝑣𝑠. 𝑌 = 𝛽" + 𝛽#𝑋 + 𝑒
, which will visualize the 𝜆 that gives the best fit for the
data.
How to find the best transformation
• Example :
The best choice
of 𝜆 = 1/3. It
gives best fit in
the inverse
response plot
and results the
minimum RSS
of the model.
How to find the best transformation
• Method 2: Box-Cox Transformation
• The aim of this method is to ensure the usual
assumption of normality holds, i.e.
𝑌~𝑁(𝛽" + 𝛽#𝑋, 𝜎 $)

• When both 𝑋 and 𝑌 are normally distributed, then the


regression of 𝑌 on 𝑋 is linear.
• So Box-cox transformation tries to make 𝑌 and 𝑋 as
close to normal as possible to make the regression
linear.
• Note: The normality of 𝑋 is not a part of the model
assumption.
How to find the best transformation
• It is based on the maximum likelihood method.

• If the assumption is true, i.e. 𝑌~𝑁(𝛽" + 𝛽# 𝑋, 𝜎 $ ), then


the log-likelihood function of 𝑌 is

• Maximizing the log-likelihood function is equivalent to


minimize the last term,

, which is RSS.
How to find the best transformation
• So for any transformed 𝑌 with 𝜆, we want to have minimum
RSS, i.e.
𝑅𝑆𝑆 𝜆 = &(𝑦& ' − 𝛽*( + 𝛽*) 𝑥& )*

• The result (choice of 𝜆) should be consistent with what we


get from the inverse response plot.

.
Interpretation of the transformed model
• Interpretation of the slope : In a linear model, we
learned that the slope implies “the unit change in 𝑌
associated with the unit change in 𝑋”.

• But when the variables are transformed, it is not easy


to interpret the slope in a conventional way.

• Log transformation (special case) :


• Widely used in biomedical and psychosocial
research to deal with skewed data.
• Not only used since it improves the model, but also
used since it focus on “percentage effect”.

.
Interpretation of the transformed model
• Log transformation : Consider the regression model,
log 𝑌 = 𝛽" + 𝛽# log 𝑋 + 𝑒

• Interpretation of the slope


• Let’s take the two values from the data, (X#, Y#)
and (X $, Y$).
• According to the model, it is yielded that
log 𝑌# = 𝛽" + 𝛽# log 𝑋# + 𝑒#
log 𝑌$ = 𝛽" + 𝛽# log 𝑋$ + 𝑒$

log 𝑌$ − log 𝑌# = 𝛽#(log 𝑋$ − log 𝑋# )


log 𝑌$/𝑌# = 𝛽# ∗ log 𝑋$/𝑋#
𝑌$/𝑌# = (𝑋$/𝑋#)4!
.
Interpretation of the transformed model
• Interpretation of the slope (cont.)
𝑌$/𝑌# = (𝑋$/𝑋#)4!

• If 𝑋 increases by 1%, i.e. 𝑋$/𝑋# = 1.01, then the


expected ratio of the 𝑌 variables will be (1.01)4! , i.e.
(1.01)4! − 1 ∗ 100% change in Y.

• For 𝛽1 < 10, 100 ∗ (1.01)4! − 1 ≈ 𝛽#.

• Thus, for the smaller 𝛽1, we simply interpret it as “one


percentage change in 𝑋 results in 𝛽1 percentage change
in 𝑌”, i.e. 𝛽1 describes the percentage effect within two
variables.
Note
• There’s no one-to-one resolution for the most
violations. Even though the two model have the same
issue, i.e. non-constant variance, we may take
different transformation.

• Transformation is not the all-the-time key for


overcoming violations. Sometimes, all transformation
could fail.

• We cannot discard any data point unless we find very


strong evidence to exclude the points from the model.
(Investigation needed!) In other words, outliers cannot
be removed only because they are outliers.
Summary
• Browse the relationship of the two variables and fit the
linear model, if it seems reasonable.

• Get numerical summary of the linear model and


understand what each statistic/measure tells you about
the model.

• Check the model assumptions using diagnostic plots


and useful measures, i.e. leverages, standardized
residuals, etc.

• If you find any violation of the assumptions, try to


transform the variables to suggest alternative model.

You might also like