0% found this document useful (0 votes)

21 views54 pages

Stats101A - Chapter 3

This document discusses diagnostic tools for checking the validity of linear regression models, including residual plots, leverage points, standardized residuals, and Cook's distance. It explains that while common regression outputs may indicate a significant model, the model assumptions around the error term need to be validated. Diagnostic tools help check that the error term has a mean of 0, constant variance, and is normally distributed without discernible patterns. Leverage points and standardized residuals are particularly important for validating models with influential observations.

Uploaded by

Zhen Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views54 pages

Stats101A - Chapter 3

Uploaded by

Zhen Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Chapter 3

STATS 101A Introduction to Data Analysis and Regression

Maria Cha
Recall : Model Assumption
• In the population linear model,
𝑌! = 𝛽" + 𝛽# 𝑋! + 𝑒!

• Assumptions about the ‘Error term (𝑒! )’

• 1. random and independent one another.
• 2. normally distributed.
• 3. their mean is 0, and the variance is 𝜎 $ , which is
usually unknown.
𝑒! ~𝑁(0, 𝜎 $ )
Recall : Model Assumption
• Based on these assumptions,
• 𝐸(𝑌|𝑋 = 𝑥) = 𝛽" + 𝛽#𝑥 : the average 𝑌 for a given 𝑋
is a linear function of 𝑋.
• 𝑉𝑎𝑟(𝑌|𝑋 = 𝑥) = 𝜎 $: the variance of 𝑌 is constant.

• Question: Testing the slope, ANOVA for the model, or

𝑅2 is not giving enough information about the model
validity? Do we need another step to check if the model
is valid?
Valid model
• Example : Anscombe’s four data sets
Valid model
• Model 1: 𝑌# vs. 𝑋#
• Regression model: 𝑌0# = 3.0 + 0.5𝑋#
• R results: The slope is significant (p-value:0.002), the
regression model is significant (p-value 0.002), and
the 𝑅2 = 0.67.

• The regression model (𝑌0 = 3.0 + 0.5𝑋) and the

corresponding R results are same for Model 2 (𝑌$ vs.
𝑋$), Model 3(𝑌% vs. 𝑋%), and Model 4 (𝑌& vs. 𝑋&).
Valid model
Valid model
• The numerical regression output suggests that all four
models are significant.

• Remember, all tests and analysis for the regression

model are mathematically designed based on the
same assumptions for the error term.
Valid model
• Let’s take a look the graphical output, i.e. scatter
plots of 𝑌 vs. 𝑋.

• What can we observe or check from the scatter plot?

• 1. The average 𝑌 should be the linear function of
𝑋.
𝐸(𝑌|𝑋 = 𝑥) = 𝛽" + 𝛽# 𝑥

• 2. The variance of 𝑌 given 𝑋 should be constant.

𝑉𝑎𝑟(𝑌|𝑋 = 𝑥) = 𝜎 $
Valid model
Valid model
• The numerical regression output should always be
supplemented by an analysis to ensure that an
appropriate model has been fitted to the data.

• Even though the regression output suggests a

significant regression model, it could not be an
appropriate model, if the model assumptions are
violated.

• We need diagnostic tools for checking model

assumptions.
Diagnostic tool (1) Residual
• Residuals give us significant information for checking
model assumptions for the error term.

• If 𝛽4" and 𝛽4# are good estimates of 𝛽" and 𝛽# , i.e. they
are close one another enough, then the residual (𝑒̂! )
will resemble the error term, 𝑒! .

𝑒̂! = 𝑦! − 𝑦8!
= 𝛽" + 𝛽# 𝑥! + 𝑒! − 𝛽4" + 𝛽4# 𝑥!
= 𝛽" − 𝛽4" + 𝛽# − 𝛽4# 𝑥! + 𝑒!
≈ 𝑒!
Diagnostic tool (1) Residual
• What can we observe or check from the residual plot,
i.e. scatter plot of residual vs. 𝑋?

• Recall that we assume 𝑒! ~𝑁(0, 𝜎 $).

• Thus, we want to check from residual vs. 𝑋 that

• 1. if the average residuals is zero (with a horizontal
line),
• 2. if the variance of the residuals is constant (not
dependent to 𝑥),
• 3. if the residuals show any discernable pattern.
Diagnostic tool (1) Residual
Diagnostic tool (2) Leverage
• Leverage points : data points which exercise
considerable influence on the fitted model. Its x-
value is distant from the other x-values.
• Good leverage point : if its y-value follows the
trend pattern, i.e. not an outlier.
• Bad leverage point : if its y-value does not follow
the trend pattern, which is also an outlier.
• Note: There’s an outlier that is not a leverage
point. In this case, its x-value is not distant from
the other x-values. Thus, it does not “lever”
up/down the regression line as much.
Diagnostic tool (2) Leverage
• Good vs. Bad Leverage points
• Adding an extraordinary point may/may not lever the
regression line: black >> red.
Diagnostic tool (2) Leverage
• Leverages (ℎ𝑖𝑖 ) measure how the predicted 𝑦 changes
if we were to change the observed 𝑦 for each
observation 𝑖.

• In other words, how does 𝑦>! (mathematically) depend on

𝑦! ?

𝑦>! = ∑' ℎ!' 𝑦' ,

# ̅ " +*)̅
(*! +*)(*
, where ℎ!' = (
+ .//
.
Diagnostic tool (2) Leverage
• Show :

𝑦>! = 𝛽@" + 𝛽@#𝑥!

= 𝑦A − 𝛽@#𝑥̅ + 𝛽@#𝑥!
= 𝑦A − 𝛽@# 𝑥̅ − 𝑥!
( (
1 (𝑥' − 𝑥)̅
= F 𝑦' − F 𝑦' 𝑥̅ − 𝑥!
𝑛 𝑆𝑋𝑋
'0# '0#
(
1 𝑥! − 𝑥̅ (𝑥' − 𝑥)̅
=F + 𝑦'
𝑛 𝑆𝑋𝑋
'0#
(
= F ℎ!' 𝑦'
'0#
Diagnostic tool (2) Leverage
• ‘Hat matrix’ ℎ!' , where 𝑖, 𝑗 = 1, … , 𝑛

ℎ## ℎ#$ … ℎ#(

ℎ ℎ$$ … ⋮
ℎ!' = $#
⋮ ⋮ ⋱ ⋮
ℎ(# … … ℎ((

• The diagonal entries of the hat matrix are the

‘leverages’.
• Leverages are independent of the y values, but only
dependent to the 𝑥 values.
Diagnostic tool (2) Leverage
• Leverages (ℎ𝑖𝑖 ) of the 𝑥𝑖 :
1 (𝑥! − 𝑥)̅ $
ℎ!! = +
𝑛 Σ(𝑥' − 𝑥)̅ $
• Note that ∑('0# ℎ!' = 1. If ℎ!! ≅ 1 then the other ℎ!' are
close to zero. Thus, ℎ!! shows how 𝑦! affects 𝑦>! , since
𝑦>! = ∑' ℎ!' 𝑦' .

$
• ℎA !! = ( , for 𝑖 = 1, … , 𝑛.

• The higher the leverage is, the more influential the data
point is. More specifically, if ℎ!! > 4/𝑛, i.e. if a leverage
is greater than two times of its average value, then the
corresponding 𝑥𝑖 is a leverage point.
Diagnostic tool (2) Leverage
• Revisit the example: Suppose we now have 6 observations for
weights and heights, and want to predict weights with the heights
using simple linear regression model:

# ̅
(&! '&)(& ̅
" '&)
𝑋 𝑌 𝑋 − 𝑋& & !
(𝑋 − 𝑋) The entries of ℎ!" = +
$ *++
60 105 -10 100
66 140 -4 16

72 185 2 4

70 145 0 0 ℎ!" =
62 120 -8 64
90 250 20 400
𝑋& = 70 𝑌& = 157.5 Σ = 584
Diagnostic tool (2) Leverage
• The diagonal entries (ℎ23 ) are the leverages.

• Based on the leverages, which observation(s) is(are)

leverage point(s)?

• Determine whether it(they) is(are) good or bad leverage

point(s).
Diagnostic tool (2) Leverage
• In R,
Diagnostic tool (3) Standardized residual
• We shall write the variances of 𝑖𝑡ℎ residual and the
fitted 𝑦𝑖 with respect to the leverages (ℎ𝑖𝑖) :
𝑉𝑎𝑟 𝑒̂! = 𝜎 $ (1 − ℎ!! )
𝑉𝑎𝑟 𝑦8! = 𝜎 $ ℎ!!
(*See p.61 for derivation.)

• It implies that the variance of the residuals are not

constant but it depends on ℎ𝑖𝑖.

• The relationship between ℎ𝑖𝑖 and prediction : If ℎ𝑖𝑖 is

close to 1, i.e. 𝑥𝑖 is the leverage point, then
𝑦8! = ℎ!! 𝑦! + B ℎ!' 𝑦' ≈ 𝑦!
'1!
Diagnostic tool (3) Standardized residual
• Now it makes sense that if ℎ𝑖𝑖 is close to 1, i.e. 𝑥𝑖 is
the leverage point, then
𝑉𝑎𝑟 𝑒̂! = 𝜎 $ (1 − ℎ!! ) ≈ 0 (very small)
𝑉𝑎𝑟 𝑦8! = 𝜎 $ ℎ!! ≈ 𝜎 $ = 𝑉𝑎𝑟(𝑦! )

• Standardized residual (𝑟𝑖) : Need to deal with non-

constant variances of the residuals when the data
includes the leverage points.
𝑒̂!
𝑟! =
𝑠 1 − ℎ!!
2..
, where 𝑠 is the estimate of 𝜎, i.e. s = .
(+$
Diagnostic tool (3) Standardized residual
• When the (high) leverage points exist, it should be
more informative to look at plots of the standardized
residuals instead of those of the residuals.

• Why : Even though the (population) errors have a

constant variance, the variance of the residuals
could show the non-constant variance.

• Implication of the standardized residuals : how

many estimated standard deviations any point is
away from the fitted regression model (Think about
the interpretation of any z-score)
Diagnostic tool (3) Standardized residual
• If the standardized residual falls outside the interval
from -2 to 2, then it is considered as an outlier.

• In addition, any leverage point whose standardized

residual falls outside the interval from -2 to 2 is
considered as a bad leverage point.
Diagnostic tool (3) Standardized residual
• Standardized residuals are also used to check the
assumption for the constant variance.

• It is recommended to plot of
|𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑧𝑒𝑑 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙| 𝑣𝑠. 𝑋 (or fitted Y value)
, and the square root is used to reduce the skewness in
the absolute values.

• Interpretation of the plot will be discussed in the later

slides.
Diagnostic tool (4) Cook’s distance
• Using leverages and standardized residuals, we are
able to classify which observations could be problematic
and violating the model assumptions.

• Cook’s distance is measuring and assessing the

influence of the certain cases:
∑('0#(𝑦>' ! − 𝑦>' )$ 𝑟!$ ℎ!!
𝐷! = $ = ∗
2𝑆 2 1 − ℎ!!

• 𝑦>' ! denotes the 𝑗𝑡ℎ fitted value based on the fit obtained
when the 𝑖𝑡ℎ case has been deleted from the fit.
• 𝑟𝑖 is the 𝑖𝑡ℎ standardized residual and ℎ𝑖𝑖 is the 𝑖𝑡ℎ
leverage value.
Diagnostic tool (4) Cook’s distance
𝑟!$ ℎ!!
𝐷! = ∗
2 1 − ℎ!!

• Cook’s distance increases when the observation

produces high standardized residuals, i.e. the point lies
far way from the regression line. Also it increases when
the leverage is big, i.e. 𝑥 value is far away from its
mean.
• There is some guideline for the “big“ Cook’s distance
when
4
𝐷! >
𝑛−2

, which is a bit harsh.

Diagnostic tool (5) Normal Q-Q plot
• Normal Q-Q plot is one of the most widely used tool
to assess the normality of the errors.

• Scatter plot of ordered standardized residuals vs.

expected order statistics from 𝑁(0,1).

• If the plot shows close to the straight line, then the

errors are considered to be consistent with the
normal distribution.
Diagnostic tool (5) Normal Q-Q plot
Diagnostic tool : R result
• plot(model) produces four diagnostic plots. (data:
iowatest.txt)
Diagnostic tool : R result
• The plot of Residual vs. Fitted is equivalent to The
plot of Residual vs. 𝑋, since the fitted 𝑌 is a linear
function of 𝑋.

• The red line shows

whether 1) the
relationship is linear (the
line should be straight),
and 2) the average of the
errors is 0.
Diagnostic tool : R result
• Normal Q-Q plot is to check the normality of the
error term.
• If the points are aligned to
the straight line, then it
implies the normality of
the errors.
• Observation 70 : the 𝑟70 <
− 3, but it should have
been (theoretically) -2.7 if
all 𝑟𝑖 ’s are from the normal
distribution.
Diagnostic tool : R result
• The plot of |𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑧𝑒𝑑 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙|vs. fitted 𝑌
value is to check the constant variance of the error
term.
• If it does not show any
apparent pattern, then it
suggests the constant
variance.
Diagnostic tool : R result
• The plot of Standardized residuals vs. Leverages is
to check the outliers or (possible) influential points.

• Check if 𝑟𝑖 ’s fall outside

−2, 2 .
• Check If the leverage
is greater than 4/𝑛 =
0.03 (𝑛 = 133)
• Check if the 𝐷𝑖’s are
greater than 4/(𝑛 − 2)
= 0.0305
Diagnostic tool : R result
• armspan2015.txt data
• Check if 𝑟𝑖’s fall
outside (−2, 2) : 𝑖 =
47
• Check If the leverage
is greater than 4/𝑛 =
0.056 (𝑛 = 71) : 𝑖 =
67
• Check if the 𝐷𝑖’s are
greater than 4/(𝑛 −
2) = 0.058 𝑖 = 46,67
So..what should we do?
• Points should not be routinely deleted from an analysis
just because they do not fit the model. Outliers and bad
leverage points are signals, flagging potential problems
with the model.
• Outliers often point out an important feature of the
problem not considered before. They may point to an
alternative model in which the points are not an outlier.
In this case it is then worth considering fitting an
alternative model by including extra predictor variables
or by transforming 𝑌 and/or 𝑋.
• When the assumption that the variance of the errors is
constant does not hold, we can use weighted least
squares to account for the changing variance (Chapter
4)
Recall : Model Assumption
• In the population linear model,

𝑌! = 𝛽" + 𝛽#𝑋! + 𝑒!
, we assume that .
𝑒! ~𝑁 0, 𝜎 $ .

• In other words, we want to check

• 1. Linearity of the relationship,
• 2. Independence and randomness of the error term
(difficult to check),
• 3. Normality of the error term (or 𝑌),
• 4. Constant variance of the error term
(homoscedasticity),
• (5. Outliers or influential points)
Checking the assumption
• 1. Linearity of the relationship :
o scatter plot of 𝑌 vs. 𝑋
U
o scatter plot of residuals vs. 𝑋 (or 𝑌)
• 2. Normality of the error term or 𝑌
o Normal Q-Q plot
• 3. Constant variance of the error term (homoscedasticity)
U
o scatter plot of residuals vs. 𝑋 (or 𝑌)
U
o scatter plot of |Standardized residuals|0.5 vs. 𝑋 (or 𝑌)
• 4. Outliers or influential points :
o leverages, standardized residuals, Cook’s distance
o scatter plot of Standardized residuals vs. Leverages
What if they are violated?
• 1. Linearity of the relationship : The predictions and
estimates of the slope and intercept will be biased.
• 2. Normality of the error term (or 𝑌): The estimates of
intercept and slope, as well as predicted values, are still
unbiased.
• large sample size : good approximations of the p-
values for the t-tests. So are the Confidence intervals
for slope and intercept.
• small sample size : Then everything falls apart.
• If the normality fails, prediction intervals are not
accurate, regardless of the size of 𝑛. (the CLT
doesn’t apply)
• 3. Constant variance and independence of the error term
fail : All inference tools invalidated.
Transformation
• Most widely used tool to overcome the violation of the
assumptions.
• It is possible to change the distribution (shape) of the
variable, and the outliers from the original data could
be considered as not an outlier in the transformed
data.
How to find the best transformation
• Method 1: Inverse Response plot
• Suppose that the true regression model is
𝑌 = 𝑔(𝛽" + 𝛽#𝑋 + 𝑒)

, so 𝑋 and 𝑌 are not linearly associated.

• In other words, the transformed 𝑌 is linearly associated

with 𝑋.
𝑔+#(𝑌) = 𝛽" + 𝛽#𝑋 + 𝑒

• For example, if 𝑔 𝑌 = exp(𝑌), then 𝑔+# 𝑌 = log(𝑌)

Or, if 𝑔 𝑌 = Y $, then 𝑔+# 𝑌 = 𝑌.
How to find the best transformation
• How do we find the best 𝑔+#(𝑌)?
• Consider the power transformation of
𝑔+# 𝑌 = 𝑌 3

, where 𝜆 = 0 corresponds to natural logarithms.

• Try different 𝜆’s and choose the 𝜆 that minimizes RSS of

the model.

• Inverse response plot : a scatter plot of

𝑌0 = 𝑌 3 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑑 𝑌 𝑣𝑠. 𝑌 = 𝛽" + 𝛽#𝑋 + 𝑒
, which will visualize the 𝜆 that gives the best fit for the
data.
How to find the best transformation
• Example :
The best choice
of 𝜆 = 1/3. It
gives best fit in
the inverse
response plot
and results the
minimum RSS
of the model.
How to find the best transformation
• Method 2: Box-Cox Transformation
• The aim of this method is to ensure the usual
assumption of normality holds, i.e.
𝑌~𝑁(𝛽" + 𝛽#𝑋, 𝜎 $)

• When both 𝑋 and 𝑌 are normally distributed, then the

regression of 𝑌 on 𝑋 is linear.
• So Box-cox transformation tries to make 𝑌 and 𝑋 as
close to normal as possible to make the regression
linear.
• Note: The normality of 𝑋 is not a part of the model
assumption.
How to find the best transformation
• It is based on the maximum likelihood method.

• If the assumption is true, i.e. 𝑌~𝑁(𝛽" + 𝛽# 𝑋, 𝜎 $ ), then

the log-likelihood function of 𝑌 is

• Maximizing the log-likelihood function is equivalent to

minimize the last term,

, which is RSS.
How to find the best transformation
• So for any transformed 𝑌 with 𝜆, we want to have minimum
RSS, i.e.
𝑅𝑆𝑆 𝜆 = &(𝑦& ' − 𝛽*( + 𝛽*) 𝑥& )*

• The result (choice of 𝜆) should be consistent with what we

get from the inverse response plot.

.
Interpretation of the transformed model
• Interpretation of the slope : In a linear model, we
learned that the slope implies “the unit change in 𝑌
associated with the unit change in 𝑋”.

• But when the variables are transformed, it is not easy

to interpret the slope in a conventional way.

• Log transformation (special case) :

• Widely used in biomedical and psychosocial
research to deal with skewed data.
• Not only used since it improves the model, but also
used since it focus on “percentage effect”.

.
Interpretation of the transformed model
• Log transformation : Consider the regression model,
log 𝑌 = 𝛽" + 𝛽# log 𝑋 + 𝑒

• Interpretation of the slope

• Let’s take the two values from the data, (X#, Y#)
and (X $, Y$).
• According to the model, it is yielded that
log 𝑌# = 𝛽" + 𝛽# log 𝑋# + 𝑒#
log 𝑌$ = 𝛽" + 𝛽# log 𝑋$ + 𝑒$

log 𝑌$ − log 𝑌# = 𝛽#(log 𝑋$ − log 𝑋# )

log 𝑌$/𝑌# = 𝛽# ∗ log 𝑋$/𝑋#
𝑌$/𝑌# = (𝑋$/𝑋#)4!
.
Interpretation of the transformed model
• Interpretation of the slope (cont.)
𝑌$/𝑌# = (𝑋$/𝑋#)4!

• If 𝑋 increases by 1%, i.e. 𝑋$/𝑋# = 1.01, then the

expected ratio of the 𝑌 variables will be (1.01)4! , i.e.
(1.01)4! − 1 ∗ 100% change in Y.

• For 𝛽1 < 10, 100 ∗ (1.01)4! − 1 ≈ 𝛽#.

• Thus, for the smaller 𝛽1, we simply interpret it as “one

percentage change in 𝑋 results in 𝛽1 percentage change
in 𝑌”, i.e. 𝛽1 describes the percentage effect within two
variables.
Note
• There’s no one-to-one resolution for the most
violations. Even though the two model have the same
issue, i.e. non-constant variance, we may take
different transformation.

• Transformation is not the all-the-time key for

overcoming violations. Sometimes, all transformation
could fail.

• We cannot discard any data point unless we find very

strong evidence to exclude the points from the model.
(Investigation needed!) In other words, outliers cannot
be removed only because they are outliers.
Summary
• Browse the relationship of the two variables and fit the
linear model, if it seems reasonable.

• Get numerical summary of the linear model and

understand what each statistic/measure tells you about
the model.

• Check the model assumptions using diagnostic plots

and useful measures, i.e. leverages, standardized
residuals, etc.

• If you find any violation of the assumptions, try to

transform the variables to suggest alternative model.

Unit-III (Data Analytics)
50% (2)
Unit-III (Data Analytics)
15 pages
Analysis of Factors Affecting The Winning Percentage ODI Cricket Matches
0% (1)
Analysis of Factors Affecting The Winning Percentage ODI Cricket Matches
24 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Linear Regression
100% (2)
Linear Regression
228 pages
ISOM2500 Spring 25 - Topic 10 - Assumptions For Linear Regression
No ratings yet
ISOM2500 Spring 25 - Topic 10 - Assumptions For Linear Regression
35 pages
SUMMARY - SUKARMAN - A062201008 - An Empirical Evaluation of Accounting Income Number
No ratings yet
SUMMARY - SUKARMAN - A062201008 - An Empirical Evaluation of Accounting Income Number
3 pages
Forecasting Formula Used in Supply Chain For Prod..
No ratings yet
Forecasting Formula Used in Supply Chain For Prod..
2 pages
Errors and Uncertainties in Biology Internal Assessment
No ratings yet
Errors and Uncertainties in Biology Internal Assessment
4 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Bow and Arrow Lab
100% (3)
Bow and Arrow Lab
4 pages
14 Statistics and Probability
No ratings yet
14 Statistics and Probability
37 pages
BA501 Week5 Linear Regression
No ratings yet
BA501 Week5 Linear Regression
45 pages
Sample Size Determination
100% (1)
Sample Size Determination
19 pages
Business Stats Ken Black Case Answers
100% (1)
Business Stats Ken Black Case Answers
54 pages
Exercise 1
0% (1)
Exercise 1
5 pages
Multiple Regression
100% (1)
Multiple Regression
21 pages
Group 1 Mary Research
No ratings yet
Group 1 Mary Research
14 pages
00000chen - Linear Regression Analysis3
No ratings yet
00000chen - Linear Regression Analysis3
252 pages
Capstone: Stem-Based Research
No ratings yet
Capstone: Stem-Based Research
29 pages
Complete Handbook of High Frequency Trading and Modeling in Finance 1st Edition Ionut Florescu PDF For All Chapters
No ratings yet
Complete Handbook of High Frequency Trading and Modeling in Finance 1st Edition Ionut Florescu PDF For All Chapters
91 pages
1 Residuals, Outliers and Regression Diagnostics - CH 14.8 15.8 Revised
No ratings yet
1 Residuals, Outliers and Regression Diagnostics - CH 14.8 15.8 Revised
48 pages
Week 2
No ratings yet
Week 2
54 pages
Methods For Estimating Regression Discontinuity Design With Multiple Assignment Variables A Comparative Study of Three Estimation Methods
No ratings yet
Methods For Estimating Regression Discontinuity Design With Multiple Assignment Variables A Comparative Study of Three Estimation Methods
73 pages
ch12 0
No ratings yet
ch12 0
43 pages
Regression Analysis
100% (1)
Regression Analysis
280 pages
Linear Regression
No ratings yet
Linear Regression
59 pages
PE Civil: Transportation Ebook Practice Exam
No ratings yet
PE Civil: Transportation Ebook Practice Exam
41 pages
10 - 4 - ML - SUP - Linear Regression
No ratings yet
10 - 4 - ML - SUP - Linear Regression
59 pages
Stats101A - Chapter 2
No ratings yet
Stats101A - Chapter 2
59 pages
Sosik Et Al 2009 Silver Bullet or Voodoo Statistics A Primer For Using The Partial Least Squares Data Analytic
No ratings yet
Sosik Et Al 2009 Silver Bullet or Voodoo Statistics A Primer For Using The Partial Least Squares Data Analytic
32 pages
Chapter Three - Estimation
No ratings yet
Chapter Three - Estimation
50 pages
C6 Regression
No ratings yet
C6 Regression
27 pages
Lesson 3 Overview Problems and Outliers
No ratings yet
Lesson 3 Overview Problems and Outliers
31 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
20 pages
Regrn Summary Output
No ratings yet
Regrn Summary Output
24 pages
MultivariableRegression 6
No ratings yet
MultivariableRegression 6
44 pages
Basic Econometrics - II
No ratings yet
Basic Econometrics - II
30 pages
Estad Istica II Chapter 5. Regression Analysis (Second Part)
No ratings yet
Estad Istica II Chapter 5. Regression Analysis (Second Part)
39 pages
STAT22209 - Chapter 02-Regression Analyisis - 2022
No ratings yet
STAT22209 - Chapter 02-Regression Analyisis - 2022
41 pages
2019 - GOSIEWSKA - AUDITOR - AN R Package For Model Agnostic Visual Validation and Diagnostics
No ratings yet
2019 - GOSIEWSKA - AUDITOR - AN R Package For Model Agnostic Visual Validation and Diagnostics
14 pages
Unit 3
No ratings yet
Unit 3
24 pages
BT Group PLC: An Event Study
No ratings yet
BT Group PLC: An Event Study
48 pages
DA-3rd Unit
No ratings yet
DA-3rd Unit
16 pages
Stat 353 Study Guide
No ratings yet
Stat 353 Study Guide
44 pages
LR Assumptions - 05
No ratings yet
LR Assumptions - 05
12 pages
9 W9INSE6220 Fall 2023
No ratings yet
9 W9INSE6220 Fall 2023
42 pages
Multicollinearity
No ratings yet
Multicollinearity
36 pages
Stats101A - Chapter 3 Transformation (Case Study)
No ratings yet
Stats101A - Chapter 3 Transformation (Case Study)
14 pages
TS03J Nyoagbe Ayer Et Al 12205
No ratings yet
TS03J Nyoagbe Ayer Et Al 12205
18 pages
Final Answer Bank
No ratings yet
Final Answer Bank
10 pages
10 - 4 - ML - SUP - Linear Regression
No ratings yet
10 - 4 - ML - SUP - Linear Regression
59 pages
Economics 106P E. Mcdevitt Study Questions-Set #1
No ratings yet
Economics 106P E. Mcdevitt Study Questions-Set #1
13 pages
Chapter 3
No ratings yet
Chapter 3
22 pages
Lec 34
No ratings yet
Lec 34
15 pages
Chapter 8 Regression Model - 2023
No ratings yet
Chapter 8 Regression Model - 2023
21 pages
7 OLS Assumptions
No ratings yet
7 OLS Assumptions
37 pages
Basic Regression Analysis 3
No ratings yet
Basic Regression Analysis 3
6 pages
Metodologi Penelitian - Bab 5
No ratings yet
Metodologi Penelitian - Bab 5
29 pages
Lecture 4
No ratings yet
Lecture 4
12 pages
Lecture36 2012 Full
No ratings yet
Lecture36 2012 Full
30 pages
Chapter 4 MLR
No ratings yet
Chapter 4 MLR
17 pages
4-Regression Diagnostics SAS
No ratings yet
4-Regression Diagnostics SAS
12 pages
Regression For Everyone Vol. 1
No ratings yet
Regression For Everyone Vol. 1
25 pages
Practical Session 2 Linear Regression Model Assumptions
No ratings yet
Practical Session 2 Linear Regression Model Assumptions
7 pages
Digitalcommons@University of Nebraska - Lincoln Digitalcommons@University of Nebraska - Lincoln
No ratings yet
Digitalcommons@University of Nebraska - Lincoln Digitalcommons@University of Nebraska - Lincoln
22 pages
LR Assumptions
No ratings yet
LR Assumptions
9 pages
Economics 106P Ucla E. Mcdevitt Study Questions Set #2 Pricing
No ratings yet
Economics 106P Ucla E. Mcdevitt Study Questions Set #2 Pricing
8 pages
Determinants of Career Change of Overseas Filipino Professionals in The Middle East
No ratings yet
Determinants of Career Change of Overseas Filipino Professionals in The Middle East
14 pages
Is The Dependent Variable Related To The Independent Variable?
No ratings yet
Is The Dependent Variable Related To The Independent Variable?
10 pages
Midterm 2 Nem Veg Leges
No ratings yet
Midterm 2 Nem Veg Leges
9 pages
Mungadze Linear
No ratings yet
Mungadze Linear
21 pages
Oulier in R
No ratings yet
Oulier in R
8 pages
Effect of Tax Knowledge On Individual Taxpayers Compliance: Anita Damajanti Abdul Karim, Se - Msi.Akt
No ratings yet
Effect of Tax Knowledge On Individual Taxpayers Compliance: Anita Damajanti Abdul Karim, Se - Msi.Akt
19 pages
Residual Analysis For Simple Linear Regression: X B B y N e N e
No ratings yet
Residual Analysis For Simple Linear Regression: X B B y N e N e
15 pages
CH 2
No ratings yet
CH 2
31 pages
Effect of Heuristic Factors and Real Estate Investment in Embu County, Kenya
No ratings yet
Effect of Heuristic Factors and Real Estate Investment in Embu County, Kenya
9 pages
Tourism and Poverty Reduction in Mexico: An ARDL Cointegration Approach
No ratings yet
Tourism and Poverty Reduction in Mexico: An ARDL Cointegration Approach
10 pages
Chapter2 (Simple Linear Regression)
No ratings yet
Chapter2 (Simple Linear Regression)
11 pages
The CANCORR Procedure
No ratings yet
The CANCORR Procedure
32 pages
Regression Diagnostics With R: Anne Boomsma
No ratings yet
Regression Diagnostics With R: Anne Boomsma
23 pages
Using R For Linear Regression
No ratings yet
Using R For Linear Regression
9 pages
Regrion
No ratings yet
Regrion
19 pages
Data Science 03 - Regression PDF
No ratings yet
Data Science 03 - Regression PDF
32 pages
Simple Linear Regression: Parameters
No ratings yet
Simple Linear Regression: Parameters
34 pages
Multiple Regression - D. Boduszek
No ratings yet
Multiple Regression - D. Boduszek
27 pages
Free Fall Lab
100% (1)
Free Fall Lab
6 pages
Simple Regression Model Fitting
No ratings yet
Simple Regression Model Fitting
5 pages
Chapter 11
No ratings yet
Chapter 11
10 pages
Lecture 22: Review For Exam 2 1 Basic Model Assumptions (Without Gaussian Noise)
No ratings yet
Lecture 22: Review For Exam 2 1 Basic Model Assumptions (Without Gaussian Noise)
7 pages
Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
Core Concepts in Real Analysis
From Everand
Core Concepts in Real Analysis
Roshan Trivedi
No ratings yet
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)

Stats101A - Chapter 3

Uploaded by

Stats101A - Chapter 3

Uploaded by

Chapter 3

STATS 101A Introduction to Data Analysis and Regression

• Assumptions about the ‘Error term (𝑒! )’

• Question: Testing the slope, ANOVA for the model, or

• The regression model (𝑌0 = 3.0 + 0.5𝑋) and the

• Remember, all tests and analysis for the regression

• What can we observe or check from the scatter plot?

• 2. The variance of 𝑌 given 𝑋 should be constant.

• Even though the regression output suggests a

• We need diagnostic tools for checking model

• Recall that we assume 𝑒! ~𝑁(0, 𝜎 $).

• Thus, we want to check from residual vs. 𝑋 that

• In other words, how does 𝑦>! (mathematically) depend on

𝑦>! = ∑' ℎ!' 𝑦' ,

𝑦>! = 𝛽@" + 𝛽@#𝑥!

ℎ## ℎ#$ … ℎ#(

• The diagonal entries of the hat matrix are the

• Based on the leverages, which observation(s) is(are)

• Determine whether it(they) is(are) good or bad leverage

• It implies that the variance of the residuals are not

• The relationship between ℎ𝑖𝑖 and prediction : If ℎ𝑖𝑖 is

• Standardized residual (𝑟𝑖) : Need to deal with non-

• Why : Even though the (population) errors have a

• Implication of the standardized residuals : how

• In addition, any leverage point whose standardized

• Interpretation of the plot will be discussed in the later

• Cook’s distance is measuring and assessing the

• Cook’s distance increases when the observation

, which is a bit harsh.

• Scatter plot of ordered standardized residuals vs.

• If the plot shows close to the straight line, then the

• The red line shows

• Check if 𝑟𝑖 ’s fall outside

• In other words, we want to check

, so 𝑋 and 𝑌 are not linearly associated.

• In other words, the transformed 𝑌 is linearly associated

• For example, if 𝑔 𝑌 = exp(𝑌), then 𝑔+# 𝑌 = log(𝑌)

, where 𝜆 = 0 corresponds to natural logarithms.

• Try different 𝜆’s and choose the 𝜆 that minimizes RSS of

• Inverse response plot : a scatter plot of

• When both 𝑋 and 𝑌 are normally distributed, then the

• If the assumption is true, i.e. 𝑌~𝑁(𝛽" + 𝛽# 𝑋, 𝜎 $ ), then

• Maximizing the log-likelihood function is equivalent to

• The result (choice of 𝜆) should be consistent with what we

• But when the variables are transformed, it is not easy

• Log transformation (special case) :

• Interpretation of the slope

log 𝑌$ − log 𝑌# = 𝛽#(log 𝑋$ − log 𝑋# )

• If 𝑋 increases by 1%, i.e. 𝑋$/𝑋# = 1.01, then the

• For 𝛽1 < 10, 100 ∗ (1.01)4! − 1 ≈ 𝛽#.

• Thus, for the smaller 𝛽1, we simply interpret it as “one

• Transformation is not the all-the-time key for

• We cannot discard any data point unless we find very

• Get numerical summary of the linear model and

• Check the model assumptions using diagnostic plots

• If you find any violation of the assumptions, try to

You might also like