0% found this document useful (0 votes)

40 views17 pages

Chapter 4 MLR

This chapter discusses assumptions of linear regression models and diagnostic procedures to assess these assumptions. It covers assumptions of linearity, normality, homoskedasticity, and independence. Various visual and statistical tests are presented to diagnose validity of the assumptions, including plots of residuals and formal lack-of-fit and hypothesis tests.

Uploaded by

musamhlaba0329

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views17 pages

Chapter 4 MLR

Uploaded by

musamhlaba0329

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 17

Chapter 4: Regression Assumptions and Residual/ Errors Diagnostics

When we fit a simple linear regression model to our sample data and use the estimated model to make predictions and statistical
inferences about a larger population, we make several assumptions that may or may not be correct for the data at hand. The key
theoretical assumptions about the linear regression model are:
 Linearity. The equation that is used for the connection between the expected value of the Y (dependent) variable and the
different levels of the X (independent) variable describes the actual pattern of the data. In other words, we use a straight-line
equation because we assume the \average" pattern in the data is indeed linear.
 Normality. The errors are normally distributed with a mean of 0.
 Homoskedasticity/ constant variance. The errors have the same theoretical variance, regardless of the values of X and,
thus, regardless of the expected value of Y. For a straight line, this means that the vertical variation of data points around the
line has about the same magnitude everywhere.
 Independence. The errors are independent of each other (i.e., that they are a random sample) and are independent of any
time order in the data. A discussion assessing dependency requires an introduction to time series analysis. Durbin-Waston
Test

This chapter presents various regression diagnostic procedures to assess these assumptions.
Moreover, the same standard deviation holds at all values of the independent variable. Thus, the distribution curves (shown in red
at three different values of the independent variable) all look the same, but they are just translated along the x-axis based on the
regression relationship. Note that the only assumption that is not visualized here is the assumption of independence, which will
usually be satisfied if there is no temporal component and the experimenter did a proper job of designing the experiment.
A. CONSEQUENCES OF INVALID ASSUMPTIONS

Linearity. Using the wrong equation (such as using a straight line for curved data) is very serious. Predicted values will be wrong in
a biased manner, meaning that predicted values will systematically miss the true pattern of the expected value of Y as related to X.

Normality. If the errors do not have a normal distribution, it usually is not particularly serious. Simulation results have shown that
regression inferences tend to be robust with respect to normality (or nonnormality of the errors). In practice, the residuals may
appear to be nonnormal when the wrong regression equation has been used. With that stated, if the error distribution is significantly
nonnormal, then inferential procedures can be very misleading; e.g., statistical intervals may be too wide or too narrow. It is not
possible to check the assumption that the overall mean of the errors is equal to 0 because the least squares process causes the
residuals to sum to 0. However, if the wrong equation is used and the predicted values are biased, the sample residuals will be
patterned so that they may not average 0 at specific values of X.

Homoskedasticity. The principal consequence of nonconstant variance i.e., where the variance is not the same at each level of X |
are prediction intervals for individual Y values will be wrong because they are determined assuming constant variance. There is a
small effect on the validity of t-test and F-test results, but generally regression inferences are robust with regard to the variance
issue.

Independence. As noted earlier, this assumption will usually be satisfied if there is no temporal component and the experimenter
did a proper job of designing the experiment. However, if there is a time-ordering of the observations, then there is the possibility of
correlation between consecutive errors. If present, then this is indicative of a badly misspecified model and any subsequent
inference procedures will be severely misleading.
B. DIAGNOSING VALIDITY OF ASSUMPTIONS

Visualization plots and diagnostic measures used to detect types of disagreement between the observed data and an assumed
regression model have a long history. Central to many of these methods are residuals based on the fitted model. As will be shown,
there are more than just the raw residuals; i.e., the difference between the observed values and the fitted values. We briefly outline
the types of strategies that can be used to diagnose validity of each of the four assumptions. We will then discuss these procedures
in greater detail.

 DIAGNOSING WHETHER THE RIGHT TYPE OF EQUATION WAS USED

1. Examine a plot of residuals versus fitted values (predicted values). A curved pattern for the residuals versus fitted values plot

indicates that the wrong type of equation has been used.

2. Use goodness-of-fit measures. For example, r2 can be used as a rough goodness-of-fit measure, but by no means should

be used to solely determine the appropriateness of the model fit. The plot of residuals versus fitted values can also be used

for assessing goodness-of-fit. Other measures, like model selection criteria, are discussed in Chapter 5.

3. If the regression has repeated measurements at the same x i values, then you can perform a formal lack-of-fit test

(also called a pure error test) in which the null hypothesis is that the type of equation used as the regression equation is

correct. Failure to reject this null hypothesis is a good thing since it means that the regression equation is okay.
 DIAGNOSING WHETHER THE ERRORS HAVE A NORMAL DISTRIBUTION

1. Examine a histogram of the residuals to see if it appears to be bell-shaped, such as the residuals from the simulated data

given in Figure (a). The difficulty is that the shape of a histogram may be difficult to judge unless the sample size is large.

2. Examine a normal probability plot of the residuals. Essentially, the ordered (standardized) residuals are plotted against

theoretical expected values for a sample from a standard normal curve population. A straight-line pattern for a normal

probability plot (NPP) indicates that the assumption of normality is reasonable, such as the NPP given in Figure (b).

3. Do a hypothesis test in which the null hypothesis is that the errors have a normal distribution. Failure to reject this null

hypothesis is a good result. It means that it is reasonable to assume that the errors have a normal distribution.

Goodness-of-fit tests are also used for this purpose.

 DIAGNOSING WHETHER OR NOT THE VARIANCE IS CONSTANT

1. Examine a plot of residuals versus fitted values. Obvious differences in the vertical spread of the residuals indicate

nonconstant variance. The most typical pattern for nonconstant variance is a plot of residuals versus fittted values with a

pattern that resembles a sideways cone.

2. Do a hypothesis test with the null hypothesis that the variance of the errors is the same for all values of the predictor

variable(s). There are various statistical tests that can be used, such as the modified Levene test and Bartlett's test.
 DIAGNOSING INDEPENDENCE OF THE ERROR TERMS

1. Examine a plot of the residuals versus their order. Trends in this plot are indicative of a correlated error structure and, hence,
dependency.
2. If the observations in the dataset represent successive time periods at which they were recorded and you suspect a
temporal component to the study, then time series can be used for analyzing and remedying the situation.

C. HOW TO IDENTIFY AN OUTLIER IN REGRESSION

A value of
|r i| > 3 usually indicates the value can be considered an outlier. Other methods can be used in outlier detection, which

are discussed later. Plots of residuals versus fitted values are constructed by plotting all pairs of
r i and y^ i for the observed sample,
with residuals on the vertical axis. The idea is that properties of the residuals should be the same for all different values of the
predicted values. Specifically, as we move across the plot (across predicted values), the average of the residuals should always be

about 0 and the vertical variation in residuals should maintain about the same magnitude. Note that plots of
e i and y^ i can also be
used and typically appear similar to those based on the Studentized residuals. Sr=e/sqr(mse)

Ideal Appearance of Plots

The usual assumptions for regression imply that the pattern of deviations (errors) from the regression line should be similar
regardless of the value of the predicted value (and value of the X variable). The consequence is that a plot of residuals versus fitted
values (or residuals versus an X variable) ideally has a random \zero correlation" appearance.
Regression model in Matrix notation
^
Y =Xβ+ ε ⇒ Y^ =X β=X (X'X )−1 X'Y

[ ]
h11 h .. .
h22
⋮ number of parameters p
−1 ' h ii 0≤hii ≤1 compare it with 2 h̄=2 =2
Hat / projection matrix: H=P=X ( X'X) X= where n n

number of parameters p
2 h̄=2 =2
This formula n n is applicable only if the sample size is greater than the number of independent
variables

( x i − x̄ )2 1 ( x i− x̄ )( x j− x̄ )
2
1 1 ( x i − x̄ )
hii = pii= + = + hij = pij = +
In simple linear regression we can also use this formula
n ∑ ( x i− x̄ ) 2 n SS xx
and
n ∑ ( x − x̄ ) 2
i

Properties of hat matrix

'
1. Symmetric; A=A

' −1 ' '

Proof: H =[ X (X'X ) X ] =H

2. Idempotent, H.H=H
H3=H
Proof:
H .H =[ X ( X'X)−1 X ' ][ X (X'X )−1 X ' ]
¿XI (X'X )−1 X ' =H
Show that ( I −H ) is idempotent
A
e . g A . A−1 = =I
A
Y^ =HY

HY=Y^
−1 ' ^ ^
Show that LHS=HY=X ( X'X) X Y =X β=Y
LEVERAGE, INFLUENCE AND OUTLIERS

 An outlier is a data point whose response y does not follow the general trend of the rest of the data.
 A data point has high leverage if it has "extreme" predictor x values. With a single predictor, an extreme x value is simply
one that is particularly high or low. With multiple predictors, extreme x values may be particularly high or low for one or more
predictors, or may be "unusual" combinations of predictor values (e.g., with two predictors that are positively correlated, an
unusual combination of predictor values might be a high value of one predictor paired with a low value of the other predictor)

Note that — for our purposes — we consider a data point to be an outlier only if it is extreme with respect to the other y values, not
the x values.
A data point is influential if it excessively influences any part of a regression analysis, such as the predicted responses, the
estimated slope coefficients, or the hypothesis test results.
Outliers and high leverage data points have the potential to be influential, but we generally have to investigate further to determine
whether or not they are actually influential.

One advantage of the case in which we have only one predictor is that we can look at simple scatter plots in order to identify any
outliers and high leverage data points.

Let's take a look at a few examples that should help to clarify the distinction between the two types of extreme values.

Example 1
Based on the definitions above, do you think the following data set contains any outliers? Or, any high leverage data points?

All of the data points follow the general trend of the rest of the data, so there are no outliers (in the y direction). And, none of the
data points are extreme with respect to x, so there are no high leverage points. Overall, none of the data points would appear to be
influential with respect to the location of the best fitting line.
Example 2
Now, how about this example? Do you think the following data set contains any outliers? Or, any high leverage data points?

Of course! Because the red data point does not follow the general trend of the rest of the data, it would be considered an outlier.
However, this point does not have an extreme x value, so it does not have high leverage. Is the red data point influential? An easy
way to determine if the data point is influential is to find the best fitting line twice — once with the red data point included and once
with the red data point excluded.
Example 3
Now, how about this example? Do you think the following data set contains any outliers? Or, any high leverage data points?

In this case, the red data point does follow the general trend of the rest of the data. Therefore, it is not deemed an outlier here.
However, this point does have an extreme x value, so it does have high leverage. Is the red data point influential? It certainly
appears to be far removed from the rest of the data (in the x direction), but is that sufficient to make the data point influential in this
case?
Example 4
One last example! Do you think the following data set contains any outliers? Or, any high leverage data points?

In this case, the red data point is most certainly an outlier and has high leverage! The red data point does not follow the general
trend of the rest of the data and it also has an extreme x value. And, in this case the red data point is influential.
DIFFICULTIES POSSIBLY SEEN IN THE PLOTS
Three primary difficulties may show up in plots of residuals versus fitted values:
1. Outliers in the data.
2. The regression equation for the average does not have the right form.
3. The residual variance is not constant.

 Difficulty 1: Outliers in the Y Values

An unusual value for the Y variable will often lead to a large residual. Thus, an outlier may show up as an extreme point in the
vertical direction of the residuals versus fitted values plot. The Figure below gives a plot of data simulated from a simple linear
regression model. However, the point at (6; 35) is an outlier, especially since the residual at this point is so large. This is also
confirmed by observing the plot of Studentized residuals versus fitted values, which is not shown here.

HOW TO CHECK FOR INFLUENTIAL OBSERVATIONS

 Leverage procedure.
 Jacknife/ Deleted residuals.
 Cook’s Distance.

 Difficulty 2: Wrong Mathematical Form of the Regression Equation

A curved appearance in a plot of residuals versus fitted values indicates that we used a regression equation that does not match
the curvature of the data. Thus, we have misspecified our model. Figures (c) and (d) show the case of nonlinearity in the residual
plots. When this happens, there is often a pattern to the data similar to that of an exponential or trigonometric function.

 Difficulty 3: Nonconstant Residual Variation

Many times, nonconstant residual variance leads to a sideways cone or funnel shape for the plot of residuals versus fitted values.
Figures (b) and (d) show plots of such residuals versus fitted values. Nonconstant variance is noted by how the data \fans out" as
the predicted value increases. In other words, the residual variance (i.e., the vertical range in the plot) is increasing as the size of
the predicted value increases. This typically has a minimal impact on regression estimates, but is a feature of the data that needs to
be taken into account when reporting the accuracy of the predictions, especially for inferential purposes.

D. DATA TRANSFORMATIONS

Transformations of the variables are used in regression to describe curvature and sometimes are also used to adjust for
nonconstant variance in the errors and the response variable. Below are some general guidelines when considering
transformations.

 What to Try?
When there is curvature in the data, there might possibly be some theory in the literature of the subject matter that suggests an
appropriate equation. Or, you might have to use trial-and-error data exploration to determine a model that fits the data. In the trial-
and-error approach, you might try polynomial models or transformations of X and/or Y , such as square root, logarithmic, or
reciprocal. One of these will often end up improving the overall fit, but note that interpretations of any quantities will be on the
transformed variable(s).

 Transform X or Transform Y?

In the data exploration approach, if you transform Y, you will change the variance of the Y and the errors. You may wish to try

common transformations of the Y (e.g., log(Y ),√ Y , orY ) when there is nonconstant variance and possible curvature to the data.
−1

−1
Try transformations of the X (e.g., X ; X 2 or X 3 ) when the data are curved, but the variance looks to be constant in the original
scale of Y.

 Why Might Logarithms Work?

Logarithms often are used because they are connected to common exponential growth and power curve relationships. The
relationships discussed below are easily verified using the algebra of logarithms.

Aff700 1000 221209
No ratings yet
Aff700 1000 221209
11 pages
Run Off Triangles - Bornhuetter
No ratings yet
Run Off Triangles - Bornhuetter
1,030 pages
Insurnance Chapter Iii-Insurance: An Overview: 3.1. The Nature and Functions of Insurance
No ratings yet
Insurnance Chapter Iii-Insurance: An Overview: 3.1. The Nature and Functions of Insurance
29 pages
Multivariate Data Analysis Joseph F. Hair Jr. William C. Black Barry J. Babin Rolph E. Anderson Seventh Edition
0% (1)
Multivariate Data Analysis Joseph F. Hair Jr. William C. Black Barry J. Babin Rolph E. Anderson Seventh Edition
7 pages
Demography and Health Services Statistics
No ratings yet
Demography and Health Services Statistics
74 pages
Population Pyramid: Pyramid", Is A Graphical Illustration That Shows The
No ratings yet
Population Pyramid: Pyramid", Is A Graphical Illustration That Shows The
7 pages
Employee Benefits: John Bo S. Cayetano, CPA, MBA
No ratings yet
Employee Benefits: John Bo S. Cayetano, CPA, MBA
110 pages
Week 6: Assumptions in Regression Analysis
No ratings yet
Week 6: Assumptions in Regression Analysis
69 pages
Statistical Methods For Bioinformatics Lecture 3
No ratings yet
Statistical Methods For Bioinformatics Lecture 3
33 pages
00000chen - Linear Regression Analysis3
No ratings yet
00000chen - Linear Regression Analysis3
252 pages
Perhitungan Ukuran Risiko Untuk Model Kerugian Agregat PDF
No ratings yet
Perhitungan Ukuran Risiko Untuk Model Kerugian Agregat PDF
12 pages
Business Statistics Canadian 3rd Edition Sharpe Solutions Manual Download
100% (30)
Business Statistics Canadian 3rd Edition Sharpe Solutions Manual Download
41 pages
Lec 21
No ratings yet
Lec 21
14 pages
ECE 3040 Lecture 18: Curve Fitting by Least-Squares-Error Regression
No ratings yet
ECE 3040 Lecture 18: Curve Fitting by Least-Squares-Error Regression
38 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Forecasting in Business (PPT) - 092138
No ratings yet
Forecasting in Business (PPT) - 092138
53 pages
Week 3 Supplemental Handout - Linear Model Assumptions
No ratings yet
Week 3 Supplemental Handout - Linear Model Assumptions
7 pages
KOM6115 Assignment 2 (GS65807)
No ratings yet
KOM6115 Assignment 2 (GS65807)
12 pages
Seminarski Insurance
No ratings yet
Seminarski Insurance
23 pages
Regression Diagnostics With R: Anne Boomsma
No ratings yet
Regression Diagnostics With R: Anne Boomsma
23 pages
BE368 Lecture 4
No ratings yet
BE368 Lecture 4
28 pages
Assignment I Questions Econ. For Acct & Fin. 2023
No ratings yet
Assignment I Questions Econ. For Acct & Fin. 2023
3 pages
AS 15 All in One
No ratings yet
AS 15 All in One
93 pages
Actuarial Specialist - Portfolio Management & Diagnostics Team 2022
No ratings yet
Actuarial Specialist - Portfolio Management & Diagnostics Team 2022
4 pages
Stats101A - Chapter 3
No ratings yet
Stats101A - Chapter 3
54 pages
Chap03 4
No ratings yet
Chap03 4
49 pages
Assumption Checking On Linear Regression
No ratings yet
Assumption Checking On Linear Regression
65 pages
7 OLS Assumptions
No ratings yet
7 OLS Assumptions
37 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Unit 5
No ratings yet
Unit 5
34 pages
Bahasa Indonesia Kel 5 Fix
No ratings yet
Bahasa Indonesia Kel 5 Fix
22 pages
Statistic For Agriculture Studies: The Assumptions of Regression
No ratings yet
Statistic For Agriculture Studies: The Assumptions of Regression
6 pages
FCDS - RA ch3 Sp21
No ratings yet
FCDS - RA ch3 Sp21
20 pages
Assumptions of Multiple Regression
No ratings yet
Assumptions of Multiple Regression
12 pages
Jurnal Mengenai Regresi Linear
No ratings yet
Jurnal Mengenai Regresi Linear
19 pages
Oversikt ECN402
No ratings yet
Oversikt ECN402
40 pages
5.multiple Regression
No ratings yet
5.multiple Regression
17 pages
STATG5 - Simple Linear Regression Using SPSS Module
No ratings yet
STATG5 - Simple Linear Regression Using SPSS Module
16 pages
Assumptions of Simple and Multiple Linear Regression Model
No ratings yet
Assumptions of Simple and Multiple Linear Regression Model
25 pages
LR Assumptions
No ratings yet
LR Assumptions
9 pages
Set 4
No ratings yet
Set 4
29 pages
Pages From SPSS For Beginners
No ratings yet
Pages From SPSS For Beginners
58 pages
Regression Adequacy
No ratings yet
Regression Adequacy
11 pages
W6 - L5 - Assumptions of Regression
No ratings yet
W6 - L5 - Assumptions of Regression
4 pages
Regression Assumptions Explained
No ratings yet
Regression Assumptions Explained
6 pages
Linear Regression Analysis: Module - Iv
No ratings yet
Linear Regression Analysis: Module - Iv
10 pages
Course Outline
No ratings yet
Course Outline
3 pages
Linear Regression Makes Several Key Assumptions
No ratings yet
Linear Regression Makes Several Key Assumptions
5 pages
Assumptions of Regression
100% (2)
Assumptions of Regression
16 pages
Heteroscedasticity:: Testing and Correcting in SPSS
No ratings yet
Heteroscedasticity:: Testing and Correcting in SPSS
32 pages
STAT22209 - Chapter 02-Regression Analyisis - 2022
No ratings yet
STAT22209 - Chapter 02-Regression Analyisis - 2022
41 pages
A-Cat Corp - Forecasting
No ratings yet
A-Cat Corp - Forecasting
7 pages
CBA101-FT-Moreon Vocab
No ratings yet
CBA101-FT-Moreon Vocab
5 pages
Chapter 3
No ratings yet
Chapter 3
22 pages
Lec 34
No ratings yet
Lec 34
15 pages
Data Statistik Baru
No ratings yet
Data Statistik Baru
5 pages
Chapter 3 - Classical Simple Linear Regression
No ratings yet
Chapter 3 - Classical Simple Linear Regression
52 pages
Reading 07-Correlation and Regression
No ratings yet
Reading 07-Correlation and Regression
18 pages
2023 Statistics Fin 10
No ratings yet
2023 Statistics Fin 10
14 pages
Iai Brochure College 1
No ratings yet
Iai Brochure College 1
8 pages
Regression For Everyone Vol. 1
No ratings yet
Regression For Everyone Vol. 1
25 pages
Testing The Assumptions of Linear Regression
100% (1)
Testing The Assumptions of Linear Regression
14 pages
Regression Validation
No ratings yet
Regression Validation
3 pages
Chapter 4
No ratings yet
Chapter 4
15 pages
Chapter 5 - Violations of Regression Assumptions
No ratings yet
Chapter 5 - Violations of Regression Assumptions
44 pages
Linear Regression
No ratings yet
Linear Regression
38 pages
Diagnostico de Modelos
No ratings yet
Diagnostico de Modelos
4 pages
Regression and Introduction To Bayesian Network
No ratings yet
Regression and Introduction To Bayesian Network
12 pages
The Four Assumptions of Linear Regression
No ratings yet
The Four Assumptions of Linear Regression
10 pages
SLR Solved Example
No ratings yet
SLR Solved Example
6 pages
Case Study - Pontius Data: at - at May Not Be Good Enough
No ratings yet
Case Study - Pontius Data: at - at May Not Be Good Enough
9 pages
Lasso Vs Ridge Vs Elastic 1
No ratings yet
Lasso Vs Ridge Vs Elastic 1
5 pages
LR Assumptions - 05
No ratings yet
LR Assumptions - 05
12 pages
Assignment #1
No ratings yet
Assignment #1
3 pages
Chap12 2012
No ratings yet
Chap12 2012
30 pages
Chapter Three
No ratings yet
Chapter Three
22 pages
4-Regression Diagnostics SAS
No ratings yet
4-Regression Diagnostics SAS
12 pages
Lecture 2 (Modified v.2025)
No ratings yet
Lecture 2 (Modified v.2025)
31 pages
Yaregal Birhanu
No ratings yet
Yaregal Birhanu
8 pages
Unit 3
No ratings yet
Unit 3
24 pages
3-Applying Multiple Linear Regression
No ratings yet
3-Applying Multiple Linear Regression
5 pages
Ket Qua Eview Chuong 4 - 9
No ratings yet
Ket Qua Eview Chuong 4 - 9
12 pages
Chapter 12
No ratings yet
Chapter 12
12 pages
Act Roadmap
No ratings yet
Act Roadmap
1 page
Adequacy Og Regression Model
No ratings yet
Adequacy Og Regression Model
10 pages
Module 4 Part 3
No ratings yet
Module 4 Part 3
16 pages
Cheat Sheet Linear and Logistic Regression
No ratings yet
Cheat Sheet Linear and Logistic Regression
2 pages
ISOM2500 Spring 25 - Topic 10 - Assumptions For Linear Regression
No ratings yet
ISOM2500 Spring 25 - Topic 10 - Assumptions For Linear Regression
35 pages
Direct Linear Transformation: Practical Applications and Techniques in Computer Vision
From Everand
Direct Linear Transformation: Practical Applications and Techniques in Computer Vision
Fouad Sabry
No ratings yet
Foundations of Elementary Analysis
From Everand
Foundations of Elementary Analysis
Roshan Trivedi
No ratings yet

Chapter 4 MLR

Uploaded by

Chapter 4 MLR

Uploaded by

Chapter 4: Regression Assumptions and Residual/ Errors Diagnostics

 DIAGNOSING WHETHER THE RIGHT TYPE OF EQUATION WAS USED

indicates that the wrong type of equation has been used.

Goodness-of-fit tests are also used for this purpose.

pattern that resembles a sideways cone.

C. HOW TO IDENTIFY AN OUTLIER IN REGRESSION

Ideal Appearance of Plots

Properties of hat matrix

' −1 ' '

 Difficulty 1: Outliers in the Y Values

HOW TO CHECK FOR INFLUENTIAL OBSERVATIONS

 Difficulty 2: Wrong Mathematical Form of the Regression Equation

 Difficulty 3: Nonconstant Residual Variation

 Why Might Logarithms Work?

You might also like