0% found this document useful (0 votes)

22 views9 pages

Regn Lect 4

Epidemiology linear regression

Uploaded by

Martha Reuben

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views9 pages

Regn Lect 4

Epidemiology linear regression

Uploaded by

Martha Reuben

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

1

Lecture 4: Residuals and regression diagnostics

Objectives

 Learn what is meant by a “residual”

 Learn how to use residuals to check the assumptions and adequacy of a regression
model

In our simple linear regression model

yi = α + β xi + εi
we made the following assumptions
1. The independent variable x is measured without error
2. The true value of the response variable y is linearly related to x ; y is subject to random error
yi = α + β xi + εi
3. The deviations εi are assumed to be
(a) independent
(b) normally distributed with zero mean and constant variance σ2

Thus before making use of our regression model we should check on the validity of these
assumptions. Some assumptions (e.g. linearity) can be checked before the analysis by plotting
the data. Other assumptions have to be checked after the analysis has been completed.
To do this we use the residuals defined by
residual = observed value - fitted value

We can use the residuals to check our assumptions and test the adequacy of our model. If the
assumptions hold, then the residuals should be a random sample from a normal distribution with
mean zero.
Note that the study of residuals has grown widely in recent years and goes by the name of
regression diagnostics. Although we will look at the facilities available in Stata, similar
facilities are available in other packages e.g. SPSS and SAS.
2

1. Linearity
If we plot the residuals against the fitted values (or against the independent variable x) then any
non-linearity is shown up by a systematic non-linear pattern in the residuals.
We can distinguish between different kinds of residuals. The residuals that we have defined as
observed value minus fitted value are the raw residuals. In checking assumptions such as
constant variance there are two other kinds of residual that are more useful - these are the
standardized residual and the studentized residual.
Standardized and studentized residuals are attempts to adjust residuals for their standard
errors. Although the εi (theoretical residuals) are assumed to have constant variance, the
calculated residuals ei do not –
In fact Var(ei) = σ2 (1 - hi ) where hi are the leverage values which will be discussed shortly.
(Thus observations with the greatest leverage have corresponding residuals with the smallest
variance). Standardised residuals estimate σ2 using the residual mean square from the overall
regression model, while studentized residuals use the residual mean square for the regression
that would have been obtained if that particular observation had been omitted from the model. In
general studentized residuals are preferred to standardized residuals in identifying outliers
(which will be discussed below), while either can be used to check for constant variance. Thus
to simplify matters we will use the studentized residuals which can be found after fitting a
regression model in Stata using the predict command.

Ex: In the Stepping Stones pilot data we can fit a regression of the knowledge score rhknow on
age. The observed points and fitted line are shown below:

Regression of knowledge on age

15
10
5
0

10 15 20 25 30 35 40 45
Age

reproductive health knowledge score Fitted values

. regress rhknow age

Source | SS df MS Number of obs = 203

-------------+------------------------------ F( 1, 201) = 7.22
Model | 40.8237239 1 40.8237239 Prob > F = 0.0078
Residual | 1135.82652 201 5.65087822 R-squared = 0.0347
-------------+------------------------------ Adj R-squared = 0.0299
Total | 1176.65025 202 5.82500122 Root MSE = 2.3772

------------------------------------------------------------------------------
rhknow | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0712879 .0265227 2.69 0.008 .0189895 .1235863
_cons | 7.081222 .5457386 12.98 0.000 6.005115 8.15733
------------------------------------------------------------------------------

. predict knowfit
(option xb assumed; fitted values)
(1 missing value generated)

. twoway (scatter rhknow age, sort msymbol (plus)) ///

> (connected knowfit age , sort msymbol (none)) , ///
> xtitle(age) xlabel(10 (5) 45) ti("Regression of knowledge on age")

. *** The plot of studentized residuals versus fitted values

. predict knowresa , rstu
(1 missing value generated)
*** This gives us the studentized residuals (through the option rstu) in the variable
knowresa
. scatter knowresa knowfit , yline(-2 0 2) ti("residuals vs fitted values for rhknow")

Residuals vs fitted values for rhknow

4 2
Studentized residuals
0 -2
-4

8 8.5 9 9.5 10
Linear prediction
4

There is no obvious pattern in the residual plot at this stage. Note that there is a quick way to
get a plot of the residuals versus the fitted values using the raw residuals.

. *** First a raw residual vs fitted values plot

. rvfplot

Stata also provides a test to see whether there are any omitted variables in our regression
model, which may also indicate some non-linearity (e.g. that we need to introduce a quadratic
term) or else the need for an extra explanatory variable (see multiple regression below). This is
the “omitted variable test” due to Ramsey .
. estat ovtest

Ramsey RESET test using powers of the fitted values of rhknow

Ho: model has no omitted variables
F(3, 198) = 3.89
Prob > F = 0.0099

From Ramsay’s omitted variables test there is in fact very strong evidence of an omitted
variable. In this case we could consider a quadratic relationship with age - which would allow
the effect of age to become less with increasing age; we will also examine the effect of other
variables later (e.g. education, number of partners, gender, etc.)

Homogeneity of Variance
To test for homogeneity of variance we plot the residuals against the fitted values. Equal scatter
of the residuals about the horizontal axis implies constant variance (homogeneity of variance)
while unequal scatter suggests non constant variance. If we look at the plot of the studentized
residuals versus the fitted values for the pilot stepping stones data above, there is little evidence
of non-constant variance.
We can also test for non-constant variance or heterogeneity of variance using the estat hettest
command
. estat hettest
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
Variables: fitted values of rhknow
chi2(1) = 0.01
Prob > chi2 = 0.9343
This confirms that there is no evidence of heterogeneity in this case.
5

Note: If we do find evidence of non-constant variance, one approach is to transform the

response variable. Empirically it has been found that if the variance of y increases with
increasing x a log transformation can be effective (see the practical exercise and lecture 7).

Normality
One way of testing for normality is to do a normal probability plot which essentially plots the
cumulative distribution of the residuals against the cumulative distribution of a standard normal
variable. If our residuals are normally distributed then the normal probability plot should give us
an approximate straight line.
This can be done as follows in Stata for the stepping stones pilot data (having saved the
studentized residuals in knowresa as done above).
. pnorm knowresa
In this case the normal probability plot shows no evidence of non-normality.
We can also plot a histogram of the studentised residuals with a normal distribution
superimposed using
hist knowresa , norm
1.00 0.75
Normal F[(rhres-m)/s]
0.25 0.50
0.00

0.00 0.25 0.50 0.75 1.00

Empirical P[i] = i/(N+1)
6

.4
.3
Density
.2.1
0

-4 -2 0 2 4
Studentized residuals

Outliers
A large value of a residual, either positive or negative, indicates a potential outlier. Since our
studentized residuals are assumed to come from a standard normal distribution, 95% of the
studentized (or standardized) residuals should lie between -2 and +2. This gives us a yardstick
for what is “large”.
For an observation with a large residual, the fitted value is very different to the observed value,
and we should investigate the observation. If the cause of the strange value is known then we
can reject it and report its removal. Outliers can seriously distort the analysis. However careful
consideration must be made before removing outliers from the data set.
Looking at the histogram of residuals we see one large residual, we can reanalyse the data
removing any observation with a studentized residual larger than 3 - we should first examine
this observation to consider why it merits exclusion, and also see what the effect of removing
this observation is.

. list age rhknow educ knowfit knowresa if knowresa > 3 & knowresa<.

+-------------------------------------------+
| age rhknow educ knowfit knowresa|
|-------------------------------------------|
| 14 16 2 8.079252 3.435576 |
+-------------------------------------------+
7

So this observation corresponds to a 14 year old who obtained a score of 16 - to put this in
context we can look at the distribution of rhknow

. summa rhknow , det

reproductive health knowledge score

-------------------------------------------------------------
Percentiles Smallest
1% 2 0
5% 4 2
10% 5 2 Obs 204
25% 7 3 Sum of Wgt. 204

50% 9 Mean 8.436275

Largest Std. Dev. 2.479643
75% 10 13
90% 11 13 Variance 6.148628
95% 12 15 Skewness -.2512049
99% 13 16 Kurtosis 3.43432

Thus the score of 16 was the highest overall, and the 14 year old was the only person to get
that score
We can see the effect of omitting that score

. reg rhknow age if knowresa<3

Source | SS df MS Number of obs = 202

-------------+------------------------------ F( 1, 200) = 8.81
Model | 47.2569706 1 47.2569706 Prob > F = 0.0034
Residual | 1072.53016 200 5.36265079 R-squared = 0.0422
-------------+------------------------------ Adj R-squared = 0.0374
Total | 1119.78713 201 5.57108024 Root MSE = 2.3157
------------------------------------------------------------------------------
rhknow | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0768499 .0258881 2.97 0.003 .0258013 .1278985
_cons | 6.932891 .5333889 13.00 0.000 5.881103 7.984678

Thus omitting that score leads to a slight increase in the value of R-squared and a slight
increase in the slope. Overall the change probably doesn’t warrant dropping the observation -
besides the score is almost certainly genuine (i.e. was not based on guessing).
8

Leverage Values
A point (xi, yi) is an outlier if the fitted value is far from the observed value yi. An observation
could be an outlier in another way, if xi is far from the other x’s. However the regression line of y
on x will pass very close to this point, and thus the residual will be very small. However this
point will have a dramatic effect on the resulting estimates, since if this point were removed the
estimates would change markedly. Such a point is said to have a high leverage. We can find the
leverage values for each point using the leverage option of the predict command. This will give
leverage values for each observation.
A commonly used rough rule for assessing leverage values is to pay attention to cases with
leverage greater than 2p/n or 3p/n where p is the number of parameters fitted (so p=2 for simple
linear regression) and n is the number of observations. Thus for our example p=2 and n=200 so
we would consider leverage values > 4/200 = 0.02.
A useful plot is the leverage versus residual squared plot which can be obtained in Stata using
the lvr2plot command. The lines on the chart show the average values of leverage and
(normalized) residuals squared. Points above the horizontal line have higher-than average
leverage; points to the right of the vertical line have larger than average residuals.

Cook’s Distance
Leverages give us an indication of “unusual” explanatory variables and residuals indicate
observations with an unusual value of the response variable. It is sometimes useful to have an
overall measure of how “unusual” an observation is combining both explanatory and response
variables. What is needed is a measure of how influential each observation is in determining
the regression. One such measure is Cook’s distance. Cook’s distance for the ith observation is
a measure of the change in the estimated regression coefficients that would result from deleting
the ith observation. An observation with a large value of Cook’s D should be further
investigated. As a rough measure of size, values of Cook’s distance greater than 4/n deserve
further investigation (whatever the value of p, i.e. however many independent variables are in
the model).

DFBETAs
DFBETAs are perhaps the most direct influence measure – they focus on one coefficient and
measure the difference between the regression coefficient when the ith observation is included
and when it is excluded, the difference being scaled by the estimated standard error of the
coefficient. It has been suggested that we check observations with an absolute dfbeta > 2/√n,
9

but other authors suggest a cut-off of 1 (meaning that the observation shifted the estimate at
least one standard error). The use of dfbetas in our regression model of rhknow on age is
shown below:

. predict dage , dfbeta(age)

(1 missing value generated)
. summa dage , det
Dfbeta age
-------------------------------------------------------------
Percentiles Smallest
1% -.2687882 -.4621181
5% -.1472693 -.3186477
10% -.0580034 -.2687882 Obs 203
25% -.0197982 -.2619583 Sum of Wgt. 203

50% .0032605 Mean -.0003466

Largest Std. Dev. .0764411
75% .0290477 .135505
90% .0804681 .1385067 Variance .0058432
95% .1041601 .1576703 Skewness -1.989395
99% .1385067 .2315073 Kurtosis 11.92995

. list sex age educ rhknow if abs(dage)>0.15 & dage<. , noobs

+------------------------------+
| sex age educ rhknow |
|------------------------------|
| male 14 2 16 |
| male 14 1 15 |
| male 33 1 13 |
| male 32 1 5 |
| male 41 2 8 |
|------------------------------|
| male 35 2 6 |
| male 42 1 6 |
| male 37 1 6 |
| male 25 1 3 |
| male 36 1 7 |
|------------------------------|
| female 34 2 7 |
| female 28 3 13 |
+------------------------------+

Note that here 2/√n is 0.1404 hence the chosen cut-off of 0.15 – there are 12 observations
worth checking (including our earlier potential outlier) although none of the dfbetas approach the
more stringent cut-off of 1. Most of these observations are either young people with high scores
or older people with low scores. Without any more information on these observations it would
not be advisable to omit them from the model.

State and Sovereignty
100% (1)
State and Sovereignty
19 pages
PE Civil: Transportation Ebook Practice Exam
No ratings yet
PE Civil: Transportation Ebook Practice Exam
41 pages
PF3204 Risk Management
No ratings yet
PF3204 Risk Management
10 pages
Introductory Econometrics A Modern Approach 5th Edition Wooldridge Solutions Manual 1
100% (51)
Introductory Econometrics A Modern Approach 5th Edition Wooldridge Solutions Manual 1
26 pages
330 Lect11
No ratings yet
330 Lect11
35 pages
Sap HR Abap Interview Faq
100% (2)
Sap HR Abap Interview Faq
6 pages
Regression Analysis - Chapter 4 - Model Adequacy Checking - Shalabh, IIT Kanpur
No ratings yet
Regression Analysis - Chapter 4 - Model Adequacy Checking - Shalabh, IIT Kanpur
36 pages
IV AI-DS AD3491 FDSA Unit5
No ratings yet
IV AI-DS AD3491 FDSA Unit5
35 pages
Chapter 4 Hand Out
No ratings yet
Chapter 4 Hand Out
15 pages
ST T153A Regression Analysis
No ratings yet
ST T153A Regression Analysis
54 pages
C6 Regression
No ratings yet
C6 Regression
27 pages
Module 6 Content
No ratings yet
Module 6 Content
12 pages
ASTM C88 Soundness of Aggregate
100% (1)
ASTM C88 Soundness of Aggregate
3 pages
As A Future Educator I See Myself As A Effective Teacher
67% (3)
As A Future Educator I See Myself As A Effective Teacher
1 page
Interactive Lecture Notes 12-Regression Analysis
No ratings yet
Interactive Lecture Notes 12-Regression Analysis
22 pages
Lec 34
No ratings yet
Lec 34
15 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
7 Generalized Linear Models
No ratings yet
7 Generalized Linear Models
16 pages
Engineering - Simple Correlation and Regression - 2024
No ratings yet
Engineering - Simple Correlation and Regression - 2024
35 pages
Unit 3
No ratings yet
Unit 3
24 pages
Lecture 4
No ratings yet
Lecture 4
12 pages
Fda Unit 5
No ratings yet
Fda Unit 5
20 pages
Correlation and Regression Analysis Using SPSS
No ratings yet
Correlation and Regression Analysis Using SPSS
102 pages
Week 2 - Linear Regression
No ratings yet
Week 2 - Linear Regression
5 pages
Analysis of Laminated Composite Plate Using Matlab
100% (1)
Analysis of Laminated Composite Plate Using Matlab
10 pages
Week 6: Assumptions in Regression Analysis
No ratings yet
Week 6: Assumptions in Regression Analysis
69 pages
4-Regression Diagnostics SAS
No ratings yet
4-Regression Diagnostics SAS
12 pages
Regression Analysis
100% (1)
Regression Analysis
280 pages
STAT22209 - Chapter 02-Regression Analyisis - 2022
No ratings yet
STAT22209 - Chapter 02-Regression Analyisis - 2022
41 pages
Chapter 3
No ratings yet
Chapter 3
22 pages
Diagnostics For Logistic Regression: Newsom Psy 525/625 Categorical Data Analysis, Spring 2021 1
No ratings yet
Diagnostics For Logistic Regression: Newsom Psy 525/625 Categorical Data Analysis, Spring 2021 1
6 pages
Jamia Millia Islamia: Ownership
0% (1)
Jamia Millia Islamia: Ownership
16 pages
Readiness Checklist School Opening 1 2
No ratings yet
Readiness Checklist School Opening 1 2
4 pages
STAT22209 - Chapter 03-Multiple Regression - 2022
No ratings yet
STAT22209 - Chapter 03-Multiple Regression - 2022
41 pages
Ltap Merzifon
No ratings yet
Ltap Merzifon
6 pages
00000chen - Linear Regression Analysis3
No ratings yet
00000chen - Linear Regression Analysis3
252 pages
Draw Space
No ratings yet
Draw Space
8 pages
Lecture 12
No ratings yet
Lecture 12
47 pages
Analysing Panel Data
No ratings yet
Analysing Panel Data
25 pages
Session 08 2024
No ratings yet
Session 08 2024
27 pages
IDS22Bayes Applications
No ratings yet
IDS22Bayes Applications
34 pages
Residual Analysis For Simple Linear Regression: X B B y N e N e
No ratings yet
Residual Analysis For Simple Linear Regression: X B B y N e N e
15 pages
Evolution of TQM Philosophies: Ms. Irum Shahzadi 6
No ratings yet
Evolution of TQM Philosophies: Ms. Irum Shahzadi 6
12 pages
The Relationship Between Performance Appraisal and Job Performance
No ratings yet
The Relationship Between Performance Appraisal and Job Performance
12 pages
HIV Testing
No ratings yet
HIV Testing
260 pages
Knowledge Attitude and Practice On Cervi
No ratings yet
Knowledge Attitude and Practice On Cervi
87 pages
Applied Statistics II-SLR
100% (1)
Applied Statistics II-SLR
23 pages
Compare Maslow and Herzberg Theory of Motivation
100% (1)
Compare Maslow and Herzberg Theory of Motivation
3 pages
Lecture 15
No ratings yet
Lecture 15
12 pages
Lecture 3. Part 1 - Regression Analysis
No ratings yet
Lecture 3. Part 1 - Regression Analysis
21 pages
10-Year Project TOKIO
No ratings yet
10-Year Project TOKIO
16 pages
Torque, Rotational Equilibrium and Center of Gravity: Gwyneth Marie Dayagan
No ratings yet
Torque, Rotational Equilibrium and Center of Gravity: Gwyneth Marie Dayagan
5 pages
Logistic Regression From Malhotra
No ratings yet
Logistic Regression From Malhotra
24 pages
Market Failures and Intervention by State
No ratings yet
Market Failures and Intervention by State
12 pages
Methodology
No ratings yet
Methodology
4 pages
Lesson Week 13
No ratings yet
Lesson Week 13
6 pages
Regression Analysis - VCE Further Mathematics
No ratings yet
Regression Analysis - VCE Further Mathematics
5 pages
Chapter 4 MLR
No ratings yet
Chapter 4 MLR
17 pages
Stats101A - Chapter 1
No ratings yet
Stats101A - Chapter 1
25 pages
Peirano 1998 When Anthropology Is at Home PDF
No ratings yet
Peirano 1998 When Anthropology Is at Home PDF
25 pages
STATA Red Tutorial
100% (1)
STATA Red Tutorial
84 pages
Ch.0 Introduction To VHDL
No ratings yet
Ch.0 Introduction To VHDL
34 pages
Regn Lect 5
No ratings yet
Regn Lect 5
9 pages
Cervical Cancer
No ratings yet
Cervical Cancer
91 pages
Deleted Chapter 2024 @somyajeet
No ratings yet
Deleted Chapter 2024 @somyajeet
5 pages
STAT501 Online - Spring2024 - FinalExam
No ratings yet
STAT501 Online - Spring2024 - FinalExam
14 pages
Grades 3-6
No ratings yet
Grades 3-6
77 pages
Homework 1 AMATH 301 UW
No ratings yet
Homework 1 AMATH 301 UW
2 pages
Regression Diagnostics With R: Anne Boomsma
No ratings yet
Regression Diagnostics With R: Anne Boomsma
23 pages
Robust Regression Modeling With STATA Lecture Notes
No ratings yet
Robust Regression Modeling With STATA Lecture Notes
93 pages
Untitled 472
No ratings yet
Untitled 472
13 pages
Multivariate Data Analysis Joseph F. Hair Jr. William C. Black Barry J. Babin Rolph E. Anderson Seventh Edition
0% (1)
Multivariate Data Analysis Joseph F. Hair Jr. William C. Black Barry J. Babin Rolph E. Anderson Seventh Edition
7 pages
DB Structure Pivot Etc
No ratings yet
DB Structure Pivot Etc
14 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
102x Screening Exam Questions
No ratings yet
102x Screening Exam Questions
3 pages
Document Clustering Based On Correlation Preserving Indexing
No ratings yet
Document Clustering Based On Correlation Preserving Indexing
5 pages
Econometrics: Domodar N. Gujarati
No ratings yet
Econometrics: Domodar N. Gujarati
36 pages
Data Science 03 - Regression PDF
No ratings yet
Data Science 03 - Regression PDF
32 pages
ISCEON ® MO79 - Tablas Termodinamicas (Si)
No ratings yet
ISCEON ® MO79 - Tablas Termodinamicas (Si)
18 pages
How To Improve Your English
100% (1)
How To Improve Your English
4 pages
Poli 30: Political Inquiry: Fall Quarter, 2012 Review
No ratings yet
Poli 30: Political Inquiry: Fall Quarter, 2012 Review
18 pages
Regression Validation
No ratings yet
Regression Validation
3 pages
Ilani Fernandes Resume Policy Fellowship 2019
No ratings yet
Ilani Fernandes Resume Policy Fellowship 2019
1 page
What Is Multiple Linear Regression
No ratings yet
What Is Multiple Linear Regression
23 pages
Chapter2 (Simple Linear Regression)
No ratings yet
Chapter2 (Simple Linear Regression)
11 pages
Linear Regression Analysis: Module - Iv
No ratings yet
Linear Regression Analysis: Module - Iv
10 pages
Regn Lect 7
No ratings yet
Regn Lect 7
26 pages
8D Report 240105 - 11.11.2024
No ratings yet
8D Report 240105 - 11.11.2024
18 pages
Hidrogeologi
No ratings yet
Hidrogeologi
393 pages
Regn Lect 3
No ratings yet
Regn Lect 3
10 pages
School Aspects
No ratings yet
School Aspects
2 pages
Regn Lect 6
No ratings yet
Regn Lect 6
8 pages
Regn Prac 1
No ratings yet
Regn Prac 1
4 pages
Chapter 11
No ratings yet
Chapter 11
10 pages
TMGT Tech Marketing Influencer Briefing Book
No ratings yet
TMGT Tech Marketing Influencer Briefing Book
3 pages
Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
Regn Prac 4
No ratings yet
Regn Prac 4
1 page
Regn Prac 2
No ratings yet
Regn Prac 2
1 page
Regn Prac 5
No ratings yet
Regn Prac 5
1 page
Regn Prac 3
No ratings yet
Regn Prac 3
1 page

Regn Lect 4

Uploaded by

Regn Lect 4

Uploaded by

1

Lecture 4: Residuals and regression diagnostics

 Learn what is meant by a “residual”

In our simple linear regression model

Regression of knowledge on age

reproductive health knowledge score Fitted values

. regress rhknow age

Source | SS df MS Number of obs = 203

. twoway (scatter rhknow age, sort msymbol (plus)) ///

. *** The plot of studentized residuals versus fitted values

Residuals vs fitted values for rhknow

. *** First a raw residual vs fitted values plot

Ramsey RESET test using powers of the fitted values of rhknow

Note: If we do find evidence of non-constant variance, one approach is to transform the

0.00 0.25 0.50 0.75 1.00

. summa rhknow , det

reproductive health knowledge score

50% 9 Mean 8.436275

. reg rhknow age if knowresa<3

Source | SS df MS Number of obs = 202

. predict dage , dfbeta(age)

50% .0032605 Mean -.0003466

. list sex age educ rhknow if abs(dage)>0.15 & dage<. , noobs

You might also like