0% found this document useful (0 votes)
22 views9 pages

Regn Lect 4

Epidemiology linear regression

Uploaded by

Martha Reuben
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views9 pages

Regn Lect 4

Epidemiology linear regression

Uploaded by

Martha Reuben
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

1

Lecture 4: Residuals and regression diagnostics

Objectives

 Learn what is meant by a “residual”


 Learn how to use residuals to check the assumptions and adequacy of a regression
model

In our simple linear regression model


yi = α + β xi + εi
we made the following assumptions
1. The independent variable x is measured without error
2. The true value of the response variable y is linearly related to x ; y is subject to random error
yi = α + β xi + εi
3. The deviations εi are assumed to be
(a) independent
(b) normally distributed with zero mean and constant variance σ2

Thus before making use of our regression model we should check on the validity of these
assumptions. Some assumptions (e.g. linearity) can be checked before the analysis by plotting
the data. Other assumptions have to be checked after the analysis has been completed.
To do this we use the residuals defined by
residual = observed value - fitted value

We can use the residuals to check our assumptions and test the adequacy of our model. If the
assumptions hold, then the residuals should be a random sample from a normal distribution with
mean zero.
Note that the study of residuals has grown widely in recent years and goes by the name of
regression diagnostics. Although we will look at the facilities available in Stata, similar
facilities are available in other packages e.g. SPSS and SAS.
2

1. Linearity
If we plot the residuals against the fitted values (or against the independent variable x) then any
non-linearity is shown up by a systematic non-linear pattern in the residuals.
We can distinguish between different kinds of residuals. The residuals that we have defined as
observed value minus fitted value are the raw residuals. In checking assumptions such as
constant variance there are two other kinds of residual that are more useful - these are the
standardized residual and the studentized residual.
Standardized and studentized residuals are attempts to adjust residuals for their standard
errors. Although the εi (theoretical residuals) are assumed to have constant variance, the
calculated residuals ei do not –
In fact Var(ei) = σ2 (1 - hi ) where hi are the leverage values which will be discussed shortly.
(Thus observations with the greatest leverage have corresponding residuals with the smallest
variance). Standardised residuals estimate σ2 using the residual mean square from the overall
regression model, while studentized residuals use the residual mean square for the regression
that would have been obtained if that particular observation had been omitted from the model. In
general studentized residuals are preferred to standardized residuals in identifying outliers
(which will be discussed below), while either can be used to check for constant variance. Thus
to simplify matters we will use the studentized residuals which can be found after fitting a
regression model in Stata using the predict command.

Ex: In the Stepping Stones pilot data we can fit a regression of the knowledge score rhknow on
age. The observed points and fitted line are shown below:

Regression of knowledge on age


15
10
5
0

10 15 20 25 30 35 40 45
Age

reproductive health knowledge score Fitted values


3

. regress rhknow age

Source | SS df MS Number of obs = 203


-------------+------------------------------ F( 1, 201) = 7.22
Model | 40.8237239 1 40.8237239 Prob > F = 0.0078
Residual | 1135.82652 201 5.65087822 R-squared = 0.0347
-------------+------------------------------ Adj R-squared = 0.0299
Total | 1176.65025 202 5.82500122 Root MSE = 2.3772

------------------------------------------------------------------------------
rhknow | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0712879 .0265227 2.69 0.008 .0189895 .1235863
_cons | 7.081222 .5457386 12.98 0.000 6.005115 8.15733
------------------------------------------------------------------------------

. predict knowfit
(option xb assumed; fitted values)
(1 missing value generated)

. twoway (scatter rhknow age, sort msymbol (plus)) ///


> (connected knowfit age , sort msymbol (none)) , ///
> xtitle(age) xlabel(10 (5) 45) ti("Regression of knowledge on age")

. *** The plot of studentized residuals versus fitted values


. predict knowresa , rstu
(1 missing value generated)
*** This gives us the studentized residuals (through the option rstu) in the variable
knowresa
. scatter knowresa knowfit , yline(-2 0 2) ti("residuals vs fitted values for rhknow")

Residuals vs fitted values for rhknow


4 2
Studentized residuals
0 -2
-4

8 8.5 9 9.5 10
Linear prediction
4

There is no obvious pattern in the residual plot at this stage. Note that there is a quick way to
get a plot of the residuals versus the fitted values using the raw residuals.

. *** First a raw residual vs fitted values plot


. rvfplot

Stata also provides a test to see whether there are any omitted variables in our regression
model, which may also indicate some non-linearity (e.g. that we need to introduce a quadratic
term) or else the need for an extra explanatory variable (see multiple regression below). This is
the “omitted variable test” due to Ramsey .
. estat ovtest

Ramsey RESET test using powers of the fitted values of rhknow


Ho: model has no omitted variables
F(3, 198) = 3.89
Prob > F = 0.0099

From Ramsay’s omitted variables test there is in fact very strong evidence of an omitted
variable. In this case we could consider a quadratic relationship with age - which would allow
the effect of age to become less with increasing age; we will also examine the effect of other
variables later (e.g. education, number of partners, gender, etc.)

Homogeneity of Variance
To test for homogeneity of variance we plot the residuals against the fitted values. Equal scatter
of the residuals about the horizontal axis implies constant variance (homogeneity of variance)
while unequal scatter suggests non constant variance. If we look at the plot of the studentized
residuals versus the fitted values for the pilot stepping stones data above, there is little evidence
of non-constant variance.
We can also test for non-constant variance or heterogeneity of variance using the estat hettest
command
. estat hettest
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
Variables: fitted values of rhknow
chi2(1) = 0.01
Prob > chi2 = 0.9343
This confirms that there is no evidence of heterogeneity in this case.
5

Note: If we do find evidence of non-constant variance, one approach is to transform the


response variable. Empirically it has been found that if the variance of y increases with
increasing x a log transformation can be effective (see the practical exercise and lecture 7).

Normality
One way of testing for normality is to do a normal probability plot which essentially plots the
cumulative distribution of the residuals against the cumulative distribution of a standard normal
variable. If our residuals are normally distributed then the normal probability plot should give us
an approximate straight line.
This can be done as follows in Stata for the stepping stones pilot data (having saved the
studentized residuals in knowresa as done above).
. pnorm knowresa
In this case the normal probability plot shows no evidence of non-normality.
We can also plot a histogram of the studentised residuals with a normal distribution
superimposed using
hist knowresa , norm
1.00 0.75
Normal F[(rhres-m)/s]
0.25 0.50
0.00

0.00 0.25 0.50 0.75 1.00


Empirical P[i] = i/(N+1)
6

.4
.3
Density
.2.1
0

-4 -2 0 2 4
Studentized residuals

Outliers
A large value of a residual, either positive or negative, indicates a potential outlier. Since our
studentized residuals are assumed to come from a standard normal distribution, 95% of the
studentized (or standardized) residuals should lie between -2 and +2. This gives us a yardstick
for what is “large”.
For an observation with a large residual, the fitted value is very different to the observed value,
and we should investigate the observation. If the cause of the strange value is known then we
can reject it and report its removal. Outliers can seriously distort the analysis. However careful
consideration must be made before removing outliers from the data set.
Looking at the histogram of residuals we see one large residual, we can reanalyse the data
removing any observation with a studentized residual larger than 3 - we should first examine
this observation to consider why it merits exclusion, and also see what the effect of removing
this observation is.

. list age rhknow educ knowfit knowresa if knowresa > 3 & knowresa<.

+-------------------------------------------+
| age rhknow educ knowfit knowresa|
|-------------------------------------------|
| 14 16 2 8.079252 3.435576 |
+-------------------------------------------+
7

So this observation corresponds to a 14 year old who obtained a score of 16 - to put this in
context we can look at the distribution of rhknow

. summa rhknow , det

reproductive health knowledge score


-------------------------------------------------------------
Percentiles Smallest
1% 2 0
5% 4 2
10% 5 2 Obs 204
25% 7 3 Sum of Wgt. 204

50% 9 Mean 8.436275


Largest Std. Dev. 2.479643
75% 10 13
90% 11 13 Variance 6.148628
95% 12 15 Skewness -.2512049
99% 13 16 Kurtosis 3.43432

Thus the score of 16 was the highest overall, and the 14 year old was the only person to get
that score
We can see the effect of omitting that score

. reg rhknow age if knowresa<3

Source | SS df MS Number of obs = 202


-------------+------------------------------ F( 1, 200) = 8.81
Model | 47.2569706 1 47.2569706 Prob > F = 0.0034
Residual | 1072.53016 200 5.36265079 R-squared = 0.0422
-------------+------------------------------ Adj R-squared = 0.0374
Total | 1119.78713 201 5.57108024 Root MSE = 2.3157
------------------------------------------------------------------------------
rhknow | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0768499 .0258881 2.97 0.003 .0258013 .1278985
_cons | 6.932891 .5333889 13.00 0.000 5.881103 7.984678

Thus omitting that score leads to a slight increase in the value of R-squared and a slight
increase in the slope. Overall the change probably doesn’t warrant dropping the observation -
besides the score is almost certainly genuine (i.e. was not based on guessing).
8

Leverage Values
A point (xi, yi) is an outlier if the fitted value is far from the observed value yi. An observation
could be an outlier in another way, if xi is far from the other x’s. However the regression line of y
on x will pass very close to this point, and thus the residual will be very small. However this
point will have a dramatic effect on the resulting estimates, since if this point were removed the
estimates would change markedly. Such a point is said to have a high leverage. We can find the
leverage values for each point using the leverage option of the predict command. This will give
leverage values for each observation.
A commonly used rough rule for assessing leverage values is to pay attention to cases with
leverage greater than 2p/n or 3p/n where p is the number of parameters fitted (so p=2 for simple
linear regression) and n is the number of observations. Thus for our example p=2 and n=200 so
we would consider leverage values > 4/200 = 0.02.
A useful plot is the leverage versus residual squared plot which can be obtained in Stata using
the lvr2plot command. The lines on the chart show the average values of leverage and
(normalized) residuals squared. Points above the horizontal line have higher-than average
leverage; points to the right of the vertical line have larger than average residuals.

Cook’s Distance
Leverages give us an indication of “unusual” explanatory variables and residuals indicate
observations with an unusual value of the response variable. It is sometimes useful to have an
overall measure of how “unusual” an observation is combining both explanatory and response
variables. What is needed is a measure of how influential each observation is in determining
the regression. One such measure is Cook’s distance. Cook’s distance for the ith observation is
a measure of the change in the estimated regression coefficients that would result from deleting
the ith observation. An observation with a large value of Cook’s D should be further
investigated. As a rough measure of size, values of Cook’s distance greater than 4/n deserve
further investigation (whatever the value of p, i.e. however many independent variables are in
the model).

DFBETAs
DFBETAs are perhaps the most direct influence measure – they focus on one coefficient and
measure the difference between the regression coefficient when the ith observation is included
and when it is excluded, the difference being scaled by the estimated standard error of the
coefficient. It has been suggested that we check observations with an absolute dfbeta > 2/√n,
9

but other authors suggest a cut-off of 1 (meaning that the observation shifted the estimate at
least one standard error). The use of dfbetas in our regression model of rhknow on age is
shown below:

. predict dage , dfbeta(age)


(1 missing value generated)
. summa dage , det
Dfbeta age
-------------------------------------------------------------
Percentiles Smallest
1% -.2687882 -.4621181
5% -.1472693 -.3186477
10% -.0580034 -.2687882 Obs 203
25% -.0197982 -.2619583 Sum of Wgt. 203

50% .0032605 Mean -.0003466


Largest Std. Dev. .0764411
75% .0290477 .135505
90% .0804681 .1385067 Variance .0058432
95% .1041601 .1576703 Skewness -1.989395
99% .1385067 .2315073 Kurtosis 11.92995

. list sex age educ rhknow if abs(dage)>0.15 & dage<. , noobs


+------------------------------+
| sex age educ rhknow |
|------------------------------|
| male 14 2 16 |
| male 14 1 15 |
| male 33 1 13 |
| male 32 1 5 |
| male 41 2 8 |
|------------------------------|
| male 35 2 6 |
| male 42 1 6 |
| male 37 1 6 |
| male 25 1 3 |
| male 36 1 7 |
|------------------------------|
| female 34 2 7 |
| female 28 3 13 |
+------------------------------+

Note that here 2/√n is 0.1404 hence the chosen cut-off of 0.15 – there are 12 observations
worth checking (including our earlier potential outlier) although none of the dfbetas approach the
more stringent cut-off of 1. Most of these observations are either young people with high scores
or older people with low scores. Without any more information on these observations it would
not be advisable to omit them from the model.

You might also like