Regn Lect 4
Regn Lect 4
Objectives
Thus before making use of our regression model we should check on the validity of these
assumptions. Some assumptions (e.g. linearity) can be checked before the analysis by plotting
the data. Other assumptions have to be checked after the analysis has been completed.
To do this we use the residuals defined by
residual = observed value - fitted value
We can use the residuals to check our assumptions and test the adequacy of our model. If the
assumptions hold, then the residuals should be a random sample from a normal distribution with
mean zero.
Note that the study of residuals has grown widely in recent years and goes by the name of
regression diagnostics. Although we will look at the facilities available in Stata, similar
facilities are available in other packages e.g. SPSS and SAS.
2
1. Linearity
If we plot the residuals against the fitted values (or against the independent variable x) then any
non-linearity is shown up by a systematic non-linear pattern in the residuals.
We can distinguish between different kinds of residuals. The residuals that we have defined as
observed value minus fitted value are the raw residuals. In checking assumptions such as
constant variance there are two other kinds of residual that are more useful - these are the
standardized residual and the studentized residual.
Standardized and studentized residuals are attempts to adjust residuals for their standard
errors. Although the εi (theoretical residuals) are assumed to have constant variance, the
calculated residuals ei do not –
In fact Var(ei) = σ2 (1 - hi ) where hi are the leverage values which will be discussed shortly.
(Thus observations with the greatest leverage have corresponding residuals with the smallest
variance). Standardised residuals estimate σ2 using the residual mean square from the overall
regression model, while studentized residuals use the residual mean square for the regression
that would have been obtained if that particular observation had been omitted from the model. In
general studentized residuals are preferred to standardized residuals in identifying outliers
(which will be discussed below), while either can be used to check for constant variance. Thus
to simplify matters we will use the studentized residuals which can be found after fitting a
regression model in Stata using the predict command.
Ex: In the Stepping Stones pilot data we can fit a regression of the knowledge score rhknow on
age. The observed points and fitted line are shown below:
10 15 20 25 30 35 40 45
Age
------------------------------------------------------------------------------
rhknow | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0712879 .0265227 2.69 0.008 .0189895 .1235863
_cons | 7.081222 .5457386 12.98 0.000 6.005115 8.15733
------------------------------------------------------------------------------
. predict knowfit
(option xb assumed; fitted values)
(1 missing value generated)
8 8.5 9 9.5 10
Linear prediction
4
There is no obvious pattern in the residual plot at this stage. Note that there is a quick way to
get a plot of the residuals versus the fitted values using the raw residuals.
Stata also provides a test to see whether there are any omitted variables in our regression
model, which may also indicate some non-linearity (e.g. that we need to introduce a quadratic
term) or else the need for an extra explanatory variable (see multiple regression below). This is
the “omitted variable test” due to Ramsey .
. estat ovtest
From Ramsay’s omitted variables test there is in fact very strong evidence of an omitted
variable. In this case we could consider a quadratic relationship with age - which would allow
the effect of age to become less with increasing age; we will also examine the effect of other
variables later (e.g. education, number of partners, gender, etc.)
Homogeneity of Variance
To test for homogeneity of variance we plot the residuals against the fitted values. Equal scatter
of the residuals about the horizontal axis implies constant variance (homogeneity of variance)
while unequal scatter suggests non constant variance. If we look at the plot of the studentized
residuals versus the fitted values for the pilot stepping stones data above, there is little evidence
of non-constant variance.
We can also test for non-constant variance or heterogeneity of variance using the estat hettest
command
. estat hettest
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
Variables: fitted values of rhknow
chi2(1) = 0.01
Prob > chi2 = 0.9343
This confirms that there is no evidence of heterogeneity in this case.
5
Normality
One way of testing for normality is to do a normal probability plot which essentially plots the
cumulative distribution of the residuals against the cumulative distribution of a standard normal
variable. If our residuals are normally distributed then the normal probability plot should give us
an approximate straight line.
This can be done as follows in Stata for the stepping stones pilot data (having saved the
studentized residuals in knowresa as done above).
. pnorm knowresa
In this case the normal probability plot shows no evidence of non-normality.
We can also plot a histogram of the studentised residuals with a normal distribution
superimposed using
hist knowresa , norm
1.00 0.75
Normal F[(rhres-m)/s]
0.25 0.50
0.00
.4
.3
Density
.2.1
0
-4 -2 0 2 4
Studentized residuals
Outliers
A large value of a residual, either positive or negative, indicates a potential outlier. Since our
studentized residuals are assumed to come from a standard normal distribution, 95% of the
studentized (or standardized) residuals should lie between -2 and +2. This gives us a yardstick
for what is “large”.
For an observation with a large residual, the fitted value is very different to the observed value,
and we should investigate the observation. If the cause of the strange value is known then we
can reject it and report its removal. Outliers can seriously distort the analysis. However careful
consideration must be made before removing outliers from the data set.
Looking at the histogram of residuals we see one large residual, we can reanalyse the data
removing any observation with a studentized residual larger than 3 - we should first examine
this observation to consider why it merits exclusion, and also see what the effect of removing
this observation is.
. list age rhknow educ knowfit knowresa if knowresa > 3 & knowresa<.
+-------------------------------------------+
| age rhknow educ knowfit knowresa|
|-------------------------------------------|
| 14 16 2 8.079252 3.435576 |
+-------------------------------------------+
7
So this observation corresponds to a 14 year old who obtained a score of 16 - to put this in
context we can look at the distribution of rhknow
Thus the score of 16 was the highest overall, and the 14 year old was the only person to get
that score
We can see the effect of omitting that score
Thus omitting that score leads to a slight increase in the value of R-squared and a slight
increase in the slope. Overall the change probably doesn’t warrant dropping the observation -
besides the score is almost certainly genuine (i.e. was not based on guessing).
8
Leverage Values
A point (xi, yi) is an outlier if the fitted value is far from the observed value yi. An observation
could be an outlier in another way, if xi is far from the other x’s. However the regression line of y
on x will pass very close to this point, and thus the residual will be very small. However this
point will have a dramatic effect on the resulting estimates, since if this point were removed the
estimates would change markedly. Such a point is said to have a high leverage. We can find the
leverage values for each point using the leverage option of the predict command. This will give
leverage values for each observation.
A commonly used rough rule for assessing leverage values is to pay attention to cases with
leverage greater than 2p/n or 3p/n where p is the number of parameters fitted (so p=2 for simple
linear regression) and n is the number of observations. Thus for our example p=2 and n=200 so
we would consider leverage values > 4/200 = 0.02.
A useful plot is the leverage versus residual squared plot which can be obtained in Stata using
the lvr2plot command. The lines on the chart show the average values of leverage and
(normalized) residuals squared. Points above the horizontal line have higher-than average
leverage; points to the right of the vertical line have larger than average residuals.
Cook’s Distance
Leverages give us an indication of “unusual” explanatory variables and residuals indicate
observations with an unusual value of the response variable. It is sometimes useful to have an
overall measure of how “unusual” an observation is combining both explanatory and response
variables. What is needed is a measure of how influential each observation is in determining
the regression. One such measure is Cook’s distance. Cook’s distance for the ith observation is
a measure of the change in the estimated regression coefficients that would result from deleting
the ith observation. An observation with a large value of Cook’s D should be further
investigated. As a rough measure of size, values of Cook’s distance greater than 4/n deserve
further investigation (whatever the value of p, i.e. however many independent variables are in
the model).
DFBETAs
DFBETAs are perhaps the most direct influence measure – they focus on one coefficient and
measure the difference between the regression coefficient when the ith observation is included
and when it is excluded, the difference being scaled by the estimated standard error of the
coefficient. It has been suggested that we check observations with an absolute dfbeta > 2/√n,
9
but other authors suggest a cut-off of 1 (meaning that the observation shifted the estimate at
least one standard error). The use of dfbetas in our regression model of rhknow on age is
shown below:
Note that here 2/√n is 0.1404 hence the chosen cut-off of 0.15 – there are 12 observations
worth checking (including our earlier potential outlier) although none of the dfbetas approach the
more stringent cut-off of 1. Most of these observations are either young people with high scores
or older people with low scores. Without any more information on these observations it would
not be advisable to omit them from the model.