CH 4 - Problems
CH 4 - Problems
1
•Inferences made based on the results of OLS estimations
are valid so long as the assumptions of the classical
linear regression model hold.
2
1. Heteroskedasticity
3
Heteroscedasticity is likely to be a problem
when the values of the variables in the
regression equation vary substantially in
different observations.
4
Causes of Heteroscedasticity
•Non homogenous sample: When we consider the specific product sales of firms, the
sales of large sized firms is usually volatile compared to small sized firms. Also,
consumption expenditure studies showed that the consumption expenditure of high-
income people is relatively volatile compared to low income individuals.
•Outlier: some thing which is situated away or detached from the main body or
system
• Functional specification error (omitted variable): In such a case the residuals
obtained from the regression may give distinct impression that the error variances
may not be constant. eg. Consider economies with different economic background. If
we want to estimate the effect of GDP on value added by manufacturing sector, the
variation in the value added to the sector for a small change in GDP seems to be
relatively higher for big economies whereas it is lower for small economies. Hence,
economic background of the countries should be considered in the regression.
•Skewness: the distribution of some variables such as income, wealth, etc… is
skewed.
•Incorrect data transformation ladder/gladder cons
5
What are the consequences of existence of
heteroscedasticity?
•Heteroscedasticity by itself does not cause OLS estimators to be
biased or inconsistent since neither bias nor consistency are
determined by the covariance matrix of the error term.
•The graph of the residual against the dependent variable gives rough
indication of the existence of heteroscedasticity. The command in STATA to
detect heteroscedasticity graphically is rvfplot, yline(0)
•If there appears a systematic trend in the graph it may be an indication of the
existence of heteroscedasticity.
•First run OLS regression and then find the fitted values and residuals, then
plot the residual against the fitted value of the dependent variable and see the
scatteredness of the residual (i.e, variance of the residual).
If the scatteredness of the residual has systematic trend with the dependent
variable, heteroskedasticity exists.
However, this is not adequate and we need to conduct further tests. There are
many other formal methods of detecting the problem. They are outlined
below.
7
8
Breusch-Pagan test
run yi 1 2 x2i k xki i
2
then run ˆi 1 2 x2i k xki ei
obtain R2ˆ2
R2ˆ2 ( k 1)
form F or LM nR2ˆ2
(1 R2ˆ2 ) n k
In this case, H0: There is homoscedasticity (there is no problem of
heteroscedasticity).
Compare the calculated value of F with the tabulated value. If we find
calculated value > tabulated value (or if the P-value < 0.05 at 5% level of
significance), we reject the null hypothesis (saying there is
homoskedasticity); we accept that there is problem of heteroscedasticity.
11
12
13
a. Assume a model of the following
2
form: Yi = α + β1X1i + β2X2i + β3X3i + ui which has a
heteroskedastic disturbance. Assume that the heteroskedastic disturbance is generated by
the following equation: Var (ui) = σi² = kX2i. Perform the transformation necessary to
make the model homoscedastic.
b. What if
E(εi²) = σi² = a + bXi + cXi 2
14
Note: Summary of stata commands
16
2. Autocorrelation
• One of the assumptions of the OLS is that successive values of
the error terms are independent or are not related.
cov ui u j E ui u j 0
17
Sources of Autocorrelation
18
2. Misspecification of the mathematical form of the
model:
• If we have adopted a mathematical form which differs
from the true form of the relationship, then the
disturbance term may show serial correlation.
• For example, the true estimation should be
19
3. Lags
• If we have adopted a mathematical form which differs
from the true form of the relationship, then the
disturbance term may show serial correlation.
• For example, in case lagged terms of the dependent and
the independent variables should be included in the
model, but we overlook them, the regression will face
serial correlation of the error terms.
4. Inertia
The momentum built into economic data will continue until
something happens. Thus successive time series data are
likely to be interdependent. 20
5. Manipulation of Data
• This involves averaging, interpolation or extrapolation of data.
In empirical analysis, the raw data are often “manipulated.’’ For example,
in time series regressions involving quarterly data, such data are usually
derived from the monthly data by simply adding three monthly
observations and dividing the sum by 3. This averaging introduces
smoothness into the data by dampening the fluctuations in the monthly
data.
i) Asymptotic test
ii) Durbin Watson test
23
i) Asymptotic test of autocorrelation
• The OLS residuals from the regression Y = XB + u provide useful
information about the possible presence of serial correlation in
the equation’s error term.
• An intuitively appealing starting point is to consider the
regression of the OLS residual ut upon its lag ut−1.
• Where
25
• Consequently, a value of dw close to 2 indicates that the first order
autocorrelation coefficient is close to zero. If dw is much smaller
than 2, this is an indication for positive autocorrelation ( 0); if dw is
much larger than 2 then < 0.
• Since the distribution of dw depends not only upon the sample size
n and the number of variables k, but also upon the actual values of
Xs, the critical values cannot be tabulated for general use.
• Fortunately, it is possible to compute upper and lower limits for
critical values of dw that depend only upon sample size n and
number of variables K. These values, dL and du, were tabulated by
Durbin Watson (1950)and Savin and Whit (1977).
• The Durbin-Watson statistic has a range from 0 to 4 with a midpoint
of 2. estat dwatson
• Ranges for dw is shown as:
_______
26
Decision
27
Other alternative tests of autocorrelation
using STATA
• estat durbinalt
(This is Durbin's alternative test for autocorrelation)
• estat bgodfrey
(Breusch-Godfrey LM test for autocorrelation)
28
Remedies to correct autocorrelation
- Use of Heteroskedasticity-and-autocorrelation-
consistent Standard Errors for OLS (Regression with
robust standard errors)
29
Correction of functional specification
• In many cases the finding of autocorrelation is an
indication that the model is mis-specified.
• If this is the case, the most natural route is not to change
your estimator (from OLS to EGLS) but to change your
model.
• Typically, three (interrelated) types of misspecification
may lead to a finding of autocorrelation in your OLS
residuals:
- dynamic misspecification,
- omitted variables, and
- functional form misspecification. 30
Summary questions
• What is meant by autocorrelation?
• What are the sources of autocorrelation?
• What are the consequences of autocorrelation?
• How can we detect autocorrelation?
• What are the remedy measures to correct autocorrelation?
31
3. Multicollinearity
One of the assumptions of CLRM says that there is no exact linear
relationship between any of the explanatory variables included in
the regression analysis.
•When this assumption is violated we speak of Multicollinearity.
•Practically, in empirical research we encounter moderate to high
degree of MC. (read Gujarati starting from pp. 341)
•If there exists perfect multicollinearity, the matrix for regression
will not be with full rank. (read about rank of matrices).
•Rank of matrix refers to simply the number of independent
row/column vectors in that matrix.
•If the number of independent row/column vectors of a matrix is
equal to the number of rows/columns, we say the matrix is with
full rank. 32
The following is an example of presence of perfect MC.
33
Sources of multicollinearity
1. Problem with data collection method: sampling over a limited range of the
values taken by the regressors in the population. (specially on discrete variables
with limited values). Small sample can also limit the range of the values even if
the variables are continuous.
In this case, physical constraints in the population can cause multicollinearity, in the
sense that, by very nature income and household size are highly related variables.
Family with the higher incomes generally have larger homes than families
with lower incomes.
5. Time series data: In most cases, the regressors used in time series
data share a common trend, that is, they all increase or decrease
over time together.
35
Consequences of Multicollinearity
• The presence of multicollinearity has a number of potentially
serious effects on the least-squares estimates of the regression
coefficients.
• Some of these effects may be easily demonstrated.
• To illustrate the effect of multicollinearity on the OLS estimator in
more detail, consider the following example.
• Let the following regression model be estimated,
• Where it is assumed that the sample means . Moreover, assume that the
sample variances of are equal to 1, while the sample covariance (correlation
coefficient) is
36
•
•
where: = 1
37
• This shows that if there is strong multicollinearity between x1 and
x2, then the correlation coefficient r12 will be large.
39
40
Detection of Multicollinearity
• One method of detection of multicollinearity is to
estimate correlation coefficient among variables.
42
3. By transforming the functional relationship
Eg. Estimation with first difference of the variables (in the case of time
series analysis)
44
4) Problem of Model Specification
Is it a linear model?
Should the dependent or independent variable be transformed?
Is there any important variable omitted?
Is there irrelevant variable included in the model?
Is the model estimated with error in variables?
Is there structural break in the regression?
46
4000
3000
2000
1000
0
0 5 10 15 20
household size
• In this case, the result shows that the fitted values are almost
aligned to the lowess smoother predictor. Hence, we see linear
relationship 47
• Checking the linearity assumption is not so straightforward in the case
of multiple regressions.
• In this case, the most straightforward thing to do is to plot the
standardized residuals against each of the predictor variables in the
regression model.
• If there is a clear nonlinear pattern, there is a problem of nonlinearity.
• To do so; we first run the regression (eg. let us say reg cons ageh sexh
hhsize)
• Then, find residual for each observation. (predict resid, residual)
• Then use the command for each regressor:
• scatter resid ageh
• Scatter resid sexh (for this variable, we don’t have contiunous r/n
ship)
• Scatter resid hhsize
48
• scatter resid ageh
30002000
Residuals
1000 0
-1000
20 40 60 80 100
Age of household head
49
• scatter resid sexh
3000 2000
R esiduals
1000 0
-1000
0 .2 .4 .6 .8 1
Sex of household head
50
• Scatter resid hhsize
30002000
Residuals
1000 0
-1000
0 5 10 15 20
household size
51
• Another command for detecting non-linearity is acprplot.
• acprplot graphs an augmented component-plus-residual plot,
a.k.a. augmented partial residual plot.
• It can be used to identify nonlinearities in the data.
52
A u gm e n te d c o m p o n e n t p lu s re sid u a l
-1 0 0 0 0 1000 2000 3000 4000
20
40
60
Age of household head
80
acprplot ageh, lowess lsopts(bwidth(1))
100
53
A u gm e nted co m po ne n t plus residu al
0 1 00 0 2 00 0 3 00 0 4 00 0
0
5
10
household size
15
acprplot hhsize, lowess lsopts(bwidth(1))
20
54
B) Test for functional specification problem in the
regressors
• Since any function can be approximated by a polynomial of
sufficiently high order, this mimics any non-linear function. (Taylor
series expansion Y = f(x) = a + bx2 + cx3 + dx4 + etc.).
• In practice we don't usually go beyond the 2 terms. Also, instead
of using powers of X, we use powers of Y , which of course is a
function of X.
• Ramsey’s RESET test procedure is:
• 1. Estimate Y = b1 + b2 X2 + ... + bk Xk + u
• 2. Obtain fitted values
• 3. Estimate Y = b'1 + b'2 X2 + ... + b'k Xk + c Yˆ 2+ u'
• 4. Test for the significance of c. If it is significantly different from
0, reject the linear specification. Alternative is not clear, though.55
• The linktest command can also perform test of a model specification
problem.
• linktest creates two new variables, the variable of prediction, Y_hat,
and the variable of squared prediction, Y_hatsq.
• The model is then refit using these two variables as predictors.
• Linktest test procedure is:
1. Estimate Y = b1 + b2 X2 + ... + bk Xk + u
2. Obtain fitted values (Y_hat and Y_hatsq)
3. Estimate Y = b'1 + b'2 Y_hat + b‘3 Y_hatsq + u'
4. Test for the significance of b'2 and b‘3.
------------------------------------------------------------------------------
cons | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_hat | 1.842698 .6423948 2.87 0.004 .5825729 3.102824
_hatsq | -.0007843 .0005914 -1.33 0.185 -.0019444 .0003757
_cons | -215.7849 169.9994 -1.27 0.205 -549.2568 117.687
------------------------------------------------------------------------------
57
• C) Problem of omitted variable or inclusion of
irrelevant variable
• A model specification error can occur when one or more relevant
variables are omitted from the model or one or more irrelevant
variables are included in the model.
i. Problem of Omitted Variable
Yt 0 1 X 1 2 X 2 e
• But we estimate the model: Yt 0 1 X 1 e*
Where :
e* 2 X 2 e
58
• Clearly this model violates our assumption that E(u)=0
• E(β2X2+e) is not 0
59
• Under these circumstances, what does our
estimate of b1 reflect?
• Recall that:
1
b1 ( X 1 ' X 1 ) X 1 ' y
1
b1 ( X 1 ' X 1 ) X 1 ' ( X 1 B1 X 2 2 e)
1 1
b1 1 ( X 1 ' X 1 ) X 1 ' X 2 2 ( X 1 ' X 1 ) X 1 ' e
60
• We retain the assumption that E(e)=0
• That is, the expectation of the errors is 0 after we
account for the impact of X2
• Thus our estimate of b1 is:
1
E (b1 ) 1 ( X 1 ' X 1 ) X 1 ' X 2 2
• This equation indicates that our estimate of b1 will be
biased by two factors.
The extent of the correlation between X1 and X2 and
The extent of X2’s impact on Y.
• Often difficult to know direction of bias. 61
• If either of these sources of bias is 0, then the
overall bias is 0
62
• Omitted variable bias can be detected using Ramsey regression
specification error test (RESET) for omitted variables.
• The syntax for this is estat ovtest or ovtest
• This test amounts to fitting y=xb+zt+u and then testing the
significance of coefficients of powers of the fitted values are
used for z. The test involves F-statistic.
Where: z = matrix of , ,
63
ii) Problem of Including Irrelevant Variables
• Two possibilities:
1. Yes => Cov(X,ε)=0 => ε must be correlated
with X* => OLS is unbiased, but we have large
error variance
V [(u 1 ) | X ) V (u | X ) 12V ( | X ) 2 12 2
72