MULTICOLLINEARITY
BASIC ECONOMETRIC
INTRODUCTION
• Means the existence of a “perfect”, or exact, linear
relationship among some or all explanatory variables of
regression model (Ragnar Fris, 1934).
• But today, the term multicollinearity is used in a border sense
to include the case the X variables are intercorrelated but not
perfectly.`
• Multicollinearity (M-C) – avoid the assumption 10 of CLRM,
which is no M-C among the regressors included in the
regression model.
INTRODUCTION
• Why CLRM assume here is no M-C among X’s?
• Reason:
– If M-C is perfect, the regression coefficients of the X
variables are indeterminate and their SE are infinite.
– If M-C is less perfect, the regression coefficient, posses
large SE (in relation to the coefficient themselves) which
means the coefficients cannot be estimated with great
precision or accuracy.
SOURCES OF M-C
• The data collection method employed. Sampling over a
limited range of the values taken by the regressors in the
population.
• Constraints on the model or in the population being
sampled. Example: regression of electricity cons on income
(X2) and house size (X3) – families with higher income
generally have larger homes.
• Model specification. Example: adding polynomial terms to a
regression model, especially when the range of the X variable
is small.
SOURCES OF M-C
• An overdetermined model. This happens when the model has
more explanatory variables than the number of obs.
CONSEQUENCES OF M-C
1. Although BLUE, the OLS estimators have large variance and
covariance – making precise estimation difficult.
2. Because of (1), the confidence intervals tend to be much
wider, leading the acceptance of “zero null hypothesis” (H0).
3. Also because of (1), the t ratio of one or more coefficients
tends to be statistically insignificant.
4. Although t ratio is statistically insignificant, the , the
overall measure of goodness of fit can be very high.
5. The OLS estimators and their standard errors can be
sensitive to small change in the data.
(refer to Gujerati to get the more evaluation)
DETECTION OF M-C
• High but few significant t ratio.
– This is classic sympton of M-C.
– If high, says >0.8, the F test will reject the hypothesis all coefficient
simultaneously equal to zero.
– But the individual t test show that none or very few of the coefficients
are statistically different from zero.
• High-pair-wise correlation among regressors.
– The pair-wise or zero-order correlation coefficient between 2
regressors is high (> 0.8).
– But, high zero-order correlation are sufficient but not a necessary
condition for the existence of M-C because it can exist even though
the zero-order or simple correlation are comparatively low.
DETECTION OF M-C
• Examination of partial correlations.
– In regression of Y on X2, X3 and X4 , a finding that is very high
but their , and are comparatively low may suggest
that the variable X2, X3 and X4 are highly intercorrelated.
– Although a study of the partial correlations may be useful, but there is
no guarantee they will provide an infallible guide o M-C, for it may
happen that both and all partial correlations are sufficiently high.
DETECTION OF M-C
• Variance inflation factor (VIF).
– Measures M-C by regrressing one independent variables on all of the
remaining independent variables.
– If we have 3 independent variables, to use VIF to look aby possible M-
C, we run 3 regrerssion, one for each independent variables.
– We would run the following three regression:
DETECTION OF M-C
– Next, for each these regression, we calculate the VIF using this
formula:
– where is the unadjusted from a regression using X as dependent
variables. (not from the original regression that we suspect has M-C).
– Idea VIF – if the in above equation is high, the variances of the
slope estimates (and the S.E) will also be high or inflated.
– Some researchers believe that a VIF greater than 4 indicates a serious
M-C problem
REMEDIAL MEASURES
• Do nothing
– Blanchard (1967) – M-C is essentially a data deficiency problem and
some times we have no choice over the data we have available for
empirical analysis.
– Not all the coefficient are statistically insignificant. If we cannot
estimate one or more regression coefficient s with greater precision, a
linear combination of them can be estimated relatively efficiently.
REMEDIAL MEASURES
• Combining cross-sectional and time series data.
– known as pooling the data.
– Assume we have time series data on the number of cars sold, average
price of the car and consumer income:
– Where Y = number of cars, P = average price, I = income and t = time.
– In time series data, price and income generally tend to be highly
collinear.
– If we run regression – face M-C problem.
– Tobin (1950) suggest, if we have cross-sectional data, we can obtain a
fairly reliable estimate of the income elasticity because in such data,
which are at a point in time, the prices do not vary much.
REMEDIAL MEASURES
• Dropping a variable(s) as specification bias.
– One of ‘simplest’ things to do – drop one of collinear variables.
– But dropping a variable – may be commiting a specification bias or
specification error.
– Specification bias arises from incorrect specification of the model used
in the analysis.
• Transformation of variables.
– Supposed we have time series data on cons expenditure (Y), income
(X1) and wealth (X2).
– Income and wealth show high collinearity – overt time both the
variables tend to move in the same direction.
REMEDIAL MEASURES
• At the time t-1,
• If we subtract before from later, we obtain
• Where
• Equation above known a first difference form because we run
the regression not in the original variable, but on difference
form.
• First difference model often reduce the severity of M-C
because there is no priori reason their difference will also
highly correlated.
REMEDIAL MEASURES
• Additional or new data
– Since M-C is a sample feature, it is possible that in other sample
involving the same variables collinearity may not be serious as in the
first sample.
– Some times, simply increase the size sample may attenuate the
collinearity problem.
DIAGNOSTIC TEST
THE MODEL ESTIMATED
• Lets the model we want to estimate is as below:
ln 𝑃𝐶𝐸𝑡 = 𝛼0 + 𝛼1 ln 𝑃𝐷𝐼𝑡 + 𝛼2 ln 𝐺𝐷𝑃𝑡 + 𝑢𝑡 (1)
where;
PCE = personel consumption expenditure (billion of 1987 dollar)
PDI = personal disposible income (billion of 1987 dollar)
GDP = Gross domestic products (billion of 1987 dollar)
CHANGE DATA INTO LOG FORM
• Because the model is in log form, we must change the data
and create the new variables in Stata into log form.
. gen lngdp = log(gdp)
. gen lnpdi = log(pdi)
. gen lnpce = log(pce)
• Now we have create a new variables namely lngdp, lnpdi and
lnpce in log form.
• Make sure you have create the variable t (if use time series),
and at Stata command;
. tsset t
time variable: t, 1 to 88
delta: 1 unit
REGRESS THE MODEL
• Now, base on the model in Eq(1), we will perform an OLS
regression.
. regress lnpce lnpdi lngdp, constant
Source SS df MS Number of obs = 88
F( 2, 85) =15955.37
Model 2.93006108 2 1.46503054 Prob > F = 0.0000
Residual .007804743 85 .000091821 R-squared = 0.9973
Adj R-squared = 0.9973
Total 2.93786582 87 .033768573 Root MSE = .00958
lnpce Coef. Std. Err. t P>|t| [95% Conf. Interval]
lnpdi .4871041 .0601343 8.10 0.000 .367541 .6066672
lngdp .6091764 .0636003 9.58 0.000 .4827219 .7356309
_cons -1.060787 .0689796 -15.38 0.000 -1.197937 -.9236368
MULTICOLLINEAR TEST
• We now will test whether the mulit-k is exist in our model by
using the VIF test.
. estat vif
Variable VIF 1/VIF
lngdp 102.35 0.009770
lnpdi 102.35 0.009770
Mean VIF 102.35
• Our VIF = 102.35 , consider high multi-k