0% found this document useful (0 votes)
59 views27 pages

CH 5 - Multicollearity

This document discusses multicollinearity in regression analysis. It defines multicollinearity as a high correlation between two or more independent variables. It outlines the following key points: 1. Multicollinearity can cause estimation of coefficients to have high variance and become unstable, making precise estimation difficult. 2. Even when the overall regression model is significant, individual coefficients may not be statistically significant due to multicollinearity inflating their standard errors. 3. Multicollinearity can be detected by a high R-squared but few significant t-ratios, high pairwise correlations between predictors, and variance inflation factors above 10. 4. Potential remedial measures include dropping highly correlated variables, transforming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views27 pages

CH 5 - Multicollearity

This document discusses multicollinearity in regression analysis. It defines multicollinearity as a high correlation between two or more independent variables. It outlines the following key points: 1. Multicollinearity can cause estimation of coefficients to have high variance and become unstable, making precise estimation difficult. 2. Even when the overall regression model is significant, individual coefficients may not be statistically significant due to multicollinearity inflating their standard errors. 3. Multicollinearity can be detected by a high R-squared but few significant t-ratios, high pairwise correlations between predictors, and variance inflation factors above 10. 4. Potential remedial measures include dropping highly correlated variables, transforming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Chapter 4B

Multicollinearity
Outline
• The nature of multicollinearity
• Estimation in the presence of multicollinearity.
• Practical consequences
• Detection of multicollinearity
• Remedial measures
1. The Nature of Multicollinearity
• Originally it meant the existence of a “perfect,” or exact,
linear relationship among some or all explanatory variables of
a regression model.
• Today, it includes perfect multicollinearity and less than
perfect multicollinearity.
• Wooldridge (2004): High (but not perfect) correlation between
two or more independent variables is called multicollinearity.
• Perfect multicollinearity
λ1X1 + λ2X2 + · · ·+λk Xk = 0
• Unperfect multicollinearity
λ1X1 + λ2X2 + · · ·+λ2Xk + vi = 0
where vi is a stochastic error term.
1. The Nature of Multicollinearity

• A numerical example:

• X3i = 5X2i  There is perfect collinearity between X2 and X3


• The variable X*3 was created from X3 by simply adding to it
the following numbers (vi = 2, 0, 7, 9, 2). Now there is no
longer perfect collinearity between X2 and X*3. However, the two
variables are highly correlated because calculations will show that the
coefficient of correlation between them is 0.9959.
The Nature of Multicollinearity

• The term
2. Estimation in the presence of multicollinearity
Perfect multicollinearity

Perfect collirearity X3i = λX2i , where λ is a nonzero constant

 the estimator is indeterminate


2. Estimation in the presence of multicollinearity
High multicollinearity
• The variances and covariances of β2 and β3 are given by

 where r23 is the coefficient of correlation between X2 and X3.


• The r23 tends toward 1 as collinearity increases, the variances
and covariance of the estimators increase.
Perfect collinearity: r23 = 1, the variances are infinite.
2. Estimation in the presence of multicollinearity

• The speed with which variances and covariances increase


can be seen with the variance-inflating factor (VIF),
which is defined as:

• Using this definition, we can express


3. Practical consequences of Multicollinearity

High multicollinearity
1. The OLS estimators have large variances and covariances,
making precise estimation difficult.
2. The confidence intervals tend to be much wider, leading to the
acceptance of the “zero null hypothesis” (i.e., the true population
coefficient is zero) more readily.
3. The t ratio of one or more coefficients tends to be statistically
insignificant.
4. Although the t ratio of one or more coefficients is statistically
insignificant, R2, the overall measure of goodness of fit, can be
very high.
5. The OLS estimators and their standard errors can be sensitive
to small changes in the data.
Example
Example
• Income and wealth together explain about 96
percent of the variation in consumption
expenditure .
• Neither of the slope coefficients is individually
statistically significant.
• Not only is the wealth variable statistically
insignificant but also it has the wrong sign.
• H0 ( ) is rejected (F=92.40)  Consumption
expenditure is related to income and wealth.
 When collinearity is high, tests on individual
regressors are not reliable.
Example
• Correlation matrix
. use "D:\Bai giang\Kinh te luong\datasets\WAGE2.DTA", clear

. gen exper2=exper*2

. pwcorr educ exper exper2 tenure age sibs brthord, star(0.05)

educ exper exper2 tenure age sibs brthord

educ 1.0000
exper -0.4556* 1.0000
exper2 -0.4556* 1.0000* 1.0000
tenure -0.0362 0.2437* 0.2437* 1.0000
age -0.0123 0.4953* 0.4953* 0.2706* 1.0000
sibs -0.2393* 0.0643* 0.0643* -0.0392 -0.0407 1.0000
brthord -0.2050* 0.0883* 0.0883* -0.0285 0.0054 0.5939* 1.0000
Example
• Regression results
. reg lwage educ exper exper2 tenure age sibs brthord
note: exper omitted because of collinearity

Source SS df MS Number of obs = 852


F( 6, 845) = 28.37
Model 24.6147323 6 4.10245539 Prob > F = 0.0000
Residual 122.201332 845 .144616961 R-squared = 0.1677
Adj R-squared = 0.1617
Total 146.816065 851 .172521815 Root MSE = .38029

lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ .0675684 .0071676 9.43 0.000 .0534999 .0816368


exper 0 (omitted)
exper2 .0047813 .0020189 2.37 0.018 .0008186 .0087439
tenure .0119621 .0026717 4.48 0.000 .0067181 .0172062
age .0098871 .0050935 1.94 0.053 -.0001103 .0198845
sibs -.0067073 .0071707 -0.94 0.350 -.0207818 .0073672
brthord -.0134547 .0102047 -1.32 0.188 -.0334841 .0065748
_cons 5.406201 .1685784 32.07 0.000 5.07532 5.737083
Example
• Regression results without sibs
. reg lwage educ exper tenure age brthord

Source SS df MS Number of obs = 852


F( 5, 846) = 33.87
Model 24.4882037 5 4.89764073 Prob > F = 0.0000
Residual 122.327861 846 .14459558 R-squared = 0.1668
Adj R-squared = 0.1619
Total 146.816065 851 .172521815 Root MSE = .38026

lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ .0684463 .0071054 9.63 0.000 .0545001 .0823926


exper .0096979 .0040349 2.40 0.016 .0017782 .0176175
tenure .0120242 .0026707 4.50 0.000 .0067822 .0172662
age .0099569 .0050926 1.96 0.051 -.0000387 .0199524
brthord -.018937 .008353 -2.27 0.024 -.035332 -.002542
_cons 5.383053 .1667398 32.28 0.000 5.055781 5.710325
Example
• Regression results without brthord
. reg lwage educ exper tenure age sibs

Source SS df MS Number of obs = 935


F( 5, 929) = 35.96
Model 26.8611795 5 5.37223589 Prob > F = 0.0000
Residual 138.795104 929 .149402695 R-squared = 0.1622
Adj R-squared = 0.1576
Total 165.656283 934 .177362188 Root MSE = .38653

lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ .0683854 .0069057 9.90 0.000 .0548328 .0819381


exper .0114801 .0039255 2.92 0.004 .0037763 .019184
tenure .0124307 .0026148 4.75 0.000 .0072992 .0175622
age .0086558 .0049387 1.75 0.080 -.0010366 .0183481
sibs -.0121724 .0056602 -2.15 0.032 -.0232807 -.0010642
_cons 5.384746 .1630053 33.03 0.000 5.064844 5.704647
3. Detection of Multicollinearity

• High R2 but few significant t ratios. If R2 is high,


say, in excess of 0.8, the F test in most cases will
reject the hypothesis that the partial slope
coefficients are simultaneously equal to zero, but the
individual t tests will show that none or very few of
the partial slope coefficients are statistically
different from zero.
3. Detection of Multicollinearity

• High pair-wise correlation among regressors: a rule of


thumb indicates that the pair-wise correlation is high, say, in
excess of 0.8, then multicollinearity is a serious problem. 
This is a sufficient but not necessary condition. The model
involving more than two explanatory variables, the
correlations will not provide an infallible guide to the
presence of multicollinearity.
• Auxiliary regressions: to regress each Xi on the remaining
X variables and compute the corresponding R2. If the
computed F exceeds the critical Fi at the chosen level of
significance, it is taken to mean that the particular Xi is
collinear with other X’s;
3. Detection of Multicollinearity

• Tolerance and variance inflation factor: if the VIF of a


variable exceeds 10, which will happen if R2j exceeds 0.90,
that variable is said be highly collinear
The inverse of the VIF is called tolerance (TOL). That is,

When R2j = 1 (i.e., perfect collinearity), TOLj = 0


When R2j = 0 (i.e., no collinearity whatsoever), TOLj is 1
4. Remedial Measures

• Do Nothing:
• A priori information.
• Combining cross-sectional and time series data.
• Dropping a variable(s) and specification bias
• Transformation of variables
• Additional or new data.
Do Nothing
• Multicollinearity is essentially a data
deficiency problem and sometimes we have no
choice over the data we have available for
empirical analysis.
Priori information
Ex: Suppose we consider the Cobb Douglas production
function of a country:

Or:
- High correlation between K and L leads to large
variances of coefficient estimators.
- Based on the findings in prior literature, we know that the
country has constant returns to scale: α+β = 1.
Priori information
Replacing β with 1-α, we obtain:

or
Where
We estimate and compute:
Combining cross-sectional and time series data
• Examine the demand for automobiles

Where Y= number of cars sold, P= average


price, I= income, t= time.
We estimate price elasticity and income
elasticity.
In time series data, the price and income
variables tend to be highly collinear.
Combining cross-sectional and time series data
• If we have cross-sectional data, for ex, data
generated by consumer panes, or budget
studies conducted by various private and
governmental agencies  we obtain a fairly
reliable estimate of the income elasticity
• Time series regression:

• Where:
• Dropping a variable(s) and specification bias

-When we drop the wealth variable, the income variable is now


highly significant.
- But we may be committing a specification bias or
specification error. Economic theory says that income and
wealth should both be included in the model explaining the
consumption expenditure, dropping the wealth variable would
constitute specification bias.
-The remedy may be worse than the disease. Multicolliearity
may prevent precise estimation of the parameters, whereas
omitting a variable may seriously mislead us as to the true
values of the parameters.
Transformation of variables-first difference form
Regression model:

 It must hold at time (t-1):

 The first difference regression model often


reduces the severity of multicollinearity.
Additional or new data
• Increasing the size of the sample may
attenuate the collinearity problem.

• The sample size increases  will


increase  The variance will decrease.

You might also like