0% found this document useful (0 votes)
113 views9 pages

CH 10

Chapter 10 discusses multicollinearity in regression analysis, highlighting its nature, consequences, detection methods, and potential remedies. It explains that multicollinearity occurs when regressors are correlated, which can lead to indeterminate coefficients or large standard errors, complicating precise estimation. The chapter also outlines various detection methods, including examining R-squared values, pair-wise correlations, and variance inflation factors, and suggests remedial measures such as dropping variables, transforming data, or using advanced statistical techniques.

Uploaded by

261936547
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views9 pages

CH 10

Chapter 10 discusses multicollinearity in regression analysis, highlighting its nature, consequences, detection methods, and potential remedies. It explains that multicollinearity occurs when regressors are correlated, which can lead to indeterminate coefficients or large standard errors, complicating precise estimation. The chapter also outlines various detection methods, including examining R-squared values, pair-wise correlations, and variance inflation factors, and suggests remedial measures such as dropping variables, transforming data, or using advanced statistical techniques.

Uploaded by

261936547
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

CH 10

MULTICOLLINEARITY: WHAT HAPPENS IF THE REGRESSORS ARE


CORRELATED?

Assumption of the classical linear regression model (CLRM) is that there is no


multicollinearity among the regressors included in the regression model. In this chapter we take
a critical look at this assumption by seeking answers to the following questions:
1. What is the nature of multicollinearity?
2. Is multicollinearity really a problem?
3. What are its practical consequences?
4. How does one detect it?
5. What remedial measures can be taken to alleviate the problem of multicollinearity?
THE NATURE OF MULTICOLLINEARITY
The term multicollinearity is due to Ragnar Frisch. Originally it meant the existence of a
“perfect,” or exact, linear relationship among some or all explanatory variables of a regression
model. For the k-variable regression involving explanatory variable X1, X2, . . . , Xk (where X1 = 1
for all observations to allow for the intercept term), an exact linear relationship is said to exist if
the following condition is satisfied:
X2 X3 = 3X2 X5 = 2X2+Ui X4 = X22

2 6 5 4
3 9 8 9
9 27 23 81

Ui = a random number
r23= 1 perfect linear relationship
r25 =0.9997 highly linear relationship but not perfect
r24 = 0.9972 highly linear relationship but not perfect

λ1X1 + λ2X2 + · · · + λkXk = 0 1

where λ1, λ2, . . . , λk are constants such that not all of them are zero simultaneously.

Today, however, the term multicollinearity is used in a broader sense to include the case of
perfect multicollinearity, as shown by (1), as well as the case where the X variables are
intercorrelated but not perfectly so, as follows:
λ1X1 + λ2X2 + · · · + λ2Xk + vi = 0
where vi is a stochastic error term.

As a numerical example, consider the following hypothetical data:


It is apparent that X3i = 5X2i . Therefore, there is perfect collinearity between X2 and X3 since the
coefficient of correlation r23 is unity. The variable X*3 was created from X3 by simply adding to it
the following numbers, which were taken from a table of random numbers: 2, 0, 7, 9, 2. Now
there is no longer perfect collinearity between X2 and X*3. However, the two variables are highly
correlated because calculations will show that the coefficient of correlation between them is
0.9959.

If multicollinearity is perfect in the sense of (x2 and X 3), the regression coefficients of the X
variables are indeterminate and their standard errors are infinite If multicollinearity is less
than perfect, as in x2 and x3*, the regression coefficients, although determinate, possess
large standard errors (in relation to the coefficients themselves), which means the
coefficients cannot be estimated with great precision or accuracy.
There are several sources of multicollinearity. As Montgomery and Peck note, multicollinearity
may be due to the following factors
1. The data collection method employed, for example, sampling over a limited range of the
values taken by the regressors in the population.
2. Constraints on the model or in the population being sampled. For example, in the regression
of electricity consumption on income (X2) and house size (X3) there is a physical constraint in
the population in that families with higher incomes generally have larger homes than families
with lower incomes.
3. Model specification, for example, adding polynomial terms to a regression model, especially
when the range of the X variable is small. Y = b1 +b2X +b3X2 +b4X3 +Ui

4. An overdetermined model. This happens when the model has more explanatory variables than
the number of observations.

PRACTICAL CONSEQUENCES OF MULTICOLLINEARITY


In cases of near or high multicollinearity, one is likely to encounter the following consequences:
1. Although BLUE, the OLS estimators have large variances and covariances, making precise
estimation difficult.
2. Because of consequence 1, the confidence intervals tend to be much wider, leading to the
acceptance of the “zero null hypothesis” (i.e., the true population coefficient is zero) more
readily.
3. Also because of consequence 1, the t ratio of one or more coefficients tends to be statistically
insignificant.
4. Although the t ratio of one or more coefficients is statistically insignificant, R2, the overall
measure of goodness of fit, can be very high.
5. The OLS estimators and their standard errors can be sensitive to small changes in the data.
Large Variances and Covariances of OLS Estimators
To see large variances and covariances, recall that for the model the variances and covariances of
βˆ2 and βˆ3 are given by

t = b/s.e

where r2 3 is the coefficient of correlation between X2 and X3.

The speed with which variances and covariances increase can be seen with the variance-
inflating factor (VIF), which is defined as
VIF shows how the variance of an estimator is inflated by the presence of multicollinearity. As r
2 3 approaches 1, the VIF approaches infinity.
As you can see from this expression, var (βˆj ) is proportional to σ2 and VIF but inversely
proportional to x2j . Thus, whether var (βˆj ) is large or small will depend on the three ingredients:
(1) σ2, (2) VIF, and (3) x2j . The last one, which ties in with Assumption 8 of the classical model,
states that the larger the variability in a regressor, the smaller the variance of the coefficient of
that regressor, assuming the other two ingredients are constant, and therefore the greater the
precision with which that coefficient can be estimated.
Before proceeding further, it may be noted that the inverse of the VIF is called tolerance (TOL).
That is,

When R2j = 1 (i.e., perfect collinearity), TOLj = 0 and when R2j = 0 (i.e., no collinearity
whatsoever), TOLj is 1. Because of the intimate connection between VIF and TOL, one can use
them interchangeably.

To illustrate the various points made thus far, let us reconsider the consumption–income
example. In Table we reproduce the data of Table and add to it data on wealth of the consumer.
If we assume that consumption expenditure is linearly related to income and wealth, then, from
Table we obtain the following regression:

Regression shows that income and wealth together explain about 96 percent of the variation in
consumption expenditure, and yet neither of the slope coefficients is individually statistically
significant. Moreover, not only is the wealth variable statistically insignificant but also it has the
wrong

DETECTION OF MULTICOLLINEARITY
Having studied the nature and consequences of multicollinearity, the natural question is: How
does one know that collinearity is present in any given situation, especially in models involving
more than two explanatory variables?
gHere it is useful to bear in mind Kmenta’s warning:
1. Multicollinearity is a question of degree and not of kind. The meaningful distinction is not
between the presence and the absence of multicollinearity, but between its various degrees.
2. Since multicollinearity refers to the condition of the explanatory variables that are assumed to
be nonstochastic, it is a feature of the sample and not of the population. Therefore, we do not
“test for multicollinearity” but can, if we wish, measure its degree in any particular sample. Since
multicollinearity is essentially a sample phenomenon, arising out of the largely nonexperimental
data collected in most social sciences, we do not have one unique method of detecting it or
measuring its strength. What we have are some rules of thumb, some informal and some formal,
but rules of thumb all the same. We now consider some of these rules.
-

1. High R2 but few significant t ratios. As noted, this is the “classic” symptom of
multicollinearity. If R2 is high, say, in excess of 0.8, the F test in most cases will reject the
hypothesis that the partial slope coefficients are simultaneously equal to zero, but the individual t
tests will show that none or very few of the partial slope coefficients are statistically different
from zero.
Example
The data are time series for the years 1947–1962 and pertain to Y = number of people employed,
in thousands; X1 = GNP implicit price deflator; X2 = GNP, millions of dollars; X3 = number of
people unemployed in thousands, X4 = number of people in the armed forces, X5 =
noninstitutionalized population over 14 years of age; and X6 = year, equal to 1 in 1947, 2 in 1948,
and 16 in 1962.

Yt = β1 + β2X2 + β3X3 + β4X4 + β5X5 + β6X6 +Ut


2. High pair-wise correlations among regressors. Another suggested rule of thumb is that if
the pair-wise or zero-order correlation coefficient between two regressors is high, say, in excess
of 0.8, then multicollinearity is a serious problem. The problem with this criterion is that,
although high zero-order correlations may suggest collinearity, it is not necessary that they be
high to have collinearity in any specific case.

As you can see, several of these pair-wise correlations are quite high, suggesting that there may
be a severe collinearity problem. Of course, remember the warning given earlier that such pair-
wise correlations may be a sufficient but not a necessary condition for the existence of
multicollinearity
Examination of partial correlations. Because of the problem just mentioned in relying on zero-
order correlations, Farrar and Glauber have suggested that one should look at the partial
correlation coefficients. Thus in the regression of Y on X2, X3, and X4, a finding that R2 1.2 3 4 is
very high but r 2 1 2.3 4, r 2 1 3.2 4, and r 2 1 4.2 3 are comparatively low may suggest that the variables
X2, X3, and X4 are highly intercorrelated and that at least one of these variables is superfluous.
Although a study of the partial correlations may be useful, there is no guarantee that they will
provide an infallible guide to multicollinearity, for it may happen that both R2 and all the partial
correlations are sufficiently high.

Auxiliary regressions.
one may adopt Klien’s rule of thumb, which suggests that multicollinearity may be a
troublesome problem only if the R2 obtained from an auxiliary regression is greater than the
overall R2, that is, that obtained from the regression of Y on all the regressors. Of course, like all
other rules of thumb, this one should be used judiciously. Tolerance and variance inflation
factor. We have already introduced TOL and VIF. As R2j , the coefficient of determination in the
regression of regressor Xj on the remaining regressors in the model, increases toward unity, that
is, as the collinearity of Xj with the other regressors increases, VIF also increases and in the limit
it can be infinite. Some authors therefore use the VIF as an indicator of multicollinearity.
The larger the value of VIFj, the more “troublesome” or collinear the variable Xj. As a rule of
thumb, if the VIF of a variable exceeds 10, which will happen if R2j exceeds 0.90, that variable
is said be highly collinear. Of course, one could use TOLj as a measure of multicollinearity in
view of its intimate connection with VIFj. The closer is TOLj to zero, the greater the degree of
collinearity of that variable with the other regressors. On the other hand, the closer TOLj is to 1,
the greater the evidence that Xj is not collinear with the other regressors.
To shed further light on the nature of the multicollinearity problem, let us run the auxiliary
regressions, that is the regression of each X variable on the remaining X variables. To save space,
we will present only the R2 values obtained from these regressions, which are given in Table
Since the R2 values in the auxiliary regressions are very high (with the possible exception of the
regression of X4) on the remaining X variables, it seems that we do have a serious collinearity
problem. The same information is obtained from the tolerance factors. As noted previously, the
closer the tolerance factor is to zero, the greater is the evidence of collinearity. Applying Klein’s
rule of thumb, we see that the R2 values obtained from the auxiliary regressions exceed the
overall R2 value (that is the one obtained from the regression of Y on all the X variables) of
0.9954 in 3 out of 6 auxiliary regressions, again suggesting that indeed the Longley data are
plagued by the multicollinearity problem.

REMEDIAL MEASURES
What can be done if multicollinearity is serious? We have two choices:
(1) do nothing or (2) follow some rules of thumb.
Do Nothing
The “do nothing” school of thought is expressed by Blanchard as follows:
When students run their first ordinary least squares (OLS) regression, the first problem
that they usually encounter is that of multicollinearity.Many of them conclude that there
is something wrong with OLS; some resort to new and often creative techniques to get
around the problem. But, we tell them, this is wrong. Multicollinearity is God’s will, not
a problemwith OLS or statistical technique in general.
Rule-of-Thumb Procedures
1. A priori information. Suppose we consider the model
Yi = β1 + β2X2i + β3 X3i + ui
where Y = consumption, X2 = income, and X3 = wealth. As noted before, income and wealth
variables tend to be highly collinear. But suppose a priori we believe that β3 = 0.10β2; that is, the
rate of change of consumption with respect to wealth is one-tenth the corresponding rate with
respect to income. We can then run the following regression:
Yi = β1 + β2X2i + 0.10β2X3i + ui

Yi = β1 + β2 (X2i + 0.10X3i) + ui

Yi = β1 + β2 Zi + ui
where Zi = X2i + 0.1X3i . Once we obtain βˆ2, we can estimate βˆ3 from the postulated
relationship between β2 and β3
Combining cross-sectional and time series data.
Dropping a variable(s) and specification bias. When faced with severe multicollinearity, one
of the “simplest” things to do is to drop one of the collinear variables. Thus, in our consumption–
income–wealth illustration, when we drop the wealth variable, we obtain regression which
shows that, whereas in the original model the income variable was statistically insignificant, it is
now “highly” significant. But in dropping a variable from the model we may be committing a
specification bias or specification error.

Transformation of variables.

Example
Let us reconsider our original model.
First of all, we could express GNP not in nominal terms, but in real terms, which we can do by
dividing nominal GNP by the implicit price deflator. Second, since noninstitutional population
over 14 years of age grows over time because of natural population growth, it will be highly
correlated with time, the variable X6 in our model. Therefore, instead of keeping both these
variables, we will keep the variable X5 and drop X6. Third, there is no compelling reason to
include X3, the number of people unemployed; perhaps the unemployment rate would have been
a better measure of labor market conditions. But we have no data on the latter. So, we will drop
the variable X3. Making these changes, we obtain the following regression results (RGNP = real
GNP

Additional or new data.


Other methods of remedying multicollinearity. Multivariate statistical techniques such as
factor analysis and principal components or techniques such as ridge regression are often
employed to “solve” the problem of multicollinearity. Unfortunately, these techniques are
beyond the scope of this book, for they cannot be discussed competently without resorting to
matrix algebra.

Skill Development
QNO 10.2,10.3,10.5,10.7,10.10,10.11,10.12,10.19,10.24,10.25

You might also like