6 Multicolinearity
6 Multicolinearity
6 Multicolinearity
Multicollinearity
The term multicollinearity is due to Ragnar Frisch. Originally it meant the existence of a “perfect,” or exact,
linear relationship among some or all explanatory variables of a regression model.
Perfect Multicollinearity
For the k-variable regression involving explanatory variable X 1 , X 2 ......X k an exact linear relationship is
− 1 X 1 3 X 3 X
X2 = − − ......... − k k
2 2 2
which shows how X 2 is exactly linearly related to other variables or how it can be derived from a linear
combination of other X variables.
consider the following regression model:
Yi = 0 + 1 X 1i + 2 X 2i + 3 X 3i + ui
X 1i X 2i X 3i Yi
2 −4 8 58
4 −2 10 42
7 13 1 75
1 − 10 12 77
6 7 5 49
3 − 2
8 66
X 2 + X 3 = 2X 1
X2 + X3
= X1
2
Less than Perfect Multicollinearity
The case where the X variables are intercorrelated but not perfectly so, as follows:
1 X 1 + 2 X 2 + ........ + k X k + vi = 0 ; where 1 , 2 ......k are constants such that not all of them are zero
simultaneously and vi is a stochastic error term.
from a table of random numbers: 3, 7, 0, 5, 1,-5. Now there is no longer perfect collinearity between X 2 and
X 3 . However, the two variables are highly correlated because calculations will show that the coefficient of
X i2 and X i3 are obviously functionally related to X i , but the relationship is nonlinear. Strictly, therefore,
the X variables are indeterminate and their standard errors are infinite. If multicollinearity is less than perfect,
as in 1 X 1 + 2 X 2 + ........ + k X k + vi = 0 , the regression coefficients, although determinate, possess large
standard errors (in relation to the coefficients themselves), which means the coefficients cannot be estimated
with great precision or accuracy.
Consequences of Multicollinearity
In cases of near or high multicollinearity, one is likely to encounter the following consequences:
1. With exact linear relationships among the explanatory variables, the condition of exact collinearity, or
exact multicollinearity, exists and the least-squares estimator is not defined. That means X X′ is singular
and estimation of coefficients and standard errors is not possible.
2. For variables that are highly related to one another (but not perfectly related), the OLS (Ordinary Least
Squares) estimators have large variances and covariances, making precise estimation difficult.
3. Because of the consequences of point 2, confidence intervals tend to be much wider, leading to the
acceptance of the null hypothesis more readily. This is due to the relatively large standard error. The
standard error is based, in part, on the correlation between the variables in the model.
4. Although the t ratio of one or more of the coefficients is more likely to be insignificant with
multicollinearity, the R2 value for the model can still be relatively high.
5. The OLS `estimators and their standard errors can be sensitive to small changes in the data. In other
words, the results will not be robust.
Consider following regression equation,
Yi = 0 + 1 X1i + 2 X 2i + ui
1 − r12
x (1 − r ) x x (1 − r )
2 2 2 2 2
( X X )−1
1i 12
=
1i 2i 12
− r12 1
x (1 − r )
x x (1 − r )
2 2
2 2 2
1i 2i 12 2i 12
( )
Var − Cov ˆ j = 2 ( X X )
−1
( )
Var ˆ1 =
2
x (1 − r )
1i
2
12
2
( )
Var ˆ2 =
2
x (1 − r )
2i
2
12
2
(
Cov ˆ1 , ˆ2 =) − r12 2
x x (1 − r )
1i
2
2i
2
12
2
Detection of Multicollinearity
1. High R2 but few significant t ratios.
If R2 is high, say, the F test in most cases will reject the hypothesis that the partial slope coefficients
are simultaneously equal to zero, but the individual t tests will show that none or very few of the
partial slope coefficients are statistically different from zero.
2. High pair-wise correlations among regressors.
3. Examination of partial correlations.
4. Tolerance and variance inflation factor.
1
VIFj =
(1 − R j )
2
5. Condition number.
max
K=
min
Sources of Multicollinearity
There are four primary sources of multicollinearity:
01. The data collection method employed
The data collection method can lead to multicollinearity problems when the analyst samples only a
subspace of the region of the regressors define as following Equation
Let the jth column of X matrix be denoted by X j , so that X = X 1 , X 2 ,.....X k thus X j contains the n
levels of the regressor variable. Formally multicollinearity can be defined as the linear dependence of the
columns of X. The vectors are linearly dependent if there is a set of constants t1 , t 2 ,.....t k , not all zero such
that
k
j =1
tjX j = 0 A
If Equation (A) holds exactly for a subset of the columns of X, then the rank of the X X matrix is
less than p and ( X X ) does not exist. However suppose the Equation (A) is approximately true for some
−1
subset of the columns of X. Then there will be a near linear dependency in X X and the problem of
multicollinearity is said to exist. It is to be noted that the multicollinearity is a form of ill-conditioning in the
X X matrix. Furthermore, the problem is one of the degrees, that is, every data set will suffer from
multicollinearity to some extent unless the columns of X are orthogonal. The presence of multicollinearity
can make the usual least-squares analysis of the regression model dramatically inadequate
02. Constraints on the model or in the population.
Constraints of the model or in the population being sampled can cause multicollinearity. For example,
suppose an electric utility is investigating the effect of family income, per month in terms of thousands
rupees and house size in terms of square meters on residential electricity consumption.
In this example a physical constraints in the population has caused this phenomenon, namely the family with
the higher incomes generally have larger homes than families with lower incomes. When physical
constraints such as this present, multicollinearity will exist regardless of the sampling method employed.
03. Model specification.
Multicollinearity may also be induced by the choice of model. We know that adding a polynomial term to a
regression model causes ill conditioning of the X X matrix. Furthermore if the range of x is small, adding
an term can result in significant multicollinearity.
04. An over defined model.
An over defined model has more regressor variables than number of observations. These models are
sometimes encountered in medical and behavioral research, where there may be only a small number of
subjects (sample units) available, and information is collected for a large number of regressors on each
subject. The usual approach to dealing with the multicollinearity in this context is to eliminate some of the
regressor variables from consideration.
Remedies of Multicollinearity
01. Dropping a variable(s) and specification bias.
When faced with severe multicollinearity, one of the “simplest” things to do is to drop one of the collinear
variables. Thus, in our consumption–income–wealth illustration.
Consumption ( y) = 0 + 1 Income + 2 Welth + u
We can drop wealth variable,
But in dropping a variable from the model we may be committing a specification bias or specification error.
Specification bias arises from incorrect specification of the model used in the analysis. Thus, if economic theory
says that income and wealth should both be included in the model explaining the consumption expenditure,
dropping the wealth variable would constitute specification bias.
02. Transformation of variables
Suppose we have time series data on consumption expenditure, income, and wealth. One reason for high
multicollinearity between income and wealth in such data is that over time both the variables tend to move
in the same direction. One way of minimizing this dependence is to proceed as follows.
If the relation
Consumption ( yt ) = 0 + 1 Income + 2 Welth + ut
Yt = 0 + 1 X1t + 2 X 2t + ut → a
Yt −1 = 0 + 1 X1t −1 + 2 X 2t −1 + ut −1 → b
a − b,
Yt − Yt −1 = ( 0 − 0 ) + (1 X 1t − 1 X 1t −1 ) + ( 2 X 2t − 2 X 2t −1 ) + (ut − ut −1 )
Yt − Yt −1 = 1 ( X 1t − X 1t −1 ) + 2 ( X 2t − X 2t −1 ) + (ut − ut −1 )
Yt * X 1t * X 2t * Vt
Yt = 1 X 1t + 2 X 2t + Vt
* * *
This equation is known as the first difference form because we run the regression, not on the original
variables, but on the differences of successive values of the variables.
Another commonly used transformation in practice is the ratio transformation.
Consider the model
Yt = 0 + 1 X 1t + 2 X 2t + ut
where Y is consumption expenditure in real dollars, X 1 is GDP, and X 2 is total population. Since GDP
and population grow over time, they are likely to be correlated. One “solution” to this problem is to
express the model on a per capita basis, that is, by dividing by X 2 , to obtain:
Yt 1 X ut
= 0 + 1 1t +
X 2t X 2t X 2t X 2t
Such a transformation may reduce collinearity in the original variables.
( )
Var ˆ1 =
2
( )
x1i 1 − r12
2 2
Now as the sample size increases,_ x1i will generally increase. (Why?) Therefore, for any given r12 , the
2
variance of ˆ1 will decrease, thus decreasing the standard error, which will enable us to estimate ˆ1 more
precisely.
04. Combining cross-sectional and time series data
05. Reducing collinearity in polynomial regressions.
06. Other methods (Ridge Regression or Principal Component Regression)