6 Multicolinearity

SOST 3103.
3 Regression Analysis – First Semester 2020/2021 Lecture Note - 05
Multicollinearity
The term multicollinearity is due to Ragnar Frisch. Originally it meant the existence of a “perfect,” or exact,
linear relationship among some or all explanatory variables of a regression model.
Perfect Multicollinearity
For the k-variable regression involving explanatory variable X 1 , X 2 ......X k an exact linear relationship is
said to exist if the following condition is satisfied:

1 X 1 + 2 X 2 + ........ + k X k = 0 ; where 1 , 2 ......k are constants such that not all of them are zero
simultaneously.
if 2  0 , Eq. can be written as
− 1 X 1 3 X 3  X
X2 = − − ......... − k k
2 2 2
which shows how X 2 is exactly linearly related to other variables or how it can be derived from a linear
combination of other X variables.
consider the following regression model:
Yi =  0 + 1 X 1i +  2 X 2i +  3 X 3i + ui
 X 1i X 2i X 3i  Yi
 2 −4 8  58

 4 −2 10  42
 
 7 13 1  75
 1 − 10 12  77
 
 6 7 5  49
 3 − 2 
 8 66
In this model X 1 has exact linear relationship with X 2 and X 3
X 2 + X 3 = 2X 1
X2 + X3
= X1
2
Less than Perfect Multicollinearity
The case where the X variables are intercorrelated but not perfectly so, as follows:
1 X 1 + 2 X 2 + ........ + k X k + vi = 0 ; where 1 , 2 ......k are constants such that not all of them are zero
simultaneously and vi is a stochastic error term.
if 2  0 , Eq. can be written as

− 1 X 1 3 X 3  X v
X2 = − − ......... − k k + i
2 2 2 2
which shows how X 2 is exactly linearly related to other variables or how it can be derived from a linear
combination of other X variables.
consider the following hypothetical data,
 X 1i X 2i X 3i  vi
 7 28 31  3

 9 36 43  7
 
 28 112 112  0
 32 128 133 5
 
 21 84 85  1
 15 60 55  −5
 
It is apparent that X 2i = 4 X 1i Therefore, there is perfect collinearity between X 2 and X 1 since the coefficient
of correlation r12 is unity.

The variable X 3i was created from X 2i by simply adding to it the following numbers, which were taken
from a table of random numbers: 3, 7, 0, 5, 1,-5. Now there is no longer perfect collinearity between X 2 and
X 3 . However, the two variables are highly correlated because calculations will show that the coefficient of
correlation between them is 0.994.

In passing, note that multicollinearity, as we have defined it, refers only to linear relationships among the
independent variables. It does not rule out nonlinear relationships among them. For example, consider the
following regression model:
Yi =  0 +  1 X i +  2 X i2 +  2 X i3 + u i
X i2 and X i3 are obviously functionally related to X i , but the relationship is nonlinear. Strictly, therefore,
model do not violate the assumption of no multicollinearity.

If multicollinearity is perfect in the sense of 1 X 1 + 2 X 2 + ........ + k X k = 0 , the regression coefficients of
the X variables are indeterminate and their standard errors are infinite. If multicollinearity is less than perfect,
as in 1 X 1 + 2 X 2 + ........ + k X k + vi = 0 , the regression coefficients, although determinate, possess large
standard errors (in relation to the coefficients themselves), which means the coefficients cannot be estimated
with great precision or accuracy.
Consequences of Multicollinearity
In cases of near or high multicollinearity, one is likely to encounter the following consequences:
1. With exact linear relationships among the explanatory variables, the condition of exact collinearity, or
exact multicollinearity, exists and the least-squares estimator is not defined. That means X X′ is singular
and estimation of coefficients and standard errors is not possible.
2. For variables that are highly related to one another (but not perfectly related), the OLS (Ordinary Least
Squares) estimators have large variances and covariances, making precise estimation difficult.
3. Because of the consequences of point 2, confidence intervals tend to be much wider, leading to the
acceptance of the null hypothesis more readily. This is due to the relatively large standard error. The
standard error is based, in part, on the correlation between the variables in the model.
4. Although the t ratio of one or more of the coefficients is more likely to be insignificant with
multicollinearity, the R2 value for the model can still be relatively high.
5. The OLS `estimators and their standard errors can be sensitive to small changes in the data. In other
words, the results will not be robust.
Consider following regression equation,
Yi = 0 + 1 X1i +  2 X 2i + ui
 1 − r12 

 x (1 − r )  x  x (1 − r )

2 2 2 2 2
 
( X  X )−1
1i 12
=
1i 2i 12
− r12 1 
 x (1 − r )
 
 x  x (1 − r )
2 2
 
2 2 2
1i 2i 12 2i 12
( )
Var − Cov ˆ j =  2 ( X  X )
−1
( )
Var ˆ1 =
2
 x (1 − r )
1i
2
12
2
( )
Var ˆ2 =
2
 x (1 − r )
2i
2
12
2
(
Cov ˆ1 , ˆ2 =) − r12  2
 x  x (1 − r )
1i
2
2i
2
12
2
Detection of Multicollinearity
1. High R2 but few significant t ratios.
If R2 is high, say, the F test in most cases will reject the hypothesis that the partial slope coefficients
are simultaneously equal to zero, but the individual t tests will show that none or very few of the
partial slope coefficients are statistically different from zero.
2. High pair-wise correlations among regressors.
3. Examination of partial correlations.
4. Tolerance and variance inflation factor.
1
VIFj =
(1 − R j )
2
VIF j  5  There is multicollinearity problems
5. Condition number.
max
K=
min
K  100  There is no serious problem with multicollinearity
100  K  1000  Moderate to strong multicollinearity
K  1000  There is severe multicollinearity

6. Condition index.
max
Kj = j = 1, 2 , 3..........k
j
K j  1000  Presence of severe multicollinearity
Sources of Multicollinearity
There are four primary sources of multicollinearity:
01. The data collection method employed
The data collection method can lead to multicollinearity problems when the analyst samples only a
subspace of the region of the regressors define as following Equation
Let the jth column of X matrix be denoted by X j , so that X = X 1 , X 2 ,.....X k  thus X j contains the n
levels of the regressor variable. Formally multicollinearity can be defined as the linear dependence of the
columns of X. The vectors are linearly dependent if there is a set of constants t1 , t 2 ,.....t k , not all zero such
that
k

j =1
tjX j = 0  A
If Equation (A) holds exactly for a subset of the columns of X, then the rank of the X  X matrix is
less than p and ( X  X ) does not exist. However suppose the Equation (A) is approximately true for some
−1
subset of the columns of X. Then there will be a near linear dependency in X  X and the problem of
multicollinearity is said to exist. It is to be noted that the multicollinearity is a form of ill-conditioning in the
X  X matrix. Furthermore, the problem is one of the degrees, that is, every data set will suffer from
multicollinearity to some extent unless the columns of X are orthogonal. The presence of multicollinearity
can make the usual least-squares analysis of the regression model dramatically inadequate
02. Constraints on the model or in the population.
Constraints of the model or in the population being sampled can cause multicollinearity. For example,
suppose an electric utility is investigating the effect of family income, per month in terms of thousands
rupees and house size in terms of square meters on residential electricity consumption.
In this example a physical constraints in the population has caused this phenomenon, namely the family with
the higher incomes generally have larger homes than families with lower incomes. When physical
constraints such as this present, multicollinearity will exist regardless of the sampling method employed.
03. Model specification.
Multicollinearity may also be induced by the choice of model. We know that adding a polynomial term to a
regression model causes ill conditioning of the X  X matrix. Furthermore if the range of x is small, adding
an term can result in significant multicollinearity.
04. An over defined model.
An over defined model has more regressor variables than number of observations. These models are
sometimes encountered in medical and behavioral research, where there may be only a small number of
subjects (sample units) available, and information is collected for a large number of regressors on each
subject. The usual approach to dealing with the multicollinearity in this context is to eliminate some of the
regressor variables from consideration.
Remedies of Multicollinearity
01. Dropping a variable(s) and specification bias.
When faced with severe multicollinearity, one of the “simplest” things to do is to drop one of the collinear
variables. Thus, in our consumption–income–wealth illustration.
Consumption ( y) = 0 + 1 Income + 2 Welth + u
We can drop wealth variable,
But in dropping a variable from the model we may be committing a specification bias or specification error.
Specification bias arises from incorrect specification of the model used in the analysis. Thus, if economic theory
says that income and wealth should both be included in the model explaining the consumption expenditure,
dropping the wealth variable would constitute specification bias.
02. Transformation of variables
Suppose we have time series data on consumption expenditure, income, and wealth. One reason for high
multicollinearity between income and wealth in such data is that over time both the variables tend to move
in the same direction. One way of minimizing this dependence is to proceed as follows.
If the relation
Consumption ( yt ) = 0 + 1 Income + 2 Welth + ut
Yt = 0 + 1 X1t + 2 X 2t + ut → a
Yt −1 = 0 + 1 X1t −1 + 2 X 2t −1 + ut −1 → b
a − b,
Yt − Yt −1 = ( 0 −  0 ) + (1 X 1t − 1 X 1t −1 ) + ( 2 X 2t −  2 X 2t −1 ) + (ut − ut −1 )
Yt − Yt −1 = 1 ( X 1t − X 1t −1 ) +  2 ( X 2t − X 2t −1 ) + (ut − ut −1 )
     
Yt * X 1t * X 2t * Vt
Yt = 1 X 1t +  2 X 2t + Vt
* * *
This equation is known as the first difference form because we run the regression, not on the original
variables, but on the differences of successive values of the variables.
Another commonly used transformation in practice is the ratio transformation.
Consider the model
Yt =  0 + 1 X 1t +  2 X 2t + ut
where Y is consumption expenditure in real dollars, X 1 is GDP, and X 2 is total population. Since GDP
and population grow over time, they are likely to be correlated. One “solution” to this problem is to
express the model on a per capita basis, that is, by dividing by X 2 , to obtain:
Yt  1  X   ut 
=  0   + 1  1t  +  
X 2t  X 2t   X 2t   X 2t 
Such a transformation may reduce collinearity in the original variables.
03. Use Additional or New Data
( )
Var ˆ1 =
2
( )
 x1i 1 − r12
2 2
Now as the sample size increases,_  x1i will generally increase. (Why?) Therefore, for any given r12 , the
2
variance of ˆ1 will decrease, thus decreasing the standard error, which will enable us to estimate ˆ1 more
precisely.
04. Combining cross-sectional and time series data
05. Reducing collinearity in polynomial regressions.
06. Other methods (Ridge Regression or Principal Component Regression)

6 Multicolinearity

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

6 Multicolinearity

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

6 Multicolinearity

Uploaded by

Copyright:

Available Formats

SOST 3103.

3 Regression Analysis – First Semester 2020/2021 Lecture Note - 05

said to exist if the following condition is satisfied:

In this model X 1 has exact linear relationship with X 2 and X 3

if 2  0 , Eq. can be written as

of correlation r12 is unity.

correlation between them is 0.994.

model do not violate the assumption of no multicollinearity.

VIF j  5  There is multicollinearity problems

K  100  There is no serious problem with multicollinearity

100  K  1000  Moderate to strong multicollinearity

K  1000  There is severe multicollinearity

K j  1000  Presence of severe multicollinearity

03. Use Additional or New Data

You might also like