Multicollinearity
Multicollinearity
What is multicollinearity
Multicollinearity generally occurs when there are high correlations between two or more predictor
variables. In other words, one predictor variable can be used to predict the other. It describes a
perfect or exact relationship between the regression exploratory variables. Linear
regression analysis assumes that there is no perfect exact relationship among exploratory
variables. In regression analysis, when this assumption is violated, the problem of
Multicollinearity occurs.
This creates redundant information, skewing the results in a regression model. Examples of
correlated predictor variables (also called multicollinear predictors) are: a person’s height and
weight, age and sales price of a car, or years of education and annual income.
The following are the methods that show the presence of multicollinearity:
1. In regression analysis, when R-square of the model is very high but there are very few
significant t ratios, this shows multicollinearity in the data.
2. Calculate correlation coefficients for all pairs of predictor variables. High correlation
between exploratory variables also indicates the problem of multicollinearity. If the
correlation coefficient, r, is exactly +1 or -1, this is called perfect multicollinearity.
3. Tolerance limit and variance inflating factor: In regression analysis, one-by-one minus
correlation of the exploratory variable is called the variance inflating factor. As the
correlation between the repressor variable increases, VIF also increases. More VIF shows
the presence of multicollinearity. The inverse of VIF is called Tolerance. So the VIF and
TOI have a direct connection.
Example
Let's take a quick look at an example in which data-based multicollinearity exists. The following
data (bloodpress.txt) on 20 individuals with high blood pressure:
blood pressure (y = BP, in mm Hg)
age (x1 = Age, in years)
weight (x2 = Weight, in kg)
body surface area (x3 = BSA, in sq m)
duration of hypertension (x4 = Dur, in years)
basal pulse (x5 = Pulse, in beats per minute)
stress index (x6 = Stress)
The researchers were interested in determining if a relationship exists between blood pressure and
age, weight, body surface area, duration, pulse rate and/or stress level.
The matrix plot of BP, Age, Weight, and BSA:
allow us to investigate the various marginal relationships between the response BP and the
predictors. Blood pressure appears to be related fairly strongly to Weight and BSA, and hardly
related at all to Stress level. The matrix plots also allow us to investigate whether or not
relationships exist among the predictors. For example, Weight and BSA appear to be strongly
related, while Stress and BSA appear to be hardly related at all.
The following correlation matrix:
provides further evidence of the above claims. Blood pressure appears to be related fairly strongly
to Weight (r = 0.950) and BSA (r = 0.866), and hardly related at all to Stress level (r = 0.164).
And, Weight and BSA appear to be strongly related (r = 0.875), while Stress and BSA appear to
be hardly related at all (r = 0.018). The high correlation among some of the predictors suggests
that data-based multicollinearity exists.
Why is Multicollinearity a Potential Problem?
A key goal of regression analysis is to isolate the relationship between each independent variable
and the dependent variable. The interpretation of a regression coefficient is that it represents the
mean change in the dependent variable for each 1 unit change in an independent variable when
you hold all of the other independent variables constant. That last portion is crucial for our
discussion about multicollinearity.
Effect of Multicollinearity
If the data has a perfect or exact multicollinearity problem, then the following will be the impact
of it:
1. In the presence of multicollinearity, variance and covariance will be wider, which will
make it difficult to reach a statistical decision for the null and alternative hypothesis.
2. In the presence of multicollinearity, the confidence interval will be wider due to the wider
confidence interval. In this case, we will accept the null hypothesis, which should be
rejected.
3. In the presence of multicollinearity, the standard error will increase and it makes the value
of the t-test smaller. We will accept the null hypothesis that should be rejected.
4. Multicollinearity will increase the R-square as well, which will impact the goodness of fit
of the model.
5. Multicollinearity makes it difficult to gauge the effect of independent
variables on dependent variables.
As the severity of the multicollinearity increases so do these problematic effects. However, these
issues affect only those independent variables that are correlated. You can have a model with
severe multicollinearity and yet some variables in the model can be completely unaffected.
Multicollinearity makes it hard to interpret your coefficients, and it reduces the power of your
model to identify independent variables that are statistically significant. These are definitely
serious problems. However, the good news is that you don’t always have to find a way to fix
multicollinearity.
The need to reduce multicollinearity depends on its severity and your primary goal for your
regression model. Keep the following three points in mind:
1. The severity of the problems increases with the degree of the multicollinearity. Therefore,
if you have only moderate multicollinearity, you may not need to resolve it.
2. Multicollinearity affects only the specific independent variables that are correlated.
Therefore, if multicollinearity is not present for the independent variables that you are
particularly interested in, you may not need to resolve it. Suppose your model contains the
experimental variables of interest and some control variables. If high multicollinearity exists
for the control variables but not the experimental variables, then you can interpret the
experimental variables without problems.
3. Multicollinearity affects the coefficients and p-values, but it does not influence the
predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary
goal is to make predictions, and you don’t need to understand the role of each independent
variable, you don’t need to reduce severe multicollinearity.
As the name suggests, a variance inflation factor (VIF) quantifies how much the variance is inflated.
But what variance? Recall that we learned previously that the standard errors — and hence the
variances — of the estimated coefficients are inflated when multicollinearity exists. A variance
inflation factor exists for each of the predictors in a multiple regression model. For example, the
variance inflation factor for the estimated regression coefficient bj —denoted VIFj —is just the
factor by which the variance of bj is "inflated" by the existence of correlation among the predictor
variables in the model.
1
In particular, the variance inflation factor for the jth predictor is: 𝑉𝐼𝐹𝑗 = 1−𝑅 2
𝑗
where 𝑅𝑗 2 is the R2-value obtained by regressing the jth predictor on the remaining predictors.
How do we interpret the variance inflation factors for a regression model? A VIF of 1 means that
there is no correlation among the jth predictor and the remaining predictor variables, and hence the
variance of bj is not inflated at all. The general rule of thumb is that VIFs exceeding 4 warrant
further investigation, while VIFs exceeding 10 are signs of serious multicollinearity requiring
correction.
Statistical software calculates a VIF for each independent variable. VIFs start at 1 and have no
upper limit. A value of 1 indicates that there is no correlation between this independent variable
and any others. VIFs between 1 and 5 suggest that there is a moderate correlation, but it is not
severe enough to warrant corrective measures. VIFs greater than 5 represent critical levels of
multicollinearity where the coefficients are poorly estimated, and the p-values are questionable.
o Partial least squares regression uses principal component analysis to create a set of
uncorrelated components to include in the model.
o LASSO and Ridge regression are advanced forms of regression analysis that can handle
multicollinearity. If you know how to perform linear least squares regression, you’ll be able
to handle these analyses with just a little additional study.
https://online.stat.psu.edu/stat462/node/181/