Multicollinearity and Remedies
Multicollinearity and Remedies
DEPARTMENT OF STATISTICS
TYPES:
• Structural multicollinearity: This type of multicollinearity is caused by the researchers (people like us) who create
new predictors using the given predictors in order to solve the problem.
• Data multicollinearity: This type of multicollinearity is the result of poorly designed experiments that are purely
observational. Thus, it is present in the data itself and has not been specified/designed by us.
PROBLEMS CAUSED:
• it becomes difficult to find out which variable is contributing to predict the response variable.
• With presence of correlated predictors, the standard errors tend to increase. And, with large standard errors, the confidence
interval becomes wider leading to less precise estimates of slope parameters.
HOW TO DETECT?
VARIANCE INFLATION FACTOR
The most common way to detect multicollinearity is by using the variance inflation factor (VIF), which
measures the correlation and strength of correlation between the predictor variables in a regression model.
VIF tells us about how well an independent variable is predictable using the other independent variables.
Example,
FINDING VIF VALUES IN R that the overall F-statistic is 560.6 and the corresponding p-value is 6.395e-15,
which indicates that the overall regression model is significant.
Also, the predictor variables C1,C2 AND C3 are statistically significant.
Now, we’ll use the vif() function from the car library to calculate the VIF for each predictor variable in
the model.
To visualize the VIF values for each predictor variable, we can create a simple horizontal bar chart
To gain a better understanding of why one predictor variable may have a high VIF value, we can create a correlation
matrix to view the linear correlation coefficients between each pair of variables
Several of these pair-wise correlations are quite high, suggesting that
there may be a severe collinearity problem.
REMEDIES OF MULTICOLLINEARITY
AIM:
Minimize the sum of square of coefficients to reduce the impact of correlated predictors but keeping all predictors in a model.
Keep all the features but reducing the magnitude of the coefficients of the model.
In Ridge Regression, the main idea is to fit a new line that overfit the training data, and so, has much higher variance then
the OLS. So we introduce a certain amount of bias into the new trend line.
The Lambda in the penalty terms determines how severe the penalty is. Here, if lambda is zero then you can imagine we
get back OLS. Then LAMBDA asymptotically increases, we arrive to a slope close to 0 (flat curve).
However, if lambda is very large then it will add too much weight and it will lead to under-fitting. Having said that it’s
important how lambda is chosen.
LAMBDAS are very small values(like 0.001), even the increments in lambdas are very small.
Ridge Regression in R
Calculating Lambdas
Running the ridge regression model for all
values of lambdas.
Here, we have used Cross Validation process to find the best lambda which fits the best
Reducing the number of variables of a data set naturally comes at the expense of
accuracy, but the trick in dimensionality reduction is to trade a little accuracy for
simplicity.
PURPOSE: reduce the number of variables of a data set, while Principal Components are often called the
preserving as much information as possible. SUMMARIES OF THE DATA.
STEP BY STEP MECHANISM OF PCA
2 PCs
4 Ind Variables Z1 Z2
X1 X2 X3 X4 * *
* * * * * *
* * * * * *
* * * * * *
* * * * 2-D * *
* * * * * *
* * * *
4-D
EIGEN STUFF … …
* * * * COVARIANCE MATRIX Λ1 v1 From
* * * * Λ2 v2 small
* * * * * * * * Λ3 v3 to big
* * * * Λ4 v4
* * * *
* * * * * * * * **An eigenvector points in a direction in which it **For symmetric matrix:
is stretched by the transformation and the eigenvalue is • Eigen vectors are always
* * * * perpendicular.
the factor by which it is stretched.
• And real.
STEP 1: Enter data in SPSS STEP 2:
Go to ANALYSE->DIMENSION REDUCTION-> FACTOR ANALYSIS
STEP 3: Move all the STEP 4: Choose Univariate descriptives, Initial
INDEPENDENT VARIABLES to solution and KMO and Bartlett’s test of
Variable section. sphericity from the Descriptive’s section
Two key types of rotation,
• Orthogonal rotation
means that your factors or components, if there is more than one, they will
be uncorrelated. In fact, that rotational solution forces them to be
uncorrelated.
• Oblique rotation
They're rotated in such a way where they're allowed to be correlated.
One way to see how good of a job this The percent of variance =
The sum of all analysis did at explaining the magnitude of the
the eigen values relationships between those variables, is eigenvalue retained / the
is equal to the to look at the percent of variance sum of the eigenvalues
no. of variables. accounted for by the component. =4.401/6 = 0.7335
We have 6
components in
our rows here
are the number
of variables we
input in our
analysis.
Notice that we have these distinct eigenvalues in our Initial In the Extraction Sums of Squared
Eigenvalues table. So, 3.013 and 1.388 are the first ones, Loadings section of this table, notice
and everything after that is less than 1. Our first rule, as that there's only two value here now.
you may recall, was to only consider components with And what this means is this is how many
eigenvalue larger than one rule. So, we keep the number components SPSS retained or kept,
of factors or components with eigenvalues greater than based on the rule.
one and discard all additional components.
SCREE PLOT
Multicollinearity makes it hard to interpret your coefficients, and it reduces the power of your model to identify
independent variables that are statistically significant. These are definitely serious problems.
However, the good news is that you don’t always have to find a way to fix multicollinearity. The need to reduce
multicollinearity depends on your primary goal for your regression model. Keep the following three points in mind:
1. The severity of the problems increases with the degree of the multicollinearity. Therefore, if you have only moderate
multicollinearity, you may not need to resolve it.
2. Multicollinearity affects only the specific independent variables that are correlated. Therefore, if multicollinearity is not
present for the independent variables that you are particularly interested in, you may not need to resolve it.
3. Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the
predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions you don’t need to reduce severe
multicollinearity.
THANK YOU