0% found this document useful (0 votes)
38 views23 pages

Multicollinearity and Remedies

Uploaded by

deeksha3363
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views23 pages

Multicollinearity and Remedies

Uploaded by

deeksha3363
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

RAM LAL ANAND COLLEGE

DEPARTMENT OF STATISTICS

MULTICOLLINEARITY AND ITS


REMEDIES
SUBMITTED BY -DIKSHA SINGH
University Roll no. – 19058568012
Guide –
College Roll no. - 5024
Dr. SEEMA GUPTA
Paper Name – ECONOMETRICS
Course – B.Sc. Statistics Hons. (V SEM)
MULTICOLLINEARITY
WHAT
One of the important assumptions in regression analysis is THAT THE INDEPENDENT VARIABLES SHOULD NOT
BE CORRELATED.
Absence of this phenomenon is known as Multicollinearity.
MULTICOLLINEARITY: When two variables that are supposed to be independent in reality have a high amount of
correlation, such that they do not provide unique or independent information in the regression model.

TYPES:
• Structural multicollinearity: This type of multicollinearity is caused by the researchers (people like us) who create
new predictors using the given predictors in order to solve the problem.
• Data multicollinearity: This type of multicollinearity is the result of poorly designed experiments that are purely
observational. Thus, it is present in the data itself and has not been specified/designed by us.

PROBLEMS CAUSED:
• it becomes difficult to find out which variable is contributing to predict the response variable.
• With presence of correlated predictors, the standard errors tend to increase. And, with large standard errors, the confidence
interval becomes wider leading to less precise estimates of slope parameters.
HOW TO DETECT?
VARIANCE INFLATION FACTOR
The most common way to detect multicollinearity is by using the variance inflation factor (VIF), which
measures the correlation and strength of correlation between the predictor variables in a regression model.
VIF tells us about how well an independent variable is predictable using the other independent variables.

Example,

1. Here are 9 INDEPENDENT VARIABLES.

2. To compute the VIF of variable V1, we isolate V1 and treat it as


the target variable(Y), while the rest of the variables are considered
as predictor variables(X).

3. We use all the other predictor variables and train a regression


model and find out the corresponding R2 value.
Using this R2 value, we compute the VIF value gives as the image
below.
THE R2 VALUE INCREASES, THE VIF VALUE ALSO
BP Age Weight BSA Dur Pulse Stress
105
115
47
49
85.4 1.75
94.2 2.1
5.1
3.8
63
70
33
14
DATA
116 49 95.3 1.98 8.2 72 10
117 50 94.7 2.01 5.8 73 99 •Blood pressure (BP), in mm Hg
112 51 89.4 1.89 7 72 95 •Age, in years
121 48 99.5 2.25 9.3 71 10 •Weight, in kg
121 49 99.8 2.25 2.5 69 42 •Body surface area (BSA), in m²
110 47 90.9 1.9 6.2 66 8 •Duration of hypertension (Dur), in years
110 49 89.2 1.83 7.1 69 62 •Basal Pulse (Pulse), in beats per minute
114 48 92.7 2.07 5.6 64 35 •Stress index (Stress)
114 47 94.4 2.07 5.3 74 90
115 49 94.1 1.98 5.6 71 21
114 50 91.6 2.05 10.2 68 47
106 45 87.1 1.92 5.6 67 80
125 52 101.3 2.19 10 76 98
114 46 94.5 1.98 7.4 69 95
106 46 87 1.87 3.6 62 18
113 46 94.5 1.9 4.3 70 12
110 48 90.5 1.88 9 71 99 **Here, I have taken BP as a dependent variable ie, build
122 56 95.7 2.09 7 75 99 a model that predicts BP
In the output the R-squared value for the model is 0.9962. We can also see

FINDING VIF VALUES IN R that the overall F-statistic is 560.6 and the corresponding p-value is 6.395e-15,
which indicates that the overall regression model is significant.
Also, the predictor variables C1,C2 AND C3 are statistically significant.

We’ll fit a regression model using T as


HERE WE HAVE ADDED OUR DATA IN R AND
the response variable and all the
CHANGED THE COLUMN NAMES
c’s as the predictor variables
VIF = 1 → No correlation
VIF = 1 to 5 → Moderate
correlation
VIF >10 → High correlation

Now, we’ll use the vif() function from the car library to calculate the VIF for each predictor variable in
the model.
To visualize the VIF values for each predictor variable, we can create a simple horizontal bar chart
To gain a better understanding of why one predictor variable may have a high VIF value, we can create a correlation
matrix to view the linear correlation coefficients between each pair of variables
Several of these pair-wise correlations are quite high, suggesting that
there may be a severe collinearity problem.
REMEDIES OF MULTICOLLINEARITY

1. Drop a Redundant Variable


If a variable is redundant, it should have never been included in the model in the
first place. So dropping it actually is just correcting for a specification error. Use
economic theory to guide your choice of which variable to drop.
2. Transform the Multicollinear Variables
Sometimes you can reduce multicollinearity by re-specifying the model, for
instance, create a combination of the multicollinear variables. As an example, rather
than including the variables GDP and population in the model, include
GDP/population (GDP per capita) instead.
3. Increase the Sample Size
Increasing the sample size improves the precision of an estimator and reduces the
adverse effects of multicollinearity. Usually adding data though is not feasible.
4. Ridge Regression
5. Principal Component Analysis
RIDGE REGRESSION
Ridge Regression performs a L2 regularization, i.e. adds “squared magnitude” of coefficient as penalty term.

AIM:
Minimize the sum of square of coefficients to reduce the impact of correlated predictors but keeping all predictors in a model.
Keep all the features but reducing the magnitude of the coefficients of the model.

In Ridge Regression, the main idea is to fit a new line that overfit the training data, and so, has much higher variance then
the OLS. So we introduce a certain amount of bias into the new trend line.

The Lambda in the penalty terms determines how severe the penalty is. Here, if lambda is zero then you can imagine we
get back OLS. Then LAMBDA asymptotically increases, we arrive to a slope close to 0 (flat curve).
However, if lambda is very large then it will add too much weight and it will lead to under-fitting. Having said that it’s
important how lambda is chosen.

LAMBDAS are very small values(like 0.001), even the increments in lambdas are very small.
Ridge Regression in R

Specify the responsive and the


predictor variables

Calculating Lambdas
Running the ridge regression model for all
values of lambdas.
Here, we have used Cross Validation process to find the best lambda which fits the best

As we can see from the graph above, for low values of


Lambda, the Mean Squared Error is quite low. On the
contrary, as the value of Lambda increases, the Mean
Squared Error also increases. So, we decide to continue using
low value of Lambda.
The optimal value of Lambda to minimize the
Error is 0.01 and we stored it in optlambda. So,
we can now re-run the model using
the optlambda value that we found.

As we can see from the R-squared value rsq, now we


have an optimal model that has accounted for
99.99% of the variance in the training set.
We were able to obtain this result using all the
predictiors, and we obtained a very good job in
predicting BP, having a model that explain 99% of the
variation n the house price.
PRINCIPAL COMPONENT ANALYSIS (PCA)
Principal Component Analysis is a dimensionality-reduction method that is often used
to reduce the dimensionality of large data sets, by transforming a large set of variables
into a smaller one that still contains most of the information in the large set.

Reducing the number of variables of a data set naturally comes at the expense of
accuracy, but the trick in dimensionality reduction is to trade a little accuracy for
simplicity.
PURPOSE: reduce the number of variables of a data set, while Principal Components are often called the
preserving as much information as possible. SUMMARIES OF THE DATA.
STEP BY STEP MECHANISM OF PCA
2 PCs
4 Ind Variables Z1 Z2
X1 X2 X3 X4 * *
* * * * * *
* * * * * *
* * * * * *
* * * * 2-D * *
* * * * * *
* * * *
4-D
EIGEN STUFF … …
* * * * COVARIANCE MATRIX Λ1 v1 From
* * * * Λ2 v2 small
* * * * * * * * Λ3 v3 to big

* * * * Λ4 v4
* * * *
* * * * * * * * **An eigenvector points in a direction in which it **For symmetric matrix:
is stretched by the transformation and the eigenvalue is • Eigen vectors are always
* * * * perpendicular.
the factor by which it is stretched.
• And real.
STEP 1: Enter data in SPSS STEP 2:
Go to ANALYSE->DIMENSION REDUCTION-> FACTOR ANALYSIS
STEP 3: Move all the STEP 4: Choose Univariate descriptives, Initial
INDEPENDENT VARIABLES to solution and KMO and Bartlett’s test of
Variable section. sphericity from the Descriptive’s section
Two key types of rotation,
• Orthogonal rotation
means that your factors or components, if there is more than one, they will
be uncorrelated. In fact, that rotational solution forces them to be
uncorrelated.
• Oblique rotation
They're rotated in such a way where they're allowed to be correlated.

STEP 5: Choose Scree plot, Unrotated factor


STEP 6: Choose Varimax method of rotation from
solution, correlation matrix, and Principal
the Rotation section.
component method from the Extraction section.
Correlation matrix of independent
variables
Here we could use covariance matrix as well but we prefer using
Correlation matrix because its standardized.

KMO and Bartlett’s Test is actually testing whether in this


correlation matrix, are these variables are they correlated
significantly or not.
But unlike the correlation matrix, it doesn't test each individual
correlation separately, but what it does is, in one overall test, it
assesses whether these 10 correlations, taken as a group, do they
significantly differ from zero.

KMO and Bartlett’s Test


Factor Extraction Methods
Total Variance Explained and Scree Plot, these are two of the most used procedures for deciding how many components to retain.

One way to see how good of a job this The percent of variance =
The sum of all analysis did at explaining the magnitude of the
the eigen values relationships between those variables, is eigenvalue retained / the
is equal to the to look at the percent of variance sum of the eigenvalues
no. of variables. accounted for by the component. =4.401/6 = 0.7335

We have 6
components in
our rows here
are the number
of variables we
input in our
analysis.

Notice that we have these distinct eigenvalues in our Initial In the Extraction Sums of Squared
Eigenvalues table. So, 3.013 and 1.388 are the first ones, Loadings section of this table, notice
and everything after that is less than 1. Our first rule, as that there's only two value here now.
you may recall, was to only consider components with And what this means is this is how many
eigenvalue larger than one rule. So, we keep the number components SPSS retained or kept,
of factors or components with eigenvalues greater than based on the rule.
one and discard all additional components.
SCREE PLOT

• Here we retain the number of


components that are above what's
known as the scree, or where this
plot tends taper off very gradually.
• Notice how the last 4 eigenvalues, the
rate of change, or the slope here, is
quite minimal as we move across.
• But for the first 2 values, there's a big
drop from component 1 to
component 2 and component 2 to
component 3.
Here we can see how each
variable loads on each
component. That’s why we use rotated Both components have very low
But as you can see in variable component matrix. correlation.
DUR, it’s almost loading equally
on both the components.
Is Multicollinearity Necessarily Bad?

Multicollinearity makes it hard to interpret your coefficients, and it reduces the power of your model to identify
independent variables that are statistically significant. These are definitely serious problems.
However, the good news is that you don’t always have to find a way to fix multicollinearity. The need to reduce
multicollinearity depends on your primary goal for your regression model. Keep the following three points in mind:

1. The severity of the problems increases with the degree of the multicollinearity. Therefore, if you have only moderate
multicollinearity, you may not need to resolve it.
2. Multicollinearity affects only the specific independent variables that are correlated. Therefore, if multicollinearity is not
present for the independent variables that you are particularly interested in, you may not need to resolve it.
3. Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the
predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions you don’t need to reduce severe
multicollinearity.
THANK YOU

You might also like