0% found this document useful (0 votes)

38 views23 pages

Multicollinearity and Remedies

Uploaded by

deeksha3363

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views23 pages

Multicollinearity and Remedies

Uploaded by

deeksha3363

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

RAM LAL ANAND COLLEGE

DEPARTMENT OF STATISTICS

MULTICOLLINEARITY AND ITS

REMEDIES
SUBMITTED BY -DIKSHA SINGH
University Roll no. – 19058568012
Guide –
College Roll no. - 5024
Dr. SEEMA GUPTA
Paper Name – ECONOMETRICS
Course – B.Sc. Statistics Hons. (V SEM)
MULTICOLLINEARITY
WHAT
One of the important assumptions in regression analysis is THAT THE INDEPENDENT VARIABLES SHOULD NOT
BE CORRELATED.
Absence of this phenomenon is known as Multicollinearity.
MULTICOLLINEARITY: When two variables that are supposed to be independent in reality have a high amount of
correlation, such that they do not provide unique or independent information in the regression model.

TYPES:
• Structural multicollinearity: This type of multicollinearity is caused by the researchers (people like us) who create
new predictors using the given predictors in order to solve the problem.
• Data multicollinearity: This type of multicollinearity is the result of poorly designed experiments that are purely
observational. Thus, it is present in the data itself and has not been specified/designed by us.

PROBLEMS CAUSED:
• it becomes difficult to find out which variable is contributing to predict the response variable.
• With presence of correlated predictors, the standard errors tend to increase. And, with large standard errors, the confidence
interval becomes wider leading to less precise estimates of slope parameters.
HOW TO DETECT?
VARIANCE INFLATION FACTOR
The most common way to detect multicollinearity is by using the variance inflation factor (VIF), which
measures the correlation and strength of correlation between the predictor variables in a regression model.
VIF tells us about how well an independent variable is predictable using the other independent variables.

Example,

1. Here are 9 INDEPENDENT VARIABLES.

2. To compute the VIF of variable V1, we isolate V1 and treat it as

the target variable(Y), while the rest of the variables are considered
as predictor variables(X).

3. We use all the other predictor variables and train a regression

model and find out the corresponding R2 value.
Using this R2 value, we compute the VIF value gives as the image
below.
THE R2 VALUE INCREASES, THE VIF VALUE ALSO
BP Age Weight BSA Dur Pulse Stress
105
115
47
49
85.4 1.75
94.2 2.1
5.1
3.8
63
70
33
14
DATA
116 49 95.3 1.98 8.2 72 10
117 50 94.7 2.01 5.8 73 99 •Blood pressure (BP), in mm Hg
112 51 89.4 1.89 7 72 95 •Age, in years
121 48 99.5 2.25 9.3 71 10 •Weight, in kg
121 49 99.8 2.25 2.5 69 42 •Body surface area (BSA), in m²
110 47 90.9 1.9 6.2 66 8 •Duration of hypertension (Dur), in years
110 49 89.2 1.83 7.1 69 62 •Basal Pulse (Pulse), in beats per minute
114 48 92.7 2.07 5.6 64 35 •Stress index (Stress)
114 47 94.4 2.07 5.3 74 90
115 49 94.1 1.98 5.6 71 21
114 50 91.6 2.05 10.2 68 47
106 45 87.1 1.92 5.6 67 80
125 52 101.3 2.19 10 76 98
114 46 94.5 1.98 7.4 69 95
106 46 87 1.87 3.6 62 18
113 46 94.5 1.9 4.3 70 12
110 48 90.5 1.88 9 71 99 **Here, I have taken BP as a dependent variable ie, build
122 56 95.7 2.09 7 75 99 a model that predicts BP
In the output the R-squared value for the model is 0.9962. We can also see

FINDING VIF VALUES IN R that the overall F-statistic is 560.6 and the corresponding p-value is 6.395e-15,
which indicates that the overall regression model is significant.
Also, the predictor variables C1,C2 AND C3 are statistically significant.

We’ll fit a regression model using T as

HERE WE HAVE ADDED OUR DATA IN R AND
the response variable and all the
CHANGED THE COLUMN NAMES
c’s as the predictor variables
VIF = 1 → No correlation
VIF = 1 to 5 → Moderate
correlation
VIF >10 → High correlation

Now, we’ll use the vif() function from the car library to calculate the VIF for each predictor variable in
the model.
To visualize the VIF values for each predictor variable, we can create a simple horizontal bar chart
To gain a better understanding of why one predictor variable may have a high VIF value, we can create a correlation
matrix to view the linear correlation coefficients between each pair of variables
Several of these pair-wise correlations are quite high, suggesting that
there may be a severe collinearity problem.
REMEDIES OF MULTICOLLINEARITY

1. Drop a Redundant Variable

If a variable is redundant, it should have never been included in the model in the
first place. So dropping it actually is just correcting for a specification error. Use
economic theory to guide your choice of which variable to drop.
2. Transform the Multicollinear Variables
Sometimes you can reduce multicollinearity by re-specifying the model, for
instance, create a combination of the multicollinear variables. As an example, rather
than including the variables GDP and population in the model, include
GDP/population (GDP per capita) instead.
3. Increase the Sample Size
Increasing the sample size improves the precision of an estimator and reduces the
adverse effects of multicollinearity. Usually adding data though is not feasible.
4. Ridge Regression
5. Principal Component Analysis
RIDGE REGRESSION
Ridge Regression performs a L2 regularization, i.e. adds “squared magnitude” of coefficient as penalty term.

AIM:
Minimize the sum of square of coefficients to reduce the impact of correlated predictors but keeping all predictors in a model.
Keep all the features but reducing the magnitude of the coefficients of the model.

In Ridge Regression, the main idea is to fit a new line that overfit the training data, and so, has much higher variance then
the OLS. So we introduce a certain amount of bias into the new trend line.

The Lambda in the penalty terms determines how severe the penalty is. Here, if lambda is zero then you can imagine we
get back OLS. Then LAMBDA asymptotically increases, we arrive to a slope close to 0 (flat curve).
However, if lambda is very large then it will add too much weight and it will lead to under-fitting. Having said that it’s
important how lambda is chosen.

LAMBDAS are very small values(like 0.001), even the increments in lambdas are very small.
Ridge Regression in R

Specify the responsive and the

predictor variables

Calculating Lambdas
Running the ridge regression model for all
values of lambdas.
Here, we have used Cross Validation process to find the best lambda which fits the best

As we can see from the graph above, for low values of

Lambda, the Mean Squared Error is quite low. On the
contrary, as the value of Lambda increases, the Mean
Squared Error also increases. So, we decide to continue using
low value of Lambda.
The optimal value of Lambda to minimize the
Error is 0.01 and we stored it in optlambda. So,
we can now re-run the model using
the optlambda value that we found.

As we can see from the R-squared value rsq, now we

have an optimal model that has accounted for
99.99% of the variance in the training set.
We were able to obtain this result using all the
predictiors, and we obtained a very good job in
predicting BP, having a model that explain 99% of the
variation n the house price.
PRINCIPAL COMPONENT ANALYSIS (PCA)
Principal Component Analysis is a dimensionality-reduction method that is often used
to reduce the dimensionality of large data sets, by transforming a large set of variables
into a smaller one that still contains most of the information in the large set.

Reducing the number of variables of a data set naturally comes at the expense of
accuracy, but the trick in dimensionality reduction is to trade a little accuracy for
simplicity.
PURPOSE: reduce the number of variables of a data set, while Principal Components are often called the
preserving as much information as possible. SUMMARIES OF THE DATA.
STEP BY STEP MECHANISM OF PCA
2 PCs
4 Ind Variables Z1 Z2
X1 X2 X3 X4 * *
* * * * * *
* * * * * *
* * * * * *
* * * * 2-D * *
* * * * * *
* * * *
4-D
EIGEN STUFF … …
* * * * COVARIANCE MATRIX Λ1 v1 From
* * * * Λ2 v2 small
* * * * * * * * Λ3 v3 to big

* * * * Λ4 v4
* * * *
* * * * * * * * **An eigenvector points in a direction in which it **For symmetric matrix:
is stretched by the transformation and the eigenvalue is • Eigen vectors are always
* * * * perpendicular.
the factor by which it is stretched.
• And real.
STEP 1: Enter data in SPSS STEP 2:
Go to ANALYSE->DIMENSION REDUCTION-> FACTOR ANALYSIS
STEP 3: Move all the STEP 4: Choose Univariate descriptives, Initial
INDEPENDENT VARIABLES to solution and KMO and Bartlett’s test of
Variable section. sphericity from the Descriptive’s section
Two key types of rotation,
• Orthogonal rotation
means that your factors or components, if there is more than one, they will
be uncorrelated. In fact, that rotational solution forces them to be
uncorrelated.
• Oblique rotation
They're rotated in such a way where they're allowed to be correlated.

STEP 5: Choose Scree plot, Unrotated factor

STEP 6: Choose Varimax method of rotation from
solution, correlation matrix, and Principal
the Rotation section.
component method from the Extraction section.
Correlation matrix of independent
variables
Here we could use covariance matrix as well but we prefer using
Correlation matrix because its standardized.

KMO and Bartlett’s Test is actually testing whether in this

correlation matrix, are these variables are they correlated
significantly or not.
But unlike the correlation matrix, it doesn't test each individual
correlation separately, but what it does is, in one overall test, it
assesses whether these 10 correlations, taken as a group, do they
significantly differ from zero.

KMO and Bartlett’s Test

Factor Extraction Methods
Total Variance Explained and Scree Plot, these are two of the most used procedures for deciding how many components to retain.

One way to see how good of a job this The percent of variance =
The sum of all analysis did at explaining the magnitude of the
the eigen values relationships between those variables, is eigenvalue retained / the
is equal to the to look at the percent of variance sum of the eigenvalues
no. of variables. accounted for by the component. =4.401/6 = 0.7335

We have 6
components in
our rows here
are the number
of variables we
input in our
analysis.

Notice that we have these distinct eigenvalues in our Initial In the Extraction Sums of Squared
Eigenvalues table. So, 3.013 and 1.388 are the first ones, Loadings section of this table, notice
and everything after that is less than 1. Our first rule, as that there's only two value here now.
you may recall, was to only consider components with And what this means is this is how many
eigenvalue larger than one rule. So, we keep the number components SPSS retained or kept,
of factors or components with eigenvalues greater than based on the rule.
one and discard all additional components.
SCREE PLOT

• Here we retain the number of

components that are above what's
known as the scree, or where this
plot tends taper off very gradually.
• Notice how the last 4 eigenvalues, the
rate of change, or the slope here, is
quite minimal as we move across.
• But for the first 2 values, there's a big
drop from component 1 to
component 2 and component 2 to
component 3.
Here we can see how each
variable loads on each
component. That’s why we use rotated Both components have very low
But as you can see in variable component matrix. correlation.
DUR, it’s almost loading equally
on both the components.
Is Multicollinearity Necessarily Bad?

Multicollinearity makes it hard to interpret your coefficients, and it reduces the power of your model to identify
independent variables that are statistically significant. These are definitely serious problems.
However, the good news is that you don’t always have to find a way to fix multicollinearity. The need to reduce
multicollinearity depends on your primary goal for your regression model. Keep the following three points in mind:

1. The severity of the problems increases with the degree of the multicollinearity. Therefore, if you have only moderate
multicollinearity, you may not need to resolve it.
2. Multicollinearity affects only the specific independent variables that are correlated. Therefore, if multicollinearity is not
present for the independent variables that you are particularly interested in, you may not need to resolve it.
3. Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the
predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions you don’t need to reduce severe
multicollinearity.
THANK YOU

Lecture 6
100% (2)
Lecture 6
18 pages
Multicollinarity
No ratings yet
Multicollinarity
29 pages
Week 9 Lecture - Revision Test-Dual-Translated
No ratings yet
Week 9 Lecture - Revision Test-Dual-Translated
92 pages
Week 2262666362hs
100% (3)
Week 2262666362hs
2 pages
UNIT 2 Notes
No ratings yet
UNIT 2 Notes
8 pages
MUlticollinearity in EFA 2023
No ratings yet
MUlticollinearity in EFA 2023
22 pages
Multicollinearity in Regression Analysis PDF
No ratings yet
Multicollinearity in Regression Analysis PDF
73 pages
Multicollinearity 3
No ratings yet
Multicollinearity 3
4 pages
Chat GPT
No ratings yet
Chat GPT
6 pages
Multi Col Linearity
No ratings yet
Multi Col Linearity
45 pages
Lecture-5 2
No ratings yet
Lecture-5 2
51 pages
Chapter 12 - Dimension Reduction
No ratings yet
Chapter 12 - Dimension Reduction
14 pages
Statistic and Data Science Ii PDF
No ratings yet
Statistic and Data Science Ii PDF
37 pages
Visualising Multicollinearity in Python
No ratings yet
Visualising Multicollinearity in Python
18 pages
Multicollinearity in Regression Model
No ratings yet
Multicollinearity in Regression Model
9 pages
Lecture 17: Multicollinearity 1 Why Collinearity Is A Problem
No ratings yet
Lecture 17: Multicollinearity 1 Why Collinearity Is A Problem
9 pages
Chat Openai Com Share d1822345 3a2b 42c7 9060 79766097ae3b
No ratings yet
Chat Openai Com Share d1822345 3a2b 42c7 9060 79766097ae3b
14 pages
L 10 Principal Component Analysis 09052024 072206pm
No ratings yet
L 10 Principal Component Analysis 09052024 072206pm
37 pages
Chapter 5
No ratings yet
Chapter 5
15 pages
9-Multiple Regression
No ratings yet
9-Multiple Regression
22 pages
QMT 533 Assesment 2
No ratings yet
QMT 533 Assesment 2
20 pages
Multicolinearidade
No ratings yet
Multicolinearidade
24 pages
Unit 4-1
No ratings yet
Unit 4-1
29 pages
ACPusing R
No ratings yet
ACPusing R
25 pages
C4 English
No ratings yet
C4 English
27 pages
Multiple Linear Regression Analysis
No ratings yet
Multiple Linear Regression Analysis
23 pages
3-Linear Regreesion-Assumptions
No ratings yet
3-Linear Regreesion-Assumptions
28 pages
Multicollinearity Occurs When The Multiple Linear Regression Analysis Includes Several Variables That Are Significantly Correlated Not Only With The Dependent Variable But Also To Each Other
No ratings yet
Multicollinearity Occurs When The Multiple Linear Regression Analysis Includes Several Variables That Are Significantly Correlated Not Only With The Dependent Variable But Also To Each Other
11 pages
Linear Regression 1
No ratings yet
Linear Regression 1
14 pages
MBA Sahil Business Analytics
No ratings yet
MBA Sahil Business Analytics
5 pages
Formato Tareas PUCE
No ratings yet
Formato Tareas PUCE
15 pages
Multi Col Linearity
No ratings yet
Multi Col Linearity
3 pages
Multi Col Linearity
No ratings yet
Multi Col Linearity
12 pages
Multicollinearity
No ratings yet
Multicollinearity
36 pages
Session On Multicollinearity
No ratings yet
Session On Multicollinearity
11 pages
Detecting Multicollinearity Using Variance Inflation Factors
No ratings yet
Detecting Multicollinearity Using Variance Inflation Factors
7 pages
Day.11 What Is Multiple Linear Regression
No ratings yet
Day.11 What Is Multiple Linear Regression
10 pages
Multivariate Statistics Principal Component Analysis (PCA)
No ratings yet
Multivariate Statistics Principal Component Analysis (PCA)
41 pages
Prediction & Forecasting: Regression Analysis
No ratings yet
Prediction & Forecasting: Regression Analysis
3 pages
04 Violation of Assumptions All
No ratings yet
04 Violation of Assumptions All
24 pages
Chapter 4 Multicollinearity
No ratings yet
Chapter 4 Multicollinearity
7 pages
Multicollinearity Econometrics Corrected Format
No ratings yet
Multicollinearity Econometrics Corrected Format
2 pages
Lecture 13 - Reguralization
No ratings yet
Lecture 13 - Reguralization
33 pages
Trapti Chap4
No ratings yet
Trapti Chap4
8 pages
week6 pre稿
No ratings yet
week6 pre稿
1 page
Missing Value 11
No ratings yet
Missing Value 11
14 pages
Chapter 6 (Part Ii)
No ratings yet
Chapter 6 (Part Ii)
41 pages
Sample Exam Answers CMA
No ratings yet
Sample Exam Answers CMA
9 pages
Dsur I Chapter 17 Efa
No ratings yet
Dsur I Chapter 17 Efa
47 pages
Regression Packet
No ratings yet
Regression Packet
27 pages
Multicollinearity
No ratings yet
Multicollinearity
26 pages
3 Regression Diagnostics
100% (1)
3 Regression Diagnostics
53 pages
Factor Hair Revised Project Report PDF
No ratings yet
Factor Hair Revised Project Report PDF
23 pages
BRM Assignment
No ratings yet
BRM Assignment
26 pages
ML Unit3 MultipleLinearRegression
No ratings yet
ML Unit3 MultipleLinearRegression
70 pages
SEM With AMOS and Tutorial
No ratings yet
SEM With AMOS and Tutorial
118 pages
Module 1 Notes
100% (1)
Module 1 Notes
73 pages
Unit4 DL Final
No ratings yet
Unit4 DL Final
30 pages
Ch-2 Supervised Machine Learning
No ratings yet
Ch-2 Supervised Machine Learning
48 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
ML - Module 2
No ratings yet
ML - Module 2
16 pages
Screenshot 2024-10-29 at 10.49.03 PM
No ratings yet
Screenshot 2024-10-29 at 10.49.03 PM
88 pages
DSE 3 Unit 4
No ratings yet
DSE 3 Unit 4
8 pages
Deep Learning Notes All Units
No ratings yet
Deep Learning Notes All Units
69 pages
007-Discrete Dynamics in Nature and Society - 2022 - Alkhammash - Optimized Multivariate Adaptive Regression Splines For
No ratings yet
007-Discrete Dynamics in Nature and Society - 2022 - Alkhammash - Optimized Multivariate Adaptive Regression Splines For
9 pages
Ridge Regression LASSO
No ratings yet
Ridge Regression LASSO
18 pages
Lecture 7 Loss Function and Regularization
No ratings yet
Lecture 7 Loss Function and Regularization
38 pages
Regression
No ratings yet
Regression
35 pages
Applied Mathematics in Hydrogeology Lee 2024 Scribd Download
100% (2)
Applied Mathematics in Hydrogeology Lee 2024 Scribd Download
55 pages
Full Download Inverse Problems: Basics, Theory and Applications in Geophysics 2nd Edition Mathias Richter PDF
100% (1)
Full Download Inverse Problems: Basics, Theory and Applications in Geophysics 2nd Edition Mathias Richter PDF
55 pages
Smoothed Bootstrap - Nelson-Siegel Revisited June 2010
No ratings yet
Smoothed Bootstrap - Nelson-Siegel Revisited June 2010
38 pages
Tire Noise Sound Synthesis
No ratings yet
Tire Noise Sound Synthesis
16 pages
L2 Linear Regression
No ratings yet
L2 Linear Regression
61 pages
Training Code
No ratings yet
Training Code
27 pages
Ridge Regression
No ratings yet
Ridge Regression
9 pages
Slide 1
No ratings yet
Slide 1
4 pages
2023 02 28 530310 Full-2
No ratings yet
2023 02 28 530310 Full-2
31 pages
Proposal-Ridge Regression
No ratings yet
Proposal-Ridge Regression
6 pages
Evaluating The Accuracy of Valuation Multiples On
No ratings yet
Evaluating The Accuracy of Valuation Multiples On
30 pages
Principal Component Analysis (PCA) - by Kavishka Abeywardana - Jun, 2024 - Medium
No ratings yet
Principal Component Analysis (PCA) - by Kavishka Abeywardana - Jun, 2024 - Medium
22 pages
Advanced Social Media Sentiment Analysis For Short-Term
No ratings yet
Advanced Social Media Sentiment Analysis For Short-Term
16 pages
Multivariate Short-Term Traffic Flow Prediction Based On Real-Time Expressway Toll Plaza Data Using Non-Parametric Techniques
No ratings yet
Multivariate Short-Term Traffic Flow Prediction Based On Real-Time Expressway Toll Plaza Data Using Non-Parametric Techniques
19 pages
10 1016@j Electacta 2019 05 010
No ratings yet
10 1016@j Electacta 2019 05 010
14 pages
Kernel Ridge Regression Classification
No ratings yet
Kernel Ridge Regression Classification
5 pages
A Modified Tikhonov Regularization Method For An Axisymmetric Backward Heat Equation
No ratings yet
A Modified Tikhonov Regularization Method For An Axisymmetric Backward Heat Equation
8 pages
A Conversation About Calculus
From Everand
A Conversation About Calculus
Ginachukwu Amah
No ratings yet
Future Ready: How to Master Business Forecasting
From Everand
Future Ready: How to Master Business Forecasting
Steve Morlidge
No ratings yet
Solutions Manual to accompany Introduction to Linear Regression Analysis
From Everand
Solutions Manual to accompany Introduction to Linear Regression Analysis
Douglas C. Montgomery
1/5 (1)
Instruction for Using a Slide Rule
From Everand
Instruction for Using a Slide Rule
W. Stanley
No ratings yet
Linear Regression with Multiple Covariates
From Everand
Linear Regression with Multiple Covariates
Brett Kottmann
No ratings yet

Multicollinearity and Remedies

Uploaded by

Multicollinearity and Remedies

Uploaded by

RAM LAL ANAND COLLEGE

MULTICOLLINEARITY AND ITS

1. Here are 9 INDEPENDENT VARIABLES.

2. To compute the VIF of variable V1, we isolate V1 and treat it as

3. We use all the other predictor variables and train a regression

We’ll fit a regression model using T as

1. Drop a Redundant Variable

Specify the responsive and the

As we can see from the graph above, for low values of

As we can see from the R-squared value rsq, now we

STEP 5: Choose Scree plot, Unrotated factor

KMO and Bartlett’s Test is actually testing whether in this

KMO and Bartlett’s Test

• Here we retain the number of

You might also like