Multi Col Linearity
Multi Col Linearity
1. What is Multicollinearity?
Multicollinearity occurs in Ordinary Least Squares (OLS) regression when two or more
independent variables (the predictors) are highly correlated with each other. This means
that one predictor can be almost completely explained using the other predictor(s).
OLS Regression: A statistical method used to study how one dependent variable (what
you are trying to predict) gets impacted by or is related to one or more independent
variables (what you use to make the prediction).
Since larger houses generally have more bedrooms, these two variables will be highly
correlated. This correlation causes a problem for the regression model when trying to
separate the individual effects of square footage and bedrooms on house price.
• With multicollinearity, the model struggles to decide how much each predictor
contributes to the dependent variable, which leads to larger standard errors. This
means the coefficients become unreliable and fluctuate more depending on the
sample data.
When variables are highly correlated, it’s hard to determine how much each variable
uniquely contributes to the outcome. For example, is house price more influenced by
square footage or number of bedrooms? Multicollinearity makes it unclear.
c. Insignificant Variables
d. Unstable Coefficients
The regression coefficients become unstable, meaning small changes in the data can
lead to big swings in their values. This instability makes the model unreliable for
predictions.
a. Correlation Matrix
A correlation matrix shows the relationships between all pairs of independent variables.
Correlations close to ±1 indicate potential multicollinearity.
• VIF measures how much the variance of a regression coefficient increases due to
multicollinearity.
When you detect multicollinearity, here’s how you can address it:
If two variables are highly correlated, consider removing one of them. For instance, if
square footage and the number of bedrooms are highly correlated, you might choose to
keep only square footage.
b. Combine Variables
You can create a new variable that combines the information from the correlated
predictors. For example, you could create a “size index” by combining square footage and
the number of bedrooms into one variable.
PCA transforms the variables into a new set of uncorrelated components. These
components can then be used in the regression model.
d. Collect More Data
With more data, the relationships between variables may become clearer, and
multicollinearity can be reduced.
e. Standardize Variables
5. Real-World Example
2. Online Ad Spend (X2): A subset of the advertising budget focused on digital ads.
• The Problem: Online ad spend is part of the overall advertising budget, so these
two variables are highly correlated.
• Impact: The regression model can’t distinguish how much revenue is driven by
overall advertising vs. online ads. Coefficients for both variables become unstable
and have large standard errors.
• Solution: Remove one variable (e.g., keep only total advertising budget) or
combine them into a single variable representing "total spend" instead.
Key Takeaways
2. Why it’s a Problem: It inflates standard errors, makes coefficients unstable, and
reduces the interpretability and reliability of the regression model.