0% found this document useful (0 votes)
11 views15 pages

LR 4

The document discusses several assumptions of linear regression including multi-collinearity, normality of residuals, and heteroskedasticity. It explains how to check for multi-collinearity using the variance inflation factor and how to remove highly collinear features. The document also states that the residuals of a linear regression should be normally distributed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views15 pages

LR 4

The document discusses several assumptions of linear regression including multi-collinearity, normality of residuals, and heteroskedasticity. It explains how to check for multi-collinearity using the variance inflation factor and how to remove highly collinear features. The document also states that the residuals of a linear regression should be normally distributed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Lec-5 : Linear Regression -

Assumptions of Linear Regression


- Multi-Collinearity
-

Normality of Residual

-
Heteroskedasticity
- Auto correlation
Recap
& feature
-

Assump tionz) there should be linear relationship blu target


-

Assumption of
Linearity
target
I
target i
-

&
&

f
+f t
- f&

+
- Yx

>
- >

feature feature
Multi collinearity
correlation the
* what happens when there is high among

features ?

Eg Teas Age
C
2011

2012 9

34 + 5
y 222
+ +
,
Eg.
=

feature is the most important ?


of which

2
it has highest weight
features
as
> >
-
-
that and de are collinear
let's assume i ,
-
related
They
are
linearly
12
= 1 .

541 (lets says

3x + 5
④ 22
+ +
= 2

+5
= Mi + 2215 di) + 32

= di + 3 d + Bust5

X
Hah + 32 + 5
Y =

most
that I is the
So ,
now
mode is
telling us

important
.
won't be
the weights and we

* Multicollinearity messes up

determine feature importances


·
able to
Variance Inflation Factor (VIFL

Goal :
Identify features which can be
expressed as

other features
a linear combination of
f) f2f3 fd
..

To check whether feature fi i

with other features


highly correlated
↳> E of not >
-
Train a
linear regression
model

and
d with fi as target

Target rest features


leaving
as
the
.
Wifet We fa Will first Wide fite + Wafd
fi
+ ... ...

- it
c
high it is collinear we can
- - remove

Check R2 score
· low -> it is not collinear - we can keep it

I
>
- VIF Score
=

I-R2
Feature is highly
VIF high Wif -

J R2
3
0
1
Scon
collinear
=
- =
,

can be
removed
if R2 VIF I >
-
=
·
Score = 0
,
if feature is not collinear
Low is

- should not
remove
Thumb rule

· vIf > 10
:
Very high multicollinearity , drop
*
5 <VIFC10
: high multicollinearity
* VIF25 : low
multicollinearity
Procedure to remove collinear features :

for all features


Step 1 Calculate VIf Score
-

Step 2 find the feature with highestvif


vif
feature with highest
step 3 Remove the
-

VIf the
remaining
step Refit and calculate of
features
Step and 4 until multicollinearity
step 5) repeat 2
,
3

is .
removed
Residuals Assumption : residuals
Normality of of Linear regrasion
distributed
Y
N

Y error/residual
should
normally
-

distributi
gaussian - On

residual actual- predicted should


3
=

There distribution

-y
-

y be random
r
distributed
normally
a histogram of residual target ⑰ e outlines



-

) Eidedcore

>
take

Right skewed distribution left skewed


& a

- -
d S ↓ S

indicates presence of
outliers outliers
in data
Outliers
Impact of
line
& Outliers should
24 pulled
the
regression be
+ *
towards themselves
-
removed before
·

f 1
e

Fix implementing
(in Reg.
a
S
There should be Letco-
Assumption :
no

ticity
Heteroskedasticity a Skedas
in the
X
data
I
****
a X
M

x
-

I
-
* f


&

xy)
L
+
'
I
Y
-

, - -

,
Y -
X

iI
X
-
Y *
-
+
I
-
I +

>
·
feature
residuals
residual #

I N Heteroskedastic
a

-
homoskedastic +
-
a
+
*
- °
- -x +
x X

**
Y x
*
+ - X y - x + X X

Pred
x X L x
x
x +
* x
+ * x4 > >
=E
pred
X + ++ x
-
+ , X x x x
x x x x < C
x x x x
+ x

-
-
x
/x
-

residuals
variane
- of >
-
variance
of residuals

is
-
constant as pred ine is not constant
Autocorrelation >
-
Covered in detail time-series analysis
fi fz - - .
fd y
Q


independ
ent
Date stock-price ④ -

·
--

3rdApril 14 40 &

M
Cr Dependent
4 April 1500

d
ste April 1600
Autocorrelation

is not
Linear regression
the best model
Summary
-

Assumption :
-

Q linearity
② No multicullinearity
dist
Residuals should be normally .


④ No Heteroskedasticity
⑤ No Autocorrelation

You might also like