LR 4
LR 4
Normality of Residual
-
Heteroskedasticity
- Auto correlation
Recap
& feature
-
Assumption of
Linearity
target
I
target i
-
&
&
f
+f t
- f&
+
- Yx
>
- >
feature feature
Multi collinearity
correlation the
* what happens when there is high among
features ?
Eg Teas Age
C
2011
2012 9
34 + 5
y 222
+ +
,
Eg.
=
2
it has highest weight
features
as
> >
-
-
that and de are collinear
let's assume i ,
-
related
They
are
linearly
12
= 1 .
3x + 5
④ 22
+ +
= 2
+5
= Mi + 2215 di) + 32
= di + 3 d + Bust5
X
Hah + 32 + 5
Y =
most
that I is the
So ,
now
mode is
telling us
important
.
won't be
the weights and we
* Multicollinearity messes up
Goal :
Identify features which can be
expressed as
other features
a linear combination of
f) f2f3 fd
..
and
d with fi as target
- it
c
high it is collinear we can
- - remove
Check R2 score
· low -> it is not collinear - we can keep it
I
>
- VIF Score
=
I-R2
Feature is highly
VIF high Wif -
J R2
3
0
1
Scon
collinear
=
- =
,
can be
removed
if R2 VIF I >
-
=
·
Score = 0
,
if feature is not collinear
Low is
- should not
remove
Thumb rule
· vIf > 10
:
Very high multicollinearity , drop
*
5 <VIFC10
: high multicollinearity
* VIF25 : low
multicollinearity
Procedure to remove collinear features :
VIf the
remaining
step Refit and calculate of
features
Step and 4 until multicollinearity
step 5) repeat 2
,
3
is .
removed
Residuals Assumption : residuals
Normality of of Linear regrasion
distributed
Y
N
Y error/residual
should
normally
-
distributi
gaussian - On
There distribution
-y
-
y be random
r
distributed
normally
a histogram of residual target ⑰ e outlines
↑
↑
⑭
-
) Eidedcore
>
take
- -
d S ↓ S
indicates presence of
outliers outliers
in data
Outliers
Impact of
line
& Outliers should
24 pulled
the
regression be
+ *
towards themselves
-
removed before
·
f 1
e
Fix implementing
(in Reg.
a
S
There should be Letco-
Assumption :
no
ticity
Heteroskedasticity a Skedas
in the
X
data
I
****
a X
M
x
-
I
-
* f
↑
&
xy)
L
+
'
I
Y
-
, - -
,
Y -
X
iI
X
-
Y *
-
+
I
-
I +
>
·
feature
residuals
residual #
I N Heteroskedastic
a
-
homoskedastic +
-
a
+
*
- °
- -x +
x X
**
Y x
*
+ - X y - x + X X
Pred
x X L x
x
x +
* x
+ * x4 > >
=E
pred
X + ++ x
-
+ , X x x x
x x x x < C
x x x x
+ x
-
-
x
/x
-
residuals
variane
- of >
-
variance
of residuals
is
-
constant as pred ine is not constant
Autocorrelation >
-
Covered in detail time-series analysis
fi fz - - .
fd y
Q
↳
②
independ
ent
Date stock-price ④ -
·
--
3rdApril 14 40 &
M
Cr Dependent
4 April 1500
d
ste April 1600
Autocorrelation
↓
is not
Linear regression
the best model
Summary
-
Assumption :
-
Q linearity
② No multicullinearity
dist
Residuals should be normally .
③
④ No Heteroskedasticity
⑤ No Autocorrelation