Chapter 4 - Linear Regression
Chapter 4 - Linear Regression
import pandas as pd
import numpy as np
np.set_printoptions(precision=4, linewidth=100)
0 1 62.00 270000
1 2 76.33 200000
2 3 72.00 240000
3 4 60.00 250000
4 5 61.00 180000
5 6 55.00 300000
6 7 70.00 260000
7 8 68.00 235000
8 9 82.80 425000
9 10 59.00 240000
mba_salary_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 3 columns):
S. No. 50 non-null int64
Percentage in Grade 10 50 non-null float64
Salary 50 non-null int64
dtypes: float64(1), int64(2)
memory usage: 1.2 KB
0 1.0 62.00
1 1.0 76.33
2 1.0 72.00
3 1.0 60.00
4 1.0 61.00
Y = mba_salary_df['Salary']
print( mba_salary_lm.params )
const 30587.285652
Percentage in Grade 10 3560.587383
dtype: float64
Percentage in
3560.5874 1116.9258 3.1878 0.0029 1299.4892 5821.6855
Grade 10
4.4.6.1 Z-Score
mba_influence = mba_salary_lm.get_influence()
(c, p) = mba_influence.cooks_distance
np.abs(r2_score(test_y, pred_y))
0.15664584974230378
import numpy
np.sqrt(mean_squared_error(test_y, pred_y))
73458.04348346894
pred_y_df[0:10]
ipl_auction_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 26 columns):
Sl.NO. 130 non-null int64
PLAYER NAME 130 non-null object
AGE 130 non-null int64
COUNTRY 130 non-null object
TEAM 130 non-null object
PLAYING ROLE 130 non-null object
T-RUNS 130 non-null int64
T-WKTS 130 non-null int64
ODI-RUNS-S 130 non-null int64
ODI-SR-B 130 non-null float64
ODI-WKTS 130 non-null int64
ODI-SR-BL 130 non-null float64
CAPTAINCY EXP 130 non-null int64
RUNS-S 130 non-null int64
HS 130 non-null int64
AVE 130 non-null float64
SR-B 130 non-null float64
SIXERS 130 non-null int64
RUNS-C 130 non-null int64
WKTS 130 non-null int64
AVE-BL 130 non-null float64
ECON 130 non-null float64
SR-BL 130 non-null float64
AUCTION YEAR 130 non-null int64
BASE PRICE 130 non-null int64
SOLD PRICE 130 non-null int64
dtypes: float64(7), int64(15), object(4)
memory usage: 26.5+ KB
ODI-
PLAYER PLAYING T- T- ODI-
Sl.NO. AGE COUNTRY TEAM RUNS-
NAME ROLE RUNS WKTS SR-B
S
Abdulla,
0 1 2 SA KXIP Allrounder 0 0 0 0.00
YA
Abdur
1 2 2 BAN RCB Bowler 214 18 657 71.41
Razzak
Agarkar,
2 3 2 IND KKR Bowler 571 58 1269 80.62
AB
Badrinath,
4 5 2 IND CSK Batsman 63 0 79 45.93
S
ipl_auction_df.iloc[0:5, 13:]
X_features = ipl_auction_df.columns
ipl_auction_df['PLAYING ROLE'].unique()
0 1 0 0 0
1 0 0 1 0
2 0 0 1 0
3 0 0 1 0
4 0 1 0 0
ipl_auction_encoded_df.columns
X_features = ipl_auction_encoded_df.columns
X = sm.add_constant( ipl_auction_encoded_df )
Y = ipl_auction_df['SOLD PRICE']
PLAYING
75724.7643 150250.0240 0.5040 0.6158 -223793.1844 375242.713
ROLE_Batsman
PLAYING
15395.8752 126308.1272 0.1219 0.9033 -236394.7744 267186.524
ROLE_Bowler
PLAYING
ROLE_W. -71358.6280 213585.7444 -0.3341 0.7393 -497134.0278 354416.771
Keeper
CAPTAINCY
164113.3972 123430.6353 1.3296 0.1878 -81941.0772 410167.871
EXP_1
4.5.6 Multi-Collinearity
4.5.6.1 VIF
def get_vif_factors( X ):
X_matrix = X.as_matrix()
vif = [ variance_inflation_factor( X_matrix, i ) for i in range( X_matrix.sh
ape[1] ) ]
vif_factors = pd.DataFrame()
vif_factors['column'] = X.columns
vif_factors['vif'] = vif
return vif_factors
Now, calling the above method with the X features will return the VIF for the corresponding columns.
column vif
0 T-RUNS 12.612694
1 T-WKTS 7.679284
2 ODI-RUNS-S 16.426209
3 ODI-SR-B 13.829376
4 ODI-WKTS 9.951800
5 ODI-SR-BL 4.426818
6 RUNS-S 16.135407
7 HS 22.781017
8 AVE 25.226566
9 SR-B 21.576204
10 SIXERS 9.547268
11 RUNS-C 38.229691
12 WKTS 33.366067
13 AVE-BL 100.198105
14 ECON 7.650140
15 SR-BL 103.723846
16 AGE_2 6.996226
17 AGE_3 3.855003
18 COUNTRY_BAN 1.469017
19 COUNTRY_ENG 1.391524
20 COUNTRY_IND 4.568898
21 COUNTRY_NZ 1.497856
22 COUNTRY_PAK 1.796355
23 COUNTRY_SA 1.886555
24 COUNTRY_SL 1.984902
25 COUNTRY_WI 1.531847
26 COUNTRY_ZIM 1.312168
column vif
0 COUNTRY_SL 1.519752
1 SIXERS 2.397409
2 COUNTRY_BAN 1.094293
3 COUNTRY_NZ 1.173418
4 AGE_3 1.779861
5 COUNTRY_ENG 1.131869
6 COUNTRY_PAK 1.334773
7 ODI-WKTS 2.742889
10 WKTS 2.883101
12 COUNTRY_ZIM 1.205305
14 COUNTRY_WI 1.194093
15 COUNTRY_IND 3.144668
16 ODI-SR-BL 2.822148
17 COUNTRY_SA 1.416657
CAPTAINCY
208376.6957 98128.0284 2.1235 0.0366 13304.6315 403448.760
EXP_1
PLAYING
ROLE_W. -55121.9240 169922.5271 -0.3244 0.7464 -392916.7280 282672.880
Keeper
PLAYING
-18315.4968 106035.9664 -0.1727 0.8633 -229108.0215 192477.027
ROLE_Bowler
PLAYING
121382.0570 106685.0356 1.1378 0.2584 -90700.7746 333464.888
ROLE_Batsman
train_X = train_X[significant_vars]
CAPTAINCY
359725.2741 74930.3460 4.8008 0.0000 211065.6018 508384.9463
EXP_1
plot_resid_fitted( ipl_model_3.fittedvalues,
ipl_model_3.resid,
"Figure 4.7 - Residual Plot")
k = train_X.shape[1]
n = train_X.shape[0]
ODI-
PLAYER PLAYING T- T- O
Sl.NO. AGE COUNTRY TEAM RUNS-
NAME ROLE RUNS WKTS SR
S
Mascarenhas,
58 59 2 ENG RR+ Allrounder 0 0 245 95
AD
3 rows × 26 columns
The r-squard value of the model has increased to 0.751. And the following P-P plot also shows that the
residuals follow a normal distribution.
draw_pp_plot( ipl_model_4,
"Figure 4.8 - Normal P-P Plot of Regression Standardized Residuals"
);
Measuring RMSE
np.sqrt(metrics.mean_squared_error(pred_y, test_y))
496151.18122558104
0.44