100% found this document useful (2 votes)

146 views

Chapter 4 - Linear Regression

Uploaded by

anshita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

146 views

Chapter 4 - Linear Regression

Uploaded by

anshita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Machine Learning using Python

Chapter 4: Linear Regression

4.3 Building Simple Linear Regression Model

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

np.set_printoptions(precision=4, linewidth=100)

mba_salary_df = pd.read_csv( 'MBA Salary.csv' )

mba_salary_df.head( 10 )

S. No. Percentage in Grade 10 Salary

0 1 62.00 270000

1 2 76.33 200000

2 3 72.00 240000

3 4 60.00 250000

4 5 61.00 180000

5 6 55.00 300000

6 7 70.00 260000

7 8 68.00 235000

8 9 82.80 425000

9 10 59.00 240000

More information about the dataset

mba_salary_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 3 columns):
S. No. 50 non-null int64
Percentage in Grade 10 50 non-null float64
Salary 50 non-null int64
dtypes: float64(1), int64(2)
memory usage: 1.2 KB

4.3.1 Creating Feature Set(X) and Outcome Variable(Y)

Copyright © 2019 by Wiley India Pvt. Ltd. 1/25

Machine Learning using Python
import statsmodels.api as sm

X = sm.add_constant( mba_salary_df['Percentage in Grade 10'] )

X.head(5)

const Percentage in Grade 10

0 1.0 62.00

1 1.0 76.33

2 1.0 72.00

3 1.0 60.00

4 1.0 61.00

Y = mba_salary_df['Salary']

4.3.2 Splitting the dataset into training and validation sets

from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split( X ,

Y,
train_size = 0.8,
random_state = 100 )

4.3.3 Fitting the Model

mba_salary_lm = sm.OLS( train_y, train_X ).fit()

4.3.3.1 Printing Estimated Parameters and interpreting them

print( mba_salary_lm.params )

const 30587.285652
Percentage in Grade 10 3560.587383
dtype: float64

4.4 Model Diagnostics

Copyright © 2019 by Wiley India Pvt. Ltd. 2/25

Machine Learning using Python
mba_salary_lm.summary2()

Model: OLS Adj. R-squared: 0.190

Dependent Variable: Salary AIC: 1008.8680

Date: 2019-04-23 18:26 BIC: 1012.2458

No. Observations: 40 Log-Likelihood: -502.43

Df Model: 1 F-statistic: 10.16

Df Residuals: 38 Prob (F-statistic): 0.00287

R-squared: 0.211 Scale: 5.0121e+09

Coef. Std.Err. t P>|t| [0.025 0.975]

const 30587.2857 71869.4497 0.4256 0.6728 -114904.8089 176079.3802

Percentage in
3560.5874 1116.9258 3.1878 0.0029 1299.4892 5821.6855
Grade 10

Omnibus: 2.048 Durbin-Watson: 2.611

Prob(Omnibus): 0.359 Jarque-Bera (JB): 1.724

Skew: 0.369 Prob(JB): 0.422

Kurtosis: 2.300 Condition No.: 413

4.4.5 Residual Anlalysis

4.4.5.1 Checking Normality

import matplotlib.pyplot as plt

import seaborn as sn
%matplotlib inline

Copyright © 2019 by Wiley India Pvt. Ltd. 3/25

Machine Learning using Python
mba_salary_resid = mba_salary_lm.resid
probplot = sm.ProbPlot( mba_salary_resid )
plt.figure( figsize = (8, 6) )
probplot.ppplot( line='45' )
plt.title( "Fig 4.1 - Normal P-P Plot of Regression Standardized Residuals" )
plt.show()

<Figure size 576x432 with 0 Axes>

4.4.5.2 Test of Homoscedasticity

def get_standardized_values( vals ):

return (vals - vals.mean())/vals.std()

Copyright © 2019 by Wiley India Pvt. Ltd. 4/25

Machine Learning using Python
plt.scatter( get_standardized_values( mba_salary_lm.fittedvalues ),
get_standardized_values( mba_salary_resid ) )
plt.title( "Fig 4.2 - Residual Plot: MBA Salary Prediction" );
plt.xlabel( "Standardized predicted values")
plt.ylabel( "Standardized Residuals");

4.4.6 Outlier Analysis

4.4.6.1 Z-Score

from scipy.stats import zscore

mba_salary_df['z_score_salary'] = zscore( mba_salary_df.Salary )

mba_salary_df[ (mba_salary_df.z_score_salary > 3.0) | (mba_salary_df.z_score_sal

ary < -3.0) ]

S. No. Percentage in Grade 10 Salary z_score_salary

4.4.6.2 Cook's Distance

Copyright © 2019 by Wiley India Pvt. Ltd. 5/25

Machine Learning using Python
import numpy as np

mba_influence = mba_salary_lm.get_influence()
(c, p) = mba_influence.cooks_distance

plt.stem( np.arange( len( train_X) ),

np.round( c, 3 ),
markerfmt="," );
plt.title( "Figure 4.3 - Cooks distance for all observations in MBA Salaray data
set" );
plt.xlabel( "Row index")
plt.ylabel( "Cooks Distance");

4.4.6.3 Leverage Values

Copyright © 2019 by Wiley India Pvt. Ltd. 6/25

Machine Learning using Python
from statsmodels.graphics.regressionplots import influence_plot

fig, ax = plt.subplots( figsize=(8,6) )

influence_plot( mba_salary_lm, ax = ax )
plt.title( "Figure 4.4 - Leverage Value Vs Residuals")
plt.show();

4.4.7 Making prediction using the model

4.4.7.1 Predicting on validation set

pred_y = mba_salary_lm.predict( test_X )

Copyright © 2019 by Wiley India Pvt. Ltd. 7/25

Machine Learning using Python

4.4.7.2 Finding R-Square and RMSE

from sklearn.metrics import r2_score, mean_squared_error

np.abs(r2_score(test_y, pred_y))

0.15664584974230378

import numpy

np.sqrt(mean_squared_error(test_y, pred_y))

73458.04348346894

4.4.7.3 Calculating prediction intervals

from statsmodels.sandbox.regression.predstd import wls_prediction_std

# Predict the y values

pred_y = mba_salary_lm.predict( test_X )

# Predict the low and high interval values for y

_, pred_y_low, pred_y_high = wls_prediction_std( mba_salary_lm,
test_X,
alpha = 0.1)

# Store all the values in a dataframe

pred_y_df = pd.DataFrame( { 'grade_10_perc': test_X['Percentage in Grade 10'],
'pred_y': pred_y,
'pred_y_left': pred_y_low,
'pred_y_right': pred_y_high } )

pred_y_df[0:10]

grade_10_perc pred_y pred_y_left pred_y_right

6 70.0 279828.402452 158379.832044 401276.972860

36 68.0 272707.227686 151576.715020 393837.740352

37 52.0 215737.829560 92950.942395 338524.716726

28 58.0 237101.353858 115806.869618 358395.838097

43 74.5 295851.045675 173266.083342 418436.008008

49 60.8 247070.998530 126117.560983 368024.436076

5 55.0 226419.591709 104507.444388 348331.739030

33 78.0 308313.101515 184450.060488 432176.142542

20 63.0 254904.290772 134057.999258 375750.582286

42 74.4 295494.986937 172941.528691 418048.445182

Copyright © 2019 by Wiley India Pvt. Ltd. 8/25

Machine Learning using Python

4.5 Multiple Linear Regression

4.5.2.1 Loading the dataset

ipl_auction_df = pd.read_csv( 'IPL IMB381IPL2013.csv' )

ipl_auction_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 26 columns):
Sl.NO. 130 non-null int64
PLAYER NAME 130 non-null object
AGE 130 non-null int64
COUNTRY 130 non-null object
TEAM 130 non-null object
PLAYING ROLE 130 non-null object
T-RUNS 130 non-null int64
T-WKTS 130 non-null int64
ODI-RUNS-S 130 non-null int64
ODI-SR-B 130 non-null float64
ODI-WKTS 130 non-null int64
ODI-SR-BL 130 non-null float64
CAPTAINCY EXP 130 non-null int64
RUNS-S 130 non-null int64
HS 130 non-null int64
AVE 130 non-null float64
SR-B 130 non-null float64
SIXERS 130 non-null int64
RUNS-C 130 non-null int64
WKTS 130 non-null int64
AVE-BL 130 non-null float64
ECON 130 non-null float64
SR-BL 130 non-null float64
AUCTION YEAR 130 non-null int64
BASE PRICE 130 non-null int64
SOLD PRICE 130 non-null int64
dtypes: float64(7), int64(15), object(4)
memory usage: 26.5+ KB

Copyright © 2019 by Wiley India Pvt. Ltd. 9/25

Machine Learning using Python
ipl_auction_df.iloc[0:5, 0:10]

ODI-
PLAYER PLAYING T- T- ODI-
Sl.NO. AGE COUNTRY TEAM RUNS-
NAME ROLE RUNS WKTS SR-B
S

Abdulla,
0 1 2 SA KXIP Allrounder 0 0 0 0.00
YA

Abdur
1 2 2 BAN RCB Bowler 214 18 657 71.41
Razzak

Agarkar,
2 3 2 IND KKR Bowler 571 58 1269 80.62
AB

3 4 Ashwin, R 1 IND CSK Bowler 284 31 241 84.56

Badrinath,
4 5 2 IND CSK Batsman 63 0 79 45.93
S

ipl_auction_df.iloc[0:5, 13:]

RUNS- RUNS- AVE- SR- AUCTION

HS AVE SR-B SIXERS WKTS ECON
S C BL BL YEAR

0 0 0 0.00 0.00 0 307 15 20.47 8.90 13.93 2009 5

1 0 0 0.00 0.00 0 29 0 0.00 14.50 0.00 2008 5

2 167 39 18.56 121.01 5 1059 29 36.52 8.81 24.90 2008 2

3 58 11 5.80 76.32 0 1125 49 22.96 6.23 22.14 2011 1

4 1317 71 32.93 120.71 28 0 0 0.00 0.00 0.00 2011 1

X_features = ipl_auction_df.columns

X_features = ['AGE', 'COUNTRY', 'PLAYING ROLE',

'T-RUNS', 'T-WKTS', 'ODI-RUNS-S', 'ODI-SR-B',
'ODI-WKTS', 'ODI-SR-BL', 'CAPTAINCY EXP', 'RUNS-S',
'HS', 'AVE', 'SR-B', 'SIXERS', 'RUNS-C', 'WKTS',
'AVE-BL', 'ECON', 'SR-BL']

4.5.3 Encoding Categorical Features

ipl_auction_df['PLAYING ROLE'].unique()

array(['Allrounder', 'Bowler', 'Batsman', 'W. Keeper'], dtype=objec

Copyright © 2019 by Wiley India Pvt. Ltd. 10/25

Machine Learning using Python
pd.get_dummies(ipl_auction_df['PLAYING ROLE'])[0:5]

Allrounder Batsman Bowler W. Keeper

0 1 0 0 0

1 0 0 1 0

2 0 0 1 0

3 0 0 1 0

4 0 1 0 0

categorical_features = ['AGE', 'COUNTRY', 'PLAYING ROLE', 'CAPTAINCY EXP']

ipl_auction_encoded_df = pd.get_dummies( ipl_auction_df[X_features],

columns = categorical_features,
drop_first = True )

ipl_auction_encoded_df.columns

Index(['T-RUNS', 'T-WKTS', 'ODI-RUNS-S', 'ODI-SR-B', 'ODI-WKTS', 'OD

I-SR-BL',
'RUNS-S', 'HS', 'AVE', 'SR-B', 'SIXERS', 'RUNS-C', 'WKTS', 'A
VE-BL',
'ECON', 'SR-BL', 'AGE_2', 'AGE_3', 'COUNTRY_BAN', 'COUNTRY_EN
G',
'COUNTRY_IND', 'COUNTRY_NZ', 'COUNTRY_PAK', 'COUNTRY_SA', 'CO
UNTRY_SL',
'COUNTRY_WI', 'COUNTRY_ZIM', 'PLAYING ROLE_Batsman',
'PLAYING ROLE_Bowler', 'PLAYING ROLE_W. Keeper', 'CAPTAINCY E
XP_1'],
dtype='object')

X_features = ipl_auction_encoded_df.columns

X = sm.add_constant( ipl_auction_encoded_df )
Y = ipl_auction_df['SOLD PRICE']

train_X, test_X, train_y, test_y = train_test_split( X ,

Y,
train_size = 0.8,
random_state = 42 )

4.5.5 Building the model on training dataset

Copyright © 2019 by Wiley India Pvt. Ltd. 11/25

Machine Learning using Python
ipl_model_1 = sm.OLS(train_y, train_X).fit()
ipl_model_1.summary2()

Copyright © 2019 by Wiley India Pvt. Ltd. 12/25

Machine Learning using Python
Model: OLS Adj. R-squared: 0.362

Dependent Variable: SOLD PRICE AIC: 2965.2841

Date: 2019-04-23 18:26 BIC: 3049.9046

No. Observations: 104 Log-Likelihood: -1450.6

Df Model: 31 F-statistic: 2.883

Df Residuals: 72 Prob (F-statistic): 0.000114

R-squared: 0.554 Scale: 1.1034e+11

Coef. Std.Err. t P>|t| [0.025 0.975]

const 375827.1991 228849.9306 1.6422 0.1049 -80376.7996 832031.197

T-RUNS -53.7890 32.7172 -1.6441 0.1045 -119.0096 11.4316

T-WKTS -132.5967 609.7525 -0.2175 0.8285 -1348.1162 1082.9228

ODI-RUNS-S 57.9600 31.5071 1.8396 0.0700 -4.8482 120.7681

ODI-SR-B -524.1450 1576.6368 -0.3324 0.7405 -3667.1130 2618.8231

ODI-WKTS 815.3944 832.3883 0.9796 0.3306 -843.9413 2474.7301

ODI-SR-BL -773.3092 1536.3334 -0.5033 0.6163 -3835.9338 2289.3154

RUNS-S 114.7205 173.3088 0.6619 0.5101 -230.7643 460.2054

HS -5516.3354 2586.3277 -2.1329 0.0363 -10672.0855 -360.5853

AVE 21560.2760 7774.2419 2.7733 0.0071 6062.6080 37057.9439

SR-B -1324.7218 1373.1303 -0.9647 0.3379 -4062.0071 1412.5635

SIXERS 4264.1001 4089.6000 1.0427 0.3006 -3888.3685 12416.5687

RUNS-C 69.8250 297.6697 0.2346 0.8152 -523.5687 663.2187

WKTS 3075.2422 7262.4452 0.4234 0.6732 -11402.1778 17552.6622

AVE-BL 5182.9335 10230.1581 0.5066 0.6140 -15210.5140 25576.3810

ECON -6820.7781 13109.3693 -0.5203 0.6045 -32953.8282 19312.2721

SR-BL -7658.8094 14041.8735 -0.5454 0.5871 -35650.7726 20333.1539

AGE_2 -230767.6463 114117.2005 -2.0222 0.0469 -458256.1279 -3279.1648

AGE_3 -216827.0808 152246.6232 -1.4242 0.1587 -520325.1772 86671.0155

COUNTRY_BAN -122103.5196 438719.2796 -0.2783 0.7816 -996674.4194 752467.380

COUNTRY_ENG 672410.7654 238386.2220 2.8207 0.0062 197196.5172 1147625.01

COUNTRY_IND 155306.4011 126316.3449 1.2295 0.2229 -96500.6302 407113.432

COUNTRY_NZ 194218.9120 173491.9293 1.1195 0.2667 -151630.9280 540068.752

COUNTRY_PAK 75921.7670 193463.5545 0.3924 0.6959 -309740.7804 461584.314

COUNTRY_SA 64283.3894 144587.6773 0.4446 0.6579 -223946.8775 352513.656

COUNTRY_SL 17360.1530 176333.7497 0.0985 0.9218 -334154.7526 368875.058

COUNTRY_WI 10607.7792 230686.7892 0.0460 0.9635 -449257.9303 470473.488

Copyright © 2019 by Wiley India Pvt. Ltd. 13/25

Machine Learning using Python
COUNTRY_ZIM -145494.4793 401505.2815 -0.3624 0.7181 -945880.6296 654891.671

PLAYING
75724.7643 150250.0240 0.5040 0.6158 -223793.1844 375242.713
ROLE_Batsman

PLAYING
15395.8752 126308.1272 0.1219 0.9033 -236394.7744 267186.524
ROLE_Bowler

PLAYING
ROLE_W. -71358.6280 213585.7444 -0.3341 0.7393 -497134.0278 354416.771
Keeper

CAPTAINCY
164113.3972 123430.6353 1.3296 0.1878 -81941.0772 410167.871
EXP_1

Omnibus: 0.891 Durbin-Watson: 2.244

Prob(Omnibus): 0.640 Jarque-Bera (JB): 0.638

Skew: 0.190 Prob(JB): 0.727

Kurtosis: 3.059 Condition No.: 84116

4.5.6 Multi-Collinearity

4.5.6.1 VIF

from statsmodels.stats.outliers_influence import variance_inflation_factor

def get_vif_factors( X ):
X_matrix = X.as_matrix()
vif = [ variance_inflation_factor( X_matrix, i ) for i in range( X_matrix.sh
ape[1] ) ]
vif_factors = pd.DataFrame()
vif_factors['column'] = X.columns
vif_factors['vif'] = vif

return vif_factors

Now, calling the above method with the X features will return the VIF for the corresponding columns.

Copyright © 2019 by Wiley India Pvt. Ltd. 14/25

Machine Learning using Python
vif_factors = get_vif_factors( X[X_features] )
vif_factors

column vif

0 T-RUNS 12.612694

1 T-WKTS 7.679284

2 ODI-RUNS-S 16.426209

3 ODI-SR-B 13.829376

4 ODI-WKTS 9.951800

5 ODI-SR-BL 4.426818

6 RUNS-S 16.135407

7 HS 22.781017

8 AVE 25.226566

9 SR-B 21.576204

10 SIXERS 9.547268

11 RUNS-C 38.229691

12 WKTS 33.366067

13 AVE-BL 100.198105

14 ECON 7.650140

15 SR-BL 103.723846

16 AGE_2 6.996226

17 AGE_3 3.855003

18 COUNTRY_BAN 1.469017

19 COUNTRY_ENG 1.391524

20 COUNTRY_IND 4.568898

21 COUNTRY_NZ 1.497856

22 COUNTRY_PAK 1.796355

23 COUNTRY_SA 1.886555

24 COUNTRY_SL 1.984902

25 COUNTRY_WI 1.531847

26 COUNTRY_ZIM 1.312168

27 PLAYING ROLE_Batsman 4.843136

28 PLAYING ROLE_Bowler 3.795864

29 PLAYING ROLE_W. Keeper 3.132044

30 CAPTAINCY EXP_1 4.245128

Copyright © 2019 by Wiley India Pvt. Ltd. 15/25

Machine Learning using Python

4.5.6.2 Checking correlation of columns with large VIFs

columns_with_large_vif = vif_factors[vif_factors.vif > 4].column

plt.figure( figsize = (12,10) )

sn.heatmap( X[columns_with_large_vif].corr(), annot = True );
plt.title( "Figure 4.5 - Heatmap depicting correlation between features");

columns_to_be_removed = ['T-RUNS', 'T-WKTS', 'RUNS-S', 'HS',

'AVE', 'RUNS-C', 'SR-B', 'AVE-BL',
'ECON', 'ODI-SR-B', 'ODI-RUNS-S', 'AGE_2', 'SR-BL']

X_new_features = list( set(X_features) - set(columns_to_be_removed) )

Copyright © 2019 by Wiley India Pvt. Ltd. 16/25

Machine Learning using Python
get_vif_factors( X[X_new_features] )

column vif

0 COUNTRY_SL 1.519752

1 SIXERS 2.397409

2 COUNTRY_BAN 1.094293

3 COUNTRY_NZ 1.173418

4 AGE_3 1.779861

5 COUNTRY_ENG 1.131869

6 COUNTRY_PAK 1.334773

7 ODI-WKTS 2.742889

8 CAPTAINCY EXP_1 2.458745

9 PLAYING ROLE_W. Keeper 1.900941

10 WKTS 2.883101

11 PLAYING ROLE_Bowler 3.060168

12 COUNTRY_ZIM 1.205305

13 PLAYING ROLE_Batsman 2.680207

14 COUNTRY_WI 1.194093

15 COUNTRY_IND 3.144668

16 ODI-SR-BL 2.822148

17 COUNTRY_SA 1.416657

4.5.6.3 Building a new model after removing multicollinearity

Machine Learning using Python
train_X = train_X[X_new_features]

ipl_model_2 = sm.OLS(train_y, train_X).fit()

ipl_model_2.summary2()

Machine Learning using Python
Model: OLS Adj. R-squared: 0.728

Dependent Variable: SOLD PRICE AIC: 2965.1080

Date: 2019-04-23 18:26 BIC: 3012.7070

No. Observations: 104 Log-Likelihood: -1464.6

Df Model: 18 F-statistic: 16.49

Df Residuals: 86 Prob (F-statistic): 1.13e-20

R-squared: 0.775 Scale: 1.2071e+11

Coef. Std.Err. t P>|t| [0.025 0.975]

COUNTRY_SL 55912.3398 142277.1829 0.3930 0.6953 -226925.3388 338750.018

SIXERS 7862.1259 2086.6101 3.7679 0.0003 3714.0824 12010.1694

COUNTRY_BAN -108758.6040 369274.1916 -0.2945 0.7691 -842851.4010 625334.193

COUNTRY_NZ 142968.8843 151841.7382 0.9416 0.3491 -158882.5009 444820.269

AGE_3 -8950.6659 98041.9325 -0.0913 0.9275 -203851.5772 185950.245

COUNTRY_ENG 682934.7166 216150.8279 3.1595 0.0022 253241.0920 1112628.34

COUNTRY_PAK 122810.2480 159600.8063 0.7695 0.4437 -194465.6541 440086.150

ODI-WKTS 772.4088 470.6354 1.6412 0.1044 -163.1834 1708.0009

CAPTAINCY
208376.6957 98128.0284 2.1235 0.0366 13304.6315 403448.760
EXP_1

PLAYING
ROLE_W. -55121.9240 169922.5271 -0.3244 0.7464 -392916.7280 282672.880
Keeper

WKTS 2431.8988 2105.3524 1.1551 0.2512 -1753.4033 6617.2008

PLAYING
-18315.4968 106035.9664 -0.1727 0.8633 -229108.0215 192477.027
ROLE_Bowler

COUNTRY_ZIM -67977.6781 390859.9289 -0.1739 0.8623 -844981.5006 709026.144

PLAYING
121382.0570 106685.0356 1.1378 0.2584 -90700.7746 333464.888
ROLE_Batsman

COUNTRY_WI -22234.9315 213050.5847 -0.1044 0.9171 -445765.4766 401295.613

COUNTRY_IND 282829.8091 96188.0292 2.9404 0.0042 91614.3356 474045.282

ODI-SR-BL 909.0021 1267.4969 0.7172 0.4752 -1610.6983 3428.7026

COUNTRY_SA 108735.9086 115092.9596 0.9448 0.3474 -120061.3227 337533.139

Omnibus: 8.635 Durbin-Watson: 2.252

Prob(Omnibus): 0.013 Jarque-Bera (JB): 8.345

Skew: 0.623 Prob(JB): 0.015

Kurtosis: 3.609 Condition No.: 1492

ﬁle:///Users/manaranjan/Documents/Work/Data Science/Python/Book Writing/ﬁnal editions 2.0/code/Chapter 4 - Linear Regression.html 19/25
Machine Learning using Python
significant_vars = ['COUNTRY_IND', 'COUNTRY_ENG', 'SIXERS', 'CAPTAINCY EXP_1']

train_X = train_X[significant_vars]

ipl_model_3 = sm.OLS(train_y, train_X).fit()

ipl_model_3.summary2()

Model: OLS Adj. R-squared: 0.704

Dependent Variable: SOLD PRICE AIC: 2961.8089

Date: 2019-04-23 18:26 BIC: 2972.3864

No. Observations: 104 Log-Likelihood: -1476.9

Df Model: 4 F-statistic: 62.77

Df Residuals: 100 Prob (F-statistic): 1.97e-26

R-squared: 0.715 Scale: 1.3164e+11

Coef. Std.Err. t P>|t| [0.025 0.975]

COUNTRY_IND 387890.2538 63007.1511 6.1563 0.0000 262885.8606 512894.6471

COUNTRY_ENG 731833.6386 214164.4988 3.4172 0.0009 306937.3727 1156729.9045

SIXERS 8637.8344 1675.1313 5.1565 0.0000 5314.4216 11961.2472

CAPTAINCY
359725.2741 74930.3460 4.8008 0.0000 211065.6018 508384.9463
EXP_1

Omnibus: 1.130 Durbin-Watson: 2.238

Prob(Omnibus): 0.568 Jarque-Bera (JB): 0.874

Skew: 0.223 Prob(JB): 0.646

Kurtosis: 3.046 Condition No.: 165

4.5.7 Residual Analysis

4.5.7.1 P-P Plot

def draw_pp_plot( model, title ):

probplot = sm.ProbPlot( model.resid );
plt.figure( figsize = (8, 6) );
probplot.ppplot( line='45' );
plt.title( title );
plt.show();

Machine Learning using Python
draw_pp_plot( ipl_model_3,
"Figure 4.6 - Normal P-P Plot of Regression Standardized Residuals"
);

<Figure size 576x432 with 0 Axes>

4.5.7.2 Residual Plot

def plot_resid_fitted( fitted, resid, title):

plt.scatter( get_standardized_values( fitted ),
get_standardized_values( resid ) )
plt.title( title )
plt.xlabel( "Standardized predicted values")
plt.ylabel( "Standardized residual values")
plt.show()

plot_resid_fitted( ipl_model_3.fittedvalues,
ipl_model_3.resid,
"Figure 4.7 - Residual Plot")

Machine Learning using Python

4.5.8 Detecting Inﬂuencers

k = train_X.shape[1]
n = train_X.shape[0]

print( "Number of variables:", k, " and number of observations:", n)

Number of variables: 4 and number of observations: 104

leverage_cutoff = 3*((k + 1)/n)

print( "Cutoff for leverage value: ", round(leverage_cutoff, 3) )

Cutoff for leverage value: 0.144

from statsmodels.graphics.regressionplots import influence_plot

fig, ax = plt.subplots( figsize=(8,6) )

influence_plot( ipl_model_3, ax = ax )
plt.title( "Figure 4.7 - Leverage Value Vs Residuals")
plt.show()

Machine Learning using Python
ipl_auction_df[ipl_auction_df.index.isin( [23, 58, 83] )]

ODI-
PLAYER PLAYING T- T- O
Sl.NO. AGE COUNTRY TEAM RUNS-
NAME ROLE RUNS WKTS SR
S

23 24 Flintoﬀ, A 2 ENG CSK Allrounder 3845 226 3394 88

Mascarenhas,
58 59 2 ENG RR+ Allrounder 0 0 245 95
AD

83 84 Pietersen, KP 2 ENG RCB+ Batsman 6654 5 4184 86

3 rows × 26 columns

train_X_new = train_X.drop( [23, 58, 83], axis = 0)

train_y_new = train_y.drop( [23, 58, 83], axis = 0)

4.5.9 Transforming Response Variable

train_y = np.sqrt( train_y )

Machine Learning using Python
ipl_model_4 = sm.OLS(train_y, train_X).fit()
ipl_model_4.summary2()

Model: OLS Adj. R-squared: 0.741

Dependent Variable: SOLD PRICE AIC: 1527.9999

Date: 2019-04-23 18:27 BIC: 1538.5775

No. Observations: 104 Log-Likelihood: -760.00

Df Model: 4 F-statistic: 75.29

Df Residuals: 100 Prob (F-statistic): 2.63e-29

R-squared: 0.751 Scale: 1.3550e+05

Coef. Std.Err. t P>|t| [0.025 0.975]

COUNTRY_IND 490.7089 63.9238 7.6765 0.0000 363.8860 617.5318

COUNTRY_ENG 563.0261 217.2801 2.5912 0.0110 131.9486 994.1036

SIXERS 8.5338 1.6995 5.0213 0.0000 5.1620 11.9055

CAPTAINCY EXP_1 417.7575 76.0204 5.4953 0.0000 266.9352 568.5799

Omnibus: 0.017 Durbin-Watson: 1.879

Prob(Omnibus): 0.992 Jarque-Bera (JB): 0.145

Skew: 0.005 Prob(JB): 0.930

Kurtosis: 2.817 Condition No.: 165

The r-squard value of the model has increased to 0.751. And the following P-P plot also shows that the
residuals follow a normal distribution.

draw_pp_plot( ipl_model_4,
"Figure 4.8 - Normal P-P Plot of Regression Standardized Residuals"
);

<Figure size 576x432 with 0 Axes>

Machine Learning using Python

4.5.10 Making predictions on validation set

pred_y = np.power( ipl_model_4.predict( test_X[train_X.columns] ), 2)

Measuring RMSE

from sklearn import metrics

np.sqrt(metrics.mean_squared_error(pred_y, test_y))

496151.18122558104

Measuring R-squared value

np.round( metrics.r2_score(pred_y, test_y), 2 )

0.44

Applied Data Science Camp - Info
100% (1)
Applied Data Science Camp - Info
12 pages
PDF
No ratings yet
PDF
3 pages
Classification
100% (1)
Classification
37 pages
ML - LAB - 7 - Jupyter Notebook
100% (1)
ML - LAB - 7 - Jupyter Notebook
7 pages
Machine Learning and Neural Networks: Riccardo Rizzo
100% (1)
Machine Learning and Neural Networks: Riccardo Rizzo
113 pages
Book
100% (1)
Book
480 pages
Machine Learnin
100% (2)
Machine Learnin
23 pages
Cardio Screen RF
100% (1)
Cardio Screen RF
27 pages
Lab 3. Linear Regression 230223
100% (1)
Lab 3. Linear Regression 230223
7 pages
CS550 Regression Aug12
100% (1)
CS550 Regression Aug12
63 pages
PR01
100% (1)
PR01
41 pages
Regression Anallysis Hands0n 1
100% (1)
Regression Anallysis Hands0n 1
3 pages
Thinkcspy 3
100% (1)
Thinkcspy 3
415 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
CS464 Ch9 LinearRegression
100% (1)
CS464 Ch9 LinearRegression
43 pages
A) What Is Motivation Behind Ensemble Methods? Give Your Answer in Probabilistic Terms
100% (1)
A) What Is Motivation Behind Ensemble Methods? Give Your Answer in Probabilistic Terms
6 pages
Importing Libraries: Import As Import As Import As From Import As From Import From Import Import
100% (1)
Importing Libraries: Import As Import As Import As From Import As From Import From Import Import
11 pages
SQL Cheat Sheet
100% (1)
SQL Cheat Sheet
44 pages
Assignment 5 Python
No ratings yet
Assignment 5 Python
12 pages
0.1 Stock Data
100% (1)
0.1 Stock Data
4 pages
To Pattern Recognition: CSE555, Fall 2021 Chapter 1, DHS
100% (1)
To Pattern Recognition: CSE555, Fall 2021 Chapter 1, DHS
39 pages
Chapter-3-Linear Models For Regression
100% (1)
Chapter-3-Linear Models For Regression
61 pages
EMF CheatSheet V4
100% (1)
EMF CheatSheet V4
2 pages
Performance Metrics (Classification) : Enrique J. de La Hoz D
100% (1)
Performance Metrics (Classification) : Enrique J. de La Hoz D
30 pages
K-NN (Nearest Neighbor)
100% (1)
K-NN (Nearest Neighbor)
17 pages
Assignment 11
100% (1)
Assignment 11
7 pages
Classification Problems
100% (1)
Classification Problems
25 pages
Merging - Scaled - 1D - & - Trying - Different - CLassification - ML - Models - .Ipynb - Colaboratory
100% (1)
Merging - Scaled - 1D - & - Trying - Different - CLassification - ML - Models - .Ipynb - Colaboratory
16 pages
Charmi Shah 20bcp299 Lab2
100% (1)
Charmi Shah 20bcp299 Lab2
7 pages
Actividad Semana 4 - Jupyter Notebook
100% (1)
Actividad Semana 4 - Jupyter Notebook
7 pages
Correlation & Regression Analysis
100% (1)
Correlation & Regression Analysis
39 pages
ML Lect1
100% (1)
ML Lect1
51 pages
Regressao Linear Simples - Ipynb - Colaboratory
100% (1)
Regressao Linear Simples - Ipynb - Colaboratory
2 pages
Decision Trees: at Some Point of Time You Have To Take A Decision Sitting On A Tree
100% (1)
Decision Trees: at Some Point of Time You Have To Take A Decision Sitting On A Tree
19 pages
Loading The Dataset: First We Load The Dataset and Find Out The Number of Columns, Rows, NULL Values, Etc
100% (1)
Loading The Dataset: First We Load The Dataset and Find Out The Number of Columns, Rows, NULL Values, Etc
8 pages
Pedestrian Detection - Kristina Pickl
No ratings yet
Pedestrian Detection - Kristina Pickl
45 pages
Outlines: Statements of Problems Objectives Bagging Random Forest Boosting Adaboost
100% (1)
Outlines: Statements of Problems Objectives Bagging Random Forest Boosting Adaboost
14 pages
01-Introduction Machine Learning
100% (1)
01-Introduction Machine Learning
48 pages
Lecture Week 2 KNN and Model Evaluation PDF
100% (1)
Lecture Week 2 KNN and Model Evaluation PDF
53 pages
Unit V - Classification and Prediction 2020-21
100% (1)
Unit V - Classification and Prediction 2020-21
68 pages
Assignment10 4
100% (1)
Assignment10 4
3 pages
IT5409 - Ch7 - Part3 - DL For CV-v2 - 4pages
No ratings yet
IT5409 - Ch7 - Part3 - DL For CV-v2 - 4pages
42 pages
CNN Course V1.3
No ratings yet
CNN Course V1.3
19 pages
Csi 5155 ML Project Report
100% (1)
Csi 5155 ML Project Report
24 pages
Deep Convolutional Neural Networks For Image Classification: Many Slides From Rob Fergus (NYU and Facebook)
No ratings yet
Deep Convolutional Neural Networks For Image Classification: Many Slides From Rob Fergus (NYU and Facebook)
55 pages
IT5409 Ch7 Part1 Object Detection v2
No ratings yet
IT5409 Ch7 Part1 Object Detection v2
97 pages
Linear Regression: What Is Regression Analysis?
100% (1)
Linear Regression: What Is Regression Analysis?
21 pages
ML0101EN Clas K Nearest Neighbors CustCat Py v1
100% (1)
ML0101EN Clas K Nearest Neighbors CustCat Py v1
11 pages
7 Classification
100% (3)
7 Classification
63 pages
9 Regression
100% (1)
9 Regression
14 pages
Bootstrap Powerpoint
100% (1)
Bootstrap Powerpoint
10 pages
Xgboost in Online Transaction Fraud Detection
100% (1)
Xgboost in Online Transaction Fraud Detection
8 pages
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
100% (1)
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
28 pages
Hypothesis and Hypothesis Testing
100% (1)
Hypothesis and Hypothesis Testing
59 pages
DL Lab Manual
100% (1)
DL Lab Manual
35 pages
Unit - 4 Machine Learning
100% (1)
Unit - 4 Machine Learning
84 pages
The Problem of Overfitting: Overfitting With Linear Regression
No ratings yet
The Problem of Overfitting: Overfitting With Linear Regression
32 pages
TP Regression
100% (1)
TP Regression
1 page
Outliers, Hypothesis and Natural Language Processing
100% (1)
Outliers, Hypothesis and Natural Language Processing
7 pages
Task1
No ratings yet
Task1
5 pages
python 1
No ratings yet
python 1
3 pages
Case Study: Microsoft Azure
No ratings yet
Case Study: Microsoft Azure
18 pages
History I
No ratings yet
History I
119 pages
Chapter 9 - Recommendation Systems
No ratings yet
Chapter 9 - Recommendation Systems
12 pages
Chapter 5 - Classification Problems
100% (1)
Chapter 5 - Classification Problems
25 pages
Chapter 10 - Text Analytics
No ratings yet
Chapter 10 - Text Analytics
13 pages
Chapter 8 - Forecasting
No ratings yet
Chapter 8 - Forecasting
17 pages
Integer Processing
No ratings yet
Integer Processing
31 pages
NLP Notes
No ratings yet
NLP Notes
80 pages
Explain Briefly The Stages in Data Processing
No ratings yet
Explain Briefly The Stages in Data Processing
7 pages
Encryption of Files and Web - Config
100% (1)
Encryption of Files and Web - Config
24 pages
Grade 9 Maths D Test
No ratings yet
Grade 9 Maths D Test
1 page
Volume of Solid Figures _ Quizizz
No ratings yet
Volume of Solid Figures _ Quizizz
8 pages
Paradigm 2024 O Level AM P2 Analysis
No ratings yet
Paradigm 2024 O Level AM P2 Analysis
100 pages
Detailed Lesson Plan in Math4 Co2
No ratings yet
Detailed Lesson Plan in Math4 Co2
7 pages
Pgdgi 202 Digital Image Processing 2011
No ratings yet
Pgdgi 202 Digital Image Processing 2011
3 pages
martin1984
No ratings yet
martin1984
16 pages
Foundations of Ergodic Theory-M.Viana-K.Oliveira
No ratings yet
Foundations of Ergodic Theory-M.Viana-K.Oliveira
550 pages
Register Transfer and Microoperations
No ratings yet
Register Transfer and Microoperations
45 pages
DLL Mathematics 6 q1 w2
No ratings yet
DLL Mathematics 6 q1 w2
7 pages
Foundation ATSR ENG-FR - PPSX
No ratings yet
Foundation ATSR ENG-FR - PPSX
96 pages
Grade 7A Culmination Script AY 23-24
No ratings yet
Grade 7A Culmination Script AY 23-24
14 pages
Short Questions For Mid
No ratings yet
Short Questions For Mid
2 pages
Double Differential Space-Time Block Coding For Time-Selective Fading Channels
No ratings yet
Double Differential Space-Time Block Coding For Time-Selective Fading Channels
10 pages
Week 4 HW PDF
No ratings yet
Week 4 HW PDF
1 page
Btech - CS3404 - Theory of Automata - Unit 2
No ratings yet
Btech - CS3404 - Theory of Automata - Unit 2
3 pages
San Antonio National High School: Mil11/12Imiliiia-1
No ratings yet
San Antonio National High School: Mil11/12Imiliiia-1
5 pages
Module 18 Probability Distributions
No ratings yet
Module 18 Probability Distributions
34 pages
Lesson Plan
100% (2)
Lesson Plan
11 pages
C2 - Lab Report
No ratings yet
C2 - Lab Report
10 pages
Fundamentals of Engineering Electromagnetics 1st Edition Rajeev Bansal - Read the ebook online or download it to own the complete version
100% (1)
Fundamentals of Engineering Electromagnetics 1st Edition Rajeev Bansal - Read the ebook online or download it to own the complete version
50 pages
Quran Question and Answer Corpus For Data Mining With WEKA
No ratings yet
Quran Question and Answer Corpus For Data Mining With WEKA
6 pages
Audio System Engineering Mid Sem17
No ratings yet
Audio System Engineering Mid Sem17
2 pages
2nd Preboard Math Q
No ratings yet
2nd Preboard Math Q
5 pages
Pic 6-1-17 Co Pos Mapped
No ratings yet
Pic 6-1-17 Co Pos Mapped
6 pages
Lagcao, Claire Ann M. - Module 5
No ratings yet
Lagcao, Claire Ann M. - Module 5
2 pages
Homework9 Due 11-30
No ratings yet
Homework9 Due 11-30
4 pages
TLE-GRADE8-Q4-SY.-2022-2023
No ratings yet
TLE-GRADE8-Q4-SY.-2022-2023
87 pages