0% found this document useful (0 votes)
68 views

Multple Linear Regression

This document discusses developing a multiple linear regression model using Python to predict auction prices of cricket players. It loads a dataset containing performance statistics of players from the Indian Premier League (IPL) and describes the variables in the dataset. These variables measure players' performances in formats like IPL, international ODIs and Tests. It will use these predictor variables to build a linear regression model to predict players' auction prices.

Uploaded by

Neha Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

Multple Linear Regression

This document discusses developing a multiple linear regression model using Python to predict auction prices of cricket players. It loads a dataset containing performance statistics of players from the Indian Premier League (IPL) and describes the variables in the dataset. These variables measure players' performances in formats like IPL, international ODIs and Tests. It will use these predictor variables to build a linear regression model to predict players' auction prices.

Uploaded by

Neha Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

112 Machine Learning using Python

3 LF L L I
B SUHGB BORZ SUHGB B L ZOVBSUHGLFWLR BVWG PEDBVDODU BOP
WHVWB
DOS D
L I P
SUHGB BGI SG DWD)UDPH UDGHB BSHUF WHVWB > 3HUFH WD H L
UDGH
SUHGB SUHGB
SUHGB BOHIW SUHGB BORZ
SUHGB BUL W SUHGB B L

SUHGB BGI>

grade_10_perc pred_y pred_y_left pred_y_right


6 70.0 279828.402452 158379.832044 401276.972860
36 68.0 272707.227686 151576.715020 393837.740352
37 52.0 215737.829560 92950.942395 338524.716726
28 58.0 237101.353858 115806.869618 358395.838097
43 74.5 295851.045675 173266.083342 418436.008008
49 60.8 247070.998530 126117.560983 368024.436076
5 55.0 226419.591709 104507.444388 348331.739030
33 78.0 308313.101515 184450.060488 432176.142542
20 63.0 254904.290772 134057.999258 375750.582286
42 74.4 295494.986937 172941.528691 418048.445182

4.5 | MULTIPLE LINEAR REGRESSION

association relationship between a dependent variable (aka response variable or outcome variable) and
several independent variables (aka explanatory variables or predictor variable or features).

Yi = b 0 + b 1 X1i + b 2 X2i + L + b k X ki + e i
e regression coe cients b1, b2, … , bk are called partial regression coe cients since the relationship
between an explanatory variable and the response (outcome) variable is calculated a er removing (or
controlling) the e ect all the other explanatory variables (features) in the model.
e assumptions that are made in multiple linear regression model are as follows:

1. e regression model is linear in regression parameters (b-values).


2. e residuals follow a normal distribution and the expected value (mean) of the residuals is zero.
3. In time series data, residuals are assumed to uncorrelated.

Chapter 04_Linear Regression.indd 112 4/24/2019 6:54:46 PM


Chapter 4 • Linear Regression 113

4. e variance of the residuals is constant for all values of Xi. When the variance of the residuals
is constant for di erent values of Xi, it is called homoscedasticity. A non-constant variance of
residuals is called heteroscedasticity.
5. ere is no high correlation between independent variables in the model (called multi-collinearity).
Multi-collinearity can destabilize the model and can result in an incorrect estimation of the
regression parameters.

e partial regressions coe cients are estimated by minimizing the sum of squared errors (SSE). We will
explain the multiple linear regression model by using the example of auction pricing of players in the

4.5.1 | Predicting the SOLD PRICE (Auction Price) of Players

players and popular Indian players are auctioned.


-
lows the Twenty20 format of the game, it is possible that the performance of the players in the other
formats of the game such as Test and One-Day matches could in uence player pricing. A few players
had excellent records in Test matches, but their records in Twenty20 matches were not very impressive.
−2011) measured
through various performance metrics are provided in Table 4.3.

TABLE 4.3 Metadata of IPL dataset


Data Code Description
AGE Age of the player at the time of auction classi ed into three categories. Category 1 (L25) means the player is less
than 25 years old, category 2 means that the age is between 25 and 35 years (B25− 35) and category 3 means
that the age is more than 35 (A35).
RUNS-S Number of runs scored by a player.
RUNS-C Number of runs conceded by a player.
HS Highest score by a batsman in IPL.
AVE-B Average runs scored by a batsman in IPL.
AVE-BL Bowling average (number of runs conceded/number of wickets taken) in IPL.
SR-B Batting strike rate (ratio of the number of runs scored to the number of balls faced) in IPL.
SR-BL Bowling strike rate (ratio of the number of balls bowled to the number of wickets taken) in IPL.
SIXERS Number of six runs scored by a player in IPL.
WKTS Number of wickets taken by a player in IPL.
(Continued)

Chapter 04_Linear Regression.indd 113 4/24/2019 6:54:46 PM


114 Machine Learning using Python

TABLE 4.3 (Continued)


Data Code Description
ECON Economy rate of a bowler (number of runs conceded by the bowler per over) in IPL.
CAPTAINCY EXP Captained either a T20 team or a national team.
ODI-SR-B Batting strike rate in One-Day Internationals.
ODI-SR-BL Bowling strike rate in One-Day Internationals.
ODI-RUNS-S Runs scored in One-Day Internationals.
ODI-WKTS Wickets taken in One-Day Internationals.
T-RUNS-S Runs scored in Test matches.
T-WKTS Wickets taken in Test matches.
PLAYER-SKILL Player’s primary skill (batsman, bowler, or allrounder).
COUNTRY Country of origin of the player (AUS: Australia; IND: India; PAK: Pakistan; SA: South Africa; SL: Sri Lanka; NZ: New
Zealand; WI: West Indies; OTH: Other countries).
YEAR-A Year of Auction in IPL.
IPL TEAM Team(s) for which the player had played in the IPL (CSK: Chennai Super Kings; DC: Deccan Chargers; DD: Delhi Dare-
devils; KXI: Kings XI Punjab; KKR: Kolkata Knight Riders; MI: Mumbai Indians; PWI: Pune Warriors India; RR: Rajasthan
Royals; RCB: Royal Challengers Bangalore). A + sign is used to indicate that the player has played for more than one
team. For example, CSK+ would mean that the player has played for CSK as well as for one or more other teams.

4.5.2 | Developing Multiple Linear Regression Model Using Python


In this section, we will be discussing various steps involved in developing a multiple linear regression

4.5.2.1 Loading the Dataset


3 0 3 F the le and print the meta data.

ipl_auction_df = pd.read_csv( ‘IPL IMB381IPL2013.csv’ )

LSOBDXFWLR BGI L IR

<class ꞌpandas.core.frame.DataFrame’>
Rangelndex: 130 entries, 0 to 129
Data columns (total 26 columns):
6O 1 R XOO L W
3/ <(5 1 0( R XOO REMHFW
( R XOO L W
8175< R XOO REMHFW
7( 0 R XOO REMHFW

Chapter 04_Linear Regression.indd 114 4/24/2019 6:54:46 PM


Chapter 4 • Linear Regression 115

3/ <,1 5 /( R XOO REMHFW


7 5816 R XOO L W
T-WKTS R XOO L W
, 5816 6 R XOO L W
, 65 % R XOO oat64
ODI-WKTS R XOO L W
, 65 %/ R XOO oat64
37 ,1 < ( 3 R XOO L W
5816 6 R XOO L W
6 R XOO L W
9( R XOO oat64
65 % R XOO oat64
6, (56 R XOO L W
5816 R XOO L W
WKTS R XOO L W
9( %/ R XOO oat64
( 1 R XOO oat64
65 %/ R XOO oat64
8 7, 1 <( 5 R XOO L W
% 6( 35, ( R XOO L W
6 / 35, ( R XOO L W
dtypes: oat64(7), int64(15), object(4)
memory usage: 26.5+ KB
ere are 130 observations (records) and 26 columns (features) in the data, and there are no missing
values.

4.5.2.2 Displaying the First Five Records


As the number of columns is very large, we will display the initial 10 columns for the rst 5 rows. e
function df.iloc() is used for displaying a subset of the dataset.

LSOBDXFWLR BGI LORF>

Sl. NO. PLAYER NAME AGE COUNTRY TEAM PLAYING ROLE T-RUNS T-WKTS ODI-RUNS-S ODI-SR-B
0 1 Abdulla, YA 2 SA KXIP Allrounder 0 0 0 0.00
1 2 Abdur Razzak 2 BAN RCB Bowler 214 18 657 71.41
2 3 Agarkar, AB 2 IND KKR Bowler 571 58 1269 80.62
3 4 Ashwin, R 1 IND CSK Bowler 284 31 241 84.56
4 5 Badrinath, S 2 IND CSK Batsman 63 0 79 45.93

We can build a model to understand what features of players are in uencing their SOLD PRICE or
predict the player’s auction prices in future. However, all columns are not features. For example, Sl. NO.

Chapter 04_Linear Regression.indd 115 4/24/2019 6:54:46 PM


116 Machine Learning using Python

is just a serial number and cannot be considered a feature of the player. We will build a model using only
player’s statistics. So, BASE PRICE can also be removed. We will create a variable X_feature which will
contain the list of features that we will nally use for building the model and ignore rest of the columns
of the DataFrame. e following function is used for including the features in the model building.

BIHDWXUHV LSOBDXFWLR BGI FROXP V

Most of the features in the dataset are numerical (ratio scale) whereas features such as AGE, COUNTRY,
PLAYING ROLE, CAPTAINCY EXP are categorical and hence need to be encoded before building the
model. Categorical variables cannot be directly included in the regression model, and they must be
encoded using dummy variables before incorporating in the model building.

BIHDWXUHV > ( 8175< 3/ <,1 5 /(


‘T-RUNS’, ‘T-WKTS’, ‘ODI-RUNS-S’, ‘ODI-SR-B’,
‘ODI-WKTS’, ‘ODI-SR-BL’, ‘CAPTAINCY EXP’, ‘RUNS-S’,
‘HS’, ‘AVE’, ‘SR-B’, ‘SIXERS’, ‘RUNS-C’, ‘WKTS’,
9( %/ ( 1 65 %/

4.5.3 | Encoding Categorical Features


Qualitative variables or categorical variables need to be encoded using dummy variables before incorporating
them in the regression model. If a categorical variable has n categories (e.g., the player role in the data has four
categories, namely, batsman, bowler, wicket-keeper and allrounder), then we will need n − 1 dummy vari-
ables. So, in the case of PLAYING ROLE, we will need three dummy variables since there are four categories.
Finding unique values of column PLAYING ROLE shows the values: Allrounder, Bowler, Batsman,
W. Keeper
variables:

ipl_auction_df[‘PLAYING ROLE’].unique()

array([‘Allrounder’, ‘Bowler’, ‘Batsman’, ‘W. Keeper’], dtype=object)


e variable can be converted into four dummy variables. Set the variable value to 1 to indicate the role
of the player. is can be done using pd.get_dummies() method. We will create dummy variables for only
PLAYING ROLE to understand and then create dummy variables for the rest of the categorical variables.

SG HWBGXPPLHV LSOBDXFWLR BGI> 3/ <,1 5 /( >

Allrounder Batsman Bowler W. Keeper


0 1.0 0.0 0.0 0.0
1 0.0 0.0 1.0 0.0
(Continued)

Chapter 04_Linear Regression.indd 116 4/24/2019 6:54:46 PM


Chapter 4 • Linear Regression 117

Allrounder Batsman Bowler W. Keeper


2 0.0 0.0 1.0 0.0
3 0.0 0.0 1.0 0.0
4 0.0 1.0 0.0 0.0

As shown in the table above, the pd.get_dummies() method has created four dummy variables and has
already set the variables to 1 as variable value in each sample.
Whenever we have n levels (or categories) for a qualitative variable (categorical variable), we will use
(n − 1) dummy variables, where each dummy variable is a binary variable used for representing whether
an observation belongs to a category or not. e reason why we create only (n − 1) dummy variables
is that inclusion of dummy variables for all categories and the constant in the regression equation will
create perfect multi-collinearity (will be discussed later). To drop one category, the parameter drop_ rst
should be set to True.
We must create dummy variables for all categorical (qualitative) variables present in the dataset.

FDWH RULFDOBIHDWXUHV > ( 8175< 3/ <,1 5 /(


37 ,1 < ( 3

LSOBDXFWLR BH FRGHGBGI SG HWBGXPPLHV LSOBDXFWLR BGI> BIHDWXUHV


FROXP V FDWH RULFDOBIHDWXUHV
drop_ rst =

LSOBDXFWLR BH FRGHGBGI FROXP V

Index([‘T-RUNS’, ‘T-WKTS’, ‘ODI-RUNS-S’, ‘ODI-SR-B’, ‘ODI-WKTS’,


, 65 %/ 5816 6 6 9( 65 % 6, (56
‘RUNS-C’, ‘WKTS’, ‘AVE-BL’, ‘ECON’, ‘SR-BL’, ‘AGE_2’,
(B 8175<B% 1 8175<B(1 8175<B,1
‘COUNTRY_NZ’, ‘COUNTRY_PAK’, ‘COUNTRY_SA’, ‘COUNTRY_SL’,
8175<B:, 8175<B=,0 3/ <,1 5 /(B%DWVPD
‘PLAYING ROLE_Bowler’, ‘PLAYING ROLE_W. Keeper’,
37 ,1 < ( 3B GW SH REMHFW
e dataset contains the new dummy variables that have been created. We can reassign the new features
to the variable X_features, which we created earlier to keep track of all features that will be used to build
the model nally.

BIHDWXUHV LSOBDXFWLR BH FRGHGBGI FROXP V

4.5.4 | Splitting the Dataset into Train and Validation Sets


Before building the model, we will split the dataset into 80:20 ratio. e split function allows using a
parameter random_state, which is a seed function for reproducibility of randomness. is parameter is
not required to be passed. Setting this variable to a xed number will make sure that the records that go

Chapter 04_Linear Regression.indd 117 4/24/2019 6:54:46 PM


118 Machine Learning using Python

into training and test set remain unchanged and hence the results can be reproduced. We will use the
value 42 (it is again selected randomly). You can use the same random seed of 42 for the reproducibility

di erent results.

VP DGGBFR VWD W LSOBDXFWLR BH FRGHGBGI


< LSOBDXFWLR BGI> 6 / 35, (
WUDL B WHVWB WUDL B WHVWB WUDL BWHVWBVSOLW
<
WUDL BVL]H
random_state = 42 )

4.5.5 | Building the Model on the Training Dataset

provides details of the model accuracy, feature signi cance, and signs of any multi-collinearity e ect,
which is discussed in detail in the next section.

ipl_model_1 = sm.OLS(train_y, train_X). t()


ipl_model_1.summary2()

TABLE 4.4 Model summary for ipl_model_1


Model: OLS Adj. R-squared: 0.362
Dependent Variable: SOLD PRICE AIC: 2965.2841
Date: 2018-04-08 07:27 BIC: 3049.9046
No. Observations: 104 Log-Likelihood: −1450.6
Df Model: 31 F-statistic: 2.883
Df Residuals: 72 Prob (F-statistic): 0.000114
R-squared: 0.554 Scale: 1.1034e+11

Coef. Std.Err. t P > |t| [0.025 0.975]


const 375827.1991 228849.9306 1.6422 0.1049 −80376.7996 832031.1978
T-RUNS −53.7890 32.7172 −1.6441 0.1045 −119.0096 11.4316
T-WKTS −132.5967 609.7525 −0.2175 0.8285 −1348.1162 1082.9228
ODI-RUNS-S 57.9600 31.5071 1.8396 0.0700 −4.8482 120.7681
ODI-SR-B −524.1450 1576.6368 −0.3324 0.7405 −3667.1130 2618.8231
ODI-WKTS 815.3944 832.3883 0.9796 0.3306 −843.9413 2474.7301
ODI-SR-BL −773.3092 1536.3334 −0.5033 0.6163 −3835.9338 2289.3154
RUNS-S 114.7205 173.3088 0.6619 0.5101 −230.7643 460.2054
HS −5516.3354 2586.3277 −2.1329 0.0363 −10672.0855 −360.5853
AVE 21560.2760 7774.2419 2.7733 0.0071 6062.6080 37057.9439
(Continued)

Chapter 04_Linear Regression.indd 118 4/24/2019 6:54:46 PM


Chapter 4 • Linear Regression 119

Coef. Std.Err. t P > |t| [0.025 0.975]


SR-B −1324.7218 1373.1303 −0.9647 0.3379 −4062.0071 1412.5635
SIXERS 4264.1001 4089.6000 1.0427 0.3006 −3888.3685 12416.5687
RUNS-C 69.8250 297.6697 0.2346 0.8152 −523.5687 663.2187
WKTS 3075.2422 7262.4452 0.4234 0.6732 −11402.1778 17552.6622
AVE-BL 5182.9335 10230.1581 0.5066 0.6140 −15210.5140 25576.3810
ECON −6820.7781 13109.3693 −0.5203 0.6045 −32953.8282 19312.2721
SR-BL −7658.8094 14041.8735 −0.5454 0.5871 −35650.7726 20333.1539
AGE_2 −230767.6463 114117.2005 −2.0222 0.0469 −458256.1279 −3279.1648
AGE_3 −216827.0808 152246.6232 −1.4242 0.1587 −520325.1772 86671.0155
COUNTRY_BAN −122103.5196 438719.2796 −0.2783 0.7816 −996674.4194 752467.3801
COUNTRY_ENG 672410.7654 238386.2220 2.8207 0.0062 197196.5172 1147625.0135
COUNTRY_IND 155306.4011 126316.3449 1.2295 0.2229 −96500.6302 407113.4325
COUNTRY_NZ 194218.9120 173491.9293 1.1195 0.2667 −151630.9280 540068.7521
COUNTRY_PAK 75921.7670 193463.5545 0.3924 0.6959 −309740.7804 461584.3143
COUNTRY_SA 64283.3894 144587.6773 0.4446 0.6579 −223946.8775 352513.6563
COUNTRY_SL 17360.1530 176333.7497 0.0985 0.9218 −334154.7526 368875.0586
COUNTRY_WI 10607.7792 230686.7892 0.0460 0.9635 −449257.9303 470473.4887
COUNTRY_ZIM −145494.4793 401505.2815 −0.3624 0.7181 −945880.6296 654891.6710
PLAYING ROLE_Batsman 75724.7643 150250.0240 0.5040 0.6158 −223793.1844 375242.7130
PLAYING ROLE_Bowler 15395.8752 126308.1272 0.1219 0.9033 −236394.7744 267186.5249
PLAYING ROLE_W. Keeper −71358.6280 213585.7444 –0.3341 0.7393 −497134.0278 354416.7718
CAPTAINCY EXP_1 164113.3972 123430.6353 1.3296 0.1878 −81941.0772 410167.8716

Omnibus: 0.891 Durbin-Watson: 2.244


Prob(Omnibus): 0.640 Jarque-Bera (JB): 0.638
Skew: 0.190 Prob(JB): 0.727
Kurtosis: 3.059 Condition No.: 84116

p-value (<0.05), only the features


HS, AGE_2, AVE and COUNTRY_ENG have come out signi cant. e model says that none of the other
features are in uencing SOLD PRICE (at a signi cance value of 0.05). is is not very intuitive and could
be a result of multi-collinearity e ect of variables.

4.5.6 | Multi-Collinearity and Handling Multi-Collinearity


When the dataset has a large number of independent variables (features), it is possible that few of these
independent variables (features) may be highly correlated. e existence of a high correlation between

Chapter 04_Linear Regression.indd 119 4/24/2019 6:54:46 PM

You might also like