Multple Linear Regression
Multple Linear Regression
3 LF L L I
B SUHGB BORZ SUHGB B L ZOVBSUHGLFWLR BVWG PEDBVDODU BOP
WHVWB
DOS D
L I P
SUHGB BGI SG DWD)UDPH UDGHB BSHUF WHVWB > 3HUFH WD H L
UDGH
SUHGB SUHGB
SUHGB BOHIW SUHGB BORZ
SUHGB BUL W SUHGB B L
SUHGB BGI>
association relationship between a dependent variable (aka response variable or outcome variable) and
several independent variables (aka explanatory variables or predictor variable or features).
Yi = b 0 + b 1 X1i + b 2 X2i + L + b k X ki + e i
e regression coe cients b1, b2, … , bk are called partial regression coe cients since the relationship
between an explanatory variable and the response (outcome) variable is calculated a er removing (or
controlling) the e ect all the other explanatory variables (features) in the model.
e assumptions that are made in multiple linear regression model are as follows:
4. e variance of the residuals is constant for all values of Xi. When the variance of the residuals
is constant for di erent values of Xi, it is called homoscedasticity. A non-constant variance of
residuals is called heteroscedasticity.
5. ere is no high correlation between independent variables in the model (called multi-collinearity).
Multi-collinearity can destabilize the model and can result in an incorrect estimation of the
regression parameters.
e partial regressions coe cients are estimated by minimizing the sum of squared errors (SSE). We will
explain the multiple linear regression model by using the example of auction pricing of players in the
LSOBDXFWLR BGI L IR
<class ꞌpandas.core.frame.DataFrame’>
Rangelndex: 130 entries, 0 to 129
Data columns (total 26 columns):
6O 1 R XOO L W
3/ <(5 1 0( R XOO REMHFW
( R XOO L W
8175< R XOO REMHFW
7( 0 R XOO REMHFW
Sl. NO. PLAYER NAME AGE COUNTRY TEAM PLAYING ROLE T-RUNS T-WKTS ODI-RUNS-S ODI-SR-B
0 1 Abdulla, YA 2 SA KXIP Allrounder 0 0 0 0.00
1 2 Abdur Razzak 2 BAN RCB Bowler 214 18 657 71.41
2 3 Agarkar, AB 2 IND KKR Bowler 571 58 1269 80.62
3 4 Ashwin, R 1 IND CSK Bowler 284 31 241 84.56
4 5 Badrinath, S 2 IND CSK Batsman 63 0 79 45.93
We can build a model to understand what features of players are in uencing their SOLD PRICE or
predict the player’s auction prices in future. However, all columns are not features. For example, Sl. NO.
is just a serial number and cannot be considered a feature of the player. We will build a model using only
player’s statistics. So, BASE PRICE can also be removed. We will create a variable X_feature which will
contain the list of features that we will nally use for building the model and ignore rest of the columns
of the DataFrame. e following function is used for including the features in the model building.
Most of the features in the dataset are numerical (ratio scale) whereas features such as AGE, COUNTRY,
PLAYING ROLE, CAPTAINCY EXP are categorical and hence need to be encoded before building the
model. Categorical variables cannot be directly included in the regression model, and they must be
encoded using dummy variables before incorporating in the model building.
ipl_auction_df[‘PLAYING ROLE’].unique()
As shown in the table above, the pd.get_dummies() method has created four dummy variables and has
already set the variables to 1 as variable value in each sample.
Whenever we have n levels (or categories) for a qualitative variable (categorical variable), we will use
(n − 1) dummy variables, where each dummy variable is a binary variable used for representing whether
an observation belongs to a category or not. e reason why we create only (n − 1) dummy variables
is that inclusion of dummy variables for all categories and the constant in the regression equation will
create perfect multi-collinearity (will be discussed later). To drop one category, the parameter drop_ rst
should be set to True.
We must create dummy variables for all categorical (qualitative) variables present in the dataset.
into training and test set remain unchanged and hence the results can be reproduced. We will use the
value 42 (it is again selected randomly). You can use the same random seed of 42 for the reproducibility
di erent results.
provides details of the model accuracy, feature signi cance, and signs of any multi-collinearity e ect,
which is discussed in detail in the next section.