Practical - Regression
Practical - Regression
Practical - Regression
Part 3
SIMPLE LINEAR REGRESSION
• Simple linear regression is a statistical technique used for
finding the existence of an association relationship between
a dependent variable (aka response variable or outcome
variable) and an independent variable (aka explanatory
variable, predictor variable or feature).
• We can only establish that change in the value of the
outcome variable (Y) is associated with change in the value
of feature X, that is, regression technique cannot be used
for establishing causal relationship between two variables.
• Regression is one of the most popular supervised learning
algorithms in predictive analytics. A regresion model
requires the knowledge of both the outcome and the
feature variables in the training dataset.
The following are a few examples of simple
and multiple linear regression problems:
1. A hospital may be interested in finding how the
total cost of a patient for a treatment varies with
the body weight of the patient.
2. Insurance companies would like to understand
the association between healthcare costs and
ageing.
3. An organization may be interested in finding the
relationship between revenue generated from a
product and features such as the price, money spent
on promotion, competitors’ price, and promotion
expenses.
4. Restaurants would like to know the relationship
between the customer waiting time after placing the
order and the revenue.
5. E-commerce companies such as Amazon,
BigBasket, and Flipkart would like to understand the
relationship between revenue and features such as
(a) Number of customer visits to their portal. (b)
Number of clicks on products. (c) Number of items
on sale. (d) Average discount percentage.
6. Banks and other financial institutions would like
to understand the impact of variables such as
unemployment rate, marital status, balance in the
bank account, etc. on the percentage of
non-performing assets (NPA).
STEPS IN BUILDING A REGRESSION
MODEL
• In this section, we will explain the steps used
in building a regression model.
• Building a regression model is an iterative
process and several iterations may be
required before finalizing the appropriate
model.
STEP 1: Collect/Extract Data
• The first step in building a regression model is
to collect or extract data on the dependent
(outcome) variable and independent (feature)
variables from different data sources.
• Data collection in many cases can be
time-consuming and expensive, even when
the organization has well-designed enterprise
resource planning (ERP) system.
STEP 2: Pre-Process the Data
• Before the model is built,it is essential to ensure the quality
of the data for issues such as reliability, completeness,
usefulness, accuracy, missing data, and outliers.
1. Data imputation techniques may be used to deal with
missing data. Use of descriptive statistics and visualization
(such as box plot and scatter plot) may be used to identify
the existence of outliers and variability in the dataset.
2. Many new variables (such as the ratio of variables or
product of variables) can be derived (aka feature engineering)
and also used in model building.
3. Categorical data has must be pre-processed using dummy
variables (part of feature engineering) before it is used in the
regression model.
STEP 3: Dividing Data into Training and
Validation Datasets
• In this stage the data is divided into two subsets
(sometimes more than two subsets):
• training dataset and validation or test dataset.
• The proportion of training dataset is usually between
70% and 80% of the data and the remaining data is
treated as the validation data.
• The subsets may be created using random/ stratified
sampling procedure.
• This is an important step to measure the performance
of the model using dataset not used in model building.
• It is also essential to check for any overfitting of the
model. In many cases, multiple training and multiple
test data are used (called cross-validation).
STEP 4: Perform Descriptive Analytics
or Data Exploration
• It is always a good practice to perform descriptive analytics
before moving to building a predictive analytics model.
• Descriptive statistics will help us to understand the
variability in the model and visualization of the data
through, say, a box plot which will show if there are any
outliers in the data.
• Another visualization technique, the scatter plot, may also
reveal if there is any obvious relationship between the two
variables under consideration.
• Scatter plot is useful to describe the functional relationship
between the dependent or outcome variable and features.
STEP 5: Build the Model
• The model is built using the training dataset to
estimate the regression parameters.
• The method of Ordinary Least Squares (OLS) is
used to estimate the regression parameters
STEP 6: Perform Model Diagnostics
• Regression is often misused since many times
the modeler fails to perform necessary
diagnostics tests before applying the model.
• Before it can be applied, it is necessary that
the model created is validated for all model
assumptions including the definition of the
function form. If the model assumptions are
violated, then the modeler must use remedial
measure.
STEP 7: Validate the Model and
Measure Model Accuracy
• A major concern in analytics is over-fitting,
that is, the model may perform very well on
the training dataset, but may perform badly in
validation dataset.
• It is important to ensure that the model
performance is consistent on the validation
dataset as is in the training dataset.
• In fact, the model may be crossvalidated using
multiple training and test datasets
STEP 8: Decide on Model Deployment
• The final step in the regression model is to
develop a deployment strategy in the form of
actionable items and business rules that can
be used by the organization.
BUILDING SIMPLE LINEAR REGRESSION
MODEL
• Simple Linear Regression (SLR) is a statistical
model in which there is only one independent
variable (or feature) and the functional
relationship between the outcome variable
and the regression coefficient is linear.
• Linear regression implies that the
mathematical function is linear with respect to
regression parameters.
where Yi is the value of ith observation of the dependent
variable (outcome variable) in the sample, Xi is the value of ith
observation of the independent variable or feature in the
sample, ei is the random error (also known as residuals) in
predicting the value of Yi , b0 and b1 are the regression
parameters (or regression coefficients or feature weights)
Assumptions of the Linear Regression
Model
1. The errors or residuals ei are assumed to follow a
normal distribution with expected value of error E(ei
) = 0.
2. The variance of error, VAR(ei ), is constant for
various values of independent variable X. This is
known as homoscedasticity. When the variance is
not constant, it is called heteroscedasticity.
3. The error and independent variable are
uncorrelated.
4. The functional relationship between the outcome
variable and feature is correctly defined
Properties of Simple Linear Regression
Example 3- Predicting MBA Salary from
Grade in MBA
• Excel file (MBA.xlsx) contains the salary of 50
graduating MBA students of a Business School
in 2022 and their corresponding percentage
marks in MBA
• Develop an SLR model to understand and
predict salary based on the percentage of
marks in MBA.
Steps for building a regression model
using Python:
1. Import pandas and numpy libraries
2. Use read_csv to load the dataset into
DataFrame.
3. Identify the feature(s) (X) and outcome (Y)
variable in the DataFrame for building the model.
4. Split the dataset into training and validation sets
using train_test_split().
5. Import statsmodel library and fit the model using
OLS() method.
6. Print model summary and conduct model
diagnostics
STEP 1
• For loading dataset into a DataFrame, we
need to import pandas and numpy libraries.
import pandas as pd
import numpy as np
import statsmodels.api as sm
STEP 2
df =
pd.read_excel("C:/Users/LENOVO/Desktop/RLA/
BMS/Sem 3/Introduction to Business
Analytics/Practical/MBA.xlsx")
df.head()
STEP 3- Creating Feature Set (X) and
Outcome Variable (Y)
• The statsmodel library is used in Python for building
statistical models.
• OLS API available in statsmodel.api is used for estimation of
parameters for simple linear regression model.
• The OLS() model takes two parameters Y and X.
• In this example, Percentage in MBA will be X and Salary will
be Y.
• OLS API available in statsmodel.api estimates only the
coefficient of X parameter [refer to Eq. (4.1)].
• To estimate regression coefficient b0 , a constant term of 1
needs to be added as a separate column.
• As the value of the columns remains same across all
samples, the parameter estimated for this feature or
column will be the intercept term.
Code
X = sm.add_constant(df['Per'] )
X.head()
Y = df['Salary']
y.head()
output
STEP 4- Splitting the Dataset into
Training and Validation Sets
• train_test_split() function from skelarn.model_selection module provides
the ability to split the dataset randomly into training and validation
datasets.
• The parameter train_size takes a fraction between 0 and 1 for specifying
training set size.
• The remaining samples in the original set will be test or validation set. The
records that are selected for training and test set are randomly sampled. T
• he method takes a seed value in parameter named random_state, to fix
which samples go to training and which ones go to test set.
train_test_split()
• The method returns four variables as below.
1. train_X contains X features of the training set.
2. train_y contains the values of response variable for the training set.
3. test_X contains X features of the test set.
4. test_y contains the values of response variable for the test set.
code
from sklearn.model_selection import
train_test_split
train_X, test_X, train_y, test_y = train_test_split(
X,
Y, train_size = 0.8, random_state = 100 )
Step 5- Fitting the Model
• We will fit the model using OLS method and
pass train_y and train_X as parameters.
• the fit() method on OLS() estimates the
parameters and returns model information to
the variable df_lm, which contains the model
parameters, accuracy measures, and residual
values among other details
Code
df_lm = sm.OLS( train_y, train_X ).fit()
Step 6- Printing Estimated Parameters
and Interpreting Them
print(df_lm.params)
Output
const 32675.756285
Per 3533.873029
dtype: float64
Analysis
• Regression equation is Yi= α + βX
df_lm.summary2()
MODEL DIAGNOSTICS
• It is important to validate the regression model to
ensure its validity and goodness of fit before it can be
used for practical applications.
• The following measures are used to validate the
simple linear regression models:
1. Co-efficient of determination (R-squared).
2. Hypothesis test for the regression coefficient.
3. Analysis of variance for overall model validity
(important for multiple linear regression).
4. Residual analysis to validate the regression model
assumptions.
5. Outlier analysis, since the presence of outliers can
significantly impact the regression parameters.
1. Co-efficient of Determination
(R-Squared or R2)
• The primary objective of regression is to explain the
variation in Y using the knowledge of X.
• The co-efficient of determination (R-squared or R2 )
measures the percentage of variation in Y explained by
the model (b0 + b1 X).
• The simple linear regression model can be broken into
1. Variation in outcome variable explained by the
model.
2. Unexplained variation as shown in Eq. below
The co-efficient of determination
(R-squared) has the following
properties:
1. The value of R-squared lies between 0 and 1
2. Mathematically, R-squared (R2 ) is square of
correlation coefficient (R2 = r2 ), where r is the
Pearson correlation co-efficient.
3. Higher R-squared indicates better fit;
however, one should be careful about the
spurious relationship.
In the example
• The model R-squared value is 0.211, that is,
21.1% of the variation in salary is explained by
the variation in MBA score.
• The value is less than 50% and hence the the
model is not a very good fit.
2. Analysis of Variance (ANOVA) in
Regression Analysis
ANOVA analysis
• Ho: β = 0
• H1: β ≠ 0
• If this value is less than 0.05 (assumed level of
significance), we reject the Ho , otherwise we do
not reject the H0.
• Since The Probability (F statistic) = 0.0032 is less
than 0.05, we reject the H0.
• The p value of F-statistic of the model is 0.0032
which indicates that the overall model is
statistically significant.
3. Regression equation & its analysis
• Regression equation is Yi= α + βX
x=df.drop(['Score'],axis=1)
y=df['Score']
x.head()
y.head()
Step 3- training and testing
from sklearn.model_selection import
train_test_split
x_train, x_test, y_train, y_test =
train_test_split(x,y,test_size=0.20,random_state
=12)
#using random_state=12, data will be shuffled in
the same way
Step 4- model summary
import statsmodels.api as sm
model = sm.OLS(y_train,x_train)
print(model.fit().summary())
Output
CONCLUSION
•
Coefficients Table:
•
2. Hypothesis Test for the Regression Co-efficient
•
•
•
•
•
•
Multi-Collinearity and Handling
Multi-Collinearity
• When the dataset has a large number of
independent variables (features), it is possible
that few of these independent variables (features)
may be highly correlated.
• The existence of a high correlation between
independent variables is called multi-collinearity.
Presence of multi-collinearity can destabilize the
multiple linear regression model.
• Thus, it is necessary to identify the presence of
multi-collinearity and take corrective actions.
•
• One of the assumption of the Classical Linear
Regression Model is that there should be no
Multicollinearity among the independent
variables included in the regression model.
• Hence, as a researcher we need to detect
presence of such Multicollinearity in the data
using effective methods.
• Usually, as the number of independent variables
in the regression model increases, the problem of
Multicollinearity also increases.
Methods to be used to detect the
presence of Multicollinearity in our
data:
• Variance Inflating Factor
Variance Inflating Factor
•
Variance Inflation Factor (VIF)
Code
# checking multicollinearity using VIF
from statsmodels.stats.outliers_influence import
variance_inflation_factor
vif_data = pd.DataFrame()
vif_data["Feature"] = x_train.columns
vif_data["VIF"] =
[variance_inflation_factor(x_train.values, i) for i in
range(x_train.shape[1])]
print(vif_data)
Output
Since VIF of the variables GDP per capita, Social Support and Healthy life
expectancy is very high (above 5 or 10), we say that multicollinearity is
present in this model.
Residual Analysis in Multiple Linear
Regression
• Test for Normality of Residuals (P-P Plot)
• One of the important assumptions of
regression is that the residuals should be
normally distributed.
• This can be verified using P-P plot. We will
develop a method called draw_pp_plot()
which takes the model output (residuals) and
draws the P-P plot.
Distance Measures and Outliers
Diagnostics
• Outliers can have a significant impact on the
estimated regression coefficients.
• That is, the value of a regression coefficient may
change depending on the presence of outliers in
the data.
• The following distance measures are used for
diagnosing the outliers and influential
observations in MLR model.
1. Cooks Distance
2. Leverage Value
Outlier analysis
#outlier analysis
from scipy.stats import zscore
df['z_score_Score'] = zscore(df.Score)
df[ (df.z_score_Score > 3.0) | (df.z_score_Score
< -3.0) ]
output
•
Example3.4
• Data on 12 months sale of Company X along
with their Advertising expenditure and
Distribution expenditure.
Code
#importing libraries
import pandas as pd
import statsmodels.api as sm
Output
Durbin-Watson Test Statistic: 2.5228799065114567
• The value od Durbin Watson test is 2.53
• Hence we can say that there is very slight
negative auto correlation in the data.
Regression with dummy variables
(example3.3)
Step 1: Importing Required Libraries:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
Step 2: importing data and creating a
dataframe
• Import data and create a DataFrame named df
with three columns: 'Gender', 'Age', and 'Income'.
df =
pd.read_excel("C:/Users/LENOVO/Desktop/RLA/BM
S/Sem 3/Introduction to Business
Analytics/Practical/dummy.xlsx")
df.head()
df = pd.DataFrame(data)
Step 3- Creating Dummy Variables for
Gender:
• This line of code uses the pd.get_dummies()
function to create dummy variables for the
'Gender' column.
• The drop_first=True parameter drops the first
level (female) of the gender variable to
prevent multicollinearity issues in regression
analysis.
Code
df = pd.get_dummies(df, columns=['Gender'],
drop_first=True)
Step 4- Adding a Constant Term to
Predictor Variables Matrix:
• Here, we use sm.add_constant() to add a
constant (intercept) term to the predictor
variables matrix X.
• The matrix includes the dummy variable
'Gender_male' and the 'Age' column.
X = sm.add_constant(df[['Gender_male', 'Age']])
Step 5- Defining the Dependent
Variable:
• The dependent variable y is set to the
'Income' column of the DataFrame.
y = df['Income']
Step 6- Fitting the OLS Regression
Model:
• This line fits an Ordinary Least Squares (OLS)
regression model using the sm.OLS() function.
• It uses the dependent variable y and the
predictor variables matrix X.
print(model.summary())