Practical - Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 114

Regression

Part 3
SIMPLE LINEAR REGRESSION
• Simple linear regression is a statistical technique used for
finding the existence of an association relationship between
a dependent variable (aka response variable or outcome
variable) and an independent variable (aka explanatory
variable, predictor variable or feature).
• We can only establish that change in the value of the
outcome variable (Y) is associated with change in the value
of feature X, that is, regression technique cannot be used
for establishing causal relationship between two variables.
• Regression is one of the most popular supervised learning
algorithms in predictive analytics. A regresion model
requires the knowledge of both the outcome and the
feature variables in the training dataset.
The following are a few examples of simple
and multiple linear regression problems:
1. A hospital may be interested in finding how the
total cost of a patient for a treatment varies with
the body weight of the patient.
2. Insurance companies would like to understand
the association between healthcare costs and
ageing.
3. An organization may be interested in finding the
relationship between revenue generated from a
product and features such as the price, money spent
on promotion, competitors’ price, and promotion
expenses.
4. Restaurants would like to know the relationship
between the customer waiting time after placing the
order and the revenue.
5. E-commerce companies such as Amazon,
BigBasket, and Flipkart would like to understand the
relationship between revenue and features such as
(a) Number of customer visits to their portal. (b)
Number of clicks on products. (c) Number of items
on sale. (d) Average discount percentage.
6. Banks and other financial institutions would like
to understand the impact of variables such as
unemployment rate, marital status, balance in the
bank account, etc. on the percentage of
non-performing assets (NPA).
STEPS IN BUILDING A REGRESSION
MODEL
• In this section, we will explain the steps used
in building a regression model.
• Building a regression model is an iterative
process and several iterations may be
required before finalizing the appropriate
model.
STEP 1: Collect/Extract Data
• The first step in building a regression model is
to collect or extract data on the dependent
(outcome) variable and independent (feature)
variables from different data sources.
• Data collection in many cases can be
time-consuming and expensive, even when
the organization has well-designed enterprise
resource planning (ERP) system.
STEP 2: Pre-Process the Data
• Before the model is built,it is essential to ensure the quality
of the data for issues such as reliability, completeness,
usefulness, accuracy, missing data, and outliers.
1. Data imputation techniques may be used to deal with
missing data. Use of descriptive statistics and visualization
(such as box plot and scatter plot) may be used to identify
the existence of outliers and variability in the dataset.
2. Many new variables (such as the ratio of variables or
product of variables) can be derived (aka feature engineering)
and also used in model building.
3. Categorical data has must be pre-processed using dummy
variables (part of feature engineering) before it is used in the
regression model.
STEP 3: Dividing Data into Training and
Validation Datasets
• In this stage the data is divided into two subsets
(sometimes more than two subsets):
• training dataset and validation or test dataset.
• The proportion of training dataset is usually between
70% and 80% of the data and the remaining data is
treated as the validation data.
• The subsets may be created using random/ stratified
sampling procedure.
• This is an important step to measure the performance
of the model using dataset not used in model building.
• It is also essential to check for any overfitting of the
model. In many cases, multiple training and multiple
test data are used (called cross-validation).
STEP 4: Perform Descriptive Analytics
or Data Exploration
• It is always a good practice to perform descriptive analytics
before moving to building a predictive analytics model.
• Descriptive statistics will help us to understand the
variability in the model and visualization of the data
through, say, a box plot which will show if there are any
outliers in the data.
• Another visualization technique, the scatter plot, may also
reveal if there is any obvious relationship between the two
variables under consideration.
• Scatter plot is useful to describe the functional relationship
between the dependent or outcome variable and features.
STEP 5: Build the Model
• The model is built using the training dataset to
estimate the regression parameters.
• The method of Ordinary Least Squares (OLS) is
used to estimate the regression parameters
STEP 6: Perform Model Diagnostics
• Regression is often misused since many times
the modeler fails to perform necessary
diagnostics tests before applying the model.
• Before it can be applied, it is necessary that
the model created is validated for all model
assumptions including the definition of the
function form. If the model assumptions are
violated, then the modeler must use remedial
measure.
STEP 7: Validate the Model and
Measure Model Accuracy
• A major concern in analytics is over-fitting,
that is, the model may perform very well on
the training dataset, but may perform badly in
validation dataset.
• It is important to ensure that the model
performance is consistent on the validation
dataset as is in the training dataset.
• In fact, the model may be crossvalidated using
multiple training and test datasets
STEP 8: Decide on Model Deployment
• The final step in the regression model is to
develop a deployment strategy in the form of
actionable items and business rules that can
be used by the organization.
BUILDING SIMPLE LINEAR REGRESSION
MODEL
• Simple Linear Regression (SLR) is a statistical
model in which there is only one independent
variable (or feature) and the functional
relationship between the outcome variable
and the regression coefficient is linear.
• Linear regression implies that the
mathematical function is linear with respect to
regression parameters.
where Yi is the value of ith observation of the dependent
variable (outcome variable) in the sample, Xi is the value of ith
observation of the independent variable or feature in the
sample, ei is the random error (also known as residuals) in
predicting the value of Yi , b0 and b1 are the regression
parameters (or regression coefficients or feature weights)
Assumptions of the Linear Regression
Model
1. The errors or residuals ei are assumed to follow a
normal distribution with expected value of error E(ei
) = 0.
2. The variance of error, VAR(ei ), is constant for
various values of independent variable X. This is
known as homoscedasticity. When the variance is
not constant, it is called heteroscedasticity.
3. The error and independent variable are
uncorrelated.
4. The functional relationship between the outcome
variable and feature is correctly defined
Properties of Simple Linear Regression
Example 3- Predicting MBA Salary from
Grade in MBA
• Excel file (MBA.xlsx) contains the salary of 50
graduating MBA students of a Business School
in 2022 and their corresponding percentage
marks in MBA
• Develop an SLR model to understand and
predict salary based on the percentage of
marks in MBA.
Steps for building a regression model
using Python:
1. Import pandas and numpy libraries
2. Use read_csv to load the dataset into
DataFrame.
3. Identify the feature(s) (X) and outcome (Y)
variable in the DataFrame for building the model.
4. Split the dataset into training and validation sets
using train_test_split().
5. Import statsmodel library and fit the model using
OLS() method.
6. Print model summary and conduct model
diagnostics
STEP 1
• For loading dataset into a DataFrame, we
need to import pandas and numpy libraries.

import pandas as pd
import numpy as np
import statsmodels.api as sm
STEP 2
df =
pd.read_excel("C:/Users/LENOVO/Desktop/RLA/
BMS/Sem 3/Introduction to Business
Analytics/Practical/MBA.xlsx")
df.head()
STEP 3- Creating Feature Set (X) and
Outcome Variable (Y)
• The statsmodel library is used in Python for building
statistical models.
• OLS API available in statsmodel.api is used for estimation of
parameters for simple linear regression model.
• The OLS() model takes two parameters Y and X.
• In this example, Percentage in MBA will be X and Salary will
be Y.
• OLS API available in statsmodel.api estimates only the
coefficient of X parameter [refer to Eq. (4.1)].
• To estimate regression coefficient b0 , a constant term of 1
needs to be added as a separate column.
• As the value of the columns remains same across all
samples, the parameter estimated for this feature or
column will be the intercept term.
Code
X = sm.add_constant(df['Per'] )
X.head()

Y = df['Salary']
y.head()
output
STEP 4- Splitting the Dataset into
Training and Validation Sets
• train_test_split() function from skelarn.model_selection module provides
the ability to split the dataset randomly into training and validation
datasets.
• The parameter train_size takes a fraction between 0 and 1 for specifying
training set size.
• The remaining samples in the original set will be test or validation set. The
records that are selected for training and test set are randomly sampled. T
• he method takes a seed value in parameter named random_state, to fix
which samples go to training and which ones go to test set.
train_test_split()
• The method returns four variables as below.
1. train_X contains X features of the training set.
2. train_y contains the values of response variable for the training set.
3. test_X contains X features of the test set.
4. test_y contains the values of response variable for the test set.
code
from sklearn.model_selection import
train_test_split
train_X, test_X, train_y, test_y = train_test_split(
X,
Y, train_size = 0.8, random_state = 100 )
Step 5- Fitting the Model
• We will fit the model using OLS method and
pass train_y and train_X as parameters.
• the fit() method on OLS() estimates the
parameters and returns model information to
the variable df_lm, which contains the model
parameters, accuracy measures, and residual
values among other details
Code
df_lm = sm.OLS( train_y, train_X ).fit()
Step 6- Printing Estimated Parameters
and Interpreting Them
print(df_lm.params)

Output
const 32675.756285
Per 3533.873029
dtype: float64
Analysis
• Regression equation is Yi= α + βX

• The estimated (predicted) model can be written as


• MBA Salary = 30587.285 + 3560.587 * (Percentage in
Grade 10)
• The equation can be interpreted as follows:
• If the student score 0 percent in MBA score, his salary
will still be Rs. 30,587
• For every 1% increase in MBA score, the salary of the
MBA students will increase by 3560.587
Step 7- Regression Model Summary
Using Python
• The function summary2() prints the model
summary which contains the information
required for diagnosing a regression model

df_lm.summary2()
MODEL DIAGNOSTICS
• It is important to validate the regression model to
ensure its validity and goodness of fit before it can be
used for practical applications.
• The following measures are used to validate the
simple linear regression models:
1. Co-efficient of determination (R-squared).
2. Hypothesis test for the regression coefficient.
3. Analysis of variance for overall model validity
(important for multiple linear regression).
4. Residual analysis to validate the regression model
assumptions.
5. Outlier analysis, since the presence of outliers can
significantly impact the regression parameters.
1. Co-efficient of Determination
(R-Squared or R2)
• The primary objective of regression is to explain the
variation in Y using the knowledge of X.
• The co-efficient of determination (R-squared or R2 )
measures the percentage of variation in Y explained by
the model (b0 + b1 X).
• The simple linear regression model can be broken into
1. Variation in outcome variable explained by the
model.
2. Unexplained variation as shown in Eq. below
The co-efficient of determination
(R-squared) has the following
properties:
1. The value of R-squared lies between 0 and 1
2. Mathematically, R-squared (R2 ) is square of
correlation coefficient (R2 = r2 ), where r is the
Pearson correlation co-efficient.
3. Higher R-squared indicates better fit;
however, one should be careful about the
spurious relationship.
In the example
• The model R-squared value is 0.211, that is,
21.1% of the variation in salary is explained by
the variation in MBA score.
• The value is less than 50% and hence the the
model is not a very good fit.
2. Analysis of Variance (ANOVA) in
Regression Analysis
ANOVA analysis
• Ho: β = 0
• H1: β ≠ 0
• If this value is less than 0.05 (assumed level of
significance), we reject the Ho , otherwise we do
not reject the H0.
• Since The Probability (F statistic) = 0.0032 is less
than 0.05, we reject the H0.
• The p value of F-statistic of the model is 0.0032
which indicates that the overall model is
statistically significant.
3. Regression equation & its analysis
• Regression equation is Yi= α + βX

• The estimated (predicted) model can be written as


• MBA Salary = 30587.285 + 3560.587 * (Percentage in
Grade 10)
• The equation can be interpreted as follows:
• If the student score 0 percent in MBA score, his salary
will still be Rs. 30,587
• For every 1% increase in MBA score, the salary of the
MBA students will increase by 3560.587
4. Check for Normal Distribution of
Residual
• A Q-Q (Quantile-Quantile) plot is a graphical
tool used in statistics to assess the similarity
between the distribution of a sample of data
and a theoretical distribution, typically the
normal distribution.
• In the context of residual analysis, a Q-Q plot
helps you evaluate whether the residuals of a
statistical model (such as a linear regression
model) follow a normal distribution.
Code
model = sm.OLS(train_y, train_X).fit()
import scipy.stats as stats
# Calculate residuals
residuals = test_y - model.predict(test_X)

# Create a Q-Q plot of residuals using scipy.stats


stats.probplot(residuals, plot=plt)
plt.title('Q-Q Plot of Residuals')
plt.show()
output
Analysis
• The diagonal line is the cumulative
distribution of a normal distribution, whereas
the dots represent the cumulative distribution
of the residuals.
• Since the dots are close to the diagonal line,
we can conclude that the residuals follow an
approximate normal distribution (we need
only an approximate normal distribution).
5. Outlier Analysis
• Outliers are observations whose values show
a large deviation from the mean value.
• Presence of an outlier can have a significant
influence on the values of regression
coefficients.
• Thus, it is important to identify the existence
of outliers in the data.
• The following distance measures are useful in
identifying influential observations: Z-Score
Z-score
Code
from scipy.stats import zscore
df['z_score_salary'] = zscore(df.Salary)
df[ (df.z_score_salary > 3.0) | (df.z_score_salary
< -3.0) ]

Output- So, there are no observations that are


outliers as per the Z-score
Practice
• HDI.xlsx
Multiple Regression Analysis
• Multiple linear regression (MLR) is a
supervised learning algorithm for finding the
existence of an association relationship
between a dependent variable (aka response
variable or outcome variable) and several
independent variables (aka explanatory
variables or predictor variable or features).
• The functional form of MLR is given by
• The regression coefficients b1 , b2 , … , bk are
called partial regression coefficients since the
relationship between an explanatory variable
and the response (outcome) variable is
calculated after removing (or controlling) the
effect all the other explanatory variables
(features) in the model.
The assumptions that are made in
multiple linear regression model are as
follows:
• 1. The regression model is linear in regression parameters
(b-values).
• 2. The residuals follow a normal distribution and the
expected value (mean) of the residuals is zero.
• 3. In time series data, residuals are assumed to
uncorrelated.
• 4. The variance of the residuals is constant for all values of
Xi . When the variance of the residuals is constant for
different values of Xi , it is called homoscedasticity. A
non-constant variance of residuals is called
heteroscedasticity.
• 5. There is no high correlation between independent
variables in the model (called multi-collinearity).
Multi-collinearity can destabilize the model and can result
in an incorrect estimation of the regression parameters.
Validation of Multiple Regression
Model
The following measures and tests are carried out to validate an MLR model:
1. Coefficient of multiple determination (R-square) and adjusted R-square,
which can be used to judge the overall fitness of the model.
2. t-Test to check the existence of a statistically significant relationship
between the out. come variable and an individual feature variable at a given
significance level (α) or at (1 - α)100% confidence level.
3. F-test to check the statistical significance of the overall model at a given
significance level (α) or at (1 - α)100% confidence level.
4. Residual analysis to check whether the normality and homoscedasticity
assumptions have been satisfied. Also, checking for any pattern in the
residual plots to check for correct model specification.
5. Checking for the presence of multi-collinearity (strong correlation between
independent variables) that can destabilize the regression model.
6. Checking for autocorrelation in case of time series data.
Example- example3.2
• DEPENDENT VARIABLE: Happiness Index (HI)
A Happiness Index generally focuses on the following criteria:
• Psychological Well-being (Psychological indicators)
• Mental and Spiritual Health (Health and mental health
indicators)
• Time-balance (Work and Sleep Indicators)
• Social and Community Vitality (Philanthropy, Family related
indicators)
• Cultural Vitality (Native speaker, Domestic industry indicators)
• Education (Literacy indicators)
• Standard of life (Living Standards indicators)
• Good governance (Human rights, political participation,
government performance indicators)
• Ecological Vitality (Wildlife,Forest,Urban-Rural, Environmental
indicators)
• As per World Happiness Index 2018 Report
2018, Finland has been ranked as the happiest
country while Burundi has been ranked last in
the index out of a pool of 156 countries.
Independent Variables
1. GDP Per Capita
• Social GDP refers to the total value of final (as opposed to
interim, or work-in-progress) goods and services produced
within a country’s borders during a specific calendar period
such as quarterly or annually. While GDP is the most widely
used measure of a country’s economic activity, per capita GDP
is a better indicator of the change or trend in a nation’s living
standards over time, since it adjusts for population differences
between countries GDP per capita is in terms of Purchasing
Power Parity (PPP) adjusted to constant 2011 international
dollars, taken from the World Development Indicators (WDI)
released by the World Bank in September 2017.
2. Social Support
• Social support is the national average of the
binary responses (either 0 or 1) to the
Gallup World Poll (GWP) question “If you
were in trouble, do you have relatives or
friends you can count on to help you
whenever you need them, or not?” Social
support is directly related to the culture and
traditions of the society. For example,
Indian culture supports to help each other
and provide any kind of social support which
induces happiness in the society.
3. Health Life Expectancy
• Health Life Expectancy is defined as the average number of
years the newborn is expected to live is he/she were to pass
through life subjects of the age-specific mortality rates of the
given period. The time series of healthy World Happiness Index
2018 are constructed based on data from the World Health
Organization (WHO) and WDI. WHO publishes the data on
healthy life expectancy for the year 2012. The time series of
life expectancies, with no adjustment for health, are available
in WDI. We adopt the following strategy to construct the time
series of healthy life expectancy at birth: first we generate the
ratios of healthy life expectancy to life expectancy in 2012 for
countries with both data. We then apply the country-specific
ratios to other years to generate the healthy life expectancy
data. As the longevity increases people tend to satisfy more
needs.
4. Perceptions of Corruption
• Corruption is defined as abuse of public wealth for
private interests by a particular class of people. It
has several forms such as nepotism, secretly funding
of political parties, close and endangered tie
between political elites and business world, bribery
in public sector etc. There are two opposing views
regarding the corruption-economic growth and
corruption-social progress. Perceptions of
corruption are the average of binary answers to two
GWP questions: “Is corruption widespread
throughout the government or not?” and “Is
corruption widespread within businesses or not?”
Where data for government corruption are missing,
the perception of business corruption is used as the
overall corruption-perception measure.
5. Generosity
• Generosity is the residual of regressing the
national average of GWP responses to the
question “Have you donated money to a
charity in the past month?” on GDP per capita.
The idea of the research is to find out the exact
relationship of generosity with respect to the
happiness qualified of the country.
Objective

• To prove that level of happiness, depends on the


factors GDP per capita, Social Support, Healthy
life expectancy, Perceptions of corruption,
Generosity
• Y1= Happiness Index
• X1= GDP per capita
• X2= Social Support
• X3= Healthy life expectancy
• X4= Perceptions of corruption
• X5= Generosity
Step 1- Loading the Dataset
df =
pd.read_excel("C:/Users/LENOVO/Desktop/RLA/
BMS/Sem 3/Introduction to Business
Analytics/Practical/multi.xlsx")
df.head()
Step 2- preparing x and y

x=df.drop(['Score'],axis=1)
y=df['Score']
x.head()
y.head()
Step 3- training and testing
from sklearn.model_selection import
train_test_split
x_train, x_test, y_train, y_test =
train_test_split(x,y,test_size=0.20,random_state
=12)
#using random_state=12, data will be shuffled in
the same way
Step 4- model summary
import statsmodels.api as sm
model = sm.OLS(y_train,x_train)
print(model.fit().summary())
Output
CONCLUSION

• The Regression Equation is as follows:


• World Happiness Index 2018 = 2.058 + 1.105
(GDP per Capita) + 1.225 (Social Support) +
0.855 (Health Life Expectancy) + 0.994
(Generosity) + 1.399 (Perceptions of
Corruption)
INTERPRETATION OF SUMMARY
OUTPUT
Multiple R:
• The Correlation Coefficient of multiple
regression test has a value of 0.873.
• Hence there is STRONG POSITIVE
CORRELATION between the dependent
variable and independent variables.
R Square:

• The Coefficient of Determination of multiple


regression test has a value of 0.7623.
• Hence we can conclude that around 76% of
variation in the dependent variable
(Happiness Index) is explained by the
independent variables (GDP Per Capita,
Social Support, Healthy Life Expectancy,
Generosity and Perception of Corruption)
Adjusted R Square:

• This is a measure of adequacy of the


regression model.
• Its value is = 0.7543 which is also high, hence
proving that the model is adequate.
Standard Error:

• Standard error of the estimate is a measure of


accuracy of predictions.
• Its value in this example is 0.5548 which is a
small value, hence indicating the predictions
made by the regression model are quite
accurate.
Observations:

• The total number of observations in this


regression model is 150 as there are 156
countries whose HI and various factors are
taken into account.
Analysis of Variance Table:


Coefficients Table:

2. Hypothesis Test for the Regression Co-efficient






Multi-Collinearity and Handling
Multi-Collinearity
• When the dataset has a large number of
independent variables (features), it is possible
that few of these independent variables (features)
may be highly correlated.
• The existence of a high correlation between
independent variables is called multi-collinearity.
Presence of multi-collinearity can destabilize the
multiple linear regression model.
• Thus, it is necessary to identify the presence of
multi-collinearity and take corrective actions.

• One of the assumption of the Classical Linear
Regression Model is that there should be no
Multicollinearity among the independent
variables included in the regression model.
• Hence, as a researcher we need to detect
presence of such Multicollinearity in the data
using effective methods.
• Usually, as the number of independent variables
in the regression model increases, the problem of
Multicollinearity also increases.
Methods to be used to detect the
presence of Multicollinearity in our
data:
• Variance Inflating Factor
Variance Inflating Factor


Variance Inflation Factor (VIF)
Code
# checking multicollinearity using VIF
from statsmodels.stats.outliers_influence import
variance_inflation_factor

vif_data = pd.DataFrame()
vif_data["Feature"] = x_train.columns
vif_data["VIF"] =
[variance_inflation_factor(x_train.values, i) for i in
range(x_train.shape[1])]

print(vif_data)
Output

Since VIF of the variables GDP per capita, Social Support and Healthy life
expectancy is very high (above 5 or 10), we say that multicollinearity is
present in this model.
Residual Analysis in Multiple Linear
Regression
• Test for Normality of Residuals (P-P Plot)
• One of the important assumptions of
regression is that the residuals should be
normally distributed.
• This can be verified using P-P plot. We will
develop a method called draw_pp_plot()
which takes the model output (residuals) and
draws the P-P plot.
Distance Measures and Outliers
Diagnostics
• Outliers can have a significant impact on the
estimated regression coefficients.
• That is, the value of a regression coefficient may
change depending on the presence of outliers in
the data.
• The following distance measures are used for
diagnosing the outliers and influential
observations in MLR model.
1. Cooks Distance
2. Leverage Value
Outlier analysis
#outlier analysis
from scipy.stats import zscore
df['z_score_Score'] = zscore(df.Score)
df[ (df.z_score_Score > 3.0) | (df.z_score_Score
< -3.0) ]
output

There are no outliers


Residual analysis
model = sm.OLS(y_train, X_train).fit()
import scipy.stats as stats
# Calculate residuals
residuals = y_test - model.predict(X_test)

# Create a Q-Q plot of residuals using scipy.stats


stats.probplot(residuals, plot=plt)
plt.title('Q-Q Plot of Residuals')
plt.show()
Analysis
• The diagonal line is the cumulative
distribution of a normal distribution, whereas
the dots represent the cumulative distribution
of the residuals.
• Since the dots are close to the diagonal line,
we can conclude that the residuals follow an
approximate normal distribution (we need
only an approximate normal distribution).
Autocorrelation


METHOD- DURBIN-WATSON TEST


Example3.4
• Data on 12 months sale of Company X along
with their Advertising expenditure and
Distribution expenditure.
Code
#importing libraries
import pandas as pd
import statsmodels.api as sm

# importing file from excel


df = pd.read_excel("C:/Users/LENOVO/Desktop/RLA/BMS/Sem
3/Introduction to Business Analytics/Practical/auto.xlsx")
df = pd.DataFrame(df)
df.head()
# Calculate residuals from a linear regression model (Sales , Dist +
Adv)
X = sm.add_constant(df[['Dist', 'Adv']])
y = df['Sales']
model = sm.OLS(y, X).fit()
residuals = model.resid

# Perform Durbin-Watson test for autocorrelation


durbin_watson_statistic =
sm.stats.stattools.durbin_watson(residuals)

# Print the Durbin-Watson test statistic


print("Durbin-Watson Test Statistic:", durbin_watson_statistic)

Output
Durbin-Watson Test Statistic: 2.5228799065114567
• The value od Durbin Watson test is 2.53
• Hence we can say that there is very slight
negative auto correlation in the data.
Regression with dummy variables
(example3.3)
Step 1: Importing Required Libraries:

import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
Step 2: importing data and creating a
dataframe
• Import data and create a DataFrame named df
with three columns: 'Gender', 'Age', and 'Income'.

df =
pd.read_excel("C:/Users/LENOVO/Desktop/RLA/BM
S/Sem 3/Introduction to Business
Analytics/Practical/dummy.xlsx")
df.head()
df = pd.DataFrame(data)
Step 3- Creating Dummy Variables for
Gender:
• This line of code uses the pd.get_dummies()
function to create dummy variables for the
'Gender' column.
• The drop_first=True parameter drops the first
level (female) of the gender variable to
prevent multicollinearity issues in regression
analysis.
Code
df = pd.get_dummies(df, columns=['Gender'],
drop_first=True)
Step 4- Adding a Constant Term to
Predictor Variables Matrix:
• Here, we use sm.add_constant() to add a
constant (intercept) term to the predictor
variables matrix X.
• The matrix includes the dummy variable
'Gender_male' and the 'Age' column.

X = sm.add_constant(df[['Gender_male', 'Age']])
Step 5- Defining the Dependent
Variable:
• The dependent variable y is set to the
'Income' column of the DataFrame.

y = df['Income']
Step 6- Fitting the OLS Regression
Model:
• This line fits an Ordinary Least Squares (OLS)
regression model using the sm.OLS() function.
• It uses the dependent variable y and the
predictor variables matrix X.

model = sm.OLS(y, X).fit()


Step 7- Printing Regression Summary:
• The model.summary() method prints out a
summary of the regression results.
• This includes statistical information about the
coefficients, their significance, goodness-of-fit
measures, and more.

print(model.summary())

You might also like