0% found this document useful (0 votes)
2 views

predictive modelling outputs

The document outlines the performance evaluation of a linear regression model on both training and test datasets, reporting metrics such as RMSE, MAE, and MAPE. It also discusses the process of checking linear regression assumptions, including multicollinearity, linearity, independence, normality, and homoscedasticity, along with methods for feature selection based on p-values. Finally, the document presents the final model summary and performance metrics after addressing the assumptions and refining the model.

Uploaded by

hemant09041995
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

predictive modelling outputs

The document outlines the performance evaluation of a linear regression model on both training and test datasets, reporting metrics such as RMSE, MAE, and MAPE. It also discusses the process of checking linear regression assumptions, including multicollinearity, linearity, independence, normality, and homoscedasticity, along with methods for feature selection based on p-values. Finally, the document presents the final model summary and performance metrics after addressing the assumptions and refining the model.

Uploaded by

hemant09041995
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

# checking model performance on train set (seen 70% data)

print("Training Performance\n")
olsmodel1_train_perf = model_performance_regression(olsmodel1, x_train,
y_train)
olsmodel1_train_perf

Training Performance

RMSE MAE MAPE

0.84491
1.127269 26.95745
1

# checking model performance on test set (seen 30% data)


print("Test Performance\n")
olsmodel1_test_perf = model_performance_regression(olsmodel1, x_test,
y_test) ## Complete the code to get the test performance
olsmodel1_test_perf

O/p
Test Performance

RMSE MAE MAPE

0 1.030785 0.81383 23.978281

Checking Linear Regression Assumptions

We will be checking the following Linear Regression assumptions:


1. No Multicollinearity
2. Linearity of variables
3. Independence of error terms
4. Normality of error terms
5. No Heteroscedasticity

TEST FOR MULTICOLLINEARITY

Dropping variables with high VIF

 We will test for multicollinearity using VIF.


 General Rule of thumb:
o If VIF is 1 then there is no correlation between that predictor and
the remaining predictor variables.
o If VIF exceeds 5 or is close to exceeding 5, we say there is
moderate multicollinearity.
 Can or cannot be treated with proper reasoning
o If VIF is 10 or exceeding 10, it shows signs of high
multicollinearity.
 Must be treated

Feature VIF

0 const 6.124153

1 capital 4.014583

2 patents 2.986430

3 randd 5.545531

4 employment 3.593570

5 tobinq 1.064449

6 value 2.799430

7 institutions 1.331542

8 sp500_yes 1.622238
Dropping high p-value variables

We will drop the predictor variables having a p-value greater than 0.05 as they do not
significantly impact the target variable.

But sometimes p-values change after dropping a variable. So, we'll not drop all variables
at once.

Instead, we will do the following:

Build a model, check the p-values of the variables, and drop the column with the highest
p-value.

Create a new model without the dropped feature, check the p-values of the variables,
and drop the column with the highest p-value.

Repeat the above two steps till there are no columns with p-value > 0.05.

The above process can also be done manually by picking one variable at a time that has
a high p-value, dropping it, and building a model again. But that might be a little tedious
and using a loop will be more efficient.

# initial list of columns


cols = x_train.columns.tolist()

# setting an initial max p-value


max_p_value = 1

while len(cols) > 0:


# defining the train set
x_train_aux = x_train[cols]

# fitting the model


model = sm.OLS(y_train, x_train_aux).fit()

# getting the p-values and the maximum p-value


p_values = model.pvalues
max_p_value = max(p_values)

# name of the variable with maximum p-value


feature_with_p_max = p_values.idxmax()

if max_p_value > 0.05:


cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)

o/p
['const', 'employment', 'tobinq', 'value', 'institutions', 'sp500_yes']

# checking model performance on train set (seen 70% data)


print("Training Performance\n")
olsmodel2_train_perf = model_performance_regression(olsmodel2, x_train2,
y_train)
olsmodel2_train_perf

Training Performance

RMSE MAE MAPE

0 1.131306 0.843946 26.941502

# checking model performance on test set (seen 30% data)


print("Test Performance\n")
olsmodel2_test_perf = model_performance_regression(olsmodel2, x_train2,
y_train) ## Complete the code to get the test performance
olsmodel2_test_perf

Test Performance

RMSE MAE MAPE

0 1.131306 0.843946 26.941502

TEST FOR LINEARITY AND INDEPENDENCE

We will test for linearity and independence by making a plot of fitted values vs residuals
and checking for patterns.
If there is no pattern, then we say the model is linear and residuals are independent.

Otherwise, the model is showing signs of non-linearity and residuals are not
independent.

Actual Fitted Residu


Values Values als

-
65
5.772882 5.774955 0.00207
2
3

36 0.94420
6.340426 5.396227
6 0

-
44
9.259054 9.546073 0.28701
7
9

61 0.64043
6.229126 5.588692
8 4

61 0.12216
5.455543 5.333378
0 5

stats.shapiro(df_pred["Residuals"])

ShapiroResult(statistic=0.9822825883697879, pvalue=1.4029046526104298e-06)

TEST FOR HOMOSCEDASTICITY

We will test for homoscedasticity by using the goldfeldquandt test.

If we get a p-value greater than 0.05, we can say that the residuals are homoscedastic.
Otherwise, they are heteroscedastic.

import statsmodels.stats.api as sms


from statsmodels.compat import lzip
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(df_pred["Residuals"], x_train2)
lzip(name, test)

[('F statistic', 1.0339466399164485), ('p-value', 0.38839487294695685)]

Final Model Summary

olsmodel_final = sm.OLS(y_train, x_train2).fit()


print(olsmodel_final.summary())

OLS Regression Results


==========================================================================
====
Dep. Variable: sales R-squared: 0.667
Model: OLS Adj. R-squared:.664
Method: Least Squares F-statistic: 233.5
Date: Sun, 05 Jan 2025 Prob (F-statistic): 1.09e-136
Time: 15:56:45 Log-Likelihood: -909.96
No. Observations: 590 AIC: 1832.
Df Residuals: 584 BIC: 1858.
Df Model: 5
Covariance Type: nonrobust
==========================================================================
======
coef std err t P>|t| [0.02 0.975]
--------------------------------------------------------------------------
------
const 4.7867 0.115 41.533 0.000 4.560 5.013
employment 0.0053 0.001 3.947 0.000 0.003 0.008
tobinq -0.1406 0.015 -9.555 0.000 -0.170-0.112
value 7.475e-05 8.81e-06 8.488 0.000 5.74e-05 9.2e-05
institutions 0.0251 0.002 10.122 0.000 0.020 0.030
sp500_yes 1.4786 0.129 11.487 0.000 1.226 1.731
==========================================================================
====
Omnibus: 25.118 Durbin-Watson: 1.983
Prob(Omnibus): 0.000 Jarque-Bera (JB): 68.680
Skew: -0.020 Prob(JB): 1.22e-15
Kurtosis: 4.671 Cond. No: 2.28e+04
==========================================================================
====
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 2.28e+04. This might indicate that
there are
strong multicollinearity or other numerical problems.

# checking model performance on train set (seen 70% data)


print("Training Performance\n")
olsmodel_final_train_perf = model_performance_regression(
olsmodel_final, x_train2, y_train
)
olsmodel_final_train_perf

Training Performance

RMSE MAE MAPE

0 1.131306 0.843946 26.941502

# checking model performance on test set (seen 30% data)


print("Test Performance\n")
olsmodel_final_test_perf = model_performance_regression(olsmodel_final,
x_test2, y_test)
olsmodel_final_test_perf

Test Performance

RMSE MAE MAPE

0 1.030857 0.812045 23.962577

You might also like