0% found this document useful (0 votes)
66 views31 pages

Predictive Modelling Project - Nandini

The document outlines a predictive modeling project aimed at forecasting sales for 759 firms based on various attributes such as capital, patents, and R&D spending. It includes sections on data loading, exploratory data analysis, model building, and actionable insights, emphasizing the importance of identifying key attributes influencing sales. The analysis reveals significant correlations among variables and highlights the presence of outliers and skewness in the data distributions.

Uploaded by

Nandini Priya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views31 pages

Predictive Modelling Project - Nandini

The document outlines a predictive modeling project aimed at forecasting sales for 759 firms based on various attributes such as capital, patents, and R&D spending. It includes sections on data loading, exploratory data analysis, model building, and actionable insights, emphasizing the importance of identifying key attributes influencing sales. The analysis reveals significant correlations among variables and highlights the presence of outliers and skewness in the data distributions.

Uploaded by

Nandini Priya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Predictive Modelling

Project

BUSINESS REPORT
NANDINI PRIYA MS
Table of Contents
List of Tables 3
List of Figures 4
1.1 Problem Statement 5
1.2 Loading the Dataset and Data Overview 6
1.3 Exploratory Data Analysis 9
1.4 Data Preprocessing 17
1.5 Model Building 20
1.6 Final Model Building 30
1.7 Actionable Insights and Recommendations 32
List of Tables
Table 1: First 5 rows of the dataset 6

Table 2: Last 5 rows of the dataset 7

Table 3: Information about the dataset 7

Table 4: Datatypes of the variables in the dataset 8

Table 5: Unique values in the dataset 8

Table 6: Statistical Summary of the dataset 9

Table 7: No. of missing values in the dataset 17

Table 8: Linear Regression - Model Summary 21

Table 9: Performance Metrics for Linear regression Training data 23

Table 10: Performance Metrics for Linear regression Test data 23

Table 11: VIF values for the test of Multicollinearity 27

Table 12: Performance Metrics after removing the variables 27

Table 13: Linear Regression - Final Model Summary 31

Table 14: Performance Metrics for Linear regression Training data 31

Table 15: Performance Metrics for Linear regression Test data 32


1.1 Problem Statement

1.1.1 Context

You are a part of an investment firm and your work is to research these 759 firms.
You are provided with the dataset containing the sales and other attributes of these
759 firms. Predict the sales of these firms based on the details given in the dataset
to help your company in investing consciously. Also, provide them with the most
important attributes.

1.1.2 Data Dictionary


The detailed data dictionary is given below.
1. sales: Normalized values of the Sales
2. capital: Net stock of property, plant, and equipment.
3. patents: Granted patents.
4. randd: R&D stock (in millions of dollars).
5. employment: Employment (in 1000s).
6. sp500: Membership of firms in the S&P 500 index. S&P is a stock market
index that measures the stock performance of 500 large companies listed on
stock exchanges in the United States
7. tobinq: Tobin's q (also known as q ratio and Kaldor's v) is the ratio between
a physical asset's market value and its replacement value.
8. value: Stock market value.

1.2 Loading the Dataset and Data Overview


The dataset provided was loaded into a pandas dataframe for the analysis.

1.2.1 Getting the first 5 and last 5 rows


The dataset has been loaded successfully. It has 738 rows × 9 columns.

● First 5 rows of the dataset:


● Last 5 rows of the dataset

Table 2: Last 5 rows of the dataset

1.2.2 Information about the dataset

Table 3: Information about the dataset

● There are 9 variables in the dataset with no null values.

1.2.3 Data types of the variables

Data type No of variables

object 1

float64 7
int64 1

Table 4: Datatypes of the variables in the dataset

1.2.4 Unique values in the attributes of the dataset

Table 5: Unique values in the dataset

● Among the variables in the dataset - sales, capital ,value have the
highest counts of unique values.

1.2.5 Check for Duplicate Records


● No duplicate records exist in this dataset.

1.2.6 Statistical Description of the Dataset


● A description of the first few columns of the dataset is given below:
● The sales ranges from 4.8 to 7.5.
 Sales: The average sales value is 6.17 units (assuming units are not
specified), with a standard deviation of 1.96. The distribution is skewed to the
right, with a median of 6.11 units. This suggests that a majority of businesses
have sales values close to the median, but there are some businesses with
significantly higher sales.
 Capital: The average capital value is 2028.51 units (assuming units are not
specified), with a standard deviation of 6550.94. The distribution is highly
skewed to the right, with a median of 205.81 units. This suggests that there
are a few businesses with extremely high capital values, which are pulling
the mean to the right.
 Patents: The average number of patents is 26.32, with a standard deviation
of 98.57. The distribution is skewed to the right, with a median of 3.00. This
suggests that a majority of businesses have a relatively small number of
patents, but there are a few businesses with a large number of patents.
 R&D: The average R&D spending is 449.88 units (assuming units are not
specified), with a standard deviation of 2034.80. The distribution is highly
skewed to the right, with a median of 36.82 units. This suggests that there
are a few businesses with extremely high R&D spending, which are pulling
the mean to the right.
 Employment: The average number of employees is 14.50, with a standard
deviation of 43.89. The distribution is skewed to the right, with a median of
3.04. This suggests that a majority of businesses have a relatively small
number of employees, but there are a few businesses with a large number of
employees.
 Tobin Q: The average Tobin Q value is 2.79, with a standard deviation of
3.37. The distribution is skewed to the right, with a median of 1.68. This
suggests that a majority of businesses have a Tobin Q value below the
average, but there are a few businesses with significantly higher Tobin Q
values.
 Value: The average value is 2797.51 units (assuming units are not specified),
with a standard deviation of 7159.92. The distribution is highly skewed to the
right, with a median of 418.52 units. This suggests that there are a few
businesses with extremely high values, which are pulling the mean to the
right.
 Institutions: The average number of institutions is 43.10, with a standard
deviation of 21.73. The distribution is skewed to the right, with a median of
44.43. This suggests that a majority of businesses have a relatively large
number of institutions, but there are a few businesses with a smaller number
of institutions.
1.3 Exploratory Data Analysis
1.3.1 Univariate Analysis

In this section, we will analyze the distribution of independent variables. It


will help us identify the pattern among the variables and the effects they
have on the target variable.

First, let us see how the target variable (sales) is distributed.

The data appears to be approximately normally distributed, with a slight skew to


the right
Capital
 Extreme Outliers: The numerous dots extending to the right indicate a significant
number of outliers, suggesting that there are some extremely large values in the
data.
 Tight Distribution: The majority of the data points are clustered tightly together near
the lower end of the distribution.
 Long Tail: The long tail extending to the right confirms the heavy right-skewness of
the data.
 Median and Quartiles: The median (the line within the box) is positioned towards
the lower end of the box, further emphasizing the right-skewness. The interquartile
range (the length of the box) is relatively small compared to the overall range of the
data.

The distribution is highly skewed to the right, with a long tail extending to the right. This
indicates that there are a few very large values of capital, which are pulling the mean to the
right

The majority of the data points are clustered around the lower values of capital, with a
decreasing frequency as the values increase.

Patents
the data appears to be heavily skewed to the left with a large number of outliers on the right
side. Here's a breakdown of the key observations:
 Extreme Outliers: The numerous dots extending to the right indicate a significant
number of outliers, suggesting that there are some extremely large values in the
data.
 Tight Distribution: The majority of the data points are clustered tightly together near
the higher end of the distribution.
 Long Tail: The long tail extending to the right confirms the heavy right-skewness of
the data.
 Median and Quartiles: The median (the line within the box) is positioned towards
the higher end of the box, further emphasizing the right-skewness. The interquartile
range (the length of the box) is relatively small compared to the overall range of the
data.

randd
the data appears to be heavily skewed to the right with a large number of outliers on the right
side. Here's a breakdown of the key observations:
 Extreme Outliers: The numerous dots extending to the right indicate a significant
number of outliers, suggesting that there are some extremely large values in the
data.
 Tight Distribution: The majority of the data points are clustered tightly together near
the lower end of the distribution.
 Long Tail: The long tail extending to the right confirms the heavy right-skewness of
the data.
 Median and Quartiles: The median (the line within the box) is positioned towards
the lower end of the box, further emphasizing the right-skewness. The interquartile
range (the length of the box) is relatively small compared to the overall range of the
data.

employment
the data appears to be heavily skewed to the right with a large number of outliers on the right
side. Here's a breakdown of the key observations:
 Extreme Outliers: The numerous dots extending to the right indicate a significant
number of outliers, suggesting that there are some extremely large values in the
data.
 Tight Distribution: The majority of the data points are clustered tightly together near
the lower end of the distribution.
 Long Tail: The long tail extending to the right confirms the heavy right-skewness of
the data.
 Median and Quartiles: The median (the line within the box) is positioned towards
the lower end of the box, further emphasizing the right-skewness. The interquartile
range (the length of the box) is relatively small compared to the overall range of the
data.

tobinq
the data appears to be heavily skewed to the right with a large number of outliers.
Here's a breakdown of the key observations:
 Extreme Outliers: The numerous dots extending to the right indicate a significant
number of outliers, suggesting that there are some extremely large values in the
data.
 Tight Distribution: The majority of the data points are clustered tightly together near
the lower end of the distribution.
 Long Tail: The long tail extending to the right confirms the heavy right-skewness of
the data.
 Median and Quartiles: The median (the line within the box) is positioned towards
the lower end of the box, further emphasizing the right-skewness. The interquartile
range (the length of th

value
the data appears to be heavily skewed to the right with a large number of outliers.
Here's a breakdown of the key observations:
 Extreme Outliers: The numerous dots extending to the right indicate a significant
number of outliers, suggesting that there are some extremely large values in the
data.
 Tight Distribution: The majority of the data points are clustered tightly together near
the lower end of the distribution.
 Long Tail: The long tail extending to the right confirms the heavy right-skewness of
the data.
 Median and Quartiles: The median (the line within the box) is positioned towards
the lower end of the box, further emphasizing the right-skewness. The interquartile
range (the length of the box) is relatively small compared to the overall range of the
data.

sp500
The count plot shows the distribution of a categorical variable, likely related to the S&P 500
index. Based on the plot, we can infer the following:
 Categories: The variable has two categories: "no" and "yes."
 Distribution: The category "no" has a significantly higher count than the category
"yes." Specifically, 70.6% of the observations fall into the "no" category, while 29.4%
fall into the "yes" category.
 Dominant Category: The "no" category is the dominant category, indicating that it is
more frequent or prevalent in the data.

1.3.2 Bivariate Analysis:

For Bivariate analysis, we can plot the boxplots of independent variables with
respect to the target variable. This will help us analyze the contribution of
variables in determining the trends and patterns of the used phone prices.
difference in the distribution of sales between companies in the S&P 500 and those
not in the S&P 500. Here's a breakdown of the key observations:
 Median Sales: The median sales for companies in the S&P 500 (right box) are
higher than those not in the S&P 500 (left box). This indicates that the typical sales
value is greater for companies in the index.
 Interquartile Range: The interquartile range (the length of the box) for companies in
the S&P 500 is larger than that for companies not in the S&P 500. This suggests that
there is a greater variability in sales among companies in the index.
 Outliers: There are more outliers (data points outside the whiskers) in the group of
companies not in the S&P 500. This indicates that there are a few companies outside
the index that have significantly higher or lower sales compared to the majority of
companies in both groups.
Strong Positive Correlations:
 Sales and Capital: A strong positive correlation exists between sales and capital.
This suggests that companies with higher capital tend to have higher sales.
 Sales and Employment: There's a moderate positive correlation between sales and
employment, indicating that larger companies (with more employees) generally have
higher sales.
 Capital and Patents: A moderate positive correlation is observed between capital
and patents, suggesting that companies with more capital might invest more in
research and development, leading to more patents.
 Capital and R&D: A strong positive correlation exists between capital and R&D,
implying that companies with more capital allocate more resources to research and
development.
 Patents and R&D: A strong positive correlation between patents and R&D indicates
that companies with more patents likely invest more in research and development.
 Employment and R&D: A moderate positive correlation between employment and
R&D suggests that larger companies might have more resources for research and
development.
Moderate Positive Correlations:
 Sales and R&D: A moderate positive correlation exists between sales and R&D,
suggesting that companies investing more in R&D might have higher sales.
 Sales and Value: A moderate positive correlation between sales and value indicates
that companies with higher sales might be perceived as more valuable.
 Capital and Value: A moderate positive correlation exists between capital and value,
suggesting that companies with more capital might be considered more valuable.
 Patents and Value: A moderate positive correlation between patents and value
indicates that companies with more patents might be perceived as more valuable.
 Employment and Value: A moderate positive correlation between employment and
value suggests that larger companies might be considered more valuable.
Weak or No Correlation:
 Tobing and other variables: Tobin Q seems to have weak or no correlations with
most of the other variables. This might suggest that it's not strongly correlated with
the other factors considered in this analysis.

1.4 Exploratory Data Analysis

1.4.1 Outliers Check


General Observations:
 Skewness: Most of the variables exhibit a rightward skew, indicating that there are
a few large values that are pulling the distribution to the right. This is evident in the
long tails on the right side of the box plots.
 Outliers: Several outliers are present in most of the variables, especially in capital,
patents, randd, and value. These outliers suggest that there are a few extreme data
points that are significantly different from the majority of the data.
 Median and Interquartile Range (IQR): The median values and IQRs vary across
the variables, suggesting different levels of central tendency and dispersion.
Specific Variable Observations:
 Sales: The distribution is relatively symmetric, with a median around 6. There are a
few outliers on both sides.
 Capital: The distribution is highly skewed to the right, with a large number of outliers.
The median is likely lower than the mean.
 Patents: Similar to capital, the distribution is highly skewed to the right with outliers.
The median is likely lower than the mean.
 Randd: Another highly skewed distribution with outliers, suggesting that R&D
spending is concentrated in a few companies.
 Employment: The distribution is also skewed to the right, but less severely than the
previous variables. There are fewer outliers.
 Tobing: The distribution is relatively symmetric, with a median around 5. There are a
few outliers on both sides.
 Value: Similar to capital and patents, the distribution is highly skewed to the right
with outliers.
 Institutions: The distribution is less skewed compared to the other variables, but
there are still some outliers.
Overall Insights:
 The data contains a significant number of outliers, especially for variables related to
capital, patents, R&D, and value.
 The distributions of many variables are skewed to the right, indicating that there are a
few large values that are pulling the mean to the right.
 The median and IQR vary across the variables, suggesting different levels of central
tendency and dispersion.
1.4.2 Missing values

No missing values

1.4.3 Data preparation for modeling

Added an intercept term (a column of ones) to the independent variable matrix X. This is
essential for fitting regression models with an intercept, as it allows the model to capture the
baseline value of the dependent variable when all independent variables are zero.

converted categorical variables in the DataFrame X into numerical representations.


Each unique category within a categorical variable is assigned a new binary column.

The drop_first=True parameter is used to avoid the "dummy variable trap," which occurs
when redundant information is introduced due to including all possible dummy variables. By
dropping the first category, we reduce the number of columns while retaining the necessary
information.

randomly split the data into training and testing sets, ensuring that 80% of the samples are
used for training and 20% for testing. The random_state=1 parameter sets a seed for the
random number generator, ensuring reproducibility of the split.

Number of rows in train data = 590


Number of rows in test data = 148
1.5 Model Building

1.5.1 Linear Regression:

The provided regression analysis suggests that the model explains a significant portion of
the variation in the dependent variable (sales) based on the independent variables. The R-
squared of 0.677 indicates that approximately 67.7% of the variation in sales can be
explained by the included variables.
Key Findings:

Significant Predictors:
Based on the provided regression results, the following predictors appear to be statistically
significant at a typical significance level of 0.05:
 employment
 tobing
 value
 institutions
 sp500_yes
These predictors have p-values less than 0.05, indicating that they are likely making a
significant contribution to explaining the variation in sales.
Non-Significant Predictors:
 capital
 patents
 randd
These predictors have p-values greater than 0.05, suggesting that they are not statistically
significant at the typical significance level.

Model Fit
 R-squared: 0.669 - This indicates that approximately 66.9% of the variation in sales
can be explained by the independent variables in the model.
 Adjusted R-squared: 0.664 - This is a slightly adjusted version of R-squared that
penalizes for the number of predictors in the model, providing a more accurate
measure of the model's fit.
 The Durbin-Watson statistic is close to 2, suggesting that there is no significant
autocorrelation in the residuals.
 The Omnibus test and Jarque-Bera test indicate that the residuals are not normally
distributed, which might be a concern.

1.5.2 Model evaluation

checking model performance on train set (seen 70% data)

Based on the provided values:


 RMSE: 1.127269
 MAE: 0.844911
 MAPE: 26.95745
Overall Assessment:
 RMSE and MAE: The values for RMSE and MAE are relatively low, indicating that
the model's predictions are generally close to the actual values. However, the
specific interpretation of these values depends on the context of your problem and
the scale of the data.
 MAPE: 26.95% is a relatively high MAPE, suggesting that the model might have
some significant errors, especially for smaller values. This could be due to outliers or
systematic biases in the model.

# checking model performance on test set (seen 30% data)


Metrics:

 RMSE: 1.030785
 MAE: 0.81383
 MAPE: 23.978281
Overall Assessment:
 RMSE and MAE: The values for RMSE and MAE are relatively low, indicating that
the model's predictions are generally close to the actual values. However, the
specific interpretation of these values depends on the context of your problem and
the scale of the data.
 MAPE: 23.97% is a relatively high MAPE, suggesting that the model might have
some significant errors, especially for smaller values. This could be due to outliers or
systematic biases in the model.

1.5.3 Checking Linear Regression Assumptions

We will be checking the following Linear Regression assumptions:

1. No Multicollinearity
2. Linearity of variables
3. Independence of error terms
4. Normality of error terms
5. No Heteroscedasticity

1.5.3.1 TEST FOR MULTICOLLINEARITY

 General Rule of thumb:


o If VIF is 1 then there is no correlation between that predictor and the
remaining predictor variables.
o If VIF exceeds 5 or is close to exceeding 5, we say there is moderate
multicollinearity.
 Can or cannot be treated with proper reasoning
o If VIF is 10 or exceeding 10, it shows signs of high multicollinearity.
 Must be treated
Observations:
 Moderate Multicollinearity: The features const, capital, randd, and employment have
VIF values above 5, indicating moderate multicollinearity. This suggests that these
variables might be correlated with other predictors in the model, potentially affecting
the stability of the coefficient estimates.
 Low Multicollinearity: The remaining features have VIF values below 5, suggesting low
multicollinearity.

1.5.3.2 Dropping high p-value variables

Selected features

['const', 'employment', 'tobinq', 'value', 'institutions', 'sp500_yes']

1.5.3.3 Re-building the model


Model Summary
 R-squared: 0.667 - This indicates that approximately 66.7% of the variation in the
dependent variable (sales) can be explained by the independent variables in the
model.
 Adjusted R-squared: 0.664 - This is a slightly adjusted version of R-squared that
penalizes for the number of predictors in the model, providing a more accurate
measure of the model's fit.
 F-statistic: 233.5 - This is a test of the overall significance of the model. The high F-
statistic and low p-value (1.09e-136) suggest that the model is statistically significant.
Coefficient Interpretation
 const: 4.7867 - This is the intercept of the regression line. It represents the predicted
value of sales when all independent variables are zero.
 employment: 0.0053 - For a one-unit increase in employment, we expect sales to
increase by 0.0053 units, holding all other variables constant.
 tobing: -0.1406 - For a one-unit increase in tobing, we expect sales to decrease by
0.1406 units, holding all other variables constant.
 value: 7.475e-05 - For a one-unit increase in value, we expect sales to increase by
7.475e-05 units, holding all other variables constant.
 institutions: 0.0251 - For a one-unit increase in institutions, we expect sales to
increase by 0.0251 units, holding all other variables constant.
 sp500_yes: 1.4786 - If sp500 is yes (compared to no), we expect sales to increase
by 1.4786 units, holding all other variables constant.
Diagnostic Tests
 Omnibus, Jarque-Bera, Skew, Kurtosis: These tests assess the normality of the
residuals. The p-values for Omnibus and Jarque-Bera are both small, suggesting that
the residuals might not be perfectly normally distributed.
 Durbin-Watson: 1.983 - This test checks for autocorrelation in the residuals. A value
between 1 and 2 suggests no significant autocorrelation.
Overall Assessment
Based on the results, the regression model appears to be statistically significant and
explains a substantial portion of the variation in sales. However, there are some concerns
about the normality of the residuals

Training Performance Metrics:

Metrics:

 RMSE: 1.131306
 MAE: 0.843946
 MAPE: 26.941502
Overall Assessment:
 RMSE and MAE: The values for RMSE and MAE are relatively low, indicating that
the model's predictions are generally close to the actual values. However, the
specific interpretation of these values depends on the context of your problem and
the scale of the data.
 MAPE: 26.94% is a relatively high MAPE, suggesting that the model might have
some significant errors, especially for smaller values. This could be due to outliers or
systematic biases in the model.

Testing Performance Metrics:

Based on the provided values:


 RMSE: 1.030785
 MAE: 0.81383
 MAPE: 23.978281
Overall Assessment:
 RMSE and MAE: The values for RMSE and MAE are relatively low, indicating that
the model's predictions are generally close to the actual values. However, the
specific interpretation of these values depends on the context of your problem and
the scale of the data.
 MAPE: 23.97% is a relatively high MAPE, suggesting that the model might have
some significant errors, especially for smaller values. This could be due to outliers or
systematic biases in the model.

1.5.4 Test for Linearity and Independence


Test for linearity

here are some observations:


1. Random Scatter: The residuals seem to be scattered randomly around the
horizontal line at 0. This suggests that the model is capturing the underlying
relationship between the variables reasonably well.
2. No Clear Pattern: There's no obvious pattern or trend in the residuals. This indicates
that the assumption of homoscedasticity (constant variance of errors) is likely met.
3. Outlier: There might be a potential outlier on the right side of the plot. This point
appears to be far away from the majority of the other points. It's worth investigating
further to see if it's an anomaly or if it might be influencing the model's performance.

Test for normality


Key Observations from the Plot:
 Skewness: The distribution appears slightly skewed to the left. This means there are
more residuals on the lower end of the distribution compared to the higher end.
 Outlier: There's a potential outlier on the left side of the plot, indicated by the yellow
point. This point is far from the majority of the residuals.
 Shape: The overall shape of the histogram is somewhat bell-shaped, but it's not
perfectly symmetrical.
Key Observations from the Plot:
 Linearity: The points appear to follow a fairly straight line, especially in the middle
section. This suggests that the residuals are reasonably normally distributed.
 Deviations: There are some deviations from the line, particularly in the tails. This
indicates that the distribution might have slightly heavier tails than a normal
distribution.
 Outlier: There's an outlier on the left side of the plot, which might be influencing the
tail behavior.

ShapiroResult(statistic=0.9822825883697885, pvalue=1.4029046526110507e-06)

● Since the p-value < 0.05, the residuals are not normal as per the Shapiro-Wilk test.
● Strictly speaking, the residuals are not normal. However, as an approximation, we
can accept this distribution as close to being normal.
● So, the assumption is satisfied.

TEST FOR HOMOSCEDASTICITY

[('F statistic', 1.0339466399164483), ('p-value', 0.3883948729469446)]


Given that the p-value is greater than 0.05 (a common significance level), we fail to reject
the null hypothesis of homoscedasticity. This means that there is not enough evidence to
conclude that the errors in your regression model are heteroscedastic.
1.6 Final Model summary

Model Summary
 R-squared: 0.667 - This indicates that approximately 66.7% of the variation in the
dependent variable (sales) can be explained by the independent variables in the
model.
 Adjusted R-squared: 0.664 - This is a slightly adjusted version of R-squared that
penalizes for the number of predictors in the model, providing a more accurate
measure of the model's fit.
 F-statistic: 233.5 - This is a test of the overall significance of the model. The high F-
statistic and low p-value (1.09e-136) suggest that the model is statistically significant.

Coefficient Interpretation
 const: 4.7867 - This is the intercept of the regression line. It represents the predicted
value of sales when all independent variables are zero.
 employment: 0.0053 - For a one-unit increase in employment, we expect sales to
increase by 0.0053 units, holding all other variables constant.
 tobing: -0.1406 - For a one-unit increase in tobing, we expect sales to decrease by
0.1406 units, holding all other variables constant.
 value: 7.475e-05 - For a one-unit increase in value, we expect sales to increase by
7.475e-05 units, holding all other variables constant.
 institutions: 0.0251 - For a one-unit increase in institutions, we expect sales to
increase by 0.0251 units, holding all other variables constant.
 sp500_yes: 1.4786 - If sp500 is yes (compared to no), we expect sales to increase
by 1.4786 units, holding all other variables constant.
Diagnostic Tests
 Omnibus, Jarque-Bera, Skew, Kurtosis: These tests assess the normality of the
residuals. The p-values for Omnibus and Jarque-Bera are both small, suggesting that
the residuals might not be perfectly normally distributed.
 Durbin-Watson: 1.983 - This test checks for autocorrelation in the residuals. A value
between 1 and 2 suggests no significant autocorrelation.
Overall Assessment
Based on the results, the regression model appears to be statistically significant and
explains a substantial portion of the variation in sales. However, there are some concerns
about the normality of the residuals. It might be worth investigating these issues further and
potentially considering alternative modeling approaches or transformations.

Training performance

Overall Assessment:
 RMSE and MAE: The values for RMSE and MAE are relatively low, indicating that
the model's predictions are generally close to the actual values. However, the
specific interpretation of these values depends on the context of your problem and
the scale of the data.
 MAPE: 26.94% is a relatively high MAPE, suggesting that the model might have
some significant errors, especially for smaller values. This could be due to outliers or
systematic biases in the model.

Test performance

Overall Assessment:
 RMSE and MAE: The values for RMSE and MAE are relatively low, indicating that
the model's predictions are generally close to the actual values. However, the
specific interpretation of these values depends on the context of your problem and
the scale of the data.
 MAPE: 23.96% is a relatively high MAPE, suggesting that the model might have
some significant errors, especially for smaller values. This could be due to outliers or
systematic biases in the model.

1.7 Actionable Insights and Recommendations

The model explains 66.7% of the variation in the data, as indicated by the R-squared value.
The most significant predictors are:
 employment
 tobing
 value
 institutions
 sp500_yes
These predictors have p-values less than 0.05, suggesting that they are statistically
significant and contribute meaningfully to explaining the variation in the dependent variable
(sales).
Regarding the impact of a unit increase in normalized predictors:
 employment: For every unit increase in employment, we expect sales to increase by
0.0053 units, holding other variables constant.
 tobing: For every unit increase in tobing, we expect sales to decrease by 0.1406
units, holding other variables constant.
 value: For every unit increase in value, we expect sales to increase by 7.475e-05
units, holding other variables constant.
 institutions: For every unit increase in institutions, we expect sales to increase by
0.0251 units, holding other variables constant.
 sp500_yes: If sp500 is yes (compared to no), we expect sales to increase by 1.4786
units, holding other variables constant.
Business Insights from the Model:
 Key Drivers of Sales: Employment, tobing, value, institutions, and the overall state
of the stock market (as indicated by sp500_yes) appear to be significant drivers of
sales in the investment firm.
 Impact of Tobing: Interestingly, tobing (which might be related to a specific business
strategy or market condition) has a negative impact on sales. This suggests that
increasing tobing might lead to decreased sales.
 Importance of Institutions: Institutions play a positive role in driving sales,
indicating that partnerships or relationships with institutions are beneficial for the firm.
 Stock Market Influence: The sp500_yes variable suggests that the overall state of
the stock market can significantly impact sales in the investment firm.

You might also like