0% found this document useful (0 votes)

66 views31 pages

Predictive Modelling Project - Nandini

The document outlines a predictive modeling project aimed at forecasting sales for 759 firms based on various attributes such as capital, patents, and R&D spending. It includes sections on data loading, exploratory data analysis, model building, and actionable insights, emphasizing the importance of identifying key attributes influencing sales. The analysis reveals significant correlations among variables and highlights the presence of outliers and skewness in the data distributions.

Uploaded by

Nandini Priya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views31 pages

Predictive Modelling Project - Nandini

Uploaded by

Nandini Priya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Predictive Modelling

Project

BUSINESS REPORT
NANDINI PRIYA MS
Table of Contents
List of Tables 3
List of Figures 4
1.1 Problem Statement 5
1.2 Loading the Dataset and Data Overview 6
1.3 Exploratory Data Analysis 9
1.4 Data Preprocessing 17
1.5 Model Building 20
1.6 Final Model Building 30
1.7 Actionable Insights and Recommendations 32
List of Tables
Table 1: First 5 rows of the dataset 6

Table 2: Last 5 rows of the dataset 7

Table 3: Information about the dataset 7

Table 4: Datatypes of the variables in the dataset 8

Table 5: Unique values in the dataset 8

Table 6: Statistical Summary of the dataset 9

Table 7: No. of missing values in the dataset 17

Table 8: Linear Regression - Model Summary 21

Table 9: Performance Metrics for Linear regression Training data 23

Table 10: Performance Metrics for Linear regression Test data 23

Table 11: VIF values for the test of Multicollinearity 27

Table 12: Performance Metrics after removing the variables 27

Table 13: Linear Regression - Final Model Summary 31

Table 14: Performance Metrics for Linear regression Training data 31

Table 15: Performance Metrics for Linear regression Test data 32

1.1 Problem Statement

1.1.1 Context

You are a part of an investment firm and your work is to research these 759 firms.
You are provided with the dataset containing the sales and other attributes of these
759 firms. Predict the sales of these firms based on the details given in the dataset
to help your company in investing consciously. Also, provide them with the most
important attributes.

1.1.2 Data Dictionary

The detailed data dictionary is given below.
1. sales: Normalized values of the Sales
2. capital: Net stock of property, plant, and equipment.
3. patents: Granted patents.
4. randd: R&D stock (in millions of dollars).
5. employment: Employment (in 1000s).
6. sp500: Membership of firms in the S&P 500 index. S&P is a stock market
index that measures the stock performance of 500 large companies listed on
stock exchanges in the United States
7. tobinq: Tobin's q (also known as q ratio and Kaldor's v) is the ratio between
a physical asset's market value and its replacement value.
8. value: Stock market value.

1.2 Loading the Dataset and Data Overview

The dataset provided was loaded into a pandas dataframe for the analysis.

1.2.1 Getting the first 5 and last 5 rows

The dataset has been loaded successfully. It has 738 rows × 9 columns.

● First 5 rows of the dataset:

● Last 5 rows of the dataset

Table 2: Last 5 rows of the dataset

1.2.2 Information about the dataset

Table 3: Information about the dataset

● There are 9 variables in the dataset with no null values.

1.2.3 Data types of the variables

Data type No of variables

object 1

float64 7
int64 1

Table 4: Datatypes of the variables in the dataset

1.2.4 Unique values in the attributes of the dataset

Table 5: Unique values in the dataset

● Among the variables in the dataset - sales, capital ,value have the
highest counts of unique values.

1.2.5 Check for Duplicate Records

● No duplicate records exist in this dataset.

1.2.6 Statistical Description of the Dataset

● A description of the first few columns of the dataset is given below:
● The sales ranges from 4.8 to 7.5.
 Sales: The average sales value is 6.17 units (assuming units are not
specified), with a standard deviation of 1.96. The distribution is skewed to the
right, with a median of 6.11 units. This suggests that a majority of businesses
have sales values close to the median, but there are some businesses with
significantly higher sales.
 Capital: The average capital value is 2028.51 units (assuming units are not
specified), with a standard deviation of 6550.94. The distribution is highly
skewed to the right, with a median of 205.81 units. This suggests that there
are a few businesses with extremely high capital values, which are pulling
the mean to the right.
 Patents: The average number of patents is 26.32, with a standard deviation
of 98.57. The distribution is skewed to the right, with a median of 3.00. This
suggests that a majority of businesses have a relatively small number of
patents, but there are a few businesses with a large number of patents.
 R&D: The average R&D spending is 449.88 units (assuming units are not
specified), with a standard deviation of 2034.80. The distribution is highly
skewed to the right, with a median of 36.82 units. This suggests that there
are a few businesses with extremely high R&D spending, which are pulling
the mean to the right.
 Employment: The average number of employees is 14.50, with a standard
deviation of 43.89. The distribution is skewed to the right, with a median of
3.04. This suggests that a majority of businesses have a relatively small
number of employees, but there are a few businesses with a large number of
employees.
 Tobin Q: The average Tobin Q value is 2.79, with a standard deviation of
3.37. The distribution is skewed to the right, with a median of 1.68. This
suggests that a majority of businesses have a Tobin Q value below the
average, but there are a few businesses with significantly higher Tobin Q
values.
 Value: The average value is 2797.51 units (assuming units are not specified),
with a standard deviation of 7159.92. The distribution is highly skewed to the
right, with a median of 418.52 units. This suggests that there are a few
businesses with extremely high values, which are pulling the mean to the
right.
 Institutions: The average number of institutions is 43.10, with a standard
deviation of 21.73. The distribution is skewed to the right, with a median of
44.43. This suggests that a majority of businesses have a relatively large
number of institutions, but there are a few businesses with a smaller number
of institutions.
1.3 Exploratory Data Analysis
1.3.1 Univariate Analysis

In this section, we will analyze the distribution of independent variables. It

will help us identify the pattern among the variables and the effects they
have on the target variable.

First, let us see how the target variable (sales) is distributed.

The data appears to be approximately normally distributed, with a slight skew to

the right
Capital
 Extreme Outliers: The numerous dots extending to the right indicate a significant
number of outliers, suggesting that there are some extremely large values in the
data.
 Tight Distribution: The majority of the data points are clustered tightly together near
the lower end of the distribution.
 Long Tail: The long tail extending to the right confirms the heavy right-skewness of
the data.
 Median and Quartiles: The median (the line within the box) is positioned towards
the lower end of the box, further emphasizing the right-skewness. The interquartile
range (the length of the box) is relatively small compared to the overall range of the
data.

The distribution is highly skewed to the right, with a long tail extending to the right. This
indicates that there are a few very large values of capital, which are pulling the mean to the
right

The majority of the data points are clustered around the lower values of capital, with a
decreasing frequency as the values increase.

Patents
the data appears to be heavily skewed to the left with a large number of outliers on the right
side. Here's a breakdown of the key observations:
 Extreme Outliers: The numerous dots extending to the right indicate a significant
number of outliers, suggesting that there are some extremely large values in the
data.
 Tight Distribution: The majority of the data points are clustered tightly together near
the higher end of the distribution.
 Long Tail: The long tail extending to the right confirms the heavy right-skewness of
the data.
 Median and Quartiles: The median (the line within the box) is positioned towards
the higher end of the box, further emphasizing the right-skewness. The interquartile
range (the length of the box) is relatively small compared to the overall range of the
data.

randd
the data appears to be heavily skewed to the right with a large number of outliers on the right
side. Here's a breakdown of the key observations:
 Extreme Outliers: The numerous dots extending to the right indicate a significant
number of outliers, suggesting that there are some extremely large values in the
data.
 Tight Distribution: The majority of the data points are clustered tightly together near
the lower end of the distribution.
 Long Tail: The long tail extending to the right confirms the heavy right-skewness of
the data.
 Median and Quartiles: The median (the line within the box) is positioned towards
the lower end of the box, further emphasizing the right-skewness. The interquartile
range (the length of the box) is relatively small compared to the overall range of the
data.

employment
the data appears to be heavily skewed to the right with a large number of outliers on the right
side. Here's a breakdown of the key observations:
 Extreme Outliers: The numerous dots extending to the right indicate a significant
number of outliers, suggesting that there are some extremely large values in the
data.
 Tight Distribution: The majority of the data points are clustered tightly together near
the lower end of the distribution.
 Long Tail: The long tail extending to the right confirms the heavy right-skewness of
the data.
 Median and Quartiles: The median (the line within the box) is positioned towards
the lower end of the box, further emphasizing the right-skewness. The interquartile
range (the length of the box) is relatively small compared to the overall range of the
data.

tobinq
the data appears to be heavily skewed to the right with a large number of outliers.
Here's a breakdown of the key observations:
 Extreme Outliers: The numerous dots extending to the right indicate a significant
number of outliers, suggesting that there are some extremely large values in the
data.
 Tight Distribution: The majority of the data points are clustered tightly together near
the lower end of the distribution.
 Long Tail: The long tail extending to the right confirms the heavy right-skewness of
the data.
 Median and Quartiles: The median (the line within the box) is positioned towards
the lower end of the box, further emphasizing the right-skewness. The interquartile
range (the length of th

value
the data appears to be heavily skewed to the right with a large number of outliers.
Here's a breakdown of the key observations:
 Extreme Outliers: The numerous dots extending to the right indicate a significant
number of outliers, suggesting that there are some extremely large values in the
data.
 Tight Distribution: The majority of the data points are clustered tightly together near
the lower end of the distribution.
 Long Tail: The long tail extending to the right confirms the heavy right-skewness of
the data.
 Median and Quartiles: The median (the line within the box) is positioned towards
the lower end of the box, further emphasizing the right-skewness. The interquartile
range (the length of the box) is relatively small compared to the overall range of the
data.

sp500
The count plot shows the distribution of a categorical variable, likely related to the S&P 500
index. Based on the plot, we can infer the following:
 Categories: The variable has two categories: "no" and "yes."
 Distribution: The category "no" has a significantly higher count than the category
"yes." Specifically, 70.6% of the observations fall into the "no" category, while 29.4%
fall into the "yes" category.
 Dominant Category: The "no" category is the dominant category, indicating that it is
more frequent or prevalent in the data.

1.3.2 Bivariate Analysis:

For Bivariate analysis, we can plot the boxplots of independent variables with
respect to the target variable. This will help us analyze the contribution of
variables in determining the trends and patterns of the used phone prices.
difference in the distribution of sales between companies in the S&P 500 and those
not in the S&P 500. Here's a breakdown of the key observations:
 Median Sales: The median sales for companies in the S&P 500 (right box) are
higher than those not in the S&P 500 (left box). This indicates that the typical sales
value is greater for companies in the index.
 Interquartile Range: The interquartile range (the length of the box) for companies in
the S&P 500 is larger than that for companies not in the S&P 500. This suggests that
there is a greater variability in sales among companies in the index.
 Outliers: There are more outliers (data points outside the whiskers) in the group of
companies not in the S&P 500. This indicates that there are a few companies outside
the index that have significantly higher or lower sales compared to the majority of
companies in both groups.
Strong Positive Correlations:
 Sales and Capital: A strong positive correlation exists between sales and capital.
This suggests that companies with higher capital tend to have higher sales.
 Sales and Employment: There's a moderate positive correlation between sales and
employment, indicating that larger companies (with more employees) generally have
higher sales.
 Capital and Patents: A moderate positive correlation is observed between capital
and patents, suggesting that companies with more capital might invest more in
research and development, leading to more patents.
 Capital and R&D: A strong positive correlation exists between capital and R&D,
implying that companies with more capital allocate more resources to research and
development.
 Patents and R&D: A strong positive correlation between patents and R&D indicates
that companies with more patents likely invest more in research and development.
 Employment and R&D: A moderate positive correlation between employment and
R&D suggests that larger companies might have more resources for research and
development.
Moderate Positive Correlations:
 Sales and R&D: A moderate positive correlation exists between sales and R&D,
suggesting that companies investing more in R&D might have higher sales.
 Sales and Value: A moderate positive correlation between sales and value indicates
that companies with higher sales might be perceived as more valuable.
 Capital and Value: A moderate positive correlation exists between capital and value,
suggesting that companies with more capital might be considered more valuable.
 Patents and Value: A moderate positive correlation between patents and value
indicates that companies with more patents might be perceived as more valuable.
 Employment and Value: A moderate positive correlation between employment and
value suggests that larger companies might be considered more valuable.
Weak or No Correlation:
 Tobing and other variables: Tobin Q seems to have weak or no correlations with
most of the other variables. This might suggest that it's not strongly correlated with
the other factors considered in this analysis.

1.4 Exploratory Data Analysis

1.4.1 Outliers Check

General Observations:
 Skewness: Most of the variables exhibit a rightward skew, indicating that there are
a few large values that are pulling the distribution to the right. This is evident in the
long tails on the right side of the box plots.
 Outliers: Several outliers are present in most of the variables, especially in capital,
patents, randd, and value. These outliers suggest that there are a few extreme data
points that are significantly different from the majority of the data.
 Median and Interquartile Range (IQR): The median values and IQRs vary across
the variables, suggesting different levels of central tendency and dispersion.
Specific Variable Observations:
 Sales: The distribution is relatively symmetric, with a median around 6. There are a
few outliers on both sides.
 Capital: The distribution is highly skewed to the right, with a large number of outliers.
The median is likely lower than the mean.
 Patents: Similar to capital, the distribution is highly skewed to the right with outliers.
The median is likely lower than the mean.
 Randd: Another highly skewed distribution with outliers, suggesting that R&D
spending is concentrated in a few companies.
 Employment: The distribution is also skewed to the right, but less severely than the
previous variables. There are fewer outliers.
 Tobing: The distribution is relatively symmetric, with a median around 5. There are a
few outliers on both sides.
 Value: Similar to capital and patents, the distribution is highly skewed to the right
with outliers.
 Institutions: The distribution is less skewed compared to the other variables, but
there are still some outliers.
Overall Insights:
 The data contains a significant number of outliers, especially for variables related to
capital, patents, R&D, and value.
 The distributions of many variables are skewed to the right, indicating that there are a
few large values that are pulling the mean to the right.
 The median and IQR vary across the variables, suggesting different levels of central
tendency and dispersion.
1.4.2 Missing values

No missing values

1.4.3 Data preparation for modeling

Added an intercept term (a column of ones) to the independent variable matrix X. This is
essential for fitting regression models with an intercept, as it allows the model to capture the
baseline value of the dependent variable when all independent variables are zero.

converted categorical variables in the DataFrame X into numerical representations.

Each unique category within a categorical variable is assigned a new binary column.

The drop_first=True parameter is used to avoid the "dummy variable trap," which occurs
when redundant information is introduced due to including all possible dummy variables. By
dropping the first category, we reduce the number of columns while retaining the necessary
information.

randomly split the data into training and testing sets, ensuring that 80% of the samples are
used for training and 20% for testing. The random_state=1 parameter sets a seed for the
random number generator, ensuring reproducibility of the split.

Number of rows in train data = 590

Number of rows in test data = 148
1.5 Model Building

1.5.1 Linear Regression:

The provided regression analysis suggests that the model explains a significant portion of
the variation in the dependent variable (sales) based on the independent variables. The R-
squared of 0.677 indicates that approximately 67.7% of the variation in sales can be
explained by the included variables.
Key Findings:

Significant Predictors:
Based on the provided regression results, the following predictors appear to be statistically
significant at a typical significance level of 0.05:
 employment
 tobing
 value
 institutions
 sp500_yes
These predictors have p-values less than 0.05, indicating that they are likely making a
significant contribution to explaining the variation in sales.
Non-Significant Predictors:
 capital
 patents
 randd
These predictors have p-values greater than 0.05, suggesting that they are not statistically
significant at the typical significance level.

Model Fit
 R-squared: 0.669 - This indicates that approximately 66.9% of the variation in sales
can be explained by the independent variables in the model.
 Adjusted R-squared: 0.664 - This is a slightly adjusted version of R-squared that
penalizes for the number of predictors in the model, providing a more accurate
measure of the model's fit.
 The Durbin-Watson statistic is close to 2, suggesting that there is no significant
autocorrelation in the residuals.
 The Omnibus test and Jarque-Bera test indicate that the residuals are not normally
distributed, which might be a concern.

1.5.2 Model evaluation

checking model performance on train set (seen 70% data)

Based on the provided values:

 RMSE: 1.127269
 MAE: 0.844911
 MAPE: 26.95745
Overall Assessment:
 RMSE and MAE: The values for RMSE and MAE are relatively low, indicating that
the model's predictions are generally close to the actual values. However, the
specific interpretation of these values depends on the context of your problem and
the scale of the data.
 MAPE: 26.95% is a relatively high MAPE, suggesting that the model might have
some significant errors, especially for smaller values. This could be due to outliers or
systematic biases in the model.

# checking model performance on test set (seen 30% data)

Metrics:

 RMSE: 1.030785
 MAE: 0.81383
 MAPE: 23.978281
Overall Assessment:
 RMSE and MAE: The values for RMSE and MAE are relatively low, indicating that
the model's predictions are generally close to the actual values. However, the
specific interpretation of these values depends on the context of your problem and
the scale of the data.
 MAPE: 23.97% is a relatively high MAPE, suggesting that the model might have
some significant errors, especially for smaller values. This could be due to outliers or
systematic biases in the model.

1.5.3 Checking Linear Regression Assumptions

We will be checking the following Linear Regression assumptions:

1. No Multicollinearity
2. Linearity of variables
3. Independence of error terms
4. Normality of error terms
5. No Heteroscedasticity

1.5.3.1 TEST FOR MULTICOLLINEARITY

 General Rule of thumb:

o If VIF is 1 then there is no correlation between that predictor and the
remaining predictor variables.
o If VIF exceeds 5 or is close to exceeding 5, we say there is moderate
multicollinearity.
 Can or cannot be treated with proper reasoning
o If VIF is 10 or exceeding 10, it shows signs of high multicollinearity.
 Must be treated
Observations:
 Moderate Multicollinearity: The features const, capital, randd, and employment have
VIF values above 5, indicating moderate multicollinearity. This suggests that these
variables might be correlated with other predictors in the model, potentially affecting
the stability of the coefficient estimates.
 Low Multicollinearity: The remaining features have VIF values below 5, suggesting low
multicollinearity.

1.5.3.2 Dropping high p-value variables

Selected features

['const', 'employment', 'tobinq', 'value', 'institutions', 'sp500_yes']

1.5.3.3 Re-building the model

Model Summary
 R-squared: 0.667 - This indicates that approximately 66.7% of the variation in the
dependent variable (sales) can be explained by the independent variables in the
model.
 Adjusted R-squared: 0.664 - This is a slightly adjusted version of R-squared that
penalizes for the number of predictors in the model, providing a more accurate
measure of the model's fit.
 F-statistic: 233.5 - This is a test of the overall significance of the model. The high F-
statistic and low p-value (1.09e-136) suggest that the model is statistically significant.
Coefficient Interpretation
 const: 4.7867 - This is the intercept of the regression line. It represents the predicted
value of sales when all independent variables are zero.
 employment: 0.0053 - For a one-unit increase in employment, we expect sales to
increase by 0.0053 units, holding all other variables constant.
 tobing: -0.1406 - For a one-unit increase in tobing, we expect sales to decrease by
0.1406 units, holding all other variables constant.
 value: 7.475e-05 - For a one-unit increase in value, we expect sales to increase by
7.475e-05 units, holding all other variables constant.
 institutions: 0.0251 - For a one-unit increase in institutions, we expect sales to
increase by 0.0251 units, holding all other variables constant.
 sp500_yes: 1.4786 - If sp500 is yes (compared to no), we expect sales to increase
by 1.4786 units, holding all other variables constant.
Diagnostic Tests
 Omnibus, Jarque-Bera, Skew, Kurtosis: These tests assess the normality of the
residuals. The p-values for Omnibus and Jarque-Bera are both small, suggesting that
the residuals might not be perfectly normally distributed.
 Durbin-Watson: 1.983 - This test checks for autocorrelation in the residuals. A value
between 1 and 2 suggests no significant autocorrelation.
Overall Assessment
Based on the results, the regression model appears to be statistically significant and
explains a substantial portion of the variation in sales. However, there are some concerns
about the normality of the residuals

Training Performance Metrics:

Metrics:

 RMSE: 1.131306
 MAE: 0.843946
 MAPE: 26.941502
Overall Assessment:
 RMSE and MAE: The values for RMSE and MAE are relatively low, indicating that
the model's predictions are generally close to the actual values. However, the
specific interpretation of these values depends on the context of your problem and
the scale of the data.
 MAPE: 26.94% is a relatively high MAPE, suggesting that the model might have
some significant errors, especially for smaller values. This could be due to outliers or
systematic biases in the model.

Testing Performance Metrics:

Based on the provided values:

1.5.4 Test for Linearity and Independence

Test for linearity

here are some observations:

1. Random Scatter: The residuals seem to be scattered randomly around the
horizontal line at 0. This suggests that the model is capturing the underlying
relationship between the variables reasonably well.
2. No Clear Pattern: There's no obvious pattern or trend in the residuals. This indicates
that the assumption of homoscedasticity (constant variance of errors) is likely met.
3. Outlier: There might be a potential outlier on the right side of the plot. This point
appears to be far away from the majority of the other points. It's worth investigating
further to see if it's an anomaly or if it might be influencing the model's performance.

Test for normality

Key Observations from the Plot:
 Skewness: The distribution appears slightly skewed to the left. This means there are
more residuals on the lower end of the distribution compared to the higher end.
 Outlier: There's a potential outlier on the left side of the plot, indicated by the yellow
point. This point is far from the majority of the residuals.
 Shape: The overall shape of the histogram is somewhat bell-shaped, but it's not
perfectly symmetrical.
Key Observations from the Plot:
 Linearity: The points appear to follow a fairly straight line, especially in the middle
section. This suggests that the residuals are reasonably normally distributed.
 Deviations: There are some deviations from the line, particularly in the tails. This
indicates that the distribution might have slightly heavier tails than a normal
distribution.
 Outlier: There's an outlier on the left side of the plot, which might be influencing the
tail behavior.

ShapiroResult(statistic=0.9822825883697885, pvalue=1.4029046526110507e-06)

● Since the p-value < 0.05, the residuals are not normal as per the Shapiro-Wilk test.
● Strictly speaking, the residuals are not normal. However, as an approximation, we
can accept this distribution as close to being normal.
● So, the assumption is satisfied.

TEST FOR HOMOSCEDASTICITY

[('F statistic', 1.0339466399164483), ('p-value', 0.3883948729469446)]

Given that the p-value is greater than 0.05 (a common significance level), we fail to reject
the null hypothesis of homoscedasticity. This means that there is not enough evidence to
conclude that the errors in your regression model are heteroscedastic.
1.6 Final Model summary

Coefficient Interpretation
 const: 4.7867 - This is the intercept of the regression line. It represents the predicted
value of sales when all independent variables are zero.
 employment: 0.0053 - For a one-unit increase in employment, we expect sales to
increase by 0.0053 units, holding all other variables constant.
 tobing: -0.1406 - For a one-unit increase in tobing, we expect sales to decrease by
0.1406 units, holding all other variables constant.
 value: 7.475e-05 - For a one-unit increase in value, we expect sales to increase by
7.475e-05 units, holding all other variables constant.
 institutions: 0.0251 - For a one-unit increase in institutions, we expect sales to
increase by 0.0251 units, holding all other variables constant.
 sp500_yes: 1.4786 - If sp500 is yes (compared to no), we expect sales to increase
by 1.4786 units, holding all other variables constant.
Diagnostic Tests
 Omnibus, Jarque-Bera, Skew, Kurtosis: These tests assess the normality of the
residuals. The p-values for Omnibus and Jarque-Bera are both small, suggesting that
the residuals might not be perfectly normally distributed.
 Durbin-Watson: 1.983 - This test checks for autocorrelation in the residuals. A value
between 1 and 2 suggests no significant autocorrelation.
Overall Assessment
Based on the results, the regression model appears to be statistically significant and
explains a substantial portion of the variation in sales. However, there are some concerns
about the normality of the residuals. It might be worth investigating these issues further and
potentially considering alternative modeling approaches or transformations.

Training performance

Overall Assessment:
 RMSE and MAE: The values for RMSE and MAE are relatively low, indicating that
the model's predictions are generally close to the actual values. However, the
specific interpretation of these values depends on the context of your problem and
the scale of the data.
 MAPE: 26.94% is a relatively high MAPE, suggesting that the model might have
some significant errors, especially for smaller values. This could be due to outliers or
systematic biases in the model.

Test performance

Overall Assessment:
 RMSE and MAE: The values for RMSE and MAE are relatively low, indicating that
the model's predictions are generally close to the actual values. However, the
specific interpretation of these values depends on the context of your problem and
the scale of the data.
 MAPE: 23.96% is a relatively high MAPE, suggesting that the model might have
some significant errors, especially for smaller values. This could be due to outliers or
systematic biases in the model.

1.7 Actionable Insights and Recommendations

The model explains 66.7% of the variation in the data, as indicated by the R-squared value.
The most significant predictors are:
 employment
 tobing
 value
 institutions
 sp500_yes
These predictors have p-values less than 0.05, suggesting that they are statistically
significant and contribute meaningfully to explaining the variation in the dependent variable
(sales).
Regarding the impact of a unit increase in normalized predictors:
 employment: For every unit increase in employment, we expect sales to increase by
0.0053 units, holding other variables constant.
 tobing: For every unit increase in tobing, we expect sales to decrease by 0.1406
units, holding other variables constant.
 value: For every unit increase in value, we expect sales to increase by 7.475e-05
units, holding other variables constant.
 institutions: For every unit increase in institutions, we expect sales to increase by
0.0251 units, holding other variables constant.
 sp500_yes: If sp500 is yes (compared to no), we expect sales to increase by 1.4786
units, holding other variables constant.
Business Insights from the Model:
 Key Drivers of Sales: Employment, tobing, value, institutions, and the overall state
of the stock market (as indicated by sp500_yes) appear to be significant drivers of
sales in the investment firm.
 Impact of Tobing: Interestingly, tobing (which might be related to a specific business
strategy or market condition) has a negative impact on sales. This suggests that
increasing tobing might lead to decreased sales.
 Importance of Institutions: Institutions play a positive role in driving sales,
indicating that partnerships or relationships with institutions are beneficial for the firm.
 Stock Market Influence: The sp500_yes variable suggests that the overall state of
the stock market can significantly impact sales in the investment firm.

ML-2 Guided Project Report
No ratings yet
ML-2 Guided Project Report
63 pages
Project: Advanced Statistics: Anova, Eda and Pca
No ratings yet
Project: Advanced Statistics: Anova, Eda and Pca
35 pages
Nagareddy 18-Nov-2023
No ratings yet
Nagareddy 18-Nov-2023
20 pages
SMDM Project Report-Survi Ghura
100% (1)
SMDM Project Report-Survi Ghura
26 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Credit Risk Modeling in Python Chapter1
100% (1)
Credit Risk Modeling in Python Chapter1
27 pages
Anshul Dyundi Machine Learning July 2022
50% (2)
Anshul Dyundi Machine Learning July 2022
46 pages
Time Series Forecasting Project (Shoe Sales)
No ratings yet
Time Series Forecasting Project (Shoe Sales)
26 pages
Report On Linear Regression Using R
No ratings yet
Report On Linear Regression Using R
15 pages
AV Project Shivakumar Vanga
100% (1)
AV Project Shivakumar Vanga
37 pages
SampleDMAICReport Call Center Time
No ratings yet
SampleDMAICReport Call Center Time
14 pages
PM ProjectJune - 2021
100% (1)
PM ProjectJune - 2021
33 pages
Errors in Chemical Analysis
No ratings yet
Errors in Chemical Analysis
19 pages
Random Forest - US - Heart - Patients - Class
100% (1)
Random Forest - US - Heart - Patients - Class
24 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
SMDM Project Report Dipti
No ratings yet
SMDM Project Report Dipti
14 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Business Report SMDM Bhushan
No ratings yet
Business Report SMDM Bhushan
18 pages
7.CML Methods
No ratings yet
7.CML Methods
6 pages
Chapt 6 - Statistic Data
No ratings yet
Chapt 6 - Statistic Data
82 pages
SMDM Project Report
100% (1)
SMDM Project Report
9 pages
Wholesale Custumer
100% (1)
Wholesale Custumer
32 pages
Palash Bhai - Machine Learning Assignment
100% (2)
Palash Bhai - Machine Learning Assignment
18 pages
M4 Data Mining W4 Business Report
No ratings yet
M4 Data Mining W4 Business Report
22 pages
Akshaya SMDM Project Report
100% (1)
Akshaya SMDM Project Report
18 pages
MySQL - Week 5 Quiz
100% (1)
MySQL - Week 5 Quiz
6 pages
Vijayalakshmi
No ratings yet
Vijayalakshmi
17 pages
SMDM Report
No ratings yet
SMDM Report
12 pages
SMDM Project Report
100% (1)
SMDM Project Report
19 pages
Problem Statement 1
100% (1)
Problem Statement 1
17 pages
AP Stat Test 1 - 2004-2005
No ratings yet
AP Stat Test 1 - 2004-2005
5 pages
Ensemble Learning: Proprietary Content. ©great Learning. All Rights Reserved. Unauthorized Use or Distribution Prohibited
No ratings yet
Ensemble Learning: Proprietary Content. ©great Learning. All Rights Reserved. Unauthorized Use or Distribution Prohibited
6 pages
Time Series Forecasting - Rose - Buisness Report
100% (1)
Time Series Forecasting - Rose - Buisness Report
69 pages
NIrupam Agarwal Business Report-ML
100% (1)
NIrupam Agarwal Business Report-ML
23 pages
Data Mining Problem 2 Report
No ratings yet
Data Mining Problem 2 Report
13 pages
Data Mining Project - 27.06.2021
No ratings yet
Data Mining Project - 27.06.2021
6 pages
Problem 1 - (Download Data) : Importing Nessceary Libraries
No ratings yet
Problem 1 - (Download Data) : Importing Nessceary Libraries
16 pages
Rahulsharma - 03 12 23
No ratings yet
Rahulsharma - 03 12 23
25 pages
SMDM Project
No ratings yet
SMDM Project
17 pages
Uber Drive Practice DP PDF
No ratings yet
Uber Drive Practice DP PDF
10 pages
Education - Post 12th Standard - CSV
No ratings yet
Education - Post 12th Standard - CSV
11 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
AS Notebook - PCA - Wine Data-4
100% (1)
AS Notebook - PCA - Wine Data-4
1 page
Data Mining Project - PCA - Hair Salon
No ratings yet
Data Mining Project - PCA - Hair Salon
8 pages
Robust Statistics For Outlier Detection: Peter J. Rousseeuw and Mia Hubert
No ratings yet
Robust Statistics For Outlier Detection: Peter J. Rousseeuw and Mia Hubert
7 pages
E180 Standard Practice For Determining The Precision of ASTM Methods For Analysis and Testing of Industrial and Specialty Chemicals PDF
No ratings yet
E180 Standard Practice For Determining The Precision of ASTM Methods For Analysis and Testing of Industrial and Specialty Chemicals PDF
14 pages
VARUNSAINI - 13 Nov 2022
No ratings yet
VARUNSAINI - 13 Nov 2022
14 pages
AS Graded Project Suchi Solanki
No ratings yet
AS Graded Project Suchi Solanki
21 pages
Advance Stats Project Parijat
No ratings yet
Advance Stats Project Parijat
18 pages
Problem 1
No ratings yet
Problem 1
12 pages
Data Mining Assignment: Sudhanva Saralaya
100% (1)
Data Mining Assignment: Sudhanva Saralaya
16 pages
FINANCE & RISK ANALYTICS – PROJECT - YARESH VIJAYASUNDARAM
No ratings yet
FINANCE & RISK ANALYTICS – PROJECT - YARESH VIJAYASUNDARAM
48 pages
Answer Book - Sparkling Wines
No ratings yet
Answer Book - Sparkling Wines
10 pages
Predictive Modeling
No ratings yet
Predictive Modeling
38 pages
Surabhi FRA PartA
No ratings yet
Surabhi FRA PartA
13 pages
The Best and The Rest: Revisiting The Norm of Normality of Individual Performance
No ratings yet
The Best and The Rest: Revisiting The Norm of Normality of Individual Performance
6 pages
Copper Ore Reference Material Oreas 98: Ore Research & Exploration Pty LTD
No ratings yet
Copper Ore Reference Material Oreas 98: Ore Research & Exploration Pty LTD
22 pages
Machine Learning Guided Project
No ratings yet
Machine Learning Guided Project
23 pages
BUSINESS REPORT Part 1
No ratings yet
BUSINESS REPORT Part 1
9 pages
AS Extended Buisnesss Report
No ratings yet
AS Extended Buisnesss Report
25 pages
PM Guided Project Sample Business Report
100% (1)
PM Guided Project Sample Business Report
52 pages
Chemometric Software For Multivariate Data Analysis Based On Matlab
No ratings yet
Chemometric Software For Multivariate Data Analysis Based On Matlab
8 pages
FRA Project Report Milestone 1 PDF
No ratings yet
FRA Project Report Milestone 1 PDF
29 pages
Predictive Modeling - Supporting File1
No ratings yet
Predictive Modeling - Supporting File1
3 pages
STAT101 Assignment 1
No ratings yet
STAT101 Assignment 1
3 pages
Data Analytics Ass Group-4 Updated
No ratings yet
Data Analytics Ass Group-4 Updated
7 pages
Capstone Notes-1
No ratings yet
Capstone Notes-1
18 pages
FRA Main Project Part B Guided
No ratings yet
FRA Main Project Part B Guided
23 pages
10 Barnettandlewis 1978 Outliersinstatisticaldata
No ratings yet
10 Barnettandlewis 1978 Outliersinstatisticaldata
31 pages
FRA Project Report - Chilla Nagaraju
100% (1)
FRA Project Report - Chilla Nagaraju
66 pages
A Survey On Outlier Detection Techniques
No ratings yet
A Survey On Outlier Detection Techniques
37 pages
Chapter 04
No ratings yet
Chapter 04
95 pages
(2022) Machine Learning Techniques To Model A Full-Scale Wastewater Treatment Plant With Biological Nutrient - Zaghloul, Achari
No ratings yet
(2022) Machine Learning Techniques To Model A Full-Scale Wastewater Treatment Plant With Biological Nutrient - Zaghloul, Achari
18 pages
Great Lakes Extraa - Learn Project Business Report - 2-Kavish-Rathod
No ratings yet
Great Lakes Extraa - Learn Project Business Report - 2-Kavish-Rathod
22 pages
2014 - OJO FORMULA Numerical Predictors of Arithmetic Success in Grades 1-6
No ratings yet
2014 - OJO FORMULA Numerical Predictors of Arithmetic Success in Grades 1-6
13 pages
2023 PLS
No ratings yet
2023 PLS
21 pages
Exploratory Sensor Data Analysis in Python - by Mabel González Castellanos - Towards Data Science
No ratings yet
Exploratory Sensor Data Analysis in Python - by Mabel González Castellanos - Towards Data Science
19 pages
Linear Regression Datascience Basit PDF
No ratings yet
Linear Regression Datascience Basit PDF
19 pages
Xmagpy Manual
No ratings yet
Xmagpy Manual
60 pages
Interpret All Statistics and Graphs For One-Way ANOVA - Minitab Express
No ratings yet
Interpret All Statistics and Graphs For One-Way ANOVA - Minitab Express
18 pages
Unit 1 DMW
No ratings yet
Unit 1 DMW
41 pages
ML 2 - Problem Statements and Rubirics
No ratings yet
ML 2 - Problem Statements and Rubirics
3 pages
Brown and Lam, 2008
No ratings yet
Brown and Lam, 2008
14 pages
Amit Khilare Used Device Data PM Project
No ratings yet
Amit Khilare Used Device Data PM Project
25 pages
Edu 802 Sati
No ratings yet
Edu 802 Sati
9 pages
Data Discretization
No ratings yet
Data Discretization
32 pages
Data Cleaning
No ratings yet
Data Cleaning
6 pages
The International Encyclopedia of Ethics Volume 3 D Fal 1st Edition Hugh Lafollette PDF Download
No ratings yet
The International Encyclopedia of Ethics Volume 3 D Fal 1st Edition Hugh Lafollette PDF Download
60 pages
SMDM Predictive Modeling Business Report 05.02.2022 PDF
No ratings yet
SMDM Predictive Modeling Business Report 05.02.2022 PDF
38 pages

Predictive Modelling Project - Nandini

Uploaded by

Predictive Modelling Project - Nandini

Uploaded by

Predictive Modelling

Table 2: Last 5 rows of the dataset 7

Table 3: Information about the dataset 7

Table 4: Datatypes of the variables in the dataset 8

Table 5: Unique values in the dataset 8

Table 6: Statistical Summary of the dataset 9

Table 7: No. of missing values in the dataset 17

Table 8: Linear Regression - Model Summary 21

Table 9: Performance Metrics for Linear regression Training data 23

Table 10: Performance Metrics for Linear regression Test data 23

Table 11: VIF values for the test of Multicollinearity 27

Table 12: Performance Metrics after removing the variables 27

Table 13: Linear Regression - Final Model Summary 31

Table 14: Performance Metrics for Linear regression Training data 31

Table 15: Performance Metrics for Linear regression Test data 32

1.1.2 Data Dictionary

1.2 Loading the Dataset and Data Overview

1.2.1 Getting the first 5 and last 5 rows

● First 5 rows of the dataset:

Table 2: Last 5 rows of the dataset

1.2.2 Information about the dataset

Table 3: Information about the dataset

● There are 9 variables in the dataset with no null values.

1.2.3 Data types of the variables

Data type No of variables

Table 4: Datatypes of the variables in the dataset

1.2.4 Unique values in the attributes of the dataset

Table 5: Unique values in the dataset

1.2.5 Check for Duplicate Records

1.2.6 Statistical Description of the Dataset

In this section, we will analyze the distribution of independent variables. It

First, let us see how the target variable (sales) is distributed.

The data appears to be approximately normally distributed, with a slight skew to

1.3.2 Bivariate Analysis:

1.4 Exploratory Data Analysis

1.4.1 Outliers Check

1.4.3 Data preparation for modeling

converted categorical variables in the DataFrame X into numerical representations.

Number of rows in train data = 590

1.5.1 Linear Regression:

1.5.2 Model evaluation

checking model performance on train set (seen 70% data)

Based on the provided values:

# checking model performance on test set (seen 30% data)

1.5.3 Checking Linear Regression Assumptions

We will be checking the following Linear Regression assumptions:

1.5.3.1 TEST FOR MULTICOLLINEARITY

 General Rule of thumb:

1.5.3.2 Dropping high p-value variables

['const', 'employment', 'tobinq', 'value', 'institutions', 'sp500_yes']

1.5.3.3 Re-building the model

Training Performance Metrics:

Testing Performance Metrics:

Based on the provided values:

1.5.4 Test for Linearity and Independence

here are some observations:

Test for normality

TEST FOR HOMOSCEDASTICITY

[('F statistic', 1.0339466399164483), ('p-value', 0.3883948729469446)]

1.7 Actionable Insights and Recommendations

You might also like