0% found this document useful (0 votes)
119 views3 pages

Business Analytics - Prediction Model

This document evaluates several models for predicting average dollars spent per week by panelists. It finds that a naive model without predictors has a RMSE of $1.85 on the validation set. Model 1 includes intercept only and has slightly better RMSE than the naive model. However, it shows signs of overfitting with decreasing performance on the validation set. Model 2 applies a logarithmic transformation to reduce skewness in the residuals, resulting in a better fitting model compared to Model 1 based on error metrics and residual analysis. Subset 12 from feature selection analysis seems most promising for predictive power due to having many predictors and a Mallow's Cp close to the number of predictors.

Uploaded by

Mukul Bansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views3 pages

Business Analytics - Prediction Model

This document evaluates several models for predicting average dollars spent per week by panelists. It finds that a naive model without predictors has a RMSE of $1.85 on the validation set. Model 1 includes intercept only and has slightly better RMSE than the naive model. However, it shows signs of overfitting with decreasing performance on the validation set. Model 2 applies a logarithmic transformation to reduce skewness in the residuals, resulting in a better fitting model compared to Model 1 based on error metrics and residual analysis. Subset 12 from feature selection analysis seems most promising for predictive power due to having many predictors and a Mallow's Cp close to the number of predictors.

Uploaded by

Mukul Bansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

BADM Assignment 1

Evaluate Model 0:

Question 1. Using the information in the training set, what is a naive prediction (without building any prediction model)
for average dollars spent per week by a panelist? Why?
Naïve prediction is the benchmark prediction which sets the base for any prediction model. In this, we simply use mean or
median value of the output from the training set instead of building any models, which is the closest representation of the
predicted value. In my case the mean comes out to be $3.54 (Refer Model_0 in the excel sheet)
Question 2. What is the RMSE on the validation set using naïve prediction?

RMSE value comes out to be $1.85 (Refer Model_0 in the excel sheet).

Evaluate Model 1:

Predictor Variables:

Output Variable: Avg(Sum(DOLLARS))

Question 3. Please copy the Prediction Summary for the training and the validation sets into your report. How does the
performance compare across the training and validation sets? Which error metrics can be compared across the two
summary reports? Is there an indication of over-fitting? How does this model (Model 1) compare to the Naïve model
(Model 0) as per the RMSE of this model and that of the Naïve Model?

By comparing the above two tables, we can see that model performance has decreased as all (except SSE) the error
measuring parameters increased. It can be observed that RMSE changed from $1.65 to $1.81 whereas R2 changed from
0.12 to 0.03 which can be an indication of over-fitting.
Other than SSE metrics we can compare all other metrics. Reason behind the same is that SSE depends on the size of the
sample, which is 60:40 in this case hence cannot be compared. All other metrics are obtained by taking average at some
point in the calculation process and hence normalizing the sample size.
This model looks slightly better than Naïve model as RMSE values of both training ($1.65) and validation set ($1.81) is
better than RMSE value of Naïve model ($1.85). Since, R2 has also reduced significantly we can only say that Model_1 is
slightly better than Model_0.
Question 4. Create a histogram for the training set residuals as obtained in Model 1. What does this chart tell us about
potential prediction errors---skewed / not skewed, the nature of positive or negative errors, etc.? How is this situation
typically handled?

We can see that histogram is skewed towards left


indicating a lot of negative values in the residuals, which
suggests that actual values are less than the predicted
values.

To remove the skewness, we can apply various type of


transformations such as logarithmic, square root, z-score
normalization, etc. This transformation allows to bring
data to similar scale and linearizes both the scales.

Evaluate Model 2:

Predictor Variables: Same as Model_1

Output Variable: LNAvg(Sum(DOLLARS))

Question 5. Include the histogram of the training set residuals in Model 2 in your report. How does this histogram
compare to Model 1 histogram? Can you say which model ‘better’ fits the training data?

Residual histogram for model_2 looks to be normally


distributes suggesting that skewness was eliminated by
using logarithmic transformation for avg(sum(dollars)).

Since, skewness is less and error is normally distributed,


we can safely say that model_2 better fits the training
data.
Question 6. From Model 2 output, copy the Prediction Summary Reports for the training and validation sets. Can the
Prediction Summary Reports as obtained in Model 2 be compared to those obtained in Model 1 as is? Well, the answer is
no, but I like you to explain clearly why not.

Metrics for Model_2 looks better compared to Model_1, we can see that error parameters have reduced significantly and
R2 is a bit higher as well. We cannot compare the values of Model_1 and Model_2 directly as the scale of sample set is
different for both the models (Normal vs Logarithmic)

Question 7. Based on your explanation above, manually compute the RMSE of the training and the validation sets
corresponding to Model 2 so that this RMSE can be compared to the RMSE obtained in Model 1.

After calculating inverse log for all the individual prediction value in training and validation data set:
Refer Model_2 training worksheet: RMSE_Model_2 (training set): 1.69
Refer Model_2 validation worksheet: RMSE_Model_2 (validation set): 1.84

RMSE values of model_1 picked from Question 3:


RMSE_Model_1 (training set): 1.65
RMSE_Model_1 (validation set): 1.81

Question 8. Which model would you prefer for predicting average dollars spent if you go with RMSE? Why?

Basis the RMSE, I would want to go ahead with model_1 since RMSE values in both training as well as validation set is
lower for model_1 as compared to model_2. But on the other hand if we see that skewness of residuals is eliminated in
model_2 so it should be the model preferred for prediction.

Model_1 (LinReg_Output):
Question 9. If we only use two predictors in addition to the intercept, which pair will you choose? Why?

Based on the LinReg_FS output we can choose the following as the pair other than intercept:
• HH_AGE
• CHILDREN_GROUP_CODE

After the intercept, these are the next best predictor variables as per the feature selection report.

Question 10. Which subset of predictors seems to be most promising in terms of predictive power? Why?

Subset 12 (with 12 predictor variables) seems to be most promising in terms of predictive power as mallow’s cp (~15) is
close to the number of predictor variables in the model and subset 12 has the closest number of predictor variables as well.

You might also like