Business Analytics - Prediction Model
Business Analytics - Prediction Model
Evaluate Model 0:
Question 1. Using the information in the training set, what is a naive prediction (without building any prediction model)
for average dollars spent per week by a panelist? Why?
Naïve prediction is the benchmark prediction which sets the base for any prediction model. In this, we simply use mean or
median value of the output from the training set instead of building any models, which is the closest representation of the
predicted value. In my case the mean comes out to be $3.54 (Refer Model_0 in the excel sheet)
Question 2. What is the RMSE on the validation set using naïve prediction?
RMSE value comes out to be $1.85 (Refer Model_0 in the excel sheet).
Evaluate Model 1:
Predictor Variables:
Question 3. Please copy the Prediction Summary for the training and the validation sets into your report. How does the
performance compare across the training and validation sets? Which error metrics can be compared across the two
summary reports? Is there an indication of over-fitting? How does this model (Model 1) compare to the Naïve model
(Model 0) as per the RMSE of this model and that of the Naïve Model?
By comparing the above two tables, we can see that model performance has decreased as all (except SSE) the error
measuring parameters increased. It can be observed that RMSE changed from $1.65 to $1.81 whereas R2 changed from
0.12 to 0.03 which can be an indication of over-fitting.
Other than SSE metrics we can compare all other metrics. Reason behind the same is that SSE depends on the size of the
sample, which is 60:40 in this case hence cannot be compared. All other metrics are obtained by taking average at some
point in the calculation process and hence normalizing the sample size.
This model looks slightly better than Naïve model as RMSE values of both training ($1.65) and validation set ($1.81) is
better than RMSE value of Naïve model ($1.85). Since, R2 has also reduced significantly we can only say that Model_1 is
slightly better than Model_0.
Question 4. Create a histogram for the training set residuals as obtained in Model 1. What does this chart tell us about
potential prediction errors---skewed / not skewed, the nature of positive or negative errors, etc.? How is this situation
typically handled?
Evaluate Model 2:
Question 5. Include the histogram of the training set residuals in Model 2 in your report. How does this histogram
compare to Model 1 histogram? Can you say which model ‘better’ fits the training data?
Metrics for Model_2 looks better compared to Model_1, we can see that error parameters have reduced significantly and
R2 is a bit higher as well. We cannot compare the values of Model_1 and Model_2 directly as the scale of sample set is
different for both the models (Normal vs Logarithmic)
Question 7. Based on your explanation above, manually compute the RMSE of the training and the validation sets
corresponding to Model 2 so that this RMSE can be compared to the RMSE obtained in Model 1.
After calculating inverse log for all the individual prediction value in training and validation data set:
Refer Model_2 training worksheet: RMSE_Model_2 (training set): 1.69
Refer Model_2 validation worksheet: RMSE_Model_2 (validation set): 1.84
Question 8. Which model would you prefer for predicting average dollars spent if you go with RMSE? Why?
Basis the RMSE, I would want to go ahead with model_1 since RMSE values in both training as well as validation set is
lower for model_1 as compared to model_2. But on the other hand if we see that skewness of residuals is eliminated in
model_2 so it should be the model preferred for prediction.
Model_1 (LinReg_Output):
Question 9. If we only use two predictors in addition to the intercept, which pair will you choose? Why?
Based on the LinReg_FS output we can choose the following as the pair other than intercept:
• HH_AGE
• CHILDREN_GROUP_CODE
After the intercept, these are the next best predictor variables as per the feature selection report.
Question 10. Which subset of predictors seems to be most promising in terms of predictive power? Why?
Subset 12 (with 12 predictor variables) seems to be most promising in terms of predictive power as mallow’s cp (~15) is
close to the number of predictor variables in the model and subset 12 has the closest number of predictor variables as well.