ISYE6501 Homework 5
ISYE6501 Homework 5
2024-02-13
rm(list=ls())
setwd("~/Georgia Tech - OMSA/ISYE 6501")
if(!require(pacman)) install.packages("pacman")
library(pacman)
p_load(tinytex, tidyverse, caret, ggplot2, datasets)
Question 8.1
As a webmaster overseeing an e-commerce platform, I often rely on data-driven insights to forecast sales
performance. One situation where a linear regression model proves invaluable is in predicting monthly
revenue. In this scenario, I typically consider several predictors:
1. Marketing Spend: I analyze the impact of our marketing investments, like digital ads and social media
campaigns, on sales. Tracking how different spending levels correlate with revenue helps optimize our
marketing budget.
2. Website Traffic: Monitoring the number of visitors to our site provides a clear indication of potential
sales. Understanding traffic patterns helps anticipate demand and tailor promotional efforts accord-
ingly.
3. Seasonal Trends: Recognizing how seasonal variations affect sales allows me to adjust strategies accord-
ingly. Holidays and special events often drive fluctuations in consumer behavior, influencing purchasing
decisions.
4. Product Pricing: By examining how changes in product pricing impact sales, I can refine pricing
strategies to maximize revenue. Finding the balance between competitiveness and profitability is key.
5. Customer Feedback: Incorporating customer reviews and ratings into the model gives insight into
product perception. Understanding how sentiment influences purchasing decisions helps refine product
offerings and marketing approaches.
Question 8.2
1
# Store the data point as test
test <- data.frame(M = 14.0, So = 0, Ed = 10.0, Po1 = 12.0, Po2 = 15.5, LF = 0.640, M.F = 94.0, Pop = 15
Objective: We will need to find the significant predictor variables that can impact the outcome
of our model in predicting the crime rate of the provided test data point. Our hypothesis are:
• Ho : the selected variable does not impact the outcome which is the crime rate
• Ha : the selected variable does have some impacts on predicting the outcome
Methods: First, we will create a naive simple Linear regression model lm() with all the available
predictors and observe the result. Then we will apply the recursive feature elimination rfe()
as a cross -validation method to reduce the less important predictors to produce an optimal
model.
set.seed(1)
# create a naive model and print result
naive_model <- lm(Crime~., data=crime)
summary(naive_model)
##
## Call:
## lm(formula = Crime ~ ., data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -395.74 -98.09 -6.69 112.99 512.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.984e+03 1.628e+03 -3.675 0.000893 ***
## M 8.783e+01 4.171e+01 2.106 0.043443 *
2
## So -3.803e+00 1.488e+02 -0.026 0.979765
## Ed 1.883e+02 6.209e+01 3.033 0.004861 **
## Po1 1.928e+02 1.061e+02 1.817 0.078892 .
## Po2 -1.094e+02 1.175e+02 -0.931 0.358830
## LF -6.638e+02 1.470e+03 -0.452 0.654654
## M.F 1.741e+01 2.035e+01 0.855 0.398995
## Pop -7.330e-01 1.290e+00 -0.568 0.573845
## NW 4.204e+00 6.481e+00 0.649 0.521279
## U1 -5.827e+03 4.210e+03 -1.384 0.176238
## U2 1.678e+02 8.234e+01 2.038 0.050161 .
## Wealth 9.617e-02 1.037e-01 0.928 0.360754
## Ineq 7.067e+01 2.272e+01 3.111 0.003983 **
## Prob -4.855e+03 2.272e+03 -2.137 0.040627 *
## Time -3.479e+00 7.165e+00 -0.486 0.630708
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared: 0.8031, Adjusted R-squared: 0.7078
## F-statistic: 8.429 on 15 and 31 DF, p-value: 3.539e-07
# let predict the naive model and compare it with the Crime rate range
predict(naive_model, test)
## 1
## 155.4349
range(crime$Crime)
First Observation: We can see that the predicted crime rate (155.4) is clearly out of the Crime
rate range (342 - 1993), one of the possibility is that the naive model has been overfitting with
all the predictors.
Taking a closer look at the column Pr of the Coefficients table and comparing the values to
the Significant codes, we conclude that all the predictors that has the p values of 0.05 and
above should be excluded from our final model since they are not significantly impacted the
outcome. Keep in mind that a predictor must have a value that is lower than alpha (0.05) in
order for us to reject the null hypothesis.
With all these insignificant predictor excluded, we are now end up with 4 strong predictors, in-
cluding M (0.043), Ed (0.0048), Ineq (0.0039), Prob (0.040), so we are ready to build hopefully
a better model.
set.seed(1)
better_model <- lm(Crime ~ M + Ed + Ineq + Prob, data = crime, x = TRUE, y = TRUE)
summary(better_model)
##
## Call:
## lm(formula = Crime ~ M + Ed + Ineq + Prob, data = crime, x = TRUE,
## y = TRUE)
3
##
## Residuals:
## Min 1Q Median 3Q Max
## -532.97 -254.03 -55.72 137.80 960.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1339.35 1247.01 -1.074 0.28893
## M 35.97 53.39 0.674 0.50417
## Ed 148.61 71.92 2.066 0.04499 *
## Ineq 26.87 22.77 1.180 0.24458
## Prob -7331.92 2560.27 -2.864 0.00651 **
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 347.5 on 42 degrees of freedom
## Multiple R-squared: 0.2629, Adjusted R-squared: 0.1927
## F-statistic: 3.745 on 4 and 42 DF, p-value: 0.01077
## 1
## 897.2307
Second observation: We can tell that there is an improvement this time with the predicted
crime rate of 897.2307, and the Adjusted R-squared value drop from 0.7078 (naive model) to
0.1927. However, we don’t think we have built the best model yet, since the other excluded
predictor might still have a positive effect when being combined with the included predictors.
For this, we want to go extra miles, and apply the recursive feature elimination rfe() to train
out model and identify the best predictors using cross-validation.
set.seed(1)
rfe_lm
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold, repeated 25 times)
##
4
## Resampling performance over subset size:
##
## Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
## 1 360.8 0.3296 291.5 133.31 0.2986 100.85
## 2 345.8 0.3610 277.7 134.61 0.2975 110.02
## 3 352.0 0.3339 285.5 132.77 0.2993 109.49
## 4 333.4 0.4115 277.6 99.41 0.3142 87.97
## 5 316.4 0.4805 260.8 96.84 0.3183 86.07
## 6 306.9 0.4952 253.8 96.65 0.3224 88.09
## 7 308.9 0.5038 256.0 93.59 0.3101 84.88
## 8 270.1 0.5720 224.2 85.00 0.3001 74.44
## 9 235.8 0.6502 190.8 94.79 0.2851 78.07
## 10 230.9 0.6636 186.9 85.05 0.2645 68.75 *
## 11 236.8 0.6374 191.7 90.27 0.2725 70.84
## 12 249.0 0.6112 199.9 90.39 0.2848 72.46
## 13 252.8 0.6063 203.9 92.74 0.2881 75.30
## 14 261.2 0.5911 210.2 97.12 0.2918 80.58
## 15 265.2 0.5841 215.0 94.74 0.2874 81.74
##
## The top 5 variables (out of 10):
## U1, Prob, LF, Po1, Ed
Third observation: by comparing the lower RMSE and the higher R-squared value, we can
tell that our best fiited model can be built with 10 strongest predictors (10 variables, RMSE:
230.9, Rsquared:0.6636). Now let list our 10 best predictors and predict the outcome with our
best fitted model.
## [1] "U1" "Prob" "LF" "Po1" "Ed" "U2" "Po2" "M" "Ineq" "So"
best_model <- lm(Crime ~ U1 + Prob + LF + Po1 + Ed + U2 + Po2 + M + Ineq + So, data = crime, x = TRUE, y
predict(best_model, test)
## 1
## 870.6834