0% found this document useful (0 votes)
7 views5 pages

ISYE6501 Homework 5

The document outlines a homework assignment for ISYE 6501, focusing on using linear regression to predict sales performance and crime rates based on various predictors. It details the process of building models, including a naive model and a refined model using recursive feature elimination to identify significant predictors. The final model predicts a crime rate of 870.68 using ten selected predictors, demonstrating the effectiveness of data-driven insights in forecasting.

Uploaded by

vitieubao083
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views5 pages

ISYE6501 Homework 5

The document outlines a homework assignment for ISYE 6501, focusing on using linear regression to predict sales performance and crime rates based on various predictors. It details the process of building models, including a naive model and a refined model using recursive feature elimination to identify significant predictors. The final model predicts a crime rate of 870.68 using ten selected predictors, demonstrating the effectiveness of data-driven insights in forecasting.

Uploaded by

vitieubao083
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

ISYE6501-Homework-5

2024-02-13

Load dependencies packages using pacman

rm(list=ls())
setwd("~/Georgia Tech - OMSA/ISYE 6501")
if(!require(pacman)) install.packages("pacman")

## Loading required package: pacman

library(pacman)
p_load(tinytex, tidyverse, caret, ggplot2, datasets)

Question 8.1

As a webmaster overseeing an e-commerce platform, I often rely on data-driven insights to forecast sales
performance. One situation where a linear regression model proves invaluable is in predicting monthly
revenue. In this scenario, I typically consider several predictors:

1. Marketing Spend: I analyze the impact of our marketing investments, like digital ads and social media
campaigns, on sales. Tracking how different spending levels correlate with revenue helps optimize our
marketing budget.
2. Website Traffic: Monitoring the number of visitors to our site provides a clear indication of potential
sales. Understanding traffic patterns helps anticipate demand and tailor promotional efforts accord-
ingly.
3. Seasonal Trends: Recognizing how seasonal variations affect sales allows me to adjust strategies accord-
ingly. Holidays and special events often drive fluctuations in consumer behavior, influencing purchasing
decisions.
4. Product Pricing: By examining how changes in product pricing impact sales, I can refine pricing
strategies to maximize revenue. Finding the balance between competitiveness and profitability is key.
5. Customer Feedback: Incorporating customer reviews and ratings into the model gives insight into
product perception. Understanding how sentiment influences purchasing decisions helps refine product
offerings and marketing approaches.

Question 8.2

Using crime data from http://www.statsci.org/data/general/uscrime.txt, use regression (a useful


R function is lm or glm) to predict the observed crime rate in a city with the following data:
M = 14.0, So = 0, Ed = 10.0, Po1 = 12.0, Po2 = 15.5, LF = 0.640, M.F = 94.0, Pop = 150, NW
= 1.1, U1 = 0.120, U2 = 3.6, Wealth = 3200, Ineq = 20.1, Prob = 0.04, Time = 39.0
Show your model (factors used and their coefficients), the software output, and the quality of fit.

1
# Store the data point as test
test <- data.frame(M = 14.0, So = 0, Ed = 10.0, Po1 = 12.0, Po2 = 15.5, LF = 0.640, M.F = 94.0, Pop = 15

#Load US Crime rate


file_path <- "~/Georgia Tech - OMSA/ISYE 6501/hw5/data 8.2/uscrime.txt"
crime <- read.table(file_path, stringsAsFactors = FALSE, header=TRUE)
head(crime)

## M So Ed Po1 Po2 LF M.F Pop NW U1 U2 Wealth Ineq Prob


## 1 15.1 1 9.1 5.8 5.6 0.510 95.0 33 30.1 0.108 4.1 3940 26.1 0.084602
## 2 14.3 0 11.3 10.3 9.5 0.583 101.2 13 10.2 0.096 3.6 5570 19.4 0.029599
## 3 14.2 1 8.9 4.5 4.4 0.533 96.9 18 21.9 0.094 3.3 3180 25.0 0.083401
## 4 13.6 0 12.1 14.9 14.1 0.577 99.4 157 8.0 0.102 3.9 6730 16.7 0.015801
## 5 14.1 0 12.1 10.9 10.1 0.591 98.5 18 3.0 0.091 2.0 5780 17.4 0.041399
## 6 12.1 0 11.0 11.8 11.5 0.547 96.4 25 4.4 0.084 2.9 6890 12.6 0.034201
## Time Crime
## 1 26.2011 791
## 2 25.2999 1635
## 3 24.3006 578
## 4 29.9012 1969
## 5 21.2998 1234
## 6 20.9995 682

Objective: We will need to find the significant predictor variables that can impact the outcome
of our model in predicting the crime rate of the provided test data point. Our hypothesis are:

• Ho : the selected variable does not impact the outcome which is the crime rate
• Ha : the selected variable does have some impacts on predicting the outcome

Methods: First, we will create a naive simple Linear regression model lm() with all the available
predictors and observe the result. Then we will apply the recursive feature elimination rfe()
as a cross -validation method to reduce the less important predictors to produce an optimal
model.

set.seed(1)
# create a naive model and print result
naive_model <- lm(Crime~., data=crime)
summary(naive_model)

##
## Call:
## lm(formula = Crime ~ ., data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -395.74 -98.09 -6.69 112.99 512.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.984e+03 1.628e+03 -3.675 0.000893 ***
## M 8.783e+01 4.171e+01 2.106 0.043443 *

2
## So -3.803e+00 1.488e+02 -0.026 0.979765
## Ed 1.883e+02 6.209e+01 3.033 0.004861 **
## Po1 1.928e+02 1.061e+02 1.817 0.078892 .
## Po2 -1.094e+02 1.175e+02 -0.931 0.358830
## LF -6.638e+02 1.470e+03 -0.452 0.654654
## M.F 1.741e+01 2.035e+01 0.855 0.398995
## Pop -7.330e-01 1.290e+00 -0.568 0.573845
## NW 4.204e+00 6.481e+00 0.649 0.521279
## U1 -5.827e+03 4.210e+03 -1.384 0.176238
## U2 1.678e+02 8.234e+01 2.038 0.050161 .
## Wealth 9.617e-02 1.037e-01 0.928 0.360754
## Ineq 7.067e+01 2.272e+01 3.111 0.003983 **
## Prob -4.855e+03 2.272e+03 -2.137 0.040627 *
## Time -3.479e+00 7.165e+00 -0.486 0.630708
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared: 0.8031, Adjusted R-squared: 0.7078
## F-statistic: 8.429 on 15 and 31 DF, p-value: 3.539e-07

# let predict the naive model and compare it with the Crime rate range
predict(naive_model, test)

## 1
## 155.4349

range(crime$Crime)

## [1] 342 1993

First Observation: We can see that the predicted crime rate (155.4) is clearly out of the Crime
rate range (342 - 1993), one of the possibility is that the naive model has been overfitting with
all the predictors.
Taking a closer look at the column Pr of the Coefficients table and comparing the values to
the Significant codes, we conclude that all the predictors that has the p values of 0.05 and
above should be excluded from our final model since they are not significantly impacted the
outcome. Keep in mind that a predictor must have a value that is lower than alpha (0.05) in
order for us to reject the null hypothesis.
With all these insignificant predictor excluded, we are now end up with 4 strong predictors, in-
cluding M (0.043), Ed (0.0048), Ineq (0.0039), Prob (0.040), so we are ready to build hopefully
a better model.

set.seed(1)
better_model <- lm(Crime ~ M + Ed + Ineq + Prob, data = crime, x = TRUE, y = TRUE)
summary(better_model)

##
## Call:
## lm(formula = Crime ~ M + Ed + Ineq + Prob, data = crime, x = TRUE,
## y = TRUE)

3
##
## Residuals:
## Min 1Q Median 3Q Max
## -532.97 -254.03 -55.72 137.80 960.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1339.35 1247.01 -1.074 0.28893
## M 35.97 53.39 0.674 0.50417
## Ed 148.61 71.92 2.066 0.04499 *
## Ineq 26.87 22.77 1.180 0.24458
## Prob -7331.92 2560.27 -2.864 0.00651 **
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 347.5 on 42 degrees of freedom
## Multiple R-squared: 0.2629, Adjusted R-squared: 0.1927
## F-statistic: 3.745 on 4 and 42 DF, p-value: 0.01077

#predict the outcome with test data point


predict(better_model, test)

## 1
## 897.2307

Second observation: We can tell that there is an improvement this time with the predicted
crime rate of 897.2307, and the Adjusted R-squared value drop from 0.7078 (naive model) to
0.1927. However, we don’t think we have built the best model yet, since the other excluded
predictor might still have a positive effect when being combined with the included predictors.
For this, we want to go extra miles, and apply the recursive feature elimination rfe() to train
out model and identify the best predictors using cross-validation.

set.seed(1)

# set parameters for rfe() methods


params <- rfeControl(functions = lmFuncs,
method = "repeatedcv",
number=10,
repeats = 25,
verbose = FALSE)

#run the rfe() methods


rfe_lm <- rfe(crime[,-16], crime[[16]],
sizes = c(1:15),
rfeControl = params)

rfe_lm

##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold, repeated 25 times)
##

4
## Resampling performance over subset size:
##
## Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
## 1 360.8 0.3296 291.5 133.31 0.2986 100.85
## 2 345.8 0.3610 277.7 134.61 0.2975 110.02
## 3 352.0 0.3339 285.5 132.77 0.2993 109.49
## 4 333.4 0.4115 277.6 99.41 0.3142 87.97
## 5 316.4 0.4805 260.8 96.84 0.3183 86.07
## 6 306.9 0.4952 253.8 96.65 0.3224 88.09
## 7 308.9 0.5038 256.0 93.59 0.3101 84.88
## 8 270.1 0.5720 224.2 85.00 0.3001 74.44
## 9 235.8 0.6502 190.8 94.79 0.2851 78.07
## 10 230.9 0.6636 186.9 85.05 0.2645 68.75 *
## 11 236.8 0.6374 191.7 90.27 0.2725 70.84
## 12 249.0 0.6112 199.9 90.39 0.2848 72.46
## 13 252.8 0.6063 203.9 92.74 0.2881 75.30
## 14 261.2 0.5911 210.2 97.12 0.2918 80.58
## 15 265.2 0.5841 215.0 94.74 0.2874 81.74
##
## The top 5 variables (out of 10):
## U1, Prob, LF, Po1, Ed

Third observation: by comparing the lower RMSE and the higher R-squared value, we can
tell that our best fiited model can be built with 10 strongest predictors (10 variables, RMSE:
230.9, Rsquared:0.6636). Now let list our 10 best predictors and predict the outcome with our
best fitted model.

#list the top ten predictors


set.seed(1)
predictors(rfe_lm)

## [1] "U1" "Prob" "LF" "Po1" "Ed" "U2" "Po2" "M" "Ineq" "So"

best_model <- lm(Crime ~ U1 + Prob + LF + Po1 + Ed + U2 + Po2 + M + Ineq + So, data = crime, x = TRUE, y
predict(best_model, test)

## 1
## 870.6834

Conclusion: by running cross-validation (repeatedcv) combining with the recursive feature


elimination (rfe), we are able to find the ten possible predictors to build our best model with
an acceptable predicted crime value of 870.6834. Our final best model is lm(Crime ~ U1 + Prob
+ LF + Po1 + Ed + U2 + Po2 + M + Ineq + So, data = crime, x = TRUE, y = TRUE)

You might also like