Customer Analytics Homework 1
Customer Analytics Homework 1
Analytics
Homework 1
By Group 11: Boying Li, Mingyue Dai
and Yuantong Zhou
Estimation Preliminaries
Converting categorical variables to factors ensures that models treat them correctly.
Descriptive
Statistics
barplot(table(df$y barlot(table(df$pay_barplot(df$bill_a
)) hist(df$age) 1)) mt1)
Question 1: Generate a random training/validation index that
implements a 70/30 split. Use a random seed of your choice.
First, we set a random seed (set.seed(365)) to ensure reproducibility. Then, we used the sample()
function to randomly assign observations to either the training set or the validation set, with a
probability of 70% for training and 30% for validation (prob = c(0.7, 0.3)). This approach ensures the data
is split consistently and randomly into the two groups. After running the code, the table(idx) function
confirmed the distribution of samples, where idx == 1 represents the training set and idx == 2 represents
the validation set.
Question 2:
Estimate two logistic specifications that allow you to generate out-of-sample predictions of y. Take the following points into account:
You choose the variables X that enter each model specification. These variables X can be continuous or categorical. Make sure continuous and
categorical variables are entered appropriately into the models.
Specify model 1 as the simplest of the two. This model must include at least 5 explanatory variables.
Specify model 2 as the richer/more flexible of the two. Control flexibility through the set of X variables used. Include at least one variable interaction.
[An interaction of two variables, x1 and x2, would be x3 = x1*x2.
Model 1
• Model 1 is a simple logistic regression model created to predict whether a customer will default. It
excludes unnecessary variables like id and features such as bill_amt5, bill_amt6, pay_amt5, and
pay_amt6 to streamline the model.
• The ROC curve had an AUC of 0.770, meaning the model is good at distinguishing between defaulters
and non-defaulters.
Model 2
• Model 2 is a logistic regression model contains all features excluding id and an interaction of bill_amt1
and pay_amt1, which is a more complex model.
• The ROC curve had an AUC of 0.776, meaning the model is good at distinguishing between defaulters
and non-defaulters.
Model 1 and Model 2 Comparison
• Compared to Model 1, Model 2 performs slightly better overall, with a higher AUC of 0.776 compared to 0.770,
indicating improved ability to distinguish between defaulters and non-defaulters. Both models achieve
similar accuracy, around 82%, and maintain high specificity (Model 2: 95.5%, Model 1: 95.6%), making them
highly reliable at identifying non-defaulters. The inclusion of the interaction term in Model 2 adds flexibility,
capturing relationships that Model 1 may miss. Model 2 provides a slight edge in overall classification
performance and adaptability.
Question 3: Do any of your models exhibit signs of
overfitting? Explain.
Accuracy Model 1 Model 2
In-Sample 0.8219 0.8204
Out-of-Sample 0.8202 0.8172
Neither Model 1 nor Model 2 shows clear signs of overfitting, as their performance on the training
and validation sets is very similar. Both models maintain consistent results, like high specificity
and a solid AUC, across datasets, suggesting they generalize well. While Model 2 is slightly more
complex due to the interaction term, it doesn’t result in any noticeable overfitting. Overall, both
models perform reliably without overfitting.
Question 4:Provide a discussion of which of the two models you would
prefer for the purpose of identifying consumers who will default in the
future. If needed, make assumptions.
Between the two models, I would prefer Model 2 for identifying consumers who will default. Both models
achieve similar accuracy (~82%) and high specificity, but Model 2 has a slight edge with a higher AUC
(0.776 vs. 0.770), meaning it is better at distinguishing defaulters from non-defaulters overall. The
inclusion of the interaction term (bill_amt1 * pay_amt1) makes Model 2 more flexible, allowing it to
capture relationships that Model 1 might miss. While neither model is perfect at identifying all defaulters,
Model 2's improved classification ability and adaptability make it the better choice for future predictions.
Additionally, in this case, if false negatives, which is predicting non-default when they will actually
default, is high, it will cause higher lost to the company, Model 2 offers better precision or recall.