0% found this document useful (0 votes)
4 views12 pages

Customer Analytics Homework 1

The document presents a customer analytics homework focused on logistic regression models to predict customer defaults. Two models are developed, with Model 2 showing slightly better performance due to its inclusion of an interaction term, achieving an AUC of 0.776 compared to Model 1's 0.770. Both models demonstrate good accuracy and specificity, but Model 2 is preferred for its flexibility and improved classification ability.

Uploaded by

daimingyue02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views12 pages

Customer Analytics Homework 1

The document presents a customer analytics homework focused on logistic regression models to predict customer defaults. Two models are developed, with Model 2 showing slightly better performance due to its inclusion of an interaction term, achieving an AUC of 0.776 compared to Model 1's 0.770. Both models demonstrate good accuracy and specificity, but Model 2 is preferred for its flexibility and improved classification ability.

Uploaded by

daimingyue02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Customer

Analytics
Homework 1
By Group 11: Boying Li, Mingyue Dai
and Yuantong Zhou
Estimation Preliminaries

Rename the target column into y.

Converting categorical variables to factors ensures that models treat them correctly.
Descriptive
Statistics

barplot(table(df$y barlot(table(df$pay_barplot(df$bill_a
)) hist(df$age) 1)) mt1)
Question 1: Generate a random training/validation index that
implements a 70/30 split. Use a random seed of your choice.

First, we set a random seed (set.seed(365)) to ensure reproducibility. Then, we used the sample()
function to randomly assign observations to either the training set or the validation set, with a
probability of 70% for training and 30% for validation (prob = c(0.7, 0.3)). This approach ensures the data
is split consistently and randomly into the two groups. After running the code, the table(idx) function
confirmed the distribution of samples, where idx == 1 represents the training set and idx == 2 represents
the validation set.
Question 2:
Estimate two logistic specifications that allow you to generate out-of-sample predictions of y. Take the following points into account:
You choose the variables X that enter each model specification. These variables X can be continuous or categorical. Make sure continuous and
categorical variables are entered appropriately into the models.
Specify model 1 as the simplest of the two. This model must include at least 5 explanatory variables.
Specify model 2 as the richer/more flexible of the two. Control flexibility through the set of X variables used. Include at least one variable interaction.
[An interaction of two variables, x1 and x2, would be x3 = x1*x2.
Model 1
• Model 1 is a simple logistic regression model created to predict whether a customer will default. It
excludes unnecessary variables like id and features such as bill_amt5, bill_amt6, pay_amt5, and
pay_amt6 to streamline the model.

• On the validation set, the model achieved an


accuracy of 82.0%, showing it predicts most
outcomes correctly. It performed
exceptionally well in identifying non-
defaulters, with a specificity of 95.6%, but its
sensitivity was relatively low at 34.6%.
Model 1 – ROC & AUC

• The ROC curve had an AUC of 0.770, meaning the model is good at distinguishing between defaulters
and non-defaulters.
Model 2
• Model 2 is a logistic regression model contains all features excluding id and an interaction of bill_amt1
and pay_amt1, which is a more complex model.

• The accuracy remains relatively stable


between training (82.04%) and test data
(81.72%), suggesting the model generalizes
well. Specificity is consistently high
(~95%), indicating the model is good at
identifying non-defaulters (negative class).
Model 2 – ROC & AUC

• The ROC curve had an AUC of 0.776, meaning the model is good at distinguishing between defaulters
and non-defaulters.
Model 1 and Model 2 Comparison

• Compared to Model 1, Model 2 performs slightly better overall, with a higher AUC of 0.776 compared to 0.770,
indicating improved ability to distinguish between defaulters and non-defaulters. Both models achieve
similar accuracy, around 82%, and maintain high specificity (Model 2: 95.5%, Model 1: 95.6%), making them
highly reliable at identifying non-defaulters. The inclusion of the interaction term in Model 2 adds flexibility,
capturing relationships that Model 1 may miss. Model 2 provides a slight edge in overall classification
performance and adaptability.
Question 3: Do any of your models exhibit signs of
overfitting? Explain.
Accuracy Model 1 Model 2
In-Sample 0.8219 0.8204
Out-of-Sample 0.8202 0.8172

Neither Model 1 nor Model 2 shows clear signs of overfitting, as their performance on the training
and validation sets is very similar. Both models maintain consistent results, like high specificity
and a solid AUC, across datasets, suggesting they generalize well. While Model 2 is slightly more
complex due to the interaction term, it doesn’t result in any noticeable overfitting. Overall, both
models perform reliably without overfitting.
Question 4:Provide a discussion of which of the two models you would
prefer for the purpose of identifying consumers who will default in the
future. If needed, make assumptions.

Between the two models, I would prefer Model 2 for identifying consumers who will default. Both models
achieve similar accuracy (~82%) and high specificity, but Model 2 has a slight edge with a higher AUC
(0.776 vs. 0.770), meaning it is better at distinguishing defaulters from non-defaulters overall. The
inclusion of the interaction term (bill_amt1 * pay_amt1) makes Model 2 more flexible, allowing it to
capture relationships that Model 1 might miss. While neither model is perfect at identifying all defaulters,
Model 2's improved classification ability and adaptability make it the better choice for future predictions.
Additionally, in this case, if false negatives, which is predicting non-default when they will actually
default, is high, it will cause higher lost to the company, Model 2 offers better precision or recall.

You might also like