Lab 3 - Logistic Regression: Part B
Lab 3 - Logistic Regression: Part B
Lab 3 - Logistic Regression: Part B
Load data
(default <- as_tibble(ISLR::Default))
Split data into a training (60%) and testing (40%) data sets so we can assess how well
our model performs on an out-of-sample data set
sample <- sample(c(TRUE, FALSE), nrow(default), replace = T, prob = c(0.6,0.4))
glm function fits generalized linear models, a class of models that includes logistic
regression. The syntax of the glm function is similar to that of lm, except that we must
pass the argument family = binomial in order to tell R to run a logistic regression
rather than some other type of generalized linear model.
In the background the glm, uses maximum likelihood to fit the model.
A depiction of the default data using glm, plotted using ggplot can be done by
default %>%
mutate(prob = ifelse(default == "Yes", 1, 0)) %>%
ggplot(aes(balance, prob)) +
geom_point(alpha = .15) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
ggtitle("Logistic regression model fit") +
xlab("Balance") +
ylab("Probability of Default")
The Deviance is analogous to the sum of squares calculations in linear regression and
is a measure of the lack of fit to the data in a logistic regression model.
The goal is for the model deviance (noted as Residual deviance) to be lower; smaller
values indicate better fit.
The β1 coefficient indicates the increase in log-odds of default, if the balance increases by a
unit value. The confidence interval of the estimates can be seen using the following
command.
confint(model1)
tidy(model2)
Predict the probabilities of student or non-student defaulting, using the above model
predict(model2, data.frame(student = factor(c("Yes", "No"))), type = "response")
In the case of multiple predictor variables sometimes we want to understand which variable is
the most influential in predicting the response (Y) variable. We can do this with varImp from
the caret package. The variable with the highest value is the most important predictor.
Indicate which is the most important predictor
caret::varImp(model3)
Confusion matrix, is a table that describes the classification performance for each model on
the test data. Each quadrant of the table has an important meaning. In this case the “No” and
“Yes” in the rows represent whether customers defaulted or not. The “FALSE” and “TRUE”
in the columns represent whether we predicted customers to default or not.
FALSE TRUE
No true negatives (Top-left quadrant): false positives (Top-right quadrant): We
We predicted no default, and the predicted yes, but they didn’t actually
customer did not default. default. (Also known as a “Type I
error.”)
Diagonal elements of the confusion matrix indicate correct (true) predictions, while the off-
diagonals represent incorrect (false) predictions.
The below commands will show the various fractions for the confusion matrix, of the
different models.
Indicate the % correct and incorrect predictions for each of the models and indicate which is
the best model.
list(
model1 = table(test$default, test.predicted.m1 > 0.5) %>% prop.table() %>% round(3),
model2 = table(test$default, test.predicted.m2 > 0.5) %>% prop.table() %>% round(3),
model3 = table(test$default, test.predicted.m3 > 0.5) %>% prop.table() %>% round(3)
)
2. Include screen output and screen shots of your console to show the steps performed