Lab 3 - Logistic Regression: Part B

Lab-3 July 18, 2020
Lab 3 – Logistic Regression

Part B
1. Perform the steps listed below, include screen shots / outputs and answer the
questions.
Note: Install missing packages, as required during the execution
Dataset is Default from the ISLR package.
STEP1: Loading data

Packages
library(tidyverse) # data manipulation and visualization
library(modelr) # provides easy pipeline modeling
functions
library(broom) # helps to tidy up model outputs
Load data
(default <- as_tibble(ISLR::Default))
Split data into a training (60%) and testing (40%) data sets so we can assess how well
our model performs on an out-of-sample data set
sample <- sample(c(TRUE, FALSE), nrow(default), replace = T, prob = c(0.6,0.4))
train <- default[sample, ]

test <- default[!sample, ]
STEP 2: Simple Logistic Regression

Fit a logistic regression model in order to predict the probability of a customer
defaulting based on the average balance carried by the customer
glm function fits generalized linear models, a class of models that includes logistic
regression. The syntax of the glm function is similar to that of lm, except that we must
pass the argument family = binomial in order to tell R to run a logistic regression
rather than some other type of generalized linear model.
model1 <- glm(default ~ balance, family = "binomial", data = train)
In the background the glm, uses maximum likelihood to fit the model.
A depiction of the default data using glm, plotted using ggplot can be done by
Shahima Khan I071 A3

Lab-3 July 18, 2020
default %>%
mutate(prob = ifelse(default == "Yes", 1, 0)) %>%
ggplot(aes(balance, prob)) +
geom_point(alpha = .15) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
ggtitle("Logistic regression model fit") +
xlab("Balance") +
ylab("Probability of Default")
Model details can be seen using the command summary

summary(model1)
The Deviance is analogous to the sum of squares calculations in linear regression and
is a measure of the lack of fit to the data in a logistic regression model.
The goal is for the model deviance (noted as Residual deviance) to be lower; smaller
values indicate better fit.
STEP 3: Assessing coefficients

Coefficient estimates from logistic regression characterize the relationship between the
predictor and response variable on a log-odds scale. The following command shows the
estimates.
tidy(model1)
The β1 coefficient indicates the increase in log-odds of default, if the balance increases by a
unit value. The confidence interval of the estimates can be seen using the following
command.
confint(model1)
STEP 4: Making predictions

Predict the probability of defaulting in R using the predict function (be sure to include type =
"response")
predict(model1, data.frame(balance = c(500, 1500)), type = "response")
STEP 5: Using qualitative predictors

Previous steps used a quantitative predictor balance. Now, fit a model that uses
the student variable (qualitative predictor).
model2 <- glm(default ~ student, family = "binomial", data = train)
tidy(model2)

Lab-3 July 18, 2020
Predict the probabilities of student or non-student defaulting, using the above model
predict(model2, data.frame(student = factor(c("Yes", "No"))), type = "response")
STEP 6: Multiple logistic regression

model3 <- glm(default ~ balance + income + student, family = "binomial", data = train)
tidy(model3)
In the case of multiple predictor variables sometimes we want to understand which variable is
the most influential in predicting the response (Y) variable. We can do this with varImp from
the caret package. The variable with the highest value is the most important predictor.
Indicate which is the most important predictor
caret::varImp(model3)
STEP 7: Making predictions for the above

To make a prediction, we first create test data row, for both cases of student = Yes , No. In
this case, balance = 1000, and income = $30K
new.df <- tibble(balance = 1000, income = 30, student = c("Yes", "No"))
What do you observe about the prediction probabilities?

STEP 7: Goodness-of-Fit
McFadden’s R2 for each of the models
list(model1 = pscl::pR2(model1)["McFadden"],
model2 = pscl::pR2(model2)["McFadden"],
model3 = pscl::pR2(model3)["McFadden"])
Which model which has the best fit?

STEP 8: Validation of predicted values
We need to use the estimated models to predict values on the training data set (train). When
using predict be sure to include type = response so that the prediction returns the probability
of default.
test.predicted.m1 <- predict(model1, newdata = test, type = "response")
Confusion matrix, is a table that describes the classification performance for each model on
the test data. Each quadrant of the table has an important meaning. In this case the “No” and
“Yes” in the rows represent whether customers defaulted or not. The “FALSE” and “TRUE”
in the columns represent whether we predicted customers to default or not.

Lab-3 July 18, 2020
FALSE TRUE
No true negatives (Top-left quadrant): false positives (Top-right quadrant): We
We predicted no default, and the predicted yes, but they didn’t actually
customer did not default. default. (Also known as a “Type I
error.”)
Yes false negatives (Bottom-left): We true positives (Bottom-right quadrant):

predicted no, but they did default. these are cases in which we predicted the
(Also known as a “Type II error.”) customer would default and they did.
Diagonal elements of the confusion matrix indicate correct (true) predictions, while the off-
diagonals represent incorrect (false) predictions.
The below commands will show the various fractions for the confusion matrix, of the
different models.
Indicate the % correct and incorrect predictions for each of the models and indicate which is
the best model.
list(
model1 = table(test$default, test.predicted.m1 > 0.5) %>% prop.table() %>% round(3),
model2 = table(test$default, test.predicted.m2 > 0.5) %>% prop.table() %>% round(3),
model3 = table(test$default, test.predicted.m3 > 0.5) %>% prop.table() %>% round(3)
)
2. Include screen output and screen shots of your console to show the steps performed

Lab-3 July 18, 2020

Lab-3 July 18, 2020

Lab-3 July 18, 2020

Lab 3 - Logistic Regression: Part B

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Lab 3 - Logistic Regression: Part B

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab 3 - Logistic Regression: Part B

Uploaded by

Copyright:

Available Formats

Lab-3 July 18, 2020

Lab 3 – Logistic Regression

Note: Install missing packages, as required during the execution

Dataset is Default from the ISLR package.

STEP1: Loading data

train <- default[sample, ]

STEP 2: Simple Logistic Regression

model1 <- glm(default ~ balance, family = "binomial", data = train)

Shahima Khan I071 A3

Model details can be seen using the command summary

STEP 3: Assessing coefficients

STEP 4: Making predictions

STEP 5: Using qualitative predictors

Shahima Khan I071 A3

STEP 6: Multiple logistic regression

STEP 7: Making predictions for the above

What do you observe about the prediction probabilities?

Which model which has the best fit?

Shahima Khan I071 A3

Yes false negatives (Bottom-left): We true positives (Bottom-right quadrant):

Shahima Khan I071 A3

Shahima Khan I071 A3

Shahima Khan I071 A3

Shahima Khan I071 A3

You might also like