Lab 3 - Logistic Regression: Part B

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Lab-3 July 18, 2020

Lab 3 – Logistic Regression


Part B
1. Perform the steps listed below, include screen shots / outputs and answer the
questions.

Note: Install missing packages, as required during the execution

Dataset is Default from the ISLR package.

STEP1: Loading data


Packages
library(tidyverse) # data manipulation and visualization
library(modelr) # provides easy pipeline modeling
functions
library(broom) # helps to tidy up model outputs

Load data
(default <- as_tibble(ISLR::Default))

Split data into a training (60%) and testing (40%) data sets so we can assess how well
our model performs on an out-of-sample data set
sample <- sample(c(TRUE, FALSE), nrow(default), replace = T, prob = c(0.6,0.4))

train <- default[sample, ]


test <- default[!sample, ]

STEP 2: Simple Logistic Regression


Fit a logistic regression model in order to predict the probability of a customer
defaulting based on the average balance carried by the customer

glm function fits generalized linear models, a class of models that includes logistic
regression. The syntax of the glm function is similar to that of lm, except that we must
pass the argument family = binomial in order to tell R to run a logistic regression
rather than some other type of generalized linear model.

model1 <- glm(default ~ balance, family = "binomial", data = train)

In the background the glm, uses maximum likelihood to fit the model.

A depiction of the default data using glm, plotted using ggplot can be done by

Shahima Khan I071 A3


Lab-3 July 18, 2020

default %>%
mutate(prob = ifelse(default == "Yes", 1, 0)) %>%
ggplot(aes(balance, prob)) +
geom_point(alpha = .15) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
ggtitle("Logistic regression model fit") +
xlab("Balance") +
ylab("Probability of Default")

Model details can be seen using the command summary


summary(model1)

The Deviance is analogous to the sum of squares calculations in linear regression and
is a measure of the lack of fit to the data in a logistic regression model.

The goal is for the model deviance (noted as Residual deviance) to be lower; smaller
values indicate better fit.

STEP 3: Assessing coefficients


Coefficient estimates from logistic regression characterize the relationship between the
predictor and response variable on a log-odds scale. The following command shows the
estimates.
tidy(model1)

The β1 coefficient indicates the increase in log-odds of default, if the balance increases by a
unit value. The confidence interval of the estimates can be seen using the following
command.
confint(model1)

STEP 4: Making predictions


Predict the probability of defaulting in R using the predict function (be sure to include type =
"response")
predict(model1, data.frame(balance = c(500, 1500)), type = "response")

STEP 5: Using qualitative predictors


Previous steps used a quantitative predictor balance. Now, fit a model that uses
the student variable (qualitative predictor).
model2 <- glm(default ~ student, family = "binomial", data = train)

tidy(model2)

Shahima Khan I071 A3


Lab-3 July 18, 2020

Predict the probabilities of student or non-student defaulting, using the above model
predict(model2, data.frame(student = factor(c("Yes", "No"))), type = "response")

STEP 6: Multiple logistic regression


model3 <- glm(default ~ balance + income + student, family = "binomial", data = train)
tidy(model3)

In the case of multiple predictor variables sometimes we want to understand which variable is
the most influential in predicting the response (Y) variable. We can do this with varImp from
the caret package. The variable with the highest value is the most important predictor.
Indicate which is the most important predictor
caret::varImp(model3)

STEP 7: Making predictions for the above


To make a prediction, we first create test data row, for both cases of student = Yes , No. In
this case, balance = 1000, and income = $30K
new.df <- tibble(balance = 1000, income = 30, student = c("Yes", "No"))

What do you observe about the prediction probabilities?


STEP 7: Goodness-of-Fit
McFadden’s R2 for each of the models
list(model1 = pscl::pR2(model1)["McFadden"],
model2 = pscl::pR2(model2)["McFadden"],
model3 = pscl::pR2(model3)["McFadden"])

Which model which has the best fit?


STEP 8: Validation of predicted values
We need to use the estimated models to predict values on the training data set (train). When
using predict be sure to include type = response so that the prediction returns the probability
of default.
test.predicted.m1 <- predict(model1, newdata = test, type = "response")
test.predicted.m2 <- predict(model2, newdata = test, type = "response")
test.predicted.m3 <- predict(model3, newdata = test, type = "response")

Confusion matrix, is a table that describes the classification performance for each model on
the test data. Each quadrant of the table has an important meaning. In this case the “No” and
“Yes” in the rows represent whether customers defaulted or not. The “FALSE” and “TRUE”
in the columns represent whether we predicted customers to default or not.

Shahima Khan I071 A3


Lab-3 July 18, 2020

FALSE TRUE
No true negatives (Top-left quadrant): false positives (Top-right quadrant): We
We predicted no default, and the predicted yes, but they didn’t actually
customer did not default. default. (Also known as a “Type I
error.”)

Yes false negatives (Bottom-left): We true positives (Bottom-right quadrant):


predicted no, but they did default. these are cases in which we predicted the
(Also known as a “Type II error.”) customer would default and they did.

Diagonal elements of the confusion matrix indicate correct (true) predictions, while the off-
diagonals represent incorrect (false) predictions.
The below commands will show the various fractions for the confusion matrix, of the
different models.
Indicate the % correct and incorrect predictions for each of the models and indicate which is
the best model.
list(
model1 = table(test$default, test.predicted.m1 > 0.5) %>% prop.table() %>% round(3),
model2 = table(test$default, test.predicted.m2 > 0.5) %>% prop.table() %>% round(3),
model3 = table(test$default, test.predicted.m3 > 0.5) %>% prop.table() %>% round(3)
)

2. Include screen output and screen shots of your console to show the steps performed

Shahima Khan I071 A3


Lab-3 July 18, 2020

Shahima Khan I071 A3


Lab-3 July 18, 2020

Shahima Khan I071 A3


Lab-3 July 18, 2020

Shahima Khan I071 A3

You might also like