Logistic Regression
Logistic Regression
Logistic regression is analogous to multiple linear regression (see Chapter 4), except
the outcome is binary. Various transformations are employed to convert the problem
to one in which a linear model can be fit. Like discriminant analysis, and unlike K-
Nearest Neighbor and naive Bayes, logistic regression is a structured model approach
rather than a data-centric approach. Due to its fast computational speed and its out‐
put of a model that lends itself to rapid scoring of new data, it is a popular method.
However, fitting this model does not ensure that p will end up between 0 and 1, as a
probability must.
Instead, we model p by applying a logistic response or inverse logit function to the
predictors:
p
Odds Y = 1 =
1− p
We can obtain the probability from the odds using the inverse odds function:
Odds
p=
1 + Odds
We combine this with the logistic response function, shown earlier, to get:
Finally, taking the logarithm of both sides, we get an expression that involves a linear
function of the predictors:
The log-odds function, also known as the logit function, maps the probability p from
0, 1 to any value − ∞, + ∞ —see Figure 5-2. The transformation circle is complete;
we have used a linear model to predict a probability, which we can in turn map to a
class label by applying a cutoff rule—any record with a probability greater than the
cutoff is classified as a 1.
The response is outcome, which takes a 0 if the loan is paid off and a 1 if the loan
defaults. purpose_ and home_ are factor variables representing the purpose of the loan
and the home ownership status. As in linear regression, a factor variable with P levels
is represented with P – 1 columns. By default in R, the reference coding is used, and
the levels are all compared to the reference level (see “Factor Variables in Regression”
on page 163). The reference levels for these factors are credit_card and MORTGAGE,
respectively. The variable borrower_score is a score from 0 to 1 representing the
creditworthiness of the borrower (from poor to excellent). This variable was created
from several other variables using K-Nearest Neighbor—see “KNN as a Feature
Engine” on page 247.
In Python, we use the scikit-learn class LogisticRegression from sklearn.lin
ear_model. The arguments penalty and C are used to prevent overfitting by L1 or L2
regularization. Regularization is switched on by default. In order to fit without regu‐
larization, we set C to a very large value. The solver argument selects the used mini‐
mizer; the method liblinear is the default:
predictors = ['payment_inc_ratio', 'purpose_', 'home_', 'emp_len_',
'borrower_score']
outcome = 'outcome'
X = pd.get_dummies(loan_data[predictors], prefix='', prefix_sep='',
drop_first=True)
y = loan_data[outcome]
In contrast to R, scikit-learn derives the classes from the unique values in y (paid
off and default). Internally, the classes are ordered alphabetically. As this is the reverse
order from the factors used in R, you will see that the coefficients are reversed. The
Logistic regression is by far the most common form of GLM. A data scientist will
encounter other types of GLMs. Sometimes a log link function is used instead of the
logit; in practice, use of a log link is unlikely to lead to very different results for most
applications. The Poisson distribution is commonly used to model count data (e.g.,
the number of times a user visits a web page in a certain amount of time). Other fam‐
ilies include negative binomial and gamma, often used to model elapsed time (e.g.,
time to failure). In contrast to logistic regression, application of GLMs with these
models is more nuanced and involves greater care. These are best avoided unless you
are familiar with and understand the utility and pitfalls of these methods.
1
p=
1 + e−Y
In Python, we can convert the probabilities into a data frame and use the describe
method to get these characteristics of the distribution:
pred = pd.DataFrame(logit_reg.predict_log_proba(X),
columns=loan_data[outcome].cat.categories)
pred.describe()
The probabilities are directly available using the predict_proba methods in scikit-
learn:
pred = pd.DataFrame(logit_reg.predict_proba(X),
columns=loan_data[outcome].cat.categories)
pred.describe()
These are on a scale from 0 to 1 and don’t yet declare whether the predicted value is
default or paid off. We could declare any value greater than 0.5 as default. In practice,
a lower cutoff is often appropriate if the goal is to identify members of a rare class
(see “The Rare Class Problem” on page 223).
Odds Y = 1 X = 1
odds ratio =
Odds Y = 1 X = 0
This is interpreted as the odds that Y = 1 when X = 1 versus the odds that Y = 1 when
X = 0. If the odds ratio is 2, then the odds that Y = 1 are two times higher when X = 1
versus when X = 0.
Why bother with an odds ratio rather than probabilities? We work with odds because
the coefficient β j in the logistic regression is the log of the odds ratio for X j.
An example will make this more explicit. For the model fit in “Logistic Regression
and the GLM” on page 210, the regression coefficient for purpose_small_business is
1.21526. This means that a loan to a small business compared to a loan to pay off
credit card debt reduces the odds of defaulting versus being paid off by
exp 1.21526 ≈ 3.4. Clearly, loans for the purpose of creating or expanding a small
business are considerably riskier than other types of loans.
Figure 5-3 shows the relationship between the odds ratio and the log-odds ratio for
odds ratios greater than 1. Because the coefficients are on the log scale, an increase of
1 in the coefficient results in an increase of exp 1 ≈ 2.72 in the odds ratio.
Odds ratios for numeric variables X can be interpreted similarly: they measure the
change in the odds ratio for a unit change in X. For example, the effect of increasing
the payment-to-income ratio from, say, 5 to 6 increases the odds of the loan default‐
ing by a factor of exp 0.08244 ≈ 1.09. The variable borrower_score is a score on the
borrowers’ creditworthiness and ranges from 0 (low) to 1 (high). The odds of the best
borrowers relative to the worst borrowers defaulting on their loans is smaller by a
factor of exp − 4.61264 ≈ 0.01. In other words, the default risk from the borrowers
with the poorest creditworthiness is 100 times greater than that of the best borrowers!
Fortunately, most practitioners don’t need to concern themselves with the details of
the fitting algorithm since this is handled by the software. Most data scientists will
not need to worry about the fitting method, other than understanding that it is a way
to find a good model under certain assumptions.
Call:
glm(formula = outcome ~ payment_inc_ratio + purpose_ + home_ +
emp_len_ + borrower_score, family = "binomial", data = loan_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.51951 -1.06908 -0.05853 1.07421 2.15528
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.638092 0.073708 22.224 < 2e-16 ***
payment_inc_ratio 0.079737 0.002487 32.058 < 2e-16 ***
purpose_debt_consolidation 0.249373 0.027615 9.030 < 2e-16 ***
purpose_home_improvement 0.407743 0.046615 8.747 < 2e-16 ***
purpose_major_purchase 0.229628 0.053683 4.277 1.89e-05 ***
purpose_medical 0.510479 0.086780 5.882 4.04e-09 ***
purpose_other 0.620663 0.039436 15.738 < 2e-16 ***
purpose_small_business 1.215261 0.063320 19.192 < 2e-16 ***
home_OWN 0.048330 0.038036 1.271 0.204
home_RENT 0.157320 0.021203 7.420 1.17e-13 ***
emp_len_ > 1 Year -0.356731 0.052622 -6.779 1.21e-11 ***
borrower_score -4.612638 0.083558 -55.203 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The package statsmodels has an implementation for generalized linear model (GLM)
that provides similarly detailed information:
y_numbers = [1 if yi == 'default' else 0 for yi in y]
logit_reg_sm = sm.GLM(y_numbers, X.assign(const=1),
family=sm.families.Binomial())
logit_result = logit_reg_sm.fit()
logit_result.summary()
Interpretation of the p-value comes with the same caveat as in regression and should
be viewed more as a relative indicator of variable importance (see “Assessing the
Model” on page 153) than as a formal measure of statistical significance. A logistic
regression model, which has a binary response, does not have an associated RMSE or
R-squared. Instead, a logistic regression model is typically evaluated using more gen‐
eral metrics for classification; see “Evaluating Classification Models” on page 219.
Many other concepts for linear regression carry over to the logistic regression setting
(and other GLMs). For example, you can use stepwise regression, fit interaction
terms, or include spline terms. The same concerns regarding confounding and corre‐
lated variables apply to logistic regression (see “Interpreting the Regression Equation”
on page 169). You can fit generalized additive models (see “Generalized Additive
Models” on page 192) using the mgcv package in R:
logistic_gam <- gam(outcome ~ s(payment_inc_ratio) + purpose_ +
home_ + emp_len_ + s(borrower_score),
data=loan_data, family='binomial')
Analysis of residuals
One area where logistic regression differs from linear regression is in the analysis of
the residuals. As in linear regression (see Figure 4-9), it is straightforward to compute
partial residuals in R:
terms <- predict(logistic_gam, type='terms')
partial_resid <- resid(logistic_model) + terms
df <- data.frame(payment_inc_ratio = loan_data[, 'payment_inc_ratio'],
terms = terms[, 's(payment_inc_ratio)'],
partial_resid = partial_resid[, 's(payment_inc_ratio)'])
Key Ideas
• Logistic regression is like linear regression, except that the outcome is a binary
variable.
• Several transformations are needed to get the model into a form that can be fit as
a linear model, with the log of the odds ratio as the response variable.
• After the linear model is fit (by an iterative process), the log odds is mapped back
to a probability.
• Logistic regression is popular because it is computationally fast and produces a
model that can be scored to new data with only a few arithmetic operations.
Further Reading
• The standard reference on logistic regression is Applied Logistic Regression, 3rd
ed., by David Hosmer, Stanley Lemeshow, and Rodney Sturdivant (Wiley, 2013).
• Also popular are two books by Joseph Hilbe: Logistic Regression Models (very
comprehensive, 2017) and Practical Guide to Logistic Regression (compact, 2015),
both from Chapman & Hall/CRC Press.
• Both The Elements of Statistical Learning, 2nd ed., by Trevor Hastie, Robert Tib‐
shirani, and Jerome Friedman (Springer, 2009), and its shorter cousin, An Intro‐
duction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie,
and Robert Tibshirani (Springer, 2013), have a section on logistic regression.
• Data Mining for Business Analytics by Galit Shmueli, Peter Bruce, Nitin Patel,
Peter Gedeck, Inbal Yahav, and Kenneth Lichtendahl (Wiley, 2007–2020, with
editions for R, Python, Excel, and JMP) has a full chapter on logistic regression.