Topic 7 Regression (Cont2) Logistic Regression
Topic 7 Regression (Cont2) Logistic Regression
Logistic regression
Vincent Hoang (2022), Lecture 11
Logistic regression analysis
Logistic regression for dependent categorical variable.
How to assess models’ performance using differing metrics and different research purposes for
classification models.
Loan application modelling
• There are many factors that determine if a loan application is approved
or not approved by the bank
◦ What are the dependent variable?
Large
bubbles
means
rejected
applications
Reject if Credit Score < 640 Reject if Years + 0.095*(Credit Score) < 74.66
Logistic regression (1)
• It uses predictor variables to estimate the log-odds (not the probabilities) of an
observation belonging to the category of interest (Y=1)
where ln is the natural log, p is the probability of the event of interest happens.
• Once estimated, we can use algebra to rearrange the above equation for the
predicted probability (p)
⋯
Logistic regression vs linear regression
• We could use simple or multiple regression to estimate the probabilities – tends to work well near
the centre of the data, but doesn’t do well in the tails
• Predicted probabilities from logistic regression are bounded between 0 and 1, while predicted
probabilities from linear regression can fall outside these bounds
• Logistic regression uses maximum likelihood estimation to estimate parameters, while linear
regression use ordinary least squares.
Credit
approved
data
Probabilities vs odds
• p: the probability that an event will occur as the fraction of times you expect to see
that event in many trials.
• The odds are defined as the probability that the event will occur divided by the
probability that the event will not occur.
• Example: If the horse runs 100 races and wins 80, the probability of winning is
80/100 = 0.80 or 80%, and the odds of winning are 80/20 = 4 to 1.
◦ The probability of success is p=0.8, then the probability of failure is 1−p=0.2
.
◦ The odds of success is = .
= 4, i.e. the odds of success is 4 to 1
. .
◦ The odds of failure is .
= .
= 0.25 to 1
Odds vs odds ratios
• Example: seven out of 10 males are admitted to an engineering school while three of 10
females are admitted.
• For male,
◦ the probability of admission is p=0.7, then the probability of failure is 1−p=0.3
.
◦ the odds of success is = =2.333
.
• For female:
◦ the probability of admission is p=0.3, then the probability of failure is 1−p=0.7
.
◦ the odds of success is = =0.42857
.
• The odds ratios for admission would be OR = 2.333/.42857 = 5.44. So for a male, the odds
of being admitted are 5.44 times as large as the odds for a female being admitted.
Logistic regression (2)
• If we know the odd values, we can calculate the probability value.
• For example, re-arrange the above equation for the predicted probability (p)
• We can also estimate our model using a subset of our data and evaluate its
performance out-of-sample in a test sample.
Partitioning data
• Partitioning a data set is splitting the data randomly into two,
sometimes three smaller data sets: Training, Validation and Test.
◦ Training: The subset of data used to create (build) a model
◦ Validation: the subset of data that remains unseen when building the model
and is used to tune the model parameter estimates.
◦ Test (hold-out): A subset of data used to measure overall model performance
and compare the performance among different candidate models.
Classification performance - Accuracy
• Accuracy is simply the percentage of observations that are correctly classified
Predicted
Total
No = 0 Yes = 1
• A trade-off between sensitivity and specificity – given the context, are you more
willing to have false positives or false negatives?
◦ For example, are you more willing to approve a loan that shouldn’t be approved (false positive),
or not approve a loan that should be approved (false negative)?
An example –
loan approval data
Logistic regression to
predict loan approval
• To run a logistic
regression, click on
the “Logistic
Regression” button
• Take the time to
familiarise yourself
with the instruction
manual
Running the
regression
• To run a logistic
regression, click on
the “Logistic
Regression” button
• Then add input
range
Results
• Log-odds of approval is positively and significantly related to credit
score and homeowner status:
𝑝
ln = −30.6003 + 0.0413 ∗ 𝐶𝑟𝑒𝑑𝑖𝑡 𝑆𝑐𝑜𝑟𝑒 + 4.7397 ∗ 𝐻𝑜𝑚𝑒𝑜𝑤𝑛𝑒𝑟
1−𝑝
Classification performance
• 45/50 (90%) correctly classified
(accuracy)
• 21/23 (91.3%) of approvals correctly
classified (sensitivity)
• 24/27 (88.9%) of rejections correctly
classified (specificity)
Out-of-sample
testing –
thoughts?
• Let’s evaluate the
classification in
the test sample
Predicted probabilities
Observation Homeowner Credit Score
51 1 700
52 0 520
53 1 650
54 0 602
55 0 549
56 1 742
• A trade-off between sensitivity and specificity – given the context, are you more
willing to have false positives or false negatives?
Model 1: using different cut-off values
• Cut off value of 0.5 0.7 0.8