0% found this document useful (0 votes)
3 views33 pages

Topic 7 Regression (Cont2) Logistic Regression

The document discusses logistic regression, particularly its application in predicting loan approvals based on categorical dependent variables. It outlines the importance of model performance metrics such as accuracy, sensitivity, and specificity, as well as the process of partitioning data for model training and testing. Additionally, it emphasizes the significance of choosing appropriate cut-off values for classification and the systematic approach to model building in regression analysis.

Uploaded by

dieudan0505
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views33 pages

Topic 7 Regression (Cont2) Logistic Regression

The document discusses logistic regression, particularly its application in predicting loan approvals based on categorical dependent variables. It outlines the importance of model performance metrics such as accuracy, sensitivity, and specificity, as well as the process of partitioning data for model training and testing. Additionally, it emphasizes the significance of choosing appropriate cut-off values for classification and the systematic approach to model building in regression analysis.

Uploaded by

dieudan0505
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Topic 7 Regression (cont)

Logistic regression
Vincent Hoang (2022), Lecture 11
Logistic regression analysis
Logistic regression for dependent categorical variable.
How to assess models’ performance using differing metrics and different research purposes for
classification models.
Loan application modelling
• There are many factors that determine if a loan application is approved
or not approved by the bank
◦ What are the dependent variable?

• Note that our dependent variable is not continuous but rather


dichotomous (note original data is ordinal)
• Now we can use logistic regression to build this specific dichotomous
classification regression model.
Classification models
• Classification methods seek to classify a categorical outcome into one of
two or more categories based on various data attributes.
• For each record (observation) in a dataset, we have a categorical variable
of interest and additional predictor variables.
• For a given set of predictor variables, we would like to assign the best
value of the categorical variable (dependent variable).
◦ Should we approve this loan or not? Is this person high risk or not?
• Very useful practical implications: efficient allocation of resources,
targeted marketing, better decisions, etc.
Visual classification methods
• A simple plot says a lot – can you see any distinct points where loans go from
being approved to not approved looking at credit history and credit score?

Large
bubbles
means
rejected
applications

Reject if Credit Score < 640 Reject if Years + 0.095*(Credit Score) < 74.66
Logistic regression (1)
• It uses predictor variables to estimate the log-odds (not the probabilities) of an
observation belonging to the category of interest (Y=1)

where ln is the natural log, p is the probability of the event of interest happens.
• Once estimated, we can use algebra to rearrange the above equation for the
predicted probability (p)


Logistic regression vs linear regression
• We could use simple or multiple regression to estimate the probabilities – tends to work well near
the centre of the data, but doesn’t do well in the tails

• Predicted probabilities from logistic regression are bounded between 0 and 1, while predicted
probabilities from linear regression can fall outside these bounds

• Logistic regression uses maximum likelihood estimation to estimate parameters, while linear
regression use ordinary least squares.
Credit
approved
data
Probabilities vs odds
• p: the probability that an event will occur as the fraction of times you expect to see
that event in many trials.
• The odds are defined as the probability that the event will occur divided by the
probability that the event will not occur.
• Example: If the horse runs 100 races and wins 80, the probability of winning is
80/100 = 0.80 or 80%, and the odds of winning are 80/20 = 4 to 1.
◦ The probability of success is p=0.8, then the probability of failure is 1−p=0.2
.
◦ The odds of success is = .
= 4, i.e. the odds of success is 4 to 1
. .
◦ The odds of failure is .
= .
= 0.25 to 1
Odds vs odds ratios
• Example: seven out of 10 males are admitted to an engineering school while three of 10
females are admitted.
• For male,
◦ the probability of admission is p=0.7, then the probability of failure is 1−p=0.3
.
◦ the odds of success is = =2.333
.

• For female:
◦ the probability of admission is p=0.3, then the probability of failure is 1−p=0.7
.
◦ the odds of success is = =0.42857
.

• The odds ratios for admission would be OR = 2.333/.42857 = 5.44. So for a male, the odds
of being admitted are 5.44 times as large as the odds for a female being admitted.
Logistic regression (2)
• If we know the odd values, we can calculate the probability value.
• For example, re-arrange the above equation for the predicted probability (p)

• The predicted probabilities smoothly approach 0 and 1 as values of independent


variables go to their own extremes.
• If we plot the predictions against a single continuously-distributed independent
variable while holding the others fixed, we see an S-shaped or reverse-S shaped
"logistic curve."
Logistic regression (3)
• There are applications of logistic regression in which the objective is to build a
model to generate a predicted probability of the dependent event under a given set
of values for the independent variables.
• In other settings the objective is to make categorical predictions: a definite 1 or a
definite 0 in each case (our case study for loan application)
◦ Usually this is done by adding one more parameter to the model: a cut-off value.
◦ If the predicted probability of a positive outcome is greater than the cut-off value, it is categorically
predicted to be a 1, otherwise 0.
◦ The chosen cut-off value depend on the relative consequences of type 1 and type 2 errors (false
positives and false negatives), although a value of 0.5 is often used by default.
Measuring classification performance
• How well did our chosen model do?
• Therefore, we normally look at three measures
1. Accuracy – the percentage of observations that were correctly classified
2. Sensitivity – the percentage of positives (Dependent Variable = 1) that were correctly classified
3. Specificity – the percentage of negatives (Dependent Variable = 0) that were correctly classified

• We can also estimate our model using a subset of our data and evaluate its
performance out-of-sample in a test sample.
Partitioning data
• Partitioning a data set is splitting the data randomly into two,
sometimes three smaller data sets: Training, Validation and Test.
◦ Training: The subset of data used to create (build) a model
◦ Validation: the subset of data that remains unseen when building the model
and is used to tune the model parameter estimates.
◦ Test (hold-out): A subset of data used to measure overall model performance
and compare the performance among different candidate models.
Classification performance - Accuracy
• Accuracy is simply the percentage of observations that are correctly classified
Predicted
Total
No = 0 Yes = 1

No = 0 True Negative False Positive Total Actual Negatives


Actual
Yes = 1 False Negative True Positive Total Actual Positives

Total Total Predicted Negatives Total Predicted Positives Overall Total


Classification performance - Sensitivity
• Sensitivity is the percentage of positives (Dummy = 1) that are correctly
classified
Predicted
Total
No = 0 Yes = 1
No = 0 True Negative False Positive Total Actual Negatives
Actual
Yes = 1 False Negative True Positive Total Actual Positives
Total Total Predicted Negatives Total Predicted Positives Overall Total
Classification performance - Specificity
• Specificity is the percentage of negatives (Dummy = 0) that are correctly
classified
Predicted
Total
No = 0 Yes = 1
No = 0 True Negative False Positive Total Actual Negatives
Actual
Yes = 1 False Negative True Positive Total Actual Positives
Total Total Predicted Negatives Total Predicted Positives Overall Total
Classification threshold – cut-off value
• The cut-off value: At this point we classify an observation as belonging to the
category of interest (1) or the reference group (0)
• Common cut-off values include:
◦ 0.5
◦ Finding the mid-point between the smallest and largest predicted scores/probabilities

• A trade-off between sensitivity and specificity – given the context, are you more
willing to have false positives or false negatives?
◦ For example, are you more willing to approve a loan that shouldn’t be approved (false positive),
or not approve a loan that should be approved (false negative)?
An example –
loan approval data
Logistic regression to
predict loan approval

Spreadsheet: Topic 7 Credit Approval Data.xlsx

We will use logistic regression to predict loan


approval on the following predictors:
◦ Homeowner status
◦ Credit score
◦ Years of credit history
◦ Balance of existing credit facilities the applicant has and can use
◦ % of existing credit facility utilised by applicant
Logistic regression with “RegressIt” in Excel
• You will need to download the “RegressIt” Excel add-in at the link below,
then activate it
◦ https://regressit.com/logistic.html

• To activate, go to File>Options>Add-Ins>Manage Add-ins, then search for


the RegressIt file that you downloaded (it will likely be in your “Downloads”
folder)
• RegressIt toolbar:
Data preparation
• The data provided has two parts:
◦ 50 loan applicants, their details, and
whether their loan was approved or not.
◦ 6 potential applicants that have yet to be
assessed.

• In the RegressIt tab, click on “Select


Data”, then click on “Create Names”
and follow the prompts. This will tell
Excel what data you are using.
Make sure all the data is selected.
The Logistic
Regression
interface

• To run a logistic
regression, click on
the “Logistic
Regression” button
• Take the time to
familiarise yourself
with the instruction
manual
Running the
regression

• Let’s start with a simpler


model – predicting loan
approval with credit score
and homeowner status
Logistic regression with “Real Statistic” in
Excel
• You will need to download the “XRealStats” Excel add-in at the link
below, then activate it
◦ https://real-statistics.com/free-download/
◦ To activate, go to File>Options>Add-Ins>Manage Add-ins, then search for the
XRealStats file that you downloaded (it will likely be in your “Downloads” folder)

• Real Statistic toolbar:


The Logistic
Regression
interface

• To run a logistic
regression, click on
the “Logistic
Regression” button
• Then add input
range
Results
• Log-odds of approval is positively and significantly related to credit
score and homeowner status:
𝑝
ln = −30.6003 + 0.0413 ∗ 𝐶𝑟𝑒𝑑𝑖𝑡 𝑆𝑐𝑜𝑟𝑒 + 4.7397 ∗ 𝐻𝑜𝑚𝑒𝑜𝑤𝑛𝑒𝑟
1−𝑝
Classification performance
• 45/50 (90%) correctly classified
(accuracy)
• 21/23 (91.3%) of approvals correctly
classified (sensitivity)
• 24/27 (88.9%) of rejections correctly
classified (specificity)
Out-of-sample
testing –
thoughts?
• Let’s evaluate the
classification in
the test sample
Predicted probabilities
Observation Homeowner Credit Score
51 1 700
52 0 520
53 1 650
54 0 602
55 0 549
56 1 742

Let’s predict the probabilities of getting approval of the observations 51-55


Classification threshold – cut-off value
• The cut-off value: At this point we classify an observation as belonging to the
category of interest (1) or the reference group (0)
• Common cut-off values include:
◦ 0.5
◦ Finding the mid-point between the smallest and largest predicted scores/probabilities (not
discussed here)

• A trade-off between sensitivity and specificity – given the context, are you more
willing to have false positives or false negatives?
Model 1: using different cut-off values
• Cut off value of 0.5  0.7  0.8

We can change the cut-off value;


classification tables will change accordingly.
Sensitivity Specificity Accuracy
Comparing models (using cut-off of 0.5)
• Model 1 Model 2
Systematic modelling approach
• Analysts can take an approach in which
◦ They start with a simple model with several key variables
◦ How to determine what are key variables? Often we rely on
◦ Theories or theoretical foundations: prices going up, less consumption
◦ Widely accepted views or norms: managers earn more than non-managerial employees or experienced staff
get higher paid.
◦ Your own experience or research understanding (e.g. your work using multiple regression analysis
◦ Research objectives.
◦ Then they can add “more” variables into the model, in doing so they watch out for
◦ Increased in adjusted R-square
◦ Issues of multicollinearity
◦ Statistical significance of predictor / explanatory / independent X variables (together with signs of the
coefficient and magnitude of the coefficient)
Systematic modelling (2)
• Over the last few weeks, we have been following another approach in
which
◦ We start with a complicated model with many independent variables, we acknowledge
of potential issues related to multicollinearity
◦ Then we can remove those are not statistically significant to get to a simpler model

• In general, both approaches are good for the purpose of examining a


certain hypothesis or understanding more about the situations.
• In case we are more concerned about “predictive” power, then there are
other measures of model performance as discussed today.

You might also like