Binary Logistic (5)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Classification Methods

Logistic Regression Model


Introduction
• The linear regression model assumes that the response variable Y is
quantitative.
• In many situations, the response variable is instead qualitative.
• Often qualitative variables are referred to as categorical ; we will use
these terms interchangeably.
• Here we study approaches for predicting qualitative responses, a
process that is known as classification.
Introduction
• Predicting a qualitative response for an observation can be referred to
as classifying that observation, since it involves assigning the
observation to a category, or class.
• The methods used for classification generally predict the probability
of each of the categories of the qualitative variable, as the basis for
making the classification. In this sense they also behave like
regression methods.
• Here we will use logistic regression in modelling a binary response
variable.
Logistic Regression
• Instead of modelling this response 𝐷𝑒𝑓𝑎𝑢𝑙𝑡 directly, logistic
regression models the probability that Default belongs to a particular
category.
• For example, the probability of default given balance can be written
as
Pr Default = Yes Balance .
• The values of Pr Default = Yes Balance , which we denote by
𝑝(balance), will range between 0 and 1.
Logistic Regression
• Then for any given value of balance, a prediction can be made for
default.
• For example, one might predict 𝑑𝑒𝑓𝑎𝑢𝑙𝑡 = 𝑌𝑒𝑠 for any individual for
whom 𝑝 balance > 0.5.
• Alternatively, if a company wishes to be conservative in predicting
individuals who are at risk for default, then they may choose to use a
lower threshold, such as 𝑝 balance > 0.1.
Logistic Model
• We need to model 𝑝(𝑋) using a function that gives outputs between
0 and 1 for all values of 𝑋.
• Many functions meet this criteria.
• In logistic regression, we use the logistic function,

𝑒 𝛽0+𝛽1𝑋
𝑝 𝑋 = 𝛽 +𝛽 𝑋
… … … … (1)
1+𝑒 0 1

• 𝛽0 and 𝛽1 are the parameters of the model.


Logistic Model
• After a bit of manipulation of Equation (1), we find that

𝑝 𝑋
= 𝑒 𝛽0+𝛽1𝑋 … … … (2)
1 − 𝑝(𝑋)

• The quantity 𝑝(𝑋)/[1 − 𝑝 𝑋 ] is called the odds, and can take on any
value odds between 0 and ∞.
Odds
• Values of the odds close to 0 and ∞ indicate very low and very high
probabilities of default, respectively.
• For example, on average 1 in 5 people with an odds of 1/4 will
0.2
default, since 𝑝 𝑋 = 0.2 implies an odds of =1/4.
(1−0.2)
• Likewise on average nine out of every ten people with an odds of 9
will default, since 𝑝 𝑋 = 0.9 implies an odds of 0.9/(1 − 0.9) = 9.
• Odds are traditionally used instead of probabilities in horse-racing,
since they relate more naturally to the correct betting strategy.
Logistic Model
• Alternatively, (2) can be written as

𝑝 𝑋
ln = 𝛽0 + 𝛽1 𝑋 … … … 3 .
1−𝑝 𝑋

• The left-hand side is called the log-odds or logit.


• We see that the logistic regression model (1) has a logit that is linear
in X.
Interpreting the Coefficient Table
• We see that 𝛽1 =0.0055; this indicates that an increase in balance is
associated with an increase in the probability of default.
• To be precise, a one-unit increase in balance is associated with an
increase in the log odds of default by 0.0055 units.
• Also note that the estimated odds ratio for one-unit increase in
balance is 1.005.
• This indicates that for every one-unit increase in balance, the odds for
defaulting increases by 1.005 times.
Making Predictions
• Once the coefficients have been estimated, it is a simple matter to
compute the probability of default for any given credit card balance.
• For example, using the coefficient estimates given in the Table, we
predict that the default probability for an individual with a balance of
$1, 000 is
𝑒 𝛽0+𝛽1𝑥
𝑝 𝑋 = = 0.00576.
1 + 𝑒𝛽0+𝛽1𝑥

• This is below 1%.


• In contrast, the predicted probability of default for an individual with a
balance of $2, 000 is much higher, and equals 0.586 or 58.6%.
Qualitative Predictors
• One can use qualitative predictors with the logistic regression model
using the dummy variable approach.
• As an example, the Default data set contains the qualitative variable
student.
• To fit the model we simply create a dummy variable that takes on a
value of 1 for students and 0 for non-students.
• The logistic regression model that results from predicting probability
of default from student status can be seen in the next Table.
Qualitative Predictors
Coefficien Std. error 𝑧-statistic 𝑝-value Odds
t Ratio
Intercept −3.5041 0.0707 −49.55 <0.0001
Student[Yes] 0.4049 0.1150 3.52 0.0004 1.499
Interpreting the Coefficients
• The coefficient associated with the dummy variable is positive, and the
associated p-value is statistically significant.
• This indicates that students tend to have higher default probabilities than
non-students:
𝑒 −3.5041+0.4049×1
Pr default = Yes student = Yes = −3.5041+0.4049×1
= 0.0431,
1+𝑒

𝑒 −3.5041+0.4049×0
Pr default = Yes student = No = −3.5041+0.4049×0
= 0.0292.
1+𝑒
• Also, students are approximately 1.5 times more likely to default as
compared to those who are not students.
Multiple Logistic Regression
• We now consider the problem of predicting a binary response using
multiple predictors.
• By analogy with the extension from simple to multiple linear
regression, we can generalize Equation (3) as follows:

𝑝(𝑋)
ln = 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽𝑝 𝑋𝑝 , … … … (4)
1 − 𝑝(𝑋)
where 𝑋 = 𝑋1 , 𝑋2 , … , 𝑋𝑝 are 𝑝 predictors.
Multiple Logistic Regression
• Equation (4) can be rewritten as

𝑒 𝛽0+𝛽1𝑋1+⋯+𝛽𝑝 𝑋𝑝
𝑝 𝑋 = 𝛽0 +𝛽1 𝑋1 +⋯+𝛽𝑝 𝑋𝑝
… … … (5)
1+𝑒

• As before, we use the maximum likelihood method to estimate the


parameters 𝛽0 , … , 𝛽𝑝 .
• The next Table shows the coefficient estimates for a logistic
regression model that uses balance, income (in thousands of dollars),
and student status to predict probability of default.
Making Prediction
• By substituting estimates for the regression coefficients from the
Table into Equation (5), we can make predictions.
• For example, a student with a credit card balance of $1,500 and an
income of $40, 000 has an estimated probability of default of
𝑒 −10.869+0.00574×1500+0.003×40−0.6468×1
𝑝 𝑋 = −10.869+0.00574×1500+0.003×40−0.6468×1
= 0.058.
1+𝑒
• A non-student with the same balance and income has an estimated
probability of default of
𝑒 −10.869+0.00574×1500+0.003×40−0.6468×0
𝑝 𝑋 = −10.869+0.00574×1500+0.003×40−0.6468×0
= 0.105.
1+𝑒
Confusion Matrix
• In practice, a binary classifier such as logistic regression can make two
types of errors.
• It can incorrectly assign an individual who defaults to the no default
category, or it can incorrectly assign an individual who does not
default to the default category.
• It is often of interest to determine which of these two types of errors
are being made.
• A confusion matrix, shown for the Default data in the next Table, is a
convenient way to display this information.
Confusion Matrix
True default status
Predicted No Yes Total
Default No 9627 228 9855
Status
Yes 40 105 145
Total 9667 333 10000

105
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = = 31.53%
333
9627
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = = 99.59%
9667
268
𝑇𝑜𝑡𝑎𝑙 𝐸𝑟𝑟𝑜𝑟 𝑅𝑎𝑡𝑒 = = 2.68%
10000
Confusion Matrix
• The table reveals that the logistic regression predicted that a total of
145 people would default.
• Of these people, 105 actually defaulted and 40 did not.
• Hence only 40 out of 9, 667 of the individuals who did not default
were incorrectly labelled.
• This looks like a pretty low error rate!
Confusion Matrix
• However, of the 333 individuals who defaulted, 228 (or 68.47%) were
missed by Logistic Regression.
• So while the overall error rate is low, the error rate among individuals
who defaulted is very high.
• From the perspective of a credit card company that is trying to
identify high-risk individuals, an error rate of 228/333 = 68.47%
among individuals who default may well be unacceptable.
Sensitivity and Specificity
• terms sensitivity and specificity characterize the performance of a
classifier.
• In this case the sensitivity is the percentage of true defaulters that are
identified, a low 31.53% in this case.
• The specificity is the percentage of non-defaulters that are correctly
identified, here (1 −40/9, 667)× 100 = 99.59%.
Improving Logistic Regression Classifier
• For example, we might label any customer with a probability of
default above 20% to the default class.
• That is, we may assign an observation to default class if
Pr default = Yes X = x > 0.2 … … … (7)
Deciding the Optimal Threshold
ROC Curve
• The ROC curve is a used for simultaneously displaying two types of
errors for all possible thresholds.
• The name “ROC” comes from communications theory. It is an
acronym for receiver operating characteristics.
• The overall performance of a classifier, summarized over all possible
thresholds, is given by the area under the (ROC) curve (AUC).
• An ideal ROC curve should touch the top left corner, so the larger the
AUC the better the classifier.
General Rule
𝐴𝑈𝐶 Decision
𝐴𝑈𝐶 = 0.5 No Discrimination
0.7 ≤ 𝐴𝑈𝐶 < 0.8 Acceptable Discrimination
0.8 ≤ 𝐴𝑈𝐶 < 0.9 Excellent Discrimination
𝐴𝑈𝐶 ≥ 0.9 Outstanding Discrimination
How well does the model fit the data?
• The Hosmer-Lemeshow (HL) test is widely used to address the
question “How well does my model fit the data?”
• It serves as a goodness-of-fit (GOF) test for the logistic regression
model.
• This test is used to find out whether there is any significant evidence
against the model fitting the data well.
• If the 𝑝-value is small, this is indicative of poor fit.
• For the Default data set, the observed 𝑝-value is 0.8846, indicating
that there is no evidence of poor fit.
• So our model is indeed correctly specified!!
How well does the model fit the data?
• The logistic regression model is fitted using the method of maximum likelihood.
• The parameter estimates are those values which maximize the likelihood of the
data which have been observed.
• McFadden's R squared measure is given by
2
log 𝐿𝐶
𝑅 =1− ,
log 𝐿𝑁𝑈𝐿𝐿
where 𝐿𝐶 denotes the (maximized) likelihood value from the current fitted model,
and 𝐿𝑁𝑈𝐿𝐿 denotes the corresponding value for the null model - the model with
only an intercept and no predictors.
• McFadden's R squared measure also takes value between 0 and 1.
• For the Default data set, the McFadden's R square value is 46.19%, indicating that
the model may be useful in practice.
Training Error Rate and Test Error Rate
• The misclassification error rate calculated earlier with the optimal
threshold was 13.81%.
• However, we have used the same data to train and test our model.
• In reality, this error rate is in fact the training error rate.
• In order to assess the accuracy of the model, we should first fit a
model using a part of the data and then should examine the
performance on the “hold-out” data.
• This error rate is called the test error rate.
• Next we have used 80% of the observations to fit the model and 20%
of observations are kept aside for validating the model.

You might also like