Note 4
Note 4
Logistic Regression
Extends idea of linear regression to situation where outcome variable is categorical
Widely used, particularly where a structured model is useful to explain (=profiling) or to predict
The Logit
Goal: Find a function of the predictor variables that relates them to a 0/1 outcome
𝑝 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + … 𝛽𝑞 𝑥𝑞
The Fix:
use logistic response function
1
𝑝=
1 + 𝑒 −(𝛽0 +𝛽1 𝑥 1 +𝛽2 𝑥 2 + …𝛽𝑞 𝑥 𝑞 )
1
Step 2: The Odds
The odds of an event are defined as:
p
Odds
eq. 10.3 1 p
Or, given the odds of an event, the probability of the event can be computed by:
Odds
p
eq. 10.4 1 Odds
2
Odds (a) and Logit (b) as function of P
Example
Personal Loan Offer (UniversalBank.csv)
3
Single Predictor Model
Data Prep:
#### Table 10.2
bank.df <- read.csv("UniversalBank.csv")
head(bank.df)
> head(bank.df)
ID Age Experience Income ZIP.Code Family CCAvg Education Mortgage Personal.Loan Securities.Account CD.Account Online
CreditCard
1 1 25 1 49 91107 4 1.6 1 0 0 1 0 0
0
2 2 45 19 34 90089 3 1.5 1 0 0 1 0 0
0
3 3 39 15 11 94720 1 1.0 1 0 0 0 0 0
0
4 4 35 9 100 94112 1 2.7 2 0 0 0 0 0
0
5 5 35 8 45 91330 4 1.0 2 0 0 0 0 0
1
6 6 37 13 29 92121 4 0.4 2 155 0 0 0 1
0
>
bank.df <- bank.df[ , -c(1, 5)] # Drop ID and zip code in columns 1 and 5.
# Education is 1, 2, 3.
# Want to convert it to categorical (R will create dummy variables)
bank.df$Education <- factor(bank.df$Education, levels = c(1, 2, 3),
labels = c("Undergrad", "Graduate",
"Advanced/Professional"))
head(bank.df)
Age Experience Income Family CCAvg Education Mortgage Personal.Loan Securities.Account CD.Account Online CreditCard
1 25 1 49 4 1.6 Undergrad 0 0 1 0 0 0
2 45 19 34 3 1.5 Undergrad 0 0 1 0 0 0
3 39 15 11 1 1.0 Undergrad 0 0 0 0 0 0
4 35 9 100 1 2.7 Graduate 0 0 0 0 0 0
5 35 8 45 4 1.0 Graduate 0 0 0 0 0 1
6 37 13 29 4 0.4 Graduate 155 0 0 0 1 0
>
# partition data
set.seed(2) # random sample can be reproduced by setting a value for seed
train.index <- sample(c(1:dim(bank.df)[1]), dim(bank.df)[1]*0.6)
train.df <- bank.df[train.index, ]
valid.df <- bank.df[-train.index, ]
4
Seeing the Relationship
Estimation: Estimates of b’s are derived through an iterative process called maximum likelihood
estimation
5
> # run logistic regression
> # use glm() (general linear model) with family = "binomial" to fit a logistic
> # regression.
> logit.reg <- glm(Personal.Loan ~ ., data = train.df, family = "binomial")
> options(scipen=999)
> summary(logit.reg)
Call:
glm(formula = Personal.Loan ~ ., family = "binomial", data = train.df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0380 -0.1847 -0.0627 -0.0183 3.9810
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.6805628 2.2903370 -5.537 0.0000000308 ***
Age -0.0369346 0.0848937 -0.435 0.66351
Experience 0.0490645 0.0844410 0.581 0.56121
Income 0.0612953 0.0039762 15.416 < 0.0000000000000002 ***
Family 0.5434657 0.0994936 5.462 0.0000000470 ***
CCAvg 0.2165942 0.0601900 3.599 0.00032 ***
EducationGraduate 4.2681068 0.3703378 11.525 < 0.0000000000000002 ***
EducationAdvanced/Professional 4.4408154 0.3723360 11.927 < 0.0000000000000002 ***
Mortgage 0.0015499 0.0007926 1.955 0.05052 .
Securities.Account -1.1457476 0.3955796 -2.896 0.00377 **
CD.Account 4.5855656 0.4777696 9.598 < 0.0000000000000002 ***
Online -0.8588074 0.2191217 -3.919 0.0000888005 ***
CreditCard -1.2514213 0.2944767 -4.250 0.0000214111 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
varImp(logit.reg)
names(logit.reg)
# the names available in logit.reg, can be referred to by "logit.reg$"
6
Evaluating Classification Performance
Performance measures: Confusion matrix and % of misclassifications
# some libraries
install.packages("e1071")
library(e1071) # misc functions in stat, prob
install.packages("Rcpp")
library(Rcpp) # R and C++ integration
install.packages("caret")
library(caret) # Classfication And REgression Training
# evaluate
# predict(logit.reg, valid.df) assumes a linear regression prediction
# adding type = "response" means log(odds) is linear regression, hence
# the prob follows a logistic regression model
# evaluate
> # get the confusion Matrix using table
> c.mat <- table(ifelse(pred > 0.5, 1, 0), valid.df[,8])
> c.mat
0 1
0 1779 58
1 30 133
>
7
#ROC Curve and other capabilities of "InformationValue"
install.packages("InformationValue")
library(InformationValue)
8
$optimalCutoff
[1] 0.4799484
$sensitivityTable
CUTOFF FPR TPR YOUDENSINDEX SPECIFICITY MISCLASSERROR
1 0.999948397 0 0.005235602 0.005235602 1.0000000 0.0950
2 0.989948397 0.0005527916 0.2041885 0.203635690 0.9994472 0.0765
3 0.979948397 0.0005527916 0.3036649 0.303112130 0.9994472 0.0670
4 0.969948397 0.0005527916 0.3350785 0.334525742 0.9994472 0.0640
5
Output skipped
$misclassificationError
[1] 0.043
$TPR
[1] 0.7120419
$FPR
[1] 0.01713654
$Specificity
[1] 0.9828635
>
> # The maximization criterion for which probability cutoff score needs to be optimised.
> # Can take either of following values: "Ones" or "Zeros" or "Both" or
> # "misclasserror"(default). If "Ones" is used, ’optimalCutoff’ will be chosen to
> # maximise detection of "One’s". If ’Both’ is specified, the probability cut-off
> # that gives maximum Youden’s Index is chosen. If ’misclasserror’ is specified,
> # the probability cut-off that gives minimum mis-clasification error is chosen.
>
9
Converting to Probability
Odds
p
1 Odds
10
>
Loan Example:
Evaluating Classification Performance
Performance measures: Confusion matrix and % of misclassifications
The “lift” over the base curve indicates for a given number of cases (read on the x-axis), the
additional responders that you can identify by using the model.
The same information is portrayed in in Decile-wise lift chart: Taking the 10% of the records
that are ranked by the model as “most probable 1’s” yields 7.9 times as many 1’s as would simply
selecting 10% of the records at random.
11
12
#### Figure 10.3
> install.packages("gains") # to get lift chart
library(gains)
gain <- gains(valid.df$Personal.Loan, logit.reg.pred, groups=10)
> class(gain)
[1] "gains"
> names(gain)
[1] "depth" "obs" "cume.obs" "mean.resp"
"cume.mean.resp" "cume.pct.of.total"
[7] "lift" "cume.lift" "mean.prediction" "min.prediction"
"max.prediction" "conf"
[13] "optimal" "num.groups" "percents"
>
> data.frame(c(0,gain$cume.pct.of.total*sum(valid.df$Personal.Loan)) ,
+ c(0,gain$cume.obs) )
c.0..gain.cume.pct.of.total...sum.valid.df.Personal.Loan.. c.0..gain.cume.obs.
1 0 0
2 151 200
3 172 400
4 180 600
5 185 800
6 187 1000
7 187 1200
8 189 1400
9 190 1600
10 191 1800
11 191 2000
>
> data.frame( c(0,sum(valid.df$Personal.Loan)) , c(0, dim(valid.df)[1]) )
c.0..sum.valid.df.Personal.Loan.. c.0..dim.valid.df..1..
1 0 0
2 191 2000
>
# plot lift chart
plot(c(0,gain$cume.pct.of.total*sum(valid.df$Personal.Loan))~c(0,gain$cume.obs),
xlab="# cases", ylab="Cumulative", main="", type="l")
lines(c(0,sum(valid.df$Personal.Loan))~c(0, dim(valid.df)[1]), lty=2)
13
> # compute deciles and plot decile-wise chart
> heights <- gain$mean.resp/mean(valid.df$Personal.Loan)
>
> gain$mean.resp
[1] 0.755 0.105 0.040 0.025 0.010 0.000 0.010 0.005 0.005 0.000
> gain$mean.resp*200
[1] 151 21 8 5 2 0 2 1 1 0
> mean(valid.df$Personal.Loan)
[1] 0.0955
> heights
[1] 7.90575916 1.09947644 0.41884817 0.26178010 0.10471204 0.00000000 0.10471204
0.05235602 0.05235602 0.00000000
>
14
midpoints <- barplot(heights, names.arg = gain$depth, ylim = c(0,9),
xlab = "Percentile", ylab = "Mean Response", main = "Decile-wise lift chart")
In gains package:
Actual: a numeric vector of actual response values
Predicted: a numeric vector of predicted response values. This vector must have the same length
as actual, and the ith value of this vector needs to be the model score for the subject with the ith
value of the actual vector as its actual response.
Groups: an integer containing the number of rows in the gains table. The default value is 10.
15
Multicollinearity
Solution: Remove extreme redundancies (by dropping predictors via variable selection, or by
data reduction methods such as PCA)
Variable Selection
This is the same issue as in linear regression
1. The number of correlated predictors can grow when we create derived variables such as
interaction terms (e.g. Income x Family), to capture more complex relationships
2. Problem: Overly complex models have the danger of overfitting
3. Solution: Reduce variables via automated selection of variable subsets (as with linear
regression)
4. See Chapter 6
Summary
1. Logistic regression is similar to linear regression, except that it is used with a categorical
response
2. It can be used for explanatory tasks (=profiling) or predictive tasks (=classification)
3. The predictors are related to the response Y via a nonlinear function called the logit
4. As in linear regression, reducing predictors can be done via variable selection
5. Logistic regression can be generalized to more than two classes
16
Problems
1. Financial Condition of Banks. The file Banks.csv includes data on a sample of
20 banks. The “Financial Condition” column records the judgment of an expert on the
financial condition of each bank. This outcome variable takes one of two possible
values—weak or strong—according to the financial condition of the bank. The predictors
are two ratios used in the financial analysis of banks: TotLns&Lses/Assets is the ratio of
total loans and leases to total assets and TotExp/Assets is the ratio of total expenses to
total assets. The target is to use the two ratios for classifying the financial condition of a
new bank.
Run a logistic regression model (on the entire dataset) that models the status of a bank as
a function of the two financial measures provided. Specify the success class as weak (this
is similar to creating a dummy that is 1 for financially weak banks and 0 otherwise), and
use the default cutoff value of 0.5.
a. Write the estimated equation that associates the financial condition of a bank with
its two predictors in three formats:
i. The logit as a function of the predictors
ii. The odds as a function of the predictors
iii. The probability as a function of the predictors
b. Consider a new bank whose total loans and leases/assets ratio = 0.6 and
total expenses/assets ratio = 0.11. From your logistic regression model, estimate
the following four quantities for this bank (use R to do all the intermediate
calculations; show your final answers to four decimal places): the logit, the odds,
the probability of being financially weak, and the classification of the bank (use
cutoff = 0.5).
c. The cutoff value of 0.5 is used in conjunction with the probability of being
financially weak. Compute the threshold that should be used if we want to make a
classification based on the odds of being financially weak, and the threshold for
the corresponding logit.
d. Interpret the estimated coefficient for the total loans & leases to total
assets ratio (TotLns&Lses/Assets) in terms of the odds of being financially weak.
e. When a bank that is in poor financial condition is misclassified as
financially strong, the misclassification cost is much higher than when a
financially strong bank is misclassified as weak. To minimize the expected cost of
misclassification, should the cutoff value for classification (which is currently at
0.5) be increased or decreased?
2. Identifying Good System Administrators. A management consultant is studying
the roles played by experience and training in a system administrator’s ability to
complete a set of tasks in a specified amount of time. In particular, she is interested in
discriminating between administrators who are able to complete given tasks within a
specified time and those who are not. Data are collected on the performance of 75
randomly selected administrators. They are stored in the file SystemAdministrators.csv.
17
The variable Experience measures months of full-time system administrator experience,
while Training measures the number of relevant training credits. The outcome variable
Completed is either Yes or No, according to whether or not the administrator completed
the tasks.
a. Create a scatter plot of Experience vs. Training using color or symbol to
distinguish programmers who completed the task from those who did not
complete it. Which predictor(s) appear(s) potentially useful for classifying task
completion?
b. Run a logistic regression model with both predictors using the entire dataset as
training data. Among those who completed the task, what is the percentage of
programmers incorrectly classified as failing to complete the task?
c. To decrease the percentage in part (b), should the cutoff probability be increased
or decreased?
d. How much experience must be accumulated by a programmer with 4 years of
training before his or her estimated probability of completing the task exceeds
0.5?
3. Sales of Riding Mowers. A company that manufactures riding mowers wants to
identify the best sales prospects for an intensive sales campaign. In particular, the
manufacturer is interested in classifying households as prospective owners or nonowners
on the basis of Income (in $1000s) and Lot Size (in 1000 ft2). The marketing expert
looked at a random sample of 24 households, given in the file RidingMowers.csv. Use all
the data to fit a logistic regression of ownership on the two predictors.
a. What percentage of households in the study were owners of a riding mower?
b. Create a scatter plot of Income vs. Lot Size using color or symbol to distinguish
owners from nonowners. From the scatter plot, which class seems to have a
higher average income, owners or nonowners?
c. Among nonowners, what is the percentage of households classified correctly?
d. To increase the percentage of correctly classified nonowners, should the cutoff
probability be increased or decreased?
e. What are the odds that a household with a $60K income and a lot size of 20,000
ft2 is an owner?
f. What is the classification of a household with a $60K income and a lot size of
20,000 ft2? Use cutoff = 0.5.
g. What is the minimum income that a household with 16,000 ft2 lot size should
have before it is classified as an owner?
18