0% found this document useful (0 votes)
134 views

Note 4

Logistic regression extends linear regression to model categorical outcome variables. It relates predictor variables to the log odds of the outcome. The logistic regression model estimates coefficients that represent the relationship between predictors and the logit, or log odds, of the outcome. The logit is a linear combination of the predictors that can be converted into a probability between 0 and 1. The example fits a logistic regression model using R's glm function with family="binomial" to predict personal loan acceptance based on 12 demographic and financial predictors for bank customers. The model estimates coefficients and assesses the relative importance of predictors. Model performance will be evaluated using a confusion matrix and misclassification rate on held-out validation data.

Uploaded by

nuthan manideep
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
134 views

Note 4

Logistic regression extends linear regression to model categorical outcome variables. It relates predictor variables to the log odds of the outcome. The logistic regression model estimates coefficients that represent the relationship between predictors and the logit, or log odds, of the outcome. The logit is a linear combination of the predictors that can be converted into a probability between 0 and 1. The example fits a logistic regression model using R's glm function with family="binomial" to predict personal loan acceptance based on 12 demographic and financial predictors for bank customers. The model estimates coefficients and assesses the relative importance of predictors. Model performance will be evaluated using a confusion matrix and misclassification rate on held-out validation data.

Uploaded by

nuthan manideep
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

DS 535: ADVANCED DATA MINING FOR BUSINESS

Lecture Notes #4: Logistic Regression

(Textbook reading: Chapter 10)

Logistic Regression
 Extends idea of linear regression to situation where outcome variable is categorical

 Widely used, particularly where a structured model is useful to explain (=profiling) or to predict

 We focus on binary classification


i.e. Y=0 or Y=1

The Logit
Goal: Find a function of the predictor variables that relates them to a 0/1 outcome

 Instead of Y as outcome variable (like in linear regression), we use a function of Y called


the logit
 Logit can be modeled as a linear function of the predictors
 The logit can be mapped back to a probability, which, in turn, can be mapped to a class

Step 1: Logistic Response Function


p = probability of belonging to class 1

 Need to relate p to predictors with a function that guarantees 0  p  1

 Standard linear function (as shown below) does not guarantees 0  p  1:

𝑝 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + … 𝛽𝑞 𝑥𝑞

The Fix:
use logistic response function
1
𝑝=
1 + 𝑒 −(𝛽0 +𝛽1 𝑥 1 +𝛽2 𝑥 2 + …𝛽𝑞 𝑥 𝑞 )

1
Step 2: The Odds
The odds of an event are defined as:
p
Odds 
eq. 10.3 1 p

Or, given the odds of an event, the probability of the event can be computed by:

Odds
p
eq. 10.4 1  Odds

We can also relate the Odds to the predictors:


0  1x1   2 x2   q xq
Odds  e
Step 3: Take log on both sides
This gives us the logit:
log( Odds)  0  1 x1   2 x2     q xq

log(Odds) = logit (eq. 10.6)

So, the logit is a linear function of predictors x1, x2, …


Takes values from -infinity to +infinity

2
Odds (a) and Logit (b) as function of P

Example
Personal Loan Offer (UniversalBank.csv)

Outcome variable: accept bank loan (0/1)

Predictors: Demographic info, and info about their bank relationship


Description of Predictors for Acceptance of Personal Loan Example
Age Customer’s age in completed years
Experience Number of years of professional experience
Income Annual income of the customer ($000s)
Family Size Family size of the customer
CCAvg Average spending on credit cards per month ($000s)
Education Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
Mortgage Value of house mortgage if any ($000s)
Securities Account Coded as 1 if customer has securities account with bank
CD Account Coded as 1 if customer has certificate of deposit (CD) account with bank
Online Banking Coded as 1 if customer uses Internet banking facilities
Credit Card Coded as 1 if customer uses credit card issued by Universal Bank

3
Single Predictor Model

Data Prep:
#### Table 10.2
bank.df <- read.csv("UniversalBank.csv")
head(bank.df)
> head(bank.df)
ID Age Experience Income ZIP.Code Family CCAvg Education Mortgage Personal.Loan Securities.Account CD.Account Online
CreditCard
1 1 25 1 49 91107 4 1.6 1 0 0 1 0 0
0
2 2 45 19 34 90089 3 1.5 1 0 0 1 0 0
0
3 3 39 15 11 94720 1 1.0 1 0 0 0 0 0
0
4 4 35 9 100 94112 1 2.7 2 0 0 0 0 0
0
5 5 35 8 45 91330 4 1.0 2 0 0 0 0 0
1
6 6 37 13 29 92121 4 0.4 2 155 0 0 0 1
0
>
bank.df <- bank.df[ , -c(1, 5)] # Drop ID and zip code in columns 1 and 5.
# Education is 1, 2, 3.
# Want to convert it to categorical (R will create dummy variables)
bank.df$Education <- factor(bank.df$Education, levels = c(1, 2, 3),
labels = c("Undergrad", "Graduate",
"Advanced/Professional"))
head(bank.df)
Age Experience Income Family CCAvg Education Mortgage Personal.Loan Securities.Account CD.Account Online CreditCard
1 25 1 49 4 1.6 Undergrad 0 0 1 0 0 0
2 45 19 34 3 1.5 Undergrad 0 0 1 0 0 0
3 39 15 11 1 1.0 Undergrad 0 0 0 0 0 0
4 35 9 100 1 2.7 Graduate 0 0 0 0 0 0
5 35 8 45 4 1.0 Graduate 0 0 0 0 0 1
6 37 13 29 4 0.4 Graduate 155 0 0 0 1 0
>
# partition data
set.seed(2) # random sample can be reproduced by setting a value for seed
train.index <- sample(c(1:dim(bank.df)[1]), dim(bank.df)[1]*0.6)
train.df <- bank.df[train.index, ]
valid.df <- bank.df[-train.index, ]

Single Predictor Model: Modeling loan acceptance on income (x)

Fitted coefficients (more later): b0 = -6.3525, b1 = -0.0392

4
Seeing the Relationship

Last step - classify


Model produces an estimated probability of being a “1”

 Convert to a classification by establishing cutoff level

 If estimated prob. > cutoff, classify as “1”

Ways to Determine Cutoff


 0.50 is popular initial choice

 Additional considerations (see Chapter 5)


Maximize classification accuracy
Maximize sensitivity (subject to min. level of specificity)
Minimize false positives (subject to max. false negative rate)
Minimize expected cost of misclassification (need to specify costs)


Estimation: Estimates of b’s are derived through an iterative process called maximum likelihood
estimation

Let’s include all 12 predictors in the model now

In R use function glm (for general linear model) and


family = “binomial”

Fitting the model

5
> # run logistic regression
> # use glm() (general linear model) with family = "binomial" to fit a logistic
> # regression.
> logit.reg <- glm(Personal.Loan ~ ., data = train.df, family = "binomial")
> options(scipen=999)
> summary(logit.reg)

Call:
glm(formula = Personal.Loan ~ ., family = "binomial", data = train.df)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.0380 -0.1847 -0.0627 -0.0183 3.9810

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.6805628 2.2903370 -5.537 0.0000000308 ***
Age -0.0369346 0.0848937 -0.435 0.66351
Experience 0.0490645 0.0844410 0.581 0.56121
Income 0.0612953 0.0039762 15.416 < 0.0000000000000002 ***
Family 0.5434657 0.0994936 5.462 0.0000000470 ***
CCAvg 0.2165942 0.0601900 3.599 0.00032 ***
EducationGraduate 4.2681068 0.3703378 11.525 < 0.0000000000000002 ***
EducationAdvanced/Professional 4.4408154 0.3723360 11.927 < 0.0000000000000002 ***
Mortgage 0.0015499 0.0007926 1.955 0.05052 .
Securities.Account -1.1457476 0.3955796 -2.896 0.00377 **
CD.Account 4.5855656 0.4777696 9.598 < 0.0000000000000002 ***
Online -0.8588074 0.2191217 -3.919 0.0000888005 ***
CreditCard -1.2514213 0.2944767 -4.250 0.0000214111 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1901.71 on 2999 degrees of freedom


Residual deviance: 682.19 on 2987 degrees of freedom
AIC: 708.19

Number of Fisher Scoring iterations: 8

#To assess the relative importance of individual predictors in the model,


#we can also look at the absolute value of the t-statistic for
#each model parameter. This technique is utilized by the varImp function
#in the caret package

varImp(logit.reg)

names(logit.reg)
# the names available in logit.reg, can be referred to by "logit.reg$"

6
Evaluating Classification Performance
Performance measures: Confusion matrix and % of misclassifications

# some libraries
install.packages("e1071")
library(e1071) # misc functions in stat, prob
install.packages("Rcpp")
library(Rcpp) # R and C++ integration
install.packages("caret")
library(caret) # Classfication And REgression Training

# evaluate
# predict(logit.reg, valid.df) assumes a linear regression prediction
# adding type = "response" means log(odds) is linear regression, hence
# the prob follows a logistic regression model

pred <- predict(logit.reg, valid.df, type = "response")


# the same as
# pred1 <- predict(logit.reg, valid.df[,-8], type = "response")

# evaluate
> # get the confusion Matrix using table
> c.mat <- table(ifelse(pred > 0.5, 1, 0), valid.df[,8])
> c.mat

0 1
0 1779 58
1 30 133

> sum(diag(c.mat))/sum(c.mat) # this gives accuracy


[1] 0.956
>
> # number of 0 and 1 in valid.df
> table(valid.df[,8])
0 1
1809 191
>
# sort(pred, decreasing = TRUE)

>

7
#ROC Curve and other capabilities of "InformationValue"

install.packages("InformationValue")
library(InformationValue)

> plotROC(valid.df[,8], pred)

> confusionMatrix(valid.df[,8], pred, threshold = 0.5)


0 1
0 1779 58
1 30 133

> misClassError(valid.df[,8], pred, threshold = 0.5)


[1] 0.044

> # Compute the optimal probability cutoff score, based on a user


defined objective

> optimalCutoff(valid.df[,8], pred, optimiseFor = "misclasserror",


+ returnDiagnostics = TRUE)

8
$optimalCutoff
[1] 0.4799484

$sensitivityTable
CUTOFF FPR TPR YOUDENSINDEX SPECIFICITY MISCLASSERROR
1 0.999948397 0 0.005235602 0.005235602 1.0000000 0.0950
2 0.989948397 0.0005527916 0.2041885 0.203635690 0.9994472 0.0765
3 0.979948397 0.0005527916 0.3036649 0.303112130 0.9994472 0.0670
4 0.969948397 0.0005527916 0.3350785 0.334525742 0.9994472 0.0640
5
Output skipped

48 0.529948397 0.013267 0.6701571 0.656890070 0.9867330 0.0435


49 0.519948397 0.01437258 0.6806283 0.666255691 0.9856274 0.0435
50 0.509948397 0.01547816 0.6963351 0.680856914 0.9845218 0.0430
51 0.499948397 0.01658375 0.6963351 0.679751331 0.9834163 0.0440
52 0.489948397 0.01658375 0.7015707 0.684986933 0.9834163 0.0435
53 0.479948397 0.01713654 0.7120419 0.694905345 0.9828635 0.0430
54 0.469948397 0.01768933 0.7120419 0.694352554 0.9823107 0.0435
55 0.459948397 0.01824212 0.7172775 0.699035364 0.9817579 0.0435
56
Output skipped

95 0.059948397 0.129906 0.9005236 0.770617535 0.8700940 0.1270


96 0.049948397 0.1448314 0.9162304 0.771398968 0.8551686 0.1390
97 0.039948397 0.1658375 0.921466 0.755628489 0.8341625 0.1575
98 0.029948397 0.2012161 0.9267016 0.725485429 0.7987839 0.1890
99 0.019948397 0.2399116 0.9424084 0.702496824 0.7600884 0.2225
100 0.009948397 0.3228303 0.9685864 0.645756094 0.6771697 0.2950

$misclassificationError
[1] 0.043

$TPR
[1] 0.7120419

$FPR
[1] 0.01713654

$Specificity
[1] 0.9828635

>
> # The maximization criterion for which probability cutoff score needs to be optimised.
> # Can take either of following values: "Ones" or "Zeros" or "Both" or
> # "misclasserror"(default). If "Ones" is used, ’optimalCutoff’ will be chosen to
> # maximise detection of "One’s". If ’Both’ is specified, the probability cut-off
> # that gives maximum Youden’s Index is chosen. If ’misclasserror’ is specified,
> # the probability cut-off that gives minimum mis-clasification error is chosen.
>

9
Converting to Probability

Odds
p
1  Odds

Function predict does the conversion from logit to probabilities

> #### Table 10.3


>
> # use predict() with type = "response" to compute predicted probabilities.
> logit.reg.pred <- predict(logit.reg, valid.df[, -8], type = "response")
>
> # first 5 actual and predicted records
> data.frame(actual = valid.df$Personal.Loan[1:5], predicted =
logit.reg.pred[1:5])
actual predicted
2 0 0.00002707663
6 0 0.00326343313
9 0 0.03966293189
10 1 0.98846040544
11 0 0.59933974797
> # same as
> # data.frame(actual = valid.df$Personal.Loan[1:5], predicted = pred[1:5])
>
> sort(pred, decreasing = TRUE)
2957 2813 4283 783 4311 1015
2357 483
0.9999483968208 0.9998765031437 0.9997600847462 0.9997511337986 0.9994943018654 0.9993746369214
0.9993590055815 0.9991725259612
1412 3369 289 48 1936 972
2047 132
0.9991002979159 0.9990119318105 0.9989045045567 0.9988276036858 0.9987851429399 0.9986437054436
0.9984859004053 0.9984301560280
(output skippled)

3080 1507 3825 2835 1101 1483


1545 2599
0.0000029329685 0.0000028524356 0.0000026253577 0.0000025433706 0.0000024437677 0.0000022997677
0.0000020775111 0.0000020074794
4474 4267 778 940 1531 3126
552 4512
0.0000017726574 0.0000016705920 0.0000016452636 0.0000011837855 0.0000011409230 0.0000009971214
0.0000009389717 0.0000007090812

10
>

Interpreting Odds, Probability


For predictive classification, we typically use probability with a cutoff value

For explanatory purposes, odds have a useful interpretation:


If we increase x1 by one unit, holding x2, x3 … xq constant, then b1 is the factor by which the
odds of belonging to class 1 increase

Loan Example:
Evaluating Classification Performance
Performance measures: Confusion matrix and % of misclassifications

More useful in this example: lift (or gains) chart

The “lift” over the base curve indicates for a given number of cases (read on the x-axis), the
additional responders that you can identify by using the model.
The same information is portrayed in in Decile-wise lift chart: Taking the 10% of the records
that are ranked by the model as “most probable 1’s” yields 7.9 times as many 1’s as would simply
selecting 10% of the records at random.

11
12
#### Figure 10.3
> install.packages("gains") # to get lift chart
library(gains)
gain <- gains(valid.df$Personal.Loan, logit.reg.pred, groups=10)

> class(gain)
[1] "gains"
> names(gain)
[1] "depth" "obs" "cume.obs" "mean.resp"
"cume.mean.resp" "cume.pct.of.total"
[7] "lift" "cume.lift" "mean.prediction" "min.prediction"
"max.prediction" "conf"
[13] "optimal" "num.groups" "percents"
>

> data.frame(c(0,gain$cume.pct.of.total*sum(valid.df$Personal.Loan)) ,
+ c(0,gain$cume.obs) )
c.0..gain.cume.pct.of.total...sum.valid.df.Personal.Loan.. c.0..gain.cume.obs.
1 0 0
2 151 200
3 172 400
4 180 600
5 185 800
6 187 1000
7 187 1200
8 189 1400
9 190 1600
10 191 1800
11 191 2000
>
> data.frame( c(0,sum(valid.df$Personal.Loan)) , c(0, dim(valid.df)[1]) )
c.0..sum.valid.df.Personal.Loan.. c.0..dim.valid.df..1..
1 0 0
2 191 2000
>
# plot lift chart
plot(c(0,gain$cume.pct.of.total*sum(valid.df$Personal.Loan))~c(0,gain$cume.obs),
xlab="# cases", ylab="Cumulative", main="", type="l")
lines(c(0,sum(valid.df$Personal.Loan))~c(0, dim(valid.df)[1]), lty=2)

13
> # compute deciles and plot decile-wise chart
> heights <- gain$mean.resp/mean(valid.df$Personal.Loan)
>
> gain$mean.resp
[1] 0.755 0.105 0.040 0.025 0.010 0.000 0.010 0.005 0.005 0.000
> gain$mean.resp*200
[1] 151 21 8 5 2 0 2 1 1 0
> mean(valid.df$Personal.Loan)
[1] 0.0955
> heights
[1] 7.90575916 1.09947644 0.41884817 0.26178010 0.10471204 0.00000000 0.10471204
0.05235602 0.05235602 0.00000000
>

14
midpoints <- barplot(heights, names.arg = gain$depth, ylim = c(0,9),
xlab = "Percentile", ylab = "Mean Response", main = "Decile-wise lift chart")

# add labels to columns


text(midpoints, heights+0.5, labels=round(heights, 1), cex = 0.8)

In gains package:
Actual: a numeric vector of actual response values
Predicted: a numeric vector of predicted response values. This vector must have the same length
as actual, and the ith value of this vector needs to be the model score for the subject with the ith
value of the actual vector as its actual response.
Groups: an integer containing the number of rows in the gains table. The default value is 10.

15
Multicollinearity

Problem: As in linear regression, if one predictor is a linear combination of other predictor(s),


model estimation will fail

Note that in such a case, we have at least one redundant predictor

Solution: Remove extreme redundancies (by dropping predictors via variable selection, or by
data reduction methods such as PCA)

Variable Selection
This is the same issue as in linear regression
1. The number of correlated predictors can grow when we create derived variables such as
interaction terms (e.g. Income x Family), to capture more complex relationships
2. Problem: Overly complex models have the danger of overfitting
3. Solution: Reduce variables via automated selection of variable subsets (as with linear
regression)
4. See Chapter 6

P-values for Predictors


1. Test null hypothesis that coefficient = 0
2. Useful for review to determine whether to include variable in model
3. Important in profiling tasks, but less important in predictive classification

Summary
1. Logistic regression is similar to linear regression, except that it is used with a categorical
response
2. It can be used for explanatory tasks (=profiling) or predictive tasks (=classification)
3. The predictors are related to the response Y via a nonlinear function called the logit
4. As in linear regression, reducing predictors can be done via variable selection
5. Logistic regression can be generalized to more than two classes

16
Problems
1. Financial Condition of Banks. The file Banks.csv includes data on a sample of
20 banks. The “Financial Condition” column records the judgment of an expert on the
financial condition of each bank. This outcome variable takes one of two possible
values—weak or strong—according to the financial condition of the bank. The predictors
are two ratios used in the financial analysis of banks: TotLns&Lses/Assets is the ratio of
total loans and leases to total assets and TotExp/Assets is the ratio of total expenses to
total assets. The target is to use the two ratios for classifying the financial condition of a
new bank.
Run a logistic regression model (on the entire dataset) that models the status of a bank as
a function of the two financial measures provided. Specify the success class as weak (this
is similar to creating a dummy that is 1 for financially weak banks and 0 otherwise), and
use the default cutoff value of 0.5.
a. Write the estimated equation that associates the financial condition of a bank with
its two predictors in three formats:
i. The logit as a function of the predictors
ii. The odds as a function of the predictors
iii. The probability as a function of the predictors
b. Consider a new bank whose total loans and leases/assets ratio = 0.6 and
total expenses/assets ratio = 0.11. From your logistic regression model, estimate
the following four quantities for this bank (use R to do all the intermediate
calculations; show your final answers to four decimal places): the logit, the odds,
the probability of being financially weak, and the classification of the bank (use
cutoff = 0.5).
c. The cutoff value of 0.5 is used in conjunction with the probability of being
financially weak. Compute the threshold that should be used if we want to make a
classification based on the odds of being financially weak, and the threshold for
the corresponding logit.
d. Interpret the estimated coefficient for the total loans & leases to total
assets ratio (TotLns&Lses/Assets) in terms of the odds of being financially weak.
e. When a bank that is in poor financial condition is misclassified as
financially strong, the misclassification cost is much higher than when a
financially strong bank is misclassified as weak. To minimize the expected cost of
misclassification, should the cutoff value for classification (which is currently at
0.5) be increased or decreased?
2. Identifying Good System Administrators. A management consultant is studying
the roles played by experience and training in a system administrator’s ability to
complete a set of tasks in a specified amount of time. In particular, she is interested in
discriminating between administrators who are able to complete given tasks within a
specified time and those who are not. Data are collected on the performance of 75
randomly selected administrators. They are stored in the file SystemAdministrators.csv.

17
The variable Experience measures months of full-time system administrator experience,
while Training measures the number of relevant training credits. The outcome variable
Completed is either Yes or No, according to whether or not the administrator completed
the tasks.
a. Create a scatter plot of Experience vs. Training using color or symbol to
distinguish programmers who completed the task from those who did not
complete it. Which predictor(s) appear(s) potentially useful for classifying task
completion?
b. Run a logistic regression model with both predictors using the entire dataset as
training data. Among those who completed the task, what is the percentage of
programmers incorrectly classified as failing to complete the task?
c. To decrease the percentage in part (b), should the cutoff probability be increased
or decreased?
d. How much experience must be accumulated by a programmer with 4 years of
training before his or her estimated probability of completing the task exceeds
0.5?
3. Sales of Riding Mowers. A company that manufactures riding mowers wants to
identify the best sales prospects for an intensive sales campaign. In particular, the
manufacturer is interested in classifying households as prospective owners or nonowners
on the basis of Income (in $1000s) and Lot Size (in 1000 ft2). The marketing expert
looked at a random sample of 24 households, given in the file RidingMowers.csv. Use all
the data to fit a logistic regression of ownership on the two predictors.
a. What percentage of households in the study were owners of a riding mower?
b. Create a scatter plot of Income vs. Lot Size using color or symbol to distinguish
owners from nonowners. From the scatter plot, which class seems to have a
higher average income, owners or nonowners?
c. Among nonowners, what is the percentage of households classified correctly?
d. To increase the percentage of correctly classified nonowners, should the cutoff
probability be increased or decreased?
e. What are the odds that a household with a $60K income and a lot size of 20,000
ft2 is an owner?
f. What is the classification of a household with a $60K income and a lot size of
20,000 ft2? Use cutoff = 0.5.
g. What is the minimum income that a household with 16,000 ft2 lot size should
have before it is classified as an owner?

18

You might also like