0% found this document useful (0 votes)

184 views18 pages

Note 4

Logistic regression extends linear regression to model categorical outcome variables. It relates predictor variables to the log odds of the outcome. The logistic regression model estimates coefficients that represent the relationship between predictors and the logit, or log odds, of the outcome. The logit is a linear combination of the predictors that can be converted into a probability between 0 and 1. The example fits a logistic regression model using R's glm function with family="binomial" to predict personal loan acceptance based on 12 demographic and financial predictors for bank customers. The model estimates coefficients and assesses the relative importance of predictors. Model performance will be evaluated using a confusion matrix and misclassification rate on held-out validation data.

Uploaded by

nuthan manideep

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

184 views18 pages

Note 4

Uploaded by

nuthan manideep

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

DS 535: ADVANCED DATA MINING FOR BUSINESS

Lecture Notes #4: Logistic Regression

(Textbook reading: Chapter 10)

Logistic Regression
 Extends idea of linear regression to situation where outcome variable is categorical

 Widely used, particularly where a structured model is useful to explain (=profiling) or to predict

 We focus on binary classification

i.e. Y=0 or Y=1

The Logit
Goal: Find a function of the predictor variables that relates them to a 0/1 outcome

 Instead of Y as outcome variable (like in linear regression), we use a function of Y called

the logit
 Logit can be modeled as a linear function of the predictors
 The logit can be mapped back to a probability, which, in turn, can be mapped to a class

Step 1: Logistic Response Function

p = probability of belonging to class 1

 Need to relate p to predictors with a function that guarantees 0  p  1

 Standard linear function (as shown below) does not guarantees 0  p  1:

𝑝 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + … 𝛽𝑞 𝑥𝑞

The Fix:
use logistic response function
1
𝑝=
1 + 𝑒 −(𝛽0 +𝛽1 𝑥 1 +𝛽2 𝑥 2 + …𝛽𝑞 𝑥 𝑞 )

1
Step 2: The Odds
The odds of an event are defined as:
p
Odds 
eq. 10.3 1 p

Or, given the odds of an event, the probability of the event can be computed by:

Odds
p
eq. 10.4 1  Odds

We can also relate the Odds to the predictors:

0  1x1   2 x2   q xq
Odds  e
Step 3: Take log on both sides
This gives us the logit:
log( Odds)  0  1 x1   2 x2     q xq

log(Odds) = logit (eq. 10.6)

So, the logit is a linear function of predictors x1, x2, …

Takes values from -infinity to +infinity

2
Odds (a) and Logit (b) as function of P

Example
Personal Loan Offer (UniversalBank.csv)

Outcome variable: accept bank loan (0/1)

Predictors: Demographic info, and info about their bank relationship

Description of Predictors for Acceptance of Personal Loan Example
Age Customer’s age in completed years
Experience Number of years of professional experience
Income Annual income of the customer ($000s)
Family Size Family size of the customer
CCAvg Average spending on credit cards per month ($000s)
Education Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
Mortgage Value of house mortgage if any ($000s)
Securities Account Coded as 1 if customer has securities account with bank
CD Account Coded as 1 if customer has certificate of deposit (CD) account with bank
Online Banking Coded as 1 if customer uses Internet banking facilities
Credit Card Coded as 1 if customer uses credit card issued by Universal Bank

3
Single Predictor Model

Data Prep:
#### Table 10.2
bank.df <- read.csv("UniversalBank.csv")
head(bank.df)
> head(bank.df)
ID Age Experience Income ZIP.Code Family CCAvg Education Mortgage Personal.Loan Securities.Account CD.Account Online
CreditCard
1 1 25 1 49 91107 4 1.6 1 0 0 1 0 0
0
2 2 45 19 34 90089 3 1.5 1 0 0 1 0 0
0
3 3 39 15 11 94720 1 1.0 1 0 0 0 0 0
0
4 4 35 9 100 94112 1 2.7 2 0 0 0 0 0
0
5 5 35 8 45 91330 4 1.0 2 0 0 0 0 0
1
6 6 37 13 29 92121 4 0.4 2 155 0 0 0 1
0
>
bank.df <- bank.df[ , -c(1, 5)] # Drop ID and zip code in columns 1 and 5.
# Education is 1, 2, 3.
# Want to convert it to categorical (R will create dummy variables)
bank.df$Education <- factor(bank.df$Education, levels = c(1, 2, 3),
labels = c("Undergrad", "Graduate",
"Advanced/Professional"))
head(bank.df)
Age Experience Income Family CCAvg Education Mortgage Personal.Loan Securities.Account CD.Account Online CreditCard
1 25 1 49 4 1.6 Undergrad 0 0 1 0 0 0
2 45 19 34 3 1.5 Undergrad 0 0 1 0 0 0
3 39 15 11 1 1.0 Undergrad 0 0 0 0 0 0
4 35 9 100 1 2.7 Graduate 0 0 0 0 0 0
5 35 8 45 4 1.0 Graduate 0 0 0 0 0 1
6 37 13 29 4 0.4 Graduate 155 0 0 0 1 0
>
# partition data
set.seed(2) # random sample can be reproduced by setting a value for seed
train.index <- sample(c(1:dim(bank.df)[1]), dim(bank.df)[1]*0.6)
train.df <- bank.df[train.index, ]
valid.df <- bank.df[-train.index, ]

Single Predictor Model: Modeling loan acceptance on income (x)

Fitted coefficients (more later): b0 = -6.3525, b1 = -0.0392

4
Seeing the Relationship

Last step - classify

Model produces an estimated probability of being a “1”

 Convert to a classification by establishing cutoff level

 If estimated prob. > cutoff, classify as “1”

Ways to Determine Cutoff

 0.50 is popular initial choice

 Additional considerations (see Chapter 5)

Maximize classification accuracy
Maximize sensitivity (subject to min. level of specificity)
Minimize false positives (subject to max. false negative rate)
Minimize expected cost of misclassification (need to specify costs)


Estimation: Estimates of b’s are derived through an iterative process called maximum likelihood
estimation

Let’s include all 12 predictors in the model now

In R use function glm (for general linear model) and

family = “binomial”

Fitting the model

5
> # run logistic regression
> # use glm() (general linear model) with family = "binomial" to fit a logistic
> # regression.
> logit.reg <- glm(Personal.Loan ~ ., data = train.df, family = "binomial")
> options(scipen=999)
> summary(logit.reg)

Call:
glm(formula = Personal.Loan ~ ., family = "binomial", data = train.df)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.0380 -0.1847 -0.0627 -0.0183 3.9810

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.6805628 2.2903370 -5.537 0.0000000308 ***
Age -0.0369346 0.0848937 -0.435 0.66351
Experience 0.0490645 0.0844410 0.581 0.56121
Income 0.0612953 0.0039762 15.416 < 0.0000000000000002 ***
Family 0.5434657 0.0994936 5.462 0.0000000470 ***
CCAvg 0.2165942 0.0601900 3.599 0.00032 ***
EducationGraduate 4.2681068 0.3703378 11.525 < 0.0000000000000002 ***
EducationAdvanced/Professional 4.4408154 0.3723360 11.927 < 0.0000000000000002 ***
Mortgage 0.0015499 0.0007926 1.955 0.05052 .
Securities.Account -1.1457476 0.3955796 -2.896 0.00377 **
CD.Account 4.5855656 0.4777696 9.598 < 0.0000000000000002 ***
Online -0.8588074 0.2191217 -3.919 0.0000888005 ***
CreditCard -1.2514213 0.2944767 -4.250 0.0000214111 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1901.71 on 2999 degrees of freedom

Residual deviance: 682.19 on 2987 degrees of freedom
AIC: 708.19

Number of Fisher Scoring iterations: 8

#To assess the relative importance of individual predictors in the model,

#we can also look at the absolute value of the t-statistic for
#each model parameter. This technique is utilized by the varImp function
#in the caret package

varImp(logit.reg)

names(logit.reg)
# the names available in logit.reg, can be referred to by "logit.reg$"

6
Evaluating Classification Performance
Performance measures: Confusion matrix and % of misclassifications

# some libraries
install.packages("e1071")
library(e1071) # misc functions in stat, prob
install.packages("Rcpp")
library(Rcpp) # R and C++ integration
install.packages("caret")
library(caret) # Classfication And REgression Training

# evaluate
# predict(logit.reg, valid.df) assumes a linear regression prediction
# adding type = "response" means log(odds) is linear regression, hence
# the prob follows a logistic regression model

pred <- predict(logit.reg, valid.df, type = "response")

# the same as
# pred1 <- predict(logit.reg, valid.df[,-8], type = "response")

# evaluate
> # get the confusion Matrix using table
> c.mat <- table(ifelse(pred > 0.5, 1, 0), valid.df[,8])
> c.mat

0 1
0 1779 58
1 30 133

> sum(diag(c.mat))/sum(c.mat) # this gives accuracy

[1] 0.956
>
> # number of 0 and 1 in valid.df
> table(valid.df[,8])
0 1
1809 191
>
# sort(pred, decreasing = TRUE)

7
#ROC Curve and other capabilities of "InformationValue"

install.packages("InformationValue")
library(InformationValue)

> plotROC(valid.df[,8], pred)

> confusionMatrix(valid.df[,8], pred, threshold = 0.5)

0 1
0 1779 58
1 30 133

> misClassError(valid.df[,8], pred, threshold = 0.5)

[1] 0.044

> # Compute the optimal probability cutoff score, based on a user

defined objective

> optimalCutoff(valid.df[,8], pred, optimiseFor = "misclasserror",

+ returnDiagnostics = TRUE)

8
$optimalCutoff
[1] 0.4799484

$sensitivityTable
CUTOFF FPR TPR YOUDENSINDEX SPECIFICITY MISCLASSERROR
1 0.999948397 0 0.005235602 0.005235602 1.0000000 0.0950
2 0.989948397 0.0005527916 0.2041885 0.203635690 0.9994472 0.0765
3 0.979948397 0.0005527916 0.3036649 0.303112130 0.9994472 0.0670
4 0.969948397 0.0005527916 0.3350785 0.334525742 0.9994472 0.0640
5
Output skipped

48 0.529948397 0.013267 0.6701571 0.656890070 0.9867330 0.0435

49 0.519948397 0.01437258 0.6806283 0.666255691 0.9856274 0.0435
50 0.509948397 0.01547816 0.6963351 0.680856914 0.9845218 0.0430
51 0.499948397 0.01658375 0.6963351 0.679751331 0.9834163 0.0440
52 0.489948397 0.01658375 0.7015707 0.684986933 0.9834163 0.0435
53 0.479948397 0.01713654 0.7120419 0.694905345 0.9828635 0.0430
54 0.469948397 0.01768933 0.7120419 0.694352554 0.9823107 0.0435
55 0.459948397 0.01824212 0.7172775 0.699035364 0.9817579 0.0435
56
Output skipped

95 0.059948397 0.129906 0.9005236 0.770617535 0.8700940 0.1270

96 0.049948397 0.1448314 0.9162304 0.771398968 0.8551686 0.1390
97 0.039948397 0.1658375 0.921466 0.755628489 0.8341625 0.1575
98 0.029948397 0.2012161 0.9267016 0.725485429 0.7987839 0.1890
99 0.019948397 0.2399116 0.9424084 0.702496824 0.7600884 0.2225
100 0.009948397 0.3228303 0.9685864 0.645756094 0.6771697 0.2950

$misclassificationError
[1] 0.043

$TPR
[1] 0.7120419

$FPR
[1] 0.01713654

$Specificity
[1] 0.9828635

>
> # The maximization criterion for which probability cutoff score needs to be optimised.
> # Can take either of following values: "Ones" or "Zeros" or "Both" or
> # "misclasserror"(default). If "Ones" is used, ’optimalCutoff’ will be chosen to
> # maximise detection of "One’s". If ’Both’ is specified, the probability cut-off
> # that gives maximum Youden’s Index is chosen. If ’misclasserror’ is specified,
> # the probability cut-off that gives minimum mis-clasification error is chosen.
>

9
Converting to Probability

Odds
p
1  Odds

Function predict does the conversion from logit to probabilities

> #### Table 10.3

>
> # use predict() with type = "response" to compute predicted probabilities.
> logit.reg.pred <- predict(logit.reg, valid.df[, -8], type = "response")
>
> # first 5 actual and predicted records
> data.frame(actual = valid.df$Personal.Loan[1:5], predicted =
logit.reg.pred[1:5])
actual predicted
2 0 0.00002707663
6 0 0.00326343313
9 0 0.03966293189
10 1 0.98846040544
11 0 0.59933974797
> # same as
> # data.frame(actual = valid.df$Personal.Loan[1:5], predicted = pred[1:5])
>
> sort(pred, decreasing = TRUE)
2957 2813 4283 783 4311 1015
2357 483
0.9999483968208 0.9998765031437 0.9997600847462 0.9997511337986 0.9994943018654 0.9993746369214
0.9993590055815 0.9991725259612
1412 3369 289 48 1936 972
2047 132
0.9991002979159 0.9990119318105 0.9989045045567 0.9988276036858 0.9987851429399 0.9986437054436
0.9984859004053 0.9984301560280
(output skippled)

3080 1507 3825 2835 1101 1483

1545 2599
0.0000029329685 0.0000028524356 0.0000026253577 0.0000025433706 0.0000024437677 0.0000022997677
0.0000020775111 0.0000020074794
4474 4267 778 940 1531 3126
552 4512
0.0000017726574 0.0000016705920 0.0000016452636 0.0000011837855 0.0000011409230 0.0000009971214
0.0000009389717 0.0000007090812

10
>

Interpreting Odds, Probability

For predictive classification, we typically use probability with a cutoff value

For explanatory purposes, odds have a useful interpretation:

If we increase x1 by one unit, holding x2, x3 … xq constant, then b1 is the factor by which the
odds of belonging to class 1 increase

Loan Example:
Evaluating Classification Performance
Performance measures: Confusion matrix and % of misclassifications

More useful in this example: lift (or gains) chart

The “lift” over the base curve indicates for a given number of cases (read on the x-axis), the
additional responders that you can identify by using the model.
The same information is portrayed in in Decile-wise lift chart: Taking the 10% of the records
that are ranked by the model as “most probable 1’s” yields 7.9 times as many 1’s as would simply
selecting 10% of the records at random.

11
12
#### Figure 10.3
> install.packages("gains") # to get lift chart
library(gains)
gain <- gains(valid.df$Personal.Loan, logit.reg.pred, groups=10)

> class(gain)
[1] "gains"
> names(gain)
[1] "depth" "obs" "cume.obs" "mean.resp"
"cume.mean.resp" "cume.pct.of.total"
[7] "lift" "cume.lift" "mean.prediction" "min.prediction"
"max.prediction" "conf"
[13] "optimal" "num.groups" "percents"
>

> data.frame(c(0,gain$cume.pct.of.total*sum(valid.df$Personal.Loan)) ,
+ c(0,gain$cume.obs) )
c.0..gain.cume.pct.of.total...sum.valid.df.Personal.Loan.. c.0..gain.cume.obs.
1 0 0
2 151 200
3 172 400
4 180 600
5 185 800
6 187 1000
7 187 1200
8 189 1400
9 190 1600
10 191 1800
11 191 2000
>
> data.frame( c(0,sum(valid.df$Personal.Loan)) , c(0, dim(valid.df)[1]) )
c.0..sum.valid.df.Personal.Loan.. c.0..dim.valid.df..1..
1 0 0
2 191 2000
>
# plot lift chart
plot(c(0,gain$cume.pct.of.total*sum(valid.df$Personal.Loan))~c(0,gain$cume.obs),
xlab="# cases", ylab="Cumulative", main="", type="l")
lines(c(0,sum(valid.df$Personal.Loan))~c(0, dim(valid.df)[1]), lty=2)

13
> # compute deciles and plot decile-wise chart
> heights <- gain$mean.resp/mean(valid.df$Personal.Loan)
>
> gain$mean.resp
[1] 0.755 0.105 0.040 0.025 0.010 0.000 0.010 0.005 0.005 0.000
> gain$mean.resp*200
[1] 151 21 8 5 2 0 2 1 1 0
> mean(valid.df$Personal.Loan)
[1] 0.0955
> heights
[1] 7.90575916 1.09947644 0.41884817 0.26178010 0.10471204 0.00000000 0.10471204
0.05235602 0.05235602 0.00000000
>

14
midpoints <- barplot(heights, names.arg = gain$depth, ylim = c(0,9),
xlab = "Percentile", ylab = "Mean Response", main = "Decile-wise lift chart")

# add labels to columns

text(midpoints, heights+0.5, labels=round(heights, 1), cex = 0.8)

In gains package:
Actual: a numeric vector of actual response values
Predicted: a numeric vector of predicted response values. This vector must have the same length
as actual, and the ith value of this vector needs to be the model score for the subject with the ith
value of the actual vector as its actual response.
Groups: an integer containing the number of rows in the gains table. The default value is 10.

15
Multicollinearity

Problem: As in linear regression, if one predictor is a linear combination of other predictor(s),

model estimation will fail

Note that in such a case, we have at least one redundant predictor

Solution: Remove extreme redundancies (by dropping predictors via variable selection, or by
data reduction methods such as PCA)

Variable Selection
This is the same issue as in linear regression
1. The number of correlated predictors can grow when we create derived variables such as
interaction terms (e.g. Income x Family), to capture more complex relationships
2. Problem: Overly complex models have the danger of overfitting
3. Solution: Reduce variables via automated selection of variable subsets (as with linear
regression)
4. See Chapter 6

P-values for Predictors

1. Test null hypothesis that coefficient = 0
2. Useful for review to determine whether to include variable in model
3. Important in profiling tasks, but less important in predictive classification

Summary
1. Logistic regression is similar to linear regression, except that it is used with a categorical
response
2. It can be used for explanatory tasks (=profiling) or predictive tasks (=classification)
3. The predictors are related to the response Y via a nonlinear function called the logit
4. As in linear regression, reducing predictors can be done via variable selection
5. Logistic regression can be generalized to more than two classes

16
Problems
1. Financial Condition of Banks. The file Banks.csv includes data on a sample of
20 banks. The “Financial Condition” column records the judgment of an expert on the
financial condition of each bank. This outcome variable takes one of two possible
values—weak or strong—according to the financial condition of the bank. The predictors
are two ratios used in the financial analysis of banks: TotLns&Lses/Assets is the ratio of
total loans and leases to total assets and TotExp/Assets is the ratio of total expenses to
total assets. The target is to use the two ratios for classifying the financial condition of a
new bank.
Run a logistic regression model (on the entire dataset) that models the status of a bank as
a function of the two financial measures provided. Specify the success class as weak (this
is similar to creating a dummy that is 1 for financially weak banks and 0 otherwise), and
use the default cutoff value of 0.5.
a. Write the estimated equation that associates the financial condition of a bank with
its two predictors in three formats:
i. The logit as a function of the predictors
ii. The odds as a function of the predictors
iii. The probability as a function of the predictors
b. Consider a new bank whose total loans and leases/assets ratio = 0.6 and
total expenses/assets ratio = 0.11. From your logistic regression model, estimate
the following four quantities for this bank (use R to do all the intermediate
calculations; show your final answers to four decimal places): the logit, the odds,
the probability of being financially weak, and the classification of the bank (use
cutoff = 0.5).
c. The cutoff value of 0.5 is used in conjunction with the probability of being
financially weak. Compute the threshold that should be used if we want to make a
classification based on the odds of being financially weak, and the threshold for
the corresponding logit.
d. Interpret the estimated coefficient for the total loans & leases to total
assets ratio (TotLns&Lses/Assets) in terms of the odds of being financially weak.
e. When a bank that is in poor financial condition is misclassified as
financially strong, the misclassification cost is much higher than when a
financially strong bank is misclassified as weak. To minimize the expected cost of
misclassification, should the cutoff value for classification (which is currently at
0.5) be increased or decreased?
2. Identifying Good System Administrators. A management consultant is studying
the roles played by experience and training in a system administrator’s ability to
complete a set of tasks in a specified amount of time. In particular, she is interested in
discriminating between administrators who are able to complete given tasks within a
specified time and those who are not. Data are collected on the performance of 75
randomly selected administrators. They are stored in the file SystemAdministrators.csv.

17
The variable Experience measures months of full-time system administrator experience,
while Training measures the number of relevant training credits. The outcome variable
Completed is either Yes or No, according to whether or not the administrator completed
the tasks.
a. Create a scatter plot of Experience vs. Training using color or symbol to
distinguish programmers who completed the task from those who did not
complete it. Which predictor(s) appear(s) potentially useful for classifying task
completion?
b. Run a logistic regression model with both predictors using the entire dataset as
training data. Among those who completed the task, what is the percentage of
programmers incorrectly classified as failing to complete the task?
c. To decrease the percentage in part (b), should the cutoff probability be increased
or decreased?
d. How much experience must be accumulated by a programmer with 4 years of
training before his or her estimated probability of completing the task exceeds
0.5?
3. Sales of Riding Mowers. A company that manufactures riding mowers wants to
identify the best sales prospects for an intensive sales campaign. In particular, the
manufacturer is interested in classifying households as prospective owners or nonowners
on the basis of Income (in $1000s) and Lot Size (in 1000 ft2). The marketing expert
looked at a random sample of 24 households, given in the file RidingMowers.csv. Use all
the data to fit a logistic regression of ownership on the two predictors.
a. What percentage of households in the study were owners of a riding mower?
b. Create a scatter plot of Income vs. Lot Size using color or symbol to distinguish
owners from nonowners. From the scatter plot, which class seems to have a
higher average income, owners or nonowners?
c. Among nonowners, what is the percentage of households classified correctly?
d. To increase the percentage of correctly classified nonowners, should the cutoff
probability be increased or decreased?
e. What are the odds that a household with a $60K income and a lot size of 20,000
ft2 is an owner?
f. What is the classification of a household with a $60K income and a lot size of
20,000 ft2? Use cutoff = 0.5.
g. What is the minimum income that a household with 16,000 ft2 lot size should
have before it is classified as an owner?

Keith McNulty - Handbook of Regression Modeling in People Analytics-Routledge (2021)
100% (1)
Keith McNulty - Handbook of Regression Modeling in People Analytics-Routledge (2021)
272 pages
Cart Project
75% (4)
Cart Project
17 pages
Machine Learning Project
67% (3)
Machine Learning Project
30 pages
Thera Bank-Project
100% (12)
Thera Bank-Project
26 pages
Master
100% (1)
Master
367 pages
Machine Learning Project On Cars
92% (13)
Machine Learning Project On Cars
22 pages
Tracy R. Twyman - PLUS ULTRA - Trip 8 The Internet Is Compromised
No ratings yet
Tracy R. Twyman - PLUS ULTRA - Trip 8 The Internet Is Compromised
67 pages
Thera Bank - Project
100% (4)
Thera Bank - Project
34 pages
Project 3 Thera Bank
100% (1)
Project 3 Thera Bank
24 pages
Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook
No ratings yet
Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook
22 pages
R Programing 6 Feb
No ratings yet
R Programing 6 Feb
10 pages
Logistic Regression:: PGP Dse Bangalore July 2018
No ratings yet
Logistic Regression:: PGP Dse Bangalore July 2018
62 pages
Code PLFS MVPA
No ratings yet
Code PLFS MVPA
12 pages
CS1B April 2024
No ratings yet
CS1B April 2024
9 pages
Produit Bancaire
No ratings yet
Produit Bancaire
15 pages
Jana Sir - Final
No ratings yet
Jana Sir - Final
19 pages
Assignment 03
No ratings yet
Assignment 03
6 pages
1) Download The Binary Classification Dataset For... - Colab
No ratings yet
1) Download The Binary Classification Dataset For... - Colab
6 pages
Ec4401 07 2023
No ratings yet
Ec4401 07 2023
7 pages
FMV C
No ratings yet
FMV C
17 pages
Uni T - 2 - R Programming
No ratings yet
Uni T - 2 - R Programming
10 pages
Lab 4 Classification v.0
No ratings yet
Lab 4 Classification v.0
5 pages
Credit-Scoring-CASE
No ratings yet
Credit-Scoring-CASE
29 pages
Generalized Linear Models
100% (9)
Generalized Linear Models
243 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
APQC Cross Industry - v7.1.0
No ratings yet
APQC Cross Industry - v7.1.0
1,223 pages
Module 5: Design and Investigation of Steel Beams Lesson Outcomes
No ratings yet
Module 5: Design and Investigation of Steel Beams Lesson Outcomes
8 pages
Test Data
No ratings yet
Test Data
14 pages
Logistic Regression With R
No ratings yet
Logistic Regression With R
5 pages
TYCS Practical
No ratings yet
TYCS Practical
26 pages
DM Assignment - Thena Bank
No ratings yet
DM Assignment - Thena Bank
39 pages
Hikvision Cybersecurity Milestones
No ratings yet
Hikvision Cybersecurity Milestones
2 pages
Iso 17165-1 - 2007
No ratings yet
Iso 17165-1 - 2007
32 pages
Stats101a Homework8
No ratings yet
Stats101a Homework8
7 pages
Project3: Loading Library
No ratings yet
Project3: Loading Library
17 pages
Column Detail and Foundation Plan
No ratings yet
Column Detail and Foundation Plan
7 pages
S4 LogisticRegression 15jan2025
No ratings yet
S4 LogisticRegression 15jan2025
25 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
Project On Data Mining-Raveendra Babu Gaddam
No ratings yet
Project On Data Mining-Raveendra Babu Gaddam
29 pages
Lab-4: Regression Analysis: Logistic & Multinomial Logistic Regression
No ratings yet
Lab-4: Regression Analysis: Logistic & Multinomial Logistic Regression
10 pages
Credit Card Default
No ratings yet
Credit Card Default
5 pages
NCL Tutorial V1.1
No ratings yet
NCL Tutorial V1.1
150 pages
Coding Activity 3.ipynb - Colaboratory
No ratings yet
Coding Activity 3.ipynb - Colaboratory
7 pages
Results
No ratings yet
Results
7 pages
DS535 Note 4 (With Marks)
No ratings yet
DS535 Note 4 (With Marks)
18 pages
Logistic Regression
No ratings yet
Logistic Regression
12 pages
Intro LOGIT
No ratings yet
Intro LOGIT
46 pages
Preview-9781000427899 A41277316
No ratings yet
Preview-9781000427899 A41277316
28 pages
R Code Default Data PDF
No ratings yet
R Code Default Data PDF
10 pages
Linear Regression in R
No ratings yet
Linear Regression in R
19 pages
ITP-False Celing-NS-MSS-A-003-R-01
100% (1)
ITP-False Celing-NS-MSS-A-003-R-01
2 pages
MILLIPEDE Concept
No ratings yet
MILLIPEDE Concept
23 pages
Mini Project-Data Mining
No ratings yet
Mini Project-Data Mining
25 pages
5103A1
No ratings yet
5103A1
6 pages
TITLE: Bank Marketing Classification: Submitted To: Dr. Supriya Kumar de Professor XLRI, Jamshedpur
No ratings yet
TITLE: Bank Marketing Classification: Submitted To: Dr. Supriya Kumar de Professor XLRI, Jamshedpur
18 pages
BDA MSC It
No ratings yet
BDA MSC It
35 pages
Simulation Model For The Study of Maintenance Actions I
No ratings yet
Simulation Model For The Study of Maintenance Actions I
19 pages
2101 F 12 Logistic Regression With R1
No ratings yet
2101 F 12 Logistic Regression With R1
10 pages
Saurabh
No ratings yet
Saurabh
22 pages
ECN190 Term Project: Predicting Credit Card Default Risk: Introduction and Literature
No ratings yet
ECN190 Term Project: Predicting Credit Card Default Risk: Introduction and Literature
18 pages
Since R Considers All Variables As Numeric, We Convert Them Into Factors
No ratings yet
Since R Considers All Variables As Numeric, We Convert Them Into Factors
3 pages
Group 4
No ratings yet
Group 4
9 pages
2024 Hober Solar AC Pumps Surface2 Catalog List
No ratings yet
2024 Hober Solar AC Pumps Surface2 Catalog List
4 pages
Plansa Motopompa Caprari MEC-MG 80-4-3A
No ratings yet
Plansa Motopompa Caprari MEC-MG 80-4-3A
2 pages
B140XTN02 D-Auo
No ratings yet
B140XTN02 D-Auo
33 pages
Homework 2 Questions
No ratings yet
Homework 2 Questions
7 pages
Department of Economics Problem Set
No ratings yet
Department of Economics Problem Set
5 pages
Elements of Object Oriented Data Model
No ratings yet
Elements of Object Oriented Data Model
19 pages
Project Advance Stats - Abhishek
No ratings yet
Project Advance Stats - Abhishek
14 pages
OS 30 Q Assignment
No ratings yet
OS 30 Q Assignment
8 pages
Empirical Exercises 6
No ratings yet
Empirical Exercises 6
7 pages
Linear Regression Using R
No ratings yet
Linear Regression Using R
24 pages
Modul Praktik Network
No ratings yet
Modul Praktik Network
74 pages
Módulos Canadian 440Wp
No ratings yet
Módulos Canadian 440Wp
2 pages
Advance Stats
No ratings yet
Advance Stats
233 pages
Note 6
No ratings yet
Note 6
33 pages
11th & 12 TH Q&A
No ratings yet
11th & 12 TH Q&A
28 pages
ESL Brains Generative AI Tools Worth The Hype TV 1807
No ratings yet
ESL Brains Generative AI Tools Worth The Hype TV 1807
4 pages
Beige and Blue Minimal Modern Thesis Defense Presentation
No ratings yet
Beige and Blue Minimal Modern Thesis Defense Presentation
25 pages
M100 Twin Technical Sheet en
No ratings yet
M100 Twin Technical Sheet en
15 pages
DS535 Note 6 (Page1-14)
No ratings yet
DS535 Note 6 (Page1-14)
13 pages
Note 5
No ratings yet
Note 5
24 pages
Note 2
No ratings yet
Note 2
27 pages
DS535 Note5 (With Marks)
No ratings yet
DS535 Note5 (With Marks)
24 pages
Novo9 Spark User Manual
No ratings yet
Novo9 Spark User Manual
34 pages
Audio Catalogue-2024 (2024-03-13 16 - 04 - 35)
No ratings yet
Audio Catalogue-2024 (2024-03-13 16 - 04 - 35)
29 pages
Overview of BRIO Tool
No ratings yet
Overview of BRIO Tool
14 pages
DS535 Note 6 (Page28-30)
No ratings yet
DS535 Note 6 (Page28-30)
4 pages
Ml-Exp-1 - Jupyter Notebook
No ratings yet
Ml-Exp-1 - Jupyter Notebook
8 pages
Larry Vuw Process
No ratings yet
Larry Vuw Process
8 pages
Miginox 310 / Tiginox 310: Classification: en Iso 14343-A
No ratings yet
Miginox 310 / Tiginox 310: Classification: en Iso 14343-A
1 page
Comparison Sponge Jet Vs Wet Blasting REI
No ratings yet
Comparison Sponge Jet Vs Wet Blasting REI
2 pages
Ranks and Experience - Tanki Online Wiki
No ratings yet
Ranks and Experience - Tanki Online Wiki
1 page
Generalized Additive Model
No ratings yet
Generalized Additive Model
10 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Real Estate Math Express: Rapid Review and Practice with Essential License Exam Calculations
From Everand
Real Estate Math Express: Rapid Review and Practice with Essential License Exam Calculations
Stephen Mettling
No ratings yet

Note 4

Uploaded by

Note 4

Uploaded by

DS 535: ADVANCED DATA MINING FOR BUSINESS

Lecture Notes #4: Logistic Regression

(Textbook reading: Chapter 10)

 We focus on binary classification

 Instead of Y as outcome variable (like in linear regression), we use a function of Y called

Step 1: Logistic Response Function

 Need to relate p to predictors with a function that guarantees 0  p  1

 Standard linear function (as shown below) does not guarantees 0  p  1:

We can also relate the Odds to the predictors:

log(Odds) = logit (eq. 10.6)

So, the logit is a linear function of predictors x1, x2, …

Outcome variable: accept bank loan (0/1)

Predictors: Demographic info, and info about their bank relationship

Single Predictor Model: Modeling loan acceptance on income (x)

Fitted coefficients (more later): b0 = -6.3525, b1 = -0.0392

Last step - classify

 Convert to a classification by establishing cutoff level

 If estimated prob. > cutoff, classify as “1”

Ways to Determine Cutoff

 Additional considerations (see Chapter 5)

Let’s include all 12 predictors in the model now

In R use function glm (for general linear model) and

Fitting the model

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1901.71 on 2999 degrees of freedom

Number of Fisher Scoring iterations: 8

#To assess the relative importance of individual predictors in the model,

pred <- predict(logit.reg, valid.df, type = "response")

> sum(diag(c.mat))/sum(c.mat) # this gives accuracy

> plotROC(valid.df[,8], pred)

> confusionMatrix(valid.df[,8], pred, threshold = 0.5)

> misClassError(valid.df[,8], pred, threshold = 0.5)

> # Compute the optimal probability cutoff score, based on a user

> optimalCutoff(valid.df[,8], pred, optimiseFor = "misclasserror",

48 0.529948397 0.013267 0.6701571 0.656890070 0.9867330 0.0435

95 0.059948397 0.129906 0.9005236 0.770617535 0.8700940 0.1270

Function predict does the conversion from logit to probabilities

> #### Table 10.3

3080 1507 3825 2835 1101 1483

Interpreting Odds, Probability

For explanatory purposes, odds have a useful interpretation:

More useful in this example: lift (or gains) chart

# add labels to columns

Problem: As in linear regression, if one predictor is a linear combination of other predictor(s),

Note that in such a case, we have at least one redundant predictor

P-values for Predictors

You might also like