0% found this document useful (0 votes)

109 views

Logistic Regression

Logistic regression is used when the target variable is binary or categorical. It models the relationship between predictor variables and the log odds of the target variable. Some key differences from linear regression are that the target variable is not continuous and the logistic curve is S-shaped and bounded between 0 and 1. The document provides an example using the German Credit dataset to predict credit risk, and discusses partitioning data, model building, interpretation and evaluation using measures like ROC curve, Hosmer-Lemeshow test and Kolmogorov-Smirnov chart.

Uploaded by

Subodh Kumar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

109 views

Logistic Regression

Uploaded by

Subodh Kumar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Business Analytics

Chapter - 1

Business Analytics

Logistic Regression
Linear Regression

Recall:
Linear regression is the process of
finding a best fitting straight line
passing through points with the
objective of being able to use the
equation of the line as a model for
prediction.
The key assumption is that both
the predictor and target variables
are continuous as seen in this chart
below. Intuitively, one can state that
when X increases, Y increases along
the slope of the line.
Logistic Regression - Introduction

Logistic regression is a form of regression where the dependent

variable (outcome variable) is dichotomous (binary) and the
independent variables (influencing factors) can be either continuous or
categorical. It is a prediction done with categorical variable.

Logistic regression can be binomial (binary) or multinomial.

In the binary logistic regression, the outcome can have only two
possi le types of alues e.g. Yes or No , u ess or Failure .

Out o e is oded as a d i i ary logisti regressio .

Logistic Regression - Introduction

These models are extensively used in banking, finance and

telecommunication sectors to estimate the risks involved.
Multinomial logistic refers to cases where the outcome can have
three or ore possi le types of alues e.g., good s. ery good
s. est .
Examples
If you would like to predict who will win the next T20 world cup,
ased o player s stre gth a d other details.
Imagine you want to predict the outcome of an election. Here
your outcome variable is binary i.e. Win/Lose. Factors influencing
this outcome could be the amount of money spent on the
campaign, amount of time spent campaigning, previous
election history, etc.
You might be interested in factors that influence (or explain)
whether or not a person in the U.S. owns a U.S.-made or foreign-
made (non-U.S.) car. It would be natural to code owning a U.S.
car as a 1 and owning a foreign car as a 0. So the Y-variable in the
logistic regression is whether or not a person owns a U.S. car
coded as a 1 if he or she does and a 0 otherwise.
Logistic Curve
The logistic model: The logistic curve, illustrated below, is better
for modeling binary dependent variables coded 0 or 1 because it
comes closer hugging y=0 and y=1 points on the y axis.
Even more, the logistic function is bounded by 0 and 1.
Come back to Sales vs Ad spent example
What happens if the target variable is not continuous?
When the target variable (Y) is discrete, the straight line is no longer a
fit as seen in this chart.
Although intuitively we can still state that when X (say advertising
spend) increases, Y (say response or no response to a mailing campaign)
also increases, but there is no gradual transition, the Y value abruptly
jumps from one binary outcome to the other.
Thus the straight line is a poor fit for this data.
The Straight Line is a Poor Fit for Binary Outcome
S- Shaped Curve is a Better Fit

On the other hand, take a

look at the S-shaped curve
below.
This is certainly a better fit
for the data shown.
If we then know the
equation to this "sigmoid"
curve, we can use it as
effectively as we used the
straight line in the case of
linear regression.
Logistic Regression
Logistic regression is thus the process of obtaining an appropriate
sigmoid curve to fit the data when the target variable is discrete.
In statistics, logistic regression or logit regression is a type of regression
analysis used for predicting the outcome of a categorical dependent
variable.
Key facts to keep in mind
Logistic Regression is the equivalent of linear regression to use when
the target (or dependent) variable is discrete i.e. not continuous.
The predictors variables can be either continuous or categorical.
R Data Analysis Examples

Logistic regression, also called a logit model, is used to model

dichotomous outcome variables.
In the logit model the log odds of the outcome is modeled as a linear
combination of the predictor variables.
We will require following packages:
- caret
- aod
- ggplot2
install.packages("packagename")
R Data Analysis Examples
Example : German Credit data set
The data set o sists of usto ers i for atio a out a k
account, age ,sex, credit history and present credit situation, credit
purpose, property and installment, employment, residence as its
20 independent variables.

The target variable is Class which binary with levels- Bad, Good
Bad: Customer with high credit risk
Good: Customer with low credit risk

The purpose of a alyzi g this data set is to predi t a usto er s

credit risk based on the other information of the customer.
R Data Analysis Examples

German Credit dataset is an inbuilt R data set and can be loaded by

loading caret library as follows.

We can get the descriptive statistics for all the variables by giving
following command:
Output
We need to convert the target variable Class from Bad/Good to 0/1 to
apply logistic regression. This can be done by giving the following
command.
Partitioning the Data Set
One issue that arises when fitting the model is to check how
well the newly created model behaves when applied to new
data.
To address this issue, the data set can be divided into two
partitions a training partition used to build the model and a
test partition used to validate how well our model is performing.

Now we will partition our German Credit data set into training and
testing sets using the createDataPartition function.
Partitioning the Data Set

The training data set contains 700 observation and the testing data
set contains 300 observations.

We will now consider training data set and use logistic regression to
model Class as a function of 5 predictor variables Age, Amount,
ForeignWorker, Property.RealEstate, Housing.Own ,
CreditHistory.Critical and Purpose.NewCar using glm function.
Model Building
Model Building
Interpreting Output
All the variables in our model are significant.
The null deviance indicates the deviance for a model without
variables whereas the residual deviance is for our model.
The null deviance and residual deviance need to be as far as possible
from each other.
We can see that the null deviance is 839.40 and residual deviance is
769.93 which indicates that there is not much considerable difference
between them.
Estimates from logistic regression characterize the relationship
between the predictor and response variable on a log-odds scale.
The estimate for variable Age is 0.01839. This implies that for every 1
unit increase in Age, the log odds of the consumer having good credit
increases by exp(0.01839) = 1.01856
Similarly, for other variables.
Prediction

There is 28.43% misclassification in our model.

Goodness of Fit Hosmer Lemeshow Test
Hosmer Lemeshow test is used for testing overall goodness of fit.
The statistic is computed on the data after the observations have
been grouped by having similar predictor probabilities.
It examines whether the observed proportion of events are similar
to predicted probabilities of occurrence in the subgroups of the
data set using a Pearson Chi-Square test.
The hypothesis is,

H0: the current model fits well

v/s
H1: the current model does not fit well.
Goodness of Fit Hosmer Lemeshow Test

Since the p-value is greater than 0.05, we do not reject H0. i.e. the
current model is a good fit.
Wald Test
A Wald test is used to evaluate the statistical significance of each
coefficient in the model and is calculated by taking the ratio of the
square of the regression coefficient to the square of the standard
error of the coefficient.
We test whether the coefficient of the independent variable in the
model is significantly different from zero.
If the test fails to reject the null hypothesis, this suggests that
removing the variable from the model will not substantially harm the
fit of that model.
To apply Wald test, regTermTest function from survey library will be
used.
Wald Test

Since the p-value is less than 0.05, we reject H0 i.e. removing the
variable ForeignWorker from the model will harm the fit of the model.
Mc Fadden R2
Unlike linear regression, there is no R2 statistic which explains the
proportion of variation in the dependent variable that is explained by
the predictors.
For this we use Mc Faddens R2.

The predictors in our model explain just the 8.27% variation in the
data.
This suggests that one or more variables are missing in our model.
We can try a model by considering all the variables.
ROC Curve
Receiver Operating Characteristic (ROC) curve
is a plot of the true positive rate against the
false positive rate.

It shows a tradeoff between sensitivity and

specificity (any increase in sensitivity will be
accompanied by a decrease in specificity).

The closer the curve follows the left hand

border and then the top border of the ROC
space, the more accurate the test.
ROC Curve
The closer the curve comes to the 45-degree diagonal of the ROC
space, the less accurate the test.

Accuracy is measured by the area under the ROC curve. An area of 1

represents a perfect test; an area of 0.5 represents a worthless test.
Area Under ROC Curve

Since the area under the curve is 0.6944, we can say the discrimination
ability of our model is fair.
Kolmogorov Smirnov Chart
Kolmogorov Smirnov chart measures the performance of classification
models.

It is a measure of the degree of separation between Goods (Event) and

Bads (Non Event).

Distance between the goods and bads should be as large as possible.

The ra do li e i the hart orrespo ds to the ase of apturi g the

respo ders O es y ra do sele tio , i.e., he you do t ha e a y
model at disposal.

The odel li e represe ts the ase of apturi g the respo ders if you
go by the model generated probability scores where you begin by
targeting datapoint with highest probability scores.
Kolmogorov Smirnov Chart

As the measure
between the goods
and bads is very
small, we can say
that the
performance of the
model is not good.
Kolmogorov Smirnov Statistic

Kolmogorov Smirnov Statistic is the maximum difference between

the cumulative true positive and cumulative false positive rate.

It is the maximum difference between the goods and the bads.

It is often used as the deciding metric to judge the efficacy of the

models in credit scoring.

The higher the value of Kolmogorov Smirnov statistic, the more

efficient is the model at capturing the responders (Ones or Goods)
Kolmogorov Smirnov Statistic

As the value of Kolmogorov Smirnov Statistic is small we can say that

the model is not that efficient in capturing the responders.
Tree Diagrams
A useful way of investigating probability problems is to use what are
known as tree diagrams.
Tree diagrams are a useful way of mapping out all possible outcomes
for a given scenario.
They are widely used in probability and are often referred to as
probability trees.
They are also used in decision analysis where they are referred to as
decision trees.
In the context of decision theory a complex series of choices are
available with various different outcomes and we are looking for the
bets of these under a given performance criterion such as maximizing
profit or minimizing cost referred to as probability trees.
Tree Diagram Example 1
Suppose we are given three boxes, Box A contains 10 light bulbs, of
which 4 are defective, Box B contains 6 light bulbs, of which 1 is
defective and Box C contains 8 light bulbs, of which 3 are defective.
We select a box at random and then draw a light bulb from that box at
random. What is the probability that the bulb is defective?

Here we are performing two experiments:

Selecting a box at random
Selecting a bulb at random from the chosen box

If A, B and C denote the events choosing box A, B, or C respectively

and D and N denote the events defective/non-defective bulb chosen,
the two experiments can be represented on the diagram below.
Tree Diagram Example 1

We can compute the following probabilities and insert them onto

the branches of the tree:
Tree Diagram Example 1
To get the probability for a particular path of the tree (left to right) we
multiply the corresponding probabilities on the branches of the path.

For example, the probability of selecting box A and then getting a

defective bulb is:

Since all the paths are mutually exclusive and there are three paths
which lead to a defective bulb, to answer the original question we
must add the probabilities for the three paths,
i.e. 4/30 + 1/3*1/6 + 1/3*3/8 = 2/15 + 1/18 + 1/8 = 0.314.
Tree Diagram Example 2

Machines A and B turn out respectively 10% and 90% of

the total production of a certain type of article.
The probability that machine A turns out a defective
item is 0.01 and the probability that machine B turns out
a defective item is 0.05.

(i) What is the probability that an article taken at random from

the production line is defective?

(ii) What is the probability that an article taken at random from

the production line was made by machine A, given that it is
defective?
Tree Diagram Example 2
Thank You

Agroconsultant: Intelligent Crop Recommendation System Using Machine Learning Algorithms
No ratings yet
Agroconsultant: Intelligent Crop Recommendation System Using Machine Learning Algorithms
6 pages
Octestpart 1
No ratings yet
Octestpart 1
10 pages
CASE 2 Teaching Hospital New
No ratings yet
CASE 2 Teaching Hospital New
9 pages
Biostatistics Problem Set (Frequency Distribution Table)
No ratings yet
Biostatistics Problem Set (Frequency Distribution Table)
8 pages
DL PRACTICAL FILE
No ratings yet
DL PRACTICAL FILE
58 pages
Deep Learning With Python File
No ratings yet
Deep Learning With Python File
22 pages
machine learning final manual
No ratings yet
machine learning final manual
45 pages
Churn Modeling
100% (1)
Churn Modeling
11 pages
Machine Learning GL
No ratings yet
Machine Learning GL
25 pages
Project
No ratings yet
Project
39 pages
HW1
100% (1)
HW1
8 pages
Data Science
No ratings yet
Data Science
39 pages
Predictive Modelling
100% (1)
Predictive Modelling
58 pages
Decision Trees and Random Forests
No ratings yet
Decision Trees and Random Forests
25 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
57 pages
Implementation Data Mining With K-Means Algorithm For Clustering Distribution Rabies Case Area in Palembang City PDF
No ratings yet
Implementation Data Mining With K-Means Algorithm For Clustering Distribution Rabies Case Area in Palembang City PDF
8 pages
Paper 1-Bidirectional LSTM With Attention Mechanism and Convolutional Layer
100% (1)
Paper 1-Bidirectional LSTM With Attention Mechanism and Convolutional Layer
51 pages
Logistics Regression
100% (1)
Logistics Regression
5 pages
Multicollinearity Exercise
100% (1)
Multicollinearity Exercise
6 pages
Lab Manual: CSE 421: Artificial Intelligent and Deep Learning
No ratings yet
Lab Manual: CSE 421: Artificial Intelligent and Deep Learning
28 pages
(IJETA-V8I5P1) :yew Kee Wong
No ratings yet
(IJETA-V8I5P1) :yew Kee Wong
5 pages
Unit V - Classification and Prediction 2020-21
100% (1)
Unit V - Classification and Prediction 2020-21
68 pages
ML0101EN Clas K Nearest Neighbors CustCat Py v1
100% (1)
ML0101EN Clas K Nearest Neighbors CustCat Py v1
11 pages
TP Regression
100% (1)
TP Regression
1 page
ML0101EN Clas Logistic Reg Churn Py v1
100% (1)
ML0101EN Clas Logistic Reg Churn Py v1
13 pages
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
100% (1)
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
72 pages
Logistic Regression Model - A Review
No ratings yet
Logistic Regression Model - A Review
5 pages
Import As
100% (1)
Import As
27 pages
Singh S. Programming With Python. and Its Applications To Physical Systems 2023
100% (1)
Singh S. Programming With Python. and Its Applications To Physical Systems 2023
363 pages
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
100% (1)
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
27 pages
Poly
100% (1)
Poly
108 pages
Logistic Regression
100% (1)
Logistic Regression
29 pages
A Machine Learning Framework For Sport Result Prediction
No ratings yet
A Machine Learning Framework For Sport Result Prediction
7 pages
1694600777-Unit2.2 Logistic Regression CU 2.0
100% (1)
1694600777-Unit2.2 Logistic Regression CU 2.0
37 pages
Machine Learning: Bilal Khan
100% (2)
Machine Learning: Bilal Khan
20 pages
Machine Learning Techniques For Heart Disease Prediction: A. Lakshmanarao, Y.Swathi, P.Sri Sai Sundareswar
No ratings yet
Machine Learning Techniques For Heart Disease Prediction: A. Lakshmanarao, Y.Swathi, P.Sri Sai Sundareswar
4 pages
Artificial Neural Network
100% (1)
Artificial Neural Network
35 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
RBF, KNN, SVM, DT
No ratings yet
RBF, KNN, SVM, DT
9 pages
Machine Learning Project Report
100% (1)
Machine Learning Project Report
4 pages
Peter Dueben: Royal Society University Research Fellow & ECMWF's Coordinator For Machine Learning and AI Activities
100% (1)
Peter Dueben: Royal Society University Research Fellow & ECMWF's Coordinator For Machine Learning and AI Activities
33 pages
Variable Selection
No ratings yet
Variable Selection
15 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Lecture 4 Linear Regression
100% (1)
Lecture 4 Linear Regression
44 pages
Simple Linear Regression - Assign3
No ratings yet
Simple Linear Regression - Assign3
8 pages
Segmentation
100% (1)
Segmentation
51 pages
Pattern Classification
100% (1)
Pattern Classification
42 pages
Convex Hull Algorithms
No ratings yet
Convex Hull Algorithms
4 pages
Morphological PCB
No ratings yet
Morphological PCB
5 pages
Curse of Dimensionality
No ratings yet
Curse of Dimensionality
9 pages
Econ209 f2024 Lab 4 Truong Gia Han
No ratings yet
Econ209 f2024 Lab 4 Truong Gia Han
11 pages
Lead Scoring Group Case Study Presentation
100% (2)
Lead Scoring Group Case Study Presentation
19 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
ML UNIT-2 Notes
No ratings yet
ML UNIT-2 Notes
15 pages
Eda PDF
100% (1)
Eda PDF
45 pages
Vinee
100% (1)
Vinee
28 pages
Poisson Distribution
100% (1)
Poisson Distribution
6 pages
Matplotlib PDF
No ratings yet
Matplotlib PDF
16 pages
Actividad Semana 4 - Jupyter Notebook
100% (1)
Actividad Semana 4 - Jupyter Notebook
7 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Predict 422 - Module 8
100% (1)
Predict 422 - Module 8
138 pages
Assignment 11
100% (1)
Assignment 11
7 pages
Report Logistic Regression
No ratings yet
Report Logistic Regression
17 pages
BPR Case Study Presentation Honeywell
100% (1)
BPR Case Study Presentation Honeywell
21 pages
PHD Viva Guide
No ratings yet
PHD Viva Guide
44 pages
Programming in Java Assignment
No ratings yet
Programming in Java Assignment
8 pages
Bank Based On Data Warehouse
No ratings yet
Bank Based On Data Warehouse
5 pages
The Lived exper-WPS Office
No ratings yet
The Lived exper-WPS Office
2 pages
Question Bank Class 11 Eco II CH 8 Use of Statistical Tools
No ratings yet
Question Bank Class 11 Eco II CH 8 Use of Statistical Tools
5 pages
Block-1 (2)
No ratings yet
Block-1 (2)
44 pages
3rd Sem Business Statistics Oct 2022
No ratings yet
3rd Sem Business Statistics Oct 2022
4 pages
Developing An Instrument To Measure PCK
No ratings yet
Developing An Instrument To Measure PCK
21 pages
Maths Thesis For M.phil
100% (3)
Maths Thesis For M.phil
7 pages
Lecture4 Slides
No ratings yet
Lecture4 Slides
37 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
A Short Note On Concurrent Delays Under FIDIC
No ratings yet
A Short Note On Concurrent Delays Under FIDIC
3 pages
Pipe Inspection Robots For Structural Health and Condition Monitoring
100% (2)
Pipe Inspection Robots For Structural Health and Condition Monitoring
213 pages
ZUVENTUS
No ratings yet
ZUVENTUS
8 pages
Effects of Absenteeism On Academic Performances of
100% (2)
Effects of Absenteeism On Academic Performances of
6 pages
Prefix, Suffix, and Base Words ESP
100% (1)
Prefix, Suffix, and Base Words ESP
57 pages
The Kitchen Chemistry in A Particular Class
No ratings yet
The Kitchen Chemistry in A Particular Class
2 pages
Johns Et Al 2023 Analysis of The Competition Programmes of Elite and Sub Elite Swimmers The Influence of Sex Stroke and
No ratings yet
Johns Et Al 2023 Analysis of The Competition Programmes of Elite and Sub Elite Swimmers The Influence of Sex Stroke and
10 pages
Research and Develop For Implementation of The Network Monitoring System by Using Auvik Tool
No ratings yet
Research and Develop For Implementation of The Network Monitoring System by Using Auvik Tool
5 pages
Chapter 2-Part 1 Applied Statistics
No ratings yet
Chapter 2-Part 1 Applied Statistics
30 pages
4th RUFORUM Biennial Conference Programme Book Web - Final
100% (1)
4th RUFORUM Biennial Conference Programme Book Web - Final
98 pages
Planning and Designing Useful Evaluations
No ratings yet
Planning and Designing Useful Evaluations
18 pages
Chapter1-3 (Kuwang Instrument)
No ratings yet
Chapter1-3 (Kuwang Instrument)
32 pages
Data Mining Primer
No ratings yet
Data Mining Primer
15 pages
Discourse and Knowledge - A Sociocognitive Approach
100% (1)
Discourse and Knowledge - A Sociocognitive Approach
409 pages
Spending Habits and Financial Literacy Based On Ge
No ratings yet
Spending Habits and Financial Literacy Based On Ge
7 pages

Logistic Regression

Uploaded by

Logistic Regression

Uploaded by

Business Analytics

Logistic regression is a form of regression where the dependent

Logistic regression can be binomial (binary) or multinomial.

Out o e is oded as a d i i ary logisti regressio .

These models are extensively used in banking, finance and

On the other hand, take a

Logistic regression, also called a logit model, is used to model

The purpose of a alyzi g this data set is to predi t a usto er s

German Credit dataset is an inbuilt R data set and can be loaded by

There is 28.43% misclassification in our model.

H0: the current model fits well

It shows a tradeoff between sensitivity and

The closer the curve follows the left hand

Accuracy is measured by the area under the ROC curve. An area of 1

It is a measure of the degree of separation between Goods (Event) and

Distance between the goods and bads should be as large as possible.

The ra do li e i the hart orrespo ds to the ase of apturi g the

Kolmogorov Smirnov Statistic is the maximum difference between

It is the maximum difference between the goods and the bads.

It is often used as the deciding metric to judge the efficacy of the

The higher the value of Kolmogorov Smirnov statistic, the more

As the value of Kolmogorov Smirnov Statistic is small we can say that

Here we are performing two experiments:

If A, B and C denote the events choosing box A, B, or C respectively

We can compute the following probabilities and insert them onto

For example, the probability of selecting box A and then getting a

Machines A and B turn out respectively 10% and 90% of

(i) What is the probability that an article taken at random from

(ii) What is the probability that an article taken at random from

You might also like