Lesson 13 Logistic Regression
Lesson 13 Logistic Regression
Lesson 13
Daniele Tonini
[email protected]
Agenda
Logistic regression:
• Background
• Problem setting
• Model features
• Interpretation of the coefficients
• ODDS ratio
• Estimation of the model
• Model comparison and evaluation
3
Background
Consider the following business case:
4
Background
In order to deal with some specific managerial problems, sometimes it is not
sufficient to rely on the linear regression model, since the response dependent
variable is a categorical variable.
For example, we may be interested in studying the purchase drivers for a specific
product category, or in explaining the membership of customers to market
segments, or again in identifying the factors leading to customer defection, etc.
For this kind of problems, the business analyst needs to use specific
quantitative tools to model the phenomenon of interest: as it will become
clear in what follows, the estimation of a logistic regression model leads to
the proper solution to these issues
5
Problem Setting
(note: the response variable can be modeled using a Bernoulli random variable)
Goal: analysis of the probability of occurrence for the event of interest (Y=1) given one
or more predictor (independent) variables (X)
Pr (Y=1 | X)
6
Problem Setting
The linear regression model is not suitable to solve the previous problem (categorical
binary response variable), since:
1. In the linear regression, the predicted value is not bounded, and so it can assume
values outside the interval [0;1]
3. The usual tests for the parameters of the linear regression model are based on the
assumption of Normality for the distribution of the error terms y can take on
only the values 0 and 1, and the Normality assumption is hard to justify, even in
an approximate sense.
7
The Logistic Regression Model
Model features
To solve the problems mentioned above, it is possible to use a non-linear function of the
independent variables, that, contrary to the linear function, assumes values bounded
between 0 and 1
Among all the suitable non-linear functions, the standard logistic function (or sigmoid
function) is the most often used in practice: 𝜋 = 𝑃𝑟(𝑌 = 1) is defined as a function of p
independent variables, according to the following expression:
𝑒 𝛽0 +𝛽1𝑥1+𝛽2𝑥2+⋯+𝛽𝑝𝑥𝑝 1
(1) 𝜋 = Pr Y = 1 = OR 𝜋 = Pr Y = 1 =
1 + 𝑒 𝛽0 +𝛽1𝑥1+𝛽2𝑥2+⋯+𝛽𝑝 𝑥𝑝 1 + 𝑒 −(𝛽0+𝛽1𝑥1+𝛽2𝑥2+⋯+𝛽𝑝 𝑥𝑝 )
8
The Logistic Regression Model
The LOGIT expression
(2)
The left-hand side of the equation is often called the LOGIT transformation of 𝜋 ,
while the right-hand side is the linear combination of the p independent variables
𝑥1 , 𝑥2 , … , 𝑥𝑝 , just as for the linear regression model
9
The Logistic Regression Model
Definition of ODDS and interpretation of coefficients
There is a simple relation between the odds (O) and the probability (𝜋)
O
O
1 1 O
The LOGIT formulation (2) for the logistic model corresponds to a linear model for
the logarithm of the odds (log-odds)
⟹ the generic coefficient 𝜷𝒌 of the k-th independent variable 𝒙𝒌 is
interpreted as the change in the log-odds for one unit increase in 𝒙𝒌 , keeping
all other variables constant.
10
The Logistic Regression Model
Example: direct marketing campaign
11
The Logistic Regression Model
Example: direct marketing campaign
The estimated equation for the logistic model (in Logit form) for the previous example is
(changing the sign for category 1):
GENDER: keeping other variables constant, the estimated log-odds for the event of interest
(i.e. positive interest of the customer) increases by 0.9865 for a man compared to a woman
ACTIVITY: keeping other variables constant, the estimated log-odds increases by 0.9350
for an active customer compared to a non-active one
AGE: keeping other variables constant, the estimated log-odds decreases by 0,0011 when
the age of the customer increases by one year.
A positive coefficient denotes an increasing relation between the corresponding predictor and
the (estimate of the) probability of occurrence of the event.
12
The Logistic Regression Model
An additional interpretation: the ODDS RATIO
The ODDS RATIO is the ratio of two ODDS computed at two different values
of the independent variable:
Independent Variable
x=1 x=0
ODDS RATIO
y=1
(1)
1 (1) ODDS 1
Dependent
(0) ODDS 0
Variable
1 (0)
y=0
14
The Logistic Regression Model
An additional interpretation: the ODDS RATIO
For a categorical independent variable, a «unit increase» clearly stays for the change
from the reference category (coded as «0») to the category coded as «1»
So, the ODDS RATIO for a categorical predictor represents the ratio of ODDS for
an individual belonging to the specific category TO the ODDS for an individual
NOT belonging to the category.
15
The Logistic Regression Model
An additional interpretation: the ODDS RATIO
To sum up, the estimated coefficients in a logistic regression model can be interpreted
alternatively as:
16
The Logistic Regression Model
An additional interpretation: the ODDS RATIO
EXAMPLE: Consider the output of the logistic regression related to the direct marketing
campaign and the estimated coefficients (slide 11)
We interpreted the coefficient on the variable ACTIVITY as the estimated increase in the
log-odds for an active customer with respect to that for an inactive one (ceteris paribus)
Similarly to the linear regression model, the relation between the dependent
variable and the independent ones is known up to the values of the parameters
𝜷′𝒌 𝒔, that have to be estimated
18
The Logistic Regression Model
Estimation of the Model
Likelihood function of
pi N 1 pi
L Pr y1 , y2 ,, y N Pr y1 Pr y2 iPr
1 y 1 piPr y
𝑦𝑖 (𝑖 = 1, … , 𝑁) independent obs. N i i 1
yi
N N
pi
L Pr yi pi i 1 pi
N
Finding the Log-Likelihood
1 pi
y 1 yi
function (LL)
i 1 i 1 i 1 1 pi
pi
Taking the logarithm of both sides: ln L yi ln
𝐿𝐿 ln 1 pi
(this helps the optimization process) i 1 pi i
Expression to maximize
From LOGIT regression eq.
𝑝𝑖 ln𝐿𝐿L yi xi ln 1 exp xi
𝑙𝑛 = 𝛽𝑥𝑖 , so:
1 − 𝑝𝑖 i i
19
The Logistic Regression Model
Estimation of the Model
The Log-Likelihood expression (= the conditions that determine the estimates) is non-
linear in the parameters and does not admit an explicit solution (i.e. it’s not a closed
form)
Maximum Likelihood Estimates have some optimality properties, for a sufficiently large
sample size:
– Asymptotically unbiased (estimates are approximately unbiased)
– Asymptotically efficient (standard errors of the estimates are as low as those of
any other procedure)
– Asymptotically normal (it is possible to use a normal or chi-square distribution
to compute confidence intervals and critical values or p-values in statistical tests)
20
The Logistic Regression Model
Regularization
Overfitting the training data is a problem that can arise in Logistic Regression, especially
when data has very high dimensions and is sparse.
One approach to reducing overfitting is Regularization, in which we create a modified
“penalized log likelihood function,” which penalizes large values of the estimates.
Generally, we don’t want large estimate: if weights are large, a small change in a feature can
result in a large change in the prediction
Where 𝜆 is the penalty
Penalized log likelihood function: ln𝐿𝐿L yi xi ln 1 exp xi −𝜆𝑅(𝛽) factor and 𝑅(𝛽) is the
i i regularization function
𝑝
L1 regularization 𝑅 𝛽 = 𝛽𝑖 It tends to produce sparse solutions by forcing unimportant
𝑖=1 coefficients to be zero (L1 is equivalent to Laplace method in Knime)
1 𝑝
L2 regularization 𝑅 𝛽 = 𝛽𝑖2 keeps the coefficients from becoming too large but does not force
2 𝑖=1 them to be zero (L2 is equivalent to Gauss method in Knime)
21
The Logistic Regression Model
Goodness-of-fit and model comparison
The category of the target variable predicted by the model (“predicted values” in the
confusion matrix above) is assigned by setting each observation of the dataset in the
category “1” depending on the estimated probability of the logistic model: if this
probability is greater than a pre-determined threshold, called cut-off (default value equal
to 0.5), then the observation is assigned to “1”, else it is assigned to “0”.
In order to obtain more reliable results, is it possible to select a cut-off value equal to
the “a priori” probability of the target variable (proportion of category “1” in the target
variable)
26
Il modello di
Theregressione logistica
Logistic Regression Model
La valutazione del modello Model evaluation: ROC Curve
ROC Curve
Given a confusion matrix, based on a specific cut-off, the ROC curve is calculated
starting from the joint frequencies of predicted and observed events (correct
classification) and predicted and not observed events (errors). Specifically, with regards
of the confusion matrix of the previous slide, the ROC curve is based on the following
indicators:
d
Sensitivity True positives proportion:
c+ d
a
Specificity True negatives proportion:
a+ b
c
1 - Sensitivity False negatives proportion:
c +d
b
1 - Specificity False positives proportion:
a+b
27
Il modello di
Theregressione logistica
Logistic Regression Model
La valutazione del modello Model evaluation: ROC Curve
Once computed these values, the ROC curve is obtained by plotting, for each possible
threshold (cut-off) value, a point in the Cartesian plane that has on the horizontal axis
the percentage of false positives (1 - Specificity) and on the vertical axis the
percentage of true positives (Sensitivity). Each point of the curve, therefore,
represents a particular value of the cut-off (that varies from 0 to 1) on which has been
built a confusion matrix.
Modello11 Modello22
1.0
0.9
0.8
0.7
Sensitivity
ROC 0.6
Curves 0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
1 - Specificity 28
Il modello di
Theregressione logistica
Logistic Regression Model
La valutazione del modello Model evaluation: ROC Curve
The point (0,1) in the chart represents the ideal model, which does not commit
any error of prediction: the percentage of false positive is 0 and the percentage of
true positive is equal to one this allows a perfect separation between the two
classes
The point (0,0) corresponds to a model that assigns all the observation to the
negative class of the response variable (the absence of the target charateristic),
while the point (1,1) predicts all events as belonging to the positive class
In terms of comparison between two different models, the best fitting curve is,
therefore, the curve that is closest to the upper left angle of the graph.
Comparing the two sample models represented in the graph of the previous slide,
model 1 can be considered better from this point of view
29
Il modello di
Theregressione logistica
Logistic Regression Model
La valutazione del modello Model evaluation: AUC
Sometimes, especially when the two models have similar performances, it is difficult to clearly
distinguish which curve is better than the other, because they would often be overlapped in
some points
For this reason, it’s frequently used, together with the graphical display, an index that
measures numerically the goodness of the model
This index, called AUC (Area Under the Curve), is obtained simply by calculating the area
below the ROC curve.
Formally, the AUC index is calculated, by varying the threshold, in this way:
1 𝑁
T
𝑉𝑃 𝐹𝑃 1
𝐴𝑈𝐶𝑅𝑂𝐶 = 𝑑 = T
𝑉𝑃 𝑑𝐹𝑃
0 𝑃 𝑁 𝑃∙𝑁 0
A random classifier has an AUC of 0.5 (graphically represented by a ROC curve corresponding
to the bisecting line of the quadrant), while a perfect classifier has an AUC of 1 A logistic
regression model, therefore, will have an AUC between these two extreme values.
30
Keywords
Logistic function
LOGIT Transformation
ODDS
ODDS RATIO
AIC
Confusion matrix
ROC Curve
31