Logistic Regression
Logistic Regression
• When Z=0, exp(-z) =1, so that P =0.5. Thus the logistic curve has its
“center” at (Z,P) = (0, 0.5). This point is called an inflection point. The
logistic curve is symmetric about its inflection point.
The multivariate logistic function
• Here Z is a linear function of a set of predictor variables:
Z = 𝑏0 + 𝑏1𝑋1 + 𝑏2𝑋2 + ⋯ … . +𝑏𝑘𝑋𝑘
Then
𝟏
P= ---------- (3)
𝟏+𝐞𝐱𝐩[ − 𝒃𝟎 + 𝒃𝟏𝑿𝟏 + 𝒃𝟐𝑿𝟐+⋯….+𝒃𝒌𝑿𝒌 ]
this function ranges between 0 and 1.
The Logit Link Function
• The basic form of the logistic function as given in equation (1) is
1
P= ---------------------------- (1)
1+exp(−𝑧)
It follows that
1 𝑒𝑥𝑝 (−𝑧)
1-P = 1- = ---------------- (4)
1+exp(−𝑧) 1+exp(−𝑧)
Dividing (1) by (4) yields
𝑃
= exp (z) ------------------------------- (5)
1−𝑃
𝑃
• Odds = =Ώ
1−𝑃
and
𝑃
Logit P = log [ ] = log Ώ
1−𝑃
A link function is simply a function of the mean of the response variable Y
that we use as the response instead of Y itself. This implies we use the logit
of Y as the response in our regression equation instead of just Y
The Logit Link Function
• The multivariate equation becomes
logit P = 𝑏0 + 𝑏1𝑋1 + 𝑏2𝑋2 + ⋯ … . +𝑏𝑘𝑋𝑘
𝑃
log = 𝑏0 + 𝑏1𝑋1 + 𝑏2𝑋2 + ⋯ … . +𝑏𝑘𝑋𝑘 ------ (6)
1−𝑃
The equation (6) is the form of a multiple regression equation and the
coefficients shall be interpreted in a similar way. Multiples regression
employs the method of least squares for estimating parameters.
However, logistic regression uses maximum likelihood procedure.
Odds Ratio as measures of effects on the odds
• The trouble is that logit P is not a familiar quantity, so that the
meaning of these effects are not very clear.
• Two other measures are similar in design to the pseudo R2 value and
are generally categorized as pseudo R2 measures as well. The Cox and
Snell R2 measure operates in the same manner, with high values
indicating greater model fit. However, this measure is limited in that it
cannot reach the maximum value of 1, so Nagelkerke proposed a
modification that had the range of 0 to 1.
• Both of these additional measures are interpreted as reflecting the
amount of variation accounted for by the logistic model, with 1.0
indicating perfect model fit.
Hosmer-Lemeshow Test
• The Hosmer-Lemeshow test (HL test) is a goodness of fit test for
logistic regression, especially for risk prediction models.
• The test is only used for binary response variables
• It calculates whether the observed event rates match the expected
event rates in population subgroups.
• If the p-value is less than 0.05,then the model is poor fit.
Interpreting coefficients
• Interpreting the direction of original coefficients: The sign of the original
coefficients (positive or negative) indicates the direction of the relationship, just
as seen in regression coefficients. A positive coefficient increases the probability,
whereas a negative decreases the predictive probability.
• Interpreting the direction of exponentiated coefficients: Exponentiated
coefficients must be interpreted differently because they are the logarithms of
the original coefficient. By taking the logarithm, we are actually stating the
exponentiated coefficient in terms of odds, which means that exponentiated
coefficients will not have negative values. Because the logarithm of 0 (no effect) is
1.0, an exponentiated coefficient of 1.0 actually corresponds to a relationship
with no direction. Thus, exponentiated coefficients above 1.0 reflects a positive
relationship and values less than 1.0 represents negative relationships.
• Percentage change in odds = (exponentiated coefficienti – 1.0)*100
Testing for significance of the coefficients
• We use a statistical test called Wald statistic, to see whether the
logistic coefficient is different from 0.
• Wald statistic is test is used to test hypothesis about parameters that
are estimated by maximum likelihood method.