0% found this document useful (0 votes)
8 views

Logistic Regression

Binary logistic regression is used when the response variable is dichotomous (binary). It can be used for classification and to estimate the relationship between predictor variables and the likelihood of the response variable being 1 or 0. Key aspects include using the logit link function to transform the probability into odds, interpreting coefficients as log odds ratios, and assessing model fit using -2 log likelihood and pseudo R-squared values.

Uploaded by

tsandrasanal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Logistic Regression

Binary logistic regression is used when the response variable is dichotomous (binary). It can be used for classification and to estimate the relationship between predictor variables and the likelihood of the response variable being 1 or 0. Key aspects include using the logit link function to transform the probability into odds, interpreting coefficients as log odds ratios, and assessing model fit using -2 log likelihood and pseudo R-squared values.

Uploaded by

tsandrasanal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Binary Logistic Regression

• Logistic Regression, more commonly called logit regression, is used


when the response variable is dichotomous (i.e., binary or 0-1). The
predictor variable may be quantitative, categorical, or a mixture of
the two.
• Logistic regression does not requires the strict assumptions of
multivariate normality and equal variance – covariance matrices
across groups. Even if the assumptions are met, many researchers
prefer logistic regression because it is more similar to multiple
regression.
Assumptions
1. The outcome variable is binary. However, if the dependent variable
has three or more outcomes, then multinomial or ordinal logistic
regression shall be used.
2. The observations must be independent of each other
3. The logistic model assumes a linear relationship between the logit
and the independent variables. Box-Tidwell test is used to check
this assumption
4. Absence of multicollinearity
Objectives
1. Explanation- providing estimates of the ability of a set of
independent variables collectively and individual to distinguish
between a binary outcome
2. Classification – provide a means for classification of cases into the
outcome groups and provide a range of diagnostic measures of
predictive accuracy
Assigning binary values
• Logistic regression approaches the task of assigning binary values to
the dependent variable. It does not matters which group is assigned
the value of 1 versus 0, but this assignment (coding) must be noted
for the interpretation of the coefficients.
• If the group represent outcomes or events (e.g. success or failure).
Assume that the group with success is coded as 1, with failure coded
as 0. Then coefficients represent the impacts on the likelihood of
success. The coefficients would reflect the impact of the independent
variables on the likelihood of the group coded as 1.
The logistic function
• The basic form of the logistic function is
1
P= ---------------------------- (1)
1+exp(−𝑧)
where z is the predictor variable.
If the numerator and denominator of the right side of the above
equation are multiplies by exp(𝑧), then
exp(𝑧)
P= ------------------------------- (2)
1+exp(𝑧)
The logistic function
• Instead of a straight line, the logistic function fit some kind of sigmoid
curve to the observed points.
• The tails of the sigmoid curve level off before reaching P=0 or P=1, so
that the problem of impossible values of P is avoided.
The logistic function

• A property of the logistic function, is that when Z becomes infinitely


negative, exp(-z) becomes infinitely large, so that P approaches 0.
When Z becomes infinitely positive, exp(-z) becomes infinitesimally
small, so that P approaches unity.

• When Z=0, exp(-z) =1, so that P =0.5. Thus the logistic curve has its
“center” at (Z,P) = (0, 0.5). This point is called an inflection point. The
logistic curve is symmetric about its inflection point.
The multivariate logistic function
• Here Z is a linear function of a set of predictor variables:
Z = 𝑏0 + 𝑏1𝑋1 + 𝑏2𝑋2 + ⋯ … . +𝑏𝑘𝑋𝑘
Then
𝟏
P= ---------- (3)
𝟏+𝐞𝐱𝐩[ − 𝒃𝟎 + 𝒃𝟏𝑿𝟏 + 𝒃𝟐𝑿𝟐+⋯….+𝒃𝒌𝑿𝒌 ]
this function ranges between 0 and 1.
The Logit Link Function
• The basic form of the logistic function as given in equation (1) is
1
P= ---------------------------- (1)
1+exp(−𝑧)
It follows that
1 𝑒𝑥𝑝 (−𝑧)
1-P = 1- = ---------------- (4)
1+exp(−𝑧) 1+exp(−𝑧)
Dividing (1) by (4) yields
𝑃
= exp (z) ------------------------------- (5)
1−𝑃

Taking the natural logarithm of both sides of (5), we get


𝑃
Log =Z
1−𝑃
The Logit Link Function
𝑃
• The quantity is called the odds, denoted more concisely as Ώ and the
1−𝑃
𝑃
quantity log [ ] is called log odds or the logit of P. Thus
1−𝑃

𝑃
• Odds = =Ώ
1−𝑃

and
𝑃
Logit P = log [ ] = log Ώ
1−𝑃
A link function is simply a function of the mean of the response variable Y
that we use as the response instead of Y itself. This implies we use the logit
of Y as the response in our regression equation instead of just Y
The Logit Link Function
• The multivariate equation becomes
logit P = 𝑏0 + 𝑏1𝑋1 + 𝑏2𝑋2 + ⋯ … . +𝑏𝑘𝑋𝑘
𝑃
log = 𝑏0 + 𝑏1𝑋1 + 𝑏2𝑋2 + ⋯ … . +𝑏𝑘𝑋𝑘 ------ (6)
1−𝑃

The equation (6) is the form of a multiple regression equation and the
coefficients shall be interpreted in a similar way. Multiples regression
employs the method of least squares for estimating parameters.
However, logistic regression uses maximum likelihood procedure.
Odds Ratio as measures of effects on the odds
• The trouble is that logit P is not a familiar quantity, so that the
meaning of these effects are not very clear.

• So we use another a measure called odds ratio to interpret the impact


of independent variables.
Odds Ratio as Measures of Effect on the Odds
Consider a model
Logit P = a+bE+CU+dI ------- (7)
P: estimated probability of health awareness
E: number of completed years of education
U: 1 if urban, 0 otherwise
I: 1 if Indian, 0 otherwise
Odds Ratio as Measures of Effect on the Odds
The model may expressed as
logΏ = a+bE+cU+dI ------ (10)
Taking the exponential of both sides, we obtain
Ώ = e a+bE+cU+dI
Suppose we increase E by one unit, holding U and I constant.
Denoting the new value of Ώ as Ώ*, we have
Ώ ∗ = e a+b(E+1)+cU+dI
= e a+bE+cU+dI+b
= e a+bE+cU+dI eb
= Ώeb ------------------- (11)

Which can be written alternatively as


Ώ∗ b
= e ---------------------- (12)
Ώ
Odds Ratio as Measures of Effect on the Odds
From (11), it is evident that a one-unit increase in E, holding other
predictor variables constant, multiplies the odds by the factor eb . The
quantity is eb called an odds ratio.

The original coefficient b represents the additive effect of a one-unit


change in E on the log odds of using contraception. Equivalently, the
odds ratio eb represents the multiplicative effect of a one-unit change in
E on the odds of using contraception. Insofar as the odds is a more
intuitively meaningful concept than the log odds, eb is more readily
understandable than b as a measure of effect.
Assessing the goodness-of-fit of the
estimated model
• There are two primary methods for the evaluation of model fit. The
first is to use an overall measure of statistical significance of the
model fit and also “Psedu” R2 values.
• The second approach is to examine predictive accuracy where the
ability of the model to correct classify the outcome measure is
computed in what is termed a classification matrix.
Assessing the goodness-of-fit of the
estimated model
Model estimation fit: the basic measure of how well the maximum
likelihood estimation procedure fits is the likelihood. Logistic
regression measures model estimation fit with the value of -2 times the
log of the likelihood value, referred to as -2LL or -2 log likelihood. The
minimum value for -2LL is 0, which corresponds to a perfect fit
(likelihood=1 and -2LL is then 0). Thus, the lower the -2LL value, the
better the fit of the model.
Assessing the goodness-of-fit of the
estimated model
Between Model Comparisons: The likelihood value can be compared
between equations to assess the difference in predictive fit from one
equation to another, with statistical tests for the significance of these
differences. The basic approach follows three steps:
1. Estimate a null model: the first step is to calculate a null model,
which acts as the baseline for making comparisons of improvement
in model fit. The most common null model is one without any
independent variables. The logic behind this form of null model is
that it can acts as a baseline against which any model containing
independent variables can be compared.
Assessing the goodness-of-fit of the
estimated model
2. Estimate the proposed model: This model contains the
independent variables to be included in the logistic regression
model. Hopefully, model fit will improve from the null model and
result in a lower -2LL value. Any number of proposed models can be
estimated.
3. -2LL difference: The final step is to assess the statistical significance
of the -2LL value between the two models (null model versus
proposed model). If the statistical tests support significant
differences, then we can state that the set of independent variables
in the proposed model is significant in improving model estimation
fit and that model as a whole is statistically significant.
Pseudo R2 measures

• The Pseudo R2 measures are interpreted in manner similar to the


coefficient of determination in multiple regression. A pseudo R2 value
can be easily derived for logistic regression similar to the R2 value in
regression analysis. The Pseudo R2 for a logit model (R2 logit) can be
calculated as:
−2𝐿𝐿𝑛𝑢𝑙𝑙 −(−2𝐿𝐿𝑚𝑜𝑑𝑒𝑙)
R2 logit = −2𝐿𝐿𝑛𝑢𝑙𝑙
The logit R2 value ranges from 0.0 to 1.0. As the proposed model
increases model fit, the -2LL value decreases. A perfect fit has a -2LL
value of 0.0 and a R2 logit of 1.0.
Pseudo R2 measures

• Two other measures are similar in design to the pseudo R2 value and
are generally categorized as pseudo R2 measures as well. The Cox and
Snell R2 measure operates in the same manner, with high values
indicating greater model fit. However, this measure is limited in that it
cannot reach the maximum value of 1, so Nagelkerke proposed a
modification that had the range of 0 to 1.
• Both of these additional measures are interpreted as reflecting the
amount of variation accounted for by the logistic model, with 1.0
indicating perfect model fit.
Hosmer-Lemeshow Test
• The Hosmer-Lemeshow test (HL test) is a goodness of fit test for
logistic regression, especially for risk prediction models.
• The test is only used for binary response variables
• It calculates whether the observed event rates match the expected
event rates in population subgroups.
• If the p-value is less than 0.05,then the model is poor fit.
Interpreting coefficients
• Interpreting the direction of original coefficients: The sign of the original
coefficients (positive or negative) indicates the direction of the relationship, just
as seen in regression coefficients. A positive coefficient increases the probability,
whereas a negative decreases the predictive probability.
• Interpreting the direction of exponentiated coefficients: Exponentiated
coefficients must be interpreted differently because they are the logarithms of
the original coefficient. By taking the logarithm, we are actually stating the
exponentiated coefficient in terms of odds, which means that exponentiated
coefficients will not have negative values. Because the logarithm of 0 (no effect) is
1.0, an exponentiated coefficient of 1.0 actually corresponds to a relationship
with no direction. Thus, exponentiated coefficients above 1.0 reflects a positive
relationship and values less than 1.0 represents negative relationships.
• Percentage change in odds = (exponentiated coefficienti – 1.0)*100
Testing for significance of the coefficients
• We use a statistical test called Wald statistic, to see whether the
logistic coefficient is different from 0.
• Wald statistic is test is used to test hypothesis about parameters that
are estimated by maximum likelihood method.

You might also like