0% found this document useful (0 votes)
63 views

Logistic Regression

Okay, based on the information provided: a = 45 (youth with persistent SB assessed as having depression at baseline) b = 86 (youth with no SB at follow-up assessed as having depression at baseline) c = 32 (youth with persistent SB not assessed as having depression at baseline) d = 100 (youth with no SB at follow-up not assessed as having depression at baseline) Plugging these into the odds ratio formula: OR = (a/b) / (c/d) = (45/86) / (32/100) = 0.523 / 0.32 = 1.63 Therefore, the odds ratio of suicidal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Logistic Regression

Okay, based on the information provided: a = 45 (youth with persistent SB assessed as having depression at baseline) b = 86 (youth with no SB at follow-up assessed as having depression at baseline) c = 32 (youth with persistent SB not assessed as having depression at baseline) d = 100 (youth with no SB at follow-up not assessed as having depression at baseline) Plugging these into the odds ratio formula: OR = (a/b) / (c/d) = (45/86) / (32/100) = 0.523 / 0.32 = 1.63 Therefore, the odds ratio of suicidal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Logistic Regression

Comparison between linear and logistic regression

BASIS FOR COMPARISON LINEAR REGRESSION LOGISTIC REGRESSION

Basic The data is modelled using a The probability of some


straight line. obtained event is
represented as a linear
function of a combination of
predictor variables.

Linear relationship between Is required Not required


dependent and independent
variables

The independent variable Could be correlated with Should not be correlated with
each other (Specially in each other (no
multiple linear regression) multicollinearity exist).
Logistic regression

The theory and application of logistic regression, an extension of regression that


allows us to predict categorical outcomes based on predictor variables. For
example, it can be used for cancer detection problems. It computes the probability
of an event occurrence.

When we are trying to predict membership of only two categorical outcomes the
analysis is known as binary logistic regression, but when we want to predict
membership of more than two categories we use multinomial (or polychotomous)
logistic regression.
Categorical Response Variables
Examples:
 Non − smoker
Whether or not a person smokes Y =
Binary Response Smoker
Survives
Success of a medical treatment Y =
Dies

Opinion poll responses Agree



Y =  Neutral
Ordinal Response Disagree

Principles behind logistic regression

b0 = the Y intercept
b1 = the gradient of the straight line
X1 = the value of the predictor variable
ε = residual term

bn is the regression coefficient of the


corresponding variable Xn
In logistic regression, instead of predicting the value of a variable Y from a predictor
variable X1 or several predictor variables (Xs), we predict the probability of Y
occurring given known values of X1 (or Xs).

P(Y) = Probability of Y occurring


e = the base of natural logarithms

Expressing the multiple linear regression equation in logarithmic terms (called the
logit) and thus overcomes the problem of violating the assumption of linearity.
Assumptions

1. Linearity: There is a linear relationship between any continuous predictors and


the logit of the outcome variable.

2. Independence of errors: It means that cases of data should not be related; for
example, the same people can not be measured at different points in time.
Violating this assumption produces over dispersion.

3. Multicollinearity: Predictors should not be too highly correlated.

One way around this problem is to transform the data using the logarithmic
transformation. This transformation is a way of expressing a non-linear
relationship in a linear way. The logistic regression equation described above is
based on this principle: it expresses the multiple linear regression equation in
logarithmic terms (called the logit) and thus overcomes the problem of violating
the assumption of linearity.
Why use logistic regression?

▪ There are many important research topics for which the dependent
variable is "limited."

▪ For example: voting, morbidity or mortality, and participation data is not


continuous or distributed normally.

▪ Binary logistic regression is a type of regression analysis where the


dependent variable is a dummy variable: coded 0 (did not vote) or 1(did
vote).
Odds is a way of expressing the probability of an event, P(Event).

Odds = P(Event) / [1-P(Event)]

Suppose you shot 100 free throws and made 70. Based on this sample, the
probability of making a free throw is 70%. The odds of making a free throw can
be calculated as:
Odds = 0.70 / (1–0.70) = 2.333
More crucial to the interpretation of logistic regression is the value of the odds ratio
(Exp(B) in the SPSS output), which is an indicator of the change in odds resulting from a
unit change in the predictor.

The odds of an event occurring are defined as the probability of an event occurring
divided by the probability of that event not occurring.
Odds
Definition:
P P(Yes)
= is the odds of Yes.
1 − P P( No)

P odds
odds = P=
1− P 1 + odds
Odds
Logit form of the model:

æ p ö
log ç ÷ = b + b X
è 1- p ø 0 1

The logistic model assumes a linear relationship between the



predictors and log(odds).

p b0 + b1 X
odds = =e
1- p
Odds ratio

Odds of an event happening is defined as the likelihood that an event will occur,
expressed as a proportion of the likelihood that the event will not occur. Therefore, if
A is the probability of subjects affected and B is probability of subjects not affected,
then odds = A /B.

Therefore, the odds of rolling four on a dice are 1/5 or 20%.

An odds ratio (OR) is a measure of association between a certain property A and a


second property B in a population. Specifically, it tells how the presence or absence of
property A has an effect on the presence or absence of property B.

An odds ratio (OR) is a measure of association between an exposure and an outcome.


The OR represents the odds that an outcome will occur given a particular exposure,
compared to the odds of the outcome occurring in the absence of that exposure.

OR = 1 indicates no effect
OR >1 indicates increased occurrence of event (Risk factor)
OR <1 indicates decreased occurrence of event (protective exposure)
Calculating Odds Ratio

Where
a = Number of exposed cases
b = Number of exposed non-cases
c = Number of unexposed cases
d = Number of unexposed non-cases
Ill people: people who ate ice cream / people who did not = 13/17
People who are not ill: people who ate ice cream / people who did not = 32/23

An odds ratio of exactly 1 means that exposure to property A does not affect the odds
of property B.
An odds ratio of more than 1 means that there is a higher odds of property B
happening with exposure to property A.
An odds ratio is less than 1 is associated with lower odds.
Example: TMS for Migraines
Transcranial Magnetic Stimulation vs. Placebo
Pain Free? TMS Placebo
YES 39 22
NO 61 78
Total 100 100

PTMS = 0.39 odds = 39 / 100 = 39 = 0.639 0.639


P= = 0.39
1 + 0.639
TMS
61 / 100 61
PPlacebo = 0.22 22
oddsPlacebo = = 0.282
78
0.639 Odds are 2.27 times higher of getting relief
Odds ratio = = 2.27
0.282 using TMS than placebo.
A Single Binary Predictor for a Binary Response

Response variable: Y = Success/Failure


Predictor variable: X = Group #1 / Group #2
• Method #1: Binary logistic regression
• Method #2: Z- test, compare two proportions
• Method #3: Chi-square test for 2-way table

All three “tests” are essentially equivalent, but the logistic


regression approach allows us to mix other categorical and
quantitative predictors in the model.
Example:

Greenfield and colleagues (2008) looked at previously suicidal adolescents (n=263)


and used logistic regression to analyze the associations between baseline variables
such as age, sex, presence of psychiatric disorder, previous hospitalizations, and drug
and alcohol use, with suicidal behavior at six-month follow-up.

In the study, 186 of the 263 adolescents previously judged as having experienced a
suicidal behavior requiring immediate psychiatric consultation did not exhibit
suicidal behavior (non-suicidal, NS) at six months follow-up. Of this group, 86 young
people had been assessed as having depression at baseline. Of the 77 young people
with persistent suicidal behavior at follow-up (suicidal behavior, SB), 45 had been
assessed as having depression at baseline.

What is the OR of suicidal behaviour at six months follow-up given presence of


depression at baseline?
What is the OR of suicidal behaviour at six months follow-up given presence of depression at baseline?

First we determine the numbers to use for (a), (b), (c), (d)

a: Number of exposed cases (+ +) = ?

b: Number of exposed non-cases (+ –) = ?

c: Number of unexposed cases (– +) = ?

d: Number of unexposed non-cases (– –) = ?

Q1: Who are the exposed cases (++ = a)?

A1: Youth with persistent SB assessed as having depression at baseline


a=45

Q2: Who are the exposed non-cases (+ – = b)?

A2: Youth with no SB at follow-up assessed as having depression at baseline


b=86

Q3: Who are the unexposed cases (– + = c)?

A3: Youth with persistent SB not assessed as having depression at baseline


c: 77(SB) –45(depression) = 32

Q4: Who are the unexposed non-cases (– – = d)?

A4: Youth with no SB at follow-up not assessed as having depression at baseline


d: 186(NS) –86(depression) = 100
OR =

Thus, the odds of persistent suicidal behaviour is 1.63 higher given baseline
depression diagnosis compared to no baseline depression.
Log-liklihood

The logistic regression model predicts the probability of an event occurring for a
given person (we would denote this as P(Yi) the probability that Y occurs for the ith
person), based on observations of whether or not the event did occur for that person
(we could denote this as Yi, the actual outcome for the ith person). For a given person,
Y will be either 0 (the outcome didn’t occur) or 1 (the outcome did occur), and the
predicted value, P(Y), will be a value between 0 (there is no chance that the outcome
will occur) and 1 (the outcome will certainly occur).

The log-likelihood is based on summing the probabilities associated


with the predicted and actual outcomes
Binary logistic regression

Outcome (dependent variable): Cured (cured or not cured).


Predictor (independent variable): Intervention (intervention or no treatment).
Predictor (independent variable): Duration (the number of days before treatment
that the patient had the problem).
Block 0

includes only the


intercept

observed odds
=128/187
= .684
Block 1

This is a test of the null


hypothesis that adding the
gender variable to the
model has not significantly
increased our ability to
predict the decisions made
by our subjects.

This statistic measures how


poorly the model predicts the
decisions -- the smaller the
statistic the better the model.
If our subject is a woman (gender = 0) If our subject is a man (gender = 1)

P(continue) P(continue)

= 1/1+e-(-.847 + 1.217*0) = 1/1+e-(-.847 + 1.217*1)


= .300 = .59

P(stop) = 1 - P(continue) P(stop) = 1 – P(continue)

= 1- .300 =.70 = 1- .59 =.41


Odds = .59/.41 = 1.44
Odds = .3/.7 = .43 That is, a man is 1.44 times more
That is, a woman is only .43 as likely to likely to decide to continue the
decide to continue the research as she research than to decide to stop the
is to decide to stop the research. research.
Interpretation

= 1.44/.43 = 3.35

The Exp(B) or odds ratio tells us that the model predicts that the odds of deciding to
continue the research are 3.35 times higher for men than they are for women. For
the men, the odds are 1.44, and for the women they are 0.429.
Classification table

The results of our logistic regression can be used to classify subjects with respect to
what decision we think they will make. The model leads to the prediction that the
probability of deciding to continue the research is 30% for women and 59% for men.
Before we can use this information to classify subjects, we need to have a decision
rule. Our decision rule will take the following form: If the probability of the event is
greater than or equal to some threshold, we shall predict that the event will take
place. By default, SPSS sets this threshold to .5. While that seems reasonable, in
many cases we may want to set it higher or lower than .5. Using the default
threshold, SPSS will classify a subject into the “Continue the Research” category if
the estimated probability is .5 or more, which it is for every male subject. SPSS will
classify a subject into the “Stop the Research” category if the estimated probability is
less than .5, which it is for every female subject.

Two terms:

Test sensitivity is the ability of a test to correctly identify those with the disease (true
positive rate), whereas test specificity is the ability of the test to correctly identify
those without the disease (true negative rate).
The Classification Table shows us that this rule allows us to correctly classify 68 / 128 = 53% of the
subjects where the predicted event (deciding to continue the research) was observed. This is
known as the sensitivity of prediction, the P(correct | event did occur), that is, the percentage of
occurrences correctly predicted. We also see that this rule allows us to correctly classify 140 / 187
= 75% of the subjects where the predicted event was not observed. This is known as the
specificity of prediction, the P(correct | event did not occur), that is, the percentage of
nonoccurrences correctly predicted. Overall our predictions were correct 208 out of 315 times,
for an overall success rate of 66%. Recall that it was only 59% for the model with intercept only.

We could focus on error rates in classification. A false positive would be predicting that the event
would occur when, in fact, it did not. Our decision rule predicted a decision to continue the
research 115 times. That prediction was wrong 47 times, for a false positive rate of 47 / 115 =
41%. A false negative would be predicting that the event would not occur when, in fact, it did
occur. Our decision rule predicted a decision not to continue the research 200 times. That
prediction was wrong 60 times, for a false negative rate of 60 / 200 = 30%.
Example

A researcher is interested in how variables, such as GRE (Graduate Record Exam


scores), GPA (grade point average) and prestige of the undergraduate institution are
related to the admission into graduate school. The response variable, admit/don’t
admit, is a binary variable.

SPSS works out ----

The first table above gives the overall test for the model that includes the predictors.
The chi-square value of 41.46 with a p-value of less than 0.0005 tells us that our
model as a whole fits significantly better than an empty model (i.e., a model with no
predictors).

The -2*log likelihood (458.517) in the Model Summary table can be used in
comparisons of nested models.
In the table labeled Variables in the Equation we see the coefficients, their standard
errors, the Wald test statistic with associated degrees of freedom and p-values, and the
exponential coefficient (also known as an odds ratio). Both gre and gpa are statistically
significant. The overall (i.e., multiple degree of freedom) test for rank is given first,
followed by the terms for rank=1, rank=2, and rank=3. The overall effect of rank is
statistically significant, as are the terms for rank=1 and rank=2. The logistic regression
coefficients give the change in the log odds of the outcome for a one unit increase in
the predictor variable.

For every one unit change in gre, the log odds of admission (versus non-admission)
increases by 0.002.

For a one unit increase in gpa, the log odds of being admitted to graduate school
increases by 0.804.

The indicator variables for rank have a slightly different interpretation. For example,
having attended an undergraduate institution with rank of 1, versus an institution with
a rank of 4, increases the log odds of admission by 1.551.
Building Model: What is the equation of the binary logistic regression model

log(p/1-p) = b0 + b1*x1 + b2*x2 + b3*x3 + b4*x4 + + b5*x5


Where, p is the probability of being

= -3.45 + .002*gre + .804*gpa+ 1.55*rank(1)+ .876*rank(2)+ .211*rank(3)

= 1/1+ e –(-3.45 + .002*gre + .777*gpa-.560*rank

You might also like