0% found this document useful (0 votes)
22 views54 pages

Lec-4 Logistic Regression

Uploaded by

ghania azhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views54 pages

Lec-4 Logistic Regression

Uploaded by

ghania azhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 54

Logistic Regression

Categorical Response Variables


Examples:
 Non  smoker
Whether or not a person Y 
smokes Binary Response Smoker
Survives
Success of a medical Y 
treatment Dies

Opinion poll responses Agree



Y   Neutral
Ordinal Response Disagree

Example: Height predicts Gender
Y = Gender (0=Male 1=Female)
X = Height (inches)
Try an ordinary linear regression
> regmodel=lm(Gender~Hgt,data=Pulse)
> summary(regmodel)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.343647 0.397563 18.47 <2e-16 ***
Hgt -0.100658 0.005817 -17.30 <2e-16 ***
Ordinary linear regression is used a lot, and is
taught in every intro stat class. Logistic regression
is rarely taught or even mentioned in intro stats,
but mostly because of inertia.
We now have the computing power and
software to implement logistic regression.
π = Proportion of “Success”

In ordinary regression the model predicts the


mean Y for any combination of predictors.
What’s the “mean” of a 0/1 indicator variable?
 yi # of 1' s
y   Proportion of " success"
n # of trials

Goal of logistic regression: Predict the “true”


proportion of success, π, at any value of the
predictor.
Binary Logistic Regression Model
Y = Binary response X = Quantitative predictor
π = proportion of 1’s (yes,success) at any X
Equivalent forms of the logistic regression model:
Logit form Probability form
  
log    0  1 X
1  
What does this look like?
N.B.: This is natural log (aka “ln”)
Logit Function
no data Function Plot
1.0

0.8

0.6
y

0.4

0.2

-10 -8 -6 -4 -2 0 2 4 6 8 10 12
x
exp bo + b1• x 
y=
 + exp bo + b1• x 
Binary Logistic Regression via R
> logitmodel=glm(Gender~Hgt,family=binomial,
data=Pulse)
> summary(logitmodel)
Call:
glm(formula = Gender ~ Hgt, family = binomial)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.77443 -0.34870 -0.05375 0.32973 2.37928

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 64.1416 8.3694 7.664 1.81e-14 ***
Hgt -0.9424 0.1227 -7.680 1.60e-14***
---
Call:
glm(formula = Gender ~ Hgt, family = binomial, data = Pulse)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 64.1416 8.3694 7.664 1.81e-14 ***
Hgt -0.9424 0.1227 -7.680 1.60e-14***
---

proportion of females at that


Hgt
> plot(fitted(logitmodel)~Pulse$Hgt)
> with(Pulse,plot(Hgt,jitter(Gender,amount=0.05)))
> curve(exp(64.1-0.94*x)/(1+exp(64.1-0.94*x)), add=TRUE)
Example: Golf Putts
Length 3 4 5 6 7
Made 84 88 61 61 44
Missed 17 31 47 64 90
Total 101 119 108 125 134

Build a model to predict the proportion of


putts made (success) based on length (in feet).
Logistic Regression for Putting

Call:
glm(formula = Made ~ Length, family = binomial, data =
Putts1)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.8705 -1.1186 0.6181 1.0026 1.4882

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.25684 0.36893 8.828 <2e-16 ***
Length -0.56614 0.06747 -8.391 <2e-16 ***
---
1.5
1.0
Linear part of
logistic fit
logitPropMade
0.5
0.0
-0.5

3 4 5 6 7

PuttLength
Probability Form of Putting Model

1.0
e 3.257 0.566 Length
ˆ 
1  e 3.257 0.566 Length
0.8
Probability Made
0.6
0.4
0.2
0.0

2 4 6 8 10 12

PuttLength
Odds
Definition:
 P (Yes )
 is the odds of Yes.
1   P( No)

 odds
odds   
1  1  odds
Fair die
Event Prob Odds
even # 1/2 1 [or 1:1]
X>2 2/3 2 [or 2:1]
roll a 2 1/6 1/5 [or 1/5:1 or 1:5]
π increases
by .231

x increases
by 1

x increases the odds increase by a


by 1 factor of 2.718

π increases by .072
Odds
Logit form of the model:

The logistic model assumes a linear


⇒ relationship between the predictors
and log(odds).
Odds Ratio
A common way to compare two groups
is to look at the ratio of their odds
Odds1
Odds Ratio  OR 
Odds 2
Note: Odds ratio (OR) is similar to relative risk (RR).

So when p is small, OR ≈ RR.


X is replaced by X + 1:

is replaced by

So the ratio is
Example: TMS for Migraines
Transcranial Magnetic Stimulation vs. Placebo
Pain Free? TMS Placebo
YES 39 22
NO 61 78
Total 100 100

ˆ Placebo  0.22 22
odds Placebo   0.282
78
Odds are 2.27 times higher of getting
relief using TMS than placebo
Logistic Regression for TMS data
> lmod=glm(cbind(Yes,No)~Group,family=binomial,data=TMS)
> summary(lmod)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.2657 0.2414 -5.243 1.58e-07 ***
GroupTMS 0.8184 0.3167 2.584 0.00977 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 6.8854 on 1 degrees of freedom


Residual deviance: 0.0000 on 0 degrees of freedom
AIC: 13.701

Note: e0.8184 = 2.27 = odds ratio


> datatable=rbind(c(39,22),c(61,78))
> datatable
[,1] [,2]
Chi-Square Test for
[1,]
[2,]
39
61
22
78
2-way table
> chisq.test(datatable,correct=FALSE)
Pearson's Chi-squared test

data: datatable
X-squared = 6.8168, df = 1, p-value = 0.00903

> lmod=glm(cbind(Yes,No)~Group,family=binomial,data=TMS)
> summary(lmod)

Call:
Binary Logistic Regression
glm(formula = cbind(Yes, No) ~ Group, family = binomial)Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.2657 0.2414 -5.243 1.58e-07 ***
GroupTMS 0.8184 0.3167 2.584 0.00977 **
A Single Binary Predictor for a Binary Response
Response variable: Y = Success/Failure
Predictor variable: X = Group #1 / Group #2
• Method #1: Binary logistic regression
• Method #2: Z- test, compare two proportions
• Method #3: Chi-square test for 2-way table

All three “tests” are essentially equivalent, but the


logistic regression approach allows us to mix other
categorical and quantitative predictors in the model.
Putting Data

Odds using data from 6 feet = 0.953


Odds using data from 5 feet = 1.298

 Odds ratio (6 ft to 5 ft) = 0.953/1.298 = 0.73

The odds of making a putt from 6 feet are


73% of the odds of making from 5 feet.
Golf Putts Data
Length 3 4 5 6 7
Made 84 88 61 61 44
Missed 17 31 47 64 90
Total 101 119 108 125 134
.8317 .7394 .5648 .4880 .3284
Odds 4.941 2.839 1.298 0.953 0.489
Golf Putts Data
Length 3 4 5 6 7
Made 84 88 61 61 44
Missed 17 31 47 64 90
Total 101 119 108 125 134
.8317 .7394 .5648 .4880 .3284
Odds 4.941 2.839 1.298 .953 .489

OR .575 .457 .734 .513


Interpreting “Slope” using Odds Ratio

  
log    0  1 X
1  

⇒ odds  e  0  1 X

When we increase X by 1, the ratio of the


e 1 .

new odds to the old odds is


1
i.e. odds are multiplied by e .
Odds Ratios for Putts
From samples at each distance:
4 to 3 feet 5 to 4 feet 6 to 5 feet 7 to 6 feet
0.575 0.457 0.734 0.513
From fitted logistic:
4 to 3 feet 5 to 4 feet 6 to 5 feet 7 to 6 feet
0.568 0.568 0.568 0.568
In a logistic model, the odds ratio is constant when
changing the predictor by one.
Example: 2012 vs 2014 congressional elections

How does %vote won by Obama relate to a


Democrat winning a House seat?

See the script elections 12, 14.R


Example: 2012 vs 2014 congressional elections

How does %vote won by Obama relate to a


Democrat winning a House seat?

In 2012 a Democrat had a decent chance


even if Obama got only 50% of the vote in
the district. In 2014 that was less true.
In 2012 a Democrat had a decent chance even if
Obama got only 50% of the vote in the district. In
2014 that was less true.
There is an easy way to graph logistic curves in R.
> library(TeachingDemos)
> with(elect, plot(Obama12,jitter(Dem12,amount=.05)))
> logitmod14=glm(Dem14~Obama12,family=binomial,data=elect)
> Predict.Plot(logitmod14, pred.var="Obama12”,add=TRUE,
plot.args = list(lwd=3,col="black"))
R Logistic Output
> PuttModel=glm(Made~Length, family=binomial,data=Putts1)
> anova(PuttModel)
Analysis of Deviance Table
Df Deviance Resid. Df Resid. Dev
NULL 586 800.21
Length 1 80.317 585 719.89

> summary(PuttModel)
Call:
glm(formula = Made ~ Length, family = binomial)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.25684 0.36893 8.828 <2e-16 ***
Length -0.56614 0.06747 -8.391 <2e-16 ***
---
Null deviance: 800.21 on 586 degrees of freedom
Residual deviance: 719.89 on 585 degrees of freedom
Two forms of logistic data
1. Response variable Y = Success/Failure or 1/0: “long
form” in which each case is a row in a spreadsheet
(e.g., Putts1 has 587 cases). This is often called
“binary response” or “Bernoulli” logistic regression.
2. Response variable Y = Number of Successes for a
group of data with a common X value: “short form”
(e.g., Putts2 has 5 cases – putts of 3 ft, 4 ft, … 7 ft).
This is often called “Binomial counts” logistic
regression.
Binary Logistic Regression Model
Y = Binary X = Single predictor
response
π = proportion of 1’s (yes, success) at any x
Equivalent forms of the logistic regression model:
  
Logit form log    0  1 X
1  
 o  1 X
Probability form e
  o  1 X
1 e
Binary Logistic Regression Model

Y = Binary X1,X2,…,X
X = Single
k = Multiple
predictor
response predictors
π = proportion of 1’s (yes,
at anysuccess)
x1, x2, …,
at xany x
k

Equivalent forms of the logistic regression model:


  
Logit form log    0  1 X 1   2 X 2     k X k
1  

 o  1 X 1   2 X 2   k X k
Probability form   e
 o  1 X 1   2 X 2   k X k
1 e
Interactions in logistic regression
Consider Survival in an ICU as a function of
SysBP -- BP for short – and Sex
> intermodel=glm(Survive~BP*Sex, family=binomial, data=ICU)
> summary(intermodel)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.439304 1.021042 -1.410 0.15865
BP 0.022994 0.008325 2.762 0.00575 **
Sex 1.455166 1.525558 0.954 0.34016
BP:Sex -0.013020 0.011965 -1.088 0.27653

Null deviance: 200.16 on 199 degrees of freedom


Residual deviance: 189.99 on 196 degrees of freedom
Rep = red,
Dem = blue

Lines are
very close
to parallel;
not a
significant
interaction
Generalized Linear Model
(1) What is the link between Y and b 0 + b1X?
(a) Regular reg: indentity
(b) Logistic reg: logit
(c) Poisson reg: log
(2) What is the distribution of Y given X?
(a) Regular reg: Normal (Gaussian)
(b) Logistic reg: Binomial
(c) Poisson reg: Poisson
C-index, a measure of concordance

Med school acceptance: predicted by


MCAT and GPA?
Med school acceptance: predicted by coin
toss??
> library(Stat2Data)
> data(MedGPA)
> str(MedGPA)
> GPA10=MedGPA$GPA*10
> Med.glm3=glm(Acceptance~MCAT+GPA10, family=binomial, data=MedGPA)
> summary(Med.glm3)
> Accept.hat <- Med.glm3$fitted > .5
> with(MedGPA, table(Acceptance,Accept.hat))

Accept.hat
Acceptance FALSE TRUE
0 18 7
1 7 23

18 + 23 = 41 correct out of 55

You might also like