0% found this document useful (0 votes)
118 views14 pages

Logistic

1. The document introduces logistic regression and binomial regression models. It discusses how logistic regression can be used when the response variable is binary (0 or 1). 2. An example is provided of a study examining the relationship between age and developing coronary heart disease. Logistic regression is identified as an appropriate analysis method given the binary outcome variable (developed heart disease or not). 3. Key aspects of logistic regression models are described, including interpreting the logit transformation and odds ratios. Maximum likelihood estimation is discussed as the approach for estimating the logistic regression coefficients.

Uploaded by

Brooke Tillman
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views14 pages

Logistic

1. The document introduces logistic regression and binomial regression models. It discusses how logistic regression can be used when the response variable is binary (0 or 1). 2. An example is provided of a study examining the relationship between age and developing coronary heart disease. Logistic regression is identified as an appropriate analysis method given the binary outcome variable (developed heart disease or not). 3. Key aspects of logistic regression models are described, including interpreting the logit transformation and odds ratios. Maximum likelihood estimation is discussed as the approach for estimating the logistic regression coefficients.

Uploaded by

Brooke Tillman
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 14

LOGISTIC REGRESSION, POISSON

REGRESSION AND GENERALIZED


LINEAR MODELS
We have introduced that a continuous response, Y, could depend on continuous or
discrete variables X
1
, X
2
, X
p-1
. However, dichotomous (binary) outcome is most
common situation in biology and epidemiology.
Example:
In a longitudinal study of
coronary heart disease as a
function of age, the
response variable Y was
defined to have the two
possible outcomes: person
developed heart diease
during the study, person
did not develop heart
disease during the study.
These outcomes may be
coded 1 and 0,
respectively.
Logistic regression
Age CD Age CD Age CD
22 0 40 0 54 0
23 0 41 1 55 1
24 0 46 0 58 1
27 0 47 0 60 1
28 0 48 0 60 0
30 0 49 1 62 1
30 0 49 0 65 1
32 0 50 1 67 1
33 0 51 0 71 1
35 1 51 1 77 1
38 0 52 0 81 1

Age and signs of coronary heart disease (CD)
Prevalence (%) of signs of CD
according to age group
Diseased
Age group # in group # %
20 -29 5 0 0
30 - 39 6 1 17
40 - 49 7 2 29
50 - 59 7 4 57
60 - 69 5 4 80
70 - 79 2 2 100
80 - 89 1 1 100


Diseased %
Age (years)
1
The simple linear regression model
Y
i
=
0
+
1
X
i
+
i
Y
i
=0,1
The response function
E{Y
i
}=
0
+
1
X
i

We view Y
i
as a random variable with a
Bernoulli distribution with parameter
I
Y
i
Prob(Y
i
)
1
0
P(Y
i
=1)=
i
P(Y
i
=0)= 1-
i
P(Y
i
=k)=
k 1
i
k
i
) 1 (

, k=0,1
E{Y
i
}=1*
i
+0*(1-
i
)=
i
Special Problems When Response
Variable Is Binary
1. Nonnormal Error Terms
When Y
i
=1:
i
=1-
0
-
1
X
i

When Y
i
=0:
i
=-
0
-
1
X
i

Can we assume
i
are normally distributed?
2. Nonconstant Error Variance

2
{
i
}= (
0
+
1
X
i
)(1-
0
-
1
X
i
)
ordinary least squares is no longer optimal
3. Constraints on Response Function
0E{Y
i
}1
What does E{Y
i
} mean?
E{Y
i
}=
0
+
1
X
i
=
i
E{Y
i
} is the probability that Y
i
=1 when
then level of the predictor variable is X
i
.
This interpretation applies whether the
response function is a simple linear one,
as shown above, or a complex multiple
regression one.
The logistic function
Probability of
disease
x
x
x
1 0
1 0
e 1
e
1) P(y
+
+
+

Both theoretical and empirical results suggest
that when the response variable is binary, the
shape of the response function is either as
a tilted S or as a reverse tilted S.
The logistic function
Simple Logistic Regression
2
1. Model: Y
i
=E{ Y
i
}+
i

Where, Y
i
are independent Bernoulli random variables with
E{Y
i
}=
i
=
) X exp( 1
) X exp(
i 1 0
i 1 0
+ +
+

2. How to estimate
0
and
1
?
a. Likelihood Function:
Since the Y
i
observations are independent, their joint
probability function is:
i i
Y 1
i
n
1 i
Y
i n 1
) 1 ( ) Y ,..., Y ( g


The logarithm of the joint probability function (log-likelihood
function):
+ + +
+



n
1 i
i 1 0 e
n
1 i
i 1 0 i
n
1 i
i e
n
1 i
i
i
e i
n 1 1 0
)] X exp( [1 log ) X ( Y
) (1 log )]
1

( log [Y
) Y ,..., g(Y log ) , ( L
e
b. Maximum Likelihood Estimation:
3
x
1
n l
1 0
function it log
i
i
+
1
]
1


Maximum Likelihood Estimation
L
o
g
-
l
i
k
e
l
i
h
o
o
d
The maximum likelihood estimates of
0
and
1
in the simple
logistic regression model are those values of
0
and
1
that
maximize the log-likelihood function. However, no closed-form
solution exists for the values of
0
and
1
that maximize the log-
likelihood function. Several Computer-intensive numerical search
procedures are widely used to find the maximum likelihood
estimates b
0
and b
1
. We shall rely on standard statistical software
programs specifically designed for logistic regression to obtain the
maximum likelihood estimates b
0
and b
1
.
3. Fitted Logit Response Function
i 1 0
i
i
e
X b b )
1

( log +

4. Interpretation of b
1
4
X b b )
1

( log
1 0 e
+

when X=X
j
,
j 1 0
j
j X b b
X
X
1
e
1

odds
+

when X=X
j
+1,
) 1 X ( b b
1 X
1 X
2
j 1 0
j
j
e
1

odds
+ +
+
+

e
b
2
1
log e
odds
odds
OR
1
>
OR=b
1
b
1
=increase in log-odds for a one unite increase
in X
Assumption
Pi
Predictor Predictor
Logit
Transform
Example:
Y = 1 if the task was finished
Perso
n
i
Months of
Experienc
e
Xi
Task
Success
Yi
0
Fitted
Valu
e
i

Deviance
Residual
Devi
-.862
5
0 if the task wasnt finished
X = months of programming
experience
1
2
3
.
.
.
23
24
25
14
29
6
.
.
.
28
22
8
0
0
.
.
.
1
1
1
0.31
0.835
0.110
.
.
.
0.812
0.621
0.146
-1.899
-.483
.
.
.
.646
.976
1.962
SAS CODE:
proc logistic data = ch14ta01 ;
model y (event='1') = x ;
run;
Notice that we can specify which event to
model using the event = option in the
model statement. The other way of
specifying that we want to model 1 as
event instead of 0 is to use the
descending option in the proc logistic
statement.
SAS OUPUT:
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -3.0597 1.2594 5.9029 0.0151
x 1 0.1615 0.0650 6.1760 0.0129
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
x 1.175 1.035 1.335
How to use the output to calculate
1
? How to interpret
1

=0.31?
Interpretation of Odds Ratio
OR=1.175 means that the odds
of completing the task increase by 17.5
percent with each additional month
of experience.
Interpretation of b
1
b
1
=0.1615 means that the log-odds of
completing the task increase 0.1615
with each additional month of
experience.
4. Repeat Observations-Binomial Outcomes
In some cases, particularly for designed experiments, a number of repeat
observations are obtained at several levels of the predictor variable X. For
example, in a study of the effectiveness of coupons offering a price
reduction on a given product, 1000 homes were selected at random. The
coupons offered different price reductions (5,10,15,20 and 30 dollars), and
6
200 homes werej assigned at random to each of the price reduction
categories.
Level

j
1
2
3
4
5
Price
Reduction
Xj
5
10
15
20
30
Number of
Households
nj
200
200
200
200
200
Number of
Coupons
Redeemed
Y..j
30
55
70
100
137
Proportion of
Coupons
Redeemed
pj
.150
.275
.350
.500
.685
Mondel-
Based
Estimate
j


.1736
.2543
.3562
.4731
.7028

'

+ + + +

,
_

,
_

,
_

'

c
1 j
j 1 0 e j j 1 0 j .
j .
j
e 1 0 e
j . j j .
j
j .
Y n
j
Y
j
j .
.j
.j
j
j .
j
n
1 i
ij j .
j
j
j
ij
)]} X exp( 1 [ log n ) X ( Y
Y
n
log ) , ( L log
: function likelihood - log The
)! Y n ( ! Y
! n
Y
n
where ) 1 (
Y
n
) f(Y
: by given on distributi binomial a has Y variable random The
n
Y
p Y Y
1,2,3,4,5 j ; n 1,..., i
X level at coupons redeemed not household ith 0
X level at coupons redeemed household ith 1
Y
j j . j j . j
j
SAS CODE:
data ch14ta02;
infile 'c:\stat231B06\ch14ta02.txt';
input x n y pro;
proc logistic data=ch14ta02;
model y/n=x;
/*request estimates of the predicted*/
/*values to be stored in a file named */
/*estimates under the variable name pie*/
output out=estimates p=pie;
run;
proc print data=estimate;
run;
SAS OUTPUT:
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -2.0443 0.1610 161.2794 <.0001
x 1 0.0968 0.00855 128.2924 <.0001
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
x 1.102 1.083 1.120


Obs x n y pro pie
1 5 200 30 0.150 0.17362
2 10 200 55 0.275 0.25426
3 15 200 70 0.350 0.35621
4 20 200 100 0.500 0.47311
5 30 200 137 0.685 0.70280
Multiple Logistic Regression
1. Model: Y
i
=E{ Y
i
}+
i

7
Where, Y
i
are independent Bernoulli random variables with
E{Y
i
}=
i
=
) exp( 1
) exp(
'
'

i
i
X
X
+

2. How to estimate the vector ?


+

n
i
i e
n
i
i e
X X
1
'
1
'
i
)] exp( 1 [ log ) ( Y ) ( L log
3. Fitted Logit Response Function
b X
i
i
i
e
'
)
1

( log

Example:

'

absent disease 0
present disease 1
Y

status mic socioecono
others 0
Class Lower 1
X
others 0
Class Middle 1
X
Age X
3
2
1

'

'

'

1 sector city 0
2 sector city 1
X
4
Study purpose: assess the strength of the
association between each of the predictor
variables and the probability of a person
having contracted the disease
SAS CODE:
data ch14ta03;
infile 'c:\stat231B06\ch14ta03.txt'
DELIMITER='09'x;
input case x1 x2 x3 x4 y;
proc logistic data=ch14ta03;
model y (event='1')=x1 x2 x3 x4;
run;
Case
i
1
2
3
4
5
6
.
98
Age
Xi1
33
35
6
60
18
26
.
35
Socioeconomic
Status
Xi2 Xi3
0 0
0 0
0 0
0 0
0 1
0 1
.
0 1
City
Sector
Xi4
0
0
0
0
0
0
.
0
Disease
Status
Yi
0
0
0
0
1
0
.
0
Fitted
Value
i

.209
.219
.106
.371
.111
.136
.
.171

SAS OUTPUT:
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -2.3127 0.6426 12.9545 0.0003
x1 1 0.0297 0.0135 4.8535 0.0276
x2 1 0.4088 0.5990 0.4657 0.4950
x3 1 -0.3051 0.6041 0.2551 0.6135
x4 1 1.5746 0.5016 9.8543 0.0017
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
x1 1.030 1.003 1.058
x2 1.505 0.465 4.868
x3 0.737 0.226 2.408
x4 4.829 1.807 12.907
The odds of a person having contracted the disease increase
by about 3.0 percent with each additional year of age (X1), for
given socioeconomic status and city sector location. The odds
of a person in section 2 (X4) having contracted the disease
are almost five times as great as for a person in sector 1, for
given age and socioeconomic status.
Polynomial Logistic Regression
1. Model: Y
i
=E{ Y
i
}+
i

8

'
function log
1
i
it
i
i
X n l
1
]
1


Where, Y
i
are independent Bernoulli random variables with
E{Y
i
}=
i
=
) exp( 1
) exp(
'
'

i
i
X
X
+

Where x denotes the centered predictor, X- X.


9
k
kk
2
22 11 0
'
function logit
i
i
x x x
1

n l

+ + +
1
]
1


i
X
Example:

'

funds capital by venture financed t wasn' IPO 0


funds capital by venture financed was IPO 1
Y
company the of value face the X
1

Study purpose: determine the characteristics of companies that


attract venture capital.
SAS CODE:
data ipo;
infile 'c:\stat231B06\appenc11.txt';
input case vc faceval shares x3;
lnface=LOG(faceval);
run;
* Run 1st order logistic regression analysis;
proc logistic data=ipo descending;
model vc=lnface;
output out=linear p=linpie;
run;
* produce scatterplot and fitted 1st order
logistic;
data graph1;
set linear;
run;
proc sort data=graph1;
by lnface;
run;
proc gplot data=graph1;
symbol1 color=black value=none interpol=join;
symbol2 color=black value=circle;
title'Scatter Plot and 1st Order Logit Curve';
plot linpie*lnface vc*lnface/overlay;
/* /overlay means to overlay the two graph*/
run;
*Find mean of lnface=16.7088;
proc means;
var lnface;
run;
* Run 2st order logistic regression analysis;
data step2;
set linear;
xcnt=lnface-16.708;
xcnt2=xcnt**2;
run;
proc logistic data=step2 descending;
model vc=xcnt xcnt2;
output out=estimates p=pie;
run;
* produce scatterplot and fitted 2st order
logistic;
data graph2;
set estimates;
run;
proc sort data=graph2;
by xcnt;
run;
proc gplot data=graph2;
symbol1 color=black value=none interpol=join;
symbol2 color=black value=circle;
title'Scatter Plot and 1st Order Logit Curve';
plot pie*xcnt vc*xcnt/overlay;
/* /overlay means to overlay the two graph*/
run;
E s t i ma t e d P r o b a b i l i t y
0 . 0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
0 . 6
0 . 7
0 . 8
0 . 9
1 . 0
l n f a c e
1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0
E s t i ma t e d P r o b a b i l i t y
0 . 0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
0 . 6
0 . 7
0 . 8
0 . 9
1 . 0
x c n t
- 3 - 2 - 1 0 1 2 3
1. The natural logarithm of face value is
chosen because face value ranges
over several orders of magnitude, with
a highly skewed distribution)
2. The lowess smooth clearly suggests a
mound-shaped relationship.
SAS OUTOUT:
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 0.3000 0.1240 5.8566 0.0155
xcnt 1 0.5530 0.1385 15.9407 <.0001
xcnt2 1 -0.8615 0.1404 37.6504 <.0001
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
xcnt 1.739 1.325 2.281
xcnt2 0.423 0.321 0.556
10
Inferences about Regression Parameters
1. Test Concerning a Single
k
: Wald Test
Hypothesis: H
0
:
k
=0 vs. H
a
:
k
0
Test Statistic:
} s{b
b
z*
k
k

Decision rule: If |z*| z(1-/2), conclude H


0
.
If |z*|> z(1-/2), conclude H
a
.
Where z is a standard normal distribution.
Note: Approximate joint confidence intervals for several logistic regression
model parameters can be developed by the Bonferroni procedure. If g
parameters are to be estimated with family confidence coefficient of
approximately 1-, the joint Bonferroni confidence limits are
b
k
tBs{ b
k
}, where B=z(1-/2g).
2. Interval Estimation of a Single
k
The approximate 1- confidence limits for
k
:
b
k
tz(1-/2)s{ b
k
}
The corresponding confidence limits for the odds ratio exp(
k
):
exp[b
k
tz(1-/2)s{ b
k
}]
11
Example:
Y = 1 if the task was finished
0 if the task wasnt finished
X = months of programming
experience
Person
i
1
2
3
.
23
24
25
Months of
Experience
Xi
14
29
6
.
28
22
8
Task
Success
Yi
0
0
0
.
1
1
1
Fitted Value
i

0.31
0.835
0.110
.
0.812
0.621
0.146
Deviance
Residual
Devi
-.862
-1.899
-.483
.
.646
.976
1.962
SAS CODE:
proc logistic data=ch14ta01 ;
model y (event='1')=x /cl;
run ;
Notice that (1) we can specify cl in the
model statement to get the output for
interval estimate for
0
,
1
, etc. (2) The
test for
1
is a two-sided test. For a one-
sided test, we simply divide the p-value
(0.0129) by 2. This yields the one-sided
p-value of 0.0065. (3) The text authors
report Z*=2.485 and the square of Z* is
equal to the Wald Chi-Square Statistic
6.176, which is distributed approximately
as Chi-Square distribution with df=1.
SAS OUPUT:
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -3.0597 1.2594 5.9029 0.0151
x 1 0.1615 0.0650 6.1760 0.0129
H0: 10 vs. Ha: 1>0
for =0.05, Since one-sided p-value=0.0065<0.05, we conclude Ha, that
1 is positive.
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
x 1.175 1.035 1.335

Wald Confidence Interval for Parameters
Parameter Estimate 95% Confidence Limits
Intercept -3.0597 -5.5280 -0.5914
x 0.1615 0.0341 0.2888
With approximately 95% confidence that 1 is between 0.0341
and 0.2888. The corresponding 95% condidence limits for the
odds ratio are exp(.0341)=1.03 and exp(.2888)=1.33.
3. Test Whether Several
k
=0: Likelihood Ratio Test
Hypothesis: H
0
:
q
=
q+1
=...
p-1
=0 v
H
a
: not all of the
k
in H
0
equal zero
Full Model:
1 p 1 p 1 1 0
1
X X ' , )] ' exp( 1 [
) ' exp( 1
) ' exp(

+ + +
+

F F
F
F
X X
X
X

Reduced Model:
1 q 1 q 1 1 0
1
X X ' , )] ' exp( 1 [
) ' exp( 1
) ' exp(

+ + +
+

R R
R
R
X X
X
X

The Likelihood Ratio Statistic:


L(F)] log L(R) 2[log
L(F)
L(R)
2log G
e e e
2

1
]
1


The Decision rule: If G
2

2
(1-;p-q), conclude H
0
.
If G
2
>
2
(1-;p-q), conclude H
a
.
12
Example:

'

absent disease 0
present disease 1
Y
status mic socioecono
others 0
Class Lower 1
X
others 0
Class Middle 1
X
Age X
3
2
1

'

'

'

1 sector city 0
2 sector city 1
X
4
Study purpose: assess the strength of the association between each of the
predictor variables and the probability of a person having contracted the
disease

Case
i
1
2
3
4
5
6
.
98
Age
Xi1
33
35
6
60
18
26
.
35
Socioeconomic
Status
Xi2 Xi3
0 0
0 0
0 0
0 0
0 1
0 1
.
0 1
City
Sector
Xi4
0
0
0
0
0
0
.
0
Disease
Status
Yi
0
0
0
0
1
0
.
0
Fitted
Value
i

.209
.219
.106
.371
.111
.136
.
.171
SAS CODE:
data ch14ta03;
infile 'c:\stat231B06\ch14ta03.txt'
DELIMITER='09'x;
input case x1 x2 x3 x4 y;
/*fit full model*/
proc logistic data=ch14ta03;
model y (event='1')=x1 x2 x3 x4;
run;
/*fit reduced model*/
proc logistic data=ch14ta03;
model y (event='1')=x2 x3 x4;
run;
SAS OUTPUT:
Full model:
Model Fit Statistics
Intercept
Intercept and
Criterion Only Covariates
AIC 124.318 111.054
SC 126.903 123.979
-2 Log L 122.318 101.054
Reduced model:
Model Fit Statistics
Intercept
Intercept and
Criterion Only Covariates
AIC 124.318 114.204
SC 126.903 124.544
-2 Log L 122.318 106.204
We use proc logistic to regress Y on X1,X2,X3 and X4 and
refer to this as full model. In SAS output for full model we see
that -2 Log Likelihood statistic=101.054. We now regress Y
on X2,X3 and X4 and refer to this as the full model. In SAS
output for reduced model we see that -2 Log Likelihood
statistic=106.204. Using equation (14.60), test page 581, we
find G
2
=106.204-101.054=5.15. For =0.05 we require

2
(.95,1)=3.84. Since our computed G
2
value (5.15) is greater
than the critical value 3.84, we conclude Ha, that X1 should
not be dropped from the model.
4. Global Test Whether all
k
=0: Score Chi-square test
13
Let
) ( U
be the vector of first partial derivatives of the log likelihood
with respect to the parameter vector , and let
) ( H
be the matrix of
second partial derivatives of the log likelihood with respect to . Let I()
be either -
) ( H
or the expected value of -
) ( H
. Consider a null
hypothesis H
0
. Let 0

be the MLE of under H


0
. The chi-square score
statistic for testing H
0
is defined by )

( )

( )

( '
0 0
1
0
U I U

and it has an
asymptotic
2

distribution with r degrees of freedom under H


0
, where r is
the number of restriction imposed on by H
0.

Example:

'

absent disease 0
present disease 1
Y
status mic socioecono
others 0
Class Lower 1
X
others 0
Class Middle 1
X
Age X
3
2
1

'

'

'

1 sector city 0
2 sector city 1
X
4
Study purpose: assess the strength of the association between each of the
predictor variables and the probability of a person having contracted the
disease

Case
i
1
2
3
4
5
6
.
98
Age
Xi1
33
35
6
60
18
26
.
35
Socioeconomic
Status
Xi2 Xi3
0 0
0 0
0 0
0 0
0 1
0 1
.
0 1
City
Sector
Xi4
0
0
0
0
0
0
.
0
Disease
Status
Yi
0
0
0
0
1
0
.
0
Fitted
Value
i

.209
.219
.106
.371
.111
.136
.
.171
SAS CODE:
data ch14ta03;
infile 'c:\stat231B06\ch14ta03.txt'
DELIMITER='09'x;
input case x1 x2 x3 x4 y;
proc logistic data=ch14ta03;
model y (event='1')=x1 x2 x3 x4;
run;
SAS OUTPUT:
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 21.2635 4 0.0003
Score 20.4067 4 0.0004
Wald 16.6437 4 0.0023
Since p-value for the score test is 0.0004,
we reject the null hypothesis H
0
:

1
=
2
=
3
=
4
=0. We can also wald test and
likelihood ratio test to test the above null
hypothesis.
14

You might also like