0% found this document useful (0 votes)
31 views27 pages

Logistic Regression-1

This document discusses logistic regression, which is a statistical model used for binary dependent variables. It covers the rationale behind the logistic model, odds ratios, and how to interpret logistic regression coefficients. Examples are provided to illustrate key concepts like binary and continuous independent variables, as well as multivariate logistic regression.

Uploaded by

Neha Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views27 pages

Logistic Regression-1

This document discusses logistic regression, which is a statistical model used for binary dependent variables. It covers the rationale behind the logistic model, odds ratios, and how to interpret logistic regression coefficients. Examples are provided to illustrate key concepts like binary and continuous independent variables, as well as multivariate logistic regression.

Uploaded by

Neha Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 27

Logistic Regression

Dr. Gajendra K. Vishwakarma


Associate Scientist-Clinical Research
Lupin Research Park, Pune
LOGISTIC REGRESSION:

E.g.1 Y = Cure / no cure


X= Therapy, Other Pt. Variables [COHORT / RCT]
E.g. 2. Y = Case / Control (cancer / non-cancer)
X = Risk factors [Age, Sex, Smoking, Occupation]
[CASE / CONTROL]
E.g. 3. Y = MI / No MI
X = Risk factors [Age, Sex, family history ....]
[COHORT]
OUTCOME Y IS BINARY
Logistic Model

1 if “success”
Dependent Variable Y = (event)
0 if “failure”
(no event)

Examples: Dead / Alive


Case / Control
Exposed / Non exposed
LOGISTIC MODEL (contd.)

e a  bx
Pr Y  1 | X  
1  e a  bx

where ‘X’ is an independent


variable
RATIONALE
e a  bx
P  Pr Y  1 | X  
1  e a  bx

e a  bx
1
1  P  Pr Y  0 | X   1  
1 e a  bx
1 e a  bx

P
 e a  bx  ODDS
1 P

P
ln  a  bx  LOG ODDS  LOGIT
1 P
ln (P/1-P)
P

0 a X

b = “Slope” a = “Location”
ODDS RATIOS: BINARY X

1 = Exposed
E.g. X =
0 = Unexposed
FOR EXPOSED:
P1
ln  a  b x1  a  b
1 - P1

FOR UNEXPOSED:
P0
ln  a bx0 a
1 - P0
Therefore,
P1 P0
ln - ln  (a  b) - a  b
1 - P1 1 - P0
P1
1 - P1
ln  b
P0
1 - P0
Therefore,

b = ln(Odds Ratio)
or
Odds Ratio = eb
CONTINUOUS X:

E.g. X = Packs per day

b = ln(Odds Ratio) associated with unit increase in X

E.g. 4 Vs 3 packs per day


MULTIPLE LOGISTIC MODEL
For an individual with independent variable values X1,
X2, ....Xk

e a  b1 x1  b2 x 2  ......... bk x k
Pr Y  1 | X 1 , X 2 ,..... X k  
1  e a  b1 x1  b2 x 2  ......... bk x k
P
ln  a  b1 x1  b 2 x2  .......b k xk
1 P

b1 = ln(OR) for X1, adjusted for X2, X3, .....Xk


b2 = ln(OR) for X2, adjusted for X1, X3, .....Xk etc.

INTERPRETATION SIMILAR TO LINEAR REGRESSION,


BUT ON LOGIT SCALE
Age and Coronary Heart Disease Status
(CHD) of 100 subjects
ID AGRP AGE CHD
1 1 20 0
2 1 23 0
3 1 24 1
4 1 25 0
5 1 25 0
. . . .
. . . .
. . . .
97 8 64 0
98 8 64 1
99 8 65 1
100 8 65 1
1.2

1.0
Coronary Heart Disease (CHD)

.8

.6

.4

.2

0.0

-.2
10 20 30 40 50 60 70

AGE
EFFECT OF DATA GROUPING

Frequency table of Age group by CHD


Age group n CHD Mean
Absent Present (Proportion)
20 - 29 10 9 1 0.10
30 - 34 15 13 2 0.13
35 - 39 12 9 3 0.25
40 - 44 15 10 5 0.33
45 - 49 13 7 6 0.46
50 - 54 8 3 5 0.63
55 - 59 17 4 13 0.76
60 - 69 10 2 8 0.80
Total 100 57 43 0.43
MAXIMUM LIKELIHOOD ESTIMATION (MLE)
Maximum Likelihood Estimation (MLE)

A method of estimation (finding the values) for the unknown


parameters () in such a way that it maximizes the probability
of obtaining the observed data set.

e constant   age 
PCHD  yes | Age  
1  e constant   age 
Likelihood Function:
Probability of the observed data is expressed as a function
of unknown parameters. That is,
P(y=1 | X=Age) =  (x) =  (Age)
P(y=0 | X=Age) = 1 -  (x) = 1 -  (Age)

e constant   age  40 
 10 Age  40  
1  e constant   age  40 

Likelihood function for the 10th individual is

10(age = 40)chd-1(1- 10(age = 40)chd-0)


Likelihood Ratio:

Significance: If the predicted values are better or more


accurate than when the variable is not in the model.
That is, the likelihood estimate is directly proportional to the
difference between observed minus expected observation
(O-E).

As you add a variable in the model the likelihood estimate


will go up.
Therefore, to assess the significance of a variable we need to
have likelihood estimate with and without the variable.

G = -2 (loglikelihood for the model without the variable -


loglikelihood for the model with the variable)
G follows Chi-square distribution

The null hypothesis of 1 = 0, slope coefficient can be tested


using Chi-square statistic.A
Example:
The null hypothesis for the table is that the age is not associated
with CHD.
The model is e   1 age
PCHD  1 | age  
0

1  e 0   1 age

The Null hypothesis imply that 1 = 0.


Therefore, the likelihood without age is - 68.322
the likelihood with age is - 53.677
G = -2 (-53.677 - (68.322)) = 29.32
G = 29.31 is chi-square value with 1 d.f.
P (X2 (1) > 29.31) < 0.001
This imply that age is significantly (P<0.001) associated with CHD.
Multivariate Model:

Consider 4 independent variables, age, weight at last menstrual


period (LWT), race and number of first trimester physician visits
(FTV). The dependent variable is low birth weight (LBW).

The 4 independent variables can be represented as follows.

X’ = (age, LWT, Race, FTV)


Where,
Age - Continuous
LWT - Continuous
FTV - Discrete
Race - Polychotomous
White
Black
Other
Therefore, the design variables for RACE are

Design variable
D1 D2
White 0 0
Black 1 0
Other 0 1
Testing for Significance
Significance of the Model:
Assessing the significance of the model means that the test
for overall significance of the 4 variables in the model.
However, one or more variables individually may not
be significant.

The likelihood estimate for 4 variables + constant = -222.583


The likelihood estimate for constant = -234.673
G = {(-222.583) - (-234.673} = 12.09
The P value for the Chi-square test P[24 > 12.1] = 0.033
Significance of the Variables:

Variables in the Equation

B S.E. Wald df Sig. Exp(B)


Step
a
AGE -.023 .034 .483 1 .487 .977
1 LWT -.014 .007 4.751 1 .029 .986
RACE 4.425 2 .109
RACE(1) 1.005 .498 4.078 1 .043 2.732
RACE(2) .434 .362 1.434 1 .231 1.543
FTV -.049 .167 .087 1 .768 .952
Constant 1.287 1.070 1.447 1 .229 3.623
a. Variable(s) entered on step 1: AGE, LWT, RACE, FTV.

Log-likelihood = -222.583
Wald statistic = /SE()
We conclude that the variables LWT and possibly Race are
significant at P < 0.05.
USE OF LR COEFFICIENTS FOR
GENERAL COMPARISIONS OF RISK

General: Compare individuals with X = a to X = b


ln(OR) = 1 (a-b)

 1 a  b 
OR  e
Example:
Risk of birth defect to mothers, Ages 35 +
1 = 0.182 (Age in years)
 = e0.182 = 1.2 (change of 1 Year)
For change of 5 years (E.g. 40 vs 45)

5 1 = 0.91 e0.91 = 2.5

For change of 10 years (E.g. 45 vs 35)

10 1 = 1.82 e1.82 = 6.2

NOTES: Non - linear effect on OR


Linear effect on 1 (ln(OR))

You might also like