0% found this document useful (0 votes)
27 views

Lect7 Math231

The document discusses logistic regression, which models the probability of binary outcomes as a function of predictor variables. It provides background on logistic regression and how it addresses limitations of linear regression for binary outcomes. An example analyzes the relationship between age and coronary heart disease using logistic regression. The results show that the odds of coronary heart disease increase by 11.6% for each additional year of age.

Uploaded by

Qasim Rafi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Lect7 Math231

The document discusses logistic regression, which models the probability of binary outcomes as a function of predictor variables. It provides background on logistic regression and how it addresses limitations of linear regression for binary outcomes. An example analyzes the relationship between age and coronary heart disease using logistic regression. The results show that the odds of coronary heart disease increase by 11.6% for each additional year of age.

Uploaded by

Qasim Rafi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

1/29

Statistics

Logistic Regression

Shaheena Bashir

FALL, 2019
2/29
Outline

Background

Introduction
Logit Transformation
Assumptions

Estimation

Example
Analysis
How Good is the Fitted Model?

Single Categorical Predictor

Types of Logistic Regression Models


o
3/29
Background

Motivating Example

o
4/29
Background

Scatter Plot
Relationship between Age & CHD

1.0
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.8
Coronary heart disease

0.6
0.4
0.2
0.0

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

20 30 40 50 60 70

Age (years)

o
Not informative!!
5/29
Background

Regression Model: Objective

I Describe the relationship between an outcome (dependent or


response) variable and a set of independent (predictor or
explanatory) variables by some regression model (equation).
I Predict some future outcome based on the regression model
How to model the relationship of CHD with age?

o
6/29
Background

Background

I What distinguishes a logistic regression model from the linear


regression model is that the outcome variable is binary (or
dichotomous).
I Whether the tumor is malignant (Yes=1) or not (No=0)
I Whether a newborn baby with low birth weight (yes=1) or not
(No=0)
I A student gets admission at LUMS (Yes=1) vs not (No=0)
For categorical response variable, the assumption that the
errors follow a normal distribution fails.

o
7/29
Background

Tabular Form of CHD Data

Age Group n CHD Present % CHD Present


20-29 10 1 0.10
30-34 15 2 0.13
35-39 12 3 0.25
40-44 15 5 0.33
45-49 13 6 0.46
50-54 8 5 0.63
55-59 17 13 0.76
60-69 10 8 0.80
100 43

o
8/29
Background

Proportion of Individuals with CHD


Relationship between Age & CHD

1.0
0.8



0.6
% with CHD


0.4


0.2



0.0

20 30 40 50 60 70

Age (years)

o
9/29
Introduction

Logistic Regression Model

I The response variable in logistic regression is categorical. The


linear regression model, i.e., Y = X β +  does not work well
for a few reasons.
I The response values, 0 and 1, are arbitrary, so modeling the
actual values of Y is not exactly of interest.
I Our interest is in modeling the probability of each individual in
the population who responds with 0 or 1,
I The error terms in this case do not follow a normal distribution.
Thus, we might consider modeling P, the probability, as the
response variable.

o
10/29
Introduction

Sigmoid Function

Modeling the probability as response, some problems


I Although the general increase in probability is accompanied by
a general increase in age, we know that P, like all
probabilities, can only fall within the boundaries of 0 and 1.
I It is better to assume that the relationship between age and P
is sigmoidal (S-shaped), rather than a straight line.
I It is possible, however, to find a linear relationship between
age and a function of P. Although a number of functions
work, one of the most useful is the logit function.

o
11/29
Introduction
Logit Transformation

Logit Function
p
The logit function ln 1−p (also called log-odds) is simply the log of
ratio of P(Y = 1) divided by P(Y = 0).
p
ln = Xβ
1−p
The odds
p
= exp(X β).
(1 − p)
Solving

exp(y ) 1
p = Pr (Y = 1|X = x) = =
[1 + exp(y )] 1 + exp(−y )
gives the standard logistic function, while y = X β.
o
12/29
Introduction
Logit Transformation

Logit Function

p
g (x) = ln 1−p has many of the desirable properties of a linear
regression model.
I It may be continuous
I It is linear in the parameters
I It has the potential for a range between −∞ and +∞
depending on the range of x .

o
13/29
Introduction
Logit Transformation

Summary: Logit Transformation

Quantity Formula min max


Probability p 0 1
p
Odds 1−p 0 ∞
p
Logit or ’Log-Odds’ loge 1−p −∞ ∞

Logit stretches the probability scale

o
14/29
Introduction
Assumptions

Assumptions

Linear Regression Logistic Regression


 ∼ N(0; σ 2 )  ∼ Bin(p)
p
Y = Xβ +  ln 1−p = Xβ + 
Y |X ∼ N(X β; σ 2 ) Y |X ∼ Bin(p)

o
15/29
Estimation

Estimation of Parameters of Regression Model: β

I The method of maximum likelihood yields values for the


unknown parameters that maximize the probability of
obtaining the observed set of data.
I For logistic regression the likelihood equations are non-linear
in the parameters β’s and require special methods for their
solution.
I These methods are iterative in nature and have been
programmed into available logistic regression software

o
16/29
Example

Example: CHD Data

I Is age a risk factor of CHD? How the probability of CHD


changes by age?
I Outcome variable: CHD (Yes, No)
I Predictor: Age (in years)
Logistic regression models the probability of some event occurring
as a linear function of a set of predictors.

o
17/29
Example
Analysis

CHD Analysis

ln 1−p̂ p̂ = −5.31 + 0.11Age


I The coefficient is interpreted as the MARGINAL increase in
the log odds of CHD when age increases by 1 year.

Estimate Std. Error z value Pr(>|z|)


(Intercept) -5.31 1.13 -4.68 0.00
age 0.11 0.02 4.61 0.00

OR = exp(0.11) = 1.116
The odds of getting CHD are · · · · · · when age increases by 1 year

o
18/29
Example
Analysis

Fitted Values

exp(βo + β1 X )
p =
[1 + exp(βo + β1 X )]
exp(−5.31 + 0.11Age)
=
[1 + exp(−5.31 + 0.11Age)]

o
19/29
Example
Analysis

R Software

mod1<-glm(chd ∼ age, family=’binomial’, data=chdage)


summary(mod1)
predict(mod1, type = ’response’)
anova(mod1, test=’Chisq’)
plot(mod1)

o
20/29
Example
Analysis

Predicted Probabilities





0.8










predicted probabilities


0.6








0.4









0.2









●●
●●

20 30 40 50 60 70

Age
o
21/29
Example
How Good is the Fitted Model?

Analysis of Deviance
Model: binomial, link: logit
Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev Pr (> Chi)


NULL 99 136.66
Age 1 29.31 98 107.35 6.168e − 08 ∗ ∗∗

I Deviance is a measure of goodness of fit of a generalized


linear model. Or rather, it’s a measure of badness of fit.
I If our new model explains the data better than the null model,
there should be a significant reduction in the deviance which
can be tested against the chi-square distribution to give a
p-value
o
22/29
Example
How Good is the Fitted Model?

Hosmer-Lemeshow Goodness of Fit

How well our model fits depends on the difference between the
model and the observed data.

library(ResourceSelection)
hoslem.test(as.numeric(chdage$chd)-1, fitted(mod1))
R Output
Hosmer and Lemeshow goodness of fit (GOF) test
data: as.numeric(chdage$chd) - 1, fitted(mod1)
X-squared = 2.2243, df = 8, p-value = 0.9734

Our model appears to fit well because we have no significant


difference between the model and the observed data (i.e. the
p-value > 0.05).
o
23/29
Example
How Good is the Fitted Model?

o
24/29
Single Categorical Predictor

Simple Logistic Regression Model with a Categorical


Predictor

I How some function of the probability of categorical response


is linearly related to a predictor
I Interpretation of the resulting intercept βo & the slope β1
where predictor variable is also binary.

o
25/29
Single Categorical Predictor

Case-Control Study: A Recap Example

Past exposure CHD Cases Controls (without disease)


Smokers 112 176
Non-smokers 88 224
Totals 200 400

Odds of CHD for Smokers = · · ·


Odds of CHD for Non-Smokers = · · ·

o
26/29
Single Categorical Predictor

Case-Control Study: A Recap Example Cont’d

Let yi is binary response variable


I yi = 1; if CHD=yes
I yi = 0; if CHD=no

Past exposure yi ni
Smokers 112 288
Non-smokers 88 312

Then yi ∼ Bin(ni , pi )
xi is the binary predictor of past smoking
I xi = 1; if past smoker
I xi = 0; if non-smoker in the past

o
27/29
Single Categorical Predictor

Case-Control Study: A Recap Example Cont’d

The probability of CHD pi can be modeled as:

logit(pi ) = βo + β1 xi

I xi = 1, then logit(pi |xi = 1) = βo + β1 (1)


I xi = 0, then logit(pi |xi = 0) = βo

pi |xi = 1
β1 = logit(pi |xi = 1) − logit(pi |xi = 0) = log
pi |xi = 0
∴ OR = · · · · · ·

o
28/29
Single Categorical Predictor

Example: Logistic Regression

Estimate Std. Error z value Pr(>|z|)


(Intercept) -0.93 0.13 -7.43 0.00
pastsmoke1 0.48 0.17 2.76 0.01

I For past smokers, xi = 1 then


ln(odds of CHD) = βo + β1 ∴ Odds for smokers = · · ·
I For past non-smokers, xi = 0 then
ln(odds of CHD) = βo ∴ Odds for non-smokers = · · ·
OR = · · ·

o
29/29
Types of Logistic Regression Models

Types of Logistic Regression Model

I Binary Logistic Regression Model: The categorical


response is dichotomous (has only two 2 possible outcomes),
e.g., an email is a Spam or Not
I Multinomial Logistic Regression Model: Three or more
categories without ordering (polytomous response), e.g.,
Predicting food choices (Veg, Non-Veg, Vegan)
I Ordinal Logistic Regression Model: Three or more
categories with ordering, e.g., Movie rating from 1 to 5,
teaching evaluation by students, etc.

You might also like