0% found this document useful (0 votes)
18 views85 pages

Lecture9 Logistic Regression

Uploaded by

NiceMove
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views85 pages

Lecture9 Logistic Regression

Uploaded by

NiceMove
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Chapter 9: Multiple and logistic regression

T
aFft
Wen-Han Hwang
( Slides primarily developed by Mine Çetinkaya-Rundel from OpenIntro.)

DrA
R Institute of Statistics
National Tsing Hua University
Taiwan

Wen-Han Hwang Chapter 9 1 / 61


Outline

T
aFft
1 Introduction

2 Logistic Regression

DrA
3 Additional Example

4 Sensitivity and Specificity

5 ROC curves R
6 Utility Functions

Wen-Han Hwang Chapter 9 2 / 61


T
aFft
Introduction

DrA
R
Wen-Han Hwang Chapter 9 3 / 61
Regression so far ...

T
At this point we have covered:
5 S1N(0 64)

aFft
+
Y Bots X ,
= .
,
Simple linear regression
=> YIX- Normal
Relationship between numerical response and a numerical or categorical predictor Dist
Multiple regression + BuXi + S
Y BotB
:
,
X,
Relationship between numerical response and multiple numerical and/or categorical predictors

DrA
However, several challenges remain:

Complex Predictors: Handling predictors with nonlinear relationships or intricate


dependency structures.
R
Diverse Response Types: Addressing different types of response variables, such as binary,
percentage, categorical, and count data. Y= [i
02 %:
[ Y= 0 1 2 3 ---
, , ,
,

Logistic Regression: Specifically focuses on cases with binary response variables.


Wen-Han Hwang Chapter 9 4 / 61
Understanding Odds

T
Odds provide another method for quantifying the likelihood of an event and are frequently used
in contexts like gambling and logistic regression.

aFft
Definition: P(z) >
-

066/5
For an event E, the odds are defined as:
S

DrA
P( E) P( E)
odds( E) = =
P( Ec ) 1 P( E)
A
If the odds of E are given as x to y, then:
R x x/( x + y)
odds( E) = =
y y/( x + y)

which leads to: x y


P( E) = , P( Ec ) =
x+y x+y

Wen-Han Hwang Chapter 9 5 / 61


Example - Donner Party

T
aFft
In 1846, the Donner and Reed families departed from Springfield, Illinois, bound for California in
covered wagons. By July, the group, known as the Donner Party, had reached Fort Bridger,
Wyoming. Here, they opted to try a new, untested route to the Sacramento Valley. The party had
grown to 87 people and 20 wagons.

DrA
Their journey was hampered by difficult crossings of the Wasatch Range and the desert west of the
Great Salt Lake. They became stranded in the eastern Sierra Nevada mountains due to heavy
snows in late October. When rescuers arrived on April 21, 1847, only 47 of the original 87 members
had survived, the rest succumbing to famine and extreme cold.
R
Source: Ramsey, F.L. and Schafer, D.W. (2002). The Statistical Sleuth: A Course in Methods of Data Analysis (2nd ed.)

Wen-Han Hwang Chapter 9 6 / 61


Example - Donner Party - Data

T
This data contains the ages and sexes of the adult (over 15 years) survivors and nonsurvivors of

aFft
-
the Donner party.

Age Sex Status


1 23.00 Male Died

DrA
2 40.00 Female Survived
3 40.00 Male Survived
4 30.00 Male Died
5 28.00 Male Died
.. .. .. ..
R . . . .
43 23.00 Male Survived
44 24.00 Male Died
45 25.00 Female Survived

Wen-Han Hwang Chapter 9 7 / 61


Example - Donner Party - EDA

T
Status vs. Gender:

aFft
Male Female
Died 20 5 Die
Survived 10 10 ↓

DrA
Odds of Mortality: Male 2 (2:1), Female 0.5 (1:2)
Status vs. Age:

R
Wen-Han Hwang Chapter 9 8 / 61
! Survived G

p(y /age sex)


=
,
Example - Donner Party: Modeling Survival

T
aFft
It appears that both -
age and -
gender significantly influence survival outcomes. How can we
develop a model to further explore these relationships?
a
%: So, survived
Simply coding the outcomes as 0 for Died and 1 for Survived doesn’t sufficiently address the

DrA
complexity of the problem. We need a more robust approach.

One effective method is to view survival in terms of a binomial distribution, where the probability
of survival (success) versus non-survival (failure) can be modeled using a logistic function applied
R
to a linear combination of predictors (age and gender).

Wen-Han Hwang Chapter 9 9 / 61


Generalized linear models

T
aFft
It turns out that this is a very general way of addressing this type of problem in regression, and the
resulting models are called generalized linear models (GLMs). Logistic regression is just one
example of this type of model.

All generalized linear models have the following three characteristics:

DrA
1 A probability distribution describing the outcome variable
2 A linear model
h = b 0 + b 1 X1 + · · · + b r X r
-
3 A link function that relates the linear model to the parameter of the outcome distribution
R
g( p) = h or p = g 1 (h )

Wen-Han Hwang Chapter 9 10 / 61


T
aFft
Logistic Regression

DrA
R
Wen-Han Hwang Chapter 9 11 / 61
Logistic Regression

T
* p(y= /(x) =
P() p(o(x) 1 P(y = -

aFft
,

Logistic regression is a GLM used to model a binary categorical variable using numerical and
categorical predictors.

We assume a binomial distribution produced the outcome variable and we therefore want to
model p the probability of success for a given set of predictors.

DrA
To finish specifying the Logistic model we just need to establish a reasonable link function that
connects h to p. There are a variety of options but the most commonly used is the logit function.

Logit function ✓ ◆
R p
logit( p) = log , for 0  p  1
1 p

* en ) =
Potix
Wen-Han Hwang Chapter 9 12 / 61
Properties of the Logit

T
aFft
The logit function takes a value between 0 and 1 and maps it to a value between • and •.

Inverse logit (logistic) function

1 exp( x ) 1
g (x) = =
1 + exp( x ) 1 + exp( x )

DrA =
P(X)
The inverse logit function takes a value between • and • and maps it to a value between 0 and
1.
R
This formulation also has some use when it comes to interpreting the model as logit can be
interpreted as the log odds of a success, more on this later.

Wen-Han Hwang Chapter 9 13 / 61


expl
+ -

x)
I


·
(x) P(X)
- =
PlE() =
1- p(x)
,
I
P(X)
·

PT )ebot =
Bix

PEFOK-P() =
1 Inlodds(f1(X) = NotBx
It potsix
(i)
Poght fan
-

-
The logistic regression model

T
The three GLM criteria give us:

aFft
yi ⇠ Binom(1, pi )

h = b 0 + b 1 x1 + · · · + b r xr

DrA
In (p) =
4 = B'x
logit( p) = h

# = exp(1) =
exp(yx)
From which we arrive at,
=
R pi =
p

exp( b 0 + b 1 x1,i + · · · + b r xr,i )


1 + exp( b 0 + b 1 x1,i + · · · + b r xr,i )

Wen-Han Hwang Chapter 9 14 / 61


Example - Donner Party - Model

T
fitof &m(y-X data:
, )
..

fit1 <glm (y-X data family bino) =

aFft
:
..,

In R we fit a GLM in the same was as a linear model except using glm instead of lm and we must
,

also specify the type of GLM to fit using the family argument.

So Survive
summary(glm(Status ~ Age, data=donner, family=binomial))
Status
= he
## Call:
## glm(formula = Status ~ Age, family = binomial, data = donner)

DrA
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.81852 0.99937 1.820 0.0688 .
## Age -0.06647 0.03222 -2.063 0.0391 *
##
## Null deviance: 61.827 on 44 degrees of freedom
## Residual deviance: 56.291
R on 43 degrees of freedom
## AIC: 60.291
-
##
## Number of Fisher Scoring iterations: 4

Wen-Han Hwang Chapter 9 15 / 61


Example - Donner Party - Prediction

T
⑳Botix

aFft
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.8185 0.9994 1.82 0.0688
Age -0.0665 0.0322 -2.06 0.0391

>◆
P((Age)
Model: ✓
p

DrA
log = 1.8185 0.0665 ⇥ Age
bgit(p)
-

1 p
=

Odds / Probability of survival for a newborn (Age=0):


R log

1
p

1
p
p

p
= 1.8185 0.0665 ⇥ 0

= exp(1.8185) = 6.16

p = 6.16/7.16 = 0.86
odds

Wen-Han Hwang Chapter 9 16 / 61


Example - Donner Party - Prediction (cont.)

T
aFft
Model: ✓ ◆
p
log = 1.8185 0.0665 ⇥ Age
1 p

Odds / Probability of survival for a 25✓year ◆old:


p
log = 1.8185 0.0665 ⇥ 25
1 p

Pl
p
= exp(0.156) = 1.17 odds

DrA
1 p
p = 1.17/2.17 = 0.539
-

Odds / Probability of survival for a 50✓year ◆old:


p
log = 1.8185 0.0665 ⇥ 50
1 p

lot
R 1
p
p
-odds
= exp( 1.5065) = 0.222

p = 0.222/1.222 = 0.181

Wen-Han Hwang Chapter 9 17 / 61


Example - Donner Party - Prediction (cont.)

T
(

aFft
✓ ◆
p
log = 1.8185 0.0665 ⇥ Age

survived
1 p

1 8
.
-
0 .
0665x

&

DrA
P(X) :

eBobix
1+

R
died

Wen-Han Hwang Chapter 9 18 / 61


Example - Donner Party - Interpretation

T
Estimate Std. Error z value Pr(>|z|) In (odds (x+ )) En(odds()

I In

aFft
-

(Intercept) 1.8185 0.9994 1.82 0.0688


Age -0.0665 0.0322 -2.06 0.0391 =
-
0 .
0665

en 50 0 06 65x : 0
-

=
-
.

Interpretation: ↑
odds(X)
OR

DrA
Intercept: The log odds of survival for a party member hypothetically aged 0. This base value
allows for calculation of odds and probability with further computation.
Slope (Age): The coefficient for age, 0.0665, represents the change in the log odds of survival
- -

for each additional year of age. This can be converted to an odds ratio (OR): -

R OR = e 0.0665
⇡ 0.935

This means that each additional year of age is associated with a 6.5% decrease in the odds of
-

survival (since 1 0.935 = 0.065), emphasizing the negative impact of increasing age on
survival prospects.
Wen-Han Hwang Chapter 9 19 / 61
Understanding the Odds Ratio

T
Definition: Odds Ratio (OR) is a statistic that quantifies the strength and direction of the
association between two binary variables. It compares the odds of an event occurring in one group

aFft
to the odds of it occurring in another group.

P ( E =1| X =1)
Odds of event in the exposed group P ( E =0| X =1)
Odds Ratio (OR) = =
Odds of event in the unexposed group P ( E =1| X =0)

DrA
P ( E =0| X =0)
Odds in age (x+ )
0665
&

- -
0 .

=
odds in age (x) e
OR= 1: No association between exposure and outcome. ↓ int of X
dep .

-
OR> 1: Positive
-
association; greater odds of the event occurring with the exposure.
R
OR< 1: Negative association; lower odds of the event occurring with the exposure.
-

Application in Logistic Regression: In logistic regression, the exponentiated coefficient (e b ) of a


predictor variable gives the odds ratio. This measures how the odds of the outcome change with a
one-unit increase in the predictor, holding all else constant.
Wen-Han Hwang Chapter 9 20 / 61
Example - Donner Party - Interpretation - Slope

T
aFft
log

p1

= 1.8185
aeit
0.0665( x + 1)
1 p1
= 1.8185 0.0665x 0.0665
✓ ◆
p0
log = 1.8185 0.0665x

DrA
1 p0 ↑
age ↓
✓ ◆ ✓ ◆
p1 p0
log log = 0.0665
1 p1 1 p0
✓ ◆
p1 p0
log = 0.0665
1 p1 1 p0
R * 1
p1
p1

odds ratio
1
p0
p0
= exp( 0.0665) = 0.94

Wen-Han Hwang Chapter 9 21 / 61


Example - Donner Party - Age and Gender

T
multiple covariates
In De BotBi Age: + BoSexi

aFft
=
summary(glm(Status ~ Age + Sex, data=donner, family=binomial))
#P-
-

## Call:
Sex
Yir Ber(Pi) G!
## glm(formula = Status ~ Age + Sex, family = binomial, data = donner) =
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
Female
Male
P 1)
## (Intercept) 1.63312 1.11018 1.471 0.1413
## Age
## SexFemale
-0.07820
-
1.59729
0.03728 -2.097
0.75547 2.114
0.0359 *
0.0345 *
= =
Pi
-
## ---

DrA
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 61.827 on 44 degrees of freedom
## Residual deviance: 51.256 on 42 degrees of freedom

D
## AIC: 57.256
##
## Number of Fisher Scoring iterations: 4
-
R # 21 = 60

Gender slope: When the other predictors are held constant this is the log odds ratio between the
given level (Female) and the reference level (Male).

Wen-Han Hwang Chapter 9 22 / 61


Example - Donner Party - Gender Models

T
Just like MLR we can plug in gender to arrive at two status vs age models for men and women

aFft
respectively.

General model: ✓ ◆
p1
log = 1.63312 + 0.07820 ⇥ Age + 1.59729 ⇥ Sex
1 p1

Male model:

DrA
✓ ◆
p1
log
1 (x)(x) = 1.63312 +
p1
0.07820 ⇥ Age + 1.59729 ⇥ 0

= 1.63312 + 0.07820 ⇥ Age

Female model:
R log

p1
1 p1
E

= 1.63312 +

= 3.23041 +
0.07820 ⇥ Age + 1.59729 ⇥ 1

0.07820 ⇥ Age

Wen-Han Hwang Chapter 9 23 / 61


Example - Donner Party - Gender Models (cont.)

T
aFft
DrA
R
Wen-Han Hwang Chapter 9 24 / 61
Hypothesis test for the whole model P(x 9) P(E)
,
=
X
, 9)

T
en(g)
summary(glm(Status ~ Age + Sex, data=donner, family=binomial))
=
Bot B , X +
B2g

aFft
## Call:
## glm(formula = Status ~ Age + Sex, family = binomial, data = donner)
Fix

S
##
## Coefficients: Age = X

)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.63312 1.11018 1.471 0.1413
##
##
Age
SexFemale
-0.07820
1.59729
0.03728 -2.097
0.75547 2.114
0.0359 *
0.0345 *
= B2
## ---

DrA
##
##
##
(Dispersion parameter for binomial family taken to be 1)
= OR for gender = eBs
## Null deviance: 61.827 on 44 degrees of freedom
##
##
Residual deviance: 51.256
AIC: 57.256
on 42 degrees of freedom
ele = 4
##
## Number of Fisher Scoring iterations: 4
R Odds (X g 1)
.
=
= 4 9. odds #X go,
. .

Note: The model output does not include any F-statistic.

Wen-Han Hwang Chapter 9 25 / 61


Hypothesis tests for a coefficient

T
aFft
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.6331 1.1102 1.47 0.1413
Age -0.0782 0.0373 -2.10 0.0359
SexFemale 1.5973 0.7555 2.11 0.0345

DrA
We are however still able to perform inference on individual coefficients, the basic setup is exactly
the same as what we’ve seen before except we use a Z test.
R
Note: The only tricky bit, which is way beyond the scope of this course, is how the standard error
is calculated.

Wen-Han Hwang Chapter 9 26 / 61


Testing for the slope of Age

T
aFft
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.6331 1.1102 1.47 0.1413
Age -0.0782 0.0373 -2.10 0.0359
SexFemale 1.5973 0.7555 2.11 0.0345

H0 : b age = 0

DrA
H A : b age 6= 0

b̂ age b age 0.0782 0


Z= = = 2.10
SEage 0.0373
R p-value = P(| Z | > 2.10) = P( Z > 2.10) + P( Z <
= 2 ⇥ 0.0178 = 0.0359
Tilo
2.10)

, 13
Wen-Han Hwang Chapter 9 27 / 61
Confidence interval for age slope coefficient

T
Estimate Std. Error z value Pr(>|z|)

aFft
(Intercept) 1.6331 1.1102 1.47 0.1413
Age -0.0782 0.0373 -2.10 0.0359
SexFemale 1.5973 0.7555 2.11 0.0345

DrA
Remember, the interpretation for a slope is the change in log odds ratio per unit change in the
predictor. 95% 11 for B1

Log odds ratio: Bi


-
R Y ↓
CI = PE ± CV ⇥ SE = 0.0782 ± 1.96 ⇥ 0.0373 = ( 0.151, 0.005)
>
-
X

confidence
interval
Odds ratio:
B1
79 S

.
55E .

0.151 0.005
Veri exp(CI ) = (e ,e ) = (0.859, 0.995)
-

Wen-Han Hwang Chapter 9 28 / 61


odds (age X g
:
,
= 1)
g) Bo + B X 1 g B X8 Let
~ lodds (X =
,
+ :
+
> ORM) :
g 0)
slage x
,
= =
,

glm(Status ~ Age*Sex, data=donner, family=binomial)


-
-

In C W
I B2 + B3 X .

B2 + Bsx
OR(X) =
e

O
age
↑ ↑
depends on

24
age
X

V
&

enlodds (x . g =
1)) = BotB, X + Bu + B, X

In Codds (X g 01) Bo+ Bix


= =
.
T
aFft
Additional Example

DrA
R
Wen-Han Hwang Chapter 9 29 / 61
Example - Birdkeeping and Lung Cancer

T
aFft
A health survey conducted from 1972 to 1981 in The Hague, Netherlands, revealed an association
between keeping pet birds and an increased risk of lung cancer. In response, researchers initiated a
case-control study in 1985 at four hospitals within The Hague (population: 450,000).

Study Details:

DrA
Cases: 49 lung cancer patients, registered with a general practice, aged 65 or younger, and
residents of the city since at least 1965.
Controls: 98 residents selected to match the general age structure of the cases.

This study aimed to explore birdkeeping as a potential risk factor for lung cancer among the
R
population.

Source: Ramsey, F.L. and Schafer, D.W. (2002). The Statistical Sleuth: A Course in Methods of Data Analysis (2nd edition).

Wen-Han Hwang Chapter 9 30 / 61


Example - Birdkeeping and Lung Cancer - Data

T
LC FM SS BK AG YR CD

aFft
1 LungCancer Male Low Bird 37.00 19.00 12.00
2 LungCancer Male Low Bird 41.00 22.00 15.00
3 LungCancer Male High NoBird 43.00 19.00 15.00
. . . . . . . .
. . . . . . . .

Lane
. . . . . . . .
147 NoCancer Female Low NoBird 65.00 7.00 2.00

DrA
LC Whether subject has lung cancer
FM Sex of subject
SS Socioeconomic status
BK Indicator for birdkeeping
AG Age of subject (years)

Note:NoCancer
-
R YR
CD
Years of -
smoking prior to diagnosis or examination
Average rate of smoking (cigarettes per day)

is the reference response (0 or failure), LungCancer


-
is the non-reference response
(1 or success) - this matters for interpretation.
Wen-Han Hwang Chapter 9 31 / 61
Example - Birdkeeping and Lung Cancer - EDA

T
aFft
DrA
R Lung Cancer
No Lung Cancer
Bird
N
4
No Bird

Age

Wen-Han Hwang Chapter 9 32 / 61


Example - Birdkeeping and Lung Cancer - Model

T
aFft
summary(glm(LC ~ FM + SS + BK + AG + YR + CD, data=bird, family=binomial))

## Call:
## glm(formula = LC ~ FM + SS + BK + AG + YR + CD, family = binomial,
## data = bird)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.93736 1.80425 -1.074 0.282924
## FMFemale 0.56127 0.53116 1.057 0.290653

DrA
## SSHigh 0.10545 0.46885 0.225 0.822050
## -
BKBird 1.36259 0.41128 3.313 0.000923 ***
## AG -0.03976 0.03548 -1.120 0.262503
## YR 0.07287 0.02649 2.751 0.005940 **
## CD 0.02602 0.02552 1.019 0.308055
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 187.14 on 146 degrees of freedom
## Residual deviance: 154.20 on 140 degrees of freedom
## AIC: 168.2
##
R
## Number of Fisher Scoring iterations: 5

Wen-Han Hwang Chapter 9 33 / 61


Example - Birdkeeping and Lung Cancer - Interpretation

T
aFft
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.9374 1.8043 -1.07 0.2829
FMFemale 0.5613 0.5312 1.06 0.2907
SSHigh 0.1054 0.4688 0.22 0.8221
BKBird 1.3626 0.4113 3.31 0.0009
AG -0.0398 0.0355 -1.12 0.2625
YR 0.0729 0.0265 2.75 0.0059
CD 0.0260 0.0255 1.02 0.3081

DrA
Keeping all other predictors constant then,

The odds ratio of getting lung cancer for bird keepers vs non-bird keepers is
R
exp(1.3626) = 3.91.
The odds ratio of getting lung cancer for an additional year of smoking is exp(0.0729) = 1.08.

Wen-Han Hwang Chapter 9 34 / 61


What do the numbers not mean ...

T
The most common mistake made when interpreting logistic regression is to treat an odds ratio as a

aFft
ratio of probabilities.

Bird keepers are not 4x more likely to develop lung cancer than non-bird keepers.

DrA
This is the difference between relative risk and an odds ratio.

P(disease|exposed)
RR =
P(disease|unexposed)
R OR =
P(disease|exposed)/[1
P(disease|unexposed)/[1
P(disease|exposed)]
P(disease|unexposed)]

Wen-Han Hwang Chapter 9 35 / 61


Back to the birds

T
What is probability of lung cancer in a bird keeper if we knew that P(lung cancer|no birds) = 0.05?

aFft
P(lung cancer|birds)/[1 P(lung cancer|birds)]
OR =
P(lung cancer|no birds)/[1 P(lung cancer|no birds)]

DrA
P(lung cancer|birds)/[1 P(lung cancer|birds)]
= = 3.91
0.05/[1 0.05]

0.05
R 3.91 ⇥
0.95
P(lung cancer|birds) = = 0.171
1 + 3.91 ⇥ 0.05
0.95

RR = P(lung cancer|birds)/P(lung cancer|no birds) = 0.171/0.05 = 3.41

Wen-Han Hwang Chapter 9 36 / 61


T
aFft
Sensitivity and Specificity

DrA
R
Wen-Han Hwang Chapter 9 37 / 61
(An old) Example - House

T
aFft
If you are familiar with the TV show House on Fox, you might recall Dr. House’s frequent remark:
“It’s never lupus.”

What is Lupus?
Lupus is an autoimmune disease where antibodies, instead of protecting against infections,
mistakenly target the body’s own proteins as foreign invaders.

DrA
This abnormal immune response can lead to increased blood clotting risks.
Approximately 2% of the population suffers from lupus.
Diagnostic Accuracy:
If a person has lupus, the test is 98% accurate.
If a person does not have lupus, the test is 74% accurate.
R
Discussion Point:
Considering the test’s accuracy, is Dr. House correct in his skepticism even if a test result is positive
for lupus?

Wen-Han Hwang Chapter 9 38 / 61


(An old) Example - House

T
aFft
DrA
R P(Lupus|+) =

=
P(+, Lupus)
P(+, Lupus) + P(+, No Lupus)
0.0196
0.0196 + 0.2548
= 0.0714

Wen-Han Hwang Chapter 9 39 / 61


Testing for Lupus

T
aFft
Diagnosing lupus involves a complex array of multiple tests, reflecting the multifaceted nature of
the disease.

Common Tests for Lupus:


Complete Blood Count (CBC): Assesses overall health and detects disorders like anemia, infection,

DrA
and other diseases.
Erythrocyte Sedimentation Rate (ESR): Measures the rate at which red blood cells sediment in a
period of one hour, indicating inflammation.
Kidney and Liver Assessment: Evaluates the function of these organs which can be affected by
lupus.
Urinalysis: Tests for protein or red blood cells in the urine, which indicates kidney damage.
R
Antinuclear Antibody (ANA) Test: Detects antibodies that often are present in individuals with
autoimmune diseases like lupus.

Wen-Han Hwang Chapter 9 40 / 61


Testing for Lupus: Binary Decision Making

T
aFft
Diagnosing lupus can be viewed as a binary decision (lupus or no lupus) that requires the
integration of multiple test results.

Binary Decision:
The decision involves considering lupus or not based on a range of explanatory variables derived

DrA
from various tests.
Importance of Sensitivity and Specificity:
Sensitivity indicates the test’s ability to correctly identify those with the disease (true positive rate).
Specificity refers to the test’s ability to correctly identify those without the disease (true negative
rate). R
These metrics help interpret what a positive or negative test result actually means in the context of
diagnosing lupus.

Wen-Han Hwang Chapter 9 41 / 61


Sensitivity and Specificity

T
aFft
Sensitivity - measures a tests ability to identify positive results.

P(Test + | Conditon +) = P(+|lupus) = 0.98

Specificity - measures a tests ability to identify negative results.

DrA
P(Test | Condition ) = P( |no lupus) = 0.74

R
It is illustrative to think about the extreme cases - what is the sensitivity and specificity of a test
that always returns a positive result? What about a test that always returns a negative result?

Wen-Han Hwang Chapter 9 42 / 61


Sensitivity and Specificity (cont.)

T
aFft
DrA
R Sensitivity = P(Test + | Condition +) = TP/( TP + FN )
Speci f icity = P(Test
Falsenegativerate( b) = P(Test
| Condition ) = TN/( FP + TN )
| Condition +) = FN/( TP + FN )
Falsepositiverate(a) = P(Test + | Condition ) = FP/( FP + TN )

Wen-Han Hwang Chapter 9 43 / 61


So What?

T
aFft
Understanding test sensitivity, specificity, and disease incidence is crucial for accurate medical
decision-making.

DrA
Key Measures:
Sensitivity and specificity help calculate probabilities like P(lupus|+).
Using This Information:
How do we apply these insights to improve diagnostic decisions?
R
Wen-Han Hwang Chapter 9 44 / 61
T
aFft
ROC curves

DrA
R
Wen-Han Hwang Chapter 9 45 / 61
Identifying Spam Messages

T
aFft
Our analysis involved using logistic regression models to determine the likelihood of emails being
spam based on various predictors.

Purpose of Models:
Assess influence of different predictors on spam classification.

DrA
Assign probabilities to incoming messages for real-time filtering.
Beyond Probability Assignment:
Effective spam filtering requires decision-making based on assigned probabilities.
Decision Rule:
R
Implement a simple threshold probability.
Emails exceeding this threshold are flagged as spam.

Wen-Han Hwang Chapter 9 46 / 61


Picking a threshold

T
aFft
DrA
R Lets see what happens if we pick our threshold to be 0.75.

Wen-Han Hwang Chapter 9 47 / 61


Picking a threshold

T
aFft
DrA
R Lets see what happens if we pick our threshold to be 0.75.

Wen-Han Hwang Chapter 9 47 / 61


Picking a threshold

T
aFft
DrA
R Lets see what happens if we pick our threshold to be 0.75.

Wen-Han Hwang Chapter 9 47 / 61


Picking a threshold

T
aFft
DrA
R Lets see what happens if we pick our threshold to be 0.75.

Wen-Han Hwang Chapter 9 47 / 61


Picking a threshold

T
aFft
DrA
R Lets see what happens if we pick our threshold to be 0.75.

Wen-Han Hwang Chapter 9 47 / 61


Picking a threshold

T
aFft
DrA
R Lets see what happens if we pick our threshold to be 0.75.

Wen-Han Hwang Chapter 9 47 / 61


Consequences of picking a threshold

T
aFft
For our data set picking a threshold of 0.75 gives us the following results:
FN = 340 TP = 27
TN = 3545 FP = 9

DrA
What are the sensitivity and specificity for this particular decision rule?

Sensitivity = TP/( TP + FN ) = 27/(27 + 340) = 0.073


R Specificity = TN/( FP + TN ) = 3545/(9 + 3545) = 0.997

Wen-Han Hwang Chapter 9 48 / 61


Trying other thresholds

T
aFft
DrA
R Threshold
Sensitivity
Specificity
0.75
0.074
0.997
0.625
0.106
0.995
0.5 0.375 0.25

Wen-Han Hwang Chapter 9 49 / 61


Trying other thresholds

T
aFft
DrA
R Threshold
Sensitivity
Specificity
0.75
0.074
0.997
0.625
0.106
0.995
0.5 0.375 0.25

Wen-Han Hwang Chapter 9 49 / 61


Trying other thresholds

T
aFft
DrA
R Threshold
Sensitivity
Specificity
0.75
0.074
0.997
0.625
0.106
0.995
0.5
0.136
0.995
0.375 0.25

Wen-Han Hwang Chapter 9 49 / 61


Trying other thresholds

T
aFft
DrA
R Threshold
Sensitivity
Specificity
0.75
0.074
0.997
0.625
0.106
0.995
0.5
0.136
0.995
0.375
0.305
0.963
0.25

Wen-Han Hwang Chapter 9 49 / 61


Trying other thresholds

T
aFft
DrA
R Threshold
Sensitivity
Specificity
0.75
0.074
0.997
0.625
0.106
0.995
0.5
0.136
0.995
0.375
0.305
0.963
0.25
0.510
0.936

Wen-Han Hwang Chapter 9 49 / 61


Relationship between Sensitivity and Specificity

T
Threshold 0.75 0.625 0.5 0.375 0.25

aFft
Sensitivity 0.074 0.106 0.136 0.305 0.510
Specificity 0.997 0.995 0.995 0.963 0.936

DrA
R
Wen-Han Hwang Chapter 9 50 / 61
Relationship between Sensitivity and Specificity

T
Threshold 0.75 0.625 0.5 0.375 0.25

aFft
Sensitivity 0.074 0.106 0.136 0.305 0.510
Specificity 0.997 0.995 0.995 0.963 0.936

DrA
R
Wen-Han Hwang Chapter 9 50 / 61
Relationship between Sensitivity and Specificity

T
Threshold 0.75 0.625 0.5 0.375 0.25

aFft
Sensitivity 0.074 0.106 0.136 0.305 0.510
Specificity 0.997 0.995 0.995 0.963 0.936

DrA
R
Wen-Han Hwang Chapter 9 50 / 61
Receiver operating characteristic (ROC) curve

T
aFft
DrA
R
Wen-Han Hwang Chapter 9 51 / 61
Receiver operating characteristic (ROC) curve (cont.)

T
aFft
Why do we care about ROC curves?

DrA
Shows the trade off in sensitivity and specificity for all possible thresholds.
Straight forward to compare performance vs. chance.
Can use the area under the curve (AUC) as an assessment of the predictive ability of a model.

R
Wen-Han Hwang Chapter 9 52 / 61
Refining the Spam model

T
g_refined = glm(spam ~ to_multiple+cc+image+attach+winner
+password+line_breaks+format+re_subj

aFft
+urgent_subj+exclaim_mess,
data=email, family=binomial)
summary(g_refined)

Estimate Std. Error z value Pr(>|z|)

DrA
(Intercept) -1.7594 0.1177 -14.94 0.0000
to multipleyes -2.7368 0.3156 -8.67 0.0000
ccyes -0.5358 0.3143 -1.71 0.0882
imageyes -1.8585 0.7701 -2.41 0.0158
attachyes 1.2002 0.2391 5.02 0.0000
winneryes 2.0433 0.3528 5.79 0.0000
R passwordyes -1.5618 0.5354 -2.92 0.0035
line breaks -0.0031 0.0005 -6.33 0.0000
formatPlain 1.0130 0.1380 7.34 0.0000
re subjyes -2.9935 0.3778 -7.92 0.0000
urgent subjyes 3.8830 1.0054 3.86 0.0001
exclaim mess 0.0093 0.0016 5.71 0.0000

Wen-Han Hwang Chapter 9 53 / 61


Comparing models

T
aFft
DrA
R
Wen-Han Hwang Chapter 9 54 / 61
T
aFft
Utility Functions

DrA
R
Wen-Han Hwang Chapter 9 55 / 61
Utility Functions

T
aFft
There are many other reasonable quantitative approaches we can use to decide on what is the
“best” threshold.

DrA
If you’ve taken an economics course you have probably heard of the idea of utility functions, we
can assign costs and benefits to each of the possible outcomes and use those to calculate a utility
for each circumstance.

R
Wen-Han Hwang Chapter 9 56 / 61
Utility function for our spam filter

T
aFft
To write down a utility function for a spam filter we need to consider the costs / benefits of each
out.

Outcome Utility

DrA
True Positive 1
True Negative
False Positive
R False Negative

Wen-Han Hwang Chapter 9 57 / 61


Utility function for our spam filter

T
aFft
To write down a utility function for a spam filter we need to consider the costs / benefits of each
out.

Outcome Utility

DrA
True Positive 1
True Negative
False Positive
R False Negative

Wen-Han Hwang Chapter 9 57 / 61


Utility function for our spam filter

T
aFft
To write down a utility function for a spam filter we need to consider the costs / benefits of each
out.

Outcome Utility

DrA
True Positive 1
True Negative 1
False Positive
R False Negative

Wen-Han Hwang Chapter 9 57 / 61


Utility function for our spam filter

T
aFft
To write down a utility function for a spam filter we need to consider the costs / benefits of each
out.

Outcome Utility

DrA
True Positive 1
True Negative 1
False Positive -50
R False Negative

Wen-Han Hwang Chapter 9 57 / 61


Utility function for our spam filter

T
aFft
To write down a utility function for a spam filter we need to consider the costs / benefits of each
out.

Outcome Utility

DrA
True Positive 1
True Negative 1
False Positive -50
R False Negative -5

Wen-Han Hwang Chapter 9 57 / 61


Utility function for our spam filter

T
To write down a utility function for a spam filter we need to consider the costs / benefits of each

aFft
out.

Outcome Utility

True Positive 1

DrA
True Negative 1
False Positive -50
False Negative -5
R U ( p) = TP( p) + TN ( p) 50 ⇥ FP( p) 5 ⇥ FN ( p)

Wen-Han Hwang Chapter 9 57 / 61


Utility for the 0.75 threshold

T
aFft
For the email data set picking a threshold of 0.75 gives us the following results:
FN = 340 TP = 27
TN = 3545 FP = 9

DrA
R U ( p) = TP( p) + TN ( p)
= 27 + 3545 50 ⇥ 9
50 ⇥ FP( p) 5 ⇥ FN ( p)
5 ⇥ 340 = 1422

Not useful by itself, but allows us to compare with other thresholds.

Wen-Han Hwang Chapter 9 58 / 61


Utility curve

T
aFft
DrA
R
Wen-Han Hwang Chapter 9 59 / 61
Utility curve (zoom)

T
aFft
DrA
R
Wen-Han Hwang Chapter 9 60 / 61
Utility curve (zoom)

T
aFft
DrA
R
Wen-Han Hwang Chapter 9 60 / 61
Maximum Utility

T
aFft
DrA
R
Wen-Han Hwang Chapter 9 61 / 61

You might also like