Lecture 8
Lecture 8
1
Male Female
Not hospitalized y=0 902(95.3%) 941(89.3%)
Hospitalized y=1 44(4.7)% 113(10.7%)
Total 946 1054
Table 1
πi = P r(yi = 1)
E(yi ) = x0i β = πi
2
• Residual analysis is meaningless. The response must be either a 0 or 1,
although the regression models typically regard distribution of the error
term as continuous. This mismatch implies, for example, that the usual
residual analysis in regression modeling is meaningless. (In regular regres-
sion models, residuals are the differences between the observed values of
the dependent variable and the predicted values from the regression equa-
tion. Residual analysis is commonly used to assess the goodness-of-fit of a
regression model and to check the assumptions of the regression analysis,
such as linearity, normality of residuals, and constant variance. However,
when working with binary dependent variables, such as those that take
on values of 0 and 1 (representing categories like “success” and “failure”),
the nature of the data is inherently different. Binary outcomes do not
have continuous variation like numeric variables, and the assumptions of
traditional regression analysis may not hold.)
The name “logistic” refers to the logistic function, also known as the sig-
moid function, which is the key component of logistic regression. This function
transforms a linear combination of the independent variables into a probability
value between 0 and 1. By applying a suitable threshold, we can then classify
observations into one of the two categories.
• Logit Link Function: The most commonly used link function in logistic
regression is the logit link. It applies the logistic transformation (sigmoid
3
function) to the linear predictor, which is defined as the log-odds of the
probability of the binary outcome. The logit link function is given by:
π
g(π) = logit(π) = log( 1−π )
• Probit Link Function: The probit link function is an alternative to the
logit link function. It uses the cumulative distribution function (CDF) of
the standard normal distribution (also known as the probit function) to
transform the linear predictor. The probit link function is given by:
Table 2: How the binary response mean can be written as an appropriate distri-
bution function in the Probit, Logit, and Complementary log-log models. Here
z is the argument of the distribution function
4
Fig1: Comparision of the distribution function for the probit, logit, and com-
plementary log-log cases
Example 1
Determine which of the following pairs of distribution and the link function
is the most appropriate to model if a person is hospitalized or not
(A) Normal distribution, Identity link function
(B) Normal distribution, logit link function
(C) Binomial distribution, linear link function
(D) Binomial distribution, logit link function
(E) It cannot be determined from the information given
Comments: The term “linear link” is not standard one. Presumably it means
“identity link”.
Solution D: When a person is hospitalized or not is a binary variable, which
is best modeled by Binomial (more precisely, Bernoulli) distribution, leaving
only answers (C) and (D). The link function should be one that restricts the
Bernoulli response mean to the zero-one range. Among the identity and logit
link, only logit link has this property.
Threshold Interpretation:
Suppose that there exists an underlying linear model,
yi∗ = x0i β + ∗i
Here, we do not observe the response yi∗ , yet we interpret it to be the propen-
sity to possess a characteristic. For example, we might think about the finan-
cial strength of an insurance company as a measure of its propensity to become
insolvent (no longer capable of meeting its financial obligations). Under the
threshold interpretation, we do not observe the propensity, but we do observe
when the propensity crosses a threshold. It is customary to assume that this
5
threshold is 0, for simplicity. Thus, we observe
(
0 yi∗ ≤ 0
yi =
1 yi∗ > 0.
To see how the logit case is derived from the threshold model, assume a
logistic distribution function for the disturbances, so that
P r(∗i ≤ a) = 1
1+exp(−a) .
Like the normal distribution, one can verify by calculating the density that
the logistic distribution is symmetric about zero. Thus, −∗i has the same dis-
tribution as ∗i , and so
This establishes the threshold interpretation for the logit case. The devel-
opment for the other two cases are similar and are omitted.
Logistic Regression:
• Logistic regression is another phrase used to describe logit case.
p
• logit(p) = ln( 1−p ) is the logit function.
Odds Interpretation
- When the response y is binary, knowing only p = P r(y = 1) summarizes the
entire distribution.
p
- The odds is given by 1−p .
-For example suppose that y indicates whether a horse wins a race and that
p is the probability of the horse winning. If p = 0.25, then the odds of the horse
winning are 0.25/(1.00 − 0.25) = 0.3333.
6
-We might say that the odds of winning are 0.3333 to 1, or 1 to 3.
Thus,
P r(yi =1|xij =1)/(1−P r(yi =1|xij =1))
eβj = P r(yi =1|xij =0)/(1−P r(yi =1|xij =0))
-This shows that eβj can be expressed as the ratio of two odds, known as the
odds ratio.
-That is, the numerator of this expression is the odds when xij = 1, whereas
the denominator is the odds when xij = 0.
-Thus, we can say that the odds when xij = 1 are exp(βj ) times as large as
the odds when xij = 0.
7
Similarly, assuming that j th explanatory variable is continuous (differen-
tiable), we have
P r(yi =1|xij )
βj = ∂x∂ij x0i β = ∂x∂ij ln 1−P r(y i =1|xij )
∂ P r(yi =1|xij )
∂xij (1−P r(yi =1|xij ))
= P r(yi =1|xij )
1−P r(yi =1|xij )
∂ ∂(odds)/∂x
β1 = ∂x ln(odds) = odds
which is the proportional chang (i.e., absolute change divided by the cur-
rent value) in the odds.
(
exp(β0 ), when x = 0
odds =
exp(β0 + β1 ) when x = 1
Then
odds whenx=1
exp(β1 ) = odds whenx=0 ,
which is the ratio of two odds, called an odds ratio. Equivalently the odds
when x = 1 is exp(β1 ) times as large as the odds when x = 0. If β1 > 0
(resp.β1 < 0), the odds is higher when x = 1 (resp. x = 0).
Example 2
• One of the model uses a logistic link function and the other uses a probit
link function.
8
• Both models happen to produce same coefficient estimate βˆ0 = 0.02 and
βˆ1 = 0.3.
Calculate the absolute difference in the predicted values from the two models
at x = 4.
(A) less than 0.1
(B) At lest 0.1, but less than 0.2
(C) At least 0.2, but less than 0.3
(D) At least 0.3, but less than 0.4
(E) At least 0.4
Solution: Answer B
The absolute diffrence between these two predicted values is 0.8888 - 0.7721
= 0.1167.
Example 3
A statistician uses a logistic model to predict the probability of success, π, of a
binomial random variable.
You are given the following information:
9
Example 4
You are given the following information about an insurance policy:
• β0 = 5
• β1 = −0.65
Solution:
The odds of renewal at x = 5 is
exp(β0 + β1 x) = exp(5 − 0.65 × 5) = 5.7546
(Answer: (C))
10
> summary ( PosExpglmFull )
Call :
glm ( f o r m u l a = POSEXP ˜ AGE + GENDER + f a c t o r (RACE) + f a c t o r (REGION) +
f a c t o r (EDUC) + f a c t o r (PHSTAT) + f a c t o r (ANYLIMIT) + f a c t o r (INCOME)
+ f a c t o r ( i n s u r e ) , f a m i l y = b i n o m i a l ( l i n k = l o g i t ) , data = d f )
Deviance R e s i d u a l s :
Min 1Q Median 3Q Max
−1.3846 −0.4211 −0.3161 −0.2341 2.9673
Coefficients :
Estimat e Std . E r r o r z v a l u e Pr ( >| z | )
( Intercept ) −4.784829 0 . 7 3 6 1 3 6 −6.500 8 . 0 4 e −11 ∗∗∗
AGE −0.001252 0 . 0 0 7 3 4 6 −0.170 0 . 8 6 4 6 9 9
GENDER 0.734233 0.192474 3 . 8 1 5 0 . 0 0 0 1 3 6 ∗∗∗
f a c t o r (RACE)BLACK 0.217677 0.570656 0.381 0.702869
f a c t o r (RACE)NATIV 0.824219 0.835982 0.986 0.324168
f a c t o r (RACE)OTHER 0.007304 0.847974 0.009 0.993127
f a c t o r (RACE)WHITE 0.221817 0.532313 0.417 0.676895
f a c t o r (REGION)NORTHEAST 0.086940 0.280031 0.310 0.756207
f a c t o r (REGION)SOUTH −0.186302 0 . 2 3 7 3 4 7 −0.785 0 . 4 3 2 4 9 3
f a c t o r (REGION)WEST −0.518257 0 . 2 7 4 6 6 6 −1.887 0 . 0 5 9 1 7 9 .
f a c t o r (EDUC)HIGHSCH −0.062887 0 . 2 2 9 2 3 1 −0.274 0 . 7 8 3 8 2 2
f a c t o r (EDUC)LHIGHSC −0.068306 0 . 2 6 7 3 4 9 −0.255 0 . 7 9 8 3 4 2
f a c t o r (PHSTAT)FAIR 0.114993 0.357955 0.321 0.748020
f a c t o r (PHSTAT)GOOD 0.370522 0.263039 1.409 0.158948
f a c t o r (PHSTAT)POOR 1.668471 0.368824 4 . 5 2 4 6 . 0 8 e −06 ∗∗∗
f a c t o r (PHSTAT)VGOO 0.174648 0.267145 0.654 0.513269
f a c t o r (ANYLIMIT) 1 0.553917 0.208929 2 . 6 5 1 0 . 0 0 8 0 2 0 ∗∗
f a c t o r (INCOME)LINCOME 0.506546 0.303061 1.671 0.094636 .
f a c t o r (INCOME)MINCOME 0.311229 0.252860 1.231 0.218384
f a c t o r (INCOME)NPOOR 0.711665 0.399682 1.781 0.074981 .
f a c t o r (INCOME)POOR 0.910737 0.295533 3 . 0 8 2 0 . 0 0 2 0 5 8 ∗∗
f a c t o r ( i n s u r e )1 1.232008 0.304847 4 . 0 4 1 5 . 3 1 e −05 ∗∗∗
−−−
S i g n i f . codes : 0 ‘∗∗∗ ’ 0.001 ‘∗∗ ’ 0.01 ‘∗ ’ 0.05 ‘. ’ 0.1 ‘ ’ 1
( D i s p e r s i o n parameter f o r b i n o m i a l f a m i l y taken t o be 1 )
Number o f F i s h e r S c o r i n g i t e r a t i o n s : 6
11
Parameter Estimation:
The customary method of estimation for logistic and probit models is maximum
likelihood, described in further detail in Section 11.9. To provide intuition, we
outline the ideas in the context of binary dependent variable regression models.
The likelihood is the observed value of the probability function. For a single
observation, the likelihood is
(
1 − πi if yi = 0
πi if yi = 1.
The objective of maximum likelihood estimation is to find the parameter val-
ues that produce the largest likelihood. Finding the maximum of the logarithmic
function yields the same solution as finding the maximum of the corresponding
function. Because it is generally computationally simpler, we consider the log-
arithmic (or log-) likelihood, written as
(
ln(1 − πi ) if yi = 0
ln(πi ) if yi = 1.
12
respect to β yields the score equations:
π 0 (x0i β)
Pn
∂ 0
∂β L(β) = x
i=1 i yi − π(xi β) π(x0 β)(1−π(x0 β)) = 0
i i
0
where π is the derivative of π. The solution of these equations, denoted as
bM LE , is the maximum likelihood estimator. For the logit function the score
equations reduce to
Pn
∂ 0
∂β L(β) = i=1 x i yi − π(xi β) =0 (1)
1
where π(z) = 1+exp(−z) .
Example 6:
Additional Inference:
An estimator of the large sample variance of β may be calculated taking partial
derivatives of the score equations. Specifically, the term
2
∂
I(β) = −E ∂β∂β 0 L(β)
13
is the information matrix. As a special case, using the logit function and
equation (1), straightforward calculations show that the information matrix is
Pn
I(β) = i=1 σi2 xi xi 0 ,
The square root of the (j + 1)th diagonal element of this matrix evaluated
at β = bM LE yields the standard error for bj,M LE , denoted as se(bj,M LE )
H0 : β = 0 i.e. β 1 = β2 = · · · = βk = 0
LRT = 2(L(bM LE ) − L0 )
Example 7:
14
Solution: In class (Answer (E))
Example 8:
A practitioner built a GLM to predict claim frequency. She used a Poisson error
structure with a log link. You are given the following information regarding the
model summary statistics:
• The likelihood ratio test statistic for testing the overall model adequacy
is 11.601
Determine the best statement of what we can conclude from the value of the
likelihood ratio test statistic.
Solution: (Answer:(A))
H0 : β 1 = β 2 = β 3 = β 4 = 0
Under null hypothesis the likelihood ratio test statistic(LRT) has a χ4 2 dis-
tribution.
Nominal Regression:
-Consider a c- level categorical response variable, taking values of 1, 2, . . . , c.
-If there is no natural order on the different values of the response variables,
then, the response is called Nominal .
-For example, color (red, green, yellow,purple,. . . ), type of household insurance
claim (theft,fire,storm damage, etc.)
-For nominal response variables, one may pursue a generalized logit model by
selecting one level, say the last category or the first category, as the base-
line category, relative to which the log odds of the multinomial probabilities,
πj = P r(y = j) for j = 1, 2, . . . , c − 1 are modeled in terms of predictors:
πj
ln = x0 β j = β0j + β1j x1 + · · · + βkj xk j = 1, 2, . . . , c,
πc
|{z}
not1−πj
15
with each level of j having a separate set of parameters βj = (β0j , β1j , . . . , βkj ).
In particular,
1 1
πc = Pc
exp(x0 βk ) = 1+ c−1
P 0
.
k=1 k=1 exp(x βk )
As soon as the parameter estimates βˆk ’s are available, we can estimate the
multinomial probabilities when the explanatory variables are x as
exp(x0 βˆj )
πˆj = Pc
exp(x0 βˆk )
, j = 1, 2, . . . , c.
k=1
16
Example 9:
In a study 100 subjects were asked to choose one of three election candidates
(A,B, or C). The subjects were organized into four age categories:(18-30, 31-45,
45-61,61+).
For age group (18-30), the log-odds for preference of Candidate B and Can-
didate C were -0.535 and -1.489 respectively.
Calculate the modeled probability of someone from age group (18-30) pre-
ferring Candidate B.
Example 10:
You are given the following information about a generalized logit model:
17
(A) 0.2
(B) 0.3
(C) 0.4
(D) 0.5
(E) 0.6
Solution: In class (Answer (B) πˆ1 = 0.3203
18
Example 11:
III. ANOVA is a useful approach for analyzing the means of groups of contin-
uous response variables, where the groups are categorical.
(A) I only
(B) II only
Solution:
19
Ordinal Regression:
To fix the ideas, consider an ordinal response variable y with c ordered values
which, without loss of generality, we label as 1, 2, . . . , c.The distribution of y is
described by the probabilities
πj = P r(y = j), j = 1, 2, . . . , c Pc
among which only c − 1 probabilities are free due to the condition j=1 πj = 1
and require modeling in terms of predictors.
There are different ways to extend ordinary logistic regression to ordinal re-
sponses.
• Cumulative logit model: The most common ordinal regression model uses
the “logit” link to explain the “cumulative” probabilities τj = π1 + π2 +
· · · + πj for j = 1, 2, . . . , c − 1 in terms of explanatory variables, leading
to the model equation
τ π1 +π2 +···+πj
j
ln 1−τ j
= πj+1 +πj+2 +···+πc = x0 β j , j = 1, 2, . . . , c − 1
or
1
τj = 1+exp(−x0 β j )
ln π2π+π
1
3
= x0 β 1 = β01 + β11 x1 + β21 x2 + β31 x3 ,
ln π1π+π
3
2
= x0 β 2 = β02 + β12 x1 + β22 x2 + β32 x3
This is the cumulative logit model.
j τ
logit(τj ) = ln 1−τ j
= β0j + β1 x1 + β2 x2 + · · · + βk xk , j = 1, 2, . . . , c − 1
| {z }
does not depend on j
or,
1
τj = 1+exp[−(β0j +β1 x1 +β2 x2 +···+βk xk )]
20
These c − 1 equations have different intercepts but the same slope with
respect to each explanatory variable. As a result, the explanatory variables
exert the same effect on all the c cumulative probabilities. For this reason
(5.2.6) is known as the proportional odds model, which is the default and
most important form of ordinal logistic regression.
Example 12:
You are given the following information about a proportional odds model for
the degree of vehicle crash classified on a three-point scale, 1,2, and 3:
• The model uses two categorical explanatory variable: Age and Sex.
Parameters β̂
Intercept 1 (Degree 1) 0.450
Intercept 2 (Degree 2) 5.089
Age
Junior 0.179
Senior 0.000
Sex
F -0.172
M 0.000
Age× Sex -0.129
Calculate the ratio of the odds of having degree of crash 2 or lower for a
junior female to that for a senior male.
21
(Note that “ senior male” is the reference level.)
22