SAHADEB - Logistic Reg - Sessions 8-10
SAHADEB - Logistic Reg - Sessions 8-10
Sahadeb Sarkar
IIM Calcutta
1
Readings: Chapters 1-4 2
Odds & Odds Ratio:
Extension to Regression Analysis
Response Y binary and Covariate X may
be continuous instead of categorical.
𝑃 𝑌 =1𝑥
Logit(𝑥)= ln = 𝛽0 +𝛽1 𝑥;
𝑃 𝑌 =0𝑥
𝑒 𝛽1 = odds ratio, for X=x and X=x+1.
H0: 𝛽1 =0 Logit(x)=𝛽0 , does not
depend on x (or 𝑒 𝛽1 = odds ratio=1, for
X=x and X=x+1) odds of Y=1 does
not change with x-value.
3
Logistic Regression/ Binary Logit Model
4
Problems with Classical Linear Regression
6
Solution with Logistic Regression
Estimates:
a=–6.39,
b=0.44
8
Comparison to Classical Linear Regression
9
Comparison to Classical Linear Regression
(contd.)
• Instead of the sum of squared errors (SSE), one calculates
negative of the log-likelihood for the fitted model.
• Interpretation of parameters 1, …, k is now done through
the log of odds called “logit”, or equivalently through odds.
For example, suppose X1 changes by one unit and values of
X2, …, Xk remain constant. Thus, 1 denotes change in logit,
equivalently, 𝑒 β1 denotes multiplicative change-factor in
odds of Y=1 when X1 changes its value from x to x+1.
[Odds(Y=1|X1=x+1) = Odds(Y=1|X1 =x)*(𝑒 β1 ), while X2, …, Xk
remain at the same “level”s/values; ]
10
Intentionally Kept Blank
11
Properties of the Logit Model
13
Estimation and Model Fit
14
MLE calculation - Example 1:
The distribution of scores in a public examination is
normal with unknown mean µ and known standard
deviation 5. It is, however, known that µ equals 52,
60 or 67. If the scores of three randomly chosen
candidates turn out to be 88, 93, 59, then find the
maximum likelihood estimate of µ.
15
MLE calculation - Example 1(contd.):
The distribution of scores in a public examination is normal
with unknown mean µ and known standard deviation 5. It is,
however, known that µ equals 52, 60 or 67. If the scores of
three randomly chosen candidates turn out to be 88, 93, 59,
then find the maximum likelihood estimate of µ.
(Ans. mle µ of 67; sample mean = 80)
𝑛
1 𝑛 𝜇)2
1 −2 σ𝑖=1(𝑋𝑖 −
𝐿 𝜇 = ෑ 𝑒
5 2𝜋 52
𝑖=1
σ𝑛𝑖=1 𝑋𝑖 − 𝜇 2
𝑙𝑜𝑔𝑒 𝐿 𝜇 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 −
2 × 52
σ𝑛𝑖=1 𝑋𝑖 − 𝑋ത + 𝑋ത − 𝜇 2
𝑙𝑜𝑔𝑒 𝐿 𝜇 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 −
50
σ𝑛𝑖=1 𝑋𝑖 − 𝑋ത 2 + 𝑛 𝑋ത − 𝜇 2 + 2(𝑋ത − 𝜇) σ𝑛𝑖=1 𝑋𝑖 − 𝑋ത
𝑙𝑜𝑔𝑒 𝐿 𝜇 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 −
50
16
MLE calculation - Example 2:
(a) The distribution of a certain quality characteristic is
continuous uniform over the range [-2,], where is
unknown. If the values of the quality characteristic in a
random sample of size five are 0.63, 0.27, –1.62, 0.71
and –1.38, then what is the maximum likelihood
estimate of ? (Ans. ??)
19
Example 3. MLE calculation (contd.):
An electrical unit consists of three identical components and
the unit functions if and only if at least one of the
components function. The components are known to
function independently and has an exponential dist. Out of
500 randomly chosen units of this type 468 were found to be
functioning for 10 hours or more. On the basis of this
information, (i) find the maximum likelihood estimate of the
probability for an individual component to function for 10
hours or more (ii) find the mle of expected life of an electrical
unit. (Ans. (i) 0.936 ?? (ii) 151.2 hours ??)
20
Example 3. MLE calculation (contd.)
Solution: (i) Let = P(any component functions for 10 hours or
more) = exp(-10/), where = expected life of any component.
• = P(i-th unit functions for 10 hours or more) = 1- P(i-th unit
functions for less than 10 hours) = 1- P(all three components of
the i-th unit function for less than 10 hours) = 1- (1-)3
• Now we are given a sample of binary observations on 500 units:
X1, …, X500 ~ B(1, ).
• Sum of X1, …, X500 is 468 and hence mle of [=1- (1-)3 ] =
sample proportion = 468/500 = 0.936.
• Thus the mle of = 0.6, [=exp(-10/) & invariance property of
mle].
• (ii) The mle of the expected life of an electrical component is =
10/ln(mle of ) = 19.57615
21
Logistic Regression:
Estimating Model Parameters
• Linear Regression model uses OLS method to minimize sum
of squared errors of prediction
• Logistic Regression model maximizes “likelihood” of
observing y1,…,yn, defined by
23
Estimation (contd.)
𝐿(𝜋1 , …,𝜋𝑛 ) = ς𝑛𝑖=1 𝜋𝑖 𝑦𝑖 (1 − 𝜋𝑖 )1−𝑦𝑖
24
Model-Fit Measures
(https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds)
26
Model Fit Measures
(https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds)
2
𝐿0 𝑛
Cox & Snell R square: 𝑅2 = 1 −
𝐿
where L0= max likelihood of the intercept-only model, L=max
likelihood under specified model; ratio of the likelihoods reflects the
improvement of the specified model over the intercept-only model;
if the specified model fits data perfectly and hence L =1, max value
2
2
of Cox & Snell R square will be: 𝑅 𝑚𝑎𝑥 = 1 − 𝐿0 𝑛 < 1.
28
Interpretation of Coefficients
29
Interpretation of Coefficients
If Xi is increased by one unit, the log odds will change by
i units, when the values of other independent variables
are held constant.
Log(Odds)
Sign of i will determine Probability Odds =Logit
0 0 NC
whether the probability 0.1 0.11 -2.20
increases (if the sign is 0.2 0.25 -1.39
positive) or decreases (if the 0.3 0.43 -0.85
0.4 0.67 -0.41
sign is negative). 0.5 1.00 0.00
0.6 1.50 0.41
0.7 2.33 0.85
0.8 4.00 1.39
0.9 9.00 2.20
1 NC NC
30
Intentionally Kept Blank
31
Prescribe “Buy/Sell/Hold” for a Security
Daily Data on
Price, Volume, MACD, RSI, Put-Call Ratio, Open
Interest, etc & “Buy/Sell/Hold prediction” (attached in
hindsight on the basis of next trading day’s closing price)
32
33
34
Pass Fail Data Example
(choice of cut-off value matters)
% Correct
classification
depends on the
cutoff prob. value
35
Pass Fail Data Example
(choice of cut-off value matters)
% Correct
classification
depends on the
cutoff prob. value
36
Pass Fail Data Example
http://127.0.0.1:57322/help/index.jsp?topic=/com.ibm.spss.statistics.help/idh_lreg_sav.htm
38
Intentionally Kept Blank
39
Example: Insurance Requirement
40
Data (2-level Response)
41
SPSS Windows: Logit Analysis
1. Select ANALYZE from the SPSS menu bar.
6. Click OK.
42
Recovering Probability () from Logit
logit = ln = 0 +1 𝑋1 + ⋯ + 𝑝 𝑋𝑝
(1 − )
exp(𝑙𝑜𝑔𝑖𝑡) 1
= =
1+exp(𝑙𝑜𝑔𝑖𝑡) 1+exp(−𝑙𝑜𝑔𝑖𝑡)
1
(1 − ) =
1+exp(𝑙𝑜𝑔𝑖𝑡)
43
Output (2-level response)
45
Output (2-level response)
https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-
what-are-pseudo-r-squareds
http://127.0.0.1:57322/help/
index.jsp?topic=/com.ibm.sps
s.statistics.help/idh_lreg_sav.
46
htm
Output (2-level response; w/o Income)
48
Output (2-level response; w/o Income)
49
Classification Table
(or Confusion Matrix or Error Table)
• When there are two categories Y=1 and Y=2, this will produce a
table of counts for classification or misclassification
• when Y has 3 categories one will have a 3x3 table of counts for
classification or misclassification
Sample size 𝑛
= 𝑛11 + 𝑛12 + 𝑛13 + 𝑛21 + 𝑛22 + 𝑛23 + 𝑛31 + 𝑛32 +
𝑛33
51
Receiver Operating Characteristic
(ROC) Curve
• Sensitivity is the true positive rate (TPR) since it is the rate of
correctly classifying an observed “yes” as “yes” (true positive)
by the model for a fixed cut-off or threshold value.
• Similarly, (1-specificity) is known as the false positive rate
(FPR) as it represents the rate of incorrectly classifying an
observed “no” as “yes” (false positive) by the model for the
same cut-off or threshold value.
• Thus, the ROC plot, of sensitivity versus (1specificity), can be
thought of as a plot of the Power versus Type I Error of the
decision (classification) rule based on the chosen cut-off or
threshold value using the model predicted probabilities.
52
Receiver Operating Characteristic (ROC) Curve
• Let x = FPR and y = TPR. Then ROC points (x,y) represent relative
trade-off between true positive rate (beneficial) and false positive
rate (costly). The best possible scenario for prediction would be to
have the point (0,1) in the upper left corner of the ROC space, as it
represents 100% sensitivity (no false negatives) and 100% specificity
(no false positives).
• The (0,1) point is also known as perfect classification. 53
Receiver Operating Characteristic (ROC) Curve
• Thus, one desires to see the ROC curve as far above the diagonal as
possible. A binary logistic regression model, for various choice of
cut-off value p [0,1], producing the largest area under the ROC
curve (AUC) may be declared as the best model.
54
Receiver Operating Characteristic (ROC) Curve
55
Pearson Chi-Square & the Deviance Tests
Pearson Chi-Square test is mainly used for categorical
explanatory variables. For a sample of n subjects with binary
response and a number of covariates, using a fitted binary
logit model one can construct an I2 table for observed
counts with rows consisting of I distinct patterns of the
covariates in the data and columns representing the two
levels of the binary response. For each of these I patterns
construct model based expected counts (Eij) and compare
them with observed counts (nij), i=1, …, I, j=1,2, …,J.
56
Pass-fail data
57
Pearson Chi-Square Tests
• Then the Pearson Chi-Square test statistic is defined by
𝐼 𝐽 2
𝑛𝑖𝑗 − 𝐸𝑖𝑗
𝑃𝐶𝑆 =
𝐸𝑖𝑗
𝑖=1 𝑗=1
• If each of the I2 cells have 5 or more observations nij,
then under the null hypothesis of correct model, PCS
has an asymptotic chi-square distribution with degrees
of freedom (df) = [I(J1) m], where m = number of
(independent) model parameters being estimated. If
J=2, then the PCS has df = (I-m).
58
Deviance test statistic
Deviance test statistic (denoted by G2) also compares the observed to the
expected counts, defined by
𝐽 𝑛𝑖𝑗
𝐺2 = 2 σ𝐼𝑖=1 σ𝑗=1 𝑛𝑖𝑗 ln .
𝐸𝑖𝑗
It also has the same df = [I(J1) m], where m = number of
(independent) model parameters being estimated. The deviance test is
the likelihood ratio test, LRT = 2loge()= 2loge(L0/L), H0:0 is true.
(H1: 0c ).
Example: One may test the null hypothesis that conditional (i.e., given xi)
distribution of the responses {Yi, i=1,…,I} have independent binomial(1,
(xi)) distributions where
𝑒 β0 +β1 x1i +⋯+βk xki
𝐱𝑖 = ,
1+𝑒 β0 +β1 x1i +⋯+βk xki
against the saturated model which does not assume {(xi), i=1, …,n} to be
related in any manner. In this case, J=2 and m=k+1.
59
PCS & LRT statistics as members of Power Divergence
Family for multinomial goodness of fit tests
𝜆
2 𝑂𝑖
2𝑛𝐼 𝜆 = σ𝑘𝑖=1 𝑂𝑖 −1 ,𝜆 ∈ℝ,
𝜆(𝜆+1) 𝐸𝑖
𝑂𝑖 = 𝑋𝑖 is observed frequency for ith category in a sample of size n
from a multinomial(1, (1 , … , 𝑘 ) ) distribution.
𝜆
2 𝑂𝑖
Note: 𝑂 − 1 denotes deviance (or power divergence)
𝜆(𝜆+1) 𝑖 𝐸𝑖
between observed and expected counts/frequencies for ith category.
60
Pass-fail data
61
Hosmer Lemeshow Test: Binary case
Pearson Chi-square test is not appropriate when there are
continuous (i.e., non-categorical) explanatory variables if they are
not grouped. In this case for each “covariate” pattern X=x, there
will usually be one observation, and hence the asymptotic or
approximate chi-square distribution for the PCS or the Deviance
test is not valid.
This problem is faced if number of categorical binary predictors k
is large. For example, if k is 6, one may have as many as I = 2k = 64
distinct combinations or patterns of covariates.
To solve this problem of large I, Hosmer and Lemeshow (1980)
ෝ 𝐱 =
orders n subjects according to the fitted probabilities 𝝅
Pr(Y=“Yes”|x) and then divide n subjects into g number of groups
of comparable sizes, where g is a small number often chosen to
be 10. 62
Hosmer Lemeshow Test: Binary case
nij = number of obs in j-th cell of the i-th group; Ei1 = sum of the
ෝ 𝐱 = Pr(Y=“Yes”|x) of the subjects in the i-
fitted probabilities 𝝅
th group and Ei2 = ni+ - Ei1. The PCS-type test statistic is
constructed
2
In simulation studies, it has been seen that 𝜒𝐻𝐿 has approximately
a chi-square distribution with df= (g2).
63
Hosmer Lemeshow Test:
Multinomial Response Case
• Fagerland and Hosmer’s (2012) generalization of the
Hosmer-Lemeshow test for multinomial logistic regression
models. The groups are formed in a similar manner as done
in the binary case and the table of observed and expected
counts for J categories (namely, 0, 1, …, J-1 ) takes the form:
64
Exercise
• Fagerland, Hosmer and Bofin (2008) showed that
2
𝜒𝐹𝐻 has asymptotic chi-square distribution with df =
(g2)(J-1). Various statistical software readily
2
provide the 𝜒𝐻𝐿 test in the binary case, and do not
2
usually produce the 𝜒𝐹𝐻 test in the multinomial
case. One may compute the latter by writing
program codes (in R etc ??).
65
AIC, BIC
The Akaike Information criterion
AIC = 2m – 2loge(estimated maximum likelihood value under a
model)
where m = number of parameters being estimated in the model.
The value 2m is used to penalize a model with higher number of
parameters, to reward a parsimonious model. Given a number of
candidate models for the data, the preferred model is the one
with the minimum AIC value. The enables us to compare two
models nested or not.
66
AIC, BIC (contd)
AIC is it is not a consistent in that its probability of selecting the
true model among the candidates does not tend to one as the
sample size increases to infinity.
67
Intentionally Kept Blank
68
Example: Insurance Requirement
69
Multinomial Logit Model
𝜋𝑗
• Logit j = ln = 𝛽0𝑗 + 𝛽1𝑗 𝑋1 + … + 𝛽𝑘𝑗 𝑋𝑘 , j=1, …, (m-1)
𝜋𝑚
exp(𝐿𝑜𝑔𝑖𝑡 𝑗)
i.e., 𝜋𝑗 = ;
1+ σ𝑚−1
𝑗=1 exp(𝐿𝑜𝑔𝑖𝑡 𝑗)
1
and 𝜋𝑚 = .
1+ σ𝑚−1
𝑗=1 exp(𝐿𝑜𝑔𝑖𝑡 𝑗)
70
Multinomial Logit Model
𝜋𝑗
Logit j: ln = 𝛽0𝑗 + 𝛽1𝑗 𝑋1 + … + 𝛽𝑘𝑗 𝑋𝑘 , j=1, …, (m-1)
𝜋𝑚
𝐋𝐢𝐤𝐞𝐥𝐢𝐡𝐨𝐨𝐝 𝐟𝐮𝐧𝐜𝐭𝐢𝐨𝐧 𝑳 =
𝜋3 (𝑜𝑏𝑠 1)𝜋2 (𝑜𝑏𝑠 2)𝜋3 (obs 3)
𝜋1 𝑜𝑏𝑠 4 𝜋3 𝑜𝑏𝑠 5 𝜋2 𝑜𝑏𝑠 6
… 𝜋2 𝑜𝑏𝑠 100
73
Output (3-level response)
75
Output (3-level response)
df = [I(J1) m],
df = [I(J1) m], where m = no. of
(independent) model parameters
being estimated, I = no. of distinct
value-combinations of covariates, J =
no. of response categories.
76
Output (3-level response; w/o Income)
79
Output (3-level response; w/o Income)
80
Bankruptcy Example
81
Bankruptcy Example
(Applied Multivariate Statistical Analysis by Johnson & Wichern)
82
Bankrupt
CA/NS
CA/CL
CF/TD
NI/TD
Seq#
NI/TD
Seq#
17 0.51 0.1 2.49 0.54 1
18 0.08 0.02 2.01 0.53 1
19 0.38 0.11 3.27 0.35 1
20 0.19 0.05 2.25 0.33 1
21 0.32 0.07 4.24 0.63 1
22 0.31 0.05 4.45 0.69 1
23 0.12 0.05 2.52 0.69 1
24 -0.02 0.02 2.05 0.35 1
25 0.22 0.08 2.35 0.4 1
26 0.17 0.07 1.8 0.52 1
27 0.15 0.05 2.17 0.55 1
28 -0.1 -0.01 2.5 0.58 1
29 0.14 -0.03 0.46 0.26 1
30 0.14 0.07 2.61 0.52 1
31 0.15 0.06 2.23 0.56 1
32 0.16 0.05 2.31 0.2 1
33 0.29 0.06 1.84 0.38 1
34 0.54 0.11 2.33 0.48 1
35 -0.33 -0.09 3.01 0.47 1
36 0.48 0.09 1.24 0.18 1
84
SPSS Windows: Logit Analysis
1. Select ANALYZE from the SPSS menu bar.
6. Click OK.
85
Regressors: CFTD, CACL, NITD, CANS
86
Regressors: CFTD, CACL, NITD, CANS
87
Regressors: CFTD, CACL
88
Regressors: CFTD, CACL
89
Intentionally Kept Blank
90
Example: Survival of the Donner Party
91
Survival of the Donner Party (contd.)
…. and again in the crossing of the desert west of the
great Salt Lake. The group became stranded in the
eastern Sierra Nevada mountains when the region was hit
by heavy snow in October. By the time the last survivor
was rescued on April 21, 1847, 40 of the 87 members had
died of famine and hunger and exposure to extreme cold.
92
Donner Party Data (age above 15yrs)
[sex=1 for males]
93
Route of the Donner Party
94
The Logistic Regression Model
• The response variable in this case study is binary; it
take values 0 and 1. The Donner Party study involves
two explanatory variables, one categorical and one
numerical. Logistic regression is the appropriate tool
in such a situation.
95
Example: Donner Party
GLM code in R
> glm_survive = glm(survive ~ age+sex, family = binomial(logit),
data = donner)
> summary(glm_survive)
• Coefficients:
(Intercept) age sex
3.23041 0.0782 1.5973
• Degrees of Freedom: 44 ‘Total’(i.e. Null); 42 Residual (Model)
Residual Deviance = 51.26, DF= 42 (=45 - 3), AIC=57.26, P-value (HL test)
= 0.19, AUC=0.73, Correct Classification %= 77.8, Nagelkerke R2 = 0.28
Null deviance: For the model that includes only the intercept
Residual Deviance: For the model with explanatory variables included
96
Example: Donner Party
• The fit of the logistic regression model to the Donner
Party data (with 45 obs with age ≥15), where π
represents the survival probability gives
• logit(πi) = ln(πi/(1- πi)) = 3.23 – 0.078 agei 1.60 sexi,
where “sex” is an indicator variable for males.
• Interpretation:
1.60 = ln(odds ratio) for a male compared to a female of
same age
– 0.078 = ln(odds ratio) for one year older person of same
gender
97
Odds from Logistic Regression
= Odds of Y being 1 (survival) = P(Y=1)/P(Y=0)
In Logistic Regression:
𝜔 = exp(𝛽0 + 𝛽0 × 𝑋1 + … + 𝛽𝑘 × 𝑋𝑘 )
98
Interpretation of Coefficients (Donner Party)
99
Testing Presence of Interaction (Donner data)
Wald’s test for the hypothesis that the coefficient of interaction
term is zero in the logistic regression of survival (1 or 0) on age,
sex (=1 for males), and interaction term = age x fem:
Donner Party data (n=45)
Variable Coefficient Standard Z-Statistic
Error
Constant 7.246 3.205 2.261
age -0.194 0.087 -2.220
sex -6.928 3.399 -2.038
Age x sex 0.162 0.094 1.714
Residual Deviance = 47.34, DF= 41 (=45-1-4), AIC=55.35, P-value(HL test)=
0.65, AUC=0.762, Correct Classification %= 73.3, Nagelkerke R2 = 0.368
From the normal distribution Two-sided p-value = 2 x Pr(Z > 1.714) = 0.087
Conclusion: There is suggestive but inconclusive(?) evidence of an interaction.
100
Example ( Confidence Interval for a Coefficient)
Confidence intervals for the odds of survival for females divided by
the odds of survival for males, accounting for age, from the model
without interaction
95% Confidence Interval for the coefficient of Sex : -1.597 1.960.755 = -3.077 to -0.117
Take anti-logarithms of endpoints to get interval for the odds ratio: 0.046 to 0.889
Conclusion: The odds of survival for males are estimated to have been exp(-1.597) =
0.203, i.e., one-fifth of the odds of survival for females of similar age).
101
Generalized Linear Model (GLM)
102
Generalized Linear Model (GLM)
• A generalized linear model is one where some function of
the mean of the response variable is related to the
explanatory variables through a linear regression equation:
= 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽𝑘 𝑋𝑘
103
Generalized Linear Model (GLM)
A GLM has three components:
• A random component identifies the response variable Y and
its probability distribution.
• A systematic component which specifies the explanatory
variables used in a linear predictor function.
• A link function which specifies the function of E(Y) that the
model equates to the systematic component.
Example (Classical linear regression model, quantitative Y):
• 𝑌~𝑁 𝜇, 𝜎 2 , 𝐸 𝑌 = 𝜇;
• 𝑋1 , … , 𝑋𝑘 , 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽𝑘 𝑋𝑘 ;
• 𝜇 =𝜇
104
Generalized Linear Model (GLM)
Example (Classical linear regression model, quantitative Y):
• 𝑌~𝑁 𝜇, 𝜎 2 , 𝐸 𝑌 = 𝜇;
• 𝑋1 , … , 𝑋𝑘 , 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽𝑘 𝑋𝑘 ;
• 𝜇 = 𝜇 [identity link function]
Example (Binary logistic regression model, qualitative Y):
• 𝑌~𝐵𝑖𝑛(1, 𝑝), 𝐸 𝑌 = 𝑝;
• 𝑋1 , … , 𝑋𝑘 , 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽𝑘 𝑋𝑘 ;
𝑝
• 𝑝 = ln( ) [logit link function]
1−𝑝
Example (Poisson regression model, count variable Y):
• 𝑌~𝑃𝑜𝑖(), 𝐸 𝑌 = ;
• 𝑋1 , … , 𝑋𝑘 , 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽𝑘 𝑋𝑘 ;
• = ln() [log link function]
105
Generalized Linear Model (GLM)
• A GLM has three components:
• A random component identifies the response variable Y and
its probability distribution.
• A systematic component which specifies the explanatory
variables used in a linear predictor function.
• A link function which specifies the function of E(Y)=µ that the
model equates to the systematic component.
Exponential Family of Distributions with
Vector of Parameters
The natural exponential family has probability density
function or probability mass function given in the form
107
Exponential Family of Distributions with
Vector of Parameters
The natural exponential family has probability density
function or probability mass function given in the form
108
Exponential Family of Distributions with
Vector of Parameters
(𝑦−𝜇)2 𝑦2 𝜇2 𝑦𝜇
1 − 1 − − − 2
f(y, μ) = 𝑒 2𝜎2 = 𝑒 2𝜎2 𝑒 2𝜎2 𝑒 𝜎
𝜎 2𝜋 𝜎 2𝜋
𝑦2 𝜇2 𝑦
1 − 2 − 2 − ( )𝜇
= ( 𝑒 2𝜎 )(𝑒 2𝜎 ) (𝑒 𝜎2 )
𝜎 2𝜋
(𝑦−𝜇)2 𝑦2 𝜇2 𝑦𝜇
1 − 1 − 2 − 2 +
f(y, μ) = 𝑒 2𝜎2 = 𝑒 2𝜎 𝑒 2𝜎 𝑒 𝜎2
𝜎 2𝜋 𝜎 2𝜋
𝜇2 𝜇 1
1 − 2 𝑦 +𝑦 2 −
=( ) (𝑒 2𝜎 ) (𝑒 𝜎2 2𝜎2 )
2𝜋
𝜇 1
Here the canonical link (,2)= ,− 2 .
𝜎2 2𝜎
Exponential Family of Distributions
For Y ~ Bin(1,):
f ( y, ) y (1 )1 y
(1 )[ /(1 π)]y
(1 ) exp y ln( )
1
Here, the logit link is the canonical link. The link function that
transforms the mean to the natural parameter is called the
canonical link.
Exponential Family of Distributions
For Y ~ Poi():
f ( y, ) e y
y!
1
e y! exp( y log( ))
Here the log link is the canonical link. The link function that
transforms the mean to the natural parameter is called the
canonical link.
Log-linear Regression for Poisson
Counts
113
Log-linear Regression for Poisson Counts
114
Interpretation of the Poisson Regression
Coefficients
With one variable, our model is:
log E (Y ) x
The log of the expected value of Y is a linear function of
explanatory variable; the expected value of Y is a
multiplicative function of x:
E(Y ) exp( x) exp( ) exp( x)
115
Interpretation of the Regression Coefficient
Now suppose the value of x is one unit larger, i.e., (x+1).
Then the value of the mean of the response becomes:
E (Y | X x 1) exp( ( x 1))
exp( )[exp( ) exp( x)]
exp( ) E (Y | X x)
116
Age and Mating Success of Male Elephants
[Ramsey and Schafer]
27 0 33 3 39 1
28 1 33 3 41 3
28 1 33 3 42 4
28 1 33 2 43 0
28 3 34 1 43 2
29 0 34 1 43 3
29 0 34 2 43 4
29 0 34 3 43 9
29 2 36 5 44 3
29 2 36 6 45 5
29 2 37 1 47 7
30 1 37 1 48 2
32 2 37 6 52 9
33 4 38 2 117
Age and Mating Success of Male Elephants
118
Fitting the Poisson Log-Linear Model
• Likelihood has the following form
𝑛 −𝜇 𝑦 1
𝐿= ς𝑖=1 𝑒 𝜇𝑖𝑖 𝑖 ,
𝑦𝑖 !
where ln(𝜇𝑖 )=𝛽0 + 𝛽1 𝑥𝑖 ,
or ln(𝜇𝑖 )=𝛽0 + 𝛽1 𝑥𝑖 + 𝛽2 𝑥𝑖 2
119
Fitted Log-Linear Regression Model
Variable Coefficient Standard Error T-statisits Two sided p-
value
Constant -2.857 3.036 0.941 0.3467
121
Estimated Means and Variances
122
Estimated Means and Variances
123
Goodness of Fit
(Regression for Count Data)
𝑦 −ෞ
𝜇 2
• PCS = σ𝑛𝑖=1 𝑖 𝑖
~2𝑛−𝑝 asymptotically, if { 𝜇ෝ𝑖 , i =
ෞ𝑖
𝜇
1, … , n} are “large”
p = number of parameters in the Poisson regression
model
𝑛 𝑦𝑖
• Deviance = 2σ𝑖=1[𝑦𝑖 𝑙𝑛 ෞ − (𝑦𝑖 − 𝜇ෝ𝑖 )] ~2𝑛−𝑝
𝜇𝑖
• If PCS/(n-p) and Deviance/(n-p) are much larger than
1, the model is suspect.
• Exercise: Example 5.1, 5.3 (pp. 176, 181, Text)
124
Elephant Data: Goodness of Fit
(n=41, df1=39, df2=38)
125
Australian AIDS Data Example
Source: Internet
Correction: In the table above, read “month” as “quarter”
126
Australian AIDS Data (Contd.)
127
Australian AIDS Data (Contd.)
• Fitted Regression Model:
log(μ)=0.2596+0.2565x
128
Goodness of Fit
129
Intentionally Kept Blank
130
Intentionally Kept Blank
131
Retrospective studies
In logistic regression model, probability of binary responses
are modeled prospectively as functions of explanatory
variables. In the studies in which the probability of “yes”
responses (e.g. cancer) are very small, independent samples
are drawn retrospectively. But prospective probabilities can
not be estimated from such retrospective sampling. However,
since the odds ratio is invariant to the study design, it can be
estimated from a retrospective study using logistic regression
model.
In a logistic regression model for a retrospective study, the
estimated intercept is not an estimate of the prospective
intercept.
132
Logistic Regression with Retrospective Studies
(Agresti, p. 168, 3rd ed)
Let Z=1, if a subject is sampled; let Z=0, otherwise
Let 1 = P(Z=1|y=1) = Prob of sampling a “case”[= P(Z=1|y=1, x), assumed]
Let 0 = P(Z=1|y=0) = Prob of sampling a “control”[= P(Z=1|y=0, x), assumed]
We have not sampled Conditional Distribution of response Y given X =x.
𝑃 𝑍 = 1 𝑌 = 1, 𝑥 𝑃(𝑌=1|𝑥) 𝜌 𝑃(𝑌=1|𝑥)
P(Y=1|Z=1,x) = σ1 = 𝜌 𝑃 𝑌 = 11 𝑥 +𝜌 𝑃 𝑌 = 0 𝑥
𝑗=0 𝑃 𝑍 = 1 𝑌 = 𝑗, 𝑥 𝑃(𝑌=𝑗|𝑥) 1 0
𝑃 𝑌 = 1𝑥 𝜌
𝜌1 [ ] 𝜌1 𝛼+𝛽𝑥 ln 𝜌1 +𝛼+𝛽𝑥
𝑃 𝑌 = 0𝑥 𝜌1 [𝑒 𝛼+𝛽𝑥 ] 𝜌0
[𝑒 ] [𝑒 0 ]
= 𝑃 𝑌 = 1𝑥 = 𝜌 [𝑒 𝛼+𝛽𝑥 ]+𝜌 = 𝜌1 𝛼+𝛽𝑥 = 𝜌1
1 0 𝜌 [𝑒 ]+1 ln +𝛼+𝛽𝑥
𝜌1 [ ]+𝜌0 0 [𝑒 𝜌0 ]+1
𝑃 𝑌 = 0𝑥
𝜌
𝑙𝑜𝑔𝑖𝑡((P(Y=1|Z=1,x))=[ln 𝜌1 + 𝛼] + 𝛽𝑥 = 𝛼 ∗ + 𝛽 ∗ x,
0
where 𝑙𝑜𝑔𝑖𝑡((P(Y=1|x)) = 𝛼 + 𝛽𝑥
𝜌1
Thus, 𝛽 ∗ = 𝛽, but 𝛼 ∗ = 𝛼 + ln 𝜌0
133
Logistic Regression with Retrospective Studies
(Agresti, p. 168, 3rd ed)
CANCER CONTROL TOTAL
SMOKER 83 72 155
NON-SMOKER 3 14 17
TOTAL 86 86 172
135
Continuous Covariate and Invariance to Linear
Transformation, p. 122, Text
𝑥
Let 𝑥 ′ = 𝑎 + . [e.g., x′ = body weight in kg, x = weight in lb, k=0.453592,
𝑘
𝑎=0]. Thus , 𝑥 ′ =1 𝑥 = 𝑘 − 𝑘𝑎, and 𝑥 ′ =0 𝑥 = −𝑘𝑎
Let the two logistic regression models (Y = 1 for heart attack) be defined by
logit((P(Y=1|x)) = 𝛽0 + 𝛽1 𝑥; logit((P(Y=1|𝑥 ′ )) = 𝛽0 ′ + 𝛽1 ′ 𝑥′;
136
Intentionally Kept Blank
137
Categorical Variable Y as a Function of
Continuous Latent Variable Z
Example: Z = Exam
score, Y = letter grade or
grade point; covariate x
= no. of hours of study.
138
Proportional Odds Model for Ordinal Data (p.162-)
• Suppose observed variable Y = j iff αj-1 ≤ Z ≤ αj, j=1, …, J, and latent
variable Z depends on vector of covariates X through a linear or
non-linear model. If linear, then Z = βTx + , where has CDF F. Then,
• 𝛾𝑗 = 𝑃 𝑌 ≤ 𝑗 = 𝑃 𝑍 ≤ α𝑗 = 𝐹(α𝑗 βTx), j=1,…,J-1
• 𝐹 −1 (𝛾𝑗 ) = α𝑗 βTx
• If F(z) = 1/(1+e-z), the standard logistic distribution, then
𝜸𝒋
𝐹 −1 𝛾𝑗 = 𝐥𝐧 = 𝜶𝒋 𝑻 𝒙; (only α𝑗 is changing);
𝟏−𝜸𝒋
𝛾𝑗 (𝑥1 )/(1−𝛾𝑗 𝑥1 )
= exp(𝑇 𝑥1 − 𝑥2 ),
𝛾𝑗 (𝑥2 )/(1−𝛾𝑗 𝑥2 )
i.e., ratio of odds of the event {Yj}, between any two levels of vector of
predictors x=x1 and x=x2 is independent of j; i.e., odds of cumulative
response probabilities are proportional to each other (text, p.163)
139
Proportional Odds Model for Ordinal Data (p.162-)
• Z = βTx + , where has CDF F(z) = 1/(1+e-z), the standard logistic
distribution
𝛾𝑗
• ln = α𝑗 𝑇 𝑥; (only α𝑗 is changing);
1−𝛾𝑗
𝛾𝑗 (𝑥1 )/(1−𝛾𝑗 𝑥1 )
= exp(𝑇 𝑥1 − 𝑥2 ), i.e., odds of
𝛾𝑗 (𝑥2 )/(1−𝛾𝑗 𝑥2 )
cumulative response probabilities are proportional to each
other
• When J=2, then we have Binary Logistic Regression Model:
𝛾1
𝐹 −1 𝛾1 = ln = α1 βTx
1−𝛾1
• Since odds of cumulative response probability 𝛾𝑗 (𝑥) increases with j,
for proportional odds model we have:
α1 α2 α𝐽−1 .
140
Proportional Odds Model for Ordinal Data
• Suppose Observed variable Y = j iff αj-1 ≤ Z ≤ αj, , j=1, …, J, where
latent variable Z depends on the covariates through a linear or non-
linear model. If linear, then Z = βTx + , where has CDF F. Then,
• 𝛾𝑗 = 𝑃 𝑌 ≤ 𝑗 = 𝑃 𝑍 ≤ α𝑗 = 𝐹(α𝑗 βTx).
𝐹 −1 (𝛾𝑗 ) = α𝑗 βTx
• F(z) = 1/(1+e-z) is standard logistic distribution CDF logit model
• F(z) is normal CDF probit model
• F(z) is extreme-value dist CDF complementary log-log model
[note: if exp() has exponential or Weibull dist then has the
extreme value dist]
141
Intentionally Kept Blank
142
Degrees of Freedom
Degrees of Freedom for LRT = 2loge()= 2loge(L0/L),
is = difference in dimensionality of and 0, when
H0:0 is true. (H1: 0c )
Let I = no. of distinct patterns of covariates; then df
1. Independence Test: {IJ-1} – {(I-1)+(J-1)}
2. Homogeneity Test: {I(J-1)} – (J-1)
3. Multinomial Logistic (J=3, say): {I(J-1)} – (J-1); (No. of
model parameters under H0= J-1; null hypothesis of a
multiple logistic regression is that there is no relationship between
the X variables and the Y variable)
4. Binary Logistic (J=2): I(2-1)-1
143
Degrees of freedom (contd.)
• (i) Logistic regression (Read Sec 4.3.1, Text, p.148):
df= Ix(J-1) - m, where I = distinct patterns of (categorical or otherwise)
covariates, m = number of parameters estimated under Ho.
[ In product binomial or product multinomial sampling, "I" represents
the distinct number of binomial or multinomial distributions we are
sampling from]
• (ii) Poisson regression (Read Sec 5.2, Text, p.179-180):
df=(n–p), where n = number of obs, p = no. of parameters estimated
under Ho.
• (iii) Summary: In logistic regression, binomial or multinomial Y
has finitely many category-values, count I=distinct patterns of
(categorical or otherwise) covariates. In Poisson regression and Classical
linear regression model the Poisson or normal Y assumes infinitely many
values. The degrees of freedom involves n, and not the I = distinct
patterns of (categorical or otherwise) covariate variables] 144
References
• Agresti, A. (2012). Categorical Data Analysis, Wiley Series in
Probability and Statistics.
• Bishop, Y., Fienberg, S. E. and Holland, P. W. (1975). Discrete
Multivariate Analysis, MIT Press, Cambridge.
• Christensen, R. (1990). Loglinear Models. Springer-Verlag,
New York.
• Ramsey, F. L. and Schafer, D. W. (1997). The Statistical Sleuth.
Duxbury Press, Belmont, California.
• Read, T. R. C. and Cressie, N. (1988). Goodness of fit Statistics
for Discrete Multivariate Data. Springer-Verlag, New York.
• Goon, Gupta, Dasgupta, Fundamentals of Statistics, Volume
One.
145