0% found this document useful (0 votes)
41 views145 pages

SAHADEB - Logistic Reg - Sessions 8-10

(i) MLE of probability for an individual component to function for 10 hours or more is 0.936 (ii) MLE of expected life of an electrical unit is more than 10 hours. The probability of a unit failing is (1 - p)^3, where p is the probability of an individual component functioning. Setting (1 - p)^3 = 1 - 468/500 and solving for p, we get the MLE of p as 0.936. Since the unit functions if at least one component functions, the MLE of expected life is more than the given time of 10 hours.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views145 pages

SAHADEB - Logistic Reg - Sessions 8-10

(i) MLE of probability for an individual component to function for 10 hours or more is 0.936 (ii) MLE of expected life of an electrical unit is more than 10 hours. The probability of a unit failing is (1 - p)^3, where p is the probability of an individual component functioning. Setting (1 - p)^3 = 1 - 468/500 and solving for p, we get the MLE of p as 0.936. Since the unit functions if at least one component functions, the MLE of expected life is more than the given time of 10 hours.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 145

Categorical Data Analysis

Sahadeb Sarkar
IIM Calcutta

• Slides Adapted from Prof Ayanendranath Basu’s Class-notes


• R Programs and Data Sets in Textbook (Tang, He & Tu):
http://accda.sph.tulane.edu/r.html
• Readings: Chapters 1-6, Text

1
Readings: Chapters 1-4 2
Odds & Odds Ratio:
Extension to Regression Analysis
Response Y binary and Covariate X may
be continuous instead of categorical.
𝑃 𝑌 =1𝑥
Logit(𝑥)= ln = 𝛽0 +𝛽1 𝑥;
𝑃 𝑌 =0𝑥
𝑒 𝛽1 = odds ratio, for X=x and X=x+1.
H0: 𝛽1 =0  Logit(x)=𝛽0 , does not
depend on x (or 𝑒 𝛽1 = odds ratio=1, for
X=x and X=x+1)  odds of Y=1 does
not change with x-value.

3
Logistic Regression/ Binary Logit Model

• Logistic regression (“binary logit”) model extends


ideas of multiple linear regression to situation where
response variable, Y, is binary (with coded values 0 and
1)
• Explanatory variables X1, X2 · · ·, Xk may be
categorical/continuous variables.
• It estimates chances of an observation belonging to
each group.

4
Problems with Classical Linear Regression

For X=100, predicted value = 3.2121 !!!

R Codes for Logistic Regression:


> LM1=lm(PassFail~HoursOfStudy)
> summary(LM1)
>glm_PassFail = glm(PassFail ~ HoursOfStudy, family =
binomial(logit), data = passfail)
 summary(glm_PassFail)
 Estimates: a=–6.39, b=0.44 5
Problems with Usual Regression
Regression Errors not independent

6
Solution with Logistic Regression

 Estimates:
a=–6.39,
b=0.44

logistic function: Prob=1/[1+exp(-(-6.39+0.44*x))]


i.e., ln(Prob/(1-Prob)) = –6.39+0.44*x [=logit].
For X=100, predicted Prob = 1;
For X=18, predicted Prob = 0.82 7
“S”-shaped Logistic Function

L = Curve’s maximum value


K = steepness of the curve
X0 = curve’s midpoint

8
Comparison to Classical Linear Regression

• Check if the logistic regression model fits well, if some of the


variables are redundant etc. However, statistical tools to be
used for this purpose are not the usual R2, t- and F-tests.
• In place of least squares estimation, maximum likelihood
estimation;
• In place of t-test, large sample Z-test due to Wald, to check
redundancy of a predictor.
• In place of F- test or adjusted R2, pseudo R2 and information
criteria such as AIC (Akaike Information Criterion) or BIC
(Bayesian Information Criterion) to figure out which model is
better.

9
Comparison to Classical Linear Regression
(contd.)
• Instead of the sum of squared errors (SSE), one calculates
negative of the log-likelihood for the fitted model.
• Interpretation of parameters 1, …, k is now done through
the log of odds called “logit”, or equivalently through odds.
For example, suppose X1 changes by one unit and values of
X2, …, Xk remain constant. Thus, 1 denotes change in logit,
equivalently, 𝑒 β1 denotes multiplicative change-factor in
odds of Y=1 when X1 changes its value from x to x+1.
[Odds(Y=1|X1=x+1) = Odds(Y=1|X1 =x)*(𝑒 β1 ), while X2, …, Xk
remain at the same “level”s/values; ]

10
Intentionally Kept Blank

11
Properties of the Logit Model

• Although Xi may vary from – to  , P is constrained


to lie between 0 and 1.

• When Xi approaches – , P approaches 0.

• When Xi approaches  , P approaches 1.

• When linear regression model is used, P is not


constrained to lie between 0 and 1.
12
Relation between Probability, Odds & Logit
Log(Odds)
Probability Odds =Logit
0 0 NC Odds maps probability
0.1 0.11 -2.20 asymmetrically whereas
0.2 0.25 -1.39 logit does it symmetrically
0.3 0.43 -0.85
0.4 0.67 -0.41
0.5 1.00 0.00
0.6 1.50 0.41
0.7 2.33 0.85
0.8 4.00 1.39
0.9 9.00 2.20
1 NC NC

13
Estimation and Model Fit

• Estimation Method: Maximum Likelihood


• Fit: Cox & Snell R Square and Nagelkerke R Square
(similar to R2 in multiple regression).
• Cox & Snell R Square can not equal 1.0, even if fit is
perfect, this limitation is overcome by Nagelkerke R
Square.
• Hosmer-Lemeshow test used for assessing goodness of
fit of a model (Large P-value indicates overall model fit is
good)
• Compare predicted and actual values of Y to determine
the percentage of correct predictions.

14
MLE calculation - Example 1:
The distribution of scores in a public examination is
normal with unknown mean µ and known standard
deviation 5. It is, however, known that µ equals 52,
60 or 67. If the scores of three randomly chosen
candidates turn out to be 88, 93, 59, then find the
maximum likelihood estimate of µ.

(Exercise: mle µ = ??; sample mean = ??)

15
MLE calculation - Example 1(contd.):
The distribution of scores in a public examination is normal
with unknown mean µ and known standard deviation 5. It is,
however, known that µ equals 52, 60 or 67. If the scores of
three randomly chosen candidates turn out to be 88, 93, 59,
then find the maximum likelihood estimate of µ.
(Ans. mle µ of 67; sample mean = 80)
𝑛
1 𝑛 𝜇)2
1 −2 σ𝑖=1(𝑋𝑖 −
𝐿 𝜇 = ෑ 𝑒
5 2𝜋 52
𝑖=1

σ𝑛𝑖=1 𝑋𝑖 − 𝜇 2
𝑙𝑜𝑔𝑒 𝐿 𝜇 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 −
2 × 52
σ𝑛𝑖=1 𝑋𝑖 − 𝑋ത + 𝑋ത − 𝜇 2
𝑙𝑜𝑔𝑒 𝐿 𝜇 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 −
50
σ𝑛𝑖=1 𝑋𝑖 − 𝑋ത 2 + 𝑛 𝑋ത − 𝜇 2 + 2(𝑋ത − 𝜇) σ𝑛𝑖=1 𝑋𝑖 − 𝑋ത
𝑙𝑜𝑔𝑒 𝐿 𝜇 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 −
50
16
MLE calculation - Example 2:
(a) The distribution of a certain quality characteristic is
continuous uniform over the range [-2,], where  is
unknown. If the values of the quality characteristic in a
random sample of size five are 0.63, 0.27, –1.62, 0.71
and –1.38, then what is the maximum likelihood
estimate of ? (Ans. ??)

(b) The life X (in hours) of an electrical component is


exponentially distributed with mean , where  is
unknown. let p be the probability that a system,
consisting of two such components arranged in series,
survives for five hours or more. If five randomly chosen
components X1, …, X5 are seen to survive for 7, 6, 4, 8
and 5 hours, then what is the maximum likelihood
estimate of p? (Ans. ??)
17
MLE calculation - Example 2(contd.):
(a) The distribution of a certain quality characteristic is
continuous uniform over the range [-2,], where  is
unknown. If the values of the quality characteristic in a
random sample of size five are 0.63, 0.27, –1.62, 0.71, –1.38,
then what is the maximum likelihood estimate of ? (Ans.
0.81 = max(– X(1)/2, X(n)), where X(1)=min of X1, X2, …, Xn )

(b) The life X (in hours) of an electrical component is


exponentially distributed with mean , where  is unknown.
let p be the probability that a system, consisting of two such
components arranged in series, survives for five hours or
more. If five randomly chosen components X1, …, X5 are seen
to survive for 7, 6, 4, 8 and 5 hours, then what is the
maximum likelihood estimate of p? (Ans. mle of  = 𝑋ത = 6;
mle of p = exp(-10/) = 0.1888  0.19)
18
MLE calculation - Example 3:
An electrical unit consists of three identical components and
the unit functions if and only if at least one of the
components function. The components are known to
function independently and has an exponential dist. Out of
500 randomly chosen units of this type 468 were found to be
functioning for 10 hours or more. On the basis of this
information, (i) find the maximum likelihood estimate of the
probability for an individual component to function for 10
hours or more (ii) find the mle of expected life of an electrical
unit. (Ans. (i) ?? (ii) ??)

19
Example 3. MLE calculation (contd.):
An electrical unit consists of three identical components and
the unit functions if and only if at least one of the
components function. The components are known to
function independently and has an exponential dist. Out of
500 randomly chosen units of this type 468 were found to be
functioning for 10 hours or more. On the basis of this
information, (i) find the maximum likelihood estimate of the
probability for an individual component to function for 10
hours or more (ii) find the mle of expected life of an electrical
unit. (Ans. (i) 0.936 ?? (ii) 151.2 hours ??)

20
Example 3. MLE calculation (contd.)
Solution: (i) Let  = P(any component functions for 10 hours or
more) = exp(-10/), where  = expected life of any component.
•  = P(i-th unit functions for 10 hours or more) = 1- P(i-th unit
functions for less than 10 hours) = 1- P(all three components of
the i-th unit function for less than 10 hours) = 1- (1-)3
• Now we are given a sample of binary observations on 500 units:
X1, …, X500 ~ B(1, ).
• Sum of X1, …, X500 is 468 and hence mle of  [=1- (1-)3 ] =
sample proportion = 468/500 = 0.936.
• Thus the mle of  = 0.6, [=exp(-10/) & invariance property of
mle].
• (ii) The mle of the expected life  of an electrical component is =
10/ln(mle of ) = 19.57615
21
Logistic Regression:
Estimating Model Parameters
• Linear Regression model uses OLS method to minimize sum
of squared errors of prediction
• Logistic Regression model maximizes “likelihood” of
observing y1,…,yn, defined by

𝐿(𝜋1 , …,𝜋𝑛 ) = ς𝑛𝑖=1 𝜋𝑖 𝑦𝑖 (1 − 𝜋𝑖 )1−𝑦𝑖

By Model (single X variable) assumption,


exp(𝛽0 +𝛽1 𝑋𝑖 ) 1
𝜋𝑖 = = ;
1+exp(𝛽0 +𝛽1 𝑋𝑖 ) 1+exp(−𝛽0 −𝛽1 𝑋𝑖 )
1
(1 − 𝜋𝑖 ) = .
1+exp(𝛽0 +𝛽1 𝑋𝑖 )
Thus, L is a function of two parameters 0 and 1
22
Logistic Regression: Likelihood function
• Model: yi ~ Bin(1, 𝜋𝑖 ), i=1,…,n,
independent;
exp(𝛽0 +𝛽1 𝑋𝑖 ) 1
𝜋𝑖 = =
1+exp(𝛽0 +𝛽1 𝑋𝑖 ) 1+exp(−𝛽0 −𝛽1 𝑋𝑖 )

• Then the likelihood function L is given by


𝐿(𝜋1 , …,𝜋𝑛 ) =(1 − 𝜋1 )𝜋2 (1 − 𝜋3 )…𝜋20
= 𝐿 𝛽0 , 𝛽1 =
1 1 1
1+𝑒 𝛽0 +𝛽1 (2.5) 1+𝑒 −𝛽0 −𝛽1 (22.6) 1+𝑒 𝛽0 +𝛽1 (17.8)
1

1+𝑒 −𝛽0 −𝛽1 (17.1)

23
Estimation (contd.)
𝐿(𝜋1 , …,𝜋𝑛 ) = ς𝑛𝑖=1 𝜋𝑖 𝑦𝑖 (1 − 𝜋𝑖 )1−𝑦𝑖

By Model (multiple X variables) assumption,


exp(𝛽0 + 𝛽1 𝑋1𝑖 + … + 𝛽𝑘 𝑋𝑘𝑖 )
𝜋𝑖 =
1 + exp(𝛽0 + 𝛽1 𝑋1𝑖 + … + 𝛽𝑘 𝑋𝑘𝑖 )

To maximize L, equivalently, ln(L) one uses a software.


We require initial values of 0, 1 · · · k, which can be
obtained by MLR of “empirical log odds” on X1, …,Xk :
1
𝑦𝑖 +2
ln( 1 )= 𝛽0 + 𝛽1 𝑋1𝑖 + … + 𝛽𝑘 𝑋𝑘𝑖
1−𝑦𝑖 +2

24
Model-Fit Measures
(https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds)

When analyzing data with a logistic regression, an equivalent


statistic to R-squared does not exist. The model estimates from a
logistic regression are maximum likelihood estimates arrived at
through an iterative process. They are not calculated to minimize
variance, so the OLS approach to goodness-of-fit does not
apply. However, to evaluate the goodness-of-fit of logistic models,
several pseudo R-squareds have been developed. These are
“pseudo” R-squareds because they look like R-squared in the sense
that they are on a similar scale, ranging from 0 to 1 (though some
pseudo R-squareds never achieve 0 or 1) with higher values
indicating better model fit, but they cannot be interpreted as one
would interpret an OLS R-squared and different pseudo R-squareds
can arrive at very different values.
25
Model Fit Measures
(https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds)

“While pseudo R-squares cannot be interpreted independently or


compared across datasets, they are valid and useful in evaluating
multiple models predicting the same outcome on the same dataset.
In other words, a pseudo R-squared statistic without context has
little meaning.

A pseudo R-squared only has meaning when compared to another


pseudo R-squared of the same type, on the same data, predicting
the same outcome. In this situation, the higher pseudo R-squared
indicates which model better predicts the outcome.”

26
Model Fit Measures
(https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds)

2
𝐿0 𝑛
Cox & Snell R square: 𝑅2 = 1 −
𝐿
where L0= max likelihood of the intercept-only model, L=max
likelihood under specified model; ratio of the likelihoods reflects the
improvement of the specified model over the intercept-only model;
if the specified model fits data perfectly and hence L =1, max value
2
2
of Cox & Snell R square will be: 𝑅 𝑚𝑎𝑥 = 1 − 𝐿0 𝑛 < 1.

Nagelkerke adjusted R square (Dutch biostatistician):


2
𝐿0 𝑛
𝑅2 1−
Adjusted 𝑅2 = = 𝐿
2
𝑅2 𝑚𝑎𝑥
1− 𝐿0 𝑛
27
Other Goodness of Fit Measures
• Akaike’s Information Criterion:
AIC= 2*ln(L) + 2*k,
where k = # of model parameters
• Corrected AIC (for small samples)
AICc= AIC + 2*(k(k+1))/(n-k-1)
• Bayesian Information Criterion (Schwartz's Bayesian
Criterion):
BIC = 2*ln(L) + (ln(n))*k

28
Interpretation of Coefficients

• Log odds, i.e., ln(/(1-)), is a linear combination of 0, 1,


…, k
• If Xi value increases by one unit with other X-variables
remaining constant, log odds of Y=1 will change by i units
• 0 = log odds if/when all X-variables equal zero
𝜋𝑖
“Logit” = ln = 𝛽0 + 𝛽1 𝑋1𝑖 + … + 𝛽𝑘 𝑋𝑘𝑖
1−𝜋𝑖
exp(𝛽0 +𝛽1 𝑋1𝑖 + …+𝛽𝑘 𝑋𝑘𝑖 ) exp(Logit)
 𝜋𝑖 = =
1+ exp(𝛽0 +𝛽1 𝑋1𝑖 + …+𝛽𝑘 𝑋𝑘𝑖 ) 1+ exp(Logit)
1
 𝜋𝑖 = ;
1+ exp(−Logit)
1
 (1 − 𝜋𝑖 ) =
1+ exp(Logit)

29
Interpretation of Coefficients
If Xi is increased by one unit, the log odds will change by
i units, when the values of other independent variables
are held constant.

Log(Odds)
Sign of i will determine Probability Odds =Logit
0 0 NC
whether the probability 0.1 0.11 -2.20
increases (if the sign is 0.2 0.25 -1.39
positive) or decreases (if the 0.3 0.43 -0.85
0.4 0.67 -0.41
sign is negative). 0.5 1.00 0.00
0.6 1.50 0.41
0.7 2.33 0.85
0.8 4.00 1.39
0.9 9.00 2.20
1 NC NC
30
Intentionally Kept Blank

31
Prescribe “Buy/Sell/Hold” for a Security

Daily Data on
Price, Volume, MACD, RSI, Put-Call Ratio, Open
Interest, etc & “Buy/Sell/Hold prediction” (attached in
hindsight on the basis of next trading day’s closing price)

Using Logistic Regression Model fitted to above test


dataset over a period, we can make future decisions

32
33
34
Pass Fail Data Example
(choice of cut-off value matters)

% Correct
classification
depends on the
cutoff prob. value

35
Pass Fail Data Example
(choice of cut-off value matters)

% Correct
classification
depends on the
cutoff prob. value

36
Pass Fail Data Example
http://127.0.0.1:57322/help/index.jsp?topic=/com.ibm.spss.statistics.help/idh_lreg_sav.htm

Hosmer Lemeshow test is obtained by applying a chi-square


test on a 2×g contingency table, constructed by cross-
classifying the binary response with a grouping variable
(with g groups) in which groups are formed by partitioning the
predicted probabilities using the percentiles of the predicted
event probability. In the calculation, usually 10 groups are used
(g=10). The corresponding groups are often referred to as the
“deciles of risk” (Hosmer and Lemeshow, 2000)
37
Pass Fail Data Example (contd.)

38
Intentionally Kept Blank

39
Example: Insurance Requirement

• 100 respondents, [2-level response] 63 yes, 37 no


• Predictors are Age, Income, dependent (whether
have dependents or not, binary).
• Want to estimate chances of a prospective
customers purchasing insurance.

40
Data (2-level Response)

41
SPSS Windows: Logit Analysis
1. Select ANALYZE from the SPSS menu bar.

2. Click REGRESSION and then BINARY LOGISTIC.

3. Move “Willing” in to the DEPENDENT VARIABLE box.

4. Move “Age,” “Dependent,” and “Income,” in to the


COVARIATES box.)

5. Select ENTER for METHOD (default option)

6. Click OK.

42
Recovering Probability () from Logit


logit = ln = 0 +1 𝑋1 + ⋯ + 𝑝 𝑋𝑝
(1 − )

exp(𝑙𝑜𝑔𝑖𝑡) 1
 = =
1+exp(𝑙𝑜𝑔𝑖𝑡) 1+exp(−𝑙𝑜𝑔𝑖𝑡)
1
 (1 − ) =
1+exp(𝑙𝑜𝑔𝑖𝑡)

43
Output (2-level response)

Given age, dependent & income of a person, compute:


logit = –50.326+1.077*age+13.778(if has dependents) +
0.000*income;
& chances of insurance purchase = 1/[1+exp(–logit)]
R Codes:
> glm_insurance = glm(nWilling ~ Age+Dependent+Income,
family = binomial(logit), data = insurance)
44
> summary(glm_insurance)
Output (2-level response)

45
Output (2-level response)
https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-
what-are-pseudo-r-squareds

Deviance Test; Hosmer-


Lemeshow Test, p.151-152

http://127.0.0.1:57322/help/
index.jsp?topic=/com.ibm.sps
s.statistics.help/idh_lreg_sav.
46
htm
Output (2-level response; w/o Income)

Given age, dependent of a person, compute:


logit = –50.190+1.083*age+13.782(if has dependents) ;
& chances of insurance purchase = 1/[1+exp(–logit)]
> glm_insurance = glm(nWilling ~ Age+Dependent, family = binomial(logit), data =
insurance)
> summary(glm_insurance)
47
Output (2-level response; w/o Income)

48
Output (2-level response; w/o Income)

49
Classification Table
(or Confusion Matrix or Error Table)
• When there are two categories Y=1 and Y=2, this will produce a
table of counts for classification or misclassification

• when Y has 3 categories one will have a 3x3 table of counts for
classification or misclassification

• It shows how the accurately a model classifies for a chosen


classification cut-off probability.
50
Classification Table

Proportion of correct classification for the subjects in


the sample is given by
(𝑛11 + 𝑛22 +𝑛33 )
(𝑛11 +𝑛12 +𝑛13 +𝑛21 +𝑛22 +𝑛23 +𝑛31 +𝑛32 +𝑛33 )

Sample size 𝑛
= 𝑛11 + 𝑛12 + 𝑛13 + 𝑛21 + 𝑛22 + 𝑛23 + 𝑛31 + 𝑛32 +
𝑛33

51
Receiver Operating Characteristic
(ROC) Curve
• Sensitivity is the true positive rate (TPR) since it is the rate of
correctly classifying an observed “yes” as “yes” (true positive)
by the model for a fixed cut-off or threshold value.
• Similarly, (1-specificity) is known as the false positive rate
(FPR) as it represents the rate of incorrectly classifying an
observed “no” as “yes” (false positive) by the model for the
same cut-off or threshold value.
• Thus, the ROC plot, of sensitivity versus (1specificity), can be
thought of as a plot of the Power versus Type I Error of the
decision (classification) rule based on the chosen cut-off or
threshold value using the model predicted probabilities.

• [The method was developed for operators of military radar


receivers, which is why it is so named.]

52
Receiver Operating Characteristic (ROC) Curve

For a good model,


proportion of yes as yes
(TPR) should be much
higher than proportion of
no as yes (FPR).

• Let x = FPR and y = TPR. Then ROC points (x,y) represent relative
trade-off between true positive rate (beneficial) and false positive
rate (costly). The best possible scenario for prediction would be to
have the point (0,1) in the upper left corner of the ROC space, as it
represents 100% sensitivity (no false negatives) and 100% specificity
(no false positives).
• The (0,1) point is also known as perfect classification. 53
Receiver Operating Characteristic (ROC) Curve

Random Classifier: A prediction method based on random guessing


based on flipping a coin with P(Heads) = p  [0,1], and then
classifying a subject as “yes” category if “Heads” fall and “no”
category if “Tails” appear.
• For such a method, ( x=(1-specificity), y=sensitivity)=(p,p) along a
diagonal line from (0,0) to (1,1). Points above the diagonal line in
the ROC space represent good (poor) classification results better
(worse) than random guessing. For a good model, proportion of
yes as yes (TPR) should be much higher than proportion of no as
yes (FPR).

• Thus, one desires to see the ROC curve as far above the diagonal as
possible. A binary logistic regression model, for various choice of
cut-off value p  [0,1], producing the largest area under the ROC
curve (AUC) may be declared as the best model.

54
Receiver Operating Characteristic (ROC) Curve

For a good model,


proportion of yes as yes
(TPR) should be much
higher than proportion of
no as yes (FPR).

• The (0,1) point is also known as perfect classification.


• Thumb Rule: around 0.7 = fair discrimination, around 0.8 = good
discrimination, around 0.9= very good discrimination, much more
than 0.9 = excellent discrimination.

55
Pearson Chi-Square & the Deviance Tests
Pearson Chi-Square test is mainly used for categorical
explanatory variables. For a sample of n subjects with binary
response and a number of covariates, using a fitted binary
logit model one can construct an I2 table for observed
counts with rows consisting of I distinct patterns of the
covariates in the data and columns representing the two
levels of the binary response. For each of these I patterns
construct model based expected counts (Eij) and compare
them with observed counts (nij), i=1, …, I, j=1,2, …,J.

56
Pass-fail data

Rows consisting of I distinct patterns


of the covariates in the data
Here, I = 20 or 19?

1.9, 2.5, 2.7, 4.5, 5.4, 10.1, 13.3, 13.6, 14,


14.5, 16.3, 17.1, 17.8, 22.6, 22.6, 26, 27,
31.4, 33.1, 45.3

57
Pearson Chi-Square Tests
• Then the Pearson Chi-Square test statistic is defined by
𝐼 𝐽 2
𝑛𝑖𝑗 − 𝐸𝑖𝑗
𝑃𝐶𝑆 = ෍ ෍
𝐸𝑖𝑗
𝑖=1 𝑗=1
• If each of the I2 cells have 5 or more observations nij,
then under the null hypothesis of correct model, PCS
has an asymptotic chi-square distribution with degrees
of freedom (df) = [I(J1)  m], where m = number of
(independent) model parameters being estimated. If
J=2, then the PCS has df = (I-m).

58
Deviance test statistic
Deviance test statistic (denoted by G2) also compares the observed to the
expected counts, defined by
𝐽 𝑛𝑖𝑗
𝐺2 = 2 σ𝐼𝑖=1 σ𝑗=1 𝑛𝑖𝑗 ln .
𝐸𝑖𝑗
It also has the same df = [I(J1)  m], where m = number of
(independent) model parameters being estimated. The deviance test is
the likelihood ratio test, LRT = 2loge()= 2loge(L0/L), H0:0 is true.
(H1: 0c ).

Example: One may test the null hypothesis that conditional (i.e., given xi)
distribution of the responses {Yi, i=1,…,I} have independent binomial(1,
(xi)) distributions where
𝑒 β0 +β1 x1i +⋯+βk xki
 𝐱𝑖 = ,
1+𝑒 β0 +β1 x1i +⋯+βk xki
against the saturated model which does not assume {(xi), i=1, …,n} to be
related in any manner. In this case, J=2 and m=k+1.

59
PCS & LRT statistics as members of Power Divergence
Family for multinomial goodness of fit tests

𝜆
2 𝑂𝑖
2𝑛𝐼 𝜆 = σ𝑘𝑖=1 𝑂𝑖 −1 ,𝜆 ∈ℝ,
𝜆(𝜆+1) 𝐸𝑖
𝑂𝑖 = 𝑋𝑖 is observed frequency for ith category in a sample of size n
from a multinomial(1, (1 , … , 𝑘 ) ) distribution.
𝜆
2 𝑂𝑖
Note: 𝑂 − 1 denotes deviance (or power divergence)
𝜆(𝜆+1) 𝑖 𝐸𝑖
between observed and expected counts/frequencies for ith category.

Check that: PCS test statistic = 2𝑛𝐼 𝜆 for =1


𝑂
LRT statistic = 2 σ𝑘𝑖=1 𝑂𝑖 ln( 𝑖 ) = lim 2𝑛𝐼 𝜆
𝐸𝑖 λ→0

60
Pass-fail data

Rows consisting of I distinct patterns


of the covariates in the data
Here, I = 19 (?)

1.9, 2.5, 2.7, 4.5, 5.4, 10.1, 13.3, 13.6, 14,


14.5, 16.3, 17.1, 17.8, 22.6, 22.6, 26, 27,
31.4, 33.1, 45.3

61
Hosmer Lemeshow Test: Binary case
Pearson Chi-square test is not appropriate when there are
continuous (i.e., non-categorical) explanatory variables if they are
not grouped. In this case for each “covariate” pattern X=x, there
will usually be one observation, and hence the asymptotic or
approximate chi-square distribution for the PCS or the Deviance
test is not valid.
This problem is faced if number of categorical binary predictors k
is large. For example, if k is 6, one may have as many as I = 2k = 64
distinct combinations or patterns of covariates.
To solve this problem of large I, Hosmer and Lemeshow (1980)
ෝ 𝐱 =
orders n subjects according to the fitted probabilities 𝝅
Pr(Y=“Yes”|x) and then divide n subjects into g number of groups
of comparable sizes, where g is a small number often chosen to
be 10. 62
Hosmer Lemeshow Test: Binary case

nij = number of obs in j-th cell of the i-th group; Ei1 = sum of the
ෝ 𝐱 = Pr(Y=“Yes”|x) of the subjects in the i-
fitted probabilities 𝝅
th group and Ei2 = ni+ - Ei1. The PCS-type test statistic is
constructed

2
In simulation studies, it has been seen that 𝜒𝐻𝐿 has approximately
a chi-square distribution with df= (g2).
63
Hosmer Lemeshow Test:
Multinomial Response Case
• Fagerland and Hosmer’s (2012) generalization of the
Hosmer-Lemeshow test for multinomial logistic regression
models. The groups are formed in a similar manner as done
in the binary case and the table of observed and expected
counts for J categories (namely, 0, 1, …, J-1 ) takes the form:

64
Exercise
• Fagerland, Hosmer and Bofin (2008) showed that
2
𝜒𝐹𝐻 has asymptotic chi-square distribution with df =
(g2)(J-1). Various statistical software readily
2
provide the 𝜒𝐻𝐿 test in the binary case, and do not
2
usually produce the 𝜒𝐹𝐻 test in the multinomial
case. One may compute the latter by writing
program codes (in R etc ??).

65
AIC, BIC
The Akaike Information criterion
AIC = 2m – 2loge(estimated maximum likelihood value under a
model)
where m = number of parameters being estimated in the model.
The value 2m is used to penalize a model with higher number of
parameters, to reward a parsimonious model. Given a number of
candidate models for the data, the preferred model is the one
with the minimum AIC value. The enables us to compare two
models nested or not.

66
AIC, BIC (contd)
AIC is it is not a consistent in that its probability of selecting the
true model among the candidates does not tend to one as the
sample size increases to infinity.

One popular alternative that corrects this problem Bayesian


Information Criterion (BIC) defined by Schwarz (1978),
BIC = [loge(n)]m – 2loge(estimated maximum likelihood value
under a model)
Usually, both AIC and BIC agree with the choice of the best model.

67
Intentionally Kept Blank

68
Example: Insurance Requirement

• 100 respondents, [3-level response] 42 yes, 28 no, 30


maybe
• Predictors are Age, Income, dependent (whether have
dependents or not, binary).
• Want to estimate chances of a prospective customers
purchasing insurance.

69
Multinomial Logit Model
𝜋𝑗
• Logit j = ln = 𝛽0𝑗 + 𝛽1𝑗 𝑋1 + … + 𝛽𝑘𝑗 𝑋𝑘 , j=1, …, (m-1)
𝜋𝑚

i.e., for j = 1, … , (m − 1),

exp(𝛽0𝑗 +𝛽1𝑗 𝑋1 + …+𝛽𝑘𝑗 𝑋𝑘 )


𝜋𝑗 =
1+ σ𝑚−1
;
𝑗=1 exp(𝛽0𝑗 +𝛽1𝑗 𝑋1 + …+𝛽𝑘𝑗 𝑋𝑘 )

exp(𝐿𝑜𝑔𝑖𝑡 𝑗)
i.e., 𝜋𝑗 = ;
1+ σ𝑚−1
𝑗=1 exp(𝐿𝑜𝑔𝑖𝑡 𝑗)
1
and 𝜋𝑚 = .
1+ σ𝑚−1
𝑗=1 exp(𝐿𝑜𝑔𝑖𝑡 𝑗)

70
Multinomial Logit Model
𝜋𝑗
Logit j: ln = 𝛽0𝑗 + 𝛽1𝑗 𝑋1 + … + 𝛽𝑘𝑗 𝑋𝑘 , j=1, …, (m-1)
𝜋𝑚

Insurance Example (3-level response): m=3, k=3


• Need two logistic regression equations to model P(“may
be”)/P(“yes”) = 1/3 and P(“no”)/P(“yes”)= 2/3;
𝜋1
Logit 1: ln = 𝛽01 + 𝛽11 𝑋1 + 𝛽21 𝑋1 + 𝛽31 𝑋3 ,
𝜋3
𝜋2
Logit 2: ln = 𝛽02 + 𝛽12 𝑋1 + 𝛽22 𝑋1 + 𝛽32 𝑋3 ,
𝜋3
• 1= 1 + 2 + 3 = 3 exp(logit 1) + 3 exp(logit 2) + 3 ;
Hence, check:
1
3 = ;
1+𝑒𝑥𝑝 logit 1 +𝑒𝑥𝑝(logit 2)
𝑒𝑥𝑝 logit 1
1 = ;
1+𝑒𝑥𝑝 logit 1 +𝑒𝑥𝑝(logit 2)
𝑒𝑥𝑝 logit 2
2 =
1+𝑒𝑥𝑝 logit 1 +𝑒𝑥𝑝(logit 2) 71
Insurance Data (3-level Response)

𝐋𝐢𝐤𝐞𝐥𝐢𝐡𝐨𝐨𝐝 𝐟𝐮𝐧𝐜𝐭𝐢𝐨𝐧 𝑳 =
𝜋3 (𝑜𝑏𝑠 1)𝜋2 (𝑜𝑏𝑠 2)𝜋3 (obs 3)
𝜋1 𝑜𝑏𝑠 4 𝜋3 𝑜𝑏𝑠 5 𝜋2 𝑜𝑏𝑠 6
… 𝜋2 𝑜𝑏𝑠 100

P(may be) = 1, P(No) = 2 ; P(yes) = 3,


1
3 = ;
1+𝑒𝑥𝑝 logit 1 +𝑒𝑥𝑝(logit 2)
𝑒𝑥𝑝 logit 1
1 = ;
1+𝑒𝑥𝑝 logit 1 +𝑒𝑥𝑝(logit 2)
𝑒𝑥𝑝 logit 2
2 = ;
1+𝑒𝑥𝑝 logit 1 +𝑒𝑥𝑝(logit 2)

For observation 1: Logit 1 = 𝛽01 + 𝛽11 44 + 𝛽21 1 + 𝛽31 54261 ; Logit 2 =


𝛽02 + 𝛽12 44 + 𝛽22 1 + 𝛽32 54261;
For observation 2: Logit 1 = 𝛽01 + 𝛽11 35 + 𝛽21 0 + 𝛽31 77313 ; Logit 2 =
𝛽02 + 𝛽12 35 + 𝛽22 0 + 𝛽32 77313.
72
SPSS Windows: Logit Analysis
1. Select ANALYZE from the SPSS menu bar.
2. Click REGRESSION and then MULTINOMIAL LOGISTIC.
3. Move “Resp3level” in to the DEPENDENT VARIABLE box.
4. Move “Age,” “Dependent,” and “Income,” in to the
COVARIATES box.)
5. Under STATISTICS: under MODEL select everything except
Monotonicity measures, under PARAMETRS select
Estimates, Likelihood ratio tests & then ENTER
6. Click OK.

73
Output (3-level response)

Logit1 = 63.913 –29.447*dependent – 0.877*age+.000*income


Logit2 = 132.935 – 50.587*dependent – 2.403*age+.000*income
P(“Yes”) = 3 = 1/(1+exp(Logit1)+exp(Logit2))
P(“May be”) = 1 = exp(Logit1)/(1+exp(Logit1)+exp(Logit2))
P(“No”) = 2 = exp(Logit2)/(1+exp(Logit1)+exp(Logit2)) 74
Output (3-level response)

75
Output (3-level response)

df = [I(J1)  m],
df = [I(J1)  m], where m = no. of
(independent) model parameters
being estimated, I = no. of distinct
value-combinations of covariates, J =
no. of response categories.

76
Output (3-level response; w/o Income)

Logit1 = 63.637 -29.573*dependent -0.878*age


Logit2 = 128.431 -49.638*dependent -2.325*age
P(“Yes”) = 3 = 1/(1+exp(Logit1)+exp(Logit2))
P(“May be”) = 1 = exp(Logit1)/(1+exp(Logit1)+exp(Logit2))
P(“No”) = 2 = exp(Logit2)/(1+exp(Logit1)+exp(Logit2)) 77
78
Output (3-level response; w/o Income)

79
Output (3-level response; w/o Income)

df = [I(J1)  m], where m = no. of


(independent) model parameters
being estimated, I = no. of distinct
value-combinations of covariates, J =
no. of response categories.

80
Bankruptcy Example

Source: Book by Johnson & Wichern

81
Bankruptcy Example
(Applied Multivariate Statistical Analysis by Johnson & Wichern)

• Annual financial data collected for 16 (currently)


bankrupt firms about 2 years prior to their bankruptcy
and for 20 (currently) financially sound firms at about
the same time. [ 0: Bankrupt Firms; 1: Non-Bankrupt
Firms]
• Four explanatory variables
X1 (CFTD)= cash flow/total debt,
X2 (NITA) = net income/total assets
X3 (CACL) = current assets/current liabilities,
X4 (CANS) = current assets/net sales

82
Bankrupt
CA/NS
CA/CL
CF/TD

NI/TD
Seq#

1 -0.45 -0.41 1.09 0.45 0 0 = bankrupt


2 -0.56 -0.31 1.51 0.16 0
3 0.06 0.02 1.01 0.4 0 1= sound
4 -0.07 -0.09 1.45 0.26 0
5 -0.1 -0.09 1.56 0.67 0
6 -0.14 -0.07 0.71 0.28 0
7 0.04 0.01 1.5 0.71 0
8 -0.06 -0.06 1.37 0.4 0
9 0.07 -0.01 1.37 0.34 0
10 -0.13 -0.14 1.42 0.44 0
11 -0.23 -0.3 0.33 0.18 0
12 0.07 0.02 1.31 0.25 0
13 0.01 0 2.15 0.7 0
14 -0.28 -0.23 1.19 0.66 0
15 0.15 0.05 1.88 0.27 0
16 0.37 0.11 1.99 0.38 0
83
Bankrupt
CA/NS
CA/CL
CF/TD

NI/TD
Seq#
17 0.51 0.1 2.49 0.54 1
18 0.08 0.02 2.01 0.53 1
19 0.38 0.11 3.27 0.35 1
20 0.19 0.05 2.25 0.33 1
21 0.32 0.07 4.24 0.63 1
22 0.31 0.05 4.45 0.69 1
23 0.12 0.05 2.52 0.69 1
24 -0.02 0.02 2.05 0.35 1
25 0.22 0.08 2.35 0.4 1
26 0.17 0.07 1.8 0.52 1
27 0.15 0.05 2.17 0.55 1
28 -0.1 -0.01 2.5 0.58 1
29 0.14 -0.03 0.46 0.26 1
30 0.14 0.07 2.61 0.52 1
31 0.15 0.06 2.23 0.56 1
32 0.16 0.05 2.31 0.2 1
33 0.29 0.06 1.84 0.38 1
34 0.54 0.11 2.33 0.48 1
35 -0.33 -0.09 3.01 0.47 1
36 0.48 0.09 1.24 0.18 1
84
SPSS Windows: Logit Analysis
1. Select ANALYZE from the SPSS menu bar.

2. Click REGRESSION and then BINARY LOGISTIC.

3. Move “Bankrupt” in to the DEPENDENT VARIABLE box.

4. Move “CFTD”, “NITD”, “CACL”, & “CANS” in to the


COVARIATES box.)

5. Select ENTER for METHOD (default option)

6. Click OK.

85
Regressors: CFTD, CACL, NITD, CANS

86
Regressors: CFTD, CACL, NITD, CANS

87
Regressors: CFTD, CACL

Given “CFTD” & “CACL” values of a bank, compute:


Logit = –4.863+5.654*CFTD+2.550*CACL ;
& chances of being “sound” = 1/[1+exp(–Logit)]

88
Regressors: CFTD, CACL

89
Intentionally Kept Blank

90
Example: Survival of the Donner Party

• In 1846, the Donner and Reed families left Springfield,


Illinois, for California by covered wagon. In July the Donner
party reached Fort Bridger, Wyoming. There its leaders
decided to attempt a new and untested route to
Sacramento valley. Having reached its full size of 87 people
and 20 wagons (with 45 individuals with age ≥15), the party
was delayed by a difficult crossing of the Wasatch range …..
• (https://en.wikipedia.org/wiki/Donner_Party)
http://www.stat.ufl.edu/~winner/data/donner.txt (full dataset 87 obs)
• https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.ed
u.stat504/files/lesson07/donner.txt (only 45 obs)

91
Survival of the Donner Party (contd.)
…. and again in the crossing of the desert west of the
great Salt Lake. The group became stranded in the
eastern Sierra Nevada mountains when the region was hit
by heavy snow in October. By the time the last survivor
was rescued on April 21, 1847, 40 of the 87 members had
died of famine and hunger and exposure to extreme cold.

92
Donner Party Data (age above 15yrs)
[sex=1 for males]

93
Route of the Donner Party

94
The Logistic Regression Model
• The response variable in this case study is binary; it
take values 0 and 1. The Donner Party study involves
two explanatory variables, one categorical and one
numerical. Logistic regression is the appropriate tool
in such a situation.

95
Example: Donner Party
GLM code in R
> glm_survive = glm(survive ~ age+sex, family = binomial(logit),
data = donner)
> summary(glm_survive)
• Coefficients:
(Intercept) age sex
3.23041 0.0782 1.5973
• Degrees of Freedom: 44 ‘Total’(i.e. Null); 42 Residual (Model)

Residual Deviance = 51.26, DF= 42 (=45 - 3), AIC=57.26, P-value (HL test)
= 0.19, AUC=0.73, Correct Classification %= 77.8, Nagelkerke R2 = 0.28
Null deviance: For the model that includes only the intercept
Residual Deviance: For the model with explanatory variables included
96
Example: Donner Party
• The fit of the logistic regression model to the Donner
Party data (with 45 obs with age ≥15), where π
represents the survival probability gives
• logit(πi) = ln(πi/(1- πi)) = 3.23 – 0.078 agei 1.60 sexi,
where “sex” is an indicator variable for males.
• Interpretation:
 1.60 = ln(odds ratio) for a male compared to a female of
same age
– 0.078 = ln(odds ratio) for one year older person of same
gender

97
Odds from Logistic Regression
 = Odds of Y being 1 (survival) = P(Y=1)/P(Y=0)

In Logistic Regression:

𝜔 = exp(𝛽0 + 𝛽0 × 𝑋1 + … + 𝛽𝑘 × 𝑋𝑘 )

The ratio of the odds (a) at Xi=a, relative to that (b)


at Xi= b, for fixed values of the other X’s is
𝜔𝑎
= exp( 𝛽𝑖 (𝑎 − 𝑏))
𝜔𝑏

98
Interpretation of Coefficients (Donner Party)

• Comparing women 50 years old (a=50) with women 20 years


old (b=20), the odds ratio is estimated as,
exp[– 0.078(50 – 20)] = 0.096, or about 1/10.
So the odds of survival for a 20-year-old woman is about 10
times the odds of survival for a 50-year-old woman.

• Comparing a woman (sex = 0 = a) with a man (sex = 1 = b ) of


the same age, the estimated odds ratio is
exp(1.60[0 – 1]) = 4.95 ,
that is the woman’s odds of survival were about 5 times the
odds of survival of a man of the same age.

99
Testing Presence of Interaction (Donner data)
Wald’s test for the hypothesis that the coefficient of interaction
term is zero in the logistic regression of survival (1 or 0) on age,
sex (=1 for males), and interaction term = age x fem:
Donner Party data (n=45)
Variable Coefficient Standard Z-Statistic
Error
Constant 7.246 3.205 2.261
age -0.194 0.087 -2.220
sex -6.928 3.399 -2.038
Age x sex 0.162 0.094 1.714
Residual Deviance = 47.34, DF= 41 (=45-1-4), AIC=55.35, P-value(HL test)=
0.65, AUC=0.762, Correct Classification %= 73.3, Nagelkerke R2 = 0.368
From the normal distribution Two-sided p-value = 2 x Pr(Z > 1.714) = 0.087
Conclusion: There is suggestive but inconclusive(?) evidence of an interaction.
100
Example ( Confidence Interval for a Coefficient)
Confidence intervals for the odds of survival for females divided by
the odds of survival for males, accounting for age, from the model
without interaction

Variable Coefficient Standard Z-Statistic


Error
Constant 3.230 1.387 2.33
age -0.078 0.037 -2.10
Sex (male=1) -1.597 0.755 -2.11
Deviance = 51.256 Degrees of freedom = 42, Z.025=1.96

95% Confidence Interval for the coefficient of Sex : -1.597 1.960.755 = -3.077 to -0.117

Take anti-logarithms of endpoints to get interval for the odds ratio: 0.046 to 0.889

Conclusion: The odds of survival for males are estimated to have been exp(-1.597) =
0.203, i.e., one-fifth of the odds of survival for females of similar age).
101
Generalized Linear Model (GLM)

102
Generalized Linear Model (GLM)
• A generalized linear model is one where some function  of
the mean  of the response variable is related to the
explanatory variables through a linear regression equation:
  = 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽𝑘 𝑋𝑘

• The function  is the link function, and (μ) = μ leads to the


classical multiple linear regression

McCullagh, P. and Nelder, J. A. (1989). Generalized linear models,


2nd edition, London, Chapman and Hall.

103
Generalized Linear Model (GLM)
A GLM has three components:
• A random component identifies the response variable Y and
its probability distribution.
• A systematic component which specifies the explanatory
variables used in a linear predictor function.
• A link function which specifies the function of E(Y) that the
model equates to the systematic component.
Example (Classical linear regression model, quantitative Y):
• 𝑌~𝑁 𝜇, 𝜎 2 , 𝐸 𝑌 = 𝜇;
• 𝑋1 , … , 𝑋𝑘 , 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽𝑘 𝑋𝑘 ;
•  𝜇 =𝜇

104
Generalized Linear Model (GLM)
Example (Classical linear regression model, quantitative Y):
• 𝑌~𝑁 𝜇, 𝜎 2 , 𝐸 𝑌 = 𝜇;
• 𝑋1 , … , 𝑋𝑘 , 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽𝑘 𝑋𝑘 ;
•  𝜇 = 𝜇 [identity link function]
Example (Binary logistic regression model, qualitative Y):
• 𝑌~𝐵𝑖𝑛(1, 𝑝), 𝐸 𝑌 = 𝑝;
• 𝑋1 , … , 𝑋𝑘 , 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽𝑘 𝑋𝑘 ;
𝑝
•  𝑝 = ln( ) [logit link function]
1−𝑝
Example (Poisson regression model, count variable Y):
• 𝑌~𝑃𝑜𝑖(), 𝐸 𝑌 = ;
• 𝑋1 , … , 𝑋𝑘 , 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽𝑘 𝑋𝑘 ;
•   = ln() [log link function]
105
Generalized Linear Model (GLM)
• A GLM has three components:
• A random component identifies the response variable Y and
its probability distribution.
• A systematic component which specifies the explanatory
variables used in a linear predictor function.
• A link function which specifies the function of E(Y)=µ that the
model equates to the systematic component.
Exponential Family of Distributions with
Vector of Parameters 
The natural exponential family has probability density
function or probability mass function given in the form

f(x|) = h(x)exp   𝑇 𝑥 − 𝐴  = h(x)g()exp(  𝑇 𝑥 )


(θ) is called the natural parameter of the exponential family.

Examples: Exponential, chi-squared, gamma, normal, log-


normal, gamma, beta, Bernoulli, Poisson, geometric
distributions

107
Exponential Family of Distributions with
Vector of Parameters 
The natural exponential family has probability density
function or probability mass function given in the form

f(y|) = h(x)exp   𝑇 𝑦 − 𝐴  = h(x)g()exp(  𝑇 𝑦 )


(θ) is called the natural parameter of the exponential family.

For Bin(1,): = and () = ln( /(1- ))


𝜋
() = ln( ) is the logit link function
1−𝜋

For Poisson(): =, () = ln() is the log link function

108
Exponential Family of Distributions with
Vector of Parameters 

For Y ~ N(µ,2), 2 known:

(𝑦−𝜇)2 𝑦2 𝜇2 𝑦𝜇
1 − 1 − − − 2
f(y, μ) = 𝑒 2𝜎2 = 𝑒 2𝜎2 𝑒 2𝜎2 𝑒 𝜎
𝜎 2𝜋 𝜎 2𝜋

𝑦2 𝜇2 𝑦
1 − 2 − 2 − ( )𝜇
= ( 𝑒 2𝜎 )(𝑒 2𝜎 ) (𝑒 𝜎2 )
𝜎 2𝜋

Here the identity link ()= is the canonical link.


Exponential Family of Distributions with
Vector of Parameters 
For Y ~ N(µ,2), 2 unknown:

(𝑦−𝜇)2 𝑦2 𝜇2 𝑦𝜇
1 − 1 − 2 − 2 +
f(y, μ) = 𝑒 2𝜎2 = 𝑒 2𝜎 𝑒 2𝜎 𝑒 𝜎2
𝜎 2𝜋 𝜎 2𝜋

𝜇2 𝜇 1
1 − 2 𝑦 +𝑦 2 −
=( )  (𝑒 2𝜎 )  (𝑒 𝜎2 2𝜎2 )
2𝜋

𝜇 1
Here the canonical link (,2)= ,− 2 .
𝜎2 2𝜎
Exponential Family of Distributions
For Y ~ Bin(1,):

f ( y,  )   y (1   )1 y
 (1   )[ /(1  π)]y
  
 (1   ) exp y ln( )
 1  
Here, the logit link is the canonical link. The link function that
transforms the mean to the natural parameter is called the
canonical link.
Exponential Family of Distributions
For Y ~ Poi():

f ( y,  )  e  y
y!
 1
e y! exp( y log( ))
Here the log link is the canonical link. The link function that
transforms the mean to the natural parameter is called the
canonical link.
Log-linear Regression for Poisson
Counts

113
Log-linear Regression for Poisson Counts

Integer counts differ from binomial counts in that they represent


the occurrences of events over time or space without a
definite upper bound.

The Poisson probability distribution is often useful for describing


the population distribution of this type of counts.

The Poisson log-linear regression model is a popular generalized


linear model, like the binomial logistic model.

114
Interpretation of the Poisson Regression
Coefficients
With one variable, our model is:

log E (Y )    x
The log of the expected value of Y is a linear function of
explanatory variable; the expected value of Y is a
multiplicative function of x:
E(Y )  exp(  x)  exp( ) exp( x)

115
Interpretation of the Regression Coefficient
Now suppose the value of x is one unit larger, i.e., (x+1).
Then the value of the mean of the response becomes:
E (Y | X  x  1)  exp(   ( x  1))
 exp(  )[exp( ) exp( x)]
 exp(  ) E (Y | X  x)

Thus the increase of one unit in the value of the


explanatory variable leads to a multiplicative change
factor of exp(β) units in mean.

116
Age and Mating Success of Male Elephants
[Ramsey and Schafer]
27 0 33 3 39 1
28 1 33 3 41 3
28 1 33 3 42 4
28 1 33 2 43 0
28 3 34 1 43 2
29 0 34 1 43 3
29 0 34 2 43 4
29 0 34 3 43 9
29 2 36 5 44 3
29 2 36 6 45 5
29 2 37 1 47 7
30 1 37 1 48 2
32 2 37 6 52 9
33 4 38 2 117
Age and Mating Success of Male Elephants

• What is the relationship between mating success and


age?

• Do elephants have diminished success after reaching


some optimal age?

118
Fitting the Poisson Log-Linear Model
• Likelihood has the following form
𝑛 −𝜇 𝑦 1
𝐿= ς𝑖=1 𝑒 𝜇𝑖𝑖 𝑖 ,
𝑦𝑖 !
where ln(𝜇𝑖 )=𝛽0 + 𝛽1 𝑥𝑖 ,
or ln(𝜇𝑖 )=𝛽0 + 𝛽1 𝑥𝑖 + 𝛽2 𝑥𝑖 2

• Iterative solution of the model parameters is needed.

119
Fitted Log-Linear Regression Model
Variable Coefficient Standard Error T-statisits Two sided p-
value
Constant -2.857 3.036 0.941 0.3467

Age 0.136 0.158 0.861 0.3894

Age^2 -0.0086 0.00201 0.427 0.6692

The quadratic term is insignificant.

There is no evidence of a peak for the current data


set with given range of age values.
120
Fitted Log-Linear Regression Model

Variable Coefficient Standard Error T-statisits Two sided p-


value

Constant -1.582 0.545 2.905 0.0037


Age 0.0687 0.0138 4.996 <0.0001

Both coefficients are highly significant.

121
Estimated Means and Variances

• For an elephant aged 25 years:


log(Mean) = -1.582 + 0.0687×25 = 0.1355
 Mean = exp(0.1355) = 1.145
Thus, mean number of matings for a 25 year old
elephant is 1.145.
From the model, so is the variance in number of
matings for a 25 year old elephant.

122
Estimated Means and Variances

• For an elephant aged 45 years, the mean number of


matings is 4.53. So is the variance of the number of
matings.
• Notice that the variance of response changes with
the level of the explanatory variable (the structure of
the model leads to heteroscedasticity, unlike in
classical linear regression model).

123
Goodness of Fit
(Regression for Count Data)
𝑦 −ෞ
𝜇 2
• PCS = σ𝑛𝑖=1 𝑖 𝑖
~2𝑛−𝑝 asymptotically, if { 𝜇ෝ𝑖 , i =
ෞ𝑖
𝜇
1, … , n} are “large”
p = number of parameters in the Poisson regression
model
𝑛 𝑦𝑖
• Deviance = 2σ𝑖=1[𝑦𝑖 𝑙𝑛 ෞ − (𝑦𝑖 − 𝜇ෝ𝑖 )] ~2𝑛−𝑝
𝜇𝑖
• If PCS/(n-p) and Deviance/(n-p) are much larger than
1, the model is suspect.
• Exercise: Example 5.1, 5.3 (pp. 176, 181, Text)

124
Elephant Data: Goodness of Fit
(n=41, df1=39, df2=38)

125
Australian AIDS Data Example

Source: Internet
Correction: In the table above, read “month” as “quarter”

126
Australian AIDS Data (Contd.)

127
Australian AIDS Data (Contd.)
• Fitted Regression Model:
log(μ)=0.2596+0.2565x

• For a one month increase, the estimated count


increases by a multiplicative change factor of
exp(0.2565) = 1.292

128
Goodness of Fit

129
Intentionally Kept Blank

130
Intentionally Kept Blank

131
Retrospective studies
In logistic regression model, probability of binary responses
are modeled prospectively as functions of explanatory
variables. In the studies in which the probability of “yes”
responses (e.g. cancer) are very small, independent samples
are drawn retrospectively. But prospective probabilities can
not be estimated from such retrospective sampling. However,
since the odds ratio is invariant to the study design, it can be
estimated from a retrospective study using logistic regression
model.
In a logistic regression model for a retrospective study, the
estimated intercept is not an estimate of the prospective
intercept.

132
Logistic Regression with Retrospective Studies
(Agresti, p. 168, 3rd ed)
Let Z=1, if a subject is sampled; let Z=0, otherwise
Let 1 = P(Z=1|y=1) = Prob of sampling a “case”[= P(Z=1|y=1, x), assumed]
Let 0 = P(Z=1|y=0) = Prob of sampling a “control”[= P(Z=1|y=0, x), assumed]
We have not sampled Conditional Distribution of response Y given X =x.
𝑃 𝑍 = 1 𝑌 = 1, 𝑥 𝑃(𝑌=1|𝑥) 𝜌 𝑃(𝑌=1|𝑥)
P(Y=1|Z=1,x) = σ1 = 𝜌 𝑃 𝑌 = 11 𝑥 +𝜌 𝑃 𝑌 = 0 𝑥
𝑗=0 𝑃 𝑍 = 1 𝑌 = 𝑗, 𝑥 𝑃(𝑌=𝑗|𝑥) 1 0
𝑃 𝑌 = 1𝑥 𝜌
𝜌1 [ ] 𝜌1 𝛼+𝛽𝑥 ln 𝜌1 +𝛼+𝛽𝑥
𝑃 𝑌 = 0𝑥 𝜌1 [𝑒 𝛼+𝛽𝑥 ] 𝜌0
[𝑒 ] [𝑒 0 ]
= 𝑃 𝑌 = 1𝑥 = 𝜌 [𝑒 𝛼+𝛽𝑥 ]+𝜌 = 𝜌1 𝛼+𝛽𝑥 = 𝜌1
1 0 𝜌 [𝑒 ]+1 ln +𝛼+𝛽𝑥
𝜌1 [ ]+𝜌0 0 [𝑒 𝜌0 ]+1
𝑃 𝑌 = 0𝑥
𝜌
 𝑙𝑜𝑔𝑖𝑡((P(Y=1|Z=1,x))=[ln 𝜌1 + 𝛼] + 𝛽𝑥 = 𝛼 ∗ + 𝛽 ∗ x,
0
where 𝑙𝑜𝑔𝑖𝑡((P(Y=1|x)) = 𝛼 + 𝛽𝑥
𝜌1
Thus, 𝛽 ∗ = 𝛽, but 𝛼 ∗ = 𝛼 + ln 𝜌0

Usually, probability 1 of sampling a case (“sampling rate for cases”) is much


greater than probability 0 of sampling a control and in that case 𝛼 ∗ > 𝛼.

133
Logistic Regression with Retrospective Studies
(Agresti, p. 168, 3rd ed)
CANCER CONTROL TOTAL

SMOKER 83 72 155
NON-SMOKER 3 14 17
TOTAL 86 86 172

Let Z=1, if a subject is sampled; let Z=0, otherwise


Let 1 = P(Z=1|y=1) = Prob of sampling a “case”[= P(Z=1|y=1, x), assumed]
Let 0 = P(Z=1|y=0) = Prob of sampling a “control”[= P(Z=1|y=0, x), assumed]
We have not sampled Conditional Distribution of response Y given X =x.
𝜌1
 𝑙𝑜𝑔𝑖𝑡((P(Y=1|Z=1,x))=[ln + 𝛼] + 𝛽𝑥 = 𝛼 ∗ + 𝛽 ∗ x,
𝜌0
where 𝑙𝑜𝑔𝑖𝑡((P(Y=1|x)) = 𝛼 + 𝛽𝑥
𝜌1
Thus, 𝛽 ∗ = 𝛽, but 𝛼 ∗ = 𝛼 + ln .
𝜌0

Usually, probability 1 of sampling a case (“sampling rate for cases”) is much


greater than probability 0 of sampling a control and in that case 𝛼 ∗ > 𝛼.
134
Intentionally Kept Blank

135
Continuous Covariate and Invariance to Linear
Transformation, p. 122, Text
𝑥
Let 𝑥 ′ = 𝑎 + . [e.g., x′ = body weight in kg, x = weight in lb, k=0.453592,
𝑘
𝑎=0]. Thus , 𝑥 ′ =1  𝑥 = 𝑘 − 𝑘𝑎, and 𝑥 ′ =0  𝑥 = −𝑘𝑎

Let the two logistic regression models (Y = 1 for heart attack) be defined by
logit((P(Y=1|x)) = 𝛽0 + 𝛽1 𝑥; logit((P(Y=1|𝑥 ′ )) = 𝛽0 ′ + 𝛽1 ′ 𝑥′;

Then Odds ratio per unit change in 𝑥 ′ :


𝑃 𝑌 = 1 𝑥′ = 1 𝑃 𝑌 = 1 𝑥 = 𝑘 − 𝑘𝑎
′ [𝑃 𝑌 = 0 𝑥 ′ = 1 ] [
𝑃 𝑌 = 0 𝑥 = 𝑘 − 𝑘𝑎
]
[𝑒 𝛽0+𝛽1(𝑘−𝑘𝑎) ] 𝑘
𝛽
𝑒 = 𝑃 𝑌 = 1 𝑥′ = 0 =
1 = = 𝑒 𝛽1
𝑃 𝑌 = 1 𝑥 = −𝑘𝑎 [𝑒 𝛽0+𝛽1(−𝑘𝑎) ]
[ ′ ] [ ]
𝑃 𝑌 = 0𝑥 = 0 𝑃 𝑌 = 0 𝑥 = −𝑘𝑎

= odds ratio per unit change in x 𝑘 .

Thus, 𝛽1 ′ = 𝑘𝛽1 𝐢𝐦𝐩𝐥𝐢𝐞𝐬 𝜷𝟏 ′ = 𝟎 ⟺ 𝜷𝟏 = 𝟎 ;


𝛽0 ′ = logit((P(Y=1|𝑥 ′ = 0) = logit((P(Y=1|𝑥 = −ka)) = 𝛽0 + 𝛽1 (−𝑘𝑎)

136
Intentionally Kept Blank

137
Categorical Variable Y as a Function of
Continuous Latent Variable Z

In the graph, read x as y,


e.g., x=0 as y=0, etc.

Example: Z = Exam
score, Y = letter grade or
grade point; covariate x
= no. of hours of study.

138
Proportional Odds Model for Ordinal Data (p.162-)
• Suppose observed variable Y = j iff αj-1 ≤ Z ≤ αj, j=1, …, J, and latent
variable Z depends on vector of covariates X through a linear or
non-linear model. If linear, then Z = βTx + , where  has CDF F. Then,
• 𝛾𝑗 = 𝑃 𝑌 ≤ 𝑗 = 𝑃 𝑍 ≤ α𝑗 = 𝐹(α𝑗  βTx), j=1,…,J-1
•  𝐹 −1 (𝛾𝑗 ) = α𝑗  βTx
• If F(z) = 1/(1+e-z), the standard logistic distribution, then
𝜸𝒋
𝐹 −1 𝛾𝑗 = 𝐥𝐧 = 𝜶𝒋 𝑻 𝒙; (only α𝑗 is changing);
𝟏−𝜸𝒋
𝛾𝑗 (𝑥1 )/(1−𝛾𝑗 𝑥1 )
 = exp(𝑇 𝑥1 − 𝑥2 ),
𝛾𝑗 (𝑥2 )/(1−𝛾𝑗 𝑥2 )

i.e., ratio of odds of the event {Yj}, between any two levels of vector of
predictors x=x1 and x=x2 is independent of j; i.e., odds of cumulative
response probabilities are proportional to each other (text, p.163)

139
Proportional Odds Model for Ordinal Data (p.162-)
• Z = βTx + , where  has CDF F(z) = 1/(1+e-z), the standard logistic
distribution
𝛾𝑗
•  ln = α𝑗 𝑇 𝑥; (only α𝑗 is changing);
1−𝛾𝑗
𝛾𝑗 (𝑥1 )/(1−𝛾𝑗 𝑥1 )
 = exp(𝑇 𝑥1 − 𝑥2 ), i.e., odds of
𝛾𝑗 (𝑥2 )/(1−𝛾𝑗 𝑥2 )
cumulative response probabilities are proportional to each
other
• When J=2, then we have Binary Logistic Regression Model:
𝛾1
𝐹 −1 𝛾1 = ln = α1 βTx
1−𝛾1
• Since odds of cumulative response probability 𝛾𝑗 (𝑥) increases with j,
for proportional odds model we have:
α1  α2  α𝐽−1 .

140
Proportional Odds Model for Ordinal Data
• Suppose Observed variable Y = j iff αj-1 ≤ Z ≤ αj, , j=1, …, J, where
latent variable Z depends on the covariates through a linear or non-
linear model. If linear, then Z = βTx + , where  has CDF F. Then,
• 𝛾𝑗 = 𝑃 𝑌 ≤ 𝑗 = 𝑃 𝑍 ≤ α𝑗 = 𝐹(α𝑗  βTx).
 𝐹 −1 (𝛾𝑗 ) = α𝑗  βTx
• F(z) = 1/(1+e-z) is standard logistic distribution CDF  logit model
• F(z) is normal CDF probit model
• F(z) is extreme-value dist CDF  complementary log-log model
[note: if exp() has exponential or Weibull dist then  has the
extreme value dist]

141
Intentionally Kept Blank

142
Degrees of Freedom
Degrees of Freedom for LRT = 2loge()= 2loge(L0/L),
is = difference in dimensionality of  and 0, when
H0:0 is true. (H1: 0c )
Let I = no. of distinct patterns of covariates; then df
1. Independence Test: {IJ-1} – {(I-1)+(J-1)}
2. Homogeneity Test: {I(J-1)} – (J-1)
3. Multinomial Logistic (J=3, say): {I(J-1)} – (J-1); (No. of
model parameters under H0= J-1; null hypothesis of a
multiple logistic regression is that there is no relationship between
the X variables and the Y variable)
4. Binary Logistic (J=2): I(2-1)-1

143
Degrees of freedom (contd.)
• (i) Logistic regression (Read Sec 4.3.1, Text, p.148):
df= Ix(J-1) - m, where I = distinct patterns of (categorical or otherwise)
covariates, m = number of parameters estimated under Ho.
[ In product binomial or product multinomial sampling, "I" represents
the distinct number of binomial or multinomial distributions we are
sampling from]
• (ii) Poisson regression (Read Sec 5.2, Text, p.179-180):
df=(n–p), where n = number of obs, p = no. of parameters estimated
under Ho.
• (iii) Summary: In logistic regression, binomial or multinomial Y
has finitely many category-values, count I=distinct patterns of
(categorical or otherwise) covariates. In Poisson regression and Classical
linear regression model the Poisson or normal Y assumes infinitely many
values. The degrees of freedom involves n, and not the I = distinct
patterns of (categorical or otherwise) covariate variables] 144
References
• Agresti, A. (2012). Categorical Data Analysis, Wiley Series in
Probability and Statistics.
• Bishop, Y., Fienberg, S. E. and Holland, P. W. (1975). Discrete
Multivariate Analysis, MIT Press, Cambridge.
• Christensen, R. (1990). Loglinear Models. Springer-Verlag,
New York.
• Ramsey, F. L. and Schafer, D. W. (1997). The Statistical Sleuth.
Duxbury Press, Belmont, California.
• Read, T. R. C. and Cressie, N. (1988). Goodness of fit Statistics
for Discrete Multivariate Data. Springer-Verlag, New York.
• Goon, Gupta, Dasgupta, Fundamentals of Statistics, Volume
One.
145

You might also like