We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 4
AnR
COMPANION
toAPPLIED
REGRESSION
2:
Edition
John Fox
McMaster University
Sanford Weisberg
University of Minnesota
@SAGE5.3 GLMs for Binary-Response Data
Table 5.4 Variables in the Mroz data set.
Variable Description Remarks
lfp wife’s labor-force participation factor: no, yes
k5 number of children ages 5 and younger 0-3, few 3
k618 number of children ages 6 to 18
age wife’s age in years 30-60, single years
we wife’s college attendance factor: no, yes
he husband's college attendance factor, no, yes
lwg log of wife’s estimated wage rate see text
ine family income excluding wife’s income $1,000s
To illustrate logistic regression, we turn to an example from Long (1997),
which draws on data from the 1976 U.S. Panel Study of Income Dynamics,
and in which the response variable is married women’s labor-force participa-
tion. The data were originally used in a different context by Mroz (1987), and
the same data appear in Berndt (1991) as an exercise in logistic regression.
The data are in the data frame Mroz in the car package; printing 10 of the
n = 753 observations at random:
> library (car)
> some(Mroz) # sample 10 rows
lfp k5 k618 age we he lwg ine
43 yes 1 2 31 yes yes 0.9450 22.50
127 yes 0 3 45 yes yes -0.9606 23.67
194 yes 0 3 31 no no 1.4971 18.00
232 yes 0 0 52 no no 1.2504 10.40
277 yes 0 3 36 no yes 1.6032 16.40
351 yes 0 0 46 no no 1.3069 28.00
362 yes 0 0 54 yes no 2.1893 18.22
408 yes 1 3 36 no no 3.2189 21.00
415 yes 0 3 36 yes yes 0.5263 32.00
607 no 1 1 44 no no 0.9905 9.80
> nrow(Mroz)
(4) 753
The definitions of the variables in the Mroz data set are shown in Tabie 5.4.
With the exception of 1wg, these variables are straightforward. The log of
each woman’s estimated wage rate, 1wg, is based on her actual earnings if she
is in the labor force; if the woman is not in the labor force, then this variable is
imputed (i.e., filled in) as the predicted value from a regression of log wages
on the other predictors for women in the labor force. As we will see later (in
Section 6.6.3), this definition of expected earnings creates a problem for the
logistic regression.
5.3.1 FITTING THE BINARY LOGISTIC-
REGRESSION MODEL
The variable 1£p is a factor with two levels, and if we use this variable as
the response, then the first level, no, corresponds to failure (0) and the second
level, yes, to success (1).236
Chapter 5 Fitting Generalized Linear Mode's
> mroz.mod <- glm(1fp ~ k5 + k618 + age + we + he + Ig + inc,
+ familysbinomial, datasmroz)
The only features that differentiate this command from fitting a linear model
are the change of function from 1m to g1m and the addition of the far-
ily argument. The family argument is set to the family-generator function
binomial. The first argument to glm, the model formula, specifies the
linear predictor for the logistic regression, not the mean function directly, as it
default logit link is used; the command is therefore equivalent to
did in linear regression, Because the link function isnot given explicitly the §
> mroz.mod <- glm(1fp ~ k5 + k618 + age + wc + he + lwg + inc,
+ family=binomial(link=logit), data=Mroz)
The model summary for a logistic regression is very similar to that for a
linear regression:
> summary (mroz.mod)
call:
glm(formula = ifp ~ k5 + k618 + age + we + he + lwg +
inc, family = binomial, data = Mroz)
Deviance Residuals:
Min 19 Median 3Q Max
-2.106 -1.090 0.598 0.972 2.189
Coefficients:
Estimate Std. Error z value Pr(>|2|)
(Intercept) 3.18214 0.64438 «= 4.94 7.9e-07
KS -1.46291 0.19700 -7.43. 1.1e-13
k618 -0.06457 0.06800 -0.95 0.34234
age ~0.06287 0.01278 -4.92 8.7e-07
weyes 0.80727 0.22998 += 3.51 0.00045
heyes 0.11173 0.20604 = 0.54_-0.58762
wg 0.60469 0.15082 4.01 6.1e-05
ine -0.03445 0.00821 -4.20 2. 7e-05
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1029.75 on 752 degrees of freedom
Residual deviance: 905.27 on 745 degrees of freedom
AIC: 921.3
Number of Fisher Scoring iterations: 4
The Wald tests, given by the ratio of the coefficient estimates to their standard
errors, are now labeled as z values because the large-sample reference dis-
tribution for the tests is the normal distribution, not the ¢ distribution as in
a linear model. The dispersion parameter, @ = 1, for the binomial family is
noted in the output. Additional output includes the null deviance and degrees
of freedom, which are for a model with all parameters apart from the intercept
set to 0; the residual deviance and degrees of freedom for the model actually
fit to the data; and the AIC, an alternative measure of fit sometimes used for5.3 GLMs for Binary-Response Data 237
model selection (see Section 4.5). Finally, the number of iterations required to
obtain the maximum-likelihood estimates is printed?
5.3.2 PARAMETER ESTIMATES FOR LOGISTIC REGRESSION
The estimated logistic-regression model is given by
log, [435] = bo + bin +--+ + bere
If we exponentiate both sides of this equation, we get
2
1— p(x)
where the left-hand side of the equation, 72(x) / [1 — Z(x)], gives the fitted
odds of success—that is, the fitted probability of success divided by the fitted
probability of failure. Exponentiating the model removes the logarithms and
changes it from a model that is additive in the log-odds scale to one that is mul-
tiplicative in the odds scale. For example, increasing the age of a woman by 1
year, holding the other predictors constant, multiplies the odds of her being in
the workforce by exp(b3)= exp(—0.06287)= 0.9391—that is, reduces the
odds of her working by 6%. The exponentials of the coefficient estimates are
generally called risk factors (or odds ratios), and they can be viewed all at
once, along with their confidence intervals, by the command
= exp (bg) x exp (by) x +++ & exp (Byx;)
> round (exp (cbind(Estimate=coef (mroz.mod), confint (mroz.mod))), 2}
Estimate 2.5 8 97.5 %
(Intercept) 24.10 6.94 87.03
5 0.23 0.16 0.34
k618 0.94 0.82 1.07
age 0.94 0.92 0.96
weyes 2.26 1.43 3.54
heyes 1,12 0.75 1.68
wg 1.83 1.37 2.48
ine 0.97 0.95 0.98
Compared with a woman who did not attend college, for example, a college-
educated woman with all other predictors the same has odds of working about
2.24 times higher, with 95% confidence interval 1.43 to 3.54,
The confint function provides confidence intervals for GLMs based on
profiling the log-likelihood rather than on the Wald statistics used for linear
models (Venables and Ripley, 2002, sec. 8.4). Confidence intervals for GLMs
based on the log-likelihood take longer to compute but tend to be more accu-
rate than those based on the Wald statistic. Even before exponentiation, the
log-likelihood-based confidence intervals need not be symmetric about the
estimated coefficients.
>The iterative algorithm employed by g1m to maximize the likelihood is described in Section 5.12.