0% found this document useful (0 votes)
145 views46 pages

GLM Theory

1. The document provides an overview of generalized linear models (GLMs), which consist of a random component specifying the conditional distribution of the response variable, a linear predictor that is a linear function of regressors, and a link function that transforms the expectation of the response to the linear predictor. 2. GLMs initially used distributions from the exponential family like Gaussian, binomial, Poisson, gamma, and inverse-Gaussian, but have been extended to other families. Common link functions map the response expectation to the real line to remove restrictions on its range. 3. The variance of the response is a function of its mean and a dispersion parameter for distributions in the exponential family. The canonical link simplifies the GLM but

Uploaded by

Tindeche_Alex
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
145 views46 pages

GLM Theory

1. The document provides an overview of generalized linear models (GLMs), which consist of a random component specifying the conditional distribution of the response variable, a linear predictor that is a linear function of regressors, and a link function that transforms the expectation of the response to the linear predictor. 2. GLMs initially used distributions from the exponential family like Gaussian, binomial, Poisson, gamma, and inverse-Gaussian, but have been extended to other families. Common link functions map the response expectation to the real line to remove restrictions on its range. 3. The variance of the response is a function of its mean and a dispersion parameter for distributions in the exponential family. The canonical link simplifies the GLM but

Uploaded by

Tindeche_Alex
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

15

Generalized
Linear Models
D
ue originally to Nelder and Wedderburn (1972), generalized linear models are a remarkable
synthesis and extension of familiar regression models such as the linear models described in
Part II of this text and the logit and probit models described in the preceding chapter. The current
chapter begins with a consideration of the general structure and range of application of generalized
linear models; proceeds to examine in greater detail generalized linear models for count data,
including contingency tables; briey sketches the statistical theory underlying generalized linear
models; and concludes with the extension of regression diagnostics to generalized linear models.
The unstarred sections of this chapter are perhaps more difcult than the unstarred material in
preceding chapters. Generalized linear models have become so central to effective statistical data
analysis, however, that it is worth the additional effort required to acquire a basic understanding
of the subject.
15.1 The Structure of Generalized Linear Models
A generalized linear model (or GLM
1
) consists of three components:
1. A random component, specifying the conditional distribution of the response variable, Y
i
(for the ith of n independently sampled observations), given the values of the explanatory
variables in the model. In Nelder and Wedderburns original formulation, the distribution
of Y
i
is a member of an exponential family, such as the Gaussian (normal), binomial, Pois-
son, gamma, or inverse-Gaussian families of distributions. Subsequent work, however, has
extended GLMs to multivariate exponential families (such as the multinomial distribution),
to certain non-exponential families (such as the two-parameter negative-binomial distribu-
tion), and to some situations in which the distribution of Y
i
is not specied completely.
Most of these ideas are developed later in the chapter.
2. A linear predictorthat is a linear function of regressors

i
= +
1
X
i1
+
2
X
i2
+ +
k
X
ik
As inthe linear model, andinthe logit andprobit models of Chapter 14, the regressors X
ij
are
prespecied functions of the explanatory variables and therefore may include quantitative
explanatory variables, transformations of quantitative explanatory variables, polynomial
regressors, dummy regressors, interactions, and so on. Indeed, one of the advantages of
GLMs is that the structure of the linear predictor is the familiar structure of a linear model.
3. A smooth and invertible linearizing link function g(), which transforms the expectation of
the response variable,
i
E(Y
i
), to the linear predictor:
g(
i
) =
i
= +
1
X
i1
+
2
X
i2
+ +
k
X
ik
1
Some authors use the acronym GLM to refer to the general linear modelthat is, the linear regression model with
normal errors described in Part II of the textand instead employ GLIM to denote generalized linear models (which
is also the name of a computer program used to t GLMs).
379
380 Chapter 15. Generalized Linear Models
Table 15.1 Some Common Link Functions and Their Inverses
Link
i
= g(
i
)
i
= g
1
(
i
)
Identity
i

i
Log log
e

i
e

i
Inverse
1
i

1
i
Inverse-square
2
i

1/2
i
Square-root

i

2
i
Logit log
e

i
1
i
1
1 +e

i
Probit
1
(
i
) (
i
)
Log-log log
e
[log
e
(
i
)] exp[exp(
i
)]
Complementary log-log log
e
[log
e
(1
i
)] 1exp[exp (
i
)]
NOTE:
i
is the expected value of the response;
i
is the linear predictor; and () is the
cumulative distribution function of the standard-normal distribution.
Because the link function is invertible, we can also write

i
= g
1
(
i
) = g
1
( +
1
X
i1
+
2
X
i2
+ +
k
X
ik
)
and, thus, the GLMmay be thought of as a linear model for a transformation of the expected
response or as a nonlinear regression model for the response. The inverse link g
1
() is
also called the mean function. Commonly employed link functions and their inverses are
shown in Table 15.1. Note that the identity link simply returns its argument unaltered,

i
= g(
i
) =
i
, and thus
i
= g
1
(
i
) =
i
.
The last four link functions in Table 15.1 are for binomial data, where Y
i
represents the
observed proportion of successes in n
i
independent binary trials; thus, Y
i
can take on
any of the values 0, 1/n
i
, 2/n
i
, . . . , (n
i
1)/n
i
, 1. Recall from Chapter 15 that binomial
data also encompass binary data, where all the observations represent n
i
= 1 trial, and
consequently Y
i
is either 0 or 1. The expectation of the response
i
= E(Y
i
) is then
the probability of success, which we symbolized by
i
in the previous chapter. The logit,
probit, log-log, and complementary log-log links are graphed in Figure 15.1. In contrast to
the logit and probit links (which, as we noted previously, are nearly indistinguishable once
the variances of the underlying normal and logistic distributions are equated), the log-log
and complementary log-log links approach the asymptotes of 0 and 1 asymmetrically.
2
Beyond the general desire to select a link function that renders the regression of Y on the
Xs linear, a promising link will remove restrictions on the range of the expected response.
This is a familiar idea from the logit and probit models discussed in Chapter 14, where the
object was to model the probability of success, represented by
i
in our current general
notation. As a probability,
i
is conned to the unit interval [0,1]. The logit and probit links
map this interval to the entire real line, from to +. Similarly, if the response Y is a
count, taking on only non-negative integer values, 0, 1, 2, . . . , and consequently
i
is an
expected count, which (though not necessarily an integer) is also non-negative, the log link
maps
i
to the whole real line. This is not to say that the choice of link function is entirely
determined by the range of the response variable.
2
Because the log-log link can be obtained fromthe complementary log-log link by exchanging the denitions of success
and failure, it is common for statistical software to provide only one of the twotypically, the complementary log-log
link.
15.1. The Structure of Generalized Linear Models 381
4 2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0

i
logit probit
loglog complementary loglog

i

=

g

1
(

i
)
Figure 15.1 Logit, probit, log-log, and complementary log-log links for binomial data. The
variances of the normal and logistic distributions have been equated to facilitate the
comparison of the logit and probit links [by graphing the cumulative distribution
function of N(0,
2
/3) for the probit link].
A generalized linear model (or GLM) consists of three components:
1. A random component, specifying the conditional distribution of the response vari-
able, Y
i
(for the ith of n independently sampled observations), given the values
of the explanatory variables in the model. In the initial formulation of GLMs, the
distribution of Y
i
was a member of an exponential family, such as the Gaussian,
binomial, Poisson, gamma, or inverse-Gaussian families of distributions.
2. A linear predictorthat is a linear function of regressors,

i
= +
1
X
i1
+
2
X
i2
+ +
k
X
ik
3. A smooth and invertible linearizing link function g(), which transforms the expec-
tation of the response variable,
i
= E(Y
i
), to the linear predictor:
g(
i
) =
i
= +
1
X
i1
+
2
X
i2
+ +
k
X
ik
Aconvenient propertyof distributions inthe exponential families is that the conditional variance
of Y
i
is a function of its mean
i
[say, v(
i
)] and, possibly, a dispersion parameter . The variance
functions for the commonly used exponential families appear in Table 15.2. The conditional
variance of the response in the Gaussian family is a constant, , which is simply alternative
notation for what we previously termed the error variance,
2

. In the binomial and Poisson


families, the dispersion parameter is set to the xed value = 1.
Table 15.2 also shows the range of variation of the response variable in each family, and the
so-called canonical (or natural) link function associated with each family. The canonical link
382 Chapter 15. Generalized Linear Models
Table 15.2 Canonical Link, Response Range, and Conditional
Variance Function for Exponential Families
Family Canonical Link Range of Y
i
V(Y
i
|
i
)
Gaussian Identity (, +)
Binomial Logit
0,1,...,n
i
n
i

i
(1
i
)
n
i
Poisson Log 0,1,2,...
i
Gamma Inverse (0,)
2
i
Inverse-Gaussian Inverse-square (0,)
3
i
NOTE: is the dispersion parameter,
i
is the linear predictor, and
i
is the expectation of Y
i
(the response). In the binomial family, n
i
is the
number of trials.
simplies the GLM,
3
but other link functions may be used as well. Indeed, one of the strengths of
the GLMparadigmin contrast to transformations of the response variable in linear regression
is that the choice of linearizing transformation is partly separated from the distribution of the
response, and the same transformation does not have to both normalize the distribution of Y
and make its regression on the Xs linear.
4
The specic links that may be used vary from one
family to another and alsoto a certain extentfrom one software implementation of GLMs to
another. For example, it would not be promising to use the identity, log, inverse, inverse-square,
or square-root links with binomial data, nor would it be sensible to use the logit, probit, log-log,
or complementary log-log link with nonbinomial data.
I assume that the reader is generally familiar with the Gaussian and binomial families and
simply give their distributions here for reference. The Poisson, gamma, and inverse-Gaussian
distributions are perhaps less familiar, and so I provide some more detail:
5
The Gaussian distribution with mean and variance
2
has density function
p(y) =
1

2
exp
_
(y )
2
2
2
_
(15.1)
The binomial distribution for the proportion Y of successes in n independent binary trials
with probability of success has probability function
p(y) =
_
n
ny
_

ny
(1 )
n(1y)
(15.2)
3
This point is pursued in Section 15.3.
4
There is also this more subtle difference: When we transform Y and regress the transformed response on the Xs, we
are modeling the expectation of the transformed response,
E[g(Y
i
)] = +
1
x
i1
+
2
x
i2
+ +
k
x
ik
In a GLM, in contrast, we model the transformed expectation of the response,
g[E(Y
i
)] = +
1
x
i1
+
2
x
i2
+ +
k
x
ik
While similar in spirit, this is not quite the same thing when (as is true except for the identity link) the link function g()
is nonlinear.
5
The various distributions used in this chapter are described in a general context in Appendix D on probability and
estimation.
15.1. The Structure of Generalized Linear Models 383
Here, ny is the observed number of successes in the n trials, and n(1 y) is the number of
failures; and
_
n
ny
_
=
n!
(ny)![n(1 y)]!
is the binomial coefcient.
The Poisson distributions are a discrete family with probability function indexed by the rate
parameter > 0:
p(y) =
y

y!
for y = 0, 1, 2, . . .
The expectation and variance of a Poisson random variable are both equal to . Poisson
distributions for several values of the parameter are graphed in Figure 15.2. As we will see
in Section 15.2, the Poisson distribution is useful for modeling count data. As increases,
the Poisson distribution grows more symmetric and is eventually well approximated by a
normal distribution.
The gamma distributions are a continuous family with probability-density function indexed
by the scale parameter > 0 and shape parameter > 0:
p(y) =
_
y

_
1

exp
_
y

_
()
for y > 0 (15.3)
where () is the gamma function.
6
The expectation and variance of the gamma distri-
bution are, respectively, E(Y) = and V(Y) =
2
. In the context of a generalized
linear model, where, for the gamma family, V(Y) =
2
(recall Table 15.2 on page 382),
the dispersion parameter is simply the inverse of the shape parameter, = 1/. As the
names of the parameters suggest, the scale parameter in the gamma family inuences the
spread (and, incidentally, the location) but not the shape of the distribution, while the shape
parameter controls the skewness of the distribution. Figure 15.3 shows gamma distributions
for scale = 1 and several values of the shape parameter . (Altering the scale param-
eter would change only the labelling of the horizontal axis in the graph.) As the shape
parameter gets larger, the distribution grows more symmetric. The gamma distribution is
useful for modeling a positive continuous response variable, where the conditional variance
of the response grows with its mean but where the coefcient of variation of the response,
SD(Y)/, is constant.
The inverse-Gaussian distributions are another continuous family indexed by two
parameters, and , with density function
p(y) =
_

2y
3
exp
_

(y )
2
2y
2
_
for y > 0
The expectation and variance of Y are E(Y) = and V(Y) =
3
/. In the context of
a GLM, where, for the inverse-Gaussian family, V(Y) =
3
(as recorded in Table 15.2
6
* The gamma function is dened as
(x) =
_

0
e
z
z
x1
dz
and may be thought of as a continuous generalization of the factorial function in that when x is a non-negative integer,
x! = (x +1).
384 Chapter 15. Generalized Linear Models
0 10 15 20 25 30
y
p
(
y
)
y
p
(
y
)
0 10 15 20 25 30
y
p
(
y
)
0 10 15 20 25 30
y
p
(
y
)
0 10 15 20 25 30
y
p
(
y
)
0 10 15 20 25 30
y
p
(
y
)
0.6
0.5
0.4
0.3
0.2
0.1
0.0
(a) = 0.5 (b) = 1
5 0 10 15 20 25 30 5
0.3
0.0
0.1
0.2
(c) = 2 (d) = 4
0.25
0.20
0.15
0.10
0.05
0.00
5
0.20
0.15
0.10
0.05
0.00
5
(e) = 8 (f) = 16
0.12
0.08
0.04
0.00
5
0.10
0.08
0.06
0.04
0.02
0.00
5
Figure 15.2 Poisson distributions for various values of the rate parameter .
15.1. The Structure of Generalized Linear Models 385
0 2 4 6 8 10
y
p
(
y
)
= 2
= 5
= 1
= 0.5 1.5
1.0
0.5
0.0
Figure 15.3 Several gamma distributions for scale =1 and various values of the shape
parameter .
on page 382), is the inverse of the dispersion parameter . Like the gamma distribution,
therefore, the variance of the inverse-Gaussian distribution increases with its mean, but
at a more rapid rate. Skewness also increases with the value of and decreases with .
Figure 15.4 shows several inverse-Gaussian distributions.
A convenient property of distributions in the exponential families is that the conditional
variance of Y
i
is a function of its mean
i
and, possibly, a dispersion parameter . In addi-
tion to the familiar Gaussian and binomial families (the latter for proportions), the Poisson
family is useful for modeling count data, and the gamma and inverse-Gaussian families
for modeling positive continuous data, where the conditional variance of Y increases with
its expectation.
15.1.1 Estimating and Testing GLMs
GLMs are t to data by the method of maximum likelihood, providing not only estimates of
the regression coefcients but also estimated asymptotic (i.e., large-sample) standard errors of
the coefcients.
7
To test the null hypothesis H
0
:
j
=
(0)
j
we can compute the Wald statistic
Z
0
=
_
B
j

(0)
j
_
/SE(B
j
), where SE(B
j
) is the asymptotic standard error of the estimated
coefcient B
j
. Under the null hypothesis, Z
0
follows a standard normal distribution.
8
As explained, some of the exponential families on which GLMs are based include an unknown
dispersion parameter . Although this parameter can, in principle, be estimated by maximum
likelihood as well, it is more common to use a method of moments estimator, which I will
denote

.
9
7
Details are provided in Section 15.3.2. The method of maximum likelihood is introduced in Appendix D on probability
and estimation.
8
Wald tests and F-tests of more general linear hypotheses are described in Section 15.3.3.
9
Again, see Section 15.3.2.
386 Chapter 15. Generalized Linear Models
0 1 2 3 4 5
y
p
(
y
)
1.0
0.8
0.6
0.4
0.2
0.0
= 1, = 1 = 2, = 1
= 1, = 5 = 2, = 5
Figure 15.4 Inverse-Gaussian distributions for several combinations of values of the mean and
inverse-dispersion .
As is familiar from the preceding chapter on logit and probit models, the ANOVA for linear
models has a close analog in the analysis of deviance for GLMs. In the current more general
context, the residual deviance for a GLM is
D
m
2(log
e
L
s
log
e
L
m
)
where L
m
is the maximized likelihood under the model in question and L
s
is the maximized
likelihood under a saturated model, which dedicates one parameter to each observation and
consequently ts the data as closely as possible. The residual deviance is analogous to (and,
indeed, is a generalization of) the residual sum of squares for a linear model.
In GLMs for which the dispersion parameter is xed to 1 (i.e., binomial and Poisson GLMs), the
likelihood-ratio test statistic is simply the difference in the residual deviances for nested models.
Suppose that Model 0, with k
0
+1 coefcients, is nested within Model 1, with k
1
+1 coefcients
(where, then, k
0
< k
1
); most commonly, Model 0 would simply omit some of the regressors in
model 1. We test the null hypothesis that the restrictions on Model 1 represented by Model 0 are
correct by computing the likelihood-ratio test statistic
G
2
0
= D
0
D
1
Under the hypothesis, G
2
0
is asymptotically distributed as chi-square with k
1
k
0
degrees of
freedom.
Likelihood-ratio tests can be turned around to provide condence intervals for coefcients;
as mentioned in Section 14.1.4 in connection with logit and probit models, tests and intervals
based on the likelihood-ratio statistic tend to be more reliable than those based on the Wald
statistic. For example, the 95% condence interval for
j
includes all values

j
for which the
hypothesis H
0
:
j
=

j
is acceptable at the .05 levelthat is, all values of

j
for which
2(log
e
L
1
log
e
L
0
)
2
.05,1
= 3.84, where log
e
L
1
is the maximized log likelihood for the
full model, and log
e
L
0
is the maximized log likelihood for a model in which
j
is constrained
to the value

j
. This procedure is computationally intensive because it required proling the
likelihoodretting the model for various xed values

j
of
j
.
15.2. Generalized Linear Models for Counts 387
For GLMs in which there is a dispersion parameter to estimate (Gaussian, gamma, and inverse-
Gaussian GLMs), we can instead compare nested models by an F-test,
F
0
=
D
0
D
1
k
1
k
0

where the estimated dispersion

, analogous to the estimated error variance for a linear model, is


taken fromthe largest model t to the data (which is not necessarily Model 1). If the largest model
has k +1 coefcients, then, under the hypothesis that the restrictions on Model 1 represented by
Model 0 are correct, F
0
follows an F-distribution with k
1
k
0
and nk 1 degrees of freedom.
Applied to a Gaussian GLM, this is simply the familiar incremental F-test. The residual deviance
divided by the estimated dispersion, D

D/

, is called the scaled deviance.


10
As we did for logit and probit models,
11
we can base a GLM analog of the squared multiple
correlation on the residual deviance: Let D
0
be the residual deviance for the model including
only the regression constant termed the null devianceand D
1
the residual deviance for the
model in question. Then,
R
2
1
D
1
D
0
represents the proportion of the null deviance accounted for by the model.
GLMs are t to data by the method of maximumlikelihood, providing not only estimates of
the regression coefcients but also estimated asymptotic standard errors of the coefcients.
The ANOVA for linear models has an analog in the analysis of deviance for GLMs. The
residual deviance for a GLM is D
m
= 2(log
e
L
s
log
e
L
m
), where L
m
is the maximized
likelihood under the model in question and L
s
is the maximized likelihood under a
saturated model. The residual deviance is analogous to the residual sum of squares for a
linear model.
In GLMs for which the dispersion parameter is xed to 1 (binomial and Poisson GLMs), the
likelihood-ratio test statistic is the difference in the residual deviances for nested models.
For GLMs in which there is a dispersion parameter to estimate (Gaussian, gamma, and
inverse-Gaussian GLMs), we can instead compare nested models by an incremental F-test.
15.2 Generalized Linear Models for Counts
The basic GLM for count data is the Poisson model with log link. Consider, by way of example,
Michael Ornsteins data on interlocking directorates among 248 dominant Canadian rms, previ-
ously discussed in Chapters 3 and 4. The number of interlocks for each rm is the number of ties
10
Usage is not entirely uniform here, and either of the residual deviance or the scaled deviance is often simply termed
the deviance.
11
See Section 14.1.4.
388 Chapter 15. Generalized Linear Models
0 20 40 60 80 100
Number of Interlocks
F
r
e
q
u
e
n
c
y
25
20
15
10
5
0
Figure 15.5 The distribution of number of interlocks among 248 dominant Canadian
corporations.
that a rm maintained by virtue of its board members and top executives also serving as board
members or executives of other rms in the data set. Ornstein was interested in the regression
of number of interlocks on other characteristics of the rmsspecically, on their assets (mea-
sured in billions of dollars), nation of control (Canada, the United States, the United Kingdom,
or another country), and the principal sector of operation of the rm (10 categories, including
banking, other nancial institutions, heavy manufacturing, etc.).
Examining the distribution of number of interlocks (Figure 15.5) reveals that the variable
is highly positively skewed, and that there are many zero counts. Although the conditional
distribution of interlocks given the explanatory variables could differ from its marginal dis-
tribution, the extent to which the marginal distribution of interlocks departs from symme-
try bodes ill for least-squares regression. Moreover, no transformation will spread out the
zeroes.
12
The results of the Poisson regression of number of interlocks on assets, nation of control, and
sector are summarized in Table 15.3. I set the United States as the baseline category for nation of
control, and Construction as the baseline category for sectorthese are the categories with the
smallest tted numbers of interlocks controlling for the other variables in the regression, and the
dummy-regressor coefcients are therefore all positive.
The residual deviance for this model is D(Assets, Nation, Sector) = 1887.402 on nk 1 =
248 13 1 = 234 degrees of freedom. Deleting each explanatory variable in turn from the
model produces the following residual deviances and degrees of freedom:
Explanatory Variables Residual Deviance df
Nation, Sector 2278.298 235
Assets, Sector 2216.345 237
Assets, Nation 2248.861 243
12
Ornstein (1976) in fact performed a linear least-squares regression for these data, though one with a slightly different
specication from that given here. He cannot be faulted for having done so, however, inasmuch as Poisson regression
modelsand, with the exception of loglinear models for contingency tables, other specialized models for countswere
not typically in sociologists statistical toolkit at the time.
15.2. Generalized Linear Models for Counts 389
Table 15.3 Estimated Coefcients for the Poisson Regression of Number
of Interlocks on Assets, Nation of Control, and Sector, for
Ornsteins Canadian Interlocking-Directorate Data
Coefcient Estimate Standard Error
Constant 0.8791 0.2101
Assets 0.02085 0.00120
Nation of Control (baseline: United States)
Canada 0.8259 0.0490
Other 0.6627 0.0755
United Kingdom 0.2488 0.0919
Sector (Baseline: Construction)
Wood and paper 1.331 0.213
Transport 1.297 0.214
Other nancial 1.297 0.211
Mining, metals 1.241 0.209
Holding companies 0.8280 0.2329
Merchandising 0.7973 0.2182
Heavy manufacturing 0.6722 0.2133
Agriculture, food, light industry 0.6196 0.2120
Banking 0.2104 0.2537
Taking differences between these deviances and the residual deviance for the full model yields
the following analysis-of-deviance table:
Source G
2
0
df p
Assets 390.90 1 .0001
Nation 328.94 3 .0001
Sector 361.46 9 .0001
All the terms in the model are therefore highly statistically signicant.
Because the model uses the log link, we can interpret the exponentiated coefcients (i.e., the
e
B
j
) as multiplicative effects on the expected number of interlocks. Thus, for example, holding
nation of control and sector constant, increasing assets by 1 billion dollars (the unit of the assets
variable) multiplies the estimated expected number of interlocks by e
0.02085
= 1.021that is,
an increase of just over 2%. Similarly, the estimated expected number of interlocks is e
0.8259
=
2.283 times as high in a Canadian-controlled rm as in a comparable U.S.-controlled rm.
As mentioned, the residual deviance for the full model t to Ornsteins data is D
1
= 1887.402;
the deviance for a model tting only the constant (i.e., the null deviance) is D
0
= 3737.010.
Consequently, R
2
= 1 1887.402/3737.010 = .495, revealing that the model accounts for
nearly half the deviance in number of interlocks.
The Poisson-regression model is a nonlinear model for the expected response, and I therefore
nd it generally simpler to interpret the model graphically using effect displays than to examine
the estimated coefcients directly. The principles of construction of effect displays for GLMs are
essentially the same as for linear models and for logit and probit models:
13
We usually construct
one display for each high-order term in the model, allowing the explanatory variables in that
13
See Section 15.3.4 for details.
390 Chapter 15. Generalized Linear Models
0 50 100 150
1
2
3
4
5
6
(a)
Assets (billions of dollars)
2
4
8
16
64
256
N
u
m
b
e
r

o
f

I
n
t
e
r
l
o
c
k
s
N
u
m
b
e
r

o
f

I
n
t
e
r
l
o
c
k
s
N
u
m
b
e
r

o
f

I
n
t
e
r
l
o
c
k
s
(b)
Nation of Control
Canada Other U.S. U.K.
1
2
3
4
5
6
2
4
8
16
64
256
(c)
Sector
WOD TRN FIN MIN HLD MER MAN AGR BNK CON
1
2
3
4
5
6
2
4
8
16
64
256
l
o
g
e
N
u
m
b
e
r

o
f

I
n
t
e
r
l
o
c
k
s
l
o
g
e
N
u
m
b
e
r

o
f

I
n
t
e
r
l
o
c
k
s
l
o
g
e
N
u
m
b
e
r

o
f

I
n
t
e
r
l
o
c
k
s
Figure 15.6 Effect displays for (a) assets, (b) nation of control, and (c) sector in the Poisson
regression for Ornsteins interlocking-directorate data. The broken lines and error
bars give 95% condence intervals around the tted effects (computed using the
quasi-Poisson model described below). A rug-plot at the bottom of panel (a) shows
the distribution of assets.
term to range over their values, while holding other explanatory variables in the model to typical
values. In a GLM, it is advantageous to plot effects on the scale of the estimated linear predictor,
, a procedure that preserves the linear structure of the model. In a Poisson model with the log
link, the linear predictor is on the log-count scale. We can, however, make the display easier to
interpret by relabeling the vertical axis in the scale of the expected response, , most informatively
by providing a second vertical axis on the right-hand side of the plot. For a Poisson model, the
expected response is a count.
Effect displays for the terms in Ornsteins Poisson regression are shown in Figure 15.6. This
model has an especially simple structure because each high-order term is a main effectthere are
no interactions in the model. The effect display for assets shows a one-dimensional scatterplot
(a rug-plot) for this variable at the bottom of the graph, revealing that the distribution of assets
15.2. Generalized Linear Models for Counts 391
is highly skewed to the right. Skewness produces some high-leverage observations and suggests
the possibility of a nonlinear effect for assets, points that I pursue later in the chapter.
14
15.2.1 Models for Overdispersed Count Data
The residual deviance for the Poisson regression model t to the interlocking-directorate data,
D = 1887.4, is much larger than the 234 residual degrees of freedomfor the model. If the Poisson
model ts the data reasonably, we would expect the residual deviance to be roughly equal to the
residual degrees of freedom.
15
That the residual deviance is so large suggests that the conditional
variation of the expected number of interlocks exceeds the variation of a Poisson-distributed
variable, for which the variance equals the mean. This common occurrence in the analysis of
count data is termed overdispersion.
16
Indeed, overdispersion is so common in regression models
for count data, and its consequences are potentially so severe, that models such as the quasi-
Poisson and negative-binomial GLMs discussed in this section should be employed as a matter
of course.
The Quasi-Poisson Model
A simple remedy for overdispersed count data is to introduce a dispersion parameter into the
Poisson model, so that the conditional variance of the response is nowV(Y
i
|
i
) =
i
. If > 1,
therefore, the conditional variance of Y increases more rapidly than its mean. There is no expo-
nential family corresponding to this specication, and the resulting GLMdoes not imply a specic
probability distribution for the response variable. Rather, the model species the conditional mean
and variance of Y
i
directly. Because the model does not give a probability distribution for Y
i
, it
cannot be estimated by maximum likelihood. Nevertheless, the usual procedure for maximum-
likelihood estimation of a GLM yields the so-called quasi-likelihood estimators of the regression
coefcients, which share many of the properties of maximum-likelihood estimators.
17
As it turns out, the quasi-likelihood estimates of the regression coefcients are identical to the
ML estimates for the Poisson model. The estimated coefcient standard errors differ, however:
If

is the estimated dispersion for the model, then the coefcient standard errors for the quasi-
Poisson model are

1/2
times those for the Poisson model. In the event of overdispersion, therefore,
where

> 1, the effect of introducing a dispersion parameter and obtaining quasi-likelihood esti-
mates is (realistically) to inate the coefcient standard errors. Likewise, F-tests for terms in the
model will reect the estimated dispersion parameter, producing smaller test statistics and larger
p-values.
As explainedinthe followingsection, we use a method-of-moments estimator for the dispersion
parameter. In the quasi-Poisson model, the dispersion estimator takes the form

=
1
n k 1

(Y
i

i
)
2

i
14
See Section 15.4 on diagnostics for GLMs.
15
That is, the ratio of the residual deviance to degrees of freedom can be taken as an estimate of the dispersion parameter
, which, in a Poisson model, is xed to 1. It should be noted, however, that this deviance-based estimator of the dispersion
can perform poorly. A generally preferable method of moments estimator is given in Section 15.3.
16
Although it is much less common, it is also possible for count data to be underdispersedthat is, for the conditional
variationof the response tobe less than the mean. The remedyfor underdispseredcount data is the same as for overdispersed
data; for example, we can t a quasi-Poisson model with a dispersion parameter, as described immediately below.
17
See Section 15.3.2.
392 Chapter 15. Generalized Linear Models
where
i
= g
1
(
i
) is the tted expectation of Y
i
. Applied to Ornsteins interlocking-directorate
regression, for example, we get

= 7.9435, and, therefore, the standard errors of the regression
coefcients for the Poisson model in Table 15.3 are each multiplied by

7.9435 = 2.818.
I note in passing that there is a similar quasi-binomial model for over-dispersed proportions,
replacing the xed dispersion parameter of 1 in the binomial distribution with a dispersion param-
eter to be estimated from the data. Overdispersed binomial data can arise, for example, when
different individuals who share the same values of the explanatory variables nevertheless differ
in their probability of success, a situation that is termed unmodelled heterogeneity. Similarly,
overdispersion can occur when binomial observations are not independent, as required by the
binomial distributionfor example, when each binomial observation is for related individuals,
such as members of a family.
The Negative-Binomial Model
There are several routes to models for counts based on the negative-binomial distribution (see,
e.g., Long, 1997, sect. 8.3; McCullagh & Nelder, 1989, sect. 6.2.3). One approach (following
McCullagh & Nelder, 1989, p. 233) is to adopt a Poisson model for the count Y
i
but to suppose
that the expected count

i
is itself an unobservable random variable that is gamma-distributed
with mean
i
and constant scale parameter (implying that the the gamma shape parameter is

i
=
i
/
18
). Then the observed count Y
i
follows a negative-binomial distribution,
19
p(y
i
) =
(y
i
+)
y!()


y
i
i

(
i
+)

i
+
(15.4)
with expected value E (Y
i
) =
i
and variance V(Y
i
) =
i
+
2
i
/. Unless the parameter
is large, therefore, the variance of Y increases more rapidly with the mean than the variance of
a Poisson variable. Making the expected value of Y
i
a random variable incorporates additional
variation among observed counts for observations that share the same values of the explanatory
variables and consequently have the same linear predictor
i
.
With the gamma scale parameter xed to a known value, the negative-binomial distribution
is an exponential family (in the sense of Equation 15.15 in Section 15.3.1), and a GLM based on
this distribution can be t by iterated weighted least squares (as developed in the next section). If
insteadand is typically the casethe value of is unknown, and must therefore be estimated
from the data, standard methods for GLMs based on exponential families do not apply. We can,
however, obtain estimates of both the regression coefcients and by the method of maximum
likelihood. Applied to Ornsteins interlocking-directorate regression, and using the log link, the
negative-binomial GLM produces results very similar to those of the quasi-Poisson model (as
the reader may wish to verify). The estimated scale parameter for the negative-binomial model
is = 1.312, with standard error SE() = 0.143; we have, therefore, strong evidence that the
conditional variance of the number of interlocks increases more rapidly than its expected value.
20
Zero-Inated Poisson Regression
A particular kind of overdispersion obtains when there are more zeroes in the data than is
consistent with a Poisson (or negative-binomial) distribution, a situation that can arise when only
certain members of the population are at risk of a nonzero count. Imagine, for example, that
18
See Equation 15.3 on page 383.
19
A simpler form of the negative-binomial distribution is given in Appendix D on probability and estimation.
20
See Exercise 15.1 for a test of overdispersion based on the negative-binomial GLM.
15.2. Generalized Linear Models for Counts 393
we are interested in modeling the number of children born to a woman. We might expect that
this number is a partial function of such explanatory variables as marital status, age, ethnicity,
religion, and contraceptive use. It is also likely, however, that some women (or their partners)
are infertile and are distinct from fertile women who, though at risk for bearing children, happen
to have none. If we knew which women are infertile, we could simply exclude them from the
analysis, but let us suppose that this is not the case. To reiterate, there are two sources of zeroes in
the data that cannot be perfectly distinguished: women who cannot bear children and those who
can but have none.
Several statistical models have been proposed for count data with an excess of zeroes, including
the zero-inatedPoissonregression(or ZIP) model, due toLambert (1992). The ZIPmodel consists
of two components: (1) A binary logistic-regression model for membership in the latent class of
individuals for whom the response variable is necessarily 0 (e.g., infertile individuals)
21
and (2) a
Poisson-regression model for the latent class of individuals for whom the response may be 0 or a
positive count (e.g., fertile women).
22
Let
i
represent the probability that the response Y
i
for the ith individual is necessarily 0.
Then
log
e

i
1
i
=
0
+
1
z
i1
+
2
z
i2
+ +
p
z
ip
(15.5)
where the z
ij
are regressors for predicting membership in the rst latent class; and
log
e

i
= +
1
x
i1
+
2
x
i2
+ +
k
x
ik
(15.6)
p (y
i
|x
1
, . . . , x
k
) =

y
i
i
e

i
y
i
!
for y
i
= 0, 1, 2, . . .
where
i
E(Y
i
) is the expected count for an individual in the second latent class, and the x
ij
are
regressors for the Poisson submodel. In applications, the two sets of regressorsthe Xs and the
Zsare often the same, but this is not necessarily the case. Indeed, a particularly simple special
case arises when the logistic submodel is log
e

i
/(1
i
) =
0
, a constant, implying that the
probability of membership in the rst latent class is identical for all observations.
The probability of observing a 0 count is
p(0) Pr(Y
i
= 0) =
i
+(1
i
)e

i
and the probability of observing any particular nonzero count y
i
is
p(y
i
) = (1
i
)

y
i
i
e

i
y
i
!
The conditional expectation and variance of Y
i
are
E(Y
i
) = (1
i
)
i
V(Y
i
) = (1
i
)
i
(1 +
i

i
)
with V(Y
i
) > E(Y
i
) for
i
> 0 [unlike a pure Poisson distribution, for which V(Y
i
) = E(Y
i
) =

i
].
23
21
See Section 14.1 for a discussion of logistic regression.
22
Although this form of the zero-inated count model is the most common, Lambert (1992) also suggested the use of
other binary GLMs for membership in the zero latent class (i.e., probit, log-log, and complementary log-log models) and
the alternative use of the negative-binomial distribution for the count submodel (see Exercise 15.2).
23
See Exercise 15.2.
394 Chapter 15. Generalized Linear Models

Estimation of the ZIPmodel would be simple if we knewto which latent class each observation
belongs, but, as I have pointed out, that is not true. Instead, we must maximize the somewhat
more complex combined log likelihood for the two components of the ZIP model:
24
log
e
L(, ) =

y
i
=0
log
e
_
exp
_
z

_
+exp
_
exp(x

i
)
__
+

y
i
>0
_
y
i
x

i
exp(x

i
)
_
(15.7)

i=1
log
e
_
1 +exp(z

i
)
_

y
i
>0
log
e
(y
i
!)
where z

i
[1, z
i1
, . . . , z
ip
], x

i
[1, x
i1
, . . . , x
ik
], [
0
,
1
, . . . ,
p
]

, and
[,
1
, . . . ,
k
]

.
The basic GLM for count data is the Poisson model with log link. Frequently, however,
when the response variable is a count, its conditional variance increases more rapidly
than its mean, producing a condition termed overdispersion, and invalidating the use of
the Poisson distribution. The quasi-Poisson GLM adds a dispersion parameter to handle
overdispersed count data; this model can be estimated by the method of quasi-likelihood.
Asimilar model is based on the negative-binomial distribution, which is not an exponential
family. Negative-binomial GLMs can nevertheless be estimated by maximum likelihood.
The zero-inated Poisson regression model may be appropriate when there are more zeroes
in the data than is consistent with a Poisson distribution.
15.2.2 Loglinear Models for Contingency Tables
The joint distribution of several categorical variables denes a contingency table. As discussed
in the preceding chapter,
25
if one of the variables in a contingency table is treated as the response
variable, we can t a logit or probit model (that is, for a dichotomous response, a binomial GLM)
to the table. Loglinear models, in contrast, which are models for the associations among the
variables in a contingency table, treat the variables symmetricallythey do not distinguish one
variable as the response. There is, however, a relationship between loglinear models and logit
models that I will develop later in this section. As we will see as well, loglinear models have the
formal structure of two-way and higher-way ANOVA models
26
and can be t to data by Poisson
regression.
Loglinear models for contingency tables have many specialized applications in the social
sciencesfor example to square tables, such as mobility tables, where the variables in the table
have the same categories. The treatment of loglinear models in this section merely scratches the
surface.
27
24
See Exercise 15.2.
25
See Section 14.3.
26
See Sections 8.2 and 8.3.
27
More extensive accounts are available in many sources, including Agresti (2002), Fienberg (1980), and Powers and Xie
(2000).
15.2. Generalized Linear Models for Counts 395
Table 15.4 Voter Turnout by Intensity of Partisan Preference,
for the 1956 U.S. Presidential Election
Voter Turnout
Intensity of Preference Voted Did Not Vote Total
Weak 305 126 431
Medium 405 125 530
Strong 265 49 314
Total 975 300 1275
Table 15.5 General Two-Way Frequency Table
Variable C
Variable R 1 2 c Total
1 Y
11
Y
12
Y
1c
Y
1+
2 Y
21
Y
22
Y
2c
Y
2+
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
r Y
r1
Y
r2
Y
rc
Y
r+
Total Y
+1
Y
+2
Y
+c
n
Two-Way Tables
I will examine contingency tables for two variables in some detail, for this is the simplest case,
and the key results that I establish here extend straightforwardly to tables of higher dimension.
Consider the illustrative two-way table shown in Table 15.4, constructed from data reported in
the American Voter (Campbell, Converse, Miller, & Stokes, 1960), introduced in the previous
chapter.
28
The table relates intensity of partisan preference to voting turnout in the 1956 U.S.
presidential election. To anticipate my analysis, the data indicate that voting turnout is positively
associated with intensity of partisan preference.
More generally, two categorical variables with r and c categories, respectively, dene an r c
contingency table, as shown in Table 15.5, where Y
ij
is the observed frequency count in the i, j th
cell of the table. I use a + to represent summation over a subscript; thus Y
i+

c
j=1
Y
ij
is the
marginal frequency in the ith row; Y
+j

r
i=1
Y
ij
is the marginal frequency in the jth column;
and n = Y
++

r
i=1

c
j=1
Y
ij
is the number of observations in the sample.
I assume that the nobservations inTable 15.5are independentlysampledfroma populationwith
proportion
ij
in cell i, j, and therefore that the probability of sampling an individual observation
in this cell is
ij
. Marginal probability distributions
i+
and
+j
may be dened as above; note
that
++
= 1. If the rowand column variables are statistically independent in the population, then
the joint probability
ij
is the product of the marginal probabilities for all i and j:
ij
=
i+

+j
.
Because the observed frequencies Y
ij
result from drawing a random sample, they are random
variables that generally take on different values in different samples. The expected frequency in
28
Table 14.9 (page 371) examined the relationship of voter turnout to intensity of partisan preference and perceived
closeness of the election. The current example collapses the table for these three variables over the categories of perceived
closeness to examine the marginal table for turnout and preference. I return below to the analysis of the full three-way
table.
396 Chapter 15. Generalized Linear Models
cell i, j is
ij
E(Y
ij
) = n
ij
. If the variables are independent, then we have
ij
= n
i+

+j
.
Moreover, because
i+
=

c
j=1
n
ij
= n
i+
and
+j
=

r
i=1
n
ij
= n
+j
, we may write

ij
=
i+

+j
/n. Taking the log of both sides of this last equation produces

ij
log
e

ij
= log
e

i+
+log
e

+j
log
e
n (15.8)
That is, under independence, the log expected frequencies
ij
depend additively on the logs of the
row marginal expected frequencies, the column marginal expected frequencies, and the sample
size. As Fienberg (1980, pp. 1314) points out, Equation 15.8 is reminiscent of a main-effects
two-way ANOVAmodel, where log
e
n plays the role of the constant, log
e

i+
and log
e

+j
are
analogous to main-effect parameters, and
ij
appears in place of the response-variable mean.
If we impose ANOVA-like sigma constraints on the model, we may reparametrize Equation 15.8
as follows:

ij
= +
i
+
j
(15.9)
where
+

i
= 0 and
+

j
= 0. Equation 15.9 is the loglinear model for indepen-
dence in the two-way table. Solving for the parameters of the model, we obtain
=

++
rc
(15.10)

i
=

i+
c

j
=

+j
r

It is important to stress that although the loglinear model is formally similar to an ANOVA
model, the meaning of the two models differs importantly: In analysis of variance, the
i
and

j
are main-effect parameters, specifying the partial relationship of the (quantitative) response
variable to each explanatory variable. The loglinear model in Equation 15.9, in contrast, does not
distinguish a response variable, and, because it is a model for independence, species that the row
and column variables in the contingency table are unrelated; for this model, the
i
and
j
merely
express the relationship of the log expected cell frequencies to the row and column marginals.
The model for independence describes rc expected frequencies in terms of
1 +(r 1) +(c 1) = r +c 1
independent parameters.
By analogy to the two-way ANOVA model, we can add parameters to extend the loglinear
model to data for which the row and column classications are not independent in the population
but rather are related in an arbitrary manner:

ij
= +
i
+
j
+
ij
(15.11)
where
+
=
+
=
i+
=
+j
= 0 for all i and j. As before, we may write the parameters of the
model in terms of the log expected counts
ij
. Indeed, the solution for ,
i
, and
j
are the same
as in Equation 15.10, and

ij
=
ij

i

j
By analogy to the ANOVA model, the
ij
in the loglinear model are often called interactions,
but this usage is potentially confusing. I will therefore instead refer to the
ij
as association
parameters because they represent deviations from independence.
15.2. Generalized Linear Models for Counts 397
Under the model in Equation 15.11, called the saturated model for the two-way table, the
number of independent parameters is equal to the number of cells in the table,
1 +(r 1) +(c 1) +(r 1)(c 1) = rc
The model is therefore capable of capturing any pattern of association in a two-way table.
Remarkably, maximum-likelihood estimates for the parameters of a loglinear model (that is,
in the present case, either the model for independence in Equation 15.9 or the saturated model in
Equation 15.11) may be obtained by treating the observed cell counts Y
ij
as the response variable
in a Poisson GLM; the log expected counts
ij
are then just the linear predictor for the GLM, as
the notation suggests.
29
The constraint that all
ij
= 0 imposed by the model of independence can be tested by a
likelihood-ratio test, contrasting the model of independence (Equation 15.9) with the more gen-
eral model (Equation 15.11). Because the latter is a saturated model, its residual deviance is
necessarily 0, and the likelihood-ratio statistic for the hypothesis of independence H
0
:
ij
= 0
is simply the residual deviance for the independence model, which has (r 1)(c 1) residual
degrees of freedom. Applied to the illustrative two-way table for the American Voter data, we get
G
2
0
= 19.428 with (3 1)(2 1) = 2 degrees of freedom, for which p < .0001, suggesting that
there is strong evidence that intensity of preference and turnout are related.
30
Maximum-likelihood estimates of the parameters of the saturated loglinear model are shown in
Table 15.6. It is clear fromthe estimated association parameters
ij
that turning out to vote, j =1,
increases with partisan preference (and, of course, that not turning out to vote, j =2, decreases
with preference).
Three-Way Tables
The saturated loglinear model for a three-way (a b c) table for variables A, B, and C is
dened in analogy to the three-way ANOVA model, although, as in the case of two-way tables,
the meaning of the parameters is different:

ijk
= +
A(i)
+
B(j)
+
C(k)
+
AB(ij)
+
AC(ik)
+
BC(jk)
+
ABC(ijk)
(15.12)
29
* The reason that this result is remarkable is that a direct route to a likelihood function for the loglinear model leads to
the multinomial distribution (discussed in Appendix Don probability and estimation), not to the Poisson distribution. That
is, selecting n independent observations from a population characterized by cell probabilities
ij
results in cell counts
following the multinomial distribution,
p(y
11
, . . . , y
rc
) =
n!
r

i=1
c

j=1
y
ij
!
r

i=1
c

j=1

n
ij
ij
=
n!
r

i=1
c

j=1
y
ij
!
r

i=1
c

j=1
_

ij
n
_
n
ij
Noting that the expected counts
ij
are functions of the parameters of the loglinear model leads to the multinomial
likelihood function for the model. It turns out that maximizing this multinomial likelihood is equivalent to maximizing
the likelihood for the Poisson GLM described in the text (see, e.g., Fienberg, 1980, app. II).
30
This test is very similar to the usual Pearson chi-square test for independence in a two-way table. See Exercise 15.3
for details, and for an alternative formula for calculating the likelihood-ratio test statistic G
2
0
directly from the observed
frequencies, Y
ij
, and estimated expected frequencies under independence,
ij
.
398 Chapter 15. Generalized Linear Models
Table 15.6 Estimated Parameters for the Saturated
Loglinear Model Fit in Table 15.4

ij
i j =1 j =2
i
1 0.183 0.183 0.135
2 0.037 0.037 0.273
3 0.219 0.219 0.408

j
0.625 0.625 = 5.143
with sigma constraints specifying that each set of parameters sums to zero over each subscript;
for example
1(+)
=
12(i+)
=
123(ij+)
= 0. Given these constraints, we may solve for the
parameters in terms of the log expected counts, with the solution following the usual ANOVA
pattern; for example,
=

+++
abc

A(i)
=

i++
bc

AB(ij)
=

ij+
c

A(i)

B(j)

ABC(ijk)
=
ijk

A(i)

B(j)

C(k)

AB(ij)

AC(ik)

BC(jk)
The presence of the three-way term
ABC
in the model implies that the relationship between any
pair of variables (say, A and B) depends on the category of the third variable (say, C).
31
Other loglinear models are dened by suppressing certain terms in the saturated model, that
is, by setting parameters to zero. In specifying a restricted loglinear model, we will be guided by
the principle of marginality:
32
Whenever a high-order term is included in the model, its lower-
order relatives are included as well. Loglinear models of this type are often called hierarchical.
Nonhierarchical loglinear models may be suitable for special applications, but they are not sensible
in general (see Fienberg, 1980). According to the principle of marginality, for example, if
AB
appears in the model, so do
A
and
B
.
If we set all of
ABC
,
AB
,
AC
, and
BC
to zero, we produce the model of mutual
independence, implying that the variables in the three-way table are completely unrelated:

ijk
= +
A(i)
+
B(j)
+
C(k)
Setting
ABC
,
AC
, and
BC
to zero yields the model

ijk
= +
A(i)
+
B(j)
+
C(k)
+
AB(ij)
which species (1) that variables Aand B are related, controlling for (i.e., within categories
of) variable C; (2) that this partial relationship is constant across the categories of variable C;
and (3) that variable C is independent of variables Aand B taken jointlythat is, if we form
the two-way table with rows given by combinations of categories of A and B, and columns
given by C, the two variables in this table are independent. Note that there are two other
models of this sort: one in which
AC
is nonzero and another in which
BC
is nonzero.
31
Here and below I use the shorthand notation
ABC
to represent the whole set of
ABC(ijk)
, and similarly for the other
terms in the model.
32
See Section 7.3.2.
15.2. Generalized Linear Models for Counts 399
Table 15.7 Voter Turnout by Perceived Closeness of the Election and Intensity
of Partisan Preference, for the 1956 U.S. Presidential Election
(C) Turnout
(A) Perceived Closeness (B) Intensity of Preference Voted Did Not Vote
One-sided Weak 91 39
Medium 121 49
Strong 64 24
Close Weak 214 87
Medium 284 76
Strong 201 25
A third type of model has two nonzero two-way terms; for example, setting
ABC
and

BC
to zero, we obtain

ijk
= +
A(i)
+
B(j)
+
C(k)
+
AB(ij)
+
AC(ik)
This model implies that (1) variables Aand B have a constant partial relationship across the
categories of variable C; (2) variables A and C have a constant partial relationship across
the categories of variable B; and (3) variables B and C are independent within categories
of variable A. Again, there are two other models of this type.
Finally, consider the model that sets only the three-way term
ABC
to zero:

ijk
= +
A(i)
+
B(j)
+
C(k)
+
AB(ij)
+
AC(ik)
+
BC(jk)
This model species that each pair of variables (e.g., A and B) has a constant partial
association across the categories of the remaining variable (e.g., C).
These descriptions are relatively complicated because the loglinear models are models of
association among variables. As we will see presently, however, if one of the variables in a table
is taken as the response variable, then the loglinear model is equivalent to a logit model with a
simpler interpretation.
Table 15.7 shows a three-way table cross-classifying voter turnout by perceived closeness of
the election and intensity of partisan preference, elaborating the two-way table for the American
Voter data presented earlier in Table 15.4.
33
I have t all hierarchical loglinear models to this
three-way table, displaying the results in Table 15.8. Here I employ a compact notation for the
high-order terms in each tted model: For example, AB represents the two-way term
AB
and
implies that the lower-order relatives of this term,
A
, and
B
are also in the model. As
in the loglinear model for a two-way table, the saturated model has a residual deviance of 0, and
consequently the likelihood-ratio statistic to test any model against the saturated model (within
which all of the other models are nested, and which is the last model shown) is simply the residual
deviance for the unsaturated model.
The rst model in Table 15.8 is the model of complete independence, and it ts the data
very poorly. At the other end, the model with high-order terms AB, AC, and BC, which may
be used to test the hypothesis of no three-way association, H
0
: all
ABC(ijk)
= 0, also has a
statistically signicant likelihood-ratio test statistic (though not overwhelmingly so), suggesting
that the association between any pair of variables in the contingency tables varies over the levels
of the remaining variable.
33
This table was also discussed in Chapter 14 (see Table 14.9 on page 371).
400 Chapter 15. Generalized Linear Models
Table 15.8 Hierarchical Loglinear Models Fit to Table 15.7
Residual Degrees of Freedom
High-Order
Terms General Table 15.7 G
2
0
p
A,B,C (a1)(b1)+(a1)(c1)(b1)(c1) 7 36.39 .0001
+(a1)(b1)(c1)
AB,C (a1)(c1)+(b1)(c1)+(a1)(b1)(c1) 5 34.83 .0001
AC,B (a1)(b1)+(b1)(c1)+(a1)(b1)(c1) 5 16.96 .0046
A,BC (a1)(b1)+(a1)(c1)+(a1)(b1)(c1) 6 27.78 .0001
AB,AC (b1)(c1)+(a1)(b1)(c1) 3 15.40 .0015
AB,BC (a1)(c1)+(a1)(b1)(c1) 4 26.22 <.0001
AC,BC (a1)(b1)+(a1)(b1)(c1) 4 8.35 .079
AB,AC,BC (a1)(b1)(c1) 2 7.12 .028
ABC 0 0 0.0
NOTE: The column labeled G
2
0
is the likelihood-ratio statistic for testing each model against the saturated model.
This approach generalizes to contingency tables of any dimension, although the interpretation
of high-order association terms can become complicated.
Loglinear Models and Logit Models
As I explained, the loglinear model for a contingency table is a model for association among
the variables in the table; the variables are treated symmetrically, and none is distinguished as the
response variable. When one of the variables in a contingency table is regarded as the response,
however, the loglinear model for the table implies a logit model (identical to the logit model for a
contingency table developed in Chapter 14), the parameters of which bear a simple relationship
to the parameters of the loglinear model for the table.
For example, it is natural to regard voter turnout in Table 15.7 as a dichotomous response
variable, potentially affected by perceived closeness of the election and by intensity of partisan
preference. Indeed, this is precisely what we did previously when we analyzed this table using a
logit model.
34
With this example in mind, let us return to the saturated loglinear model for the
three-way table (repeating Equation 15.12):

ijk
= +
A(i)
+
B(j)
+
C(k)
+
AB(ij)
+
AC(ik)
+
BC(jk)
+
ABC(ijk)
For convenience, I suppose that the response variable is variable C, as in the illustration. Let

ij
symbolize the response-variable logit within categories i, j of the two explanatory variables;
that is,

ij
= log
e

ij1

ij2
= log
e
n
ij1
n
ij2
= log
e

ij1

ij2
=
ij1

ij2
Then, from the saturated loglinear model for
ijk
,

ij
=
_

C(1)

C(2)
_
+
_

AC(i1)

AC(i2)
_
+
_

BC(j1)

BC(j2)
_
+
_

ABC(ij1)

ABC(ij2)
_
(15.13)
34
See Section 14.3.
15.2. Generalized Linear Models for Counts 401
Notingthat the rst bracketedterminEquation15.13does not dependonthe explanatoryvariables,
that the second depends only upon variable A, and so forth, let us rewrite this equation in the
following manner:

ij
= +
A(i)
+
B(j)
+
AB(ij)
(15.14)
where, because of the sigma constraints on the s,

C(1)

C(2)
= 2
C(1)

A(i)

AC(i1)

AC(i2)
= 2
AC(i1)

B(j)

BC(j1)

BC(j2)
= 2
BC(j1)

AB(ij)

ABC(ij1)

ABC(ij2)
= 2
ABC(ij1)
Furthermore, because they are dened as twice the s, the s are also constrained to sum to zero
over any subscript:

A(+)
=
B(+)
=
AB(i+)
=
AB(+j)
= 0, for all i and j
Note that the loglinear-model parameters for the association of the explanatory variables Aand
B do not appear in Equation 15.13. This equation (or, equivalently, Equation 15.14), the saturated
logit model for the table, therefore shows how the response-variable log-odds depend on the
explanatory variables and their interactions. In light of the constraints that they satisfy, the s
are interpretable as ANOVA-like effect parameters, and indeed we have returned to the binomial
logit model for a contingency table introduced in the previous chapter: Note, for example, that
the likelihood-ratio test for the three-way term in the loglinear model for the American Voter
data (given in the penultimate line of Table 15.8) is identical to the likelihood-ratio test for the
interaction between closeness and preference in the logit model t to these data (see Table 14.11
on page 373).
A similar argument may also be pursued with respect to any unsaturated loglinear model for
the three-way table: Each such model implies a model for the response-variable logits. Because,
however, our purpose is to examine the effects of the explanatory variables on the response, and
not to explore the association between the explanatory variables, we generally include
AB
and
its lower-order relatives in any model that we t, thereby treating the association (if any) between
variables A and B as given. Furthermore, a similar argument to the one developed here can be
applied to a table of any dimension that has a response variable, and to a response variable with
more than two categories. In the latter event, the loglinear model is equivalent to a multinomial
logit model for the table, and in any event, we would generally include in the loglinear model a
termof dimension one less than the table corresponding to all associations among the explanatory
variables.
Loglinear models for contingency tables bear a formal resemblance to analysis-of-variance
models and can be t to data as Poisson generalized linear models with a log link.
The loglinear model for a contingency table, however, treats the variables in the table
symmetricallynone of the variables is distinguished as a response variableand con-
sequently the parameters of the model represent the associations among the variables,
not the effects of explanatory variables on a response. When one of the variables is con-
strued as the response, the loglinear model reduces to a binomial or multinomial logit
model.
402 Chapter 15. Generalized Linear Models
15.3 Statistical Theory for
Generalized Linear Models*
In this section, I revisit with greater rigor and more detail many of the points raised in the preceding
sections.
35
15.3.1 Exponential Families
As much else in modern statistics, the insight that many of the most important distributions
in statistics could be expressed in the following common linear-exponential form was due to
R. A. Fisher:
p(y; , ) = exp
_
y b()
a()
+c(y, )
_
(15.15)
where
p(y; , ) is the probability function for the discrete random variable Y, or the probability-
density function for continuous Y.
a(), b(), and c() are known functions that vary from one exponential family to another
(see below for examples).
= g
c
(), the canonical parameter for the exponential family in question, is a function
of the expectation E(Y) of Y; moreover, the canonical link function g
c
() does not
depend on .
> 0 is a dispersion parameter, which, in some families, takes on a xed, known value,
while in other families it is an unknown parameter to be estimated fromthe data along with .
Consider, for example, the normal or Gaussian distribution with mean and variance
2
, the
densityfunctionfor whichis giveninEquation15.1(onpage 382). Toput the normal distributionin
the form of Equation 15.15 requires some heroic algebraic manipulation, eventually producing
36
p(y; , ) = exp
_
y
2
/2

1
2
_
y
2

+log
e
(2)
__
with = g
c
() = ; =
2
; a() = ; b() =
2
/2; and c(y, ) =
1
2
_
y
2
/ +log
e
(2)
_
.
Now consider the binomial distribution in Equation 15.2 (page 382), where Y is the proportion
of successes in n independent binary trials, and is the probability of success on an individual
trial. Written after more algebraic gymnastics as an exponential family,
37
p(y; , ) = exp
_
y log
e
(1 +e

)
1/n
+log
e
_
n
ny
__
with = g
c
() = log
e
[/(1 )]; = 1; a() = 1/n; b() = log
e
(1 + e

); and c(y, ) =
log
e
_
n
ny
_
.
Similarly, the Poisson, gamma, and inverse-Gaussian families can all be put into the form of
Equation 15.15, using the results given in Table 15.9.
38
35
The exposition here owes a debt to Chapter 2 of McCullagh and Nelder (1989), which has become the standard source
on GLMs, and to the remarkly lucid and insightful briefer treatment of the topic by Firth (1991).
36
See Exercise 15.4.
37
See Exercise 15.5.
38
See Exercise 15.6.
15.3. Statistical Theory for Generalized Linear Models* 403
Table 15.9 Functions a(), b(), and c() for Constructing the Exponential Families
Family a() b() c(y, )
Gaussian
2
/2
1
2
_
y
2
/ +log
e
(2)
_
Binomial 1/n log
e
(1+e

) log
e
_
n
ny
_
Poisson 1 e

log
e
y!
Gamma log
e
()
2
log
e
(y/)log
e
ylog
e
(
1
)
Inverse-Gaussian

2
1
2
_
log
e
(y
3
) +1/(y)
_
NOTE: In this table, n is the number of binomial observations, and () is the gamma function.
The advantage of expressing diverse families of distributions in the common exponential form
is that general properties of exponential families can then be applied to the individual cases. For
example, it is true in general that
b

()
db()
d
=
and that
V(Y) = a()b

() = a()
d
2
b()
d
2
= a()v()
leading to the results in Table 15.2 (on page 382).
39
Note that b

() is the inverse of the canonical


link function. For example, for the normal distribution,
b

() =
d(
2
/2)
d
= =
a()b

() = 1 =
2
v() = 1
and for the binomial distribution,
b

() =
d[log
e
(1 +e

)]
d
=
e

1 +e

=
1
1 +e

=
a()b

() =
1
n

_
e

1 +e


_
e

1 +e

_
2
_
=
(1 )
n
v() = (1 )
The Gaussian, binomial, Poisson, gamma, and inverse-Gaussian distributions can all be
written in the common linear-exponential form:
p(y; , ) = exp
_
y b()
a()
+c(y, )
_
(Continued)
39
See Exercise 15.7.
404 Chapter 15. Generalized Linear Models
(Continued)
where a(), b(), and c() are known functions that vary from one exponential family to
another; = g
c
() is the canonical parameter for the exponential family in question; g
c
()
is the canonical link function; and > 0 is a dispersion parameter, which takes on a xed,
known value in some families. It is generally the case that = E(Y) = b

() and that
V(Y) = a()b

().
15.3.2 Maximum-Likelihood
Estimation of Generalized Linear Models
The log likelihood for an individual observation Y
i
follows directly from Equation 15.15
(page 402):
log
e
L(
i
, ; Y
i
) =
y
i

i
b(
i
)
a
i
()
+c(Y
i
, )
For n independent observations, we have
log
e
L(, ; y) =
n

i=1
Y
i

i
b(
i
)
a
i
()
+c(Y
i
, ) (15.16)
where {
i
} and y {Y
i
}.
Suppose that a GLM uses the link function g(), so that
40
g(
i
) =
i
=
0
+
1
X
i1
+
2
X
i2
+ +
k
X
ik
The model therefore expresses the expected values of the nobservations in terms of a much smaller
number of regression parameters. To get estimating equations for the regression parameters, we
have to differentiate the log likelihood with respect to each coefcient in turn. Let l
i
represent the
ith component of the log likelihood. Then, by the chain rule,
l
i

j
=
l
i

d
i
d
i

d
i
d
i

j
for j = 0, 1, . . . , k (15.17)
After some work, we can rewrite Equation 15.17 as
41
l
i

j
=
y
i

i
a
i
()v(
i
)

d
i
d
i
x
ij
Summing over observations, and setting the sum to zero, produces the maximum-likelihood
estimating equations for the GLM,
n

i=1
Y
i

i
a
i
v(
i
)

d
i
d
i
x
ij
= 0, for j = 0, 1, . . . , k (15.18)
where a
i
a
i
()/ does not depend upon the dispersion parameter, which is constant across
observations. For example, in a Gaussian GLM, a
i
= 1, while in a binomial GLM, a
i
= 1/n
i
.
40
It is notationally convenient here to write
0
for the regression constant .
41
See Exercise 15.8.
15.3. Statistical Theory for Generalized Linear Models* 405
Further simplication can be achieved when g() is the canonical link. In this case, the
maximum-likelihood estimating equations become
n

i=1
Y
i
x
ij
a
i
=
n

i=1

i
x
ij
a
i
setting the observed sum on the left of the equation to the expected sum on the right. We noted
this pattern in the estimating equations for logistic-regression models in the previous chapter.
42
Nevertheless, even here the estimating equations are (except in the case of the Gaussian family
paired with the identify link) nonlinear functions of the regression parameters and generally
require iterative methods for their solution.
Iterative Weighted Least Squares
Let
Z
i

i
+(Y
i

i
)
d
i
d
i
=
i
+(Y
i

i
)g

(
i
)
Then
E(Z
i
) =
i
=
0
+
1
X
i1
+
2
X
i2
+ +
k
X
ik
and
V(Z
i
) =
_
g

(
i
)
_
2
a
i
v(
i
)
If, therefore, we could compute the Z
i
, we would be able to t the model by weighted least-squares
regression of Z on the Xs, using the inverses of the V(Z
i
) as weights.
43
Of course, this is not the
case because we do not know the values of the
i
and
i
, which, indeed, depend on the regression
coefcients that we wish to estimatethat is, the argument is essentially circular. This observation
suggested to Nelder and Wedderburn (1972) the possibility of estimating GLMs by iterative
weighted least-squares (IWLS), cleverly turning the circularity into an iterative procedure:
1. Start with initial estimates of the
i
and the
i
= g(
i
), denoted
(0)
i
and
(0)
i
. A simple
choice is to set
(0)
i
= Y
i
.
44
2. At each iteration l, compute the working response variable Z using the values of and
from the preceding iteration,
Z
(l1)
i
=
(l1)
i
+
_
Y
i

(l1)
i
_
g

(l1)
i
_
42
See Sections 14.1.5 and 14.2.1.
43
See Section 12.2.2 for a general discussion of weighted least squares.
44
In certain settings, starting with
(0)
i
= Y
i
can cause computational difculties. For example, in a binomial GLM, some
of the observed proportions may be 0 or 1indeed, for binary data, this will be true for all the observationsrequiring
us to divide by 0 or to take the log of 0. The solution is to adjust the starting values, which are in any event not critical, to
protect against this possibility. For a binomial GLM, where Y
i
= 0, we can take
(0)
i
= 0.5/n
i
, and where Y
i
= 1, we
can take
(0)
i
= (n
i
0.5)/n
i
. For binary data, then, all the
(0)
i
are 0.5.
406 Chapter 15. Generalized Linear Models
along with weights
W
(l1)
i
=
1
_
g

(l1)
i
__
2
a
i
v
_

(l1)
i
_
3. Fit a weighted least-squares regression of Z
(l1)
on the Xs, using the W
(l1)
as weights.
That is, compute
b
(l)
=
_
X

W
(l1)
X
_
1
X

W
(l1)
z
(l1)
where b
(l)
(k+11)
is the vector of regression coefcients at the current iteration; X
(nk+1)
is
(as usual) the model matrix; W
(l1)
(nn)
diag
_
W
(l1)
i
_
is the diagonal weight matrix; and
z
(l1)
(n1)

_
Z
(l1)
i
_
is the working-response vector.
4. Repeat Steps 2 and 3 until the regression coefcients stabilize, at which point b converges
to the maximum-likelihood estimates of the s.
Applied to the canonical link, IWLS is equivalent to the Newton-Raphson method (as we
discovered for a logit model in the previous chapter); more generally, IWLS implements Fishers
method of scoring.
Estimating the Dispersion Parameter
Note that we do not require an estimate of the dispersion parameter to estimate the regression
coefcients in a GLM. Although it is in principle possible to estimate by maximum likelihood
as well, this is rarely done. Instead, recall that V(Y
i
) = a
i
v(
i
). Solving for the dispersion
parameter, we get = V(Y
i
)/a
i
v(
i
), suggesting the method of moments estimator

=
1
n k 1

(Y
i

i
)
2
a
i
v(
i
)
(15.19)
The estimated asymptotic covariance matrix of the coefcients is then obtained from the last
IWLS iteration as

V(b) =

_
X

WX
_
1
Because the maximum-likelihood estimator b is asymptotically normally distributed,

V(b) may
be used as the basis for Wald tests of the regression parameters.
The maximum-likelihood estimating equations for generalized linear models take the com-
mon form
n

i=1
Y
i

i
a
i
v(
i
)

d
i
d
i
x
ij
= 0, for j = 0, 1, . . . , k
15.3. Statistical Theory for Generalized Linear Models* 407
These equations are generally nonlinear and therefore have no general closed-form solu-
tion, but they can be solved by iterated weighted least squares (IWLS). The estimating
equations for the coefcients do not involve the dispersion parameter, which (for models
in which the dispersion is not xed) then can be estimated as

=
1
n k 1

(Y
i

i
)
2
a
i
v(
i
)
The estimated asymptotic covariance matrix of the coefcients is

V(b) =

_
X

WX
_
1
where b is the vector of estimated coefcients and W is a diagonal matrix of weights from
the last IWLS iteration.
Quasi-Likelihood Estimation
The argument leading to IWLS estimation rests only on the linearity of the relationship
between = g() and the Xs, and on the assumption that V(Y) depends in a particular
manner on a dispersion parameter and . As long as we can express the transformed mean
of Y as a linear function of the Xs, and can write down a variance function for Y (expressing
the conditional variance of Y as a function of its mean and a dispersion parameter), we can
apply the maximum-likelihood estimating equations (Equation 15.18 on page 404) and obtain
estimates by IWLSeven without committing ourselves to a particular conditional distribution
for Y.
This is the method of quasi-likelihood estimation, introduced by Wedderburn (1974), and it
has been shown to retain many of the properties of maximum-likelihood estimation: Although
the quasi-likelihood estimator may not be maximally asymptotically efcient, it is consistent
and has the same asymptotic distribution as the maximum-likelihood estimator of a GLM in an
exponential family.
45
We can think of quasi-likelihood estimation of GLMs as analogous to least-
squares estimation of linear regression models with potentially non-normal errors: Recall that as
long as the relationship between Y and the Xs is linear, the error variance is constant, and the
observations are independently sampled, the theory underlying OLSestimation appliesalthough
the OLS estimator may no longer be maximally efcient.
46
The maximum-likelihood estimating equations, and IWLS estimation, can be applied
whenever we can express the transformed mean of Y as a linear function of the Xs, and
can write the conditional variance of Y as a function of its mean and (possibly) a dispersion
parametereven when we do not specify a particular conditional distribution for Y. The
resulting quasi-likelihood estimator shares many of the properties of maximum-likelihood
estimators.
45
See, for example, McCullagh and Nelder (1989, chap. 9) and McCullagh (1991).
46
See Chapter 9.
408 Chapter 15. Generalized Linear Models
15.3.3 Hypothesis Tests
Analysis of Deviance
Originally (in Equation 15.16 on page 404), I wrote the log likelihood for a GLM as a function
log
e
L(, ; y) of the canonical parameters for the observations. Because
i
= g
1
c
(
i
), for the
canonical link g
c
(), we can equally well think of the log likelihood as a function of the expected
response, and therefore can write the maximized log likelihood as log
e
L(, ; y). If we then
dedicate a parameter to each observation, so that
i
= Y
i
(e.g., by removing the constant from
the regression model and dening a dummy regressor for each observation), the log likelihood
becomes log
e
L(y, ; y). The residual deviance under the initial model is twice the difference in
these log likelihoods:
D(y; ) 2[log
e
L(y, ; y) log
e
L(, ; y)] (15.20)
= 2
n

i=1
[log
e
L(Y
i
, ; Y
i
) log
e
L(
i
, ; Y
i
)]
= 2
n

i=1
Y
i
[g(Y
i
) g(
i
)] b [g(Y
i
)] +b [g(
i
)]
a
i
Dividing the residual deviance by the estimated dispersion parameter produces the scaled
deviance, D

(y; ) D(y; )/

. As explained in Section 15.1.1, deviances are the building


blocks of likelihood-ratio and F-tests for GLMs.
Applying Equation 15.20 to the Gaussian distribution, where g
c
() is the identity link, a
i
= 1,
and b() =
2
/2, produces (after some simplication)
D(y; ) =

(Y
i
)
2
that is, the residual sum of squares for the model. Similarly, applying Equation 15.20 to the
binomial distribution, where g
c
() is the logit link, a
i
= n
i
, and b() = log
e
(1 + e

), we get
(after quite a bit of simplication)
47
D(y; ) = 2

n
i
_
Y
i
log
e
Y
i

i
+(1 Y
i
) log
e
1 Y
i
1
i
_
The residual deviance for a model is twice the difference in the log likelihoods for the
saturated model, which dedicates one parameter to each observation, and the model in
question:
D(y; ) 2[log
e
L(y, ; y) log
e
L(, ; y)]
= 2
n

i=1
Y
i
[g(Y
i
) g(
i
)] b [g(Y
i
)] +b [g(
i
)]
a
i
Dividing the residual deviance by the estimated dispersion parameter produces the scaled
deviance, D

(y; ) D(y; )/

.
47
See Exercise 15.9, which also develops formulas for the deviance in Poisson, gamma, and inverse-Gaussian models.
15.3. Statistical Theory for Generalized Linear Models* 409
Testing General Linear Hypotheses
As was the case for linear models,
48
we can formulate a test for the general linear hypothesis
H
0
: L
(qk+1)

(k+11)
= c
(q1)
where the hypothesis matrix L and right-hand-side vector c contain pre-specied constants; usu-
ally, c = 0. For a GLM, the Wald statistic
Z
2
0
= (Lb c)

[L

V(b) L

]
1
(Lb c)
follows an asymptotic chi-square distribution with q degrees of freedom under the hypothesis.
The simplest application of this result is to the Wald statistic Z
0
= B
j
/SE(B
j
), testing that an
individual regression coefcient is zero. Here, Z
0
follows a standard-normal distribution under
H
0
:
j
= 0 (or, equivalently, Z
2
0
follows a chi-square distribution with one degree of freedom).
Alternatively, when the dispersion parameter is estimated from the data, we can calculate the
test statistic
F
0
=
(Lb c)

[L

V(b) L

]
1
(Lb c)
q
which is distributed as F
q,nk1
under H
0
. Applied to an individual coefcient, t
0
=

F
0
=
B
j
/SE(B
j
) produces a t -test on n k 1 degrees of freedom.
To test the general linear hypothesis H
0
: L = c, where the hypothesis matrix L has
q rows, we can compute the Wald chi-square test statistic Z
2
0
= (Lb c)

[L

V(b)L

]
1
(Lb c), with q degrees of freedom. Alternatively, if the dispersion parameter is esti-
mated from the data, we can compute the F-test statistic F
0
= (Lb c)

[L

V(b) L

]
1
(Lb c) /q on q and n k 1 degrees of freedom.
Testing Nonlinear Hypotheses
It is occasionally of interest to test a hypothesis or construct a condence interval for a nonlinear
function of the parameters of a linear or generalized linear model. If the nonlinear function in
question is a differentiable function of the regression coefcients, then an approximate asymptotic
standard error may be obtained by the delta method.
49
Suppose that we are interested in the function
f () = f (
0
,
1
, . . . ,
k
)
where, for notational convenience, I have used
0
to denote the regression constant. The
function f () need not use all the regression coefcients (see the example below). The
48
See Section 9.4.4.
49
The delta method (Rao, 1973) is described in Appendix D on probability and estimation. The method employs a rst-
order (i.e., linear) Taylor-series approximation to the nonlinear function. The delta method is appropriate here because the
maximum-likelihood (or quasi-likelihood) estimates of the coefcients of a GLMare asymptotically normally distributed.
Indeed, the procedure described in this section is applicable whenever the parameters of a regression model are normally
distributed and can therefore be applied in a wide variety of contextssuch as to the nonlinear regression models described
in Chapter 17. In small samples, however, the delta-method approximation to the standard error may not be adequate, and
the bootstrapping procedures described in Chapter 21 will usually provide more reliable results.
410 Chapter 15. Generalized Linear Models
maximum-likelihood estimator of is simply = f (

) (which, as an MLE, is also asymp-


totically normal), and the approximate sampling variance of is then

V( )
k

j=0
k

=0
v
jj

where v
jj
is the j, j

th element of the estimated asymptotic covariance matrix of the coefcients,

V(

).
To illustrate the application of this result, imagine that we are interested in determining the max-
imum or minimum value of a quadratic partial regression.
50
Focusing on the partial relationship
between the response variable and a particular X, we have an equation of the form
E(Y) = +
1
X +
2
X
2
+
Differentiating this equation with respect to X, we get
dE(Y)
dX
=
1
+2
2
X
Setting the derivative to 0 and solving for X produces the value at which the function reseaches
a minimum (if
2
is positive) or a maximum (if
2
is negative),
X =

1
2
2
which is a nonlinear function of the regression coefcients
1
and
2
.
For example, in Section 12.3.1, using data from the Canadian Survey of Labour and Income
Dynamics (the SLID), I t a least-squares regression of log wage rate on a quadratic in age,
a dummy regressor for sex, and the square of education, obtaining (repeating, and slightly rear-
ranging, Equation 12.7 on page 280):

log
2
Wages = 0.5725 + 0.1198 Age 0.001230 Age
2
(0.0834) (0.0046) (0.000059)
+ 0.3195 Male + 0.002605 Education
2
(0.0180) (0.000113)
R
2
= .3892
Imagine that we are interested in the age
1
/(2
2
) at which wages are at a maximum,
holding sex and education constant. The necessary derivatives are

B
1
=
1
2B
2
=
1
2(0.001230)
= 406.5

B
2
=
B
1
2B
2
2
=
0.1198
2(0.001230)
2
= 39,593
Our point estimate of is
=
B
1
2B
2
=
0.1198
2 0.001230
= 48.70 years
50
See Section 17.1 for a discussion of polynomial regression. The application of the delta method to nding the minimum
or maximum of a quadratic curve is suggested by Weisberg (2005, sect. 6.1.2).
15.3. Statistical Theory for Generalized Linear Models* 411
The estimated sampling variance of the age coefcient is

V(B
1
) = 2.115 10
5
, and of the
coefcient of age-squared,

V(B
2
) = 3.502 10
9
; the estimated sampling covariance for the
two coefcients is

C(B
1
, B
2
) = 2.68510
7
. The approximate estimated variance of is then

V( )
_
2.115 10
5
_
406.5
2

_
2.685 10
7
_
406.5 39, 593

_
2.685 10
7
_
406.5 39, 593 +
_
3.502 10
9
_
39, 593
2
= 0.3419
Consequently, the approximate standard error of is SE( )

0.3419 = 0.5847, and an
approximate 95% condence interval for the age at which income is highest on average is =
48.70 1.96(0.5847) = (47.55, 49.85).
The delta method may be used to approximate the standard error of a nonlinear function
of regression coefcients in a GLM. If f (
0
,
1
, . . . ,
k
), then

V( )
k

j=0
k

=0
v
jj

15.3.4 Effect Displays


Let us write the GLM in matrix form, with linear predictor

(n1)
= X
(nk+1)

(k+11)
and link function g() = , where is the expectation of the response vector y. As described in
Section 15.3.2, we compute the maximum-likelihood estimate b of , along with the estimated
asymptotic covariance matrix

V(b) of b.
Let the rows of X

include regressors corresponding to all combinations of values of explana-


tory variables appearing in a high-order term of the model (or, for a continuous explanatory
variable, values spanning the range of the variable), along with typical values of the remaining
regressors. The structure of X

with respect to interactions, for example, is the same as that of the


model matrix X. Then the tted values

= X

b represent the high-order term in question, and


a table or graph of these valuesor, alternatively, of the tted values transformed to the scale of
the response variable, g
1
(

)is an effect display. The standard errors of

, available as the
square-root diagonal entries of X

V(b)X

, may be used to compute pointwise condence intervals


for the effects, the end-points of which may then also be transformed to the scale of the response.
For example, for the Poisson regression model t to Ornstein interlocking-directorate data, the
effect display for assets in Figure 15.6(a) (page 390) is constructed by letting assets range between
its minimum value of 0.062 and maximum of 147.670 billion dollars, xing the dummy variables
for nation of control and sector to their sample meansthat is, to the observed proportions of the
data in each of the corresponding categories of nation and sector. As noted previously, this is an
especially simple example, because the model includes no interactions. The model was t with
the log link, and so the estimated effects, which in general are on the scale of the linear predictor,
are on the log-count scale; the right-hand axis of the graph shows the corresponding count scale,
which is the scale of the response variable.
412 Chapter 15. Generalized Linear Models
Effect displays for GLMs are based on the tted values

= X

b, representing a high-
order term in the model; that is, X

has the same general structure as the model matrix


X, with the explanatory variables in the high-term order ranging over their values in the
data while other explanatory variables are set to typical values. The standard errors of

,
given by the square-root diagonal entries of X

V(b)X

, may be used to compute pointwise


condence intervals for the effects.
15.4 Diagnostics for Generalized Linear Models
Most of the diagnostics for linear models presented in Chapters 11 and 12 extend relatively
straightforwardly to GLMs. These extensions typically take advantage of the computation of
maximum-likelihood and quasi-likelihood estimates for GLMs by iterated weighted least squares,
as described in Section 15.3.2. The nal weighted-least-squares t linearizes the model and pro-
vides a quadratic approximation to the log likelihood. Approximate diagnostics are then either
based directly on the WLS solution or are derived from statistics easily calculated from this
solution. Seminal work on the extension of linear least-squares diagnostics to GLMs was done
by Pregibon (1981), Landwehr, Pregibon, and Shoemaker (1984), Wang (1985, 1987), and
Williams (1987). In my experience, and with the possible exception of added-variable plots
for non-Gaussian GLMs, these extended diagnostics typically work reasonably well.
15.4.1 Outlier, Leverage, and Inuence Diagnostics
Hat-Values
Hat-values, h
i
, for a GLM can be taken directly from the nal iteration of the IWLS procedure
for tting the model,
51
and have the usual interpretationexcept that, unlike in a linear model,
the hat-values in a GLM depend on the response variable Y as well as on the conguration of
the Xs.
Residuals
Several kinds of residuals can be dened for GLMs:
Most straightforwardly (but least usefully), response residuals are simply the differences
between the observed response and its estimated expected value: Y
i

i
, where

i
= g
1
(
i
) = g
1
(A +B
1
X
i1
+B
2
X
i2
+ +B
k
X
ik
)
Working residuals are the residuals from the nal WLS t. These may be used to dene
partial residuals for component-plus-residual plots (see below).
51
* The hat-matrix is
H = W
1/2
X(X

WX)
1
X

W
1/2
where W is the weight matrix from the nal IWLS iteration.
15.4. Diagnostics for Generalized Linear Models 413
Pearson residuals are casewise components of the Pearson goodness-of-t statistic for the
model:
52

1/2
(Y
i

i
)
_

V(Y
i
|
i
)
where

is the estimated dispersion parameter for the model (Equation 15.19 on page 406)
and V(y
i
|
i
) is the conditional variance of the response (given in Table 15.2 on page 382).
Standardized Pearson residuals correct for the conditional response variation and for the
differential leverage of the observations:
R
Pi

Y
i

i
_

V(Y
i
|
i
)(1 h
i
)
Deviance residuals, G
i
, are the square-roots of the casewise components of the residual
deviance (Equation 15.20 on page 408), attaching the sign of the corresponding response
residual.
Standardized deviance residuals are
R
Gi

G
i
_

(1 h
i
)
Several different approximations to studentized residuals have been proposed. To calcu-
late exact studentized residuals would require literally retting the model deleting each
observation in turn and noting the decline in the deviance; this procedure, of course, is
computationally unattractive. Williams suggests the approximation
E

i

_
(1 h
i
)R
2
Gi
+h
i
R
2
Pi
where, once again, the sign is taken from the response residual. A Bonferroni outlier test
using the standard normal distribution may be based on the largest absolute studentized
residual.
Inuence Measures
An approximation to Cooks distance inuence measure is
D
i

R
2
Pi

(k +1)

h
i
1 h
i
This is essentially Williamss denition, except that I divide by the estimated dispersion

to scale
D
i
as an F-statistic rather than as a chi-square statistic.
Approximate values of inuence measures for individual coefcients, DFBETA
ij
and
DFBETAS
ij
, may be obtained directly from the nal iteration of the IWLS procedure.
Wang (1985) suggests an extension of added-variable plots to GLMs that works as follows:
Suppose that the focal regressor is X
j
. Ret the model with X
j
removed, extracting the working
residuals from this t. Then regress X
j
on the other Xs by WLS, using the weights from the
last IWLS step, obtaining residuals. Finally, plot the working residuals from the rst regression
against the residuals for X
j
from the second regression.
52
The Pearson statistic, an alternative to the deviance for measuring the t of the model to the data, is the sum of squared
Pearson residuals.
414 Chapter 15. Generalized Linear Models
0.0 0.1 0.2 0.3 0.4 0.5 0.6
HatValues
S
t
u
d
e
n
t
i
z
e
d

R
e
s
i
d
u
a
l
s
1
3
2
1
0
1
2
3
Figure 15.7 Hat-values, studentized residuals, and Cooks distances from the quasi-Poisson
regression for Ornsteins interlocking-directorate data. The areas of the circles are
proportional to the Cooks distances for the observations. Horizontal lines are drawn
at 2, 0, and 2 on the studentized-residual scale, vertical lines at twice and three
times the average hat-value.
0 50 100 150 200 250
Index
1
D
F
B
E
T
A
A
s
s
e
t
s
0.002
0.000
0.002
0.004
0.006
Figure 15.8 Index plot of DFBETA for the assets coefcient. The horizontal lines are drawn at 0
and SE(B
Assets
).
Figure 15.7 shows hat-values, studentized residuals, and Cooks distances for the quasi-Poisson
model t to Ornsteins interlocking directorate data. One observationNumber 1, the corporation
with the largest assetsstands out by combining a very large hat-value with the biggest absolute
studentized residual.
53
This point is not a statistically signicant outlier, however (indeed, the
Bonferroni p-value for the largest studentized residual exceeds 1). As shown in the DFBETA plot
in Figure 15.8, Observation 1 makes the coefcient of assets substantially smaller than it would
otherwise be (recall that the coefcient for assets is 0.02085).
54
In this case, the approximate
DFBETAis quite accurate: If Observation 1 is deleted, the assets coefcient increases to 0.02602.
53
Unfortunately, the data source does not include the names of the rms, but Observation 1 is the largest of the Canadian
banks, which, in the 1970s, was (I believe) the Royal Bank of Canada.
54
I invite the reader to plot the DFBETA values for the other coefcients in the model.
15.4. Diagnostics for Generalized Linear Models 415
0
3
2
1
0
1
50 100 150
Assets (billions of dollars)
C
o
m
p
o
n
e
n
t

+

R
e
s
i
d
u
a
l
Figure 15.9 Component-plus-residual plot for assets in the interlocking-directorate quasi-Poisson
regression. The broken line shows the least-squares t to the partial residuals; the
solid line is for a nonrobust lowess smooth with a span of 0.9.
Before concluding that Observation 1 requires special treatment, however, consider the check for
nonlinearity in the next section.
15.4.2 Nonlinearity Diagnostics
Component-plus-residual and CERES plots also extend straightforwardly to GLMs. Nonpara-
metric smoothing of the resulting scatterplots can be important to interpretation, especially in
models for binary response variables, where the discreteness of the response makes the plots dif-
cult to examine. Similar (if typically less extreme) effects can occur for binomial and count data.
Component-plus-residualandCERESplotsusethelinearizedmodelfromthelaststepoftheIWLS
t. For example, thepartial residual for X
j
adds theworkingresidual toB
j
X
ij
; thecomponent-plus-
residual plot then graphs the partial residual against X
j
. In smoothing a component-plus-residual
plot for a non-Gaussian GLM, it is generally preferable to use a nonrobust smoother.
A component-plus-residual plot for assets in the quasi-Poisson regression for the interlocking-
directorate data is shown in Figure 15.9. Assets is so highly positively skewed that the plot is
different to examine, but it is nevertheless apparent that the partial relationship between number
of interlocks and assets is nonlinear, with a much steeper slope at the left than at the right. Because
the bulge points to the left, we can try to straighten this relationship by transforming assets down
the ladder of power and roots. Trial and error suggests the log transformation of assets, after which
a component-plus-residual plot for the modied model (Figure 15.10) is unremarkable.
Box-Tidwell constructed-variable plots
55
also extend straightforwardly to GLMs: When con-
sidering the transformation of X
j
, simply add the constructed variable X
j
log
e
X
j
to the model
and examine the added-variable plot for the constructed variable. Applied to assets in Ornsteins
quasi-Poisson regression, this procedure produces the constructed-variable plot in Figure 15.11,
which suggests that evidence for the transformation is spread throughout the data. The coefcient
for assets log
e
assets in the constructed-variable regression is 0.02177 with a standard error
of 0.00371; the Wald-test statistic Z
0
= 0.02177/0.00371 = 5.874 therefore indicates strong
evidence for the transformation of assets. By comparing the coefcient of assets in the original
55
See Section 12.5.2.
416 Chapter 15. Generalized Linear Models
1.0 0.5 0.0 0.5 1.0 1.5 2.0
C
o
m
p
o
n
e
n
t

+

R
e
s
i
d
u
a
l
Log
10
Assets (billions of dollars)
3
2
1
0
1
2
Figure 15.10 Component-plus-residual plot following the log-transformation of assets. The lowess
t is for a span of 0.6.
20 10 0 10 20
Constructed Variable | Others
I
n
t
e
r
l
o
c
k
s

|

O
t
h
e
r
s
5
0
5
10
Figure 15.11 Constructed variable plot for the transformation of assets in the
interlocking-directorate quasi-Poisson regression.
quasi-Poisson regression (0.02085) with the coefcient of the constructed variable, we get the
suggested power transformation

= 1 +
0.02177
0.02085
= 0.044
that is, essentially the log-transformation, = 0.
Finally, it is worth noting the relationship between the problems of inuence and nonlinearity
in this example: Observation 1 was inuential in the original regression because its very large
assets gave it high leverage and because unmodelled nonlinearity put the observation below the
erroneously linear t for assets, pulling the regression surface towards it. Log-transforming assets
xes both these problems.
Alternative effect displays for assets in the transformed model are shown in Figure 15.12. Panel
(a) in this gure graphs assets on its natural scale; on this scale, of course, the tted partial
relationship between log-interlocks and assets is nonlinear. Panel (b) uses a log scale for assets,
rendering the partial relationship linear.
Exercises 417
0 50 100 150
(a)
Assets (billions of dollars)
N
u
m
b
e
r

o
f

I
n
t
e
r
l
o
c
k
s
0.05 0.50 5.00 50.00
(b)
Assets (billions of dollars)
N
u
m
b
e
r

o
f

I
n
t
e
r
l
o
c
k
s
l
o
g
e

N
u
m
b
e
r

o
f

I
n
t
e
r
l
o
c
k
s
1
2
3
4
1
2
3
4
64
32
16
2
4
8
64
32
16
2
4
8
l
o
g
e

N
u
m
b
e
r

o
f

I
n
t
e
r
l
o
c
k
s
Figure 15.12 Effect displays for assets in the quasi-Poisson regression model in which assets has
been log-transformed. Panel (a) plots assets on its natural scale, while panel (b) uses
a log scale for assets. Rug plots for assets appear at the bottom of the graphs. The
broken lines give pointwise 95% condence intervals around the estimated effect.
Most of the standard diagnostics for linear models extend relatively straightforwardly
to GLMs. These extensions typically take advantage of the computation of maximum-
likelihood and quasi-likelihood estimates for GLMs by iterated weighted least squares.
Such diagnostics include studentized residuals, hat-values, Cooks distances, DFBETA
and DFBETAS, added-variable plots, component-plus-residual plots, and the constructed-
variable plot for transforming an explanatory variable.
Exercises
Exercise 15.1. Testing overdisperison: Let 1/ represent the inverse of the scale parameter
for the negative-binomial regression model (see Equation 15.4 on page 392). When = 0, the
negative-binomial model reduces to the Poisson regression model (why?), and consequently a test
of H
0
: = 0 against the one-sided alternative hypothesis H
a
: > 0 is a test of overdispersion. A
Wald test of this hypothesis is straightforward, simply dividing

by its standard error. We can also


compute a likelihood-ratio test contrasting the deviance under the more specic Poisson regression
model with that under the more general negative-binomial model. Because the negative-binomial
model has one additional parameter, we refer the likelihood-ratio test statistic to a chi-square
distribution with one degree of freedom; as Cameron and Trivedi (1998, p. 78) explain, however,
the usual right-tailed p-value obtained from the chi-square distribution must be halved. Apply
this likelihood-ratio test for overdispersion to Ornsteins interlocking-directorate regression.
Exercise 15.2. *Zero-inated count regression models:
(a) Show that the mean and variance of the response variable Y
i
in the zero-inated Poisson
(ZIP) regression model, given in Equations 15.5 and 15.6 on page 393, are
418 Chapter 15. Generalized Linear Models
E(Y
i
) = (1
i
)
i
V(Y
i
) = (1
i
)
i
(1 +
i

i
)
(Hint: Recall that there are two sources of zeroes: observations in the rst latent class,
whose value of Y
i
is necessarily 0, and observations in the second latent class, whose value
may be zero. Probability of membership in the rst class is
i
, and in the second 1
i
.)
Show that V(Y
i
) > E(Y
i
) when
i
> 0.
(b) Derive the log likelihood for the ZIP model, given in Equation 15.7 (page 394).
(c) The zero-inated negative-binomial (ZINB) regression model substitutes a negative-
binomial GLM for the Poisson-regression submodel of Equation 15.6 on page 393:
log
e

i
= +
1
x
i1
+
2
x
i2
+ +
k
x
ik
p (y
i
|x
1
, . . . , x
k
) =
(y
i
+)
y!()


y
i
i

(
i
+)

i
+
Show that E(Y
i
) = (1
i
)
i
(as in the ZIP model) and that
V(Y
i
) = (1
i
)
i
[1 +
i
(
i
+1/)]
When
i
> 0, the conditional variance is greater in the ZINB model than in the standard
negative-binomial GLM, V(Y
i
) =
i
+
2
i
/; why? Derive the log likelihood for the ZINB
model. [Hint: Simply substitute the negative-binomial GLM for the Poisson-regression
submodel in Equation 15.7 (page 394)].
Exercise 15.3. The usual Pearson chi-square statistic for testing for independence in a two-way
contingency table is
X
2
0
=
r

i=1
c

j=1
_
Y
ij

ij
_
2

ij
where the Y
ij
are the observed frequencies in the table, and the
ij
are the estimated expected
frequencies under independence. The estimated expected frequencies can be computed from the
maximum-likelihood estimates for the loglinear model of independence, or they can be computed
directly as
ij
= Y
i+
Y
+j
/n. The likelihood-ratio statistic for testing for independence can also
be computed from the estimated expected counts as
G
2
0
= 2
r

i=1
c

j=1
Y
ij
log
e
Y
ij

ij
Both test statistics have (r 1)(c 1) degrees of freedom. The two tests are asymptotically
equivalent, and usually produce similar results. Applying these formulas to the two-way table
for voter turnout and intensity of partisan preference in Table 15.4 (page 395), compute both test
statistics, verifying that the direct formula for G
2
0
produces the same result as given in the text.
Exercise 15.4. *Show that the normal distribution can be written in exponential form as
p(y; , ) = exp
_
y
2
/2

1
2
_
y
2

+log
e
(2)
__
where = g
c
() = ; =
2
; a() = ; b() =
2
/2; and c(y, ) =
1
2
_
y
2
/ +log
e
(2)
_
.
Exercises 419
Exercise 15.5. *Show that the binomial distribution can be written in exponential form as
p(y; , ) = exp
_
y log
e
(1 +e

)
1/n
+log
e
_
n
ny
__
where = g
c
() = log
e
[/(1 )]; = 1; a() = 1/n; b() = log
e
(1 +e

); and c(y, ) =
log
e
_
n
ny
_
.
Exercise 15.6. *Using the results given in Table 15.9 (on page 403), verify that the Poisson,
gamma, and inverse-Gaussian families can all be written in the common exponential form
p(y; , ) = exp
_
y b()
a()
+c(y, )
_
Exercise 15.7. *Using the general result that the conditional variance of a distribution in an
exponential family is
V(Y) = a()
d
2
b()
d
2
and the values of a() and b() given in Table 15.9 (on page 403), verify that the variances of
the Gaussian, binomial, Poisson, gamma, and inverse-Gaussian families are, consecutively, ,
(1 )/n, ,
2
, and
3
.
Exercise 15.8. *Show that the derivative of the log likelihood for an individual observation with
respect to the regression coefcients in a GLM can be written as
l
i

j
=
y
i

i
a
i
()v(
i
)

d
i
d
i
x
ij
, for j = 0, 1, . . . , k
(See Equation 15.17 on page 404.)
Exercise 15.9. *Using the general expression for the residual deviance,
D(y; ) = 2
n

i=1
Y
i
[g(Y
i
) g(
i
)] b [g(Y
i
)] +b [g(
i
)]
a
i
showthat the deviances for the several exponential families can be written in the following forms:
Family Residual Deviance
Gaussian

(Y
i

i
)
2
Binomial 2

_
n
i
Y
i
log
e
Y
i

i
+n
i
(1 Y
i
)log
e
1 Y
i
1
i
_
Poisson 2

_
Y
i
log
e
Y
i

i
(Y
i

i
)
_
Gamma 2

_
log
e
Y
i

i
+
Y
i

i
_
Inverse-Gaussian

(Y
i

i
)
2
Y
i

2
i
420 Chapter 15. Generalized Linear Models
Exercise 15.10. *Using the SLID data, Table 12.1 in Section 12.3.2 (on page 283) reports the
results of a regression of log wages on sex, the square of education, a quadratic in age, and
interactions between sex and education-squared, and between sex and the quadratic for age.
(a) Estimate the age
1
at which women attain on average their highest level of wages, con-
trolling for education. Use the delta method to estimate the standard error of
1
. Note: You
will need to ret the model to obtain the covariance matrix for the estimated regression
coefcients.
(b) Estimate the age
2
at which men attain on average their highest level of wages, controlling
for education. Use the delta method to estimate the standard error of
2
.
(c) Let
3

1

2
, the difference between the ages at which men and women attain their
highest wage levels. Compute
3
. Use the delta method to nd the standard error of
3
and
then test the null hypothesis H
0
:
3
= 0.
Exercise 15.11. Coefcient quasi-variances: Coefcient quasi-variances for dummy-variable
regressors were introduced in Section 7.2.1. Recall that the object is to approximate the standard
errors for pairwise differences between categories,
SE(C
j
C
j
) =
_

V(C
j
) +

V(C
j
) 2

C(C
j
, C
j
)
where C
j
and C
j
are two dummy-variable coefcients for an m-category polytomous explana-
tory variable;

V(C
j
) is the estimated sampling variance of C
j
; and

C(C
j
, C
j
) is the estimated
sampling covariance of C
j
and C
j
. By convention, we take C
m
(the coefcient of the baseline
category) and its standard error, SE(C
m
), to be 0. We seek coefcient quasi-variances

V(C
j
), so
that
SE(C
j
C
j
)
_

V(C
j
) +

V(C
j
)
for all pairs of coefcients C
j
and C
j
, by minimizing the total log relative error of approximation,

j<j

_
log(RE
jj
)
_
2
, where
RE
jj

V(C
j
C
j
)

V(C
j
C
j
)
=

V(C
j
) +

V(C
j
)

V(C
j
) +

V(C
j
) 2

C(C
j
, C
j
)
Firth (2003) cleverly suggests implementing this criterion by tting a GLM in which the response
variable is Y
jj
log
e
[

V(C
j
C
j
)] for all unique pairs of categories j and j

; the linear predictor


is
jj

j
+
j
; the link function is the exponential link, g() = exp() (which is, note, not
one of the common links in Table 15.1); and the variance function is constant, V(Y|) = .
The quasi-likelihood estimates of the coefcients
j
are the quasi-variances

V(C
j
). For example,
for the Canadian occupational prestige regression described in Section 7.2.1, where the dummy
variables pertain to type of occupation (professional and managerial, white collar, or blue collar),
we have
Pair (j,j

) Y
jj
= log
e
[

V(C
j
C
j
)]
Professional, White Collar log
e
(2.771
2
)=2.038
Professional, Blue Collar log
e
(3.867
2
)=2.705
White Collar, Blue Collar log
e
(2.514
2
)=1.844
Summary 421
and model matrix
X =
_
_
_
_
(
1
) (
2
) (
3
)
1 1 0
1 0 1
0 1 1
_

_
With three unique pairs and three coefcients, we should get a perfect t: As I mentioned in
Section 7.2.1, when there are only three categories, the quasi-variances perfectly recover the
estimated variances for pairwise differences in coefcients. Demonstrate that this is the case by
tting the GLM. Some additional comments:
The computation outlined here is the basis of Firths qvcalc package (described in Firth,
2003) for the R statistical programming environment.
The computation of quasi-variances applies not only to dummy regressors in linear models
but to all models with a linear predictor for which coefcients and their estimated covariance
matrix are availablefor example, the GLMs described in this chapter.
Quasi-variances may be used to approximate the standard error for any linear combination
of dummy-variable coefcients, not just for pairwise differences.
Having found the quasi-variance approximations to a set of standard errors, we can then
compute and report the (typically small) maximum relative error of these approximations.
Firth and De Menezes (2004) give more general results for the maximum relative error for
any contrast of coefcients.
Summary
A generalized linear model (or GLM) consists of three components:
1. Arandomcomponent, specifying the conditional distribution of the response variable, Y
i
(for the ith of n independently sampled observations), given the values of the explanatory
variables in the model. In the initial formulation of GLMs, the distribution of Y
i
was
a member of an exponential family, such as the Gaussian (normal), binomial, Poisson,
gamma, or inverse-Gaussian families of distributions.
2. A linear predictorthat is a linear function of regressors,

i
= +
1
X
i1
+
2
X
i2
+ +
ik
X
k
3. A smooth and invertible linearizing link function g(), which transforms the expectation
of the response variable,
i
E(Y
i
), to the linear predictor:
g(
i
) =
i
= +
1
X
i1
+
2
X
i2
+ +
ik
X
k
A convenient property of distributions in the exponential families is that the conditional
variance of Y
i
is a function of its mean
i
and, possibly, a dispersion parameter . In addition
to the familiar Gaussian and binomial families (the latter for proportions), the Poisson family
is useful for modelingcount data, andthe gamma andinverse-Gaussianfamilies for modeling
positive continuous data, where the conditional variance of Y increases with its expectation.
GLMs are t to data by the method of maximum likelihood, providing not only estimates of
the regression coefcients but also estimated asymptotic standard errors of the coefcients.
422 Chapter 15. Generalized Linear Models
The ANOVA for linear models has an analog in the analysis of deviance for GLMs. The
residual deviance for a GLM is D
m
2(log
e
L
s
log
e
L
m
), where L
m
is the maximized
likelihood under the model in question, and L
s
is the maximized likelihood under a saturated
model. The residual deviance is analogous to the residual sum of squares for a linear model.
In GLMs for which the dispersion parameter is xed to 1 (binomial and Poisson GLMs), the
likelihood-ratio test statistic is the difference in the residual deviances for nested models.
For GLMs in which there is a dispersion parameter to estimate (Gaussian, gamma, and
inverse-Gaussian GLMs), we can instead compare nested models by an incremental F-test.
The basic GLM for count data is the Poisson model with log link. Frequently, however,
when the response variable is a count, its conditional variance increases more rapidly than
its mean, producinga conditiontermedoverdispersionandinvalidatingthe use of the Poisson
distribution. The quasi-Poisson GLM adds a dispersion parameter to handle overdispersed
count data; this model can be estimated by the method of quasi-likelihood. A similar model
is based on the negative-binomial distribution, which is not an exponential family. Negative-
binomial GLMs can nevertheless be estimated by maximum likelihood. The zero-inated
Poisson regression model may be appropriate when there are more zeroes in the data than
is consistent with a Poisson distribution.
Loglinear models for contingency tables bear a formal resemblance to ANOVA models and
can be t to data as Poisson GLMs with a log link. The loglinear model for a contingency
table, however, treats the variables in the table symmetricallynone of the variables is dis-
tinguished as a response variableand consequently the parameters of the model represent
the associations among the variables, not the effects of explanatory variables on a response.
When one of the variables is construed as the response, the loglinear model reduces to a
binomial or multinomial logit model.
The Gaussian, binomial, Poisson, gamma, and inverse-Gaussian distributions can all be
written in the common linear-exponential form:
p(y; , ) = exp
_
y b()
a()
+c(y, )
_
where a(), b(), and c() are known functions that vary from one exponential family to
another; = g
c
() is the canonical parameter for the exponential family in question; g
c
()
is the canonical link function; and > 0 is a dispersion parameter, which takes on a xed,
known value in some families. It is generally the case that = E(Y) = b

() and that
V(Y) = a()b

().
The maximum-likelihood estimating equations for generalized linear models take the com-
mon form
n

i=1
Y
i

i
a
i
v(
i
)

d
i
d
i
x
ij
= 0, for j = 0, 1, . . . , k
These equations are generally nonlinear and therefore have no general closed-formsolution,
but they can be solved by iterated weighted least squares (IWLS). The estimating equations
for the coefcients do not involve the dispersion parameter, which (for models in which the
dispersion is not xed) then can be estimated as

=
1
n k 1

(Y
i

i
)
2
a
i
v(
i
)
The estimated asymptotic covariance matrix of the coefcients is

V(b) =

_
X

WX
_
1
Summary 423
where b is the vector of estimated coefcients and W is a diagonal matrix of weights from
the last IWLS iteration.
The maximum-likelihood estimating equations, and IWLS estimation, can be applied when-
ever we can express the transformed mean of Y as a linear function of the Xs and can
write the conditional variance of Y as a function of its mean and (possibly) a dispersion
parametereven when we do not specify a particular conditional distribution for Y. The
resulting quasi-likelihood estimator shares many of the properties of maximum-likelihood
estimators.
The residual deviance for a model is twice the difference in the log likelihoods for the
saturated model, which dedicates one parameter to each observation, and the model in
question:
D(y; ) 2[log
e
L(y, ; y) log
e
L(, ; y)]
= 2
n

i=1
Y
i
[g(Y
i
) g(
i
)] b [g(Y
i
)] +b [g(
i
)]
a
i
Dividing the residual deviance by the estimated dispersion parameter produces the scaled
deviance, D

(y; ) D(y; )/

.
To test the general linear hypothesis H
0
: L = c, where the hypothesis matrix L has q
rows, we can compute the Wald chi-square test statistic
Z
2
0
= (Lb c)

[L

V(b) L

]
1
(Lb c)
with q degrees of freedom. Alternatively, if the dispersion parameter is estimated from the
data, we can compute the F-test statistic
F
0
=
(Lb c)

[L

V(b) L

]
1
(Lb c)
q
on q and n k 1 degrees of freedom.
The delta method may be used to approximate the standard error of a nonlinear function of
regression coefcients in a GLM. If f (
0
,
1
, . . . ,
k
), then

V( )
k

j=0
k

=0
v
jj

Effect displays for GLMs are based on the tted values

= X

b, representing a high-order
term in the model; that is, X

has the same general structure as the model matrix X, with


the explanatory variables in the high-term order ranging over their values in the data, while
other explanatory variables are set to typical values. The standard errors of

, given by the
square-root diagonal entries of X

V(b)X

, may be used to compute pointwise condence


intervals for the effects.
Most of the standard diagnostics for linear models extend relatively straightforwardly
to GLMs. These extensions typically take advantage of the computation of maximum-
likelihood and quasi-likelihood estimates for GLMs by iterated weighted least squares.
Such diagnostics include studentized residuals, hat-values, Cooks distances, DFBETA
and DFBETAS, added-variable plots, component-plus-residual plots, and the constructed-
variable plot for transforming an explanatory variable.
424 Chapter 15. Generalized Linear Models
Recommended Reading
McCullagh and Nelder (1989), the bible of GLMs, is a rich and interestingif generally
difculttext.
Dobson (2001) presents a much briefer overview of generalized linear models at a more
moderate level of statistical sophistication.
Aitkin, Francis, and Hindes (2005) text, geared to the statistical computer package GLIM
for tting GLMs, is still more accessible.
A chapter by Firth (1991) is the best brief treatment of generalized linear models that I have
read.
Long (1997) includes an excellent presentation of regression models for count data (though
not from the point of view of GLMs); an even more extensive treatment may be found in
Cameron and Trivedi (1998).

You might also like