Generalized Linear Models
Generalized Linear Models
The class of generalized linear models (GLMs) extends the classical linear model
for continuous, normal responses to describe the relationship between one or
more predictor variables x1 , . . . , xp and a wide variety of nonnormally distributed
responses Y including binary, count, and positive-valued variates. GLMs expand
the class of response densities from the normal to an exponential family that
contains the normal, Poisson, binomial, and other popular distributions as special
cases. The models produce estimated expected values that conform to response
constraints and allow nonlinear relationships between predictors and expected
values. It is straightforward to construct the likelihood for a set of data so that
maximum likelihood and related likelihood-based methods are popular techniques
for parameter estimation and inference. A key point with GLMs is that many of the
considerations in model construction are the same as for standard linear regression
models as the models have many common features. 2011 John Wiley & Sons, Inc.
WIREs Comp Stat 2011 3 407–413 DOI: 10.1002/wics.175
case, cost itself) can be modeled directly as a function the exponential family:
of predictors and it is straightforward to calculate
estimates of expected values. yi ∼ indep. fYi (yi ),
A linear model does not provide a good
fYi (yi ) = exp{[yi θi − b(θi )]/τ 2 − c(yi , τ )}, (2)
description of the patient understanding variable as
the expected value of this binary random variable is
where θi is known as the canonical parameter, c is a
a probability and must lie between 0 and 1. Linear
known function, and τ 2 is a dispersion parameter. For
models can produce fitted probabilities outside of the
example, for binary data
(0, 1) range. In addition, as toxicological studies from
as far back as Bliss4 show, the proportion responding y
fYi (yi ) = pi i (1 − pi )1−yi , (3)
tends to follow a sigmoidal curve as a function of
a continuous predictor, rather than a line. GLMs
where pi is the probability of a success. Writing fYi (yi )
produce estimated expected values that conform to
in the format of Eq. (2) shows that θi = log[pi /(1 −
response constraints and allow nonlinear relationships
pi )] (or pi = 1/(1 + e−θi )), b(θi ) = log(1 + eθi ), and
between predictors and expected values.
c(yi , τ ) = 1. Most commonly used distributions can
The Poisson distribution for counts provides a
be written in the form of Eq. (2).
natural distribution for the number of pain-limited
days, rather than a distribution such as the normal
with support on the entire real line. GLMs expand Function of Covariates x
the class of response densities from the normal to an As in the classical linear model, GLMs assume that
exponential family that contains the normal, Poisson, the covariates x1 , . . . , xp relate to the expected value
binomial, and other popular distributions as special of the response through a linear predictor η with
cases.
η = β0 + β1 x1 + · · · + βp xp . (4)
STRUCTURE OF THE MODEL This key feature of GLMs, namely that the predictors
affect Eq. (4) exactly the same way as they do in
Constructing a GLM involves three separate a linear regression model, Eq. (1), means that many
specifications: of the features of the latter carry over to the former
and many of the considerations in modeling are the
1. The random component: the distribution of the same. As examples, just like linear regression, GLMs
response Y. can handle a mix of continuous and categorical
predictors. Issues of how to represent predictors
2. The systematic component: the function of the
and model nonlinear relationships are similar (e.g.,
covariates x = x1 , . . . , xp that relate x to the
using polynomial functions of predictors or splines).
expected value of the response. With GLMs
Furthermore, analysts can employ similar strategies
the covariates enter the statistical model via a
to assess whether to retain a predictor in a model.
function η = β0 + β1 x1 + · · · + βp xp , known as
And random factors can be incorporated in exactly
the linear predictor. the same way with linear models and GLMs.
3. The connection between the random and
systematic components which is assumed to
be through µ = E(Y). It is assumed that this Link Function
takes the form µ = g−1 (η), where g is a known Rather than assume that µ = E(Y) is a linear function
function. of covariates, as in linear models, GLMs assume that
a function g of µ is equal to the linear predictor, η:
We consider each of these specifications in more detail.
g(µ) = η, (5)
is that regression coefficients in Eq. (4) retain the parameter estimation and inference. These methods
nice interpretation of linear regression coefficients as possess several optimal properties such as smallest
changes in the function of the expected value of the asymptotic variance6 but require strong assumptions
response, g(µ), associated with a one unit increase about the density of the responses Y. In some settings
in a predictor, for any value of the predictor. The data analysts may not have sufficient information
function g is the identity for the linear regression to make strong assumptions about the density of
model but takes other forms for other models. For responses but are able to specify models for the
example, for the probit model,5 for binary data, in expected value and variance of Y. Wedderburn7
which µ is the probability of a success and is related showed that the specification of just a few features
to the canonical parameter via µ = 1/(1 + exp[−θ]), such as the expected value and variance of a random
the link function is g(µ) = −1 (µ). In general, g can variable allows the construction of a likelihood-
be any monotonic, differentiable function but should like quantity known as a quasilikelihood which can
be appropriate for the situation. For example, log link perform as well as the full likelihood or else nearly
models for binary responses are problematic because as well. The following sections discuss estimation
for some covariate values the model will produce and inference by full likelihood, quasilikelihood, and
estimates of the expected value of the binary response, conditional likelihood methods.
a probability, that exceed 1. Link functions based on
inverse cumulative probability functions such as the
logistic do not have this defect and always produce Estimation by Maximum Likelihood
expected values in the interval (0, 1). A potential For a sample of n independent observations, indexed
drawback of a standard GLM is that predictors relate by i = 1, . . . , n, the log likelihood, l, from Eq. (2) is
to the expected value of the response solely through
the linear predictor g(µ) = η, whereas predictors may
n
n
affect the response distribution in a more complicated l= [yi θi − b(θi )]/τ 2 − c(yi , τ ), (6)
manner. i=1 i=1
For some special models known as canonical
link models, the canonical parameter of Eq. (2), θi , which we maximize as a function of β and τ , if it
is equal to the linear predictor ηi , θi = ηi . Canonical is unknown, to obtain estimates of the parameters
link models include identity link models for normal of interest. Before discussing parameter estimation
responses, logistic models for binary responses, and further it is useful to relate the parameters of Eq. (6)
Poisson models for counts. For these special models to the moments of Y. It is well known that
sufficient statistics for model parameters exist.
∂l
E = E[{yi − b (θi )}/τ 2 ] = 0, (7)
∂θi
ESTIMATION AND INFERENCE
and
Using Eq. (2), it is straightforward to construct
the likelihood based on a set of n independent ∂l ∂ 2l
observations so that maximum likelihood and related var = −E = b (θi )/τ 2 (8)
likelihood-based methods are popular techniques for
∂θi ∂θi2
under regularity conditions.6 Standard texts such as value of β1 . To conduct the test, we form the Wald
McCullagh and Nelder2 show that statistic
∂l 1 Quasilikelihood
= 2 (yi − µi )wi gµ (µi )xik , k = 0, . . . , p,
∂βk τ To implement the maximum likelihood procedures of
(11) the previous section, a data analyst must specify the
density given in Eq. (2) of Y and there may not be suf-
where wi = [v(µi )gµ2 (µi )]−1 . Solving ∂l/∂β = 0 yields ficient information about Y to make this specification;
maximum likelihood estimators β̂. The expected value an analyst may be able to specify a model for the mean
of the matrix of second derivatives of Eq. (2) provides and variance of the response but not the full density.
asymptotic variances of the estimators β̂, var∞ (β̂). Quasilikelihood methods provide parameter estima-
Replacing parameters in this matrix by their maximum tion and inference in such settings. The approach
likelihood estimates provides estimated variances of β̂. involves a likelihood-like quantity whose construc-
An extension of least squares estimation known tion involves fewer assumptions. To construct the
as the iteratively re-weighted least squares algorithm quasilikelihood, we note that the function
provides a common approach to compute the
yi − µi
maximum likelihood estimators of the parameters qi = (16)
of GLMs. Given the current value of the linear τ 2 v(µi )
predictor η̂0 and associated fitted value µ̂0 = g−1 (η̂0 ), has important properties in common with the log like-
one obtains updated estimates β̂ by regressing an lihood derivative ∂l/∂θi . Namely E(qi ) = 0, as in Eq.
adjusted dependent variable z0 on the covariates x (7), and var(qi ) is equal to the negative of the expected
with weight w0 . The adjusted dependent variable z0 is value of the derivative of qi , as in Eq. (8). Analogous
to the relationship of ∂l/∂θi to the log likelihood, we
z0 = η̂0 + (y − µ̂0 )g (µ̂0 ), (12) define the log quasilikelihood via the contribution yi
makes to it:
while the weight is defined by µi
yi − t
Qi = 2 v(t)
dt, (17)
τ
w−1 2
0 = [g (µ̂0 )] v(µ̂0 ), (13) yi
It is important to note that the construction the cluster-specific intercepts in ηi = θ i . For canonical
ni
of Qi only requires specifying how the variance link models the sufficient statistics are j=1 yij and the
changes with the mean. Also, it is often the case conditional likelihood has terms
that if we specify a mean-to-variance relationship, we
obtain maximum quasilikelihood equations which are
ni
f yi1 , . . . , yini yij , (20)
exactly the same as those corresponding to a legitimate
j=1
likelihood. Directly modeling the mean-to-variance
relationship frees data analysts from some constraints which depend on the parameters of interest, β but not
imposed by fully parametric GLMs. For example, the the αi .
quasilikelihood approach provides methods to analyze We can use the conditional likelihood as we
count data where the response variance substantially would use a standard likelihood. For example,
exceeds the mean, a feature that a standard Poisson we obtain parameter estimates by maximizing the
regression model cannot accommodate. Settings where likelihood built up from terms in Eq. (20) and
the actual variability exceed model-based values are obtain estimated large sample standard errors from
termed ‘overdispersed’. Common statistical packages the Hessian of the conditional likelihood. We can also
estimate the degree of overdispersion using moment- use standard likelihood-based inference methods such
based methods applied to model goodness-of-fit as likelihood ratio tests.
measures such as the Pearson χ 2 .2 Although conditional likelihood methods are
Inference for the parameters β using quasilikeli- only readily available for canonical link model,
hood methods proceeds much as with full maximum Neuhaus and Kalbfleisch10 and Neuhaus and
likelihood. This follows from the work of McCullagh8 McCulloch11 show that one can obtain conditional
that shows that maximum quasilikelihood estimators likelihood-like inference for noncanonical link
consistently estimate β and have an asymptotic normal models. These authors show that the conditional
distribution. likelihoods (Eq. (20)), depend on the covariates xij
only through the deviations (xij − xi· ). This suggests
a strong connection between conditional likelihood
Conditional Likelihood estimates and estimates from generalized linear mixed
Settings commonly arise where experimental units models (GLMMs)12 that model covariate effects in
are gathered in clusters, matched sets, or other fine terms of these deviations. Such models replace β1 xij in
strata. Gathering costs and pain-related outcomes the linear predictor by
on the same patients (=cluster) during several time
periods in the back pain study would be an example. βB xi + βW (xij − xi· ), (21)
Cluster-specific means of responses typically exhibit
substantial variability between clusters and it would so that βB measures the change in E[yij ] associated
be natural to consider describing such data by fitting with between-cluster differences in covariate means
a GLM with a parameter for each cluster, along while βW measures the change in E[yij ] associated with
with other predictors of interest. The model will within-cluster covariate differences. The connection
typically contain many cluster-specific parameters between covariate decomposition and conditional
and relatively few parameters corresponding to likelihood approaches suggests that GLMMs that
the predictors of interest. It is well known that include separate between- and within-cluster covariate
standard full likelihood estimation methods can yield components can provide conditional likelihood-like
inconsistent estimators of the effects of the predictors inference even for noncanonical link models that do
of interest.9 Intuitively, the inconsistent estimation not support conditional likelihood methods.
results from the fact that the parameter set grows
with the size of the sample with little new information
DIAGNOSTICS
about these additional parameters.
Instead of trying to simultaneously estimate both After specifying and fitting a statistical model of
the parameters associated with the predictors of inter- interest, it is important for the data analyst to assess
est and all the cluster-specific parameters, conditional the quality of the fit to the data in order to validate
likelihood methods focus on the parameters of interest model assumptions, to identify observations where the
and eliminate all the cluster-specific parameters from model fits the data particularly poorly, and to identify
the likelihood by conditioning on sufficient statistics observations which influence the fit far more than
which exist for canonical link models. others. Methods to assess the quality of the fit of a
Let i = 1, . . . , m denote clusters or strata, j = linear regression model, termed regression diagnostics,
1, . . . , ni denote units within clusters, and αi denote are well developed.13 The close connection of the
linear predictor (Eq. (4)) to the classical linear model advocate the use of a hierarchical likelihood, known
allows the extension of linear regression diagnostics as h-likelihood, which avoids the integration but can
to GLMs in a straightforward manner. For example, yield biased estimators.
Pregibon14 replaces the continuous response in diag- Generalized estimating equations (GEE)
nostic methods of Cook and Weisberg13 with the methods17 are another extension of GLMs to depen-
adjusted dependent variable z in form of Eq. (12) to dent data settings. GEE is essentially a quasilikeli-
develop deletion diagnostics and leverage plots for hood approach where the data analysts specifies a
GLMs. In addition, Wang15 developed residual plots GLM for each unit in the sample along with some
for GLMs. Common statistical computing packages working assumptions about the within-cluster corre-
have implemented these diagnostics so that data ana- lation structure of the responses. Like the methods
lysts can apply common model assessment methods of the Quasilikelihood Section, GEE approaches do
for any GLM. not specify the joint distribution of the responses so
that standard likelihood procedures are not available.
Instead, one constructs an estimating equation whose
COMMENTS
solution is a consistent estimator of the parameters
This article has focused on GLMs for indepen- of the unit-specific models.18 The parameters of the
dent observations, but as the Conditional Likelihood models underlying the GEE approach measure the
Section indicates, settings often arise where observa- same covariate effects that one would with a single
tions are gathered in clusters or groups and analysts observation per subject and do not involve the cluster
need to accommodate within-cluster dependence of structure of the data.
the responses. Longitudinal data are an important case
of such dependent data. GLMMs12 extend standard
CONCLUSION
GLMs to settings with dependent data by introducing
random effects into the linear predictor, η to model GLMs are extensions of the classical linear regression
the dependence. The assumptions underlying GLMMs model for continuous, normal responses that allow
allow the construction of a likelihood and analysts can the regression analysis of a variety of nonnormal
use likelihood-based methods for parameter estima- responses such as binary indicators, counts, and pos-
tion and inference. GLMMs are usually more difficult itively valued random variables. As the covariates of
to fit than standard GLMs, often requiring numerical GLMs enter the statistical model via a linear predictor,
integration methods, but provide estimates of the asso- as in linear regression models, many of the model-
ciations of changes in predictors within clusters with ing decisions and strategies of linear models carry
changes in response, typically the object of scientific over exactly or with a minor modification to GLMs.
interest with longitudinal data. Analysis with GLMs is typically likelihood-based and
Lee and Nelder16 describe a related class of data analysts can use a common algorithm to obtain
models, known as hierarchical GLMs, which combine maximum likelihood estimators for any model in the
GLMs with latent variables in the linear predictor. class. Like linear regression models for continuous,
Rather than estimating parameters by maximizing the normal-like data, GLMs provide useful assessments
marginal likelihood obtained by integrating out the of the associations of predictors with a wide variety
random effects, as with GLMMs, Lee and Nelder16 of responses.
REFERENCES
1. Nelder JA, Wedderburn RWM. Generalized linear 6. Cox DR, Hinkley DV. Theoretical Statistics. London:
models. J R Stat Soc, Ser A 1972, 135:370–384. Chapman & Hall; 1974.
2. McCullagh P, Nelder JA. Generalized Linear Models. 7. Wedderburn RWM. Quasi-likelihood functions, gener-
2nd ed. London: Chapman & Hall; 1989.
alized linear models, and the Gauss–Newton method.
3. Von Korff M, Barlow W, Cherkin D, Deyo R. Effects Biometrika 1974, 61:439–447.
of practice style in managing back pain. Ann Internal
Med 1994, 121:187–195. 8. McCullagh P. Quasi-likelihood functions. Ann Stat
1983, 11:59–67.
4. Bliss C. The calculation of the dose-mortality curve.
Ann Appl Biol 1935, 22:134–167. 9. Neyman J, Scott EL. Consistent estimates based on
5. Finney DJ. Probit Analysis. 3rd ed. Cambridge: Cam- partially consistent observations. Econometrica 1948,
bridge University Press; 1971. 16:1–32.
10. Neuhaus JM, Kalbfleisch JD. Between- and within- 14. Pregibon D. Logistic regression diagnostics. Ann Stat
cluster covariate effects in the analysis of clustered data. 1981, 9:705–724.
Biometrics 1998, 54:638–645. 15. Wang P. Residual plots for detecting nonlinearity
11. Neuhaus JM, McCulloch CE. Separating between and in generalized linear models. Technometrics 1987,
within-cluster covariate effects using conditional and 29:435–438.
partitioning methods. J R Stat Soc, Ser B 2006, 16. Lee Y, Nelder J. Hierarchical generalized linear models
68:859–872. (with discussion). J R Stat Soc, Ser B 1996, 58:619–678.
12. McCulloch CE, Searle SR, Neuhaus JM. Generalized, 17. Diggle PJ, Heagerty PJ, Liang K-Y, Zeger SL. Analysis
Linear and Mixed Models. 2nd ed. New York: John of Longitudinal Data. 2nd ed. Oxford: Oxford Univer-
Wiley & Sons; 2008. sity Press; 2002.
13. Cook RD, Weisberg S. Residuals and Influence in 18. Liang K-Y, Zeger SL. Longitudinal data analysis using
Regression. New York: Chapman & Hall; 1982. generalized linear models. Biometrika 1986, 73:13–22.