Lecture 27 - Poisson Regression: I TH 1 N I I I TH I TH
Lecture 27 - Poisson Regression: I TH 1 N I I I TH I TH
or equivalently,
λi = eβ0 +β1 xi1 +...+βp xip .
Together with the distributional assumption Yi ∼ Poisson(λi ), this is called the Poisson
log-linear model, or the Poisson regression model. It is a special case of what is known in
neuroscience as the linear-nonlinear Poisson cascade model.
More generally, the Poisson log-linear model is a model for n responses Y1 , . . . , Yn that
take integer count values. Each Yi is modeled as an independent Poisson(λi ) random variable,
where log λi is a linear combination of the covariates corresponding to the ith observation.
As in the cases of linear and logistic regression, we treat the covariates as fixed constants,
and the model parameters to be inferred are the regression coefficients β = (β0 , . . . , βp ).
27-1
Since Y1 , . . . , Yn are independent Poisson random variables, the likelihood function is
given by
n
Y λYi i e−λi
lik(β0 , . . . , βp ) =
i=1
Yi !
where λi is defined in terms of β0 , . . . , βp and the covariates xi1 , . . . , xip via equation (27.1).
Setting xi0 ≡ 1 for all i, the log-likelihood is then
n
X
l(β0 , . . . , βp ) = Yi log λi − λi − log Yi !
i=1
p
n
!
X X Pp
βj xij
= Yi βj xij −e j=0 − log Yi !
i=1 j=0
and the MLEs are the solutions to the system of score equations, for m = 0, . . . , p,
n
∂l X Pp
0= = xim (Yi − e j=0 βj xij ).
∂βm i=1
Writing Xj = (x1j , . . . , xnj ) as the jth column of the covariate matrix X and defining the
diagonal matrix Pp
Pp
βj x1j βj xnj
W = W (β) := diag e j=0 ,...,e j=0 ,
∂2l T
the above may be written as ∂βm ∂βl
= −Xm W Xl , so ∇2 l(β) = −X T W X and IY (β) =
X T W X. For large n, if the Poisson log-linear model is correct, then the MLE vector β̂ is
approximately distributed as N (β, (X T W X)−1 ). We may then estimate the standard error
of β̂j by q
se
ˆj = ((X T Ŵ X)−1 )jj ,
where Ŵ = W (β̂) is the plugin estimate for W . These formulas are the same as for the case
of logistic regression in Lecture 26, except with a different form of the diagonal matrix W .
The modeling assumption of a Poisson distribution for Yi is rather restrictive, as it implies
that the variance of Yi must be equal to its mean. This is rarely true in practice, and it is
frequently the case that the observed variance of Yi is larger than its mean—this problem is
known as overdispersion. Nonetheless, the Poisson regression model is oftentimes used in
overdispersed settings: As long as Y1 , . . . , Yn are independent and
27-2
for each i (so the model for the means of the Yi ’s is correct), then it may be shown that the
MLE β̂ in the Poisson regression model is unbiased for β, even if the distribution of Yi is
not Poisson and the variance of Yi exceeds its mean. The above standard error estimate se ˆj
and the associated confidence interval for βj , though, would not correct in the overdispersed
setting. One may use instead the robust sandwich estimate of the covariance of β̂, given by
(X T Ŵ X)−1 (X T W̃ X)(X T Ŵ X)−1
where
W̃ = diag((Y1 − λ̂1 )2 , . . . , (Yn − λ̂n )2 )
Pp
and λ̂i = e j=0 β̂j xij is the fitted value of λ for the ith observation. Alternatively, one may
use the pairs bootstrap procedure as described in Lecture 26.
Remark 27.2. The linear model, logistic regression model, and Poisson regression model
are all examples of the generalized linear model (GLM). In a generalized linear model,
Y1 , . . . , Yn are modeled as independent observations with distributions Yi ∼ f (y|θi ) for some
one-parameter family f (y|θ). The parameter θi is modeled as
g(θi ) = β0 + β1 xi1 + . . . + βp xip
for some one-to-one transformation g : R → R called the link function, where xi1 , . . . , xip
are covariates corresponding to Yi . In the linear model considered in Lecture 25, the pa-
rameter was θ ≡ µ where f (y|µ) was the PDF of the N (µ, σ02 ) distribution (for a known
variance σ02 ), and g(µ) = µ. In logistic regression, the parameter was θ ≡ p where f (y|p)
p
was the PMF of the Bernoulli(p) distribution, and g(p) = log 1−p . In Poisson regression,
the parameter was θ ≡ λ where f (y|λ) was the PMF of the Poisson(λ) distribution, and
g(λ) = log λ.
The choice of the link function g is an important modeling decision, as it determines which
transform of the model parameter should be modeled as linear in the observed covariates.
In each of the three examples discussed, we used what is called the natural link, which
is motivated by considering a change-of-variable for the parameter, θ 7→ η(θ), so that the
PDF/PMF f (y|η) in terms of the new parameter η has the form
f (y|η) = eηy−A(η) h(y)
for some functions A and h. For example, the Bernoulli PMF is
y
p p
y
f (y) = p (1 − p) 1−y
= (1 − p) = e(log 1−p )y+log(1−p) ,
1−p
p
so we may set η = log 1−p , A(η) = − log(1 − p) = log(1 + eη ), and h(y) = 1. This is called
the exponential family form of the PDF/PMF, and η is called the natural parameter.
In each example, the natural link simply sets g(θ) = η(θ) (or equivalently, g(θ) = cη(θ) for
a constant c).
Use of the natural link leads to some nice mathematical properties for likelihood-based
inference—for instance, since η is modeled as linear in β, the second-order partial derivatives
of
log f (Y |η) = ηY − A(η) + log h(Y )
27-3
with respect to β do not depend on Y , so the Fisher information is always given by −∇2 l(β)
without needing to take an expectation. (We sometimes say in this case that the “observed
and expected Fisher information matrices” are the same.) On the other hand, from the
modeling perspective, there is usually no intrinsic reason to believe that the natural link
g(θ) = η(θ) is the correct transformation of θ that is well-modeled as a linear combination
of the covariates, and other link functions are also commonly used, especially if they lead to
a better fit for the data.
27-4