Modeling Count Data. ISBN 1107611253, 978-1107611252
Modeling Count Data. ISBN 1107611253, 978-1107611252
Visit the link below to download the full version of this book:
https://cheaptodownload.com/product/modeling-count-data-1st-edition-full-pdf-dow
nload/
Other Statistics Books by Joseph M. Hilbe
www.cambridge.org
Information on this title: www.cambridge.org/9781107611252
C Joseph M. Hilbe 2014
A catalog record for this publication is available from the British Library.
ISBN 978-1-107-02833-3 Hardback
ISBN 978-1-107-61125-2 Paperback
Additional resources for this publication at www.cambridge.org/9781107611252
Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or
third-party Internet web sites referred to in this publication and does not guarantee that any content on
such web sites is, or will remain, accurate or appropriate.
Contents
Preface xi
Chapter 1
Varieties of Count Data 1
Chapter 2
Poisson Regression 35
v
vi CONTENTS
Chapter 3
Testing Overdispersion 74
Chapter 4
Assessment of Fit 108
Chapter 5
Negative Binomial Regression 126
Chapter 6
Poisson Inverse Gaussian Regression 162
Chapter 7
Problems with Zeros 172
Chapter 8
Modeling Underdispersed Count Data – Generalized Poisson 210
Chapter 9
Complex Data: More Advanced Models 217
Modeling Count Data is written for the practicing researcher who has a
reason to analyze and draw sound conclusions from modeling count data.
More specifically, it is written for an analyst who needs to construct a count
response model but is not sure how to proceed.
A count response model is a statistical model for which the dependent, or
response, variable is a count. A count is understood as a nonnegative discrete
integer ranging from zero to some specified greater number. This book aims
to be a clear and understandable guide to the following points:
There is indeed a lot to consider when selecting the best-fitted model for
your data. I will do my best in these pages to clarify the foremost concepts
and problems unique to modeling counts. If you follow along carefully, you
should have a good overview of the subject and a basic working knowledge
needed for constructing an appropriate model for your study data. I focus
on understanding the nature of the most commonly used count models and
xi
xii PREFACE
Joseph M. Hilbe
Florence, Arizona
August 12, 2013
CHAPTER 1
When discussing the modeling of count data, it’s important to clarify exactly
what is meant by a count, as well as “count data” and “count variable.” The
word “count” is typically used as a verb meaning to enumerate units, items,
or events. We might count the number of road kills observed on a stretch of
highway, how many patients died at a particular hospital within 48 hours of
having a myocardial infarction, or how many separate sunspots were observed
in March 2013. “Count data,” on the other hand, is a plural noun referring
1
2 VARIETIES OF COUNT DATA
Understanding how count data are modeled, and what modeling entails, is
discussed in the following section. For readers with little background in linear
models, I strongly suggest that you read through Chapter 1 even though var-
ious points may not be fully understood. Then re-read the chapter carefully.
The essential concepts and relationships involved in modeling should then
be clear. In Chapter 1, I have presented the fundamentals of modeling, focus-
ing on normal and count model estimation from several viewpoints, which
should at the end provide the reader with a sense of how the modeling process
is to be understood when applied to count models. If certain points are still
1.2 Understanding a Statistical Count Model 3
1
A model may consist of only the response variable, unadjusted by explanatory
variables. Such a model is estimated by modeling the response on the intercept.
For example, using R: lm(y 1); using Stata: reg y.
4 VARIETIES OF COUNT DATA
Y = 0 + 1 X 1 + 2 X 2 + · · · + n X n + ε (1.2)
Statisticians usually convert equation (1.2) to one that has the left-hand side
being the predicted or expected mean value of the response, based on the sum
of the predictors and coefficients. Each associated coefficient and predictor is
called a regression term:
ŷ = 0 + 1 X 1 + 2 X 2 + · · · + n X n (1.3)
or
ˆ = 0 + 1 X 1 + 2 X 2 + · · · + n X n (1.4)
Notice that the error became part of the expected or predicted mean response.
“”, or hat over y and (mu), indicates that this is an estimated value. From
this point on, I use the symbol to refer to the predicted value, without a hat.
Understand, though, that when we are estimating a parameter or a statistic,
a hat should go over it. The true unknown parameter, on the other hand, has
no hat. You will also at times see the term E(y) used to mean “estimated y.” I
will not use it here.
In matrix form, where the individual terms of the regression are expressed
in a single term, we have
= X (1.5)
STATA CODE
. regress sbp male smoker age, nohead
------------------------------------------------------------------------
sbp | Coef. Std. Err. t P⬎|t| [95% Conf. Interval]
-------+----------------------------------------------------------------
male | 4.048601 .2507664 16.14 0.004 2.96964 5.127562
smoker | 6.927835 .1946711 35.59 0.001 6.090233 7.765437
age | .4698085 .02886 16.28 0.004 .3456341 .593983
̲ cons | 104.0059 .7751557 134.17 0.000 100.6707 107.3411
------------------------------------------------------------------------
Continuing with Stata, we may obtain the predicted value, , which is the
estimated mean systolic blood pressure, and display the predictor values
together with (mu) as
. predict mu
. l // ’l’ is an abbreviation for list
+------------------------------------+
| sbp male smoker sge mu |
|------------------------------------|
1. | 131 1 1 34 130.9558 |
2. | 132 1 1 36 131.8954 |
3. | 122 1 0 30 122.1488 |
4. | 119 0 0 32 119.0398 |
5. | 123 0 1 26 123.1488 |
6. | 115 0 0 23 114.8115 |
+------------------------------------+
To see exactly what this means, we sum the terms of the regression. The
intercept term is also summed, but its values are set at 1. The _b[] term
6 VARIETIES OF COUNT DATA
captures the coefficient from the results saved by the software. For the inter-
cept, _b[_cons] adds the intercept term, slope[1], to the other values. The
term xb is also commonly referred to as the linear predictor.
. gen xb = _b[male]*male + _b[smoker]*smoker + _b[age]*age + _b[_cons]
. l
+-----------------------------------------------+
| sbp male smoker age mu xb |
|-----------------------------------------------|
1. | 131 1 1 34 130.9558 130.9558 |
2. | 132 1 1 36 131.8954 131.8954 |
3. | 122 1 0 30 122.1488 122.1488 |
4. | 119 0 0 32 119.0398 119.0398 |
5. | 123 0 1 26 123.1488 123.1488 |
6. | 115 0 0 23 114.8115 114.8115 |
+-----------------------------------------------+
Using R, we may obtain the same results with the following code:
R CODE
⬎ sbp ⬍- c(131,132,122,119,123,115)
⬎ male ⬍- c(1,1,1,0,0,0)
⬎ smoker ⬍- c(1,1,0,0,1,0)
⬎ age ⬍- c(34,36,30,32,26,23)
⬎ summary(reg1 ⬍- lm(sbp~ male+smoker+age))
⬍results not displayed⬎
⬎ mu ⬍- predict(reg1)
⬎ mu
1 2 3 4 5 6
130.9558 131.8954 122.1487 119.0398 123.1487 114.8115
1.2 Understanding a Statistical Count Model 7
As was done with the Stata code, we may calculate the linear predictor, which
is the same as , by first abstracting the coefficient
⬎ cof ⬍- reg1$coef
⬎ cof
(Intercept) male smoker age
104.0058910 4.0486009 6.9278351 0.4698085
and then the linear predictor, xb. Each coefficient can be identified with [ ].
The values are identical to mu.
Notice the closeness of the observed response and predicted values. The
differences are
⬎ diff ⬍- sbp - mu
⬎ diff
1 2 3 4 5 6
0.04418262 0.10456554 -0.14874816 -0.03976436 -0.14874816 0.18851252
When the values of the linear predictor are close to the predicted or expected
values, we call the model well fitted.
One of the points about statistical modeling rarely discussed is the relation-
ship of the data to a probability distribution. All parametric statistical models
are based on an underlying probability distribution. I mentioned before that
the normal or linear regression model is based on the Gaussian, or nor-
mal, probability distribution (see example in Figure 1.1). It is what defines
the error terms. When we are attempting to estimate a least squares regres-
sion or more sophisticated maximum likelihood model, we are estimating
the parameters of the underlying probability distribution that characterize
the data. These two foremost methods of estimation are described in the next
section of this opening chapter. The important point here is always to remem-
ber that when modeling count data we are really estimating the parameters
of a probability distribution that we believe best represents the data we are
modeling. We are never able to knowingly determine the true parameters