Do Not Log-Transform Count Data
Do Not Log-Transform Count Data
Do Not Log-Transform Count Data
Summary
1. Ecological count data (e.g. number of individuals or species) are often log-transformed to satisfy
parametric test assumptions.
2. Apart from the fact that generalized linear models are better suited in dealing with count data, a
log-transformation of counts has the additional quandary in how to deal with zero observations.
With just one zero observation (if this observation represents a sampling unit), the whole data set
needs to be fudged by adding a value (usually 1) before transformation.
3. Simulating data from a negative binomial distribution, we compared the outcome of fitting mod-
els that were transformed in various ways (log, square root) with results from fitting models using
quasi-Poisson and negative binomial models to untransformed count data.
4. We found that the transformations performed poorly, except when the dispersion was small and
the mean counts were large. The quasi-Poisson and negative binomial models consistently per-
formed well, with little bias.
5. We recommend that count data should not be analysed by log-transforming it, but instead mod-
els based on Poisson and negative binomial distributions should be used.
Key-words: generalized linear models, linear models, overdispersion, Poisson, transformation
2010 The Authors. Journal compilation 2010 British Ecological Society, Methods in Ecology & Evolution, 1, 118–122
120 R. B. O’Hara & D. J. Kotze
The simulations were compared by calculating the mean larly at low mean values and high variances. The square-root
bias, B: transformation has a lower bias than any of the log-transfor-
mations, unless the mean is low.
1X s
B¼ l^ l The amount of bias also depends on the transformation
S i¼1
used. When there is little variation (i.e. high h, when the nega-
and root mean-squared error (RMSE): tive binomial distribution approaches the Poisson), the square-
root transformation has little bias, as does the log-transforma-
1X s tion when the mean is high, i.e. there are few zeroes (compare
RMSE ¼ l^ l2 with Fig. 1).
S i¼1
The root mean-squared error shows a similar pattern, with
for the simulations, where l^ is the estimated parameter, l the negative binomial distribution consistently having a low
is the true value (known from the simulations) and S is RMSE, and a high value added to the log-transformation
the number of simulations. We calculated these on the log being better (Fig. 3). The behaviour of the log + 1 transfor-
scale, i.e. l = log(k). This is the scale on which the mation is a result of a change in sign of the bias, with the mini-
parameters are estimated in all of the models except the mum at the point where the mean bias is zero (compare with
square-root transformation; so, for the latter model we Fig. 2).
transformed the parameters onto the log scale. The difference between the negative binomial and quasi-
Simulations and analyses were carried out in the R statistical Poisson distribution models is insignificant. The largest abso-
program (R Development Core Team 2009), using the MASS lute difference in bias was 2Æ4 · 10)8, and the largest RMSE
(Vernables & Ripley 2002) package. The code that was used is was only 1Æ1 · 10)8, both of which are much smaller than the
available as an online supplement (Appendix S1 in Supporting scales in Figs 2 and 3.
Information).
Discussion
Results
When the error structure of data is simple, a transformation
The proportion of counts that were zero are shown in Fig. 1. (usually a log or power-transformation) can be quite useful to
Naturally, the proportion decreases as the mean increases, and improve the ability of a model to fit to the data by stabilizing
it also decreases as the variance (controlled by h) decreases. variances or by making relationships linear (Miller 1997; Pie-
The biases for the different estimation methods are plotted pho 2009) before applying simple linear regression. However, a
in Fig. 2 (the quasi-Poisson and negative binomial models transformation is not guaranteed to solve these problems:
behave similarly; so, only the latter is presented; see below). there may be a trade-off between homoscedasticity and linear-
The negative binomial model has negligible bias, whereas the ity, or the family of transformations used may not be able to
models based on a normal distribution are all biased, particu- correct one or both of these problems. Different models may
therefore need to be applied, and there is now a wide variety of
possibilities, of which GLMs and their derivatives (McCullagh
0·6
= 100
Proportion of zeroes
θ = 0·5 θ =1 θ =2
0·5 0·5 0·5
0·0 0·0 0·0
−0·5 −0·5 −0·5
−1·0 −1·0 −1·0
−1·5 −1·5 −1·5
−2·0 −2·0 −2·0
−2·5 −2·5 −2·5
−3·0 −3·0 −3·0
5 10 15 20 5 10 15 20 5 10 15 20
Bias
θ =5 θ = 10 θ = 100
0·5 0·5 0·5
0·0 0·0 0·0
−0·5 −0·5 −0·5
Neg Bin
−1·0 −1·0 −1·0 Sqrt
Log, +1
−1·5 −1·5 −1·5 Log, +0.5
−2·0 −2·0 −2·0 Log, +0.1
Log, +0.001
−2·5 −2·5 −2·5
−3·0 −3·0 −3·0
5 10 15 20 5 10 15 20 5 10 15 20
True mean
Fig. 2. Estimated mean biases from six different models, applied to data simulated from a negative binomial distribution. A low bias means that
the method will, on average, return the ‘true’ value. Note that the curves for a quasi-Poisson model would be indistinguishable from a negative
binomial curve.
r=5 r = 10 r = 100
1·4 1·4 1·4
1·2 1·2 1·2 Neg Bin
Sqrt
1·0 1·0 1·0 Log, + 1
Log, + 0·5
0·8 0·8 0·8
Log, + 0·1
0·6 0·6 0·6 Log, + 0·001
Fig. 3. Estimated root mean-squared error from six different models, applied to data simulated from a negative binomial distribution. Note that
the curves for a quasi-Poisson model would be indistinguishable from a negative binomial curve.
foundation for the model. The extra variability that can be gave an example from a real data set where they differed in
added can be chosen according to the way it affects the relation- their predictions. Whilst their data set is unusual (as they
ship between the mean and variance (Ver Hoef & Boveng acknowledge), it does serve as a warning that our result may
2007). not generalize to real data, which rarely has as balanced a
In our simulations, the Poisson and negative binomial mod- design as our simulations. The two models differ in their rela-
els gave almost identical estimates. This suggests that the mod- tionships between the mean and variance; so, if distinguishing
els are robust to a mis-specification of the relationship between them becomes important, this can be done by plotting
the mean and variance. In contrast, Ver Hoef & Boveng (2007) (yi ) ki)2 against ki: it will be linear for a quasi-Poisson model
2010 The Authors. Journal compilation 2010 British Ecological Society, Methods in Ecology & Evolution, 1, 118–122
122 R. B. O’Hara & D. J. Kotze
but quadratic for a negative binomial model. A clear curve in McCullagh, P. & Nelder, J.A. (1989) Generalized Linear Models, 2nd edn.
Chapman & Hall, London.
the plot would therefore suggest that a negative binomial
Miller, R.G., Jr (1997) Beyond anova. Chapman & Hall ⁄ CRC Press, London.
model will provide a better fit. In practice, it is probably advis- O’Hara, R.B. (2009) How to make models add up – a primer on GLMMs.
able to bin the data, i.e. calculate the average mean values and Annales Zoologici Fennici, 46, 124–137.
Piepho, H.-P. (2009) Data transformation in statistical analysis of field trials
variances for data points with similar mean values, as this will
with changing treatment variance. Agronomy Journal, 101, 865–869.
make the plots less messy (Ver Hoef & Boveng 2007). R Development Core Team (2009) R: A language and environment for statisti-
Even though the choice of the type of GLM depends on cal computing. R Foundation for Statistical Computing, Vienna, Austria.
ISBN 3-900051-07-0, URL http://www.R-project.org.
many things (O’Hara 2009; Zuur, Ieno & Elphick 2010), we do
Sileshi, G., Hailu, G. & Nyadzi, G.I. (2009) Traditional occupancy-abundance
recommend that count data not be transformed to be used in models are inadequate for zero-inflated ecological count data. Ecological
parametric tests. For such data, GLMs and their derivatives Modelling, 220, 1764–1775.
Sokal, R.R. & Rohlf, F.J. (1995) Biometry, 3rd edn. Freeman and Company,
are more appropriate.
New York, New York, USA.
Ver Hoef, J.M. & Boveng, P.L. (2007) Quasi-Poisson vs. negative binomial
regression: how should we model overdispersed count data? Ecology, 88,
Acknowledgments 2766–2772.
Vernables, W.N. & Ripley, B.D. (2002) Modern Applied Statistics with S, 4th
The order of the authors was determined by the result of the South Africa– edn. Springer, New York, New York, USA.
England cricket ODI on 27 September 2009, which England won by 22 White, G.C. & Bennetts, R.E. (1996) Analysis of frequency count data using
runs. The study was financially supported by the research funding pro- the negative binomial distribution. Ecology, 77, 2549–2557.
gramme ‘LOEWE – Landes-Offensive zur Entwicklung Wissenschaftlich- Wright, D.H. (1991) Correlations between incidence and abundance are
ökonomischer Exzellenz’ of Hesse’s Ministry of Higher Education, expected by chance. Journal of Biogeography, 18, 463–466.
Research, and the Arts, and the Academy of Finland. We thank Alain Zuur Zar, J.H. (1999) Biostatistical Analysis, 4th edn. Prentice Hall, Englewood
and an anonymous referee for their helpful comments on an earlier version Cliffs, New Jersey, USA.
of this manuscript. Zuur, A.F., Ieno, E.N. & Smith, G.M. (2007) Analysing Ecological Data.
Springer, New York, NY, USA.
Zuur, A.F., Ieno, E.N., Walker, N.J., Saveliev, A.A. & Smith, G.M. (2009)
References Mixed Effects Models and Extensions in Ecology with R. Springer, New
York, NY, USA.
Box, G.E.P. & Cox, D.R. (1964) An analysis of transformations. Journal of the Zuur, A.F., Ieno, E.N. & Elphick, C.S. (2010) A protocol for data exploration
Royal Statistical Society B, 26, 211–252. to avoid common statistical problems. Methods in Ecology and Evolution, 1,
Crawley, M.J. (2003) Statistical Computing. An Introduction to Data Analysis 3–14.
using S-Plus. John Wiley & Sons Ltd, London.
Cuesta, D., Taboada, A., Calvo, L. & Salgado, J.M. (2008) Short- and med- Received 13 December 2009; accepted 19 January 2010
ium-term effects of experimental nitrogen fertilization on arthropods associ- Handling Editor: Robert P. Freckleton
ated with Calluna vulgaris heathlands in north-west Spain. Environmental
Pollution, 152, 394–402.
Dalthorp, D. (2004) The generalized linear model for spatial data: assessing the
effects of environmental covariates on population density in the field. Entom-
Supporting Information
ologia Experimentalis et Applicata, 111, 117–131.
Additional Supporting Information may be found in the online ver-
Gebeyehu, S. & Samways, M.J. (2002) Grasshopper assemblage response to
a restored national park (Mountain Zebra National Park, South Africa). sion of this article:
Biodiversity and Conservation, 11, 283–304.
Jiao, Y., Chen, Y., Schneider, D. & Wroblewski, J. (2004) A simulation study Appendix S1. Simulation code for R.
of impacts of error structure on modeling stock-recruitment data using gen-
eralized linear models. Canadian Journal of Fisheries and Aquatic Sciences,
61, 122–133. As a service to our authors and readers, this journal provides support-
Magura, T., Tóthmérész, B. & Elek, Z. (2005) Impacts of leaf-litter addition on ing information supplied by the authors. Such materials may be
carabids in a conifer plantation. Biodiversity and Conservation, 14, 475–491. re-organized for online delivery, but are not copy-edited or typeset.
Maindonald, J. & Braun, J. (2007) Data Analysis and Graphics Using R – An
Technical support issues arising from supporting information (other
Example-Based Approach, 2nd edn. Cambridge University Press, Cam-
bridge. than missing files) should be addressed to the authors.
2010 The Authors. Journal compilation 2010 British Ecological Society, Methods in Ecology & Evolution, 1, 118–122