Bayesian Models For Astrophysical Data Using R, JAGS, Python, and Stan
Bayesian Models For Astrophysical Data Using R, JAGS, Python, and Stan
Bayesian Models for Astrophysical Data provides those who are engaged in the Bayesian
modeling of astronomical data with guidelines on how to develop code for modeling such
data, as well as on how to evaluate a model as to its fit. One focus in this volume is on
developing statistical models of astronomical phenomena from a Bayesian perspective. A
second focus of this work is to provide the reader with statistical code that can be used for
a variety of Bayesian models.
We provide fully working code, not simply code snippets, in R, JAGS, Python, and Stan
for a wide range of Bayesian statistical models. We also employ several of these models
in real astrophysical data situations, walking through the analysis and model evaluation.
This volume should foremost be thought of as a guidebook for astronomers who wish to
understand how to select the model for their data, how to code it, and finally how best
to evaluate and interpret it. The codes shown in this volume are freely available online at
www.cambridge.org/bayesianmodels. We intend to keep it continuously updated and report
any eventual bug fixes and improvements required by the community. We advise the reader
to check the online material for practical coding exercises.
This is a volume devoted to applying Bayesian modeling techniques to astrophysical
data. Why Bayesian modeling? First, science appears to work in accordance with Bayesian
principles. At each stage in the development of a scientific study new information is used
to adjust old information. As will be observed when reviewing the examples later in this
volume, this is how Bayesian modeling works. A posterior distribution created from the
mixing of the model likelihood (derived from the model data) and a prior distribution
(outside information we use to adjust the observed data) may itself be used as a prior for
yet another enhanced model. New information is continually being used in models over
time to advance yet newer models. This is the nature of scientific discovery. Yet, even
if we think of a model in isolation from later models, scientists always bring their own
perspectives into the creation of a model on the basis of previous studies or from their own
experience in dealing with the study data. Models are not built independently of the context,
so bringing in outside prior information to the study data is not unusual or overly subjective.
Frequentist statisticians choose the data and predictors used to study some variable – most
of the time based on their own backgrounds and external studies. Bayesians just make the
process more explicit.
A second reason for focusing on Bayesian models is that recently there has been a rapid
move by astronomers to Bayesian methodology when analyzing their data. Researchers in
many other disciplines are doing the same, e.g., in ecology, environmental science, health
outcomes analysis, communications, and so forth. As we discuss later, this is largely the
xiii
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:10:42, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.001
xiv Preface
case because computers are now finally able to engage with complex MCMC-based algo-
rithms, which entail thousands of sampling iterations and many millions of calculations
in arriving at a single posterior distribution. Moreover, aside from faster computers with
much greater memory, statisticians and information scientists have been developing ever
more efficient estimation algorithms, which can now be found in many commercial statis-
tical software packages as well as in specially developed Bayesian packages, e.g., JAGS,
Stan, OpenBUGS, WinBUGS, and MLwiN. Our foremost use in the book is of JAGS and
Stan. The initial release of JAGS by Martyn Plummer was in December 2007. Stan, named
after Stanislaw Ulam, co-developer of the original MCMC algorithm in 1949 with Nicholas
Metropolis (Metropolis and Ulam, 1949), was first released in late August 2012. However,
the first stable release of Stan was not until early December 2015, shortly before we began
writing this volume. In fact, Stan was written to overcome certain problems with the con-
vergence of multilevel models experienced with BUGS and JAGS. It is clear that this book
could not have been written a decade ago, or even five years ago, as of this writing. The
technology of Bayesian modeling is rapidly advancing, indicating that astrostatistics will
be advancing with it as well. This book was inspired by the new modeling capabilities
being jointly provided by the computer industry and by statisticians, who are developing
better methods of analyzing data.
Bayesian Models for Astrophysical Data differs from other books on astrostatistics. The
book is foremost aimed to provide the reader with an understanding of the statistical mod-
eling process, and it displays the complete JAGS and, in most cases, Stan code for a wide
range of models. Each model is discussed, with advice on when to use it and how best
to evaluate it with reference to other models. Following an overview of the meaning and
scope of statistical modeling, and of how frequentist and Bayesian models differ, we exam-
ine the basic Gaussian or normal model. This sets the stage for us to then present complete
modeling code based on synthetic data for what may be termed Bayesian generalized linear
models. We then extend these models, discussing two-part models, mixed models, three-
parameter models, and hierarchical models. For each model we show the reader how to
create synthetic data based on the distributional assumptions of the model being evaluated.
The model code is based on the synthetic data but because of that it is generic and can
easily be adapted to alternative synthetic data or to real data. We provide full JAGS and
Stan code for each model. In the majority of the examples in this volume, JAGS code is
run from the R environment and Stan from within the Python environment. In many cases
we also display the code for, and run, stand-alone R and Python models.
Following our examination of models, including continuous, binary, proportion,
grouped, and count response models we address model diagnostics. Specifically, we
discuss information criteria including the Bayesian deviance, the deviance information
criterion (DIC), and the pD and model predictor selection methods, e.g., the Kuo and
Mallick test and Bayesian LASSO techniques. In Chapter 10 on applications we bring in
real astronomical data from previously published studies and analyze them using the mod-
els discussed earlier in the book. Examples are the use of time series for sunspot events,
lognormal models for the stellar initial mass function, and errors in variables for the anal-
ysis of supernova properties. Other models are discussed as well, with the likelihood-free
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:10:42, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.001
xv Preface
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:10:42, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.001
xvi Preface
most examples complete commented Python scripts are provided, allowing the reader to
benefit also from a direct comparison between these programming languages. In addition,
we do not expect that readers have a background in Bayesian modeling – the subject of the
text – but the more you know already, the better.
Owing to these assumptions we cover frequency-based modeling concepts rather
quickly, touching on only major points to be remembered when contrasting frequency-
based and Bayesian methodologies. We provide an overview of Bayesian modeling and of
how it differs from frequentist-based modeling. However, we do not focus on theory. There
are a plethora of books and other publications on this topic. We do attempt, however, to
provide sufficient background on Bayesian methodology to allow the reader to understand
the logic and purpose of the Bayesian code discussed in the text. Our main emphasis is to
provide astrostatisticians with the code and the understanding to employ models on astro-
physical data that have previously not been used, or have only seldom been used – but
which perhaps should be used more frequently. We provide modeling code and diagnostics
using synthetic data as well as using real data from the literature.
We should mention that researchers in disciplines other than astrophysics may also find
the book useful. The code and discussion using synthetic data are applicable to nearly all
disciplines.
We are grateful to a number of our colleagues for their influence on this work. Foremost
we wish to acknowledge Alain F. Zuur of Highlands Statistics in Newburgh, Scotland for
his contributions to several of the JAGS models used in the book. The codes for various
models are adaptations of code from Zuur, Hilbe, and Ieno (2013), a book on both the fre-
quentist and the Bayesian approaches to generalized linear models and generalized linear
mixed models for ecologists. Moreover, we have adopted a fairly uniform style or format
for constructing JAGS models on this basis of the work of Dr. Zuur and the first author of
this volume, which is reflected in Zuur, Hilbe, and Ieno (2013).
We would like to express our appreciation to Diana Gillooly, statistics editor at
Cambridge University Press, for accepting our proposal to write this book. She has been
supportive throughout the life of the book’s preparation. We thank Esther Miguéliz, our
Content Manager, and Susan Parkinson, our freelance copy-editor, for their dedicated work
in improving this book in a number of ways. Their professionalism and assistance is greatly
appreciated. We also wish to express our gratitude to John Hammersley, CEO of Overleaf
(WriteLatex Ld), who provided us with Overleaf Pro so that we could work simultane-
ously on the manuscript. This new technology makes collaborative authorship endeavors
much easier than in the past. Since we live and work in Arizona, Hungary, and France
respectively, this was an ideal way to write the book.
Finally, we each have those in our personal lives who have contributed in some way to
the creation of this volume.
The third author would like to thank Wolfgang Hillebrandt and Emmanuel Gangler for
providing unique working environments which enabled the completion of this project. In
addition, she would like to acknowledge all those who supported the Cosmostatistics Ini-
tiative and its Residence Programs, where many applications described in this volume were
developed. Special thanks to Alberto Krone-Martins, Alan Heavens, Jason McEwen, Bruce
Bassett, and Zsolt Frei as well as her fellow co-authors.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:10:42, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.001
xvii Preface
The second author thanks all members of the Cosmostatistics Initiative. Particular
thanks to Ewan Cameron, Alberto Krone-Martins, Maria Luiza Linhares Dantas, Mad-
hura Killedar, and Ricardo Vilalta, who have demonstrated their incredible support for our
endeavor.
The first author wishes to acknowledge Cheryl Hilbe, his wife, who has supported his
taking time from other family activities to devote to this project. In addition, he also wishes
to expressly thank Eric Feigelson for advice and support over the past seven years as he
learned about the astrophysical community and the unique concerns of astrostatistics. Pro-
fessor Feigelson’s experience and insights have helped shape how he views the discipline.
He also acknowledges the true friendship which has evolved between the authors of this
volume, one which he looks forward to continuing in the coming years. Finally, he dedi-
cates this book to two year old Kimber Lynn Hilbe, his granddaughter, who will likely be
witness to a future world that cannot be envisioned by her grandfather.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:10:42, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.001
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:10:42, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.001
1 Astrostatistics
Astrostatistics is at the same time one of the oldest disciplines, and one of the youngest.
The Ionian Greek philosopher Thales of Miletus is credited with correctly predicting a total
solar eclipse in central Lydia, which he had claimed would occur in May of 585 BCE. He
based this prediction on an examination of records maintained by priests throughout the
Mediterranean and Near East. The fact that his prediction was apparently well known, and
the fact that the Lydians were engaged in a war with the Medes in Central Lydia during
this period, brought his prediction notice and fame. Thales was forever after regarded as a
sage and even today he is named the father of philosophy and the father of science in books
dealing with these subjects.
Thales’ success spurred on others to look for natural relationships governing the motions
of astronomical bodies. Of particular note was Hipparchus (190–120 BCE) who, fol-
lowing on the earlier work of Aristarchus of Samos (310–230 BCE and Eratosthenes
(276–147 BCE), is widely regarded as the first to clearly apply statistical principles to
the analysis of astronomical events. Hipparchus also is acknowledged to have first devel-
oped trigonometry, spherical trigonometry, and trigonometric tables, applying these to the
motions of both the moon and sun. Using the size of the moon’s parallax and other data
from the median percent of the Sun covered by the shadow of the Earth at various sites in
the area, he calculated the distance from the Earth to the Moon as well as from the Earth
to the Sun in terms of the Earth’s radius. His result was that the median value is 60.5 Earth
radii. The true value is 60.3. He also calculated the length of the topical year to within six
minutes per year of its true value.
Others in the ancient world, as well as scientists until the early nineteenth century,
also used descriptive statistical techniques to describe and calculate the movements and
relationships between the Earth and astronomical bodies. Even the first application of a
normal or ordinary least squares regression was to astronomy. In 1801 Hungarian Franz
von Zach applied the new least squares regression algorithm developed by Carl Gauss for
predicting the position of Ceres as it came into view from its orbit behind the Sun.
The development of the first inferential statistical algorithm by Gauss, and its success-
ful application by von Zach, did not immediately lead to major advances in inferential
statistics. Astronomers by and large seemed satisfied to work with the Gaussian, or normal,
model for predicting astronomical events, and statisticians turned much of their attention to
1
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:28:45, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.002
2 Astrostatistics
deriving various probability functions. In the early twentieth century William Gosset, Karl
Pearson, and Ronald Fisher made the most advances in statistical modeling and hypothesis
testing. Pearson developed the mathematics of goodness-of-fit and of hypothesis testing.
The Pearson χ 2 test is still used as an assessment of frequentist based model fit.1 In addi-
tion, Pearson developed a number of tests related to correlation analysis, which is important
in both frequentist and Bayesian modeling. Fisher (1890–1962) is widely regarded as the
father of modern statistics. He is basically responsible for the frequentist interpretation
of hypothesis testing and of statistical modeling. He developed the theories of maximum
likelihood estimation and analysis of variance and the standard way that statisticians under-
stood p-values and confidence intervals until the past 20 years. Frequentists still employ
his definition of the p-value in their research. His influence on twentieth century statistics
cannot be overestimated.
It is not commonly known that Pierre-Simon Laplace (1749–1827) is the person fore-
most responsible for bringing attention to the notion of Bayesian analysis, which at the
time meant employing inverse probability to the analysis of various problems. We shall
discuss Bayesian methodology in a bit more detail in Chapter 3. Here we can mention that
Thomas Bayes (1702–1761) developed the notion of inverse probability in unpublished
notes he made during his lifetime. These notes were discovered by Richard Price, Bayes’s
literary executor and a mathematician, who restructured and presented Bayes’ paper to the
Royal Society. It had little impact on British mathematicians at the time, but it did catch
the attention of Laplace. Laplace fashioned Bayes’ work into a full approach to probability,
which underlies current Bayesian statistical modeling. It would probably be more accurate
to call Bayesian methodology Bayes–Laplace methodology, but simplification has given
Bayes the nominal credit for this approach to both probability and statistical modeling.
After enthusiastically promoting inverse probability, Laplace abandoned this work
and returned to researching probability theory from the traditional perspective. He also
made major advances in differential equations, including his discovery of Laplace trans-
forms, which are still very useful in mathematics. Only a few adherents to the Bayesian
approach to probability carried on the tradition throughout the next century and a half.
Mathematicians such as Bruno de Finetti (Italy) and Harold Jeffreys (UK) promoted
inverse probability and Bayesian analysis during the early part of the twentieth century,
while Dennis Lindley (UK), Leonard J Savage (US), and Edwin Jaynes, a US physicist,
were mainstays of the tradition in the early years of the second half of the twentieth century.
But their work went largely unnoticed.
The major problem with Bayesian analysis until recent times has been related to the use
of priors. Priors are distributions representing information from outside the model data
that is incorporated into the model. We shall be discussing priors in some detail later
in the text. Briefly though, except for relatively simple models the mathematics of cal-
culating so-called posterior distributions was far too difficult to do by hand, or even by
most computers, until computing speed and memory became powerful enough. This was
particularly the case for Bayesian models, which are extremely demanding on computer
power. The foremost reason why most statisticians and analysts were not interested in
1 In Chapter 5 we shall discuss how this statistic can be used in Bayesian modeling as well.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:28:45, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.002
3 1.1 The Nature and Scope of Astrostatistics
implementing Bayesian methods into their research was that computing technology was
not advanced enough to execute more than fairly simple problems. This was certainly the
case with respect to the use of Bayesian methods in astronomy.
We shall provide a brief overview of Bayesian methodology in Chapter 3. Until such
methods became feasible, however, Fisherian or frequentist methodology was the standard
way of doing statistics. For those who are interested in understanding the history of this era
of mathematics and statistics, we refer you to the book Willful Ignorance: The Mismeasure
of Uncertainty (Weisberg, 2014).
Astronomy and descriptive statistics, i.e., the mathematics of determining mean, median,
mode, range, tabulations, frequency distributions, and so forth, were closely tied together
until the early nineteenth century. Astronomers have continued to employ descriptive statis-
tics, as well as basic linear regression, to astronomical data. But they did not concern
themselves with the work being done in statistics during most of the twentieth century.
Astronomers found advances in telescopes and the new spectroscope, as well as in cal-
culus and differential equations in particular, to be much more suited to understanding
astrophysical data than hypothesis testing and other frequentist methods. There was a near
schism between astronomers and statisticians until the end of the last century.
As mentioned in the Preface, astrostatistics can be regarded as the statistical analysis of
astronomical data. Unlike how astronomers utilized statistical methods in the past, primar-
ily focusing on descriptive measures and to a moderate extent on linear regression from
within the frequentist tradition, astrostatistics now entails the use, by a growing number
of astronomers, of the most advanced methods of statistical analysis that have been devel-
oped by members of the statistical profession. However, we still see astronomers using
linear regression on data that to a statistician should clearly be modeled by other, non-
linear, means. Recent texts on the subject have been aimed at informing astronomers of
these new advanced statistical methods.
It should be mentioned that we have incorporated astroinformatics under the general
rubric of astrostatistics. Astroinformatics is the study of the data gathering and computing
technology needed to gather astronomical data. It is essential to statistical analysis. Some
have argued that astroinformatics incorporates astrostatistics, which is the reverse of the
manner in which we envision it. But the truth is that gathering information without an
intent to analyze it is a fairly useless enterprise. Both must be understood together. The
International Astronomical Union (IAU) has established a new commission on astroinfor-
matics and astrostatistics, seeming to give primacy to the former, but how the order is given
depends somewhat on the interests of those who are establishing such names. Since we are
focusing on statistics, although we are also cognizant of the important role that informat-
ics and information sciences bring to bear on statistical analysis, we will refer to the dual
studies of astrostatistics and astroinformatics as simply astrostatistics.
In the last few decades the size and complexity of the available astronomical information
has closely followed Moore’s law, which states roughly that computing processing power
doubles every two years. However, our ability to acquire data has already surpassed our
capacity to analyze it. The Sloan Digital Sky Survey (SDSS), in operation since 2000, was
one of the first surveys to face the big data challenge: in its first data release it observed 53
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:28:45, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.002
4 Astrostatistics
million individual objects. The Large Synoptic Survey Telescope (LSST) survey is aim-
ing to process and store 30 terabytes of data each night for a period of ten years. Not
only must astronomers (astroinformaticists) deal with such massive amounts of data but
also it must be stored in a form in which meaningful statistical analysis can proceed. In
addition, the data being gathered is permeated with environmental noise (due to clouds,
moon, interstellar and intergalactic dust, cosmic rays), instrumental noise (due to distor-
tions from telescope design, detector quantum efficiency, a border effect in non-central
pixels, missing data), and observational bias (since, for example, brighter objects have a
higher probability of being detected). All these problem areas must be dealt with prior
to any attempt to subject the data to statistical analysis. The concerns are immense, but
nevertheless this is a task which is being handled by astrostatisticians and information
scientists.
Statistical analysis has developed together with the advance in computing speed and stor-
age. Maximum likelihood and all such regression procedures require that a matrix be
inverted. The size of the matrix is based on the number of predictors and parameters in
the model, including the intercept. Parameter estimates can be determined using regression
for only very small models if done by hand. Beginning in the 1960s, larger regression and
multivariate statistical procedures could be executed on mainframe computers using SAS,
SPSS, and other statistical and data management software designed specifically for main-
frame systems. These software packages were ported to the PC environment when PCs
with hard drives became available in 1983. Statistical routines requiring a large number of
iterations could take a long time to converge. But as computer speeds became ever faster,
new regression procedures could be developed and programmed that allowed for the rapid
inversion of large matrices and the solution to complex modeling projects.
Complex Bayesian modeling, if based on Markov chain Monte Carlo (MCMC) sam-
pling, was simply not feasible until near the turn of the century. By 2010 efficient Bayesian
sampling algorithms were implemented into statistical software, and the computing power
by then available allowed astronomers to analyze complex data situations using advanced
statistical techniques. By the closing years of the first decade of this century astronomers
could take advantage of the advances in statistical software and of the training and expertise
of statisticians.
As mentioned before, from the early to middle years of the nineteenth century until
the final decade of the twentieth century there was little communication between astron-
omers and statisticians. We should note, though, that there were a few astronomers who
were interested in applying sophisticated statistical models to their study data. In 1990, for
example, Thomas Loredo wrote his PhD dissertation in astrophysics at the University of
Chicago on “From Laplace Supernova SN 1987A: Bayesian inference in astrophysics” (see
the book Feigelson and Babu, 2012a), which was the first thorough application of Bayesian
modeling to astrophysical data. It was, and still is, the seminal work in the area and can be
regarded as one of the founding articles in the new discipline of astrostatistics.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:28:45, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.002
5 1.2 The Recent Development of Astrostatistics
In 1991 Eric Feigelson and Jogesh Babu, an astronomer and statistician respectively
at Pennsylvania State University, collaboratively initiated a conference entitled Statistical
Challenges in Modern Astronomy. The conference was held at their home institution and
brought together astronomers and a few statisticians for the purpose of collaboration.
The goal was also to find a forum to teach astronomers how to use appropriate statis-
tical analysis for their study projects. These conferences were held every five years at
Penn State until 2016. The conference site has now shifted to Carnegie Mellon University
under the direction of Chad Schafer. During the 1990s and 2000s a few other conferences,
workshops, and collaborations were held that aimed to provide statistical education to
astronomers. But they were relatively rare, and there did not appear to be much growth
in the area.
Until 2008 there were no astrostatistics working groups or committees authorized under
the scope of any statistical or astronomical association or society. Astrostatistics was nei-
ther recognized by the IAU nor recognized as a discipline in its own right by the principal
astronomical and statistical organizations. However, in 2008 the first interest group in
astrostatistics was initiated under the International Statistical Institute (ISI), the statisti-
cal equivalent of the IAU for astronomy. This interest group, founded by the first author of
the book, and some 50 other interested astronomers and statisticians, met together at the
2009 ISI World Statistics Congress in Durban, South Africa. The attendees voted to apply
for the creation of a standing committee on astrostatistics under the auspices of the ISI. The
proposal was approved at the December 2009 meeting of the ISI executive board. It was
the first such astrostatistics committee authorized by an astronomical or statistical society.
In the following month the ISI committee expanded to become the ISI Astrostatistics
Network. Network members organized and presented two invited sessions and two spe-
cial theme sessions in astrostatistics at the 2011 World Statistics Congress in Dublin,
Ireland. Then, in 2012, the International Astrostatistics Association (IAA) was formed
from the Network as an independent professional scientific association for the discipline.
Also in 2012 the Astrostatistics and Astroinformatics Portal (ASAIP) was instituted at
Pennsylvania State under the editorship of Eric Feigelson and the first author of this
volume. As of January 2016 the IAA had over 550 members from 56 nations, and the
Portal had some 900 members. Working Groups in astrostatistics and astroinformatics
began under the scope of IAU, the American Astronomical Society, and the American
Statistical Association. In 2014 the IAA sponsored the creation of its first section, the
Cosmostatistics Initiative (COIN), led by the second author of this volume. In 2015
the IAU working group became the IAU Commission on Astroinformatics and Astro-
statistics, with Feigelson as its initial president. Astrostatistics, and astroinformatics, is
now a fully recognized discipline. Springer has a series on astrostatistics, the IAA has
begun agreements with corporate partnerships, Cambridge University Press is sponsor-
ing IAA awards for contributions to the discipline, and multiple conferences in both
astrostatistics and astroinformatices are being held – all to advance the discipline. The
IAA headquarters is now at Brera Observatory in Milan. The IAA website is located at:
http://iaa.mi.oa-brera.inaf.it and the ASAIP Portal URL is https://asaip.psu.edu. We rec-
ommend the readers of this volume to visit these websites for additional resources on
astrostatistics.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:28:45, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.002
6 Astrostatistics
Statistics and statistical modeling have been defined in a variety of ways. We shall define
it in a general manner as follows:
Statistics may be generically understood as the science of collecting and analyzing data for the
purpose of classification, prediction, and of attempting to quantify and understand the uncer-
tainty inherent in phenomena underlying data.
Note that this definition ties data collection and analysis under the scope of statistics.
This is analogical to how we view astrostatistics and astroinformatics. In this text our fore-
most interest is in parametric models. Since the book aims to develop code and to discuss
the characterization of a number of models, it may be wise to define what is meant by
statistics and statistic models. Given that many self-described data scientists assert that
they are not statisticians and that statistics is dying, we should be clear about the meaning
of these terms and activities.
Statistical models are based on probability distributions. Parametric models are derived
from distributions having parameters which are estimated in the execution of a statistical
model. Non-parametric models are based on empirical distributions and follow the natural
or empirical shape of the data. We are foremost interested in parametric models in this text.
The theoretical basis of a statistical model differs somewhat from how analysts view
a model when it is estimated and interpreted. This is primarily the case when dealing
with models from a frequentist-based view. In the frequentist tradition, the idea is that the
data to be modeled is generated by an underlying probability distribution. The researcher
typically does not observe the entire population of the data that is being modeled but rather
observes a random sample from the population data, which itself derives from a probability
distribution with specific but unknown fixed parameters. The parameters specifying both
the population and sample data consist of the distributional mean and one or more shape
or scale parameters. The mean is regarded as a location parameter. For the normal model
the variance σ 2 is the scale parameter, which has a constant value across the range of
observations in the data. Binomial and count models have a scale parameter but its value is
set at unity. The negative binomial has a dispersion parameter, but it is not strictly speaking
a scale parameter. It adjusts the model for extra correlation or dispersion in the data. We
shall address these parameters as we discuss various models in the text.
In frequentist-based modeling the slopes or coefficients that are derived in the estimation
process for explanatory predictors are also considered as parameters. The aim of modeling
is to estimate the parameters defining the probability distribution that is considered to gen-
erate the data being modeled. The predictor coefficients, and intercept, are components of
the mean parameter.
In the Bayesian tradition parameters are considered as randomly distributed, not as fixed.
The data is characterized by an underling probability distribution but each parameter is
separately estimated. The distribution that is used to explain the predictor and parameter
data is called the likelihood. The likelihood may be mixed with outside or additional
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:28:45, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.002
7 1.4 Classification of Statistical Models
information known from other studies or obtained from the experience or background of
the analyst. This external information, when cast as a probability distribution with speci-
fied parameters, is called the prior distribution. The product of the model (data) likelihood
and prior distributions is referred to as the posterior distribution. When the model is sim-
ple, the posterior distribution of each parameter may be analytically calculated. However,
for most real model data, and certainly for astronomical data, the posterior must be deter-
mined through the use of a sampling algorithm. A variety of MCMC sampling algorithms
or some version of Gibbs sampling are considered to be the standard sampling algorithms
used in Bayesian modeling. The details of these methods go beyond the scope of this text.
For those interested in sampling algorithms we refer you to the articles Feroz and Hobson
(2008) and Foreman-Mackey et al. (2013) and the books Gamerman and Lopes (2006),
Hilbe and Robinson (2013), and Suess and Trumbo (2010).
We mentioned earlier that statistical models are of two general varieties – parametric and
non-parametric. Parametric models are based on a probability distribution, or a mixture
of distributions. This is generally the case for both frequentist-based and Bayesian models.
Parametric models are classified by the type of probability distribution upon which a model
is based. In Figure 1.1 we provide a non-exhaustive classification of the major models
discussed in this volume.
Beta-
Poisson
binomial
Gen.-
Poisson
Negative
Bernoulli
Binom.
Binomial Count
Zero-
Inverse- Inflated
Gaussian
Beta
Zero- Poisson–
Trunc. Logit
Gamma Major
Continuous Statistical Models Hurdle
(non-exhaustive)
Gamma–
Logit
Log-
Normal Log-
Normal Normal–
Logit
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:28:45, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.002
8 Astrostatistics
Astronomers utilize, or should utilize, most of these model types in their research. In
this text examples will be given, and code provided, for all. We aim to show how only rela-
tively minor changes need to be given to the basic estimation algorithm in order to develop
and expand the models presented in Figure 1.1. Regarding data science and statistics, it is
clear that when one is engaged in employing statistical models to evaluate data – whether
for better understanding of the data or to predict observations beyond it – one is doing
statistics. Statistics is a general term characterizing both descriptive and predictive mea-
sures of data and represents an attempt to quantify the uncertainty inherent in both the data
being evaluated and in our measuring and modeling mechanisms. Many statistical tools
employed by data scientists can be of use in astrostatistics, and we encourage astronomers
and traditional statisticians to explore how they can be used to obtain a better evaluation of
astronomical data. Our focus on Bayesian modeling can significantly enhance this process.
Although we could have begun our examination of Bayesian models with single-
parameter Bernoulli-based models such as Bayesian logistic and probit regression, we shall
first look at the normal or Gaussian model. The normal distribution has two characteristic
parameters, the mean or location parameter and the variance or scale parameter. The nor-
mal model is intrinsically more complex than single-parameter models, but it is the model
most commonly used in statistics – from both the frequentist and Bayesian traditions. We
believe that most readers will be more familiar with the normal model from the outset
and will have worked with it in the past. It therefore makes sense to begin with it. The
Bayesian logistic and probit binary response models, as well as the Poisson count model,
are, however, easier to understand and work with than the normal model.
Further Reading
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:28:45, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.002
2 Prerequisites
2.1 Software
The subtitle of this book is “using R, JAGS, Python, and Stan.” These software pack-
ages are used by astronomers more than any other Bayesian modeling software. Other
packages are commonly used by those in other disciplines, e.g., WinBUGS, OpenBUGS,
MLwiN, Minitab, SAS, SPSS, and recently, Stata. Minitab, SAS, SPSS, and Stata are
general commercial statistical packages. WinBUGS is no longer being supported, future
development being given to OpenBUGS. It is freeware, as are R, JAGS, Python, and Stan.
MLwiN provides hierarchical and Bayesian hierarchical modeling capabilities and is free
for academics.
In this chapter we provide an overview of each package discussed in this text. JAGS and
Stan can be run within the R or Python environment. Most scripts discussed in this book
use JAGS from within R and Stan from within Python. It is important, however, to state
that this is merely a presentation choice and that the alternative combination (Stan from
R and JAGS from Python) is also possible. We chose the first combination for didactic
reasons, as an opportunity for the interested reader to familiarize themself with a second
programming language. In two different contexts (Chapters 8 and 10) we also show how
Stan can be used from within R.
The R environment has become the most popular all purpose statistical software world-
wide. Many statistical departments require their graduate students to learn R before gaining
an advanced degree. Python, however, has quickly been gaining adherents. It is a powerful
and user-friendly tool for software development but does not have nearly as many already
supported procedures as R.
Astronomers are pretty much split between using R and Python for the statistical analysis
of their study data. Each software package has its own advantage. R can be used by itself
for a large range of statistical modeling tasks, but until recently its Bayesian capability
has been limited to relatively simple models. It is certainly possible, however, for more
complex Bayesian models to be written in R, but for the most part R has been used as
a framework within which specific Bayesian packages are run, e.g., JAGS, INLA, bmsr,
and even Stan. Bayesian astronomers nearly always turn to JAGS or Python for executing
complex models. Stan is a new tool that is quickly gaining popularity. Again, when using
JAGS it is easiest to run it from within R so that other statistical analysis using R can easily
be run on the data. Likewise, Stan can be run from within R or Python, as we shall observe
in this volume.
9
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:30:06, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.003
10 Prerequisites
We will demonstrate Bayesian models using R, JAGS, Python, and Stan, providing
the code for each in order to allow readers the ability to adapt the code for their own
purposes. In some cases, though, we will provide JAGS code within R, or Stan from within
Python, without showing the code for a strictly R or Python model. The reason is that few
researchers would want to use R or Python alone for these models but would want to use
JAGS and R, or Stan and Python together, for developing a useful Bayesian model with
appropriate statistical tests. In fact, using a Bayesian package from within R or Python
has become a fairly standard way of creating and executing Bayesian models for most
astronomers.
Note also that the integrated nested Laplace approximation (INLA) Bayesian software
package is described in Appendix A of this book. The INLA algorithm samples data by
approximation, not by MCMC-based methods. Therefore the time required to estimate
most Bayesian models is measured in seconds and even tenths of a second, not minutes or
hours.
The INLA package is becoming popular in disciplines such as ecology, particularly for
its Bayesian spatial analysis and additive modeling capabilities. However, there are impor-
tant limitations when using INLA for general-purpose Bayesian modeling, so we are not
providing a discussion of the package in the main part of the text. We felt, however, that
some mention should be made of the method owing to its potential use in the spatial anal-
ysis of astrophysical data. As INLA advances in its capabilities, we will post examples of
its use on the book’s web site.
Many models we describe in this book have not been previously discussed in the lit-
erature, but they could well be used for the better understanding of astrophysical data.
Nearly all the models examined, and the code provided, are for non-linear modeling
endeavors.
2.2 R
1 https://cran.r-project.org/
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:30:06, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.003
11 2.2 R
# Fit
mymodel <- lm(y ˜ x1 + x2) # linear regression of y on x1 and x2
# Output
summary(mymodel) # summary display
par(mfrow=c(2, 2)) # create a 2 by 2 window
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:30:06, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.003
12 Prerequisites
Standardized residuals
2
4 5 8 5 8
2 1
Residuals
0 0
–2
–1
–4
–6 3 –2 3
5 10 15 –1.5 –1.0 –0.5 0.0 0.5 1.0 1.5
Fitted values Theoretical quantiles
Standardized residuals
3 2
0.5
8 8
1.2 5 1 5
0.8 0
0.4 –1
Cook’s distance 0.5
–2 3
0.0 1
5 10 15 0.00 0.05 0.10 0.15 0.20 0.25 0.30
t
Fitted values Leverage
Table 2.1 lists the main packages that you should obtain in order to run the examples
in the subsequent chapters. All the packages can be installed through CRAN using the com-
mand install.packages(). Note that JAGS needs to be installed outside R. The function
jagsresults can be loaded from your local folder and is part of the package jagstools
available on Github:2
> source("../auxiliar_functions/jagsresults.R")
2.3 JAGS
JAGS is an acronym for Just Another Gibbs Sampler. It was designed and is maintained by
Martyn Plummer and is used strictly for Bayesian modeling. The key to running JAGS from
within R is to install and load a library called R2jags. The code is similar in appearance to
R. Typically, when run from within R, the data is brought into the program using R. The
JAGS code begins by defining a matrix for the vector of model predictors, X. In Code 2.2,
we show how a typical JAGS model looks.
Analysts have structured codes for determining the posterior means, standard deviations,
and credible intervals for the posterior distributions of Bayesian models in a variety of
2 https://github.com/johnbaums/jagstools
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:30:06, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.003
13 2.3 JAGS
Table 2.1 List of the main R packages which should be installed in order to run the examples
shown in subsequent chapters.
Name of package Description
JAGSa Analysis of Bayesian hierarchical models
R2jagsb R interface to JAGS
rstanc R interface to Stan
latticed Graphics library
ggplot2e Graphics library
MCMCpackf Functions for Bayesian inference
mcmcplotsg Plots for MCMC Output
a http://mcmc-jags.sourceforge.net,
b https://cran.r-project.org/web/packages/R2jags,
c https://cran.r-project.org/web/packages/rstan/index.html,
d http://lattice.r-forge.r-project.org/,
e http://ggplot2.org,
f http://mcmcpack.berkeley.edu,
g https://cran.r-project.org/web/packages/mcmcplots
ways. We shall implement a standard format or paradigm for developing a Bayesian model
using JAGS. The same structure will be used, with some variations based on the peculiari-
ties of the posterior distribution being sampled. We shall walk through an explanation of a
Bayesian normal model in Chapter 4.
# Fit
cat("model{
# priors
for (i in 1:N){
# log-likelihood
}
}", fill = TRUE)
sink()
inits <- function () {
list(
# Initial values
)
}
params <- c({ "beta", others ) # parameters to be displayed in output
J <- jags(data = model.data, # main JAGS sampling
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:30:06, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.003
14 Prerequisites
inits = inits,
parameters = params,
model.file = "MOD.txt",
n.thin = 1, # number of thinning
n.chains = 3, # 3 chains
n.burnin = 10000, # burn-in sample
n.iter = 20000) # samples used for posterior
2.4 Python
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:30:06, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.003
15 2.4 Python
Table 2.2 List of Python packages which should be installed in order to run the examples shown in
subsequent chapters.
Name of package Description
matplotliba Plotting library
numpyb Basic numerical library
pandasc Data structure tools
pymc3d Full Python probabilistic programming tool
pystane Python interface to Stan
scipyf Advanced numerical library
statsmodelsg Statistics library
a http://matplotlib.org/
b www.numpy.org/
c http://pandas.pydata.org/
d https://pymc-devs.github.io/pymc3/
e https://pystan.readthedocs.org/en/latest/
f www.scipy.org/
g http://statsmodels.sourceforge.net/
Python can be installed in all main operating systems by following the instructions on
the official website.8 Once you are sure you have the desired version of Python it is also
important to get pip, the recommended installer for Python packages.9 The official website
can guide you through the necessary steps.10 Windows users may prefer the graphical
interface pip-Win.11
You will now be able to install a few packages which will be used extensively for exam-
ples in subsequent chapters. Table 2.2 lists the packages12 you should get in order to
run most of the examples.13 All the packages can be installed through pip. Beyond the
tools listed in Table 2.2, we also recommend installing IPython,14 an enhanced interactive
Python shell or environment which makes it easier for the beginning user to manipulate
and check small portions of the code. It can be installed via pip and accessed by typing $
ipython on the command line.
Throughout this volume, Python-related code will be always presented as complete
scripts. This means that the reader can copy the entire script in a .py file using the text
editor of choice and run the example on the command line by typing
python my_file.py
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:30:06, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.003
16 Prerequisites
This is highly recommended since the Python shell stores results and information about
intermediate steps which can be used for further analysis. The same result can be achieved
by pasting code directly into the shell. This last alternative is appropriate if you wish to
run parts of the script separately for checking, plotting, etc. In order to paste code snippets
directly into the shell use
In [1]: %paste
Keep in mind that the standard ctrl+V shortcut will not work properly within IPython.
Finally, we emphasize that code snippets are divided into sections. The three basic compo-
nents are # Data, # Fit, and # Output. The # Data section shows all the steps necessary
for data preparation (how to simulate simple synthetic data for introductory examples, how
to read real data from standard astronomical data formats, or how to prepare the data to
be used as input for more complex modules). Sometimes, even in cases where previously
generated toy data is reused, the # Data section is shown again. This is redundant, but it
allows the reader to copy and run entire snippets without the need to rebuild them from
different parts of the book. Once data are available, the # Fit section shows examples of
how to construct the model and perform the fit. The # Output section shows how to access
the results via screen summaries or graphical representations.
Note that we will not present extracts or single-line Python code beyond this chapter.
Sometimes such code will be shown in R for the sake of argument or in order to highlight
specific parts of a model. However, we believe that presenting such extracts in both lan-
guages would be tedious. Given the similarities between R and Python we believe that the
experienced Python user will find no difficulties in interpreting these lines.
As an example, we show below the equivalent Python code for the example of linear
regression presented in Section 2.2.
Code 2.3 Example of linear regression in Python.
==================================================
import numpy as np
import statsmodels.formula.api as smf
# Data
y = np.array([13,15,9,17,8,5,19,23,10,7,10,6]) # response variable
x1 = np.array([1,1,1,1,1,1,0,0,0,0,0,0]) # binary predictor
x2 = np.array([1,1,1,1,2,2,2,2,3,3,3,3]) # categorical predictor
# Fit
results = smf.ols(formula=’y ˜ x1 + x2’, data=mydata).fit()
# Output
print(str(results.summary()))
==================================================
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:30:06, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.003
17 2.5 Stan
The import statements in the first two lines load the packages which will be used in the
calculations below. You can also recognize the sections of the code we mentioned before.
Each Python package has a specific format in which the input data should be passed to
its main functions. In our example, statsmodels require data organized as a dictionary
(mydata variable in Code 2.3).
Here we will focus on Bayesian modeling (see Chapter 3). Our goal is to demonstrate
Python’s capabilities for Bayesian analysis when using pure Python tools, e.g. pymc3, and
in its interaction with Stan (Section 2.5). Through our examples of synthetic and real data
sets we shall emphasize the latter, showing how Python and Stan can work together as
a user-friendly tool allowing the implementation of complex Bayesian statistical models
useful for astronomy. This will be done using the pystan package, which requires the data
to be formatted as a dictionary. The package may be called using an expression such as
or
In the first option, model_code receives a multiple-line string specifying the statistical
model to be fitted to the data (here called stan_code, see Section 2.5) and in the sec-
ond option file receives the name of the file where the model is stored. In what follows
we will always use the first option in order to make the model definition an explicit part of
our examples. However, the reader may find the second option safer from an organizational
point of view, especially when dealing with more complex situations. A closer look at Stan
and its modeling capabilities is given below.
2.5 Stan
Stan is a probabilistic programming language whose name was chosen in honor of Stanis-
law Ulam, one of the pioneers of the Monte Carlo method. It is written in C++ with
available interfaces in R, Python, Stata, Julia, MatLab, and the shell (CmD). As we men-
tioned before, we will provide examples of Stan code using its Python interface pystan
and in a few situations using its R interface, rstan. However, the models can be eas-
ily adapted to other interfaces. Stan uses a variation of the Metropolis algorithm called
Hamiltonian Monte Carlo, which allows for an optimized sampling of non-standard pos-
terior distributions (see the book Kruschke, 2015). It is a recent addition to the available
Bayesian modeling tools. The examples presented in this volume were implemented using
version 2.14.
Before we begin, it is necessary to know what kinds of element we are allowed to handle
when constructing a model. Stan15 provides a large range of possibilities and in what
15 http://mc-stan.org
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:30:06, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.003
18 Prerequisites
follows we will briefly describe only the features used to manipulate the models described
in this book.
A complete Stan model is composed of six code blocks (see Table 2.3). Each block
must contain instructions for specific tasks. In Section 2.4 we explained how a Stan code
block can be given as input to the Python package pystan. Here we will describe each
code block, independently of the Python interface, and highlight its role within the Stan
environment.
The main components of a Stan code are as follows:
• The data block holds the declaration for all input data to be subsequently used in the
Stan code. The user must define within this block all variables which will store different
elements of the data. Notice that the actual connection between the data dictionary and
Stan is done on the fly through the pystan package. An example of a simple data block is
data{
int<lower=0> nobs; # input integer
real beta0; # input real
int x[nobs]; # array of integers
vector[nobs] y; # vector of floats
}
transformed data {
real xnew[nobs]; # new temporary variable
for (i in 1:nobs) xnew[i] = x[i] * beta0; # data only transformation
}
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:30:06, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.003
19 2.5 Stan
In the snippet above we also make use of a loop running through the vector elements
with index i. This syntax will be useful throughout the book.
• The parameters block defines the domain to be sampled when calculating the posterior.
This might require the use of upper and/or lower limits such as
parameters {
real beta1; # unconstrained variable
real<upper=0> beta2; # beta2 <= 0
real<lower=0> sigma; # sigma >= 0
}
The limits imposed on the parameters beta2 and sigma are already a form of prior. Stan
is able to handle improper priors, meaning that, if we do not define a prior for these
parameters in the model block, it will consider the prior to be uniform over the entire
domain allowed by the constraints. However, the reader should be aware of the numerical
problems that might occur if the priors lead to improper posterior distributions.16
• The transformed parameters block allows the construction of intermediate parameters
which are built from the original ones and are frequently used to make the likelihood less
verbose. This is the case for a linear predictor; for example,
transformed parameters {
vector[nobs] lambda;
for (i in 1:nobs) lambda[i] = beta1 * xnew[i] + beta2;
}
• The model block is the core of the code structure, in which the recipe for the posterior is
given:
model {
beta1 ˜ normal(0,10); # prior for beta1
beta2 ˜ normal(0, 10); # prior for beta2
sigma ˜ gamma(1.0, 0.8); # gamma prior for
y ˜ normal(lambda, sigma); # likelihood
}
Here there is some freedom in the order in which the statements are presented. It is
possible to define the likelihood before the priors. Note that we have an apparent contra-
diction here. Parameters beta1 and beta2 were assigned the same prior, but beta1 can
take any real value while beta2 is non-positive (see our definition in the parameters
block).
We chose this example to illustrate that, in Stan, prior distributions may be truncated
through the combination of a constrained variable with an unbounded distribution. Thus,
from the parameters and model blocks above, we imposed a normal prior over beta1
and an half-normal prior over beta2. Moreover, we did not explicitly define a prior
for sigma because, in the absence of a specific prior definition, Stan assigns improper
priors over the entire variable domain. Thus, if we wish to use a non-informative prior
16 Please check the community recommendations and issues about hard priors in the article Stan (2016).
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:30:06, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.003
20 Prerequisites
over sigma, merely defining it with the appropriate constraint is sufficient although not
recommended (Stan, 2016).
Another important point is the vectorization of sampling statements (y and lambda
are vectors). If any other argument of a sampling statement is a vector, it needs to be of
the same size. If the other arguments are scalars (as in the case of sigma), they are used
repeatedly for all the elements of vector arguments (Team Stan, 2016, Chapter 6).
In this example we used the normal and gamma distributions for priors, but other
distributions are also possible. Figure 2.2 shows the major distributions available in
both JAGS and Stan, many of which will be covered throughout the book. We will also
show examples of how to implement and customize distributions for non-standard data
situations.
• The generated quantities block should be used when one wants to predict the value
of the response variable, given values for the explanatory variable and the resulting pos-
terior. In the case where we wish to predict the response variable value in the first 1000
values of the xnew vector, we have
generated quantities{
vector[nobs] y_predict;
for (n in 1:nobs)
y_predict[n] = normal_rng(xnew[n] * beta1 + beta2, sigma);
}
In this block the distribution name must be followed by _rng (random number genera-
tor). This nomenclature can be used only inside the generated quantities block. Also
Major univariate k τ
distributions available in σ σ r, λ
JAGS/Stan μ μ μ
Normal Lognormal Student t Gamma
k λ v, λ
τ τ
μ μ
Chi-square Exponential Double exp. Logistic Weibull
α, c p
p, n
a b
a, b
Pareto
Uniform Beta Bernoulli
Bernou
ull
ulli Binomial
Bino
Binom
om
mial
ial
a
a, b λ p, r
π
Beta–binomial
Beta-b nom
Beta-bino mial
ia Categorical
Categor
a g rical
c Poisson
Poisso
Poiss on Neg.
Neg.binomial
b
bino
no
omia
omial
m
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:30:06, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.003
21 2.5 Stan
important to emphasize is that, unlike the sampling statements in the model block, _rng
functions cannot be vectorized.
Putting together all the data blocks in a single-string object allows us to use Python (or
R) as an interface to Stan, as explained in Section 2.4. The online material accompanying
this book contains a complete script allowing the use of the elements above in a synthetic
model example. A similar code is presented in Chapter 4 in the context of a normal linear
model.
Important Remarks
Before one starts manipulating codes in Stan it is important to familiarize oneself with
the language style guide. We urge readers to go through Appendix C of the Stan User’s
Manual (Team Stan, 2016) before writing their own Stan programs. The recommendations
in that appendix are designed to improve readability and space optimization. In the exam-
ples presented in this volume we do our best to follow them and apologize in advance for
eventual deviations.
The reader should also be aware of the types of errors and warning messages that might
be encountered. Appendix D of Team Stan (2016) describes the messages prompted by the
Stan engine (not those originating from R, Python, or other interpreters). We call special
attention to a message which looks like this:
Information Message: The current Metropolis proposal is about to be rejec-
ted...Scale parameter is 0, but must be > 0!
Further Reading
Andreon, S. and B. Weaver (2015). Bayesian Methods for the Physical Sciences: Learning
from Examples in Astronomy and Physics. Springer Series in Astrostatistics. Springer.
17 In Stan, “warm-up” is equivalent to the burn-in phase in JAGS. See the arguments in Betancourt (2015).
18 https://groups.google.com/forum/#!topic/stan-users/hn4W_p8j3fs
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:30:06, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.003
22 Prerequisites
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:30:06, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.003
3 Frequentist vs. Bayesian Methods
Linear regression is the basis of a large part of applied statistics and has long been a main-
stay of astronomical data analysis. In its most traditional description one searches for the
best linear relation describing the correlation between x (the exploratory variable) and y
(the response variable).
Approaching the problem through a physical–astronomical perspective, the first step
is to visualize the data behavior through a scatter plot (Figure 3.1). On the basis of this
first impression, the researcher concludes that there is a linear correlation between the two
observed quantities and builds a model. This can be represented by
yi = axi + b, (3.1)
where xi and yi are the measured quantities,1 with the index running through all avail-
able observations, i = 1, . . . , n, and {a, b} the model parameters whose values must be
determined.
In the frequentist approach, the “true” values for the model parameters can be estimated
by minimizing the residuals εi between the predicted and measured values of the response
variable y for each measured value of the explanatory variable x (Figure 3.2). In other
words the goal is to find values for a and b that minimize
εi = yi − axi − b. (3.2)
1 At this point we assume there are no uncertainties associated with the observations.
23
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
24 Frequentist vs. Bayesian Methods
y 4
t
x1
t
x1
Figure 3.2 Results from linear regression applied to the toy data. The dots are synthetic observations, the solid diagonal line is
the fitted regression line, and the gray vertical lines are the residuals.
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
25 3.1 Frequentist Statistics
Assuming that {εi } is normally distributed and that observations are independent of each
other, the likelihood of the data given the model is
N
1 (yi − axi − b)2
L(data|a, b) = √ exp , (3.3)
2 πσ2 2σ 2
i=1
1
N
L ∝ (yi − axi − b)2 . (3.4)
2σ 2
i=1
The maximum likelihood estimation (MLE) method aims to find the values of the param-
eters {a, b} that minimize Equation 3.4. We shall discuss the likelihood in depth in
subsequent sections. First, though, let us see how this simple example can be implemented.
# Fit
summary(mod <- lm(y ˜ x1)) # model of the synthetic data.
==================================================
This will generate the following output on screen:
Call:
lm(formula = y ˜ x1)
Residuals:
Min 1Q Median 3Q Max
-3.2599 -0.7708 -0.0026 0.7888 3.9575
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
26 Frequentist vs. Bayesian Methods
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.9885 0.1379 14.42 <2e-16 ***
x1 2.8935 0.2381 12.15 <2e-16 ***
Note that the values we assigned to the synthetic model are close to those displayed in the
model output. In order to visualize the regression line in Figure 3.2 we can type
# Data
np.random.seed(1056) # set seed to replicate example
nobs = 250 # number of obs in model
x1 = uniform.rvs(size=nobs) # random uniform variable
# Output
print(str(results.summary()))
==================================================
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
27 3.2 Basic Theory of Bayesian Modeling
The final intercept value is reported as const and the slope as x1. The code produces
the mean (coeff), standard deviations (std err), t statistics (this can be understood as the
statistical significance of the coefficient), and p-values (p >|t|), on the null hypothesis that
the coefficient is zero, and the lower and upper bounds for the 95% confidence intervals
along with other diagnostics (some of which will be discussed later on).
Notice that here we have used a different option from that presented in Code 2.3, which
does not require writing the formula (y ∼ x) explicitly. We have done this so the reader
can be aware of different possibilities of implementation. In what follows we shall always
use the formula version in order to facilitate the correspondence between the R and Python
codes.
Bayesian statistical analysis, and Bayesian modeling in particular, started out slowly,
growing in adherents and software applications as computing power became ever more
powerful. When personal computers had sufficient speed and memory to allow statisticians
and other analysts to engage in meaningful Bayesian modeling, developers began to write
software packages aimed at the Bayesian market. This movement began in earnest after the
turn of the century. Prior to that, most analysts were forced to use sometimes very tedious
analytical methods to determine the posterior distributions of model parameters. This is
a generalization, of course, and statisticians worked to discover alternative algorithms to
calculate posteriors more optimally.
Enhancements to Bayesian modeling also varied by discipline. Mathematically adept
researchers in fields such as ecology, econometrics, and transportation were particularly
interested in applying the Bayesian approach. The reason for this attitude rests on the belief
that science tends to evolve on the basis of a Bayesian methodology. For example, the
emphasis of incorporating outside information as it becomes available, or is recognized,
into one’s modeling efforts seems natural to the way in which astrophysicists actually
work.
Currently there are a number of societies devoted to Bayesian methodology. Moreover,
the major statistical associations typically have committees devoted to the promotion of
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
28 Frequentist vs. Bayesian Methods
Bayesian techniques. The number of new books on Bayesian methodology has been rapidly
increasing each year. In astronomy, Bayesian methods are being employed ever more fre-
quently for the analysis of astronomical data. It is therefore important to develop a text that
describes a wide range of Bayesian models that can be of use to astronomical study. Given
that many astronomers are starting to use Python, Stan, or JAGS for Bayesian modeling
and R for basic statistical analysis, we describe Bayesian modeling from this perspective.
In this chapter we present a brief overview of this approach. As discussed in Chapter 1,
Bayesian statistics is named after Thomas Bayes (1702–1761), a British Presbyterian min-
ister and amateur mathematician who was interested in the notion of inverse probability,
now referred to as posterior probability. The notion of posterior probability can perhaps
best be understood with a simple example.
Suppose that 60% of the stellar systems in a galaxy far, far away host an Earth-like
planet. It follows therefore that 40% of these stellar systems fail to have Earth-like planets.
Moreover, let us also assume that every system that hosts an Earth-like planet also hosts
a Jupiter-like planet, while only half the systems which fail to host Earth-like planets host
Jupiter-like planets. Now, let us suppose that, as we peruse stellar systems in a Galaxy,
we observe a particular system with a Jupiter-like planet. What is the probability that this
system also hosts an Earth-like planet?
This is an example of inverse probability. Bayes developed a formula that can be used to
determine the probability that the stellar system being examined here does indeed host an
Earth-like planet. The equation used to solve this problem is called Bayes’ theorem.
We can more easily calculate the probability if we list each element of the problem
together with its probability. We symbolize the terms as follows:
⊕ = Earth-like planet
= Jupiter-like planet
∼ = not, or not the case. This symbol is standard in symbolic logic.
P(⊕) = 0.6, the probability that the system hosts an Earth-like planet, notwithstanding (i.e.,
ignoring) other information;
P(∼ ⊕) = 0.4, the probability that a system does not have a Earth-like planet, notwith-
standing other information;
P(|∼ ⊕) = 0.5, the probability that a system hosts a Jupiter-like planet given that it does
not host an Earth-like planet;
P(|⊕) = 1, the probability that a system hosts a Jupiter-like planet given that it also hosts
an Earth-like planet;
P() = 0.8, the probability of randomly selecting a system that has a Jupiter-like planet.
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
29 3.2 Basic Theory of Bayesian Modeling
In terms of the above notation, the probability that a system contains both an Earth-like
and a Jupiter-like planet is given by
P(B|A)P(A)
P(A|B) = , (3.9)
P(B)
where A and B are specified events. Suppose we have a model consisting of continuous
variables, including the response y. When using Bayes’ theorem for drawing inferences
from data such as these, and after analyzing the data, one might be interested in the prob-
ability distribution of one or several parameters θ , i.e., the posterior distribution p(θ |y).
To this end, the Bayesian theorem Equation 3.9 is reformulated for continuous parameters
using probability distributions:
p(y|θ )p(θ ) L(θ )π (θ )
p(θ |y) = = . (3.10)
p(y) L(θ )π (θ )dθ
In Bayesian modeling, the second right-hand side of Equation 3.10 is formed by the
model likelihood L(θ ) times the prior distribution π (θ ) divided by L(θ )π (θ )dθ , which
is the total probability of the data taking into account all possible parameter values. The
denominator is also referred to as the marginal probability or the normalization constant
guaranteeing that the observation probabilities sum to 1. Since the denominator is a con-
stant, it is not normally used in determining the posterior distribution p(θ |y). The posterior
distribution of a model parameter is key to understanding a Bayesian model. The Bayesian
formula underlying Bayesian modeling is therefore
with p(θ |y) as the posterior probability of a model parameter, given the data. The mean of
the posterior distribution in a Bayesian model is analogous to the coefficient of a maxi-
mum likelihood model. The standard deviation of the posterior is similar to the standard
error of the coefficient and the usual 95% credible intervals are similar to the 95% confi-
dence intervals in frequentist thinking. The similarities are, however, purely external. Their
interpretations differ, as do the manner in which they are calculated.
In Bayesian modeling, the model parameters are assumed to be distributed according to
some probability distribution. In frequentist statistics the parameters are considered to be
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
30 Frequentist vs. Bayesian Methods
constant: they are the values of the underlying probability distribution which describes the
data being modeled. The goal in maximum likelihood modeling is to estimate the unknown
parameter values as accurately as possible. The error in this estimation is reflected in a
parameter’s standard error. The parameters to be estimated are slopes informing us of the
rate of change in the response variable given a one-unit change in the respective predic-
tors. The confidence interval for an estimated parameter is obtained on the basis that the
true parameter would be within the confidence interval if the experiment were repeated an
infinite number of times.
In Bayesian modeling the likelihood of the predictor data is combined with outside
information which the analyst knows from other sources. If this outside information (data
not in the original model specification) is to be mixed with the likelihood distribution,
it must be described by an appropriate probability distribution with specified parame-
ter values, or hyperparameters. This is the prior distribution. Each model parameter is
mixed with an appropriate prior distribution. The product of the likelihood distribution
and prior distribution produces a posterior distribution for each parameter in the model.
To be clear, this includes posteriors for each predictor in the model plus the intercept
and any ancillary shape or scale parameters. For instance, in a normal Bayesian model,
posterior distributions must be found for the intercept, for each predictor, and for the
variance, σ 2 . If the analyst does not know of any outside information bearing on a
model parameter then a diffuse prior is given, which maximizes the information regard-
ing the likelihood or the data. Some statisticians call this type of prior non-informative,
but all priors carry some information. We shall employ informative prior distributions in
Bayesian models when we discuss real astronomical applications toward the end of the
book. Examples of diffuse priors will also be given. When we describe synthetic Bayesian
models, in Chapters 4 to 8, we will nearly always use diffuse priors. This will allow the
reader to see the structure of the code so that it can be used for many types of applica-
tion. It also allows the reader to easily compare the input fiducial model and the output
posteriors.
Typically the foremost value of interest in a Bayesian model is the posterior mean. The
median, or even the mode, can also be used as the central tendency of the posterior distribu-
tion. The median is generally used when the posterior distribution is substantially skewed.
Many well-fitted model predictor-parameters are normally distributed, or approximately
so; consequently the mean is most often used as the standard statistic of central tendency.
Standard deviations and credible intervals are based on the mean and shape of the poste-
rior. The credible intervals are defined as the outer 0.025 (or some other value) quantiles of
the posterior distribution of a parameter. When the distribution is highly skewed, the cred-
ible intervals are usually based on the highest posterior density (HPD) region. For 0.025
quantiles, we can say that there is a 95% probability that the credible interval contains the
true posterior mean. This is the common sense interpretation of a credible interval, and
is frequently confused with the meaning of the maximum likelihood confidence interval.
However, this interpretation cannot be used with maximum likelihood models. A con-
fidence interval is based on the hypothesis that if we repeat the modeling estimation a
large number of times then the true coefficient of a predictor or parameter will be within
the range of the interval on 95% of the time. Another way to understand the difference
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
31 3.2 Basic Theory of Bayesian Modeling
between confidence and credible intervals is to realize that for a confidence interval the
interval itself is a random variable whereas for a credible interval the estimated posterior
parameter is the random variable.
In the simplest case, an analyst can calculate the posterior distribution of a parameter
as the product of the model likelihood and a prior providing specific parameter values that
reflect the outside information being mixed with the likelihood distribution. Most texts on
Bayesian analysis focus on this process. More often though, the calculations are too diffi-
cult to do analytically, so that a Markov chain Monte Carlo (MCMC) sampling technique
must be used to calculate the posterior. For the models we discuss in this book, we use
software that assumes that such a sampling occurs.
A MCMC sampling method was first developed by Los Alamos National Laborato-
ries physicists Nicholas Metropolis and Stanislaw Ulam (Metropolis and Ulam, 1949).
Their article on “The Monte Carlo method” in the Journal of the American Statisti-
cal Association initiated a methodology that the authors could not have anticipated.
The method was further elaborated by Metropolis et al. (1953). It was further refined
and formulated in what was to be known as the Metropolis–Hastings method by W. K.
Hastings (Hastings, 1970). Gibbs sampling was later developed by Stuart and Donald
Geman in 1984 in what is now one of the most cited papers in statistical and engi-
neering literature (Geman and Geman, 1984); JAGS, the main software we use in this
text, is an acronym for “Just another Gibbs sampler.” Further advances were made by
Gelfand et al. (1990), Gelfand and Smith (1990), and Tanner and Wong (1987). The
last authors made it possible to obtain posterior distributions for parameters and latent
variables (unobserved variables) of complex models. Bayesian algorithms are continu-
ally being advanced and, as software becomes faster, more sophisticated algorithms are
likely to be developed. Advances in sampling methods and software have been devel-
oped also within the astronomical community itself; these include importance-nested
sampling (Feroz and Hobson, 2008) and an affine-invariant ensemble sampler for MCMC
(Foreman-Mackey et al., 2013). Stan, which we use for many examples in this text, is
one of the latest apparent advances in Bayesian modeling; it had an initial release date of
2012.
A Markov chain, named after Russian mathematician Andrey Markov (1856–1922),
steps through a series of random variables in which a given value is determined on the
basis of the value of the previous element in the chain. Values previous to the last are
ignored. A transition matrix is employed which regulates the range of values randomly
selected during the search process. When applied to posterior distributions, MCMC runs
through a large number of samples until it finally achieves the shape of the distribution
of the parameter. When this occurs, we say that convergence has been achieved. Bayesian
software normally come with tests of convergence as well as graphical outputs displaying
the shape of the posterior distribution. We will show examples of this in the text. A simple
example of how a transition matrix is constructed and how convergence is achieved can
be found in Hilbe (2011). Finally, without going into details, in Appendix A we show a
very simple application of a faster alternative to MCMC sampling known as the integrated
nested Laplace approximation, which results in posteriors that are close to MCMC-based
posteriors.
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
32 Frequentist vs. Bayesian Methods
Likelihood
Suppose that we set out to randomly observe 50 star clusters in a particular type of galaxy
over the course of a year. We observe one cluster per observing session, determining
whether the cluster has a certain characteristic X, e.g., is an open cluster. We record our
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
33 3.2 Basic Theory of Bayesian Modeling
where π is the symbol used in many texts to represent the predicted mean value of the
model response term, y. For a Bernoulli distribution, which is the function used in binary
logistic regression, π is the probability that y = 1, whereas 1 − π is the probability that
y = 0. The value of π ranges from 0 to 1, whereas y has values of only 0 or 1. Note that the
symbol μ is commonly used in place of π to represent the predicted mean. In fact, we shall
be using μ throughout most of the text. However, we shall use π here for our example of a
Bayesian Bernoulli and beta predicted probability. It may be helpful to get used to the fact
that these symbols are used interchangeably in the literature.
In Bayesian modeling we assume that there is likely outside knowledge, independent
of the model data, which can provide extra information about the subject of our model.
Every parameter in the model – the predictor parameters, the intercept, and any ancillary
dispersion, variance, or scale parameters – has a corresponding prior. If there is no outside
information then we multiply the likelihood by a diffuse prior, which is also referred to as a
reference prior, a flat prior, a uniform prior, or even a non-informative prior. The parameter
estimates are then nearly identical to the maximum likelihood estimated parameters.
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
34 Frequentist vs. Bayesian Methods
Prior
Let us return for a moment to the open or globular cluster example we mentioned before.
According to previous studies we know that 22% to 26% of star clusters, in the particular
galaxy type we are studying, have the characteristic X (are open clusters). We are using
the observed data to specify our likelihood function, but wish to use the information from
previous studies as an informative prior. That is, we wish to combine the information from
our study with information from previous studies to arrive at a posterior distribution.
We start to construct a prior by determining its mean, which is 24%, or half of 22% and
26%. Given the rather narrow range in the previous observations we might consider the
prior as heavily informative, but it does depend on the number of previous studies made,
how similar the other galaxies are to our study, and our confidence in their accuracy. We
do not have this information, though.
A reasonable prior for a Bernoulli likelihood parameter π is the beta distribution (Sec-
tion 5.2.4). This is a family of continuous probability distribution functions defined on
the interval [0, 1] and hence ideal for modeling probabilities. The beta distribution may be
expressed as
(α + β) α−1
f (π ; α, β) = π (1 − π )β−1 . (3.15)
(α) (β)
the symbol for proportionality, ∝, appears owing to the exclusion of the normalization
constant. As already mentioned, the parameters of prior distributions are generally referred
to as hyperparameters, in this case α and β.
Our posterior will therefore be determined by the product of a Bernoulli likelihood
(Equation 3.14) and a beta prior over the parameter π (Equation 3.16):
The influence of the prior in shaping the posterior distribution is highly dependent on
the sample size n. If we have an overwhelmingly large data set at hand then the data itself,
through the likelihood, will dictate the shape of the posterior distribution. This means that
there is so much information in the current data that external or prior knowledge is statisti-
cally irrelevant. However, if the data set is small (in comparison with our prior knowledge
and taking into account the complexity of the model), the prior will lead to the construction
of the posterior. We shall illustrate this effect in our stellar cluster example.
As we mentioned above, we know from previous experience that, on average, 24% of the
stellar clusters we are investigating are open clusters. Comparing Equations 3.13 and 3.16
we see that the same interpretation as assigned to the exponents of the Bernoulli distribu-
tion can be used for those of the beta distribution (i.e. the distributions are conjugate). In
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
35 3.2 Basic Theory of Bayesian Modeling
other words, α − 1 can be thought of as the number of open clusters (i.e. where characteris-
tic X was observed) and β − 1 as the number of globular clusters (i.e. without characteristic
X). Considering our initial sample of n = 50 clusters, this means that
α − 1 is given by 0.24 × 50 = 12
and
β − 1 is given by 0.76 × 50 = 38.
Thus, the parameters of our prior are α = 13 and β = 39 and the prior itself is described
as
p(π ; α, β) = Beta(π ; 13, 39) ∝ π 13−1 (1 − π )39−1 = π 12 (1 − π )38 (3.18)
A plot of this prior, displayed in Figure 3.3, can be generated in R with the following code:
We have placed a vertical line at the mean, 0.24, to emphasize that it coincides with the
peak of the distribution.
Posterior
The posterior distribution is calculated as the product of the likelihood and the prior. Given
the simple form of the Bernoulli and beta distributions, the calculation of the posterior is
straightforward:
4
y
3
0
0.0 0.2 0.4 0.6 0.8 1.0
t
x
Figure 3.3 Beta prior p for α = 13 and β = 39. The vertical line indicates the mean at x = 0.24.
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
36 Frequentist vs. Bayesian Methods
Notice that the posterior distribution has shifted over to have a mean of 19.6%.
The triplot function in the LearnBayes R package, which can be downloaded from
CRAN, produces a plot of the likelihood, prior, and posterior, given the likelihood and
prior parameters, as in Figure 3.5:
> library(LearnBayes)
> triplot(prior=c(13,39), data=c(7,43))
10
6
y
t
x
Figure 3.4 Beta posterior with α = 20 and β = 82. The vertical line highlights the mean at x = 0.196.
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
37 3.2 Basic Theory of Bayesian Modeling
10 Prior
Likelihood
8 Posterior
6
Density
t
p
Figure 3.5 Beta prior distribution (dotted line), likelihood (solid line), and posterior distribution (dashed line) over π with
sample size n = 50, calculated with the triplot R function.
Note how the posterior (dashed line) summarizes the behavior of the likelihood (solid line)
and the prior (dotted line) distributions.
We can see from Figure 3.5 that the prior was influential and shifted the posterior mean
to the right. The earlier plots of the prior and posterior are identical to the plots of the prior
and posterior in this figure. The nice feature of the triplot function is that the likelihood
is also displayed.
Let us take a look at how the influence of the prior in the likelihood changes if we
increase the number of data points drastically. Consider the same prior, a beta distribution
with 0.24 mean, and a sample size of n = 1000 star clusters rather than 50, with the
proportion between the classes as in our initial data set (7 out of 50, or 14% of the clusters
in the sample are open clusters); then
number of open clusters is 0.14 × 1000 = 140
and
number of globular clusters is 0.86 × 1000 = 860.
Note that we still have 14% of the new observed star clusters as open clusters, but now with
a much larger data set to support this new evidence. The distributions shown in Figure 3.6
can be generated by typing
> triplot(prior=c(13, 39), data=c(140, 860))
in the R terminal.
It is clear that the likelihood and posterior are nearly the same. Given a large sample size,
the emphasis is on the data. We are telling the model that the prior is not very influential
given the overwhelming new evidence.
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
38 Frequentist vs. Bayesian Methods
30
Prior
Likelihood
25 Posterior
20
Density
15
10
0
0.0 0.2 0.4 0.6 0.8 1.0
t
p
Figure 3.6 Beta prior distribution (dotted line), likelihood (solid line), and posterior distribution (dashed line) over π , with
sample size n = 1000.
Finally we employ a sample size of 10, which is five times smaller than our initial data
set, in order to quantify the influence of the prior in this data situation where
number of open clusters is 0.14 × 10 = 1.4
and
number of globular clusters is 0.86 × 10 = 8.6.
A figure of all three distributions with n = 10 is displayed, as in Figure 3.7, by typing
Note that now the prior is highly influential, resulting in the near identity of the prior and
posterior. Given a low sample size, the outside information coming to bear on the model is
much more important than the study data itself.
The above example is but one of many that can be given. We have walked through this
example to demonstrate the importance of the prior relative to new information present in
the data set. Other commonly used priors are the normal, gamma, inverse gamma, Cauchy,
half-Cauchy, and χ 2 . Of course there are a number of diffuse priors as well – the uniform or
flat prior, and standard priors with hyperparameters that wash out meaningful information
across the range of the likelihood – the so-called non-informative priors. These priors will
be discussed in the context of their use throughout the text.
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
39 3.2 Basic Theory of Bayesian Modeling
Prior
Likelihood
Posterior
6
Density
4
t
p
Figure 3.7 Beta prior distribution (dotted line), likelihood (solid line), and posterior distribution (dashed line) over π with
sample size n = 10.
MCMCpack package. The code below demonstrates how this can be done using the latter
option.
# Data
nobs = 5000 # number of obs in model
x1 <- runif(nobs) # random uniform variable
# Fit
posteriors <- MCMCregress(y ˜ x1, thin=1, seed=1056, burnin=1000,
mcmc=10000, verbose=1)
# Output
summary(posteriors)
==================================================
The above code will generate the following diagnostic information on screen:
Iterations = 1001:11000
Thinning interval = 1
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
40 Frequentist vs. Bayesian Methods
Number of chains = 1
Sample size per chain = 10000
> plot(posteriors)
8 12
4
0
2000 4000 6000 8000 10000 1.90 1.95 2.00 2.05 2.10
Iterations N = 10000 Bandwidth = 0.004772
Trace of x1 Density of x1
0 2 4 6 8
2.80 2.95 3.10
2000 4000 6000 8000 10000 2.8 2.9 3.0 3.1 3.2
Iterations N = 10000 Bandwidth = 0.008223
t
Iterations N = 10000 Bandwidth = 0.003339
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
41 3.2 Basic Theory of Bayesian Modeling
# Fit
df = pandas.DataFrame({’x1’: x1, ’y’: y}) # rewrite data
# Output
summary(trace)
Intercept:
Mean SD MC Error 95% HPD interval
------------------------------------------------------------
1.985 0.052 0.002 [1.934, 2.045]
Posterior quantiles:
2.5 25 50 75 97.5
|--------------|==============|==============|--------------|
1.930 1.967 1.986 2.006 2.043
x1:
Mean SD MC Error 95% HPD interval
-------------------------------------------------------------
2.995 0.079 0.002 [2.897, 3.092]
Posterior quantiles:
2.5 25 50 75 97.5
|--------------|==============|==============|--------------|
2.897 2.965 2.997 3.031 3.092
sd_log:
Mean SD MC Error 95% HPD interval
-----------------------------------------------------------------
-0.009 0.013 0.001 [-0.028, 0.012]
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
Intercept Intercept
3.0 3.5
Sample value
2.5 3.0
Frequency
2.0 2.5
1.5 2.0
1.5
1.0 1.0
0.5 0.5
0.0 0.0
0 1000 2000 3000 4000 5000 0.0 0.5 1.0 1.5 2.0 2.5 3.0
x1 x1
4.5 2.0
4.0
Sample value
Frequency
3.5 1.5
3.0
2.5 1.0
2.0
1.5
1.0 0.5
0.5
0.0 0.0
0 1000 2000 3000 4000 5000 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
sd_log sd_log
0.4 10
Sample value
0.3
Frequency
8
0.2
0.1 6
0.0 4
–0.1
–0.2 2
–0.3 0
0 1000 2000 3000 4000 5000 –0.3 –0.2 –0.1 0.0 0.1 0.2 0.3 0.4
sd sd
1.5 10
Sample value
1.4 Frequency 8
1.3
1.2 6
1.1 4
1.0
0.9 2
0.8 0
0 1000 2000 3000 4000 5000 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
tFigure 3.9 Output plots for the chains (left) and posterior (right) from Code 3.4.
43 3.3 Selecting Between Frequentist and Bayesian Modeling
Posterior quantiles:
2.5 25 50 75 97.5
|--------------|==============|==============|--------------|
-0.028 -0.016 -0.009 -0.002 0.012
sd:
Mean SD MC Error 95% HPD interval
------------------------------------------------------------------
0.992 0.013 0.001 [0.972, 1.012]
Posterior quantiles:
2.5 25 50 75 97.5
|--------------|==============|==============|--------------|
0.972 0.984 0.991 0.998 1.012
==================================================
The plots for the chains (MCMC sampling sessions) and the posterior plots (the Python
equivalent to Figure 3.8) are shown in Figure 3.9.
A key reason to prefer a Bayesian model over one from the frequentist tradition is to
bring information into the model that comes from outside the specific model data. That
is, these models are based on data. Maximum likelihood methods attempt to find the best
estimates of the true parameter values of the probability distribution that generates the
data being modeled. The data are given; however, most modeling situations, in particular
scientific modeling, are not based on data that is unconnected to previous study infor-
mation or to the prior beliefs of those modeling the data. When modeling astrophysical
data we want to account for any information we may have about the data that is rele-
vant to understanding the parameter values being determined. The difficult aspect of doing
this is that such prior information needs to be converted to a probability function with
parameter values appropriate for the information to be included in the model. When we
mix the likelihood with a prior distribution, we are effectively changing the magnitude
and/or direction of movement of the likelihood parameters. The result is a posterior dis-
tribution. In Chapter 10 we will show a number of examples of how informative priors
affect posterior distributions when we discuss applications of Bayesian models to real
astronomical data.
In some instances converting prior information to a probability function is rather sim-
ple, but it is typically challenging. For these latter types of situation most statisticians
subject the data to an MCMC-type sampling algorithm that converges to the mean (or
median) of the parameters being estimated. On the basis of the mean of the distribution
and the shape of the parameter, standard deviation and credible interval values may be
calculated. In Bayesian modeling separate parameter distributions are calculated for each
explanatory predictor in the model as well as for the intercept and for each scale or ancillary
parameter specified for the model in question. For a normal model, separate parameters are
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
44 Frequentist vs. Bayesian Methods
solved for each predictor, the intercept, and the variance parameter, usually referred to as
sigma-squared. The Bayesian logistic model provides parameters for each predictor and the
intercept. It has no scale parameter. Likewise, a Bayesian Poisson model provides parame-
ter means for predictors and the intercept. The negative binomial model, however, also has
a dispersion parameter which must be estimated. The dispersion parameter is not strictly
speaking a scale parameter but, rather, is an ancillary parameter which adjusts for Poisson
overdispersion. We shall describe these relationships when addressing count models later
in the text.
A common criticism of Bayesian modeling relates to the fact that the prior information
brought into a model is basically subjective. It is for those modeling the data to decide
which prior information to bring into the model, and how that data is cast into a prob-
ability function with specific parameter values. This criticism has been leveled against
Bayesian methodology from the first time it was used as a model. We believe that the
criticism is not truly valid unless the prior information is not relevant to the modeling
question. However, it is vital for those defining prior information to present it clearly in
their study report or journal article. After all, when frequentist statisticians model data,
it is their decision on which predictors to include in the model and, more importantly,
which to exclude. They also transform predictor variables by logging, taking the square,
inverting, and so forth. Frequentists decide whether to factor continuous predictors into
various levels or smooth them using a cubic spline, or a lesser smoother, etc. Thus deci-
sions are made by those modeling a given data situation on the basis of their experience
as well as the goals of the study. Moreover, when appropriate, new information can be
brought into a maximum likelihood model by adding a new variable related to the new
information.
However, in Bayesian modeling the prior information is specific to each parameter,
providing a much more powerful way of modeling. Moreover, Bayesian modeling is
much more in tune with how scientific studies are carried out and with how science
advances. Remember that, when selecting so called non-informative or diffuse priors on
parameters in a Bayesian model, the modeling results are usually close to the param-
eter values produced when using maximum likelihood estimation. The model is then
based on the likelihood, or log-likelihood, and not on outside prior information. There
is no a priori reason why almost all statistical models should not be based on Bayesian
methods. If a researcher has no outside information to bear on a model that he or
she is developing, a normal(0,0.00001) or equivalent prior may be used to wash out
any prior information2 and only the model data is used in the sampling. Of course
this involves a new way of looking at statistical modeling, but it may become com-
mon practice as computing power becomes ever faster and sampling algorithms more
efficient.
As we shall discuss later in the book, how out-of-sample predictions are made using a
Bayesian model differs considerably from how such predictions are made using maximum
2 Although this is common practice, and will be used in many instances in this volume, it might lead to numerical
problems. Frequently, using a weakly informative prior instead is advisable (see the recommendations from the
Stan community in Stan, 2016).
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
45 3.3 Selecting Between Frequentist and Bayesian Modeling
Further Reading
Cowles, M. K. (2013). Applied Bayesian Statistics: With R and Open BUGS Examples.
Springer Texts in Statistics. Springer.
Feigelson, E. D. and G. J. Babu (2012a). Modern Statistical Methods for Astronomy: With
R Applications. Cambridge University Press.
Hilbe, J. M. and A. P. Robinson (2013). Methods of Statistical Model Estimation. EBL-
Schweitzer. CRC Press.
Korner-Nievergelt, F., T. Roth, S. von Felten, J. Guélat, B. Almasi, and P. Korner-
Nievergelt (2015). Bayesian Data Analysis in Ecology Using Linear Models with R,
BUGS, and Stan. Elsevier Science.
McElreath, R. (2016). Statistical Rethinking: A Bayesian Course with Examples in R and
Stan. Chapman & Hall/CRC Texts in Statistical Science. CRC Press.
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 21:27:48, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.004
4 Normal Linear Models
The first Bayesian model we address is the normal or Gaussian model. We shall examine
the normal probability distribution function (PDF), as well as its likelihood and log-
likelihood functions. All the Bayesian models we discuss in this volume are based on
specific likelihoods, which in turn are simple re-parameterizations of an underlying PDF.
For most readers, this section will be no more than review material since it is the most
commonly used model employed by analysts and researchers when modeling physical and
astrophysical data.
As discussed in Chapter 3, a PDF links the outcome of an experiment described by a
random variable with its probability of occurrence. The normal or Gaussian PDF may be
expressed as
1
e−(y−μ) /2σ ,
2 2
f (y; μ, σ 2 ) = √ (4.1)
2π σ 2
where y represents the experimental measurements of the random variable y, μ is its mean,
and σ 2 is the scale parameter, or variance. The square root of the variance, σ , is called
the standard deviation. The Gaussian PDF has the well-known format of a bell curve, with
the mean parameter determining the location of its maximum in the real number domain R
and the standard deviation controlling its shape (Figure 4.1 illustrates how these parameters
affect the shape of the distribution). Notice the function arguments on the left-hand side of
Equation 4.1. The probability function, indicated as f (·), tells us that the data y is generated
on the basis of the parameters μ and σ . Any symbol on the right-hand side of Equation 4.1
that is not y or a parameter is regarded as a constant. This will be the case for all the PDFs
we discuss.
A statistical model, whether it is based on a frequentist or Bayesian methodology, is
structured to determine the values of parameters based on the given data. This is just the
reverse of how a PDF is understood. The function that reverses the relationship of what
is to be generated or understood in a PDF is called a likelihood function. The likelihood
function determines which parameter values make the data being modeled most likely –
hence the name likelihood. It is similar to the PDF except that the left-hand side of the
equation now appears as
1
e−(y−μ) /2σ ,
2 2
L μ, σ 2 ; y = √ (4.2)
2π σ 2
46
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
47 4.1 The Gaussian or Normal Model
Normal PDF
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
–6 –4 –2 0 2 4 6 –6 –4 –2 0 2 4 6
t
x x
Figure 4.1 Left: Set of gaussian or normal probability distribution functions centered at the origin, μ = 0, with different values
for the standard deviation σ . Right: Set of gaussian or normal probability distribution functions centered in different
locations with standard deviation σ = 1.0.
indicating that the mean and variance parameters are to be determined on the basis of the
data.
Again, in statistical modeling we are interested in determining estimated parameter
values. It should also be mentioned that both the probability and likelihood functions apply
to all the observations in a model. The likelihood of a model is the product of the likeli-
hood values for each observation. Many texts indicate this relationship with a product
sign, , placed in front of the term on the right-hand side of Equation 4.1. Multiply-
ing individual likelihood values across observations can easily lead to overflow, however.
As a result, statisticians log the likelihood and use this as the basis of model estimation.
The log-likelihood is determined from the sum of individual observation log-likelihood
values. Statisticians are not interested in individual log-likelihood values, but the estimation
algorithm uses them to determine the model’s overall log-likelihood. The log-likelihood
function L for the normal model may be expressed as follows:
N
(yi − μi )2 1
L (μ, σ ; y) =
2
− − ln 2π σ 2
. (4.3)
2σ 2 2
i=1
The normal model is used to estimate the mean and variance parameters from continuous
data with a range of −∞ to +∞. A key distributional assumption of the normal or Gaus-
sian distribution is that the variance is constant for all values of the mean. This is an
important assumption that is often neglected when models are selected for specific data
situations. Also note that the normal model allows for the possibility that the continuous
variable being modeled has both negative and positive values. If the data being modeled
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
48 Normal Linear Models
Normal PDF
y
x1
x2
x3 x
tFigure 4.2 Illustration of points normally distributed around a linear relation yi = β0 + β1 xi + εi , with ε ∼ Normal(0, σ 2 ).
can only have positive values, or if the values are discrete or are counts, using a normal
model violates the distributional assumptions upon which the model is based.
The Gaussian, or normal, model is defined by the probability and log-likelihood distri-
butions underlying the data to be modeled, specifically the behavior of the noise associated
with the response variable. In the case of a normal linear model with a single predictor, the
response variable noise is normally distributed and the relationship between the response
variable y and the explanatory variable x traditionally takes the form
yi = β0 + β1 xi + ε, ε ∼ Normal(0, σ 2 ). (4.4)
Figure 4.2 illustrates this behavior, showing a systematic component (the linear
relationship between x and y) and the normal PDF driving the stochastic element of the
model (the errors in y). In Equation 4.4, the intercept and slope of the explanatory variable
x are represented by β0 and β1 , respectively, and ε is the error in predicting the response
variable y. The index i runs through all the available observations. We can drop the error
term by indicating ŷ as an estimated fitted value with its own standard error:
ŷi = β0 + β1 xi . (4.5)
In the case of the normal linear model, ŷ is the fitted value and is also symbolized by
μ or, more technically, μ̂. Statisticians use the hat over a symbol to represent an estimate
(be aware that this notation is not used when the true value of the parameter is known).
However, we will not employ this convention. It will be straightforward to use the context
in which variables are being described in order to distinguish between these two meanings.
For the normal linear model the two values are identical, ŷ = μ. This is not the case for
models not based on the normal or Gaussian distribution, e.g., logistic, Poisson, or gamma
regression models.
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
49 4.1 The Gaussian or Normal Model
# Likelihood function
for (i in 1:N){
Y[i]˜dnorm(mu[i],tau)
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
50 Normal Linear Models
# Initial values
inits <- function () {
list(beta = rnorm(K, 0, 0.01))
}
# Parameters to be displayed
params <- c("beta", "sigma")
# MCMC
normfit <- jags(data = model.data,
inits = inits,
parameters = params,
model = textConnection(NORM),
n.chains = 3,
n.iter = 15000,
n.thin = 1,
n.burnin = 10000)
The source code below, CH Figures-R, comes from a section of code that is made
available to readers of Zuur, Hilbe, and Ieno (2013) from the publisher’s website.
Plot the chains to assess mixing (Figure 4.3):
source("CH-Figures.R")
out <- normfit$BUGSoutput
MyBUGSChains(out,c(uNames("beta",K),"sigma"))
source("CH-Figures.R")
out <- normfit$BUGSoutput
MyBUGSHist(out,c(uNames("beta",K),"sigma"))
The algorithm begins by making certain that the R2jags library is loaded into the mem-
ory. If it is not then an error message will appear when one attempts to run the model.
We set the seed at 1056; in fact we could have set it at almost any value. The idea is that
the same results will be obtained if we remodel the data using the same seed. We also set
the number of observations in the synthetic model at 5000. A single predictor consisting
of random uniform values is defined for the model and given the name x1. The following
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
51 4.1 The Gaussian or Normal Model
3.2
1.04
1.90 1.95 2.00 2.05 2.10
1.02
3.1
Sampled values
1.00
3.0
0.98
2.9
0 1000 2000 3000 0 1000 2000 3000
t
MCMC iterations
Figure 4.3 MCMC chains for the two regression parameters β1 and β2 and for the standard deviation σ for the normal model.
0
1.90 1.95 2.00 2.05 2.10 2.9 3.0 3.1 3.2 0.98 1.00 1.02 1.04
t
Posterior distribution
Figure 4.4 Histograms of the MCMC iterations for each parameter. The thick line at the base of each histogram represents the
95% credible interval. Note that no 95% credible interval contains 0.
line specifies the linear predictor, xb, which is the sum of the intercept and each term in
the model. A term is understood as the predictor value times its associated coefficient (or
estimated mean). We have only a single term here. Finally, we use R’s (pseudo)random nor-
mal number generator, rnorm(), with three arguments to calculate a random response y.
We specify that the synthetic model has intercept value 2 and slope or coefficient 3. When
we run a linear OLS model of y on x1, we expect the values of the intercept and coefficient
to be close to 2 and 3 respectively. We defined sigma to be 1, but sigma is not estimated
using R’s glm function. It is estimated using R’s lm function as the “residual mean squared
error.” It can also be estimated using our JAGS function for a normal model. The value
estimated is 0.99 ± 0.1.
The Bayesian code begins by constructing a directory or listing of the components
needed for developing a Bayesian model. Recall that the code will provide posterior
distributions for the intercept, for the x1 slope, and for the variance.
The term X consists of a matrix of 5000 observations and two rows, one for the inter-
cept, 1, and another for x1; model.data is a list containing Y, the response, X, the data
matrix, K, the number of columns in the matrix, and N, the number of observations in the
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
52 Normal Linear Models
matrix. We will have other models where we define b0 and B0 in the list as the mean and
variance of the priors used in the model. If we are using so-called non-informative pri-
ors then they are not needed but, if we do want to use them here, we could specify b0 =
rep(0,K) meaning that there are K priors each with value 0. The variance B0 can be defined
as diag(0.00001,K), meaning that each prior has a very wide variance (as the inverse of
the precision) and thus does not contribute additional information to the model. Again, our
model here is simple and does not require knowing or adjusting the mean and variance of
the priors.
The next grouping consists of a block where the priors and log-likelihood are defined.
For the intercept and x1 we assign what is called a non-informative prior. We do this when
we have no outside information about x1, which is certainly the case here given its synthetic
nature. Bayesian statisticians have demonstrated that no prior is truly non-informative if
run on real data. We learn something about the data by specifying that we have no outside
information about it.
Recall that outside information (the prior) must be converted to a probability distribution
and multiplied with the log-likelihood distribution of the model. The idea here is that
we want the log-likelihood, or distribution defining the data, to be left as much as pos-
sible as it is. By specifying a uniform distribution across the range of parameter values,
we are saying that nothing is influencing the data. There can be problems at times with
using the uniform prior, though, since it may be improper, so most Bayesians prefer to use
the normal distribution with mean 0 and a very small precision value e.g., 0.0001. JAGS
uses a precision statistic in place of the variance or even the standard deviation as the
scale parameter for the normal distribution. The precision, though, is simply the inverse
of the variance. A precision of 0.0001 is the same as a variance of 10 000. This generally
guarantees that the variance is spread so wide that it is equal for all values in the range of
the response.
The second block in the NORM model relates to defining priors for the variance σ 2 . Sigma
is the standard deviation, which is given a uniform distribution running from 0 to 100. The
precision tau (τ ) is then given as the inverse of σ 2 , i.e. pow(sigma, -2).
The third block defines the normal probability with the betas and the means and the
precision as the inverse variance. As a synthetic model, this defines the random normal
values adjusted by the mean and variance. The line mu[i] <- eta[i] assigns the values of
the linear predictor of the model, obtained from inprod(beta[], X[i,]), to mu, which is
commonly regarded as the term indicating the fitted or predicted value. For a normal model
eta = mu. This will not be the case for other models we discuss in the book.
It should be noted that, when using JAGS, the order in which we place the code directives
makes no difference. In the code lines above it would seem to make sense to calculate the
linear predictor, eta, first and then assign it to mu. The software is actually doing this, but
it is not apparent from viewing the code.
The next block of code provides initial values for each parameter in the model. The
initial values for beta are specified as 0 mean and 0.01 precision. These values are put
into a list called beta. Following the initial values, the code specifies the parameters to be
estimated, which are the beta or mean values and sigma, the standard deviation.
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
53 4.1 The Gaussian or Normal Model
The final block of code submits the information above into the MCMC sampling algo-
rithm, which is in fact a variety of Gibbs sampling. The details of how the sampling is to
occur are specified in the argument list. The model algorithm samples the data 15 000 times
(m.iter); the first 10 000 are used as burn-in values. These values are not kept. However,
the next 5000 are kept and define the shape of the posterior distributions for each parameter
in the model.
The final line in the code prints the model results to the screen. There are a variety of
ways of displaying this information, some of which are quite complicated. Post-estimation
code can be given for the model diagnostics, which are vital to proper modeling. We shall
discuss diagnostics in the context of examples.
Recall that we mentioned that “non-informative” priors are not completely non-
informative. Some information is entailed in the priors even if we have no outside
information to use for adjusting the model parameters. Moreover, a prior that has lit-
tle impact on one type of likelihood may in fact have a rather substantial impact on
another type of likelihood. However, the non-informative priors used in this book are
aimed to truly minimize outside information from influencing or adjusting the given model
data.
Since we employed non-informative priors on the above synthetic Bayesian normal
model, we should expect that the posterior means are close to the values we set for each
parameter. In the top R code used to define the data, we specified values of 2 for the inter-
cept β0 , 3 for the slope β1 , and 1 for the standard deviation σ . The results of our model
are 2.06 for the intercept, 2.92 for the slope, and 0.99 for the standard deviation. This is
a simple model, so the MCMC sampling algorithm does not require a very large num-
ber of iterations in order to converge. When the data consist of a large number of factor
variables, and a large number of observations, or when we are modeling mixed and multi-
level models, convergence may require a huge number of sampling iterations. Moreover, it
may be necessary to skip one or more iterations when sampling, a process called thinning.
This helps ameliorate, if not eliminate, autocorrelation in the posterior distribution being
generated.
Another tactic when having problems with convergence is to enlarge the default burn-in
number. When engaging a sampling algorithm such as MCMC the initial sample is very
likely to be far different from the final posterior distribution. Convergence as well as appro-
priate posterior values are obtained if the initial sampling values are excluded from the
posterior distribution. Having a large number of burn-in samples before accepting the
remainder of the sample values generally results in superior estimates. If we requested
more iterations (n.iter = 50000) and a greater burn-in number (n.burnin = 30000) it is
likely that the mean, standard deviation, and credible interval would be closer to the values
we defined for the model. Experimentation and trial-and-error are common when working
with Bayesian models. Remember also that we have used a synthetic model here; real data
is much more messy.
Finally, whenever reporting results from a Bayesian model, it is important to be clear
about the elements enclosed in a given implementation. In this context, the Gaussian
normal model studied in this section can be reported as
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
54 Normal Linear Models
Yi ∼ Normal(μi , σ 2 )
g(μi ) = ηi
ηi ≡ xTi β = β1 + β2 x1 (4.6)
β1 ∼ Normal(0, 104 )
β2 ∼ Normal(0, 104 )
σ 2 ∼ Uniform(0, 102 )
i = 1, . . . , N
where N is the number of observations. Notice that this is almost a direct mathematical
translation from the JAGS implementation shown in Code 4.1. Although in this particular
example the description is trivial, this type of representation can be extremely helpful as
models increase in complexity, as we will show in the following chapters.
4.1.2 Bayesian Synthetic Normal Model in R using JAGS and the Zero Trick
For complex non-linear models it may be difficult to construct the correct log-likelihood
and associated statistics. Analysts commonly use what is termed the zero trick when setting
up a Bayesian model. We shall use this method often throughout the book since it makes
estimation easier. In some cases it is helpful to display the codes for both standard and
zero-trick approaches.
The zero trick1 is based on the probability of zero counts in a Poisson distribution with a
given mean. You may come across references to a one-trick model as well, which is based
on the probability of obtaining the value 1 from a Bernoulli or binary logistic distribution.
The zero-trick approach is much more common than the one-trick approach, so we focus
on it.
The code below converts a standard JAGS model to a model using the zero trick for
estimation of the posterior distributions of the model parameters. We use the same data
and same model set-up. All that differs is that the zero trick is used. It is instructive to use
this method since the reader can easily observe the differences and how to set up the zero
trick for his or her own purposes.
The amendments occur in the model.data function, where we add Zeros = rep(0,obs)
to the list. In the likelihood block we have added C <- 1000 and two lines beginning with
Zero. We give C the value 1000 and add it to the likelihood to ensure positive values.
Otherwise, all is the same.
1 The zero trick is a way to include a given customized likelihood, L(θ ), not available in JAGS. It uses the fact
that for a Poisson distribution, the probability of observing a zero is e−μ . So, if we observe a data point with
a value of zero, a Possion distribution with mean log(L(θ )) yields the probability L(θ ). But because log(L(θ ))
must always be a positive value (a requirement of the Poission distribution), a large constant needs to be added
to the likelihood to ensure that it is always positive.
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
55 4.1 The Gaussian or Normal Model
Code 4.2 Normal linear model in R using JAGS and the zero trick.
==================================================
require(R2jags)
set.seed(1056) # set seed to replicate example
nobs = 5000 # number of obs in model
x1 <- runif(nobs) # predictor, random uniform variable
# Model setup
X <- model.matrix(˜ 1 + x1)
K <- ncol(X)
model.data <- list(Y = y, # Response variable
X = X, # Predictors
K = K, # Number of predictors including
the intercept
N = nobs, # Sample size
Zeros = rep(0, nobs) # Zero trick
)
# Likelihood function
C <- 10000
for (i in 1:N){
Zeros[i] ˜ dpois(Zeros.mean[i])
Zeros.mean[i] <- -L[i] + C
l1[i] <- -0.5 * log(2*3.1416) - 0.5 * log(sigma)
l2[i] <- -0.5 * pow(Y[i] - mu[i],2)/sigma
L[i] <- l1[i] + l2[i]
mu[i] <- eta[i]
eta[i] <- inprod(beta[], X[i,])
}
}"
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
56 Normal Linear Models
n.iter = 15000,
n.thin = 1,
n.burnin = 10000)
The values are the same as in the model not using the zero trick. We have displayed both
models to assure you that the zero trick works. We shall use it frequently for other models
discussed in this volume.
# Data
np.random.seed(1056) # set seed to replicate example
nobs = 5000 # number of obs in model
x1 = uniform.rvs(size=nobs) # random uniform variable
# Fit
toy_data = {} # build data dictionary
toy_data[’nobs’] = nobs # sample size
toy_data[’x’] = x1 # explanatory variable
toy_data[’y’] = y # response variable
stan_code = """
data {
int<lower=0> nobs;
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
57 4.1 The Gaussian or Normal Model
vector[nobs] x;
vector[nobs] y;
}
parameters {
real beta0;
real beta1;
real<lower=0> sigma;
}
model {
vector[nobs] mu;
mu = beta0 + beta1 * x;
y ˜ normal(mu, sigma); # Likelihood function
}"""
# Output
nlines = 8 # number of lines on screen output
output = str(fit).split(’\n’)
for item in output[:nlines]:
print(item)
==================================================
The summary dumped to the screen will have the form
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta0 1.99 7.7e-4 0.03 1.93 1.97 1.99 2.01 2.04 1349.0 1.0
beta1 3.0 1.4e-3 0.05 2.9 2.96 3.0 3.03 3.1 1345.0 1.0
sigma 0.99 2.4e-4 9.9e-3 0.97 0.98 0.99 1.0 1.01 1683.0 1.0
4.1.4 Bayesian Synthetic Normal Model using Stan with a Customized Likelihood
In the example above we used the built-in normal distribution in Stan by employing the
symbol ∼ in the model code block when defining the likelihood. Stan also offers the pos-
sibility to implement custom probability distributions without the necessity to use the zero
trick. As an example, if you wish to write your own normal distribution in the above code,
it is enough to replace the model block by
Code 4.5 Modifications to be applied to Code 4.3 in order to use a customized likelihood.
==================================================
model {vector[nobs] mu;
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
58 Normal Linear Models
real temp_const;
vector[nobs] loglike;
mu = beta0 + beta1 * x;
All other parts of the code remain the same. This feature is redundant at this point but,
as explained in Section 4.1.2, it greatly enlarges the range of models that we are able to
address.
The model can be easily generalized for more predictors when the response variable y is a
linear combination of a set of explanatory variables x. In this case, Equation 4.4 takes the
form
yi = β0 + β1 x1i + β2 x2i + · · · + βj xji + ε, (4.7)
where each term of βi xj depicts a predictor xj and its associated slope or coefficient βi . In
addition, ε represents the error in predicting the response variable on basis of the explana-
tory variables, i runs through all available observations in the model, and j represents the
number of predictors in the model.
The mathematical representation can be easily extended to
Yi ∼ Normal(μi , σ )
g(μi ) = ηi
ηi ≡ xTi β = β0 + βj xi + · · · + βj xN (4.8)
βj ∼ Normal(0, 10 ) 4
σ ∼ Uniform(0, 102 )
i = 1, . . . , N
j = 1, . . . , J
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
59 4.2 Multivariate Normal Model
# Likelihood function
for (i in 1:N){
Y[i]˜dnorm(mu[i],tau)
mu[i] <- eta[i]
eta[i] <- inprod(beta[], X[i,])
}
}"
inits <- function () {
list (
beta = rnorm(K, 0, 0.01))
}
params <- c ("beta", "sigma")
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
60 Normal Linear Models
# Data
np.random.seed(1056) # set seed to replicate example
nobs = 5000 # number of obs in model
x1 = uniform.rvs(size=nobs) # random uniform variable
x2 = uniform.rvs(size=nobs) # second explanatory
# Fit
toy_data = {}
toy_data[’y’] = y # response variable
toy_data[’x’] = X # predictors
toy_data[’k’] = toy_data[’x’].shape[1] # number of predictors including
intercept
toy_data[’nobs’] = nobs # sample size
# Stan code
stan_code = """
data {
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
61 4.3 Bayesian Errors-in-Measurements Modeling
int<lower=1> k;
int<lower=0> nobs;
matrix[nobs, k] x;
vector[nobs] y;
}
parameters {
matrix[k,1] beta;
real<lower=0> sigma;
}
transformed parameters{
matrix[nobs,1] mu;
vector[nobs] mu2;
mu = x * beta;
mu2 = to_vector(mu); # normal distribution
# does not take matrices as input
}
model {
for (i in 1:k){ # diffuse normal priors for predictors
beta[i] ˜ normal(0.0, 100);
}
sigma ˜ uniform(0, 100); # uniform prior for standard deviation
y ˜ normal(mu2, sigma); # likelihood function
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=toy_data, iter=5000, chains=3,
verbose=False, n_jobs=3)
# Output
nlines = 9 # number of lines to appear on screen
output = str(fit).split(’\n’)
for item in output[:nlines]:
print(item)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0,0] 2.02 9.1e-4 0.04 1.94 1.99 2.02 2.04 2.09 1645.0 1.0
beta[1,0] 2.97 1.2e-3 0.05 2.88 2.94 2.98 3.01 3.07 1712.0 1.0
beta[2,0] -2.49 1.2e-3 0.05 -2.58 -2.52 -2.49 -2.46 -2.4 1683.0 1.0
sigma 0.99 2.3e-4 9.9e-3 0.97 0.99 0.99 1.0 1.01 1922.0 1.0
Notice that in the block of transformed parameters it is necessary to define two new
parameters: one to store the result of multiplying the data and parameter matrices, which
returns a matrix object of dimensions {nobs, 1}, and another to transform this result into a
vector object. This is necessary since the normal sampler accepts only scalars or vectors as
a mean. The fitting and plotting routines are the same as those presented in Section 4.1.3.
The standard methodology for regression modeling often assumes that the variables are
measured without errors or that the variance of the uncertainty is unknown. Most quantities
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
62 Normal Linear Models
in astronomy arise from measurements which are taken with some error. When the mea-
surement error is large relative to the quantity being measured, it can be important to
introduce an explicit model of errors in variables. A Bayesian approach to errors-in-
variables models is to treat the true quantities being measured as missing data (Richardson
and Gilks, 1993). This requires a model of how the measurements are derived from the
true values. Let us assume a normal linear model where the true values of the predictor,
xi , and the response, yi , variables are not known, but only the observed quantities xiobs and
i . If the measurement uncertainties σx and σy can be estimated, the observed values can
yobs
be modeled as function of the true values plus a measurement noise. A common approach
is to assume the measurement error as normal with known deviation: εx ∼ Normal(0, σx )
and εy ∼ Normal(0, σy ).
A synthetic normal model with measurement errors in both the x and y variables may be
created in R using the code given below.
This code has only one predictor, x1, with assigned coefficient 7 and intercept −4. First,
we try to make an inference about the parameters ignoring the presence of errors.
The JAGS code for a Bayesian normal model, ignoring errors, is given below.
Code 4.9 Normal linear model in R using JAGS and ignoring errors in measurements.
==================================================
Require (R2 jags)
K <- 2
model.data <- list(obsy = obsy,
obsx = obsx,
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
63 4.3 Bayesian Errors-in-Measurements Modeling
K = K,
N = nobs)
# likelihood function
for (i in 1:N){
obsy[i]˜dnorm(mu[i],tau)
mu[i] <- eta[i]
eta[i] <- beta[1]+beta[2]*obsx[i]
}
}"
# Initial value
inits <- function () {
list(
beta = rnorm(K, 0, 0.01))
}
# Parameters to display and save
params <- c("beta", "sigma")
Code 4.10 Normal linear model in R using JAGS and including errors in variables.
==================================================
Require (R2 jags)
model.data <- list(obsy = obsy,
obsx = obsx,
K = K,
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
64 Normal Linear Models
errx = errx,
erry = erry,
N = nobs)
NORM_err <-" model{
# Diffuse normal priors for predictors
for (i in 1:K) { beta[i] ˜ dnorm(0, 1e-3) }
# Likelihood
for (i in 1:N){
obsy[i]˜dnorm(y[i],pow(erry[i],-2))
y[i]˜dnorm(mu[i],tauy)
obsx[i]˜dnorm(x[i],pow(errx[i],-2))
mu[i]<-beta[1]+beta[2]*x[i]
}
}"
# Initial values
inits <- function () {
list(
beta = rnorm(K, 0, 0.01))
}
# Parameter to display and save
params <- c("beta", "sigma")
The inference with the errors-in-variables model is much closer to the assigned val-
ues; in particular, ignoring the errors largely overestimates the intrinsic scatter of the data.
Figure 4.5 enables a comparison of the two models.
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
65 4.3 Bayesian Errors-in-Measurements Modeling
40
y 0
−40
−10 −5 0 5 10
t
x
Figure 4.5 Visualization of different elements of a Bayesian normal linear model that includes errors in variables. The dots and
corresponding error bars represent the data. The dashed line and surrounding shaded bands show the mean, 50%
(darker), and 95% (lighter) credible intervals when the errors are ignored. The dotted line and surrounding shaded
bands show the mean, 50% (darker), and 95% (lighter) credible intervals obtained when the errors are taken into
account (note that these shaded bands are very narrow). The solid line (i.e., the line with the steepest slope) shows
the fiducial model used to generate the data. (A black and white version of this figure will appear in some formats. For
the color version, please refer to the plate section.)
Code 4.11 Normal linear model in Python using Stan and including errors in variables.
==================================================
import numpy as np
import statsmodels.api as sm
import pystan
# Data
np.random.seed(1056) # set seed to replicate example
nobs = 1000 # number of obs in model
sdobsx = 1.25
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
66 Normal Linear Models
beta0 = -4
beta1 = 7
sdy = 1.25
sdobsy = 2.5
# Fit
toy_data = {} # build data dictionary
toy_data[’N’] = nobs # sample size
toy_data[’obsx’] = obsx # explanatory variable
toy_data[’errx’] = errx # uncertainty in explanatory variable
toy_data[’obsy’] = obsy # response variable
toy_data[’erry’] = erry # uncertainty in response variable
toy_data[’xmean’] = np.repeat(0, nobs) # initial guess for true x position
# Stan code
stan_code = """
data {
int<lower=0> N;
vector[N] obsx;
vector[N] obsy;
vector[N] errx;
vector[N] erry;
vector[N] xmean;
}
transformed data{
vector[N] varx;
vector[N] vary;
for (i in 1:N){
varx[i] = fabs(errx[i]);
vary[i] = fabs(erry[i]);
}
}
parameters {
real beta0;
real beta1;
real<lower=0> sigma;
vector[N] x;
vector[N] y;
}
transformed parameters{
vector[N] mu;
for (i in 1:N){
mu[i] = beta0 + beta1 * x[i];
}
}
model{
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
67 4.3 Bayesian Errors-in-Measurements Modeling
x ˜ normal(xmean, 100);
obsx ˜ normal(x, varx);
y ˜ normal(mu, sigma);
obsy ˜ normal(y, vary);
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=toy_data, iter=5000, chains=3,
n_jobs=3, warmup=2500, verbose=False, thin=1)
# Output
nlines = 8 # number of lines in screen output
output = str(fit).split(’\n’)
for item in output[:nlines]:
print(item)
==================================================
Output on screen:
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta0 -3.73 3.2e-3 0.14 -4.02 -3.83 -3.73 -3.63 -3.45 2049.0 1.0
beta1 6.68 2.4e-3 0.06 6.56 6.64 6.68 6.72 6.8 632.0 1.01
sigma 1.69 0.01 0.19 1.33 1.56 1.69 1.82 2.07 173.0 1.03
Further Reading
Feigelson, E. D. and G. J. Babu (1992). “Linear regression in astronomy. II”. ApJ 397,
55–67. DOI: 10.1086/171766
Gelman, A., J. Carlin, and H. Stern (2013). Bayesian Data Analysis, Third Edition.
Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis.
Isobe, T. et al. (1990). “Linear regression in astronomy.” ApJ 364, 104–113. DOI:
10.1086/169390
Kelly, B. C. (2007). “Some aspects of measurement error in linear re-gression of
astronomical data.” ApJ 665, 1489–1506. DOI: 10.1086/519947. arXiv: 0705.2774
Zuur, A. F., J. M. Hilbe, and E. N. Ieno (2013). A Beginner’s Guide GLM and GLMM with
R: A Frequentist and Bayesian Perspective for Ecologists. Highland Statistics.
Downloaded from https:/www.cambridge.org/core. University of Illinois at Urbana-Champaign Library, on 26 Jun 2017 at 21:31:18, subject to the Cambridge Core
terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.005
5 GLMs Part I – Continuous and Binomial Models
The normal model, also referred to as linear regression, was discussed in Chapter 4. The
characteristic assumption of the normal model is that the data being modeled is normally
distributed. A normal or Gaussian distribution has two parameters, a location or mean
parameter μ and a variance or scale parameter σ 2 . In most frequency-based linear models
the variance is not estimated but is regarded as having a fixed or constant value. A full
maximum likelihood estimation of a normal model, however, does estimate both the mean
and variance parameters. We found that both parameters are estimated in a Bayesian nor-
mal model. It is important to remember that analysts employing a Bayesian model nearly
always estimate all the parameters associated with the probability function considered to
underlie a model.
Generalized linear models (GLMs) are fundamental to much contemporary statistical
modeling. The normal or linear model examined in the previous chapter is also a GLM
and rests at the foundation of this class of models. However, the true power of GLMs rests
in the extensions that can be drawn from it. The results are binary, binomial or propor-
tional, multinomial, count, gamma, and inverse Gaussian models, mixtures of these, and
panel models for the above to account for clustering, nesting, time series, survival, and
longitudinal effects. Other models can be developed from within this framework, so it is
worth taking time to outline what is involved in GLM modeling and to see how it can be
extended. Bayesian models can be derived from these distributions, and combinations of
distributions, in order to obtain a posterior distribution that most appropriately represents
the data being modeled. Despite the ubiquitous implementation of GLMs in general statis-
tical applications, there have been only a handful of astronomical studies applying GLM
techniques such as logistic regression (e.g. de Souza et al., 2015a, 2016; Lansbury et al.,
2014; Raichoor and Andreon, 2014), Poisson regression (e.g. Andreon and Hurn, 2010),
gamma regression (Elliott et al., 2015), and negative binomial (NB) regression (de Souza
et al., 2015b).
The generalized linear model approach was developed by statisticians John Nelder and
Robert Wedderburn in 1972 while working together at the Rothamsted Experimental Sta-
tion in the UK. Two years later they developed a software application for estimating GLMs
called Generalized Linear Interactive Models (GLIM), which was used by statisticians
worldwide until it was discontinued in 1994. By this time large commercial statistical
68
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
69 5.1 Brief Overview of Generalized Linear Models
packages were beginning to have GLM procedures as part of their official packages. The
earliest of these procedures were authored by Nelder for GenStat in the late 1970s and by
Trevor Hastie (then at AT&T) for S in the mid 1980s; this was later incorporated into S-
Plus and then later into R. The first author of this book authored a full GLM procedure in
Stata, including the negative binomial, in late 1992 and then with Berwin Turlach did the
same for XploRe economic statistical software in 1993. Gordon Johnston of SAS created
the SAS/STAT Genmod procedure for GLM models using SAS in 1994. Now nearly all
statistical packages have GLM capabilities.
The traditional or basic GLM algorithm provides a unified way to estimate the mean
parameter of models belonging to the single-parameter exponential family of distributions.
The random response variable, Yi , i = 1, 2, . . . , n, may be represented as
Yi ∼ Normal(μi , σ 2 ),
μi = β0 + β1 x1 + · · · + βp xp . (5.2)
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
70 GLMs Part I – Continuous and Binomial Models
and usually took much longer to come to a solution than a GLM. Actually, GLM method-
ology is a variety of maximum likelihood modeling but is a simplification of it applied
to models belonging to the exponential family of distributions. Contemporary maximum
likelihood estimation algorithms are much more efficient than those constructed before the
1990s and usually do not have the problems experienced in earlier times. However, GLM
methodology was, and still is, an efficient way to model certain data situations and is at the
foundation of a number of more advanced models.
In order for a model to be a GLM, it must meet the following criteria:
It should be noted that the use of a full maximum likelihood estimation algorithm allows
estimation of a scale or ancillary parameter if the otherwise GLM model has such a second
parameter. This is the case for models with a continuous variable and for the negative
binomial count model.
The unique property of the exponential family of probability distributions is that the link,
mean, and variance functions of the GLM distributions can easily be determined from the
distribution – if it is cast in exponential-family form. The formula for the log-likelihood of
the exponential family is given as
n
yi θi − b(θi )
L = + c(yi , φ) (5.3)
α(φ)
i=1
with y as the response variable to be modeled, θ , the link function, b(θ ) the cumulant, and
φ the scale parameter. The first derivative of the cumulant, with respect to θ , defines the
mean of the distribution. The second derivative with respect to θ defines the distributional
variance. The PDF normalization term c() ensures that the distribution sums to 1. Note that
the above equation provides for the estimation of both the mean and scale parameters. The
scale parameter is applicable only for continuous response models and is generally ignored
when one is modeling traditional GLMs. It is set to 1 for all GLM count models, including
such models as logistic and probit regression, Poisson, and negative binomial regression.
As we discuss later, though, the negative binomial has a second dispersion parameter, but
it is not a GLM scale parameter so is dealt with in a different manner (it is entered into the
GLM algorithm as a constant). When estimated as single-parameter models, such as is the
case with R’s glm function, the interior terms of the above log-likelihood equation reduce
to, for a single observation,
yθ − b(θ ) + c(y). (5.4)
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
71 5.1 Brief Overview of Generalized Linear Models
The value of θ is then ln(μ) or log μ, which is the Poisson link function. The first derivative
of μ with respect to ln(μ), is μ, as is the second derivative. The means and variances of the
Poisson distribution and model are therefore identical, both being μ. In terms of the linear
predictor xβ, the mean and variance of the Poisson model is exp(xβ). All the other GLM
models share the above properties, which makes it rather easy for an analyst to switch
between models.
It should also be mentioned that the canonical or natural link function is derived directly
from the distribution, as is the case for the Poisson distribution above. However, one may
change link functions within the same distribution. We may therefore have lognormal mod-
els or log-gamma models. In the note to Table 5.1 below, it is mentioned that the traditional
negative binomial model used in statistical software is in fact a log-linked negative bino-
mial. The link that is used is not derived directly from the negative binomial PDF (see the
book Hilbe, 2011).
The variance functions in Table 5.1 show two columns for continuous response models.
The second column displays the scale parameter for a two-parameter GLM. The scale
parameter is defined as 1 for discrete response models, e.g., binomial and count models.
Note that ν is sometimes regarded as a shape parameter with the scale defined as φ = 1/ν.
As we shall find later in this chapter, the Bernoulli distribution is a subset of the bino-
mial distribution. Both are used for models with binary responses, e.g. 0, 1. The m in the
binomial variance and link functions indicates the binomial denominator. For the Bernoulli
distribution m = 1. The geometric distribution is identical to the negative binomial with
a dispersion parameter equal to 1. The Poisson can also be regarded as a negative bino-
mial with a dispersion parameter equal to 0. These relationships can easily be observed by
inspecting the variance functions in Table 5.1.
When estimating a negative binomial model within a traditional GLM program, a value
for the dispersion parameter α is entered into the estimation algorithm as a constant. How-
ever, most contemporary GLM functions estimate α in a subroutine and then put it back
into the regular GLM algorithm. The results are identical to a full maximum likelihood
estimation (MLE) of the model.
Notice, as was mentioned before, the normal variance function is set at 1 and is not
estimated when using a traditional GLM algorithm. When GLMs are estimated using MLE
it is possible to estimate the variance (normal) and scale parameters for the gamma and
inverse Gaussian models. Technically this is an extension to the basic GLM algorithm.
Recall our earlier discussion of the probability distribution function, the parameter val-
ues of which are assumed to generate the observations we are modeling. The equation used
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
72 GLMs Part I – Continuous and Binomial Models
Table 5.1 The traditional GLM families with their variance and link functions.
Discrete distributions Variance function Link function
Bernoulli μ(1 − μ) log (μ/(1 − μ))
Binomial μ(1 − μ/m) log (μ/(m − μ))
Poisson μ log (μ)
Geometric μ(1 + μ) log (μ/(1 + μ))∗
Negative binomial μ(1 + αμ) log (αμ/(1 + αμ))∗
Continuous distributions Variance function Link function
Gaussian or normal 1 σ2 μ
Gamma μ2 μ2 /ν 1/μ
Inverse Gaussian μ3 μ3 /ν 1/μ2
∗ The geometric model is the negative binomial with α = 1. It is not usually estimated as a model in
its own right but is considered as a type of negative binomial. The geometric and negative binomial
models are nearly always parameterized with a log link, like the Poisson. In using the same link as
for the Poisson, the negative binomial can be used to adjust for overdispersion or extra correlation in
otherwise Poisson data. The link functions displayed in the table are the canonical links. Lastly, the
negative binomial has been parameterized here with a direct relation between the dispersion
parameter and the mean. There is a good reason for this, which we discuss in the section on negative
binomial models.
for estimating the distributional parameters underlying the model data is called the likeli-
hood function. It is an equation which tells us how likely our model data is, given specific
parameter values for the probability characterizing our model. The log of the likelihood
function is used in the estimation process for frequency-based models as well as for most
Bayesian models.
In order to better visualize how the IRLS algorithm works for a GLM, we shall con-
struct a simple model using R. We shall provide values for an x continuous predictor and
a binary variable y, which is the variable being modeled, the response variable. Since
y is a binary variable, we model it using a logistic regression. To replicate the exam-
ple below, place the code in the R editor, select it, and run. The output below displays
the results of the simple GLM that we created using R’s glm function. The top code is
an example of IRLS. Note that it is a linear model re-weighted by the variance at each
iteration.
# Fit
mu <- (y + 0.5)/2 # initialize mu
eta <- log(mu/(1-mu)) # initialize eta with the Bernoulli link
for (i in 1:8) {
w <- mu*(1-mu) # variance function
z <- eta + (y - mu)/(mu*(1-mu)) # working response
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
73 5.1 Brief Overview of Generalized Linear Models
# Output
summary(mod)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.28606 1.65744 0.776 0.451
x -0.07915 0.09701 -0.816 0.428
The glm logistic model coefficients are identical to the coefficients of our synthetic
model. The standard errors differ a bit but there are very few observations in the model,
so we do not expect to have near identical standard errors and hence p-values. Remember
that the z-value is the ratio of the coefficient and the standard error. It is easy to develop
an R model that will produce equivalent standard errors for both the regular glm function
and a synthetic model. But we need not be concerned with that here. See the book Hilbe
and Robinson (2013) for a full explanation of how to program GLM as well as of Bayesian
modeling.
We have provided a brief overview of traditional GLM methodology, understanding
that it is squarely within the frequentist tradition of statistics. However, we will be dis-
cussing the Bayesian logistic model as well as the Bayesian normal model and others, as
Bayesian alternatives to standard GLM models. We have earlier discussed the advantages
of Bayesian modeling, but it may be helpful to observe the contrast.
The equivalent Python code for the example above concerning weighted regression may
be written as:
# Data
x = np.array([13, 10, 15, 9, 18, 22, 29, 13, 17, 11, 27, 21, 16, 14, 18, 8])
y = np.array([1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0])
X = np.transpose(x)
X = sm.add_constant(X) # add intercept
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
74 GLMs Part I – Continuous and Binomial Models
# Fit
mu = (y + 0.5) / 2 # initialize mu
eta = np.log(mu/(1 - mu)) # initialize eta with the Bernoulli
link
for i in range(8):
w = mu * (1 - mu) # variance function
z = eta + (y - mu)/(mu * (1 - mu)) # working response
mod = sm.WLS(z, X, weights=w).fit() # weighted regression
eta = mod.predict() # linear predictor
mu = 1/(1 + np.exp(-eta)) # fitted value using inverse link
function
print (mod.summary())
==============================================================================
coeff std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 1.2861 1.657 0.776 0.451 -2.269 4.841
x1 -0.0792 0.097 -0.816 0.428 -0.287 0.129
==============================================================================
# Write data as dictionary
mydata = {}
mydata[’x’] = x
mydata[’y’] = y
We have discussed the normal model in Chapter 4. It is the paradigm continuous response
model, and it stands at the foundation of all other statistical models. With respect to gener-
alized linear models, the normal model is an identity-linked model. The linear predictor, eta
(η) or xβ, and the fitted value, μ, are identical. No transformation of the linear predictor
needs to take place in order to determine the fit or predicted value. The linear regres-
sion estimation process is therefore not iterative, whereas estimation using other GLMs
is iterative. Bayesian modeling, however, still recognizes log-likelihood distributions as
well as link functions that can be used within the Bayesian modeling code to convert, for
instance, a Gaussian or normal model to a lognormal model or a Bernoulli logistic model
to a Bernoulli probit model. Several important terms originating in GLM and maximum
likelihood estimation are also used in Bayesian modeling, although their interpretations
can differ.
We shall discuss four continuous response models in this section, two of which we
will later demonstrate with astrophysical applications. Of course, we shall then focus
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
75 5.2 Bayesian Continuous Response Models
completely on Bayesian methods for the estimation of model parameters. The models
addressed in this section are the lognormal, gamma, inverse Gaussian, and beta mod-
els. The gamma and inverse Gaussian distributions are traditional GLM family members.
The two-parameter lognormal and beta are not, but some authors have classified them
as extended GLMs. A lognormal model may be created using GLM software by using
a log link with the Gaussian family instead of the default identity link. However, only
the mean parameter is estimated, not the scale. Frequentist two-parameter lognormal
software is difficult to locate, but we shall find it easy to construct using Bayesian tech-
niques. In this section we create synthetic lognormal, log-gamma, log-inverse-Gaussian,
and beta data, modeling each using a Bayesian model. Using the log parameterization
for the gamma and inverse Gaussian distributions has better applications for modeling
astronomical data than does using the canonical forms. We discuss this in the relevant
subsections.
It should be mentioned that the lognormal, log-gamma, and log-inverse-Gaussian mod-
els are appropriate for modeling positive real numbers. Unlike in the normal model, in
which it is assumed that the variance is constant across all observations, the lognormal vari-
ance is proportional to the square of the mean. There is no restriction for the log-gamma
and log-inverse-Gaussian models, though. In comparison with the normal and lognormal
models, the shape or scale parameters expand the range of modeling space for positive real
values. The beta distribution is an important model, which is used for proportional data
with values between 0 and 1, i.e., {0 < x < 1}.
The lognormal model is commonly used on positive continuous data for which the values
of the response variable are real. Figure 5.1 shows examples of the lognormal PDF for
different parameter values and Figure 5.2 shows an example of lognormal-distributed data.
Note the skewness as the data approaches zero on the y-axis. It is preferable to use a
lognormal model rather than logging the response and modeling the result as a normal
or linear model. A lognormal model can be created from the Gaussian or normal log-
likelihood by logging the single y term in the normal distribution. Recalling the normal
PDF and log-likelihood from Chapter 4, the lognormal PDF, which is sometimes referred
to as the Galton distribution, can be expressed as
1
e−(ln y−μ)
2 /2σ 2
f (y; μ, σ 2 ) = √ , (5.7)
yσ 2π
while the associated log-likelihood assumes the form
n
(ln yi − μi )2 1
L (μ, σ 2 ; y) = − − ln(2π σ 2 2
yi ) . (5.8)
2σ 2 2
i=−1
The lognormal model is based on a lognormal PDF, which log-transforms the response
term y within the distribution itself. The actual data being modeled are left in their origi-
nal form. The key is that the data being modeled are positive real values. Values equal to
zero or negative values are not appropriate for the lognormal model. Because the normal
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
76 GLMs Part I – Continuous and Binomial Models
2.5 0.40
μ = 1.0, σ = 0.2 μ = 0.5, σ = 1.0
μ = 1.0, σ = 0.5 0.35 μ = 1.0, σ = 1.0
2.0 μ = 1.0, σ = 1.0 μ = 1.5, σ = 1.0
Lognormal PDF μ = 1.0, σ = 2.0 0.30 μ = 2.2, σ = 1.0
Lognormal PDF
1.5 0.25
0.20
1.0
0.15
0.10
0.5
0.05
0.0 0.00
0 1 2 3 4 5 0 1 2 3 4 5
t
x x
Figure 5.1 Left: Set of lognormal probability distribution functions with different values for the scatter parameter σ , centered at
μ = 1.0. Right: Set of lognormal probability distribution functions with different values for μ, and scatter
parameter σ = 1.0.
Lognormal PDF
0
x1
x2
x3 x
distribution assumes the possibility of negative and zero values in addition to positive val-
ues, using a normal model on data that can only be positive violates the distributional
assumptions underlying the normal model. To reiterate, it is statistically preferable to
model positive continuous data using a lognormal model rather than a log-y-transformed
normal model. A foremost advantage of using the lognormal is that the variance can vary
with the mean. However, the constraint is that the ratio of the mean and the standard devi-
ation in a lognormal model is constant. For a normal model, the variance is assumed to
be constant throughout the entire range of model observations. Unfortunately the lognor-
mal model has not always been available in standard commercial statistical software, so
many researchers instead have log-transformed y and estimated parameters using a normal
model. This is no longer a problem, though.
Novice researchers get into trouble when they model count data using a lognormal
model, or, worse, when they log the response and model the data as normal. Count data
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
77 5.2 Bayesian Continuous Response Models
should always be modeled using a so-termed count model, e.g., the Poisson, negative bino-
mial, generalized Poisson, and similar models. Count models are especially designed to
account for the distributional assumptions underlying count data. Count data is discrete,
and should always be modeled using an appropriate count model (O’Hara and Kotze,
2010). With this caveat in mind, let us discuss the nature of lognormal models, provid-
ing examples using synthetic data. A synthetic lognormal model may be created in R using
the following code.
Code 5.3 Synthetic lognormal data and model generated in R.
==================================================
require(gamlss)
# Data
set.seed(1056) # set seed to replicate example
nobs = 5000 # number of observations in model
x1 <- runif(nobs) # random uniform variable
xb <- 2 + 3*x1 # linear predictor, xb
y <- rlnorm(nobs, xb, sdlog=1) # create y as random lognormal variate
summary(mylnm <- gamlss(y ~ x1, family=LOGNO))
==================================================
Estimate Std.Error t value Pr(>|t|)
(Intercept) 1.99350 0.02816 70.78 <2e-16∗∗∗
x1 3.00663 0.04868 61.76 <2e-16∗∗∗
This code has only one predictor, x1, with assigned value 3 and intercept 2. This provides
us with the data set which we will use for developing a Bayesian lognormal model. Note
that the rlnorm pseudo-random number generator creates lognormal values adjusted by the
linear predictor that we specified. This data may be modeled using R’s glm function with
a normal family and log link or by using the maximum likelihood lognormal model found
in the gamlss package. If an analyst wishes to model the data using a frequentist-based
model, the gamlss model is preferred since it estimates a variance parameter in addition to
coefficients. The glm function does not. Of course, our purpose is to model the data using a
Bayesian model. Again, the y variable is lognormally distributed, as adjusted by the terms
of the linear predictor and the log link function.
Keep in mind that, when constructing a synthetic model such as the above, the values
placed in the predictors are coefficients, or slopes. That is what is being estimated when
we are using a maximum-likelihood-based model. However, when we use the synthetic
data for a Bayesian model, we are estimating posterior parameter means, not specifically
coefficients. If diffuse or minimal-information priors (unfortunately usually referred to as
non-informative priors, as mentioned earlier) are used with the Bayesian model, the results
of the Bayesian model will usually be very close to the maximum likelihood parame-
ter estimates. We have discussed why this is the case. It is a good test for determining
whether our Bayesian code is correct. For real data applications, though, it is preferred to
use meaningful priors if they are available. We shall continually discuss this subject as we
progress through the book.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
78 GLMs Part I – Continuous and Binomial Models
Again, it is important to emphasize that we are constructing Bayesian code with largely
minimal-information priors to model synthetic data with preset parameter values, in order
to demonstrate the appropriateness of the code. This procedure also allows us to describe
the model generically, which makes it easier for the reader to use for his or her own appli-
cations. We will also demonstrate that the R program creating the lognormal function is
correct and that the response y is in fact lognormally distributed. The output demonstrates
that the synthetic data is structured as we specified.
For some distributions there exists R software that provides the analyst with the
opportunity to employ a Bayesian model on the data. Specifically, the R package
MCMCpack can be used to to develop several different Bayesian models. When this is a
possibility in this chapter, we shall create synthetic data, modeling it using the appro-
priate R function as well as our JAGS code. For some of the hierarchical models
discussed in Chapter 8, the R MCMCglmm package can be used to calculate posteri-
ors as well as diagnostics. We will use the R package when possible for R models.
Unfortunately, the MCMCpack package does not support Bayesian lognormal models or any
of the continuous response models discussed in this section. We therefore shall proceed
directly to the creation of a Bayesian lognormal model using JAGS, from within the R
environment.
Lognormal Model in R using JAGS
We now shall run a JAGS lognormal algorithm on the data in Code 5.3, expecting to have
values of the predictor parameters close to those we specified, i.e., an intercept parameter
close to 2 and a coefficient parameter for x1 close to 3. Unless we set a seed value, each
time we run the code below the results will differ a bit. Estimation of parameters is obtained
by sampling. Note the closeness of the JAGS and gamlss results.
# Likelihood
for (i in 1:N){
Y[i] ˜ dlnorm(mu[i],tau)
mu[i] <- eta[i]
eta[i] <- inprod(beta[], X[i,])
}
}"
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
79 5.2 Bayesian Continuous Response Models
It should be noted that the time taken for a sampling run can be given with the
system.time function. Type ?system.time for help with this capability.
To display the histograms (Figure 5.3) and trace plots (Figure 5.4) of the model
parameters one can use:
source("CH-Figures.R")
out <- LN$BUGSoutput
MyBUGSHist(out,c(uNames("beta",K),"sigma"))
and
MyBUGSChains(out,c(uNames("beta",K),"sigma"))
1.90 1.95 2.00 2.05 2.8 2.9 3.0 3.1 0.98 1.00 1.02 1.04
t
Posterior distribution
Figure 5.3 Histogram of the MCMC iterations for each parameter. The thick lines at the bases of the histograms represent the
95% credible intervals.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
80 GLMs Part I – Continuous and Binomial Models
00
00
00
00
0
50
10
15
20
25
0
beta[1] beta[2] sigma
3.1
Sampled values
3.0
2.9
2.8
0
0
00
00
00
00
0
00
00
00
00
50
50
10
15
20
25
10
15
20
25
t
MCMC iterations
Figure 5.4 MCMC chains for the two parameters, β1 and β2 , and the standard deviation σ for the lognormal model.
104
103
2
y 10
101
0
–1 0 1 2
t
x
Figure 5.5 Visualization of the lognormal model. The dashed line and darker (lighter) shaded areas show the fitted model and
50% (95%) credible intervals, respectively. The dots represent the synthetic data points.
Note that the R2jags traceplot function can also be used to produce trace plots. Type
traceplot(LN) for this example.
If the data involves further parameters and observations then you will probably need to
increase the burn-in and number of iterations. If there is also substantial autocorrelation in
the data, we suggest thinning it, i.e., arranging for the sampling algorithm in JAGS to accept
only every other sample (n.thin=2), or every third sample (n.thin=3), etc. The greater
the thinning value, the longer the algorithm will take to converge to a solution, but the
solution is likely to be more appropriate for the data being modeled. Note that the variance
parameter of the lognormal model increases in proportion to the square of the mean, unlike
for the normal model where the variance is constant throughout the range of values in the
model. Figure 5.5 shows the fitted model in comparison with the synthetic data, together
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
81 5.2 Bayesian Continuous Response Models
with 50% and 95% prediction intervals. The interpretation of the 95% prediction intervals
is straightforward: there is a 95% probability that a new observation will fall within this
region.
# Data
np.random.seed(1056) # set seed to replicate example
nobs = 5000 # number of obs in model
x1 = uniform.rvs(size=nobs) # random uniform variable
stan_lognormal = """
data{
int<lower=0> N;
vector[N] x1;
vector[N] y;
}
parameters{
real beta0;
real beta1;
real<lower=0> sigma;
}
transformed parameters{
vector[N] mu;
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
82 GLMs Part I – Continuous and Binomial Models
# Fit
fit = pystan.stan(model_code=stan_lognormal, data=mydata, iter=5000, chains=3,
verbose=False, n_jobs=3)
# Output
nlines = 8 # number of lines in output
output = str(fit).split(’\n’)
for item in output[:nlines]:
print(item)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta0 1.99 7.7e-4 0.03 1.93 1.97 1.99 2.01 2.04 1390.0 1.0
beta1 3.0 1.3e-3 0.05 2.9 2.96 3.0 3.03 3.09 1408.0 1.0
sigma 0.99 2.4e-4 9.9e-3 0.97 0.98 0.99 1.0 1.01 1678.0 1.0
The results of the Python model are very close to those for the JAGS model, as we
expect. Both models properly estimate the means of the posterior distributions for each
parameter in the synthetic model we specified at the outset of the section. Of course, if
we provide an informative prior for any parameter then the mean value for the calculated
posterior distribution will change. By providing diffuse priors the posteriors are based on
the log-likelihood function, which means that they are based on the model data and not on
any information external to it. As a consequence the Bayesian results are similar to those
obtained using maximum likelihood estimation.
Finally, it is important to highlight that there is a difference between the lognormal
parameterization in JAGS and Stan such that their dispersion parameters are related by
1
σStan = √ . (5.9)
τJAGS
The results are consistent between Codes 5.4 and 5.5 because for a unitary dispersion
τJAGS = σStan .
1/φ
1 y
f (y; μ, φ) = e−y/μφ . (5.10)
y (1/φ) μφ
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
83 5.2 Bayesian Continuous Response Models
The canonical gamma log-likelihood, which is also sometimes referred to as the inverse
gamma model, can be expressed as
n
1 yi yi 1
L (μ, φ; y) = ln − − ln(yi ) − ln . (5.11)
φ φμi φμi φ
i=1
Gamma models are used for positive-only continuous data, like the lognormal model. The
link function of the canonical gamma model, however, dictates that the model is to be used
when there is an indirect relationship between the fit and the linear predictor. For a simple
economic example, suppose that the data to be modeled is set up to have a direct rela-
tionship with the mean miles-per-gallon used by automobiles. Using a canonical gamma
model provides an interpretation of the predictors based on the mean gallons-per-mile –
the inverse of the manner in which the data was established. The same inverse relationship
obtains with respect to any other application as well.
Owing to the distributional interpretation of the canonical gamma model, most ana-
lysts employ a log-gamma model for evaluating positive real data based on a gamma
distribution. The value in modeling positive real data using a log-gamma model rather
than a normal model is that the shape parameter varies with the mean, allowing substan-
tially greater flexibility to the range of modeling. For the lognormal model the variance
increases with the square of the mean, so it is limited to that relationship. However, with
the log-gamma model, the analyst has two flexible parameters with which to model positive
real data. Moreover the mean and fitted value are directly related, unlike in the canonical
gamma. Figure 5.6 shows examples of the gamma PDF for different parameter values and
Figure 5.7 shows an example of gamma-distributed data.
3.0 2.0
μ = 1.0, φ = 0.4 μ = 0.5, φ = 1.0
μ = 1.0, φ = 0.8 μ = 1.5, φ = 1.0
2.5 μ = 1.0, φ = 1.2 μ = 2.0, φ = 1.0
μ = 1.0, φ = 1.6 μ = 5.0, φ = 1.0
1.5
2.0
Gamma PDF
Gamma PDF
1.5 1.0
1.0
0.5
0.5
0.0 0.0
0 1 2 3 4 5 0 1 2 3 4 5
t
x x
Figure 5.6 Left: Set of gamma probability distribution functions with different values for the parameter φ; μ = 1. Right: Set of
gamma probability distribution functions with different values for μ; φ = 1.0.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
84 GLMs Part I – Continuous and Binomial Models
Gamma PDF
y
0
x1
x2
x3
t
x
The form that, in JAGS, can be used as the log-likelihood for a gamma Bayesian model
is as follows:
Notice that it is the same as for the inverse or canonical linked gamma, except that mu is
defined as exp(xb).
Code to create synthetic log-gamma data with specific values for the predictor param-
eters is based on specifying terms in a linear predictor and submitting it to the rgamma
function. The logic of the rgamma function is
rgamma(n, shape, rate, scale = 1/rate)
We provide a rate value of exp(xb) or a scale of 1/exp(xb), where xb is the linear predictor.
The synthetic data below specifies values for the intercept coefficient parameters, and shape
parameter r.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
85 5.2 Bayesian Continuous Response Models
sink("LGAMMA.txt")
cat("
model{
# Diffuse priors for model betas
beta ˜ dmnorm(b0[], B0[,])
# Likelihood
for (i in 1:N){
Y[i] ˜ dgamma(r, lambda[i])
lambda[i] <- r / mu[i]
log(mu[i]) <- eta[i]
eta[i] <- inprod(beta[], X[i,])
}
}
",fill = TRUE)
sink()
# JAGS MCMC
LGAM <- jags(data = model.data,
inits = inits,
parameters = params,
model.file = "LGAMMA.txt",
n.thin = 1,
n.chains = 3,
n.burnin = 3000,
n.iter = 5000)
print(LGAM, intervals=c(0.025, 0.975), digits=3)
==================================================
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
86 GLMs Part I – Continuous and Binomial Models
Even with only 2500 samples used to determine the parameter values, the results are
close to what we specified. The above generic example of a log-gamma model in JAGS can
be expanded to handle a wide variety of astronomical data situations as well as applications
from almost any other discipline.
# Data
np.random.seed(33559) # set seed to replicate example
nobs = 3000 # number of obs in model
x1 = uniform.rvs(size=nobs) # random uniform variable
x2 = uniform.rvs(size=nobs) # second explanatory
# Fit
mydata = {} # build data dictionary
mydata[’N’] = nobs # sample size
mydata[’x1’] = x1 # explanatory variable
mydata[’x2’] = x2
mydata[’y’] = y # response variable
# STAN code
stan_gamma = """
data{
int<lower=0> N;
vector[N] x1;
vector[N] x2;
vector[N] y;
}
parameters{
real beta0;
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
87 5.2 Bayesian Continuous Response Models
real beta1;
real beta2;
real<lower=0> r;
}
transformed parameters{
vector[N] eta;
vector[N] mu;
vector[N] lambda;
for (i in 1:N){
eta[i] = beta0 + beta1 * x1[i] + beta2 * x2[i];
mu[i] = exp(eta[i]);
lambda[i] = r/mu[i];
}
}
model{
r ˜ gamma(0.01, 0.01);
for (i in 1:N) y[i] ˜ gamma(r, lambda[i]);
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_gamma, data=mydata, iter=7000, chains=3,
warmup=6000, n_jobs=3)
# Output
nlines = 9 # number of lines in screen output
output = str(fit).split(’\n’)
for item in output[:nlines]:
print(item)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta0 1.05 1.4e-3 0.04 0.97 1.02 1.05 1.07 1.12 647.0 1.0
beta1 0.6 1.8e-3 0.05 0.51 0.57 0.6 0.63 0.69 663.0 1.0
beta2 -1.28 1.8e-3 0.05 -1.38 -1.31 -1.28 -1.25 -1.19 692.0 1.0
r 1.83 1.6e-3 0.04 1.75 1.8 1.83 1.86 1.91 733.0 1.0
λ −λ(y−μ)2 /(2μ2 y)
f (y; μ, λ) = e . (5.12)
2π y3
The log-likelihood of the canonical inverse Gaussian distribution, with inverse quadratic
link (1/μ2 ), is
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
88 GLMs Part I – Continuous and Binomial Models
3.0 3.0
μ = 0.5, σ = 1.0 μ = 1.0, σ = 0.5
μ = 1.0, σ = 1.0 μ = 1.0, σ = 1.0
2.5 μ = 1.5, σ = 1.0 2.5 μ = 1.0, σ = 1.5
Invgaussian PDF μ = 2.0, σ = 1.0 μ = 1.0, σ = 2.5
Invgaussian PDF
2.0 2.0
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
t
x x
Figure 5.8 Left: Set of inverse Gaussian probability distribution functions with different values for the parameter μ; σ = 1.
Right: Set of inverse Gaussian probability distribution functions with different values for σ ; μ = 1.0.
n
1 λ(yi − μi )2 1 1
L (μ, σ 2 ; y) = − ln(π y3i ) + ln(λ) . (5.13)
2 μ2 yi 2 2
i=1
The inverse Gaussian evaluator is similar to the gamma and others that we have
discussed. The difference is in how the log-likelihood is defined. In pseudocode the
log-likelihood for a canonical inverse Gaussian model is given as follows:
mu <- 1/sqrt(xb)
LL <- -.5*(y-mu)^2/(sig2*(mu^2)*y)-.5*log(pi*y^3*sig2)
Figure 5.8 shows different shapes of inverse Gaussian PDFs for various different values of
μ and σ .
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
89 5.2 Bayesian Continuous Response Models
xb <- 1 + 0.5*x1
mu <- exp(xb)
y <- rinvgauss(nobs,mu,20) # create y as adjusted random inverse-gaussian
variate
==================================================
The synthetic inverse Gaussian data is now in the y variable, as adjusted by x1, which is
a random uniform variate. The JAGS code below, without truly informative priors, should
result in a Bayesian model with roughly the same values for the parameters. Notice should
be taken of the prior put on the variance. We use a half-Cauchy(25) distribution, which
appears to be optimal for a diffuse prior on the variance function. JAGS does not have a
built-in Cauchy or half-Cauchy function, but such a function can be obtained by dividing
the normal distributions, N(0, 625) by N(0, 1). The absolute value of this ratio follows the
half-Cauchy(25) distribution. Given that JAGS employs the precision rather than the vari-
ance in its normal distribution function, the numerator will appear as dnorm(0,0.0016),
where the precision is 1/625. In the code below we call the half-Cauchy distribution
lambda, which is the diffuse prior we put on the inverse Gaussian variance (see Zuur,
Hilbe, and Ieno, 2013, for an extended explanation). Other diffuse priors put on the vari-
ance are dgamma(0.001, 0.001), which represents a gamma prior distribution with mean 1
and variance 1000 and dunif(0.001,10). Checking the mixing and comparative summary
statistics such as DIC and pD when using different priors is wise if you suspect convergence
problems. Even if you do not, selecting an alternative prior may result in a better-fitted
model. These sorts of sensitivity tests should be performed as a matter of routine.
cat("
model{
# Diffuse normal priors betas
for (i in 1:K) { beta[i] ˜ dnorm(0, 0.0001)}
# Likelihood
C <- 10000
for (i in 1:N){
Zeros[i] ˜ dpois(Zeros.mean[i])
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
90 GLMs Part I – Continuous and Binomial Models
These values are nearly identical to those we specified in generating the model data. Note
that the variance parameter was specified in the rinvgauss function to be 20. The JAGS
result is 19.85 ± 0.885. The log-inverse-Gaussian model provides even more flexibility in
the modeling space than the log-gamma model. It is most appropriate for modeling data
that is heavily peaked at lower values, with a long right skew of data.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
91 5.2 Bayesian Continuous Response Models
# Data
np.random.seed(1056) # set seed to replicate example
nobs = 1000 # number of obs in model
x1 = uniform.rvs(size=nobs) # random uniform variable
beta0 = 1.0
beta1 = 0.5
l1 = 20
# Fit
stan_data = {} # build data dictionary
stan_data[’Y’] = y # response variable
stan_data[’x1’] = x1 # explanatory variable
stan_data[’N’] = nobs # sample size
# Stan code
stan_code = """
data{
int<lower=0> N;
vector[N] Y;
vector[N] x1;
}
parameters{
real beta0;
real beta1;
real<lower=0> lambda;
}
transformed parameters{
vector[N] exb;
vector[N] xb;
for (i in 1:N){
l1 = 0.5 * (log(lambda) - log(2 * pi() * pow(Y[i], 3)));
l2 = -lambda*pow(Y[i] - exb[i], 2)/(2 * pow(exb[i], 2) * Y[i]);
loglike[i] = l1 + l2;
}
target += loglike;
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
92 GLMs Part I – Continuous and Binomial Models
}
"""
# Output
nlines = 8 # number of lines in screen output
output = str(fit).split(’\n’)
for item in output[:nlines]:
print(item)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta0 0.99 6.7e-4 0.03 0.94 0.97 0.99 1.01 1.04 1470.0 1.0
beta1 0.52 1.2e-3 0.05 0.43 0.49 0.52 0.55 0.61 1504.0 1.0
lambda 20.26 0.02 0.89 18.54 19.65 20.24 20.86 22.05 1623.0 1.0
Note the closeness of the parameter means, standard deviation, and credible intervals to
the fiducial values and JAGS output.
The beta model is used to model a response variable that is formatted as a proportion x
between the values of 0 and 1. That is, the range of values for a beta model is 0 < x < 1.
The beta PDF can be parameterized either with two parameters, a and b, or in terms of
the mean, μ. The standard beta PDF is given as
(a + b) a−1
f (y; a, b) = y (1 − y)b−1 , 0 < y < 1. (5.14)
(a) (b)
The mean and variance of the beta distribution are defined as
a ab
E(y) = = μ, V(y) = , (5.15)
a+b (a + b)2 (a + b + 1)
and the log-likelihood function is given as
n
L (a, b; y) = {log (a + b) − log (a) − log (b) + (a − 1) log yi + (b − 1) log(1 − yi )} .
i=1
(5.16)
We show in Figure 5.9 examples of the beta PDF for different values of the a and b
parameters, and in Figure 5.10 an illustration of beta-distributed data.
We can create a synthetic beta model in R using the rbeta pseudo-random number
generator.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
93 5.2 Bayesian Continuous Response Models
12 4.0
a = 0.5, b = 0.5 a = 5.0, b = 2.0
a = 2.0, b = 0.5 a = 5.0, b = 3.0
3.5
10 a = 3.0, b = 0.5 a = 5.0, b = 4.0
a = 4.0, b = 0.5 a = 5.0, b = 5.0
3.0
8
2.5
Beta PDF
Beta PDF
6 2.0
1.5
4
1.0
2
0.5
0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
t
x x
Figure 5.9 Left: Set of beta probability distribution functions with different values for the parameter a; b = 0.5. Right: Set of
beta probability distribution functions with different values for b; a = 5.0.
Beta PDF
1 y
0
x1
x2
x3 x
As an example of synthetic beta data, we executed the code in the above table. To deter-
mine the range of values in y, we used a summary function and checked a histogram
(Figure 5.11) of the beta-distributed variable:
> summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2620 0.6494 0.7526 0.7346 0.8299 0.9856
> hist(y)
Note that with the median slightly greater than the mean we expect that the distribution
will show a slight left skew.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
94 GLMs Part I – Continuous and Binomial Models
120
10080
Frequency
60 40
20
0
t
y
Figure 5.11 Histogram of the beta-distributed data y generated in the code snippet 5.12.
The beta distribution has a wide variety of shapes, as seen in Figure 5.9, which makes
it an ideal prior distribution with binomial models. We shall discuss the beta prior in some
detail when we address priors later in the book.
The model is estimated using the default logit link for the mean and the identity link for
the scale parameter, φ. At times you will find that a beta model has a log link for the scale
parameter. The coefficient parameters of the model are identical whether an identity or log
link is used for the scale; only the φ statistic changes.
# Data
# define model parameters
set.seed(1056) # set seed to replicate example
nobs<-1000 # number of obs in model
x1 <- runif(nobs) # random normal variable
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
95 5.2 Bayesian Continuous Response Models
p <- exb/(1+exb)
theta <- 15
y <- rbeta(nobs,theta*(1-p),theta*p)
# Likelihood function
for(i in 1:N){
Y[i] ˜ dbeta(shape1[i],shape2[i])
shape1[i]<-theta*pi[i]
shape2[i]<-theta*(1-pi[i])
logit(pi[i]) <- eta[i]
eta[i]<-inprod(beta[],X[i,])
}
}"
# A function to generate initial values for mcmc
inits <- function () { list(beta = rnorm(ncol(X), 0, 0.1)) }
The output provides the parameter values that we expect. A maximum likelihood beta
model can be estimated on this data using the betareg function found in the betareg
package on CRAN.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
96 GLMs Part I – Continuous and Binomial Models
1.00
0.75
y 0.50
0.25
0.00
–0.5 0.0 0.5
t
x
Figure 5.12 Visualization of synthetic data generated from a beta distribution. The dashed line and shaded areas show the fitted
curve and the 50% and 95% prediction intervals. The dots correspond to the synthetic data.
require(betareg)
betareg(y˜x1)
Call:
betareg(formula = y ˜ x1)
The parameter values calculated using JAGS very closely approximate the coefficients,
intercept, and scale parameter values of the betareg function, thus confirming the JAGS
model. Of course, we deliberately did not add informative priors to the Bayesian model,
as this would alter the parameters. How much they would differ depends on the number of
observations in the data as well as the values of the hyperparameters defining the priors.
Remember that the real power of Bayesian modeling rests in its ability to have informative
priors that can be used to adjust the model on the basis of information external to the
original model data.
Figure 5.12 provides a graphic of fitted beta values and prediction intervals for the
synthetic beta model from Code 5.13.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
97 5.2 Bayesian Continuous Response Models
# Data
np.random.seed(1056) # set seed to replicate example
nobs = 2000 # number of obs in model
x1 = uniform.rvs(size=nobs) # random uniform variable
beta0 = 0.3
beta1 = 1.5
xb = beta0 + beta1 * x1
exb = np.exp(-xb)
p = exb / (1 + exb)
theta = 15
# Fit
mydata = {} # build data dictionary
mydata[’N’] = nobs # sample size
mydata[’x1’] = x1 # predictors
mydata[’y’] = y # response variable
stan_code = """
data{
int<lower=0> N;
vector[N] x1;
vector<lower=0, upper=1>[N] y;
}
parameters{
real beta0;
real beta1;
real<lower=0> theta;
}
model{
vector[N] eta;
vector[N] p;
vector[N] shape1;
vector[N] shape2;
for (i in 1:N){
eta[i] = beta0 + beta1 * x1[i];
p[i] = inv_logit(eta[i]);
shape1[i] = theta * p[i];
shape2[i] = theta * (1 - p[i]);
}
y ˜ beta(shape1, shape2);
}
"""
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
98 GLMs Part I – Continuous and Binomial Models
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=mydata, iter=5000, chains=3,
warmup=2500, n_jobs=3)
# Output
print fit
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta0 0.34 6.3e-4 0.02 0.29 0.32 0.34 0.35 0.38 1426.0 1.0
beta1 1.43 1.2e-3 0.04 1.34 1.4 1.43 1.46 1.52 1417.0 1.0
theta 15.1 0.01 0.48 14.2 14.78 15.09 15.42 16.04 1578.0 1.0
lp_ _ 1709.3 0.03 1.22 1706.1 1708.7 1709.6 1710.2 1710.7 1326.0 1.0
Notice that, in this example, we skipped the transformed parameters block and defined
all intermediate steps in the model block. As a consequence, the code does not track the
evolution of the parameters eta and pi or the shape parameters. The reader should also be
aware that when parameters are defined in the transformed parameter block the declared
constraints are checked every time the log-posterior is calculated; this does not happen
when variables are declared in the model block. Thus, this approach should be avoided for
non-trivial parameter constraints.
Binomial models have a number of very useful properties. Perhaps the foremost character-
istic of a binomial model is that the fitted or predicted value is a probability. A second
characteristic, related to the first, is that the response term to be modeled is a binary
variable. It is assumed that the values of the binary response are 0 and 1. In fact, the soft-
ware dealing with the estimation of binomial parameters assumes that the variable being
modeled has values of only 0 or 1. If the data being modeled is cast as 1, 2 for example, the
software converts it to 0 and 1 prior to estimation. Some software will simply not accept
any values other than 0 and 1. In fact, we recommend formatting all binary variables in a
statistical model, whether response variable or predictor, as 0, 1.
There are two major parameterizations of the binomial probability distribution, as well as
corresponding models based on each parameterization. The most commonly used param-
eterization is based on the Bernoulli PDF, which is a subset of the full binomial PDF. The
binomial distribution can be expressed as
m
f (y; p, m) = p y (1 − p)m−y , (5.17)
y
where
m m!
= (5.18)
y y! (m − k)!
is the binomial normalization term, y is the response term and binomial numerator, m is the
binomial denominator, and p represents the probability that y has the value 1. Thus y = 1
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
99 5.3 Bayesian Binomial Models
n = 25 n = 25 n = 25 n = 25
0.25 p = 0.2 p = 0.5 p = 0.7 p = 0.9
Binomial PDF
0.20
0.15
0.10
0.05
n = 20 n = 25 n = 35 n = 45
0.25 p = 0.5 p = 0.5 p = 0.5 p = 0.5
Binomial PDF
0.20
0.15
0.10
0.05
5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40
t
x x x x
Figure 5.13 Upper: Set of binomial probability distribution functions with different values for the parameter p; n = 25. Lower:
Set of binomial probability distribution functions with different values for n; p = 5.0.
indicates success, usually thought of as success in obtaining whatever the model is testing
for; y = 0 indicates a lack of success, or failure. A binary response such as this does not
allow intermediate values, or values less than 0 or over 1. A more detailed examination of
the full binomial PDF and model is provided in the section on grouped logistic or binomial
models later in this chapter. Figure 5.13 shows examples of binomial distributions with
different probabilities p and numbers of trials n ≡ m.
The Bernoulli PDF sets m to the value 1, which eliminates the choose function; the
choose function serves as the normalization term, ensuring that the individual probabilities
sum to 1. The Bernoulli PDF is therefore expressed as
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
100 GLMs Part I – Continuous and Binomial Models
are symmetric functions, and the cloglog and loglog links are asymmetric (see e.g. Hilbe,
2015).
The Bernoulli log-likelihood function is used to describe the binary data being modeled.
Prior distributions are multiplied with the log-likelihood to construct a posterior distribu-
tion for each parameter in the model. For the Bayesian logistic model the only parameters
estimated are the intercept and predictor posteriors. The Bernoulli log-likelihood function
in its full form may be expressed as
n
pi
L (p, y) = yi ln + ln(1 − pi ) . (5.20)
1 − pi
i=1
y
1
0
x1
x2
x3 x
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
101 5.3 Bayesian Binomial Models
We use the MCMClogit function, to be found in the package MCMCpack on CRAN, to model
logitmod data. As the name of the function implies, it uses MCMC sampling to determine
the mean, standard deviation, and credible intervals of the model parameters. The code
within the single lines produces a summary of the parameter statistics plus plots of the
trace and density. The trace plot should have no deviations from the random line in pro-
gressing from the left to the right side of the box. For the density plot, generally one sees a
normal bell-shaped curve for the posterior distributions of predictors. Skewed – even highly
skewed – curves are sometimes displayed for the variance and scale parameters. The mean
of each parameter is at the mode of the respective density plot. The median of the posterior
is also sometimes used as the reported summary statistic for the distribution. It should be
noted that, when the log-likelihood, AIC, and BIC are given sampling distributions, these
are nearly always highly skewed.
The parameter values for the intercept, x1, and x2 are close to what we specified. The
trace plot looks good and the density plots are as expected. The credible intervals, together
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
102 GLMs Part I – Continuous and Binomial Models
01234
6000 8000 10000 12000 14000 1.6 1.8 2.0 2.2
Iterations N = 10000 Bandwidth = 0.01397
Trace of x1 Density of x1
0.6 0.8 1.0
0 2 4
6000 8000 10000 12000 14000 0.6 0.7 0.8 0.9 1.0 1.1
Iterations N = 10000 Bandwidth = 0.01169
Trace of x2 Density of x2
6000 8000 10000 12000 14000 –5.6 –5.4 –5.2 –5.0 –4.8 –4.6 –4.4
t
Iterations N = 10000 Bandwidth = 0.0243
Figure 5.15 Trace plot and posteriors for the three regression parameters of the logit model.
with the trace and density plots in Figure 5.15, appear to confirm that the model is well
fitted.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
103 5.3 Bayesian Binomial Models
b0 = rep(0, K),
B0 = diag(0.00001, K)
)
sink("LOGIT.txt")
cat("
model{
# Priors
beta ˜ dmnorm(b0[], B0[,])
# Likelihood
for (i in 1:N){
Y[i] ˜ dbern(p[i])
logit(p[i]) <- eta[i]
# logit(p[i]) <- max(-20,min(20,eta[i])) used to avoid numerical
instabilities
# p[i] <- 1/(1+exp(-eta[i])) can use for logit(p[i]) above
eta[i] <- inprod(beta[], X[i,])
LLi[i] <- Y[i] * log(p[i]) +
(1 - Y[i]) * log(1 - p[i])
}
LogL <- sum(LLi[1:N])
AIC <- -2 * LogL + 2 * K
BIC <- -2 * LogL + LogN * K
}
",fill = TRUE)
sink()
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
104 GLMs Part I – Continuous and Binomial Models
The parameter values obtained with the JAGS code are nearly the same as those
obtained using MCMClogit. To display the chains and histograms of the model parameters
(Figures 5.16 and 5.17) use
source("CH-Figures.R")
out <- LOGT$BUGSoutput
MyBUGSHist(out,c(uNames("beta",K),"AIC", " BIC", "LogL"))
MyBUGSChains(out,c(uNames("beta",K),"AIC", " BIC", "LogL"))
Figure 5.18 displays the fitted model. The y-axis represents the probability of success as
function of x1 for x2 = 1 (lower curve) and x2 = 0 (upper curve).
Note that in Code 5.17 the line
This is the inverse link function, which defines the logistic fitted probability, or μ, but
in fact p is often used to symbolize the predicted probability, as previously mentioned. It
is better programming practice to use the logit() and similar functions in situations like
this rather than the actual formula. We show both in case you use this code for models that
do not have a function like logit(), e.g., a Bernoulli model with a complementary loglog
link. The above substitution can be applied to the Bernoulli, binomial, and beta binomial
models, as we shall observe.
00
00
00
00
00
10
20
30
40
50
0
–2545
–5.4 5100
–5.6
AIC beta[1] beta[2]
2.2 1.2
5100
2.1
1.0
2.0
5090 1.9
0.8
1.8
5080 1.7 0.6
0
0
00
00
00
00
00
00
00
00
00
00
10
20
30
40
50
10
20
30
40
50
t
MCMC iterations
Figure 5.16 MCMC chains for the three model parameters, β1 , β2 , and β3 , and for the log-likelihood, the AIC and BIC, for the
Bernoulli model.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
105 5.3 Bayesian Binomial Models
1500
1500
1000
1000
500
500
Frequencies
0
0
0
–5.6 –5.4 –5.2 –5.0 –4.8 –4.6 5100 5105 5110 5115 5120 5125 –2545 –2540 –2535
0
5080 5085 5090 5095 5100 5105 1.7 1.8 1.9 2.0 2.1 2.2 0.6 0.8 1.0 1.2
t
Posterior distribution
Figure 5.17 Histogram of the MCMC iterations for each parameter. The thick line at the base of each histogram represents the 95%
credible interval. Note that no 95% credible interval contains 0.
1.00
0.75
y 0.50
0.25
0.00
t
x
Figure 5.18 Visualization of the synthetic data from the Bayesian logistic model. The dashed and dotted lines and respective
shaded areas show the fitted and 95% probability intervals. The dots in the upper horizontal border correspond to
observed successes for each binary predictor and those in the lower horizontal border correspond to observed failures.
The dots with error bars denote the fraction of successes, in bins of 0.05. (A black and white version of this figure will
appear in some formats. For the color version, please refer to the plate section.)
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
106 GLMs Part I – Continuous and Binomial Models
We have added a calculation of the AIC and BIC goodness-of-fit statistics. These fit-
test statistics are comparative tests; they have no real informative power by themselves.
They may be used on both nested and non-nested data, unlike most fit tests. The deviance
information criterion (DIC; Spiegelhalter et al., 2002) test is similar to AIC and BIC, which
are acronyms for the Aikake information criterion (Akaike, 1974) and the Bayesian infor-
mation criterion (Schwarz, 1978) respectively. However, the DIC is specific to Bayesian
models and should be used when assessing comparative model value in preference to AIC
and BIC. We shall discuss the deviance, the DIC, and pD tests in Chapter 9. The way in
which we set up the AIC and BIC tests in this model using JAGS, however, can be used for
the other models we address in this volume.
There are times when the binary response data that we are modeling is correlated.
Remember, a likelihood, or log-likelihood, function is based on a probability distribu-
tion. As such, it involves the assumption that each observation in the data is independent
of the others. But what if the model data has been gathered in various small groups? A
typical example used in the social sciences is data collected about students in schools
throughout a large metropolitan area. It is likely that teaching methods are more similar
within schools than between schools. The data is likely to be correlated on the basis of
the differences in within-school teaching methods. To assess student learning across all
the schools in the area without accounting for the within-school correlation effect would
bias the standard errors. This same panel or nesting effect may also occur in astrophys-
ical data, when, for example, comparing properties of galaxies within galaxy clusters
and between galaxy clusters. In frequency-based modeling, analysts typically scale the
model standard errors, apply a sandwich or robust adjustment to the standard errors, or
model the data as a fixed or random effects panel model. We discuss panel models in
Chapter 8.
When employing a Bayesian logistic model on the data, it is still important to account
for possible correlations in the data. Using sandwich or robust adjustments to the model
is recommended as a first type of adjustment when the model has been estimated using
maximum likelihood. However, such an adjustment is not feasible for a Bayesian model.
Scaling can be, though. The scaling of a logistic model produces what R calls a quasi-
binomial model. It simply involves multiplying the standard error by the square root of
the dispersion statistic. This may be done to the posterior standard deviations. We define
the dispersion statistic as the ratio of the Pearson χ 2 statistic and the residual degrees
of freedom. For binomial models the deviance statistic may also be used in place of the
Pearson χ 2 . The deviance statistic is produced in the R glm model output, as well as in the
default JAGS output. The residual degrees of freedom is calculated as the number of model
observations less the number of parameters in the model. We suggest scaling the binomial
posterior standard deviations only if there is a substantial difference in values between the
scaled and default standard deviation values. You will then need to calculate new credible
intervals based on the scaled standard deviation values. HPDinterval(), from the pack-
age lme4, for calculating credible intervals may also be used in place of scaling; HDP is
an acronym for highest posterior density intervals. Again, we recommend this method of
adjustment only if there is a substantial panel-effect correlation in the data. But in that case
it would be preferable to employ a hierarchical model on the data (Chapter 8).
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
107 5.3 Bayesian Binomial Models
def invlogit(x):
""" Inverse logit function.
input: scalar
output: scalar
"""
return 1.0 / (1 + np.exp(-x))
# Data
np.random.seed(13979) # set seed to replicate example
nobs = 5000 # number of obs in model
beta0 = 2.0
beta1 = 0.75
beta2 = -5.0
# Fit
niter = 5000 # parameters for MCMC
# Likelihood
p = invlogit(beta0 + beta1 * x1 + beta2 * x2)
y_obs = pm.Binomial(’y_obs’, n=1, p=p, observed=by)
# Inference
start = pm.find_MAP()
step = pm.NUTS()
trace = pm.sample(niter, step, start, progressbar=True)
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
108 GLMs Part I – Continuous and Binomial Models
beta1:
Mean SD MC Error 95% HPD interval
-----------------------------------------------------------------------
0.753 0.071 0.001 [0.617, 0.897]
Posterior quantiles:
2.5 25 50 75 97.5
|--------------|==============|==============|--------------|
0.611 0.7055 0.754 0.802 0.893
beta2:
Mean SD MC Error 95% HPD interval
------------------------------------------------------------------------
-4.883 0.144 0.004 [-5.186, -4.620]
Posterior quantiles:
2.5 25 50 75 97.5
|--------------|==============|==============|--------------|
-5.179 -4.978 -4.879 -4.787 -4.612
# Data
np.random.seed(13979) # set seed to replicate example
nobs = 5000 # number of obs in model
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
109 5.3 Bayesian Binomial Models
x1 = bernoulli.rvs(0.6, size=nobs)
x2 = uniform.rvs(size=nobs)
beta0 = 2.0
beta1 = 0.75
beta2 = -5.0
by = bernoulli.rvs(exb, size=nobs)
mydata = {}
mydata[’K’] = 3
mydata[’X’] = sm.add_constant(np.column_stack((x1,x2)))
mydata[’N’] = nobs
mydata[’Y’] = by
mydata[’LogN’] = np.log(nobs)
# Fit
stan_code = """
data{
int<lower=0> N;
int<lower=0> K;
matrix[N, K] X;
int Y[N];
real LogN;
}
parameters{
vector[K] beta;
}
transformed parameters{
vector[N] eta;
eta = X * beta;
}
model{
Y ˜ bernoulli_logit(eta);
}
generated quantities{
real LL[N];
real AIC;
real BIC;
real LogL;
real<lower=0, upper=1.0> pnew[N];
vector[N] etanew;
etanew = X * beta;
for (i in 1:N){
pnew[i] = inv_logit(etanew[i]);
LL[i] = bernoulli_lpmf(1|pnew[i]);
}
LogL = sum(LL);
AIC = -2 * LogL + 2 * K;
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
110 GLMs Part I – Continuous and Binomial Models
output = str(fit).split(’\n’)
for i in lines:
print (output[i])
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0] 1.92 2.4e-3 0.08 1.76 1.87 1.92 1.98 2.08 1115.0 1.0
beta[1] 0.75 2.0e-3 0.07 0.62 0.7 0.75 0.8 0.9 1296.0 1.0
beta[2] -4.89 4.3e-3 0.15 -5.18 -4.99 -4.88 -4.79 -4.6 1125.0 1.0
AIC 9651.1 5.24 217.68 9239.7 9501.8 9647.8 9798.4 1.0e4 1723.0 1.0
BIC 9670.7 5.24 217.68 9259.2 9521.3 9667.4 9818.0 1.0e4 1723.0 1.0
LogL -4822 2.62 108.84 -5039 -4896 -4820 -4747 -4616 1723.0 1.0
> library(boot)
> inv.logit(0.4)
[1] 0.5986877
and
> 1/(1+exp(-0.4))
[1] 0.5986877
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
111 5.3 Bayesian Binomial Models
We add a note on when a probit model should be used. The origin of the probit model
derives from the case where statisticians divide a normally distributed variable into two
components, 0 and 1. The goal might be to determine which pattern of explanatory predic-
tors produces predicted probabilities greater than 0.5 and which pattern produces predicted
probabilities less than 0.5. Whatever goal a statistician might have, though, the important
point is that if the binary data originate from an underlying, latent, normally distributed
variable, which may be completely unknown, a probit model is preferable to a logit,
complementary loglog, or loglog model. Typically, predicted probit probabilities are lit-
tle different in value to predicted logit probabilities, so which model one uses may be a
matter of simple preference. However, when modeled using maximum likelihood, expo-
nentiated logistic model coefficients are interpreted as odds ratios,1 which is very useful
in fields such as medicine, social science, and so forth. Exponentiated probit coefficients
cannot be interpreted as odds ratios and have no interesting interpretation. For this rea-
son the maximum likelihood (and GLM) logistic regression model is very popular. In the
Bayesian context, exponentiated logistic parameter means are not technically odds ratios
although some analysts have interpreted them in such a manner. Their values may be iden-
tical to those produced in maximum likelihood estimation, but the theoretical basis for this
interpretation is missing. In astronomy the odds and odds-ratio concepts are rarely if ever
used.
When making the selection of either a logistic or a probit Bayesian model, the best
tactic is to evaluate the comparative diagnostic statistics and compare the DIC statis-
tics. If the diagnostics are nearly the same, but the DIC statistic of a logistic model
is less than that for a probit model on the same data, the logistic model should be
preferred.
1 The odds of success is defined as the ratio of the probability of success and the probability of failure. The odds
ratio is the ratio of the odds of success for two different classes.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
112 GLMs Part I – Continuous and Binomial Models
The results show that the parameter means are close to those specified in the synthetic
probit data. As for the logistic model, the trace and density plots appear to fit the model
well.
Again, this is due to the fact that the logit and probit models are both based on the
Bernoulli distribution and only differ in their link functions.
# Likelihood
for (i in 1:N){
Y[i] ˜ dbern(p[i])
probit(p[i]) <- max(-20, min(20, eta[i]))
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
113 5.3 Bayesian Binomial Models
# Information criteria
AIC <- -2 * LogL + 2 * K
BIC <- -2 * LogL + LogN * K
}
",fill = TRUE)
sink()
The results are nearly identical to the parameter estimates obtained with MCMCprobit.
Other non-canonical Bernoulli models include the complementary cloglog and loglog
models. These links are found in GLM software but are rarely used. Both these alternatives
are asymmetric about the distributional mean, 0.5. In terms of Bayesian applications, an
analyst may amend the line of code we used to distinguish JAGS logit from probit models,
in order to model cloglog and loglog models. This is done by using the formulae for their
respective inverse link functions. In Table 5.2 we display a list of Bernoulli (binomial)
inverse link functions, which are used to define mu, the fitted or predicted value.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
114 GLMs Part I – Continuous and Binomial Models
def probit_phi(x):
"""Probit transformation."""
mu = 0
sd = 1
return 0.5 * (1 + tsr.erf((x - mu) / (sd * tsr.sqrt(2))))
# Data
np.random.seed(135) # set seed to replicate example
nobs = 5000 # number of obs in model
x1 = uniform.rvs(size=nobs)
x2 = 2 * uniform.rvs(size=nobs)
py = bernoulli.rvs(exb)
# Fit
niter = 10000 # parameters for MCMC
# define likelihood
theta_p = beta0 + beta1*x1 + beta2 * x2
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
115 5.3 Bayesian Binomial Models
theta = probit_phi(theta_p)
y_obs = pm.Bernoulli(’y_obs’, p=theta, observed=py)
# inference
start = pm.find_MAP() # find starting value by optimization
step = pm.NUTS()
trace = pm.sample(niter, step, start, random_seed=135, progressbar=True)
beta1:
Mean SD MC Error 95% HPD interval
-----------------------------------------------------------------
0.760 0.084 0.001 [0.602, 0.927]
Posterior quantiles:
2.5 25 50 75 97.5
|--------------|==============|==============|--------------|
0.598 0.703 0.759 0.815 0.923
beta2:
Mean SD MC Error 95% HPD interval
-----------------------------------------------------------------
-1.234 0.048 0.001 [-1.330, -1.142]
Posterior quantiles:
2.5 25 50 75 97.5
|--------------|==============|==============|--------------|
-1.329 -1.266 -1.234 -1.202 -1.140
# Data
np.random.seed(1944) # set seed to replicate example
nobs = 2000 # number of obs in model
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
116 GLMs Part I – Continuous and Binomial Models
x1 = uniform.rvs(size=nobs)
x2 = 2 * uniform.rvs(size=nobs)
beta0 = 2.0
beta1 = 0.75
beta2 = -1.25
py = bernoulli.rvs(exb)
# Fit
probit_data = {}
probit_data[’N’] = nobs
probit_data[’K’] = K
probit_data[’X’] = X
probit_data[’Y’] = py
probit_data[’logN’] = np.log(nobs)
probit_code = """
data{
int<lower=0> N;
int<lower=0> K;
matrix[N,K] X;
int Y[N];
real logN;
}
parameters{
vector[K] beta;
}
transformed parameters{
vector[N] xb;
xb = X * beta;
}
model{
xb2 = X * beta;
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
117 5.3 Bayesian Binomial Models
for (i in 1:N){
p[i] = Phi(xb2[i]);
LLi[i] = Y[i] * log(p[i]) + (1-Y[i]) * log(1 - p[i]);
}
LogL = sum(LLi);
AIC = -2 * LogL + 2 * K;
BIC = -2 * LogL + logN * K;
}
"""
# Output
lines = list(range(8)) + [2 * nobs + 8, 2 * nobs + 9, 2 * nobs + 10]
output = str(fit).split(’\n’)
for i in lines:
print(output[i])
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0] 1.92 3.4e-3 0.11 1.71 1.85 1.92 1.99 2.14 1075.0 1.0
beta[1] 0.67 3.7e-3 0.12 0.44 0.59 0.68 0.76 0.92 1147.0 1.0
beta[2] -1.2 2.2e-3 0.07 -1.34 -1.25 -1.2 -1.16 -1.06 1085.0 1.0
AIC 1643.0 0.08 2.43 1640.3 1641.2 1642.4 1644.1 1649.1 911.0 1.0
BIC 1659.8 0.08 2.43 1657.1 1658.0 1659.2 1660.9 1665.9 911.0 1.0
LogL -818.5 0.04 1.21 -821.5 -819.0 -818.2 -817.6 -817.1 911.0 1.0
y m x1 x2
2 8 1 0
6 9 0 1
0 3 1 1
7 7 0 0
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
118 GLMs Part I – Continuous and Binomial Models
The binomial and beta binomial models are appropriate for modeling such data. The beta
binomial model, however, is used when a binomial model is overdispersed. We discuss the
beta binomial in the next section.
The binomial model is the probability distribution most often used for binary response
data. When the response term is binary and the binomial denominator is 1, the distribu-
tion is called Bernoulli. The standard logistic and probit models with binary response data
are based on the Bernoulli distribution, which itself is a subset of the binomial distribu-
tion. We described the structure of a binomial data set above. We now turn to the binomial
probability distribution, the binomial log-likelihood, and other equations fundamental to
the estimation of binomial-based models. We shall discuss the logistic and probit binomial
models in this section, noting that they are frequently referred to as grouped logistic and
grouped probit models. Our main discussion will relate to the binomial logistic model since
it is used in research far more than is the probit or other models such as the complemen-
tary loglog and loglog models. These are binomial models as well, but with different link
functions.
The binomial PDF in terms of p is expressed as
mi yi
f (y; p, m) = pi (1 − pi )mi −yi . (5.21)
yi
The log-likelihood PDF in terms of μ may be given as
n
μi mi
L (μ; y, m) = yi ln + m ln(1 − μi ) + ln (5.22)
1 − μi yi
i=1
where the choose function within the final term of the log-likelihood can be re-expressed
in terms of factorials as
mi !
. (5.23)
yi ! (mi − yi )!
The log of the choose function may be converted to log-gamma functions and expressed as
These three terms are very often used in estimation algorithms to represent the final term
of the log-likelihood, which is the normalization term of the binomial PDF. Normalization
of a PDF guarantees that the individual probabilities in the PDF sum to 1.
The pseudocode for the binomial log-likelihood can be given as
The pseudocode for the binomial log-likelihood in term of the linear predictor xb can be
given as
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
119 5.3 Bayesian Binomial Models
Binomial PDF
y
0
x1
x2 x
x3
MCMCpack does not have a built-in binomial or grouped logistic function. It is possible to
create one’s own function as an option to MCMCregress, but it is much easier to use JAGS
from within R to do the modeling.
y <- rbinom(nobs,prob=p,size=m)
bindata=data.frame(y=y,m=m,x1,x2)
==================================================
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
120 GLMs Part I – Continuous and Binomial Models
The following code may be used to obtain maximum likelihood or GLM results from
modeling the above data. Note that the parameter estimates closely approximate the values
specified in the code creating the data. They also will be close to the posterior parameter
means calculated from the JAGS code 5.26 below.
> noty <- m - y
> mybin <- glm(cbind(y, noty) ˜ x1 + x2, family=binomial, data=bindata)
> summary(mybin)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.99406 0.06354 -31.38 <2e-16 ***
x1 -1.59785 0.08162 -19.58 <2e-16 ***
x2 3.09066 0.08593 35.97 <2e-16 ***
The simulated binomial numerator values are contained in y and the binomial denom-
inator values in m. The JAGS code for modeling the above data is given in Code 5.26. In
place of using the written out code for the binomial log-likelihood, we use the internal
JAGS dbin function with the m option for the log-likelihood. It is generally faster.
sink("GLOGIT.txt")
cat("
model{
# Priors
# Diffuse normal priors betas
for (i in 1:K) { beta[i] ˜ dnorm(0, 0.0001)}
# Likelihood
for (i in 1:N){
Y[i] ˜ dbin(p[i],m[i])
logit(p[i]) <- eta[i]
eta[i] <- inprod(beta[], X[i,])
}
}
",fill = TRUE)
sink()
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
121 5.3 Bayesian Binomial Models
The estimated mean parameter values are close to the values we specified when the data
were created. The specified parameter mean values for the intercept and for x1 and x2 are
−2, −1.5, and 3. The model parameter values are −1.96, −1.58, and 3.0 respectively.
It should be noted that the line
Here logit(p[i]) is identical to the inverse link, which provides the fitted probability.
Substituting one line for the other results in the same results, given a sampling error, that is.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
122 GLMs Part I – Continuous and Binomial Models
x1 <- c(1,1,1,1,1,1,0,0,0,0,0,0)
x2 <- c(1,1,0,0,1,1,0,0,1,1,0,0)
x3 <- c(1,0,1,0,1,0,1,0,1,0,1,0)
bindata1 <-data.frame(y,m,x1,x2,x3)
==================================================
The only lines that need to be amended in Code 5.26 are those specifying the data set
and the added predictor, x3. All else remains the same.
The resulting parameter means, standard deviations, and credible intervals are as follows:
Given that the “true” parameters for the intercept and three coefficient parameters
are, respectively, −1.333, 0.25, 0.5, and −0.3, the parameter values found by MCMC
sampling are very close. The algorithm is fast owing to the grouped nature of the
data.
# Data
np.random.seed(33559) # set seed to replicate example
nobs = 2000 # number of obs in model
m = 1 + poisson.rvs(5, size=nobs)
x1 = uniform.rvs(size=nobs) # random uniform variable
x2 = uniform.rvs(size=nobs)
beta0 = -2.0
beta1 = -1.5
beta2 = 3.0
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
123 5.3 Bayesian Binomial Models
mydata = {}
mydata[’K’] = 3
mydata[’X’] = sm.add_constant(np.column_stack((x1,x2)))
mydata[’N’] = nobs
mydata[’Y’] = y
mydata[’m’] = m
# Fit
stan_code = """
data{
int<lower=0> N;
int<lower=0> K;
matrix[N, K] X;
int Y[N];
int m[N];
}
parameters{
vector[K] beta;
}
transformed parameters{
vector[N] eta;
vector[N] p;
eta = X * beta;
for (i in 1:N) p[i] = inv_logit(eta[i]);
}
model{
Y ˜ binomial(m, p);
}
"""
# Output
nlines = 8
output = str(fit).split(’\n’)
for item in output[:nlines]:
print(item)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0] -2.08 1.9e-3 0.06 -2.2 -2.12 -2.08 -2.03 -1.95 1135.0 1.0
beta[1] -1.55 2.4e-3 0.08 -1.71 -1.61 -1.56 -1.5 -1.4 1143.0 1.0
beta[2] 3.15 2.6e-3 0.09 2.98 3.09 3.15 3.21 3.33 1179.0 1.0
Analogously to what was shown in the previous section, if we wish to explicitly use the
example code 5.27 in Python, the data section of Code 5.28 must be changed as follows:
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
124 GLMs Part I – Continuous and Binomial Models
Code 5.29 Adaptation of binomial model data section of Code 5.28 allowing it to handle
explicit three-parameter data.
==================================================
# Data
np.random.seed(33559) # set seed to replicate example
y = [6,11,9,13,17,21,8,10,15,19,7,12]
m = [45,54,39,47,29,44,36,57,62,55,66,48]
x1 = [1,1,1,1,1,1,0,0,0,0,0,0]
x2 = [1,1,0,0,1,1,0,0,1,1,0,0]
x3 = [1,0,1,0,1,0,1,0,1,0,1,0]
mydata = {}
mydata[’K’] = 4
mydata[’X’] = sm.add_constant(np.column_stack((x1,x2,x3)))
mydata[’N’] = len(y)
mydata[’Y’] = y
mydata[’m’] = m
# Output nlines = 9
==================================================
. . . in R using JAGS
The grouped logistic or binomial model can be made into a grouped or binomial probit
model by changing a line in the above JAGS code from
to
The output when run on the real grouped data that we used for the logistic model above
appears as
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
125 5.3 Bayesian Binomial Models
No seed value was given to the data or sampling algorithm, so each run will produce
different results. The results will not differ greatly and should be within the range of the
credible intervals 95% of the time.
p[i] = inv_logit(eta[i]);
to
p[i] = Phi(eta[i]);
in Code 5.28.
Using the data shown in Code 5.27 and the changes mentioned above, the output on the
screen should look like
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
126 GLMs Part I – Continuous and Binomial Models
Binomial Beta–binomial
200
Frequency
100
15 20 25 30 35
t
x
Figure 5.20 Histograms of a binomial distributtion and a beta–binomial distribution with same total number of trials and
probability of success.
f (y; μ, σ )
ln (n + 1) ln (1/σ ) ln (y + μ/σ ) ln [n − y + (1 − μ)/σ ]
= .
ln (y + 1) ln (n − y + 1) ln (n + 1/σ ) ln (μ/σ ) ln [(1 − μ)/σ ]
(5.25)
JAGS code may be given for the log-likelihood below, for which the logit inverse link
is used. This log-likelihood may be converted to a probit, complementary loglog, or
loglog beta–binomial by using the desired inverse link function. The remainder of the log-
likelihood function remains the same. Simply amending one line changes the model being
used.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
127 5.3 Bayesian Binomial Models
# Simulation
set.seed(33559)
nobs = 2500
m = 1 + rpois(nobs,5)
x1 = runif(nobs)
beta1 <- -2
beta2 <- -1.5
eta <- beta1 + beta2 * x1
sigma <- 20
p <- inv.logit(eta)
shape1 = sigma*p
shape2 = sigma*(1 - p)
Now we use JAGS to develop a Bayesian beta–binomial model with parameter values
that closely resemble the values we assigned to the simulated data. We base this model on
the values given to the two beta–binomial parameters, beta1 (−2.0) and beta2 (−1.50),
and sigma (20).
sink("GLOGIT.txt")
cat("
model{
# Diffuse normal priors betas
for (i in 1:K) { beta[i] ˜ dnorm(0, 0.0001)}
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
128 GLMs Part I – Continuous and Binomial Models
for (i in 1:N){
Y[i] ˜ dbin(p[i],m[i])
p[i] ˜ dbeta(shape1[i],shape2[i])
shape1[i] <- sigma*pi[i]
shape2[i] <- sigma*(1-pi[i])
logit(pi[i]) <- eta[i]
eta[i] <- inprod(beta[],X[i,])
}
}
",fill = TRUE)
sink()
The results are close to the values −2 and −1.5 we gave to the simulated beta–binomial
parameters. The value of sigma that we assigned (20) is closely estimated as 20.539.
As we did for the binomial model, we provide the same synthetic data set for modeling
the beta–binomial data grouped in standard format. Instead of using p[i] for the predicted
probability, we shall use mu[i] for the beta–binomial models. Recall, though, that they
symbolize the same thing – the fitted value. Again, we repeat the display of the binomial
data for convenience.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
129 5.3 Bayesian Binomial Models
x3 <- c(1,0,1,0,1,0,1,0,1,0,1,0)
bindata <- data.frame(y,m,x1,x2,x3)
==================================================
The JAGS code used to determine the beta–binomial parameter means, standard devia-
tions, and credible intervals for the above data is found in Code 5.34 below. Note that the
line mu[i] <- 1/(1+exp(-eta)) is equivalent to logit(mu[i]) <- max(-20, min(20,
eta[i])). The second expression, however, sets limits on the range of predicted values
allowed by the inverse logistic link. When the zero trick is used then the log-likelihood of
the distribution underlying the model must be used. The beta–binomial log-likelihood is
provided over four lines of code. The line above the start of the log-likelihood defines the
link that will be used with the model, which is the logistic link for this model.
Code 5.34 Beta–binomial model (in R using JAGS) for explicitly given data and the zero
trick.
==================================================
library(R2jags)
X <- model.matrix(˜ x1 + x2 + x3, data = bindata)
K <- ncol(X)
model.data <- list(Y = bindata$y,
N = nrow(bindata),
X = X,
K = K,
m = m,
Zeros = rep(0, nrow(bindata))
)
sink("BBL.txt")
cat("
model{
# Diffuse normal priors betas
for (i in 1:K) { beta[i] ˜ dnorm(0, 0.0001)}
C <- 10000
for (i in 1:N){
Zeros[i] ˜ dpois(Zeros.mean[i])
Zeros.mean[i] <- -LL[i] + C
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
130 GLMs Part I – Continuous and Binomial Models
In Code 5.34 above we provide a beta–binomial model using the inverse link for sigma.
The log link was used in the above parameterization of the beta–binomial. Note that we do
not use the zero trick this time, which facilitates speed.
sink("BBI.txt")
cat("
model{
# Diffuse normal priors betas
for (i in 1:K) { beta[i] ˜ dnorm(0, 0.0001)}
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
131 5.3 Bayesian Binomial Models
sink()
# Identify parameters
params <- c("beta", "sigma")
The values closely approximate those given in the results of the initial model.
Code 5.36 Beta–binomial model with synthetic data in Python using Stan.
==================================================
import numpy as np
import statsmodels.api as sm
import pystan
# Data
np.random.seed(33559) # set seed to replicate example
nobs = 4000 # number of obs in model
m = 1 + poisson.rvs(5, size=nobs)
x1 = uniform.rvs(size=nobs) # random uniform variable
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
132 GLMs Part I – Continuous and Binomial Models
beta0 = -2.0
beta1 = -1.5
p = np.exp(eta) / (1 + np.exp(eta))
shape1 = sigma * p
shape2 = sigma * (1-p)
mydata = {}
mydata[’K’] = 2
mydata[’X’] = sm.add_constant(np.transpose(x1))
mydata[’N’] = nobs
mydata[’Y’] = y
mydata[’m’] = m
# Fit
stan_code = """
data{
int<lower=0> N;
int<lower=0> K;
matrix[N, K] X;
int Y[N];
int m[N];
}
parameters{
vector[K] beta;
real<lower=0> sigma;
}
transformed parameters{
vector[N] eta;
vector[N] pi;
vector[N] shape1;
vector[N] shape2;
eta = X * beta;
for (i in 1:N){
pi[i] = inv_logit(eta[i]);
shape1[i] = sigma * pi[i];
shape2[i] = sigma * (1 - pi[i]);
}
}
model{
# Output
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
133 5.3 Bayesian Binomial Models
nlines = 8
output = str(fit).split(’\n’)
for item in output[:nlines]:
print(item)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0] -2.07 1.1e-3 0.05 -2.17 -2.1 -2.06 -2.03 -1.97 2136.0 1.0
beta[1] -1.44 2.3e-3 0.11 -1.65 -1.51 -1.44 -1.37 -1.22 2102.0 1.0
sigma 17.56 0.04 2.16 13.9 16.02 17.38 18.88 22.31 2394.0 1.0
to
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
134 GLMs Part I – Continuous and Binomial Models
Beta–binomial complementary loglog and loglog links can also be created from the
above code by amending the same line as that used for a probit link. The inverse link func-
tions provided in Table 5.1 can be used. The second model we gave can also be converted to
a probit model by changing the line logit(pi[i]) <- eta[i] to the line probit(pi[i])
<- eta[i].
pi[i] = inv_logit(eta[i]);
by
pi[i] = Phi(eta[i]);
which results in
This concludes the section on binomial models, including Bernoulli, full binomial, and
beta–binomial combination models. In Chapter 6 we shall discuss hierarchical or gener-
alized linear mixed-effect models. Some statisticians use the terms “random intercept”
and “random slopes” to refer to these models. However, we next turn, in Chapter 6,
to presenting an overview of Bayesian count models. We discuss the standard Bayesian
Poisson count model, as well as the Bayesian negative binomial, generalized Poisson,
zero-truncated, and three-parameter NB-P models. This discussion is followed by an
examination of Bayesian zero-inflated mixture models and two-part hurdle models.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:04, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.006
6 GLMs Part II – Count Models
Count data refer to observations made about events or enumerated items. In statistics
they represent observations that have only non-negative integer values, for example, the
number of globular clusters in a galaxy or the number of exoplanets orbiting a stellar
system. Examples of discrete count models discussed in this chapter are displayed in
Table 6.1.
A count response consists of any discrete number of counts: for example, the number
of hits recorded by a Geiger counter, patient days in a hospital, or sunspots appearing
per year. All count models aim to explain the number of occurrences, or counts, of an
event. The counts themselves are intrinsically heteroskedastic, i.e., right skewed, and have
a variance that increases with the mean of the distribution. If the variance is greater than the
mean, the model is said to be overdispersed; if less than the mean, the model is said to be
underdispersed; the term “extra-dispersed” includes both these possibilities. The Poisson
model is assumed to be equidispersed, with the mean equal to the variance. Violations of
this assumption in a Poisson model lead to biased standard errors. Models other than the
Poisson are aimed to adjust for what is causing the data to be under- or overdispersed.
Refer to Hilbe (2014) for a full text on the modeling of count data.
The Poisson model is the traditional standard model to use when modeling count data.
Counts are values ranging from 0 to a theoretical infinity. The important thing to remember
is that count data are discrete. A count response variable modeled using a Poisson
regression model (for frequency-based estimation) or a Poisson model (for Bayesian esti-
mation) consists of zero and positive integers. The Poisson probability distribution may be
expressed as
e−μi μi i
y
f (y; μ) = (6.1)
yi !
and the Poisson log-likelihood as
n
L = {yi ln(μi ) − μi − ln(yi )} (6.2)
i=1
with μi = exp(xi β), where xi β is the linear predictor for each observation in the model.
The general expression in Equation 5.1 then takes the form
135
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
136 GLMs Part II – Count Models
0
x1
x2
x3 x
Yi ∼ Poisson(μi );
μi = eηi ,
ηi = β0 + β1 x1 + · · · + βp xp .
When the data are formatted as integers, it is not appropriate to model them using nor-
mal, lognormal, or other continuous-response models. Prior to when Poisson regression
became available in commercial statistical software, analysts faced with modeling pos-
itive count data logged the counts, usually adding 0.5 to their value so that zero count
values would not end up with missing values. Of course, doing this changes the data and,
most importantly, violates the distributional assumptions of the normal model. Logging the
response variable and modeling it with a normal or linear regression still corresponds to a
normal model – and is based on various important assumptions.
Probably the main reason not to log a count variable and model it using ordinary least
squares regression (OLS) is that in linear regression it is assumed that the variance of the
counts remains constant throughout the range of count values. This is unrealistic and rarely
if ever occurs with real data. Nevertheless, when the model is estimated using a GLM-based
normal or linear regression, the variance is set to 1. When estimated as a two-parameter
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
137 6.1 Bayesian Poisson Models
model the normal model estimates both the mean and variance, with the variance having a
constant value for all observations. In fact, in count data the variance nearly always varies
with the mean, becoming larger as the mean becomes higher. To model a count variable as
if it were continuous violates the distributional assumptions of both the normal and count
model distributions. Typically the standard errors become biased, as well as the predicted
values. Also, goodness-of-fit statistics such as the AIC or BIC are generally substantially
higher for count data estimated using OLS rather than a standard count model.
We shall first provide a simple example of synthetic Poisson data with a corresponding
Poisson model. In the model below we create a 500-observation data set. We provide an
intercept value of 1, and a predictor x, with parameter value 2. We have combined lines by
placing the inverse link transformation within the Poisson random number generator rather
than on a separate line before it. We have also encased the regression within the summary
function to save another step. We can further simplify the code so that the synthetic data is
created on two lines; this is displayed below the following simple model.
set.seed(2016)
x <- runif(500)
py <- rpois(500, exp(1+2*x))
summary(myp <- glm(py $\newtilde$ x, family=poisson))
The result displayed is identical to that for the first version. Although it is nearly always
possible to shorten code, even though it works it may be challenging to decipher: The
problem is that when you come back to the code later, you may find it difficult to interpret.
It is better practice to write your code in a clear and understandable manner, including
comments throughout to identify exactly what the code is for.
Given the above, we now create a more complex synthetic data set. The goal is to
demonstrate how synthetic binary and continuous predictors can be used to simulate real
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
138 GLMs Part II – Count Models
x12 x2
80
500
400 60
300
Count 40
200
20
100
0 0
–0.5 0.0 0.5 1.0 1.5 –2 0 2 4
t
Value
Figure 6.2 Binary and continuous predictors for the Poisson model.
data. In this case we format the synthetic data so that it has two predictors: a binary
predictor x12 and a random normal continuous predictor x2, as shown in Figure 6.2.
Code to generate the data is given below:
Code 6.2 Synthetic Poisson data and model in R: binary and continuous predictors.
==================================================
set.seed(18472)
nobs <- 750
x1_2 <- rbinom(nobs,size=1,prob=0.7) # 70% 1’s; 30% 0’s
x2 <- rnorm(nobs,0,1)
xb <- 1 - 1.5*x1_2 - 3.5*x2
exb <- exp(xb)
py <- rpois(nobs, exb)
pois <- data.frame(py, x1_2, x2)
poi <- glm(py ~˜ x1_2 + x2, family=poisson, data=pois)
summary(poi)
==================================================
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.003426 0.010634 94.36 <2e-16 ***
x1_2 -1.507078 0.004833 -311.81 <2e-16 ***
x2 -3.500726 0.004228 -828.00 <2e-16 ***
The resulting model values are quite close to what we specified; considering that the
data has only 750 observations, they are in fact very close. The coefficients we have for
this data are what we use for comparison with a Bayesian model of the data. The data
is not identified with the values we initially specified, but rather with the coefficients and
parameter values actually created in the sampling algorithm we used. If we employed larger
data sets, the values would be closer to what we specified.
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
139 6.1 Bayesian Poisson Models
An important distributional assumption of the Poisson model is that the mean and vari-
ance of the count variable being modeled are the same, or at least almost the same. Various
tests have been designed to determine whether this criterion has been met. Recall that it
is not sufficient simply to tabulate the raw count variable; rather, it must be tabulated as
adjusted by its predictors. This is not easy to do with models on real data, but is not difficult
with synthetic data.
An easy way to determine whether the variance is greater than the mean is to check the
Poisson model dispersion statistic. If the statistic is greater than 1.0, the model is said to be
overdispersed, that is, there is more variability in the data than is allowed by the Poisson
assumption (variance > mean). If the dispersion statistic is less than 1.0, the model is
said to be underdispersed (mean > variance). Most real count data is overdispersed, which
biases the model’s standard errors: a predictor may appear to contribute significantly to
the model when in fact it does not. The greater the dispersion, the greater the bias. Such
a Poisson model should not be relied on to provide meaningful parameter estimates. The
standard way to model Poisson overdispersed data is by using a negative binomial model,
which has an extra dispersion parameter that is intended to adjust for the overdispersion.
We discuss the negative binomial model in the following section.
Underdispersion is fairly rare with real data, but it does happen when by far the
largest number of observations in the data have low count values. A negative binomial
model cannot be used with Poisson underdispersed data; however, underdispersion may
be adjusted for by using a generalized Poisson model. The generalized Poisson can also
model Poisson overdispersed data. We shall discuss the generalized Poisson later in this
chapter.
It is probably wise to use a Poisson model first when modeling count data, i.e., a model
with a count response variable. Then one checks the Pearson dispersion statistic to deter-
mine whether the model is under- or overdispersed. If the dispersion statistic is close to 1
then a Poisson model is appropriate. A boundary likelihood ratio test when one is using
a negative binomial model can provide a p-value to determine whether a count model is
Poisson. We shall deal with that as well, in the next section.
When using R’s glm function for a Poisson model, the output does not include the dis-
persion statistic. The glm function in the major commercial packages provides a dispersion
statistic as part of the default output. It is easy, however, to calculate the dispersion statistic
ourselves when using R. The dispersion statistic is the ratio of the Pearson χ 2 statistic and
the model’s residual degrees of freedom. We can calculate the Pearson χ 2 statistic as the
sum of squared Pearson residuals. The number of residual degrees of freedom is calculated
as the number of observations in the model less the number of parameters, which includes
the intercept. For the above model the dispersion statistic may be determined using the
following code:
pr <- resid(poi, type="pearson")
N <- nrow(pois)
p <- length(coef(poi))
sum(pr^2) / (N - p)
[1] 1.071476
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
140 GLMs Part II – Count Models
The COUNT package on CRAN has a function that automatically calculates the Pearson
χ 2 statistic and dispersion following glm. It is called P__disp() (two underscores). It can
be used on our data as
library(COUNT)
P__disp(poi)
pearson.chi2 dispersion
800.392699 1.071476
We can also direct the glm function to display a Pearson dispersion characteristic by
using the following code:
The output, which is identical to a quasipoisson model owing to the scaling of standard
errors by the dispersion, displays the following relevant line:
(Dispersion parameter for poisson family taken to be 1.071476)
As an aside, recall that in the section on Bernoulli logistic models we mentioned the scal-
ing of standard errors (in maximum likelihood models) or of standard deviation statistics
(in Bayesian posterior distributions). Scaling is used perhaps more with Poisson and neg-
ative binomial models than with any other model. If there is evidence of extra-dispersion
in the otherwise Poisson data, and we cannot identify its source, many statisticians scale
standard errors by multiplying the standard error by the square root of the dispersion statis-
tic. It is important for count models to base the dispersion on the Pearson χ 2 statistic and
not on the deviance, which would bias the true dispersion upward. The standard advice
to use robust or sandwich-adjusted standard errors for all count models is only applica-
ble for maximum likelihood estimated models. Scaling may be used on Bayesian Poisson
models, but we advise against it. If the source of Poisson overdispersion is unknown then
we advise employing a negative binomial model, as discussed in the following section. If
the Poisson data is underdispersed, and the source of the underdispersion is unknown, we
advise using a generalized Poisson model, which we address later in this chapter. In R,
one may scale maximum likelihood Poisson standard errors by using the glm family option
quasipoisson. This family consists of nothing more than scaling, although the glm model
documentation does not mention this.
It is also instructive to check the range of the count variable in our model by using the
summary function. Here we find that values of the response range from 0 to 63920, the
median and mean values being quite different. Because the mean is so much greater than
the median, we know that the data is likely to be highly skewed to the right, which is not
untypical of count data.
summary(py)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 0.0 1.0 307.8 11.0 63920.0
A tabulation of the counts allows us to determine whether there are any counts or
sections of counts without values, thus indicating the possible need for a two-part model.
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
141 6.1 Bayesian Poisson Models
200
50
10
5
2
1
t
0 7 15 25 34 45 62 85 126 176 301 856 3782
It is important to check a tabulation table of counts to determine whether there are zero
counts and, if so, whether there are more than are allowed on the basis of the Poisson
distribution:
barplot(table(py),log="y",col="gray75")
μ0i × e−μi
P(Yi = 0|μi ) = = e−μi ; (6.3)
0!
thus the probability of zeros for the data set above can be estimated with the following
command:
P_zeros = mean(exp(-xb))
P_zeros
[1] 0.4461361
We convert this value to zero counts with the data we are modeling:
There are in fact sum(py==0) = 329 zero counts in py, which is very close to the
expected value. In cases where a Poisson model expects many fewer zeros than are actu-
ally observed in the data, this might indicate a problem when modeling the data as Poisson;
then it may be better to use a zero-inflated model or a two-part hurdle model on this data,
as will be discussed later in this book.
Having excessive zero counts in the response variable being modeled nearly always leads
to overdispersion. The primary way in which statisticians model overdispersed Poisson
data is by using a negative binomial, a hurdle, or a zero-inflated model in place of Poisson.
This is true for Bayesian models as well. Recent work on which model bests fits Poisson
overdispersed data has been leaning towards the use of a negative binomial or hurdle model,
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
142 GLMs Part II – Count Models
Observed Predicted
100
50
Frequency
0
–50
–100
0 10 20 30 40 50
t
Count
Figure 6.4 Observed vs. predicted count frequency per cell for the Poisson model.
but it really depends on the source of the overdispersion. There is no single catch-all model
to use for overdispersed count data. However, employing the Bayesian negative binomial
model or other advanced Bayesian count models introduced in this volume greatly expands
our ability to model overdispersed count data appropriately.
We can add that there are still other models and adjustments that can be made in case of
over- or underdispersion. The remaining models discussed in this chapter can be used for
this purpose.
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
143 6.1 Bayesian Poisson Models
The parameter means and standard deviations are in fact nearly identical to the Poisson
regression coefficients and standard errors. The plot function generated with this code will
display histograms that appear normally distributed, as we expect, with the distributional
mode centered over the estimated parameter mean. The trace plots can confirm that there
are no evident problems with convergence. The output plots are not shown here but the
reader can find them in the online material.
sink("Poi.txt")
cat("
model{
for (i in 1:K) {beta[i] ˜ dnorm(0, 0.0001)}
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
144 GLMs Part II – Count Models
for (i in 1:N) {
Y[i] ˜ dpois(mu[i])
log(mu[i]) <- inprod(beta[], X[i,])
}
}",fill = TRUE)
sink()
The JAGS results are nearly identical to those for the maximum likelihood (GLM) and
MCMCpoisson models run on the data. We used only 1000 sampling iterations to deter-
mine the mean and standard deviation values. Four thousand samples were run but were
discarded as burn-in samples.
# Data
np.random.seed(2016) # set seed to replicate example
nobs = 1000 # number of obs in model
x = uniform.rvs(size=nobs)
xb = 1 + 2 * x # linear predictor
py = poisson.rvs(np.exp(xb)) # create y as adjusted
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
145 6.1 Bayesian Poisson Models
# Fit
# Build model
myp = sm.GLM(py, X, family=sm.families.Poisson())
# Output
print(res.summary())
==================================================
Generalized Linear Model Regression Results
===============================================================
Dep. Variable: y No. Observations: 1000
Model: GLM Df Residuals: 998
Model Family: Poisson Df Model: 1
Link Function: log Scale: 1.0
Method: IRLS Log-Likelihood: -2378.8
Date: Tue, 21 Jun 2016 Deviance: 985.87
Time: 12:05:03 Pearson chi2: 939.0
No. Iterations: 8
===============================================================
coeff std err z P>|z| [95.0% Conf. Int.]
const 0.9921 0.029 34.332 0.000 0.935 1.049
x1 2.0108 0.041 48.899 0.000 1.930 2.091
# Data
np.random.seed(18472) # set seed to replicate example
nobs = 750 # number of obs in model
my_data = {}
my_data[’x1_2’] = x1_2
my_data[’x2’] = x2
my_data[’py’] = py
# Build model
myp = smf.glm(’py ˜ x1_2 + x2’, data=my_data, family=sm.families.Poisson())
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
146 GLMs Part II – Count Models
# Output
print(res.summary())
==================================================
Generalized Linear Model Regression Results
=======================================================================
Dep. Variable: py No. Observations: 750
Model: GLM Df Residuals: 747
Model Family: Poisson Df Model: 2
Link Function: log Scale: 1.0
Method: IRLS Log-Likelihood: -1171.0
Date: Tue, 21 Jun 2016 Deviance: 556.26
Time: 14:24:33 Pearson chi2: 643.0
No. Iterations: 12
=======================================================================
coeff std err z P>|z| [95.0% Conf. Int.]
Intercept 1.0021 0.012 83.174 0.000 0.979 1.026
x1_2 -1.5003 0.006 -252.682 0.000 -1.512 -1.489
x2 -3.5004 0.004 -796.736 0.000 -3.509 -3.492
# Data
np.random.seed(18472) # set seed to replicate example
nobs = 750 # number of obs in model
# Fit
niter = 10000 # parameters for MCMC
# define likelihood
mu = np.exp(beta0 + beta1 * x1_2 + beta2 * x2)
y_obs = pm.Poisson(’y_obs’, mu, observed=py)
# inference
start = pm.find_MAP() # Find starting value by
optimization
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
147 6.1 Bayesian Poisson Models
step = pm.NUTS()
trace = pm.sample(niter, step, start, progressbar=True)
# Output
pm.summary(trace)
==================================================
beta0:
Mean SD MC Error 95% HPD interval
-------------------------------------------------------------------
1.002 0.027 0.000 [0.979, 1.025]
Posterior quantiles:
2.5 25 50 75 97.5
|--------------|==============|==============|--------------|
0.979 0.994 1.002 1.010 1.026
beta1:
Mean SD MC Error 95% HPD interval
-------------------------------------------------------------------
-1.495 0.082 0.005 [-1.512, -1.489]
Posterior quantiles:
2.5 25 50 75 97.5
|--------------|==============|==============|--------------|
-1.512 -1.504 -1.500 -1.496 -1.488
beta2:
Mean SD MC Error 95% HPD interval
-------------------------------------------------------------------
-3.498 0.079 0.003 [-3.509, -3.492]
Posterior quantiles:
2.5 25 50 75 97.5
|--------------|==============|==============|--------------|
-3.509 -3.503 -3.500 -3.497 -3.492
# Data
np.random.seed(18472) # set seed to replicate example
nobs = 750 # number of obs in model
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
148 GLMs Part II – Count Models
X = sm.add_constant(np.column_stack((x1_2, x2)))
#Fit
stan_code = """
data{
int N;
int K;
matrix[N,K] X;
int Y[N];
}
parameters{
vector[K] beta;
}
model{
Y ˜ poisson_log(X * beta);
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=mydata, iter=5000, chains=3,
warmup=4000, n_jobs=3)
# Output
print(fit)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0] 1.0 5.6e-4 0.01 0.98 0.99 1.0 1.01 1.03 508.0 1.0
beta[1] -1.5 2.4e-4 5.9e-3 -1.51 -1.5 -1.5 -1.5 -1.49 585.0 1.0
beta[2] -3.5 2.0e-4 4.6e-3 -3.51 -3.5 -3.5 -3.5 -3.49 505.0 1.0
The negative binomial distribution is generally thought of as a mixture of the Poisson and
gamma distributions. A model based on the negative binomial distribution is commonly
regarded as a Poisson–gamma mixture model. However, the negative binomial distribu-
tion has been derived in a variety of ways. Hilbe (2011) accounts for some 13 of these
methods, which include derivations as the probability of observing y failures before the
rth success in a series of Bernoulli trials, as a series of logarithmic distributions, as a pop-
ulation growth model, and from a geometric series. Nevertheless, the now standard way
of understanding the negative binomial model is as a Poisson–gamma mixture, in which
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
149 6.2 Bayesian Negative Binomial Models
the gamma scale parameter is used to adjust for Poisson overdispersion. The variance of a
Poisson distribution (and model) is μ, and the variance of a two-parameter gamma model
is μ2 /ν. The variance of the negative binomial, as a Poisson–gamma mixture model, is
therefore μ+μ2 /ν. In this sense ν can be regarded as a dispersion parameter. However, the
majority of statisticians have preferred to define the negative binomial dispersion param-
eter as α, where α = 1/ν. Parameterizing the dispersion as α guarantees that there is a
direct relationship between the mean and the amount of overdispersion, or correlation, in
the data. The greater the overdispersion, the greater the value of the dispersion parameter.
The direct form of the variance is μ + αμ2 .
Keep in mind that there is a difference between the Poisson dispersion statistic discussed
in the previous section and the negative binomial dispersion parameter. The dispersion
statistic is the ratio of the Poisson model Pearson χ 2 statistic and the residual degrees of
freedom. If the data is overdispersed, the dispersion statistic is greater than 1. The more
overdispersion in the data to be modeled, the higher the value of the dispersion statistic.
Values of the dispersion statistic less than 1 indicate underdispersion in the data.
As an aside, note that there are times when the Poisson dispersion statistic is greater
than 1, yet the data is in fact equidispersed. This can occur when the data is apparently
overdispersed. If transforming the predictor data, by for example logging or squaring a
predictor, by adding a needed variable, by employing an interaction term, or by adjusting
for outliers, changes the dispersion to approximately 1 then the data is not truly overdis-
persed – it is only apparently overdispersed and can be modeled using a Poisson model.
Thus one should make certain that the model is adjusted for possible apparent overdisper-
sion rather than concluding that it is truly overdispersed on the sole basis of a first look at
the dispersion statistic.
Another item to keep in mind is that the negative binomial model itself has a dispersion
statistic. It is an indicator of extra correlation or dispersion in the data over what we expect
of a negative binomial model. The same corrective operations that we suggested for ame-
liorating apparent overdispersion for Poisson models can be used on a negative binomial
model. But if the extra-dispersion persists, a three-parameter generalized negative bino-
mial model may be used to adjust for this in a negative binomial model. Another tactic is
to employ an alternative count model such as a generalized Poisson or a Poisson inverse
Gaussian model. Sometimes these models fit the data better than either a Poisson or nega-
tive binomial. In any case, nearly every reference to overdispersion relates to the Poisson
model. Most of the other more advanced count models that we discuss in this volume are
different methods to adjust for the possible causes of overdispersion in a Poisson model.
We should also mention that all major general-purpose commercial statistical packages
parameterize the negative binomial variance as having a direct relationship to the mean (or
variance) and to the amount of dispersion or correlation in the data. Moreover, specialized
econometric software such as Limdep and XploRe also employs a direct relationship. How-
ever, R’s glm and glm.nb functions incorporate an indirect relationship between the mean
and dispersion. The greater overdispersion in the data, the higher the dispersion statistic and
lower the dispersion parameter. Using the direct relationship results in a negative binomial
with α = 0, the same as for a Poisson model. Since a true Poisson model has an identical
mean and variance, it makes sense that an α value of 0 is Poisson. The greater the dispersion
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
150 GLMs Part II – Count Models
is above 0, the more dispersion over the basic Poisson model is indicated. For an indirect
parameterization, with θ = 1/α, as θ approaches 0 there is infinite overdispersion, i.e., as
the overdispersion reduces to Poisson, the dispersion in terms of θ approaches infinity.
For maximum likelihood models, the coefficients and standard errors are the same for
both direct and indirect parameterizations. Likewise, they are the same when the negative
binomial is modeled using Bayesian methodology. The difference in results between the
direct and indirect parameterizations rests with the dispersion parameter and Pearson-based
residuals and fit statistics. We warn readers to be aware of how a particular negative bino-
mial function parameterizes the dispersion parameter when attempting to interpret results
or when comparing results with models produced using other software.
It should be noted that the negative binomial functions in R’s gamlss and COUNT pack-
ages are direct parameterizations, and their default results differ from glm and glm.nb. The
R nbinomial function in the COUNT package (see the book Hilbe and Robinson 2013)
has a default direct parameterization, but also an option to change results to an indirect
parameterization. The nbinomial function also has an option to parameterize the dispersion
parameter, providing predictor coefficients influencing the dispersion. The value of this is
to show which predictors most influence overdispersion in the data.
In the subsection on the negative binomial model using JAGS below, we provide both
parameterizations of the negative binomial. The negative binomial PDF for the direct
parameterization is displayed in Equation 6.4 below. The PDF for an indirect parameteri-
zation can be obtained by inverting each α in the right-hand side of the equation. Again,
the indirect dispersion parameter θ is defined as θ = 1/α. We use theta (θ ) as the name of
the indirect dispersion because of R’s glm and glm.nb use of the symbol.
The negative binomial probability distribution function can be expressed as terms of α as
1/α
y
y + 1/α − 1 1 αμ
f (y; μ, α) = . (6.4)
1/α − 1 1 + αμ 1 + αμ
The negative binomial log-likelihood is therefore
n
αμi 1
L (μ; y, α) = yi log − log (1 + αμi )
1 + αμi α
i=1
1 1
+ log yi + − log (yi + 1) − log . (6.5)
α α
The terms on the right-hand side of the above log-likelihood may be configured into
other form. The form here is that of the exponential family. The canonical link is therefore
log(αμ/(1+αμ)), and the variance, which is the second derivative of the cumulant log(1+
αμ)/α with respect to the link, is μ + αμ2 . Figure 6.5 gives a comparison between two
data sets with similar means but one with samples from a negative binomial distribution
and the other with a Poisson distribution. Note how the negative binomial data has a much
wider spread.
The canonical form of the distribution is not used as the basis of the traditional negative
binomial model. To be used as a means to adjust for Poisson overdispersion, the link is
converted to the same link as that of the Poisson model, log(μ). With respect to the linear
predictor xβ the mean is defined as μ = exp(xβ).
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
151 6.2 Bayesian Negative Binomial Models
150
Frequency
100
50
0 20 40
t
x
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
152 GLMs Part II – Count Models
The model may be estimated assuming a direct relationship between the mean and any
overdispersion that may exist in the data. We use the nbinomial function from the COUNT
package, which is available on CRAN. The model is given as
Deviance Residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.58600 -1.12400 -0.79030 -0.48950 0.06517 3.04800
Pearson Residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.539500 -0.510700 -0.445600 0.000435 0.067810 10.170000
Note that the Pearson and deviance residuals are both provided in the results. The Pear-
son χ 2 and the negative binomial dispersion statistic are also provided. A negative binomial
dispersion statistic value of 0.9894685 is close enough to 1 that it may be concluded that the
model is well fitted as a negative binomial. We expect this result since the model is based
on true synthetic negative binomial data with these specified parameters. The dispersion
parameter is given as 3.38, which is the inverse of the θ dispersion parameter displayed
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
153 6.2 Bayesian Negative Binomial Models
in glm.nb above. Notice also that the 95% confidence intervals come with the output by
default. This is not the case with glm or glm.nb. Values of zero are not within the range
of the confidence intervals, indicating that there is prima facie evidence that the predictors
are significant.
We recommend that all frequentist-based count models be run using robust or sandwich
standard errors. This is an adjustment to the standard errors that attempts to adjust standard
errors to values that would be the case if there were no overdispersion in the data. This
is similar to scaling the standard errors, which is done by multiplying the standard errors
by the square root of the dispersion statistic. For both methods of adjustment, if the data
is not over- (or under-) dispersed then the robust and scaled standard error reduces to a
standard-model standard error. It is not something that Bayesian statisticians or analysts
need to be concerned with, however. But it is wise to keep this in mind when modeling
frequentist count models. It is a way to adjust for extra-dispersion in the data. Moreover,
before waiting a long time while modeling data with a large number of observations and
parameters, it is wise first to run a maximum likelihood model on the data if possible.
Recent literature in the area has recommended the negative binomial as a general-
purpose model to adjust for Poisson overdispersion. However, the same problem with zeros
exists for the negative binomial as for the Poisson. It may be that the data has too many
zeros for either a Poisson or a negative binomial model. The formula for calculating the
probability and expected number of zero counts for a negative binomial model is
1/α
0
0 + 1/α − 1 1 αμ
P(Y = 0|μ) = = (1 + αμ)−1/α . (6.6)
1/α − 1 1 + αμ 1 + αμ
We can estimate the expected mean number of zeros for our data as follows:
library(COUNT)
P_ _disp(nb2)
pearson.chi2 dispersion
2469.7136566 0.9890723 # NB dispersion statistic
barplot(table(nby),log="y",col="gray")
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
154 GLMs Part II – Count Models
500
50
10
5
t
0 4 8 13 19 25 31 37 43 49 59 67 75 82 98 113 135
In this case the predicted zero count for the negative binomial model is 1137 and the
observed number of zeros is 1167, which again is pretty close as we should expect for
synthetic data. Note that the number of zeros expected from a Poisson distribution with a
similar mean would be
prpois <- mean(exp(-exb))
cntpois <- 2500 * prpois
cntpois
[1] 287.1995
This is much lower than is allowed by the negative binomial distribution. We should com-
pare the range of observed versus predicted counts to inspect the fit, and check for places
in the distributions where they vary. A fit test can be designed to quantify the fit. We shall
discuss such a test later in the book.
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
155 6.2 Bayesian Negative Binomial Models
Y[i] ˜ dpois(g[i])
g[i] ˜ dgamma(alpha, rateParm[i])
The first model can be converted to a directly parameterized model by substituting the
code provided in Code 6.12. Note also that the code can easily be used to model other data:
simply amend these lines to specify different predictors, a different response variable, and
model name.
library(R2jags)
# Attach(negbml)
X <- model.matrix(˜ x1_2 + x2)
K <- ncol(X)
sink("NBGLM.txt")
cat("
model{
# Priors for coefficients
for (i in 1:K) { beta[i] ˜ dnorm(0, 0.0001)}
# Prior for dispersion
theta ˜ dunif(0.001, 5)
# Likelihood function
for (i in 1:N){
Y[i] ˜ dpois(g[i])
g[i] ˜ dgamma(theta, rateParm[i])
rateParm[i] <- theta / mu[i]
log(mu[i]) <- eta[i]
eta[i] <- inprod(beta[], X[i,])
}
}
",fill = TRUE)
sink()
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
156 GLMs Part II – Count Models
The results clearly approximate the parameter values we expected on the basis of the
data set used for the model.
The above code produces an indirectly parameterized negative binomial. To change it to
a direct parameterization insert Code 6.12 below in place of the log-likelihood code given
in Code 6.11 above.
The resulting output is the same as that for the indirect dispersion, except that the line for
alpha now reads
alpha 3.306 0.084 3.141 3.479 1.007 930
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
157 6.2 Bayesian Negative Binomial Models
Code 6.13 below provides a directly parameterized negative binomial; but, instead of
mixing the Poisson and gamma distributions, we use the dnegbin function. Internally this
function does exactly the same thing as the two lines in Code 6.12 that mix distributions.
Code 6.13 Negative binomial: direct parameterization using JAGS and dnegbin.
==================================================
library(R2jags)
# Likelihood function
for (i in 1:N){
Y[i] ˜ dnegbin(p[i], 1/alpha) # for indirect, (p[i], alpha)
p[i] <- 1/(1 + alpha*mu[i])
log(mu[i]) <- eta[i]
eta[i] <- inprod(beta[], X[i,])
}}
",fill = TRUE)
sink()
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
158 GLMs Part II – Count Models
alpha beta[1]
3.8
1.2
3.6
1.0
3.4
0.8
3.2
3.0
t
MCMC iterations
Figure 6.7 MCMC chains for the three model parameters β1 , β2 , β3 and for α, for the negative binomial model.
To change the above direct parameterization we simply amend 1/alpha to alpha in the
line with the comment # for indirect, (p[i], alpha). To display the chains for each
parameter (Figure 6.7) one can type:
source("CH-Figures.R")
out <- NB3$BUGSoutput
MyBUGSChains(out,c(uNames("beta",K),"alpha"))
Use the following code to display the histograms of the model parameters (Figure 6.8):
Finally, we will use the zero trick with a negative binomial to model the same data.
The zero-trick method is very useful if the analyst knows only the log-likelihood of the
model.
The code provides a direct parameterization of the dispersion parameter. The downside
of using the zero trick when modeling is that convergence usually takes longer to achieve.
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
159 6.2 Bayesian Negative Binomial Models
beta[2] beta[3]
0
0
1.7 1.8 1.9 2.0 2.1 2.2 2.3 –2.0 –1.8 –1.6 –1.4 –1.2
alpha beta[1]
300
0
0
t
Posterior distribution
Figure 6.8 Histogram of the MCMC iterations for each parameter. The thick line at the base of each histogram represents the 95%
credible interval. Note that no 95% credible interval contains 0.
Code 6.14 Negative binomial with zero trick using JAGS directly.
==================================================
library(R2jags)
sink("NB0.txt")
cat("
model{
# Priors regression parameters
for (i in 1:K) { beta[i] ˜ dnorm(0, 0.0001)}
# Prior for alpha
numS ˜ dnorm(0, 0.0016)
denomS ˜ dnorm(0, 1)
alpha <- abs(numS / denomS)
C <- 10000
for (i in 1:N) {
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
160 GLMs Part II – Count Models
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
161 6.2 Bayesian Negative Binomial Models
The parameter values that have been determined using the above three Bayesian neg-
ative binomial algorithms are all nearly the same. Seeds were not given to the runs to
demonstrate the sameness of the results. Moreover, each algorithm can easily be amended
to allow for either a direct or indirect parameterization of the dispersion. Under a direct
parameterization alpha = 3.39–3.41; under an indirect parameterization, alpha = 0.29–
0.31. Depending on the parameters and priors given to the sampling algorithm as well
as the burn-in number, thinning rate, and so forth, the range of values for alpha can be
considerably wider.
When modeling count data that has evidence of excessive correlation, or overdispersion,
analysts should employ a negative binomial model to determine parameter values and the
associated standard deviation and credible intervals. We recommend that a direct param-
eterization be given between the mean and the value given to the dispersion parameter. It
is easy to run a Bayesian negative binomial model with an indirect parameterization of
the dispersion parameter, but doing so appears to be contrary to how statisticians normally
think of variability. See Hilbe (2014) for an in-depth evaluation of this discussion and its
consequences.
A negative binomial is appropriate for use with overdispersed Poisson data. In the
following section we turn to a model that can be used for either underdispersed or overdis-
persed count data – the generalized Poisson model. First, however, we look at a Python
implementation of the negative binomial.
# Data
np.random.seed(141) # set seed to replicate example
nobs = 2500 # number of obs in model
theta = 0.303
xb = 1 + 2 * x1 - 1.5 * x2 # linear predictor
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
162 GLMs Part II – Count Models
exb = np.exp(xb)
nby = nbinom.rvs(exb, theta)
# Fit
niter = 10000 # parameters for MCMC
# Define likelihood
linp = beta0 + beta1 * x1 + beta2 * x2
mu = np.exp(linp)
mu2 = mu * (1 - theta)/theta # compensate for difference between
# parameterizations from pymc3 and scipy
y_obs = pm.NegativeBinomial(’y_obs’, mu2, theta, observed=nby)
# Inference
start = pm.find_MAP() # find starting value by optimization
step = pm.NUTS()
trace = pm.sample(niter, step, start, progressbar=True)
beta0:
Mean SD MC Error 95% HPD interval
----------------------------------------------------------------------
1.020 0.089 0.002 [0.846, 1.193]
Posterior quantiles:
2.5 25 50 75 97.5
|--------------|================|=================|-----------|
0.849 0.960 1.017 1.080 1.197
beta1:
Mean SD MC Error 95% HPD interval
------------------------------------------------------------------------
1.989 0.078 0.001 [1.833, 2.138]
Posterior quantiles:
2.5 25 50 75 97.5
|--------------|================|=================|-----------|
1.837 1.936 1.989 2.041 2.143
beta2:
Mean SD MC Error 95% HPD interval
----------------------------------------------------------------------
-1.516 0.130 0.002 [-1.769, -1.256]
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
163 6.2 Bayesian Negative Binomial Models
Posterior quantiles:
2.5 25 50 75 97.5
|--------------|================|=================|-----------|
-1.777 -1.603 -1.512 -1.428 -1.261
# Data
np.random.seed(141) # set seed to replicate example
nobs = 2500 # number of obs in model
theta = 0.303
X = sm.add_constant(np.column_stack((x1, x2)))
beta = [1.0, 2.0, -1.5]
xb = np.dot(X, beta) # linear predictor
exb = np.exp(xb)
nby = nbinom.rvs(exb, theta)
# Fit
stan_code = """
data{
int N;
int K;
matrix[N,K] X;
int Y[N];
}
parameters{
vector[K] beta;
real<lower=0, upper=5> alpha;
}
transformed parameters{
vector[N] mu;
mu = exp(X * beta);
}
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
164 GLMs Part II – Count Models
model{
for (i in 1:K) beta[i] ˜ normal(0, 100);
Y ˜ neg_binomial(mu, alpha);
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=mydata, iter=10000, chains=3,
warmup=5000, n_jobs=3)
# Output
nlines = 9 # number of lines in screen output
output = str(fit).split(’\n’)
for item in output[:nlines]:
print(item)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0] 1.05 8.8e-4 0.05 0.95 1.02 1.05 1.08 1.14 3181.0 1.0
beta[1] 1.97 5.0e-4 0.03 1.91 1.95 1.97 1.99 2.03 3692.0 1.0
beta[2] -1.52 5.3e-4 0.03 -1.58 -1.54 -1.52 -1.5 -1.45 3725.0 1.0
alpha 0.44 2.3e-4 0.02 0.41 0.43 0.44 0.46 0.48 7339.0 1.0
There are two foremost parameterizations of the generalized Poisson model. Both
parameterizations allow for the modeling of both overdispersed as well as underdispersed
count data. The first parameterization was developed by Consul and Famoye (1992) and
is the method used in current software implementations of the model. The second param-
eterization was developed by Famoye and Singh (2006). The models are built upon the
generalized Poisson distribution. The model was recently given a complete chapter in Hilbe
(2014), to which we refer the interested reader. The approach is maximum likelihood but
the logic of the model is the same regardless of its method of estimation. This is the only
discussion of the Bayesian generalized Poisson of which we are aware, but it can be a
powerful tool in modeling both under- and overdispersed count data.
The use of either parameterization results in the same parameter estimates. The
generalized Poisson model has an extra dispersion parameter, similar to that of the negative
binomial model. The difference is, however, that it can have both positive and negative
values. If the data being modeled is overdispersed, the generalized Poisson model disper-
sion parameter will be positive. If the data is underdispersed, the dispersion parameter
is negative. Negative binomial models can only model overdispersed data. If the data is
equidispersed (Poisson) then the dispersion parameter has the value zero, or approximately
zero. We note that frequently the generalized Poisson model fits overdispersed data better
than does a negative binomial model. Unfortunately most researchers have not had access
to the generalized Poisson for use in their modeling endeavors. There are only a very few
software implementations of the model. Bayesian generalized Poisson model software has
not appeared before.
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
165 6.3 Bayesian Generalized Poisson Model
The probability distribution and log-likelihood for the generalized Poisson can be
expressed as
μi (μi + δyi )yi −1 e−μi −δyi
f (yi ; μ, δ) = yi = 0, 1, 2, . . . (6.7)
yi !
where μ > 0, −1 ≤ δ ≤ 1,and
x1 <- runif(nobs)
xb <- 1 + 3.5 * x1
exb <- exp(xb)
delta <- -0.3
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
166 GLMs Part II – Count Models
sink("GP1reg.txt")
cat("
model{
# Priors beta
for (i in 1:K) { beta[i] ˜ dnorm(0, 0.0001)}
C <- 10000
for (i in 1:N){
Zeros[i] ˜ dpois(Zeros.mean[i])
Zeros.mean[i] <- -L[i] + C
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
167 6.3 Bayesian Generalized Poisson Model
",fill = TRUE)
sink()
The results are very close to the values specified in setting up the synthetic data. Again,
this data could not have been modeled using a negative binomial since it is underdispersed.
We know this is due to the negative value for delta. Of course, we set up the data to
be underdispersed, but in real data situations it is unlikely that we would know this in
advance unless it has been tested using a Poisson model. If the Pearson dispersion statistic
from a Poisson model is less than 1.0, the model is very likely to be underdispersed. We
always advise readers to first model the count data using a Poisson model and checking
its dispersion statistic (not its parameter). Keep in mind, though, that generalized Poisson
data may also be apparently overdispersed or underdispersed. We may be able to amelio-
rate apparent extra-dispersion by performing the operations on the data that we detailed
when discussing the apparently extra-dispersed negative binomial model in the previous
section.
We need to give a caveat regarding the parameterization of the dispersion parameter,
delta. When the data is underdispersed, take care to be sure that the parameterization,
given the dispersion statistic in the model code, is correct. The generalized Poisson disper-
sion needs to be parameterized with a hyperbolic tangent function or equivalent in order to
produce negative values of delta when the data is underdispersed. Some R software with
which we are familiar parameterizes the dispersion using an exponential function instead,
which does not allow negative values for delta in the case of underdispersed data. However,
the maximum likelihood vglm function in the VGAM package provides both positive- and
negative-valued dispersion parameters.
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
168 GLMs Part II – Count Models
def sign(delta):
"""Returns a pair of vectors to set sign on
generalized Poisson distribution. """
if delta > 0:
value = delta
sig = 1.5
else:
value = abs(delta)
sig = 0.5
class gpoisson(rv_discrete):
"""Generalized Poisson distribution."""
def _pmf(self, n, mu, delta, sig):
if sig < 1.0:
delta1 = -delta
else:
delta1 = delta
term1 = mu * ((mu + delta1 * n) ** (n - 1))
term2 = np.exp(-mu- n * delta1) / factorial(n)
return term1 * term2
# Data
np.random.seed(160) # set seed to replicate example
nobs = 1000 # number of obs in model
x1 = uniform.rvs(size=nobs)
xb = 1.0 + 3.5 * x1 # linear predictor
delta = -0.3
exb = np.exp(xb)
# Fit
stan_code = """
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
169 6.4 Bayesian Zero-Truncated Models
data{
int N;
int K;
matrix[N, K] X;
int Y[N];
}
parameters{
vector[K] beta;
mu = exp(X * beta);
}
model{
vector[N] l1;
vector[N] l2;
vector[N] LL;
for (i in 1:N){
l1[i] = log(mu[i]) + (Y[i] - 1) * log(mu[i] + delta * Y[i]);
l2[i] = mu[i] + delta * Y[i] + lgamma(Y[i] + 1);
LL[i] = l1[i] - l2[i];
}
target += LL;
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=mydata, iter=5000, chains=3,
warmup=4000, n_jobs=3)
# Output
nlines = 8 # number of lines in screen output
output = str(fit).split(’\n’)
for item in output[:nlines]:
print(item)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0] 1.0 1.4e-3 0.03 0.93 0.98 1.0 1.02 1.06 520.0 1.0
beta[1] 3.51 1.1e-3 0.03 3.46 3.5 3.51 3.53 3.56 564.0 1.0
delta -0.31 1.3e-3 0.03 -0.37 -0.33 -0.31 -0.29 -0.25 549.0 1.0
Zero-truncated models typically refer to count models for which the count response vari-
able cannot include zero counts. For instance, in health care, the length-of-hospital-stay
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
170 GLMs Part II – Count Models
data begin from day one upwards. There are no zero lengths of stay when entering the
hospital. Astronomical research also has many instances of this type of data. It is therefore
important to recognize when Bayesian zero-truncated models should be used in place of
standard Bayesian models. In particular, we shall discuss Bayesian zero-truncated Poisson
(ZTP) and zero-truncated negative binomial (ZTNB) models. These are by far the most
popular truncated model types in the statistical literature.
n
L (β; y) = {yi ln(xi β) − exp(xi β) − ln (yi + 1) − ln[1 − exp(− exp(xi β))]} .
i=1
(6.10)
Not all zero-truncated data need to have the underlying PDF adjusted as we discussed
above. If the mean of the count response variable being modeled is greater than 5 or 6, it
makes little difference to the Poisson PDF if there can be no zeros in the variable. But if
the mean is less than 6, for example, it can make a difference.
Let us take as an example some count data which we desire to model as Poisson. Suppose
that the mean of the count response is 3. Now, given that the probability of a Poisson zero
count is exp(−mean) we can calculate, for a specific mean, how much of the underlying
Poisson probability and log-likelihood is lost by ignoring the fact that there are no zeros in
the model. We may use R to determine the percentage of zeros that we should expect for a
given mean on the basis of the Poisson PDF:
> exp(-3)
[1] 0.04978707
If there are 500 observations in the model then we expect the response to have 25 zero
counts:
> exp(-3) * 500
[1] 24.89353
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
171 6.4 Bayesian Zero-Truncated Models
Poisson ZTP
600
400
Frequency
200
0
0.0 2.5 5.0 7.5
t
x
Using a Poisson model on such data biases the results in proportion to the percentage of
zero counts expected to be modeled. The Poisson dispersion statistic will most likely differ
from 1 by a substantial amount, indicating either under- or overdispersion. Using a ZTP
model on the data should ameliorate any extra-dispersion, particularly if there is no other
cause for it in the data.
It does not take a much greater mean value than 3 to reduce the probability of zero counts
to near zero. For instance, consider a count response variable with mean 6:
> exp(-6)
[1] 0.002478752
> exp(-6) * 500
[1] 1.239376
With mean 6, we should expect that the response data has about a one-quarter per-
cent probability of having zero counts. For a 500-observation model we would expect
only one zero in the response variable being modeled. Employing a Poisson model on
the data will only result in a slight difference from a ZTP model on the data. Thus,
when mean values exceed 6, it is generally not necessary to use a zero-truncated Poisson
model.
However, and this is an important caveat, it may well be the case that a Poisson model on
the data is inappropriate at the outset. If zero-truncated data remains overdispersed in spite
of being adjusted by using a zero-truncated Poisson, it may be necessary to use a zero-
truncated negative binomial. The formula we used for the probability of a zero count given
a specific mean is only valid for a model based on the Poisson distribution. The probability
of a zero negative binomial count is quite different.
The code below provides the user with a Bayesian zero-truncated Poisson model. Note
that the zero-truncated Poisson log-likelihood uses the full Poisson log-likelihood with
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
172 GLMs Part II – Count Models
the addition of a term excluding zero counts. It is easiest to use the zero trick for such a
model.
We first set up the zero-truncated Poisson data. We will use the same data as for the
Poisson model earlier in the chapter but we add 1 to each count in py, the response count
variable. Adding 1 to the response variable of synthetic data such as we have created
is likely to result in an underdispersed Poisson model. Use of the zero-truncated model
attempts to adjust the otherwise Poisson model for the structurally absent zero counts.
We display the counts to make certain that there are no zero counts in py. We also
show that the mean of ztp is 2.43, which, as we learned before, does affect the modeling
results.
> table(ztpy)
ztpy
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 27
1014 207 80 31 22 12 9 24 12 15 9 8 8 8 5 8 6 8 5 1 1 3 2 1
30
1
> mean(ztpy)
[1] 2.43
The code listed in the table below first models the data using a standard maximum like-
lihood Poisson model. We can determine how closely the coefficients specified in our
synthetic data are identified by the Poisson model. Remember, though, that we added 1
to py after assigning the coefficient values, so we expect the Poisson model coefficients
to differ from what was originally specified in setting up the model. Moreover, no adjust-
ment is being made by the Poisson model for the absent zero counts. The fact that there
were originally 123 zeros in the data, i.e., 25 percent of the total, indicates that there will
be a considerable difference in the Poisson coefficient values and in the Bayesian ZTP
parameter means.
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
173 6.4 Bayesian Zero-Truncated Models
X = X,
K = K, # number of betas
N = nobs, # sample size
Zeros = rep(0, nobs))
ZTP<-"
model{
for (i in 1:K) {beta[i] ˜ dnorm(0, 1e-4)}
The maximum likelihood Poisson model provided the results given below. Again,
no adjustment was made for the structural absence of zero counts in the data. Since
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
174 GLMs Part II – Count Models
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.35848 0.03921 34.64 <2e-16 ***
x1 1.27367 0.03390 37.57 <2e-16 ***
x2 -1.41602 0.03724 -38.02 <2e-16 ***
x3 -0.92188 0.06031 -15.29 <2e-16 ***
> library(LOGIT)
> P_disp(poi)
pearson.chi2 dispersion
943.0343784 0.6303706
return np.array(ztp)
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
175 6.4 Bayesian Zero-Truncated Models
# Data
np.random.seed(123579) # set seed to replicate example
nobs = 3000 # number of obs in model
X = np.column_stack((x1,x2,x3))
X = sm.add_constant(X)
# Fit
stan_code = """
data{
int N;
int K;
matrix[N, K] X;
int Y[N];
}
parameters{
vector[K] beta;
}
model{
vector[N] mu;
mu = exp(X * beta);
# likelihood
for (i in 1:N) Y[i] ˜ poisson(mu[i]) T[1,];
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=mydata, iter=5000, chains=3,
warmup=4000, n_jobs=3)
# Output
print(fit)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0] 1.01 1.7e-3 0.04 0.94 0.99 1.01 1.04 1.1 566.0 1.0
beta[1] 1.98 1.6e-3 0.04 1.9 1.95 1.98 2.0 2.05 593.0 1.0
beta[2] -3.05 2.6e-3 0.07 -3.19 -3.1 -3.04 -3.0 -2.9 792.0 1.0
beta[3] -1.51 2.1e-3 0.05 -1.62 -1.55 -1.51 -1.48 -1.41 659.0 1.0
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
176 GLMs Part II – Count Models
NB ZTNB
400
300
Frequency
200
100
0
0 5 10 15
t
x
Figure 6.10 For comparison, negative binomial and zero-truncated negative binomial data.
instead of exp(−μ) for calculating the probability of zero counts. For a mean of 3, we
display the expected number of zeros given α values of 0.5, 1, and 2:
The number of expected zero counts for a given mean is much higher for a negative
binomial model than for a Poisson. We advise using the negative binomial as the default
zero-truncated model since a wider range of count values is allowed, which make sense
for a zero-truncated model. The DIC should be used as a general guide for deciding which
model to use in a given modeling situation, all other considerations being equal.
The code in the following table is intended for modeling zero-truncated negative
binomial data, for which direct parameterization of the dispersion parameter is used.
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
177 6.4 Bayesian Zero-Truncated Models
Code 6.24 Zero-truncated negative binomial with 0-trick using JAGS – direct.
==================================================
require(MASS)
require(R2jags)
require(VGAM)
set.seed(123579)
nobs <- 1000
x1 <- rbinom(nobs,size=1,0.7)
x2 <- runif(nobs)
xb <- 1 + 2 * x1 - 4 * x2
exb <- exp(xb)
alpha = 5
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
178 GLMs Part II – Count Models
ZTNB1 <-
jags(data = model.data,
inits = inits,
parameters = params,
model = textConnection(ZTNB),
n.thin = 1,
n.chains = 3,
n.burnin = 2500,
n.iter = 5000)
print(ZTNB1, intervals=c(0.025, 0.975), digits=3)
==================================================
mu.vect sd.vect 2.5% 97.5% Rhat n.eff
alpha 4.619 1.198 2.936 7.580 1.032 81
beta[1] 1.099 0.199 0.664 1.445 1.016 150
beta[2] 1.823 0.131 1.564 2.074 1.004 600
beta[3] -3.957 0.199 -4.345 -3.571 1.008 290
deviance 20004596.833 2.861 20004593.165 20004603.978 1.000 1
A similar result can be achieved using Stan from Python. Code 6.25 demonstrates
how this can be implemented. As always, it is necessary to pay attention to the different
parameterizations used in Python and Stan. In this particular case, the scipy.stats.binom
function takes as input (n, p)1 and the closest Stan negative binomial parameterization takes
(α, β), where α = n and β = p/(1 − p).
Code 6.25 Zero-truncated negative binomial model in Python using Stan.
==================================================
import numpy as np
import pystan
import statsmodels.api as sm
return np.array(ztnb)
# Data
np.random.seed(123579) # set seed to replicate example
nobs = 2000 # number of obs in model
1 http://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.stats.nbinom.html
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
179 6.5 Bayesian Three-Parameter NB Model (NB-P)
X = np.column_stack((x1,x2))
X = sm.add_constant(X)
# Fit
stan_code = """
data{
int N;
int K;
matrix[N, K] X;
int Y[N];
}
parameters{
vector[K] beta;
real<lower=1> alpha;
}
model{
vector[N] mu;
# Covariates transformation
mu = exp(X * beta);
# likelihood
for (i in 1:N) Y[i] ˜ neg_binomial(mu[i], 1.0/(alpha - 1.0)) T[1,];
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=mydata, iter=5000, chains=3,
warmup=2500, n_jobs=3)
# Output
nlines = 9 # number of lines in screen output
output = str(fit).split(’\n’)
for item in output[:nlines]
print(item)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0] 0.96 2.3e-3 0.08 0.79 0.91 0.96 1.02 1.12 1331.0 1.0
beta[1] 2.07 1.8e-3 0.07 1.94 2.02 2.07 2.11 2.2 1439.0 1.0
beta[2] -3.96 1.6e-3 0.07 -4.09 -4.01 -3.96 -3.92 -3.83 1879.0 1.0
alpha 4.84 4.7e-3 0.19 4.49 4.71 4.84 4.97 5.22 1577.0 1.0
The three-parameter negative binomial P, or NB-P, model allows a much wider range of
fitting capability than does a two-parameter count model. Designed by William Greene
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
180 GLMs Part II – Count Models
(2008) for econometric data, the model can be of very good use in astronomy. As a two-
parameter negative binomial model can adjust for one-parameter Poisson overdispersion,
the three-parameter NB-P can adjust for two-parameter negative binomial overdispersion.
The NB-P log-likelihood is defined as
n
αμ2−P μ
L (μ; y, α, ρ) = αμ 2−P
ln + y ln
αμ2−P + μ αμ2−P + μ
i=1
− ln (1 − y) − ln (αμ ) + ln (y + αμ ) .
2−P 2−P
(6.12)
The evaluator now has four arguments: the log-likelihood, the linear predictor, the log of
dispersion parameter, and P, which appears in the factor
αμ2−P .
For the NB-P model we provide a simple data set based on the generic rnegbin random
number generator. The code does not provide a specified value for α or P. The algorithm
solves for Q, which is defined as 2 − P. It is easier to estimate Q and then convert it to
P than to estimate P. The model provides support for deciding whether an analyst should
model data using an NB1 model or an NB2 model. NB1 has a variance of μ + αμ and
NB2, the traditional negative binomial model, has a variance of μ + αμ2 . The NB-P
model parameterizes the exponent, understanding that the NB1 model can be expressed
as μ + αμ. If the NB-P model results in a Q value close to 2, the analyst will model
the data using an NB2 model. Statisticians now do not concern themselves with choosing
between NB1 and NB2 but prefer to utilize the power of using all three parameters to fit
the data better. The extra parameter can also be used to adjust for negative binomial over-
or underdispersion. For a full explanation of the model see Hilbe and Greene (2008) and
Hilbe (2011).
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
181 6.5 Bayesian Three-Parameter NB Model (NB-P)
}
",fill = TRUE)
sink()
# Inits function
inits <- function () {
list(beta = rnorm(K, 0, 0.1),
theta = 1,
Q = 1)
}
# Parameters to display n output
params <- c("beta",
"theta",
"Q"
)
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
182 GLMs Part II – Count Models
The synthetic model is well recovered after 2500 samples, where a previous 2500 were
discarded as burn-in. For the parameter value P in the model log-likelihood, this can be
determined as P = 2 − Q. If Q = 1.485, P = 0.515. The direct parameterization of α
equals 1/0.461 or 2.169.
# Load R package
r(’library(MASS)’)
# Get R functions
nbinomR = r[’rnegbin’]
return res
# Data
nobs = 750 # number of obs in model
x1 = uniform.rvs(size=nobs) # categorical explanatory variable
xb = 2 - 5 * x1 # linear predictor
exb = np.exp(xb)
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
183 6.5 Bayesian Three-Parameter NB Model (NB-P)
theta = 0.5
Q = 1.4
nbpy = gen_negbin(nobs, exb, theta * (exb ** Q))
# Fit
stan_code = """
data{
int N;
int K;
matrix[N,K] X;
int Y[N];
}
parameters{
vector[K] beta;
real<lower=0> theta;
real<lower=0, upper=3.0> Q;
}
transformed parameters{
vector[N] mu;
real<lower=0> theta_eff[N];
mu = exp(X * beta);
for (i in 1:N) {
theta_eff[i] = theta * pow(mu[i], Q);
}
}
model{
Y ˜ neg_binomial_2(mu, theta_eff);
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=mydata, iter=5000, chains=3,
warmup=2500, n_jobs=3)
# Output
nlines = 9 # number of lines in screen output
output = str(fit).split(’\n’)
for item in output[:nlines]:
print(item)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0] 1.96 1.9e-3 0.07 1.81 1.91 1.96 2.01 2.1 1417.0 1.0
beta[1] -4.97 9.0e-3 0.32 -5.58 -5.19 -4.98 -4.77 -4.31 1261.0 1.0
theta 0.52 2.5e-3 0.09 0.35 0.45 0.51 0.58 0.72 1414.0 1.0
Q 1.33 4.4e-3 0.16 1.03 1.21 1.32 1.43 1.67 1378.0 1.0
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:40:03, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.007
7 GLMs Part III – Zero-Inflated and Hurdle Models
Zero-inflated models are mixture models. In the domain of count models, zero-inflated
models involve the mixtures of a binary model for zero counts and a count model. It is a
mixture model because the zeros are modeled by both the binary and the count components
of a zero-inflated model.
The logic of a zero-inflated model can be expressed as
Pr(Y = 0) : Pr(Bin = 0) + [1 − Pr(Bin = 0)] × Pr(Count = 0)
Pr(Y ≥ 0) : 1 − Pr(Bin = 0) + PDFcount
Thus, the probability of a zero in a zero-inflated model is equal to the probability of a zero
in the binary model component (e.g., the logistic) plus one minus the probability of a zero
in the binary model times the probability of a zero count in the count model component.
The probability that the response is greater than or equal to zero (as in e.g. the Poisson
model) is equal to one minus the probability of a zero in the binary component plus the
count model probability distribution. The above formulae are valid for all zero-inflated
models.
184
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
185 7.1 Bayesian Zero-Inflated Models
where the subscripts p and l stand for betas on coefficients from the Possion and logistic
components respectively. The mean of the Poisson distribution is given by μW, where W
is the outcome of the Bernoulli draw. This can be expressed as
W ∼ Bernoulli(π ),
Y ∼ Poisson(μW),
(7.3)
log(μ) = β1 + β2 X,
logit(π ) = γ1 + γ2 X.
The interpretation of the zero-inflated counts differs substantially from how zero counts
are interpreted for hurdle models. Hurdle models are two-part models, each component
being separately modeled. The count component is estimated as a zero-truncated model.
The binary component is usually estimated as a 0, 1 logistic model, with the logistic “1”
values representing all positive counts from the count component. In a zero-inflated model,
however, the zeros are modeled by both the count and binary components. That is where
the mixture occurs.
Some statisticians call zero-inflated-model zeros, estimated using the binary component,
false zeros, and the zeros from the count model true zeros. This terminology is partic-
ularly common in ecology, environmental science, and now in the social sciences. The
notion behind labeling “true” and “false” zeros comes from considering them as errors.
For instance, when one is counting the number of exoplanets identified each month over a
two-year period in a given area of the Milky Way, it may be that for a specific month none
is found, even though there were no problems with the instruments and weather conditions
were excellent. These are true zeros. If exoplanets are not observed owing to long-lasting
periods of inclement weather, or if the instruments searching for exoplanets broke down,
the zeros recorded for exoplanet counts in these circumstances would be considered as
“false” zeros. The zero counts from the count component are assigned as true zeros; the
zeros of the binary component are assigned as false zeros.
When interpreting a zero-inflated model, it must be remembered that the binary com-
ponent models the zero counts, not the 1s. This is unlike hurdle models, or even standard
binary response models. A logistic model, for example, models the probability that y == 1.
For a zero-inflated Poisson–logit model, the logistic component models the probability that
y == 0. This is important to keep in mind when interpreting zero-inflated model results.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
186 GLMs Part III – Zero-Inflated and Hurdle Models
Poisson ZIP
250
200
Frequency 150
100
50
0
0 5 10 15 20
t
x
Figure 7.1 For comparison, random variables drawn from Poisson (lighter) and zero-inflated Poisson (darker) distributions.
JAGS model for the zero-inflated Poisson model, which is displayed in Code 7.1. Note that
the lines in the log-likelihood that define Y[i] and W[i], and the two lines below the log-
likelihood that define W, are the key lines that mix the Poisson and Bernoulli logit models.
Moreover, linear predictors are specified for both the count and binary components of the
model
. . . in R using JAGS
The JAGS code in Code 7.1 provides posterior means, standard errors, and credible
intervals for each parameter from both components of the model.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
187 7.1 Bayesian Zero-Inflated Models
Kc <- ncol(Xc)
Kb <- ncol(Xb)
# Likelihood
for (i in 1:N) {
W[i] ˜ dbern(1 - Pi[i])
Y[i] ˜ dpois(W[i] * mu[i])
log(mu[i]) <- inprod(beta[], Xc[i,])
logit(Pi[i]) <- inprod(gamma[], Xb[i,])
}
}"
W <- zipdata$zipy
W[zipdata$zipy > 0] <- 1
The count or Poisson component of the output is provided in the upper beta means and
the binary or logit component in the lower gamma means. Notice that the parameter values
we gave the synthetic data are close to the means of the posterior parameters displayed in
the output.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
188 GLMs Part III – Zero-Inflated and Hurdle Models
model{
y ˜ poisson(lambda);
}
can be expressed as
model{
increment_log_prob(poisson_log(y, lambda));
}
# Load R package
r(’library(VGAM)’)
# Get R functions
zipoissonR = r[’rzipois’]
# Data
np.random.seed(141) # set seed to replicate example
nobs = 5000 # number of obs in model
x1 = uniform.rvs(size=nobs)
exb = np.exp(xb)
exc = 1.0 / (1.0 + np.exp(-xc))
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
189 7.1 Bayesian Zero-Inflated Models
X = np.transpose(x1)
X = sm.add_constant(X)
# Fit
stan_code = """
data{
int N;
int Kb;
int Kc;
matrix[N, Kb] Xb;
matrix[N, Kc] Xc;
int Y[N];
}
parameters{
vector[Kc] beta;
vector[Kb] gamma;
}
transformed parameters{
vector[N] mu;
vector[N] Pi;
mu = exp(Xc * beta);
for (i in 1:N) Pi[i] = inv_logit(Xb[i] * gamma);
}
model{
real LL[N];
for (i in 1:N) {
if (Y[i] == 0) {
LL[i] = log_sum_exp(bernoulli_lpmf(1|Pi[i]),
bernoulli_lpmf(0|Pi[i]) + poisson_lpmf(Y[i]|mu[i]));
}else{
LL[i] = bernoulli_lpmf(0|Pi[i] + poisson_lpmf(Y[i]|mu[i]);
}
target += LL;
}
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=mydata, iter=5000, chains=3,
warmup=4000, n_jobs=3)
# Output
nlines = 9 # number of lines in screen output
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
190 GLMs Part III – Zero-Inflated and Hurdle Models
output = str(fit).split(’\n’)
for item in output[:nlines]:
print(item)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0] 1.01 1.2e-3 0.03 0.95 0.99 1.01 1.03 1.06 576.0 1.0
beta[1] 2.0 1.6e-3 0.04 1.93 1.97 2.0 2.02 2.07 570.0 1.0
gamma[0] 2.06 4.3e-3 0.11 1.85 1.99 2.06 2.13 2.27 619.0 1.0
gamma[1] -5.15 8.3e-3 0.21 -5.55 -5.29 -5.14 -5.0 -4.75 623.0 1.0
Notice that our model is written in two parts, so that there is a probability Pi of drawing
a zero and a probability 1 - Pi of drawing from a Poisson distribution with mean mu. The
expression log_sum_exp(l1, l2) is a more arithmetically stable version of log(exp(l1)
+ exp(l2)) (see the manual Team Stan, 2016).
set.seed(141)
nobs <- 1000
x1 <- runif(nobs)
x2 <- rbinom(nobs, size=1, 0.6)
xb <- 1 + 2.0*x1 + 1.5*x2
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
191 7.1 Bayesian Zero-Inflated Models
NB ZINB
200
150
Frequency
100
50
0 10 20 30 40 50
t
x
Figure 7.2 For comparison, random variables drawn from negative binomial (lighter) and zero-inflated negative binomial
(darker) distributions.
. . . in R using JAGS
The code for creating a zero-inflated negative binomial model is found below. We
use the direct parameterization for the negative binomial dispersion parameter, but
provide the code for converting the model to indirect parameterization below the model
output.
Kc <- ncol(Xc)
Kb <- ncol(Xb)
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
192 GLMs Part III – Zero-Inflated and Hurdle Models
Xc = Xc, # covariates
Kc = Kc, # number of betas
Xb = Xb, # covariates
Kb = Kb, # number of gammas
N = nrow(zinbdata))
sink ("ZINB.txt")
cat("
model{
# Priors...
for(i in 1: kc...
model{
# Priors - count and binary components
for (i in 1:Kc) { beta[i] ˜ dnorm(0, 0.0001)}
for (i in 1:Kb) { gamma[i] ˜ dnorm(0, 0.0001)}
alpha ˜ dunif(0.001, 5)
# Likelihood
for (i in 1:N) {
W[i] ˜ dbern(1 - Pi[i])
Y[i] ˜ dnegbin(p[i], 1/alpha)
p[i] <- 1/(1 + alpha * W[i] * mu[i])
# Y[i] ~ dnegbin(p[i],alpha) indirect
# p[i] <- alpha/(alpha + mueff[i]) indirect
mueff[i] <- W[i] * mu[i]
log(mu[i]) <- inprod(beta[], Xc[i,])
logit(Pi[i]) <- inprod(gamma[], Xb[i,])
}
}
",fill = TRUE)
sink()
W <- zinbdata$zinby
W[zinbdata$zinby > 0] <- 1
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
193 7.1 Bayesian Zero-Inflated Models
Another run using the above lines results in the output below. Note that the only
statistically substantial change is the inversion of the alpha parameter statistic.
Code 7.5 Bayesian zero-inflated negative binomial model in Python using Stan.
==================================================
import numpy as np
import pystan
import statsmodels.api as sm
from rpy2.robjects import r, FloatVector
from scipy.stats import uniform, bernoulli
# load R package
r(’require(VGAM)’)
# get R functions
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
194 GLMs Part III – Zero-Inflated and Hurdle Models
3.5
2.0
3.0
1.5
Sampled values
2.5
alpha beta[1] beta[2] beta[3]
2.5
1.5
2.0
2.0
1.0
1.5
1.5
1.0
0.5
t
MCMC iterations
Figure 7.3 MCMC chains for the regression parameters of the zero-inflated negative binomial model.
1.5 2.0 2.5 0.5 1.0 1.5 1.0 1.5 2.0 2.5 1.2 1.6 2.0
t
Posterior distribution
Figure 7.4 Histograms of the MCMC iterations for each parameter in the zero-inflated negative binomial model. The thick line at
the base of each histogram represents the 95% credible interval.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
195 7.1 Bayesian Zero-Inflated Models
zinbinomR = r[’rzinegbin’]
res = zinbinomR(n=N, munb=FloatVector(mu1), size=1.0/alpha,
pstr0=FloatVector(mu2))
# Data
np.random.seed(141) # set seed to replicate example
nobs = 7500 # number of obs in model
x1 = uniform.rvs(size=nobs)
x2 = bernoulli.rvs(0.6, size=nobs)
xb = 1.0 + 2.0 * x1 + 1.5 * x2 # linear predictor
xc = 2.0 - 5.0 * x1 + 3.0 * x2
exb = np.exp(xb)
exc = 1 / (1 + np.exp(-xc))
alpha = 2
# Create y as adjusted
zinby = gen_zinegbinom(nobs, exb, exc, alpha)
X = np.column_stack((x1,x2))
X = sm.add_constant(X)
# Fit
stan_code = """
data{
int N;
int Kb;
int Kc;
matrix[N, Kb] Xb;
matrix[N, Kc] Xc;
int Y[N];
}
parameters{
vector[Kc] beta;
vector[Kb] gamma;
real<lower=0> alpha;
}
transformed parameters{
vector[N] mu;
vector[N] Pi;
mu = exp(Xc * beta);
for (i in 1:N) Pi[i] = inv_logit(Xb[i] * gamma);
}
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
196 GLMs Part III – Zero-Inflated and Hurdle Models
model{
vector[N] LL;
for (i in 1:N) {
if (Y[i] == 0) {
LL[i] = log_sum_exp(bernoulli_lpmf(1|Pi[i]),
bernoulli_lpmf(0|Pi[i]) +
neg_binomial_2_lpmf(Y[i]|mu[i], 1/alpha));
} else {
LL[i] = bernoulli_lpmf(0|Pi[i]) +
neg_binomial_2_lpmf(Y[i]| mu[i], 1/alpha);
}
target += LL;
}
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=mydata, iter=7000, chains=3,
warmup=3500, n_jobs=3)
# Output
nlines = 12 # number of lines in screen output
output = str(fit).split(’\n’)
for item in output[:nlines]:
print(item)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0] 1.15 3.6e-3 0.14 0.88 1.06 1.15 1.25 1.44 1595.0 1.0
beta[1] 1.8 4.7e-3 0.19 1.43 1.68 1.81 1.94 2.17 1651.0 1.0
beta[2] 1.44 2.3e-3 0.11 1.23 1.37 1.44 1.51 1.65 2138.0 1.0
gamma[0] 1.96 3.7e-3 0.16 1.64 1.86 1.96 2.07 2.27 1832.0 1.0
gamma[1] -4.83 7.5e-3 0.31 -5.46 -5.04 -4.83 -4.62 -4.25 1702.0 1.0
gamma[2] 2.89 4.4e-3 0.18 2.55 2.77 2.89 3.02 3.26 1706.0 1.0
alpha 1.9 3.8e-3 0.16 1.62 1.79 1.9 2.01 2.25 1756.0 1.0
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
197 7.2 Bayesian Hurdle Models
binary component can be modeled as being right censored at one count, but we shall not
discuss this alternative here. In the following we provide non-exhaustive examples of hur-
dle models for both discrete and continuous data in the hope that readers will be able to
adjust the templates discussed herein for their own needs.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
198 GLMs Part III – Zero-Inflated and Hurdle Models
. . . in R using JAGS
For the JAGS code below, the binary component is set up by creating a separate variable
with value 1 for all counts greater than zero, and 0 for zero counts. The probability of
the logistic component is based on this bifurcated data. The count component is a zero-
truncated Poisson model. First we generate the data:
# Sample size
nobs <- 1000
# Construct filter
pi <- 1/(1+exp((xc)))
bern <- rbinom(nobs,size =1, prob=1-pi)
0
x1
x2
x3 x
tFigure 7.5 Illustration of Poisson–logit hurdle-distributed data. (A black and white version of this figure will appear in some
formats. For the color version, please refer to the plate section.)
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
199 7.2 Bayesian Hurdle Models
sink("HPL.txt")
cat("
model{
# Priors beta and gamma
for (i in 1:Kc) {beta[i] ˜ dnorm(0, 0.0001)}
for (i in 1:Kb) {gamma[i] ˜ dnorm(0, 0.0001)}
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
200 GLMs Part III – Zero-Inflated and Hurdle Models
n.burnin = 4000,
n.iter = 6000)
print(ZAP, intervals=c(0.025, 0.975), digits=3)
==================================================
mu.vect sd.vect 2.5% 97.5% Rhat n.eff
beta[1] 0.737 0.032 0.674 0.800 1.005 820
beta[2] 1.503 0.016 1.472 1.534 1.005 670
gamma[1] -2.764 0.212 -3.190 -2.354 1.004 650
gamma[2] 5.116 0.298 4.550 5.728 1.004 610
deviance 20004080.026 2.741 20004076.529 20004086.654 1.000 1
The beta parameters are for the Poisson component, with the intercept on the upper line.
The logistic components are the lower gamma parameters.
It might be instructive to look at the results of a maximum likelihood Poisson–logit hur-
dle model on the same data as used for the above Bayesian model. This should inform
us about the variability in parameter estimates based on the synthetic data. The code
follows:
library(MASS)
library(pscl)
link = "logit",
data = pdata)
summary(hlpoi)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.68905 0.03426 20.11 <2e-16 ***
x1 1.52845 0.01756 87.03 <2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.5993 0.1890 -8.463 <2e-16 ***
x1 4.2387 0.3327 12.739 <2e-16 ***
Log-likelihood: -1680 on 4 Df
The fact that we used only 750 observations in the model adds variability to the synthetic
data as well as to the JAGS model, which is based on MCMC sampling. Increasing the
sample size will increase the likelihood that the posterior means that we specified in the
synthetic data will be reflected in the JAGS model output.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
201 7.2 Bayesian Hurdle Models
import statsmodels.api as sm
return np.array(ztp)
# Data
np.random.seed(141) # set seed to replicate example
nobs = 750 # number of obs in model
x1 = uniform.rvs(size=nobs)
X = np.transpose(x1)
X = sm.add_constant(X)
# Fit
stan_code = """
data{
int<lower=0> N;
int<lower=0> Kb;
int<lower=0> Kc;
matrix[N, Kb] Xb;
matrix[N, Kc] Xc;
int<lower=0> Y[N];
}
parameters{
vector[Kc] beta;
vector[Kb] gamma;
real<lower=0, upper=5.0> r;
}
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
202 GLMs Part III – Zero-Inflated and Hurdle Models
transformed parameters{
vector[N] mu;
vector[N] Pi;
mu = exp(Xc * beta);
for (i in 1:N) Pi[i] = inv_logit(Xb[i] * gamma);
}
model{
for (i in 1:N) {
(Y[i] == 0) ˜ bernoulli(1-Pi[i]);
if (Y[i] > 0) Y[i] ˜ poisson(mu[i]) T[1,];
}
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=mydata, iter=7000, chains=3,
warmup=4000, n_jobs=3)
# Output
nlines = 10 # number of lines in screen output
output = str(fit).split(’\n’)
for item in output[:nlines]:
print(item)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0] 1.01 1.0e-3 0.03 0.94 0.99 1.01 1.03 1.07 913.0 1.0
beta[1] 4.0 1.3e-3 0.04 3.92 3.97 4.0 4.02 4.07 906.0 1.0
gamma[0] -1.07 5.3e-3 0.16 -1.39 -1.17 -1.06 -0.95 -0.77 904.0 1.01
gamma[1] 3.66 0.01 0.33 3.04 3.43 3.64 3.87 4.32 897.0 1.01
r 2.32 0.05 1.44 0.09 1.05 2.26 3.55 4.82 884.0 1.0
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
203 7.2 Bayesian Hurdle Models
Code 7.9 Zero-altered negative binomial (ZANB) or NB hurdle model in R using JAGS.
==================================================
require(R2jags)
Xc <- model.matrix(~ 1 + x1, data = pdata)
Xb <- model.matrix(~ 1 + x1, data = pdata)
Kc <- ncol(Xc)
Kb <- ncol(Xb)
sink("NBH.txt")
cat("
model{
# Priors beta and gamma
for (i in 1:Kc) {beta[i] ˜ dnorm(0, 0.0001)}
for (i in 1:Kb) {gamma[i] ˜ dnorm(0, 0.0001)}
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
204 GLMs Part III – Zero-Inflated and Hurdle Models
Code 7.10 Zero-altered negative binomial (ZANB) or NB hurdle model in Python using
Stan.
==================================================
# Data
np.random.seed(141) # set seed to replicate example
nobs= 750 # number of obs in model
X = np.transpose(x1)
X = sm.add_constant(X)
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
205 7.2 Bayesian Hurdle Models
mydata[’Xc’] = X
mydata[’Kb’] = X.shape[1] # number of coefficients
mydata[’Kc’] = X.shape[1]
stan_code = """
data{
int<lower=0> N;
int<lower=0> Kb;
int<lower=0> Kc;
matrix[N, Kb] Xb;
matrix[N, Kc] Xc;
int<lower=0> Y[N];
}
parameters{
vector[Kc] beta;
vector[Kb] gamma;
real<lower=0, upper=5.0> alpha;
}
transformed parameters{
vector[N] mu;
vector[N] Pi;
vector[N] temp;
vector[N] u;
mu = exp(Xc * beta);
temp = Xb * gamma;
for (i in 1:N) {
Pi[i] = inv_logit(temp[i]);
u[i] = 1.0/(1.0 + alpha * mu[i]);
}
}
model{
vector[N] LogTrunNB;
vector[N] z;
vector[N] l1;
vector[N] l2;
vector[N] ll;
for (i in 1:Kc){
beta[i] ~ normal(0, 100);
gamma[i] ~ normal(0, 100);
}
for (i in 1:N) {
LogTrunNB[i] = (1.0/alpha) * log(u[i]) + Y[i] * log(1 - u[i]) +
lgamma(Y[i] + 1.0/alpha) - lgamma(1.0/alpha) -
lgamma(Y[i] + 1) - log(1 - pow(u[i],1.0/alpha));
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
206 GLMs Part III – Zero-Inflated and Hurdle Models
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=mydata, iter=6000, chains=3,
warmup=4000, n_jobs=3)
# Output
nlines = 10 # number of lines in screen output
output = str(fit).split(’\n’)
for item in output[:nlines]:
print(item)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0] 0.82 6.9e-4 0.04 0.75 0.8 0.82 0.84 0.89 2697.0 1.0
beta[1] 1.47 3.5e-4 0.02 1.43 1.46 1.47 1.48 1.51 2664.0 1.0
gamma[0] -1.74 3.4e-3 0.19 -2.13 -1.87 -1.74 -1.61 -1.38 3246.0 1.0
gamma[1] 4.52 6.1e-3 0.36 3.86 4.27 4.51 4.74 5.24 3374.0 1.0
alpha 1.5e-3 1.8e-5 1.2e-3 5.9e-5 5.4e-4 1.2e-3 2.2e-3 4.7e-3 5006.0 1.0
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
207 7.2 Bayesian Hurdle Models
Poisson hurdle model. Note that the gamma posterior means are for the binary component
whereas the betas are reserved for the main distribution of interest, in this case the log-
gamma estimates.
In the code below a synthetic model with predictors x1 and x2 is assumed. The response
term is gy. These are stored in an R data frame called gdata. It is easy to amend the code
for use with real data.
set.seed(33559)
# Sample size
nobs <- 750
# Construct filter
xb <- -2 + 1.5*x1
pi <- 1/(1 + exp(-(xb)))
bern <- rbinom(nobs,size=1, prob=pi)
load.module(’glm’)
sink("ZAGGLM.txt")
cat("
model{
# Priors for both beta and gamma components
for (i in 1:Kc) {beta[i] ˜ dnorm(0, 0.0001)}
for (i in 1:Kb) {gamma[i] ˜ dnorm(0, 0.0001)}
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
208 GLMs Part III – Zero-Inflated and Hurdle Models
# gamma log-likelihood
lg1[i] <- - loggam(r) + r * log(r / mu[i])
lg2[i] <- (r - 1) * log(Y[i]) - (Y[i] * r) / mu[i]
LG[i] <- lg1[i] + lg2[i]
# MCMC sampling
ZAG <- jags(data = model.data,
inits = inits,
parameters = params,
model = "ZAGGLM.txt",
n.thin = 1,
n.chains = 3,
n.burnin = 2500,
n.iter = 5000)
# Model results
print(ZAG, intervals = c(0.025, 0.975), digits=3)
==================================================
mu.vect sd.vect 2.5% 97.5% Rhat n.eff
beta[1] -0.991 0.024 -1.035 -0.939 1.069 35
beta[2] 0.739 0.009 0.720 0.756 1.074 33
gamma[1] -1.958 0.172 -2.295 -1.640 1.007 330
gamma[2] 1.462 0.096 1.282 1.652 1.008 290
phi 0.068 0.003 0.062 0.073 1.003 890
deviance 20001890.234 3.184 20001886.012 20001897.815 1.000 1
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
209 7.2 Bayesian Hurdle Models
# Data
np.random.seed(33559) # set seed to replicate example
nobs = 1000 # number of obs in model
# Construct filter
xb = -2 + 1.5 * x1
pi = 1 / (1 + np.exp(-xb))
bern = bernoulli.rvs(1-pi)
X = np.transpose(x1)
X = sm.add_constant(X)
# Fit
mydata = {} # build data dictionary
mydata[’Y’] = gy # response variable
mydata[’N’] = nobs # sample size
mydata[’Xb’] = X # predictors
mydata[’Xc’] = X
mydata[’Kb’] = X.shape[1] # number of coefficients
mydata[’Kc’] = X.shape[1]
stan_code = """
data{
int N;
int Kb;
int Kc;
matrix[N, Kb] Xb;
matrix[N, Kc] Xc;
real<lower=0> Y[N];
}
parameters{
vector[Kc] beta;
vector[Kb] gamma;
real<lower=0> phi;
}
model{
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
210 GLMs Part III – Zero-Inflated and Hurdle Models
0
x1
x2
x3 x
tFigure 7.6 Illustration of lognormal–logit hurdle-distributed data. (A black and white version of this figure will appear in some
formats. For the color version, please refer to the plate section.)
vector[N] mu;
vector[N] Pi;
mu = exp(Xc * beta);
for (i in 1:N) Pi[i] = inv_logit(Xb[i] * gamma);
for (i in 1:N) {
(Y[i] == 0) ˜ bernoulli(Pi[i]);
if (Y[i] > 0) Y[i] ˜ gamma(mu[i], phi) T[0,];
}
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=mydata, iter=6000, chains=3,
warmup=4000, n_jobs=3)
# Output
print (fit)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0] -0.97 1.9e-3 0.08 -1.12 -1.02 -0.97 -0.92 -0.82 1536.0 1.0
beta[1] 0.79 1.1e-3 0.04 0.71 0.77 0.8 0.82 0.88 1527.0 1.0
gamma[0] -1.9 4.6e-3 0.17 -2.24 -2.02 -1.9 -1.79 -1.57 1374.0 1.0
gamma[1] 1.5 2.6e-3 0.1 1.31 1.43 1.5 1.56 1.69 1381.0 1.0
phi 0.07 1.7e-4 6.4e-3 0.06 0.07 0.07 0.08 0.08 1459.0 1.0
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
211 7.2 Bayesian Hurdle Models
# Construct filter
xb <- -3 + 4.5*x1
pi <- 1/(1+exp(-(xb)))
bern <- rbinom(nobs,size=1, prob=pi)
load.module(’glm’)
sink("ZALN.txt")
cat("
model{
# Priors for both beta and gamma components
for (i in 1:Kc) {beta[i] ˜ dnorm(0, 0.0001)}
for (i in 1:Kb) {gamma[i] ˜ dnorm(0, 0.0001)}
# LN log-likelihood
ln1[i] <- -(log(Y[i]) + log(sigmaLN) + log(sqrt(2 * sigmaLN)))
ln2[i] <- -0.5 * pow((log(Y[i]) - mu[i]),2)/(sigmaLN * sigmaLN)
LN[i] <- ln1[i] + ln2[i]
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
212 GLMs Part III – Zero-Inflated and Hurdle Models
# MCMC sampling
ZALN <- jags(data = JAGS.data,
inits = inits,
parameters = params,
model = "ZALN.txt",
n.thin = 1,
n.chains = 3,
n.burnin = 2500,
n.iter = 5000)
# Model results
print(ZALN, intervals = c(0.025, 0.975), digits=3)
==================================================
mu.vect sd.vect 2.5% 97.5% Rhat n.eff
beta[1] 0.703 0.041 0.621 0.782 1.003 1700
beta[2] 1.238 0.025 1.190 1.287 1.002 1800
gamma[1] -3.063 0.235 -3.534 -2.615 1.008 300
gamma[2] 4.626 0.299 4.054 5.225 1.007 320
deviance 20004905.556 3.168 20004901.358 20004913.309 1.000 1
The values of the posteriors for both the count and the Bernoulli components are close to
what we specified in the synthetic data. The values are nearly identical to the maximum
likelihood estimation of the same data.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
213 7.2 Bayesian Hurdle Models
import statsmodels.api as sm
# Data
np.random.seed(33559) # set seed to replicate example
nobs = 2000 # number of obs in model
X = np.transpose(x1)
X = sm.add_constant(X)
# Fit
stan_code = """
data{
int<lower=0> N;
int<lower=0> Kb;
int<lower=0> Kc;
matrix[N, Kb] Xb;
matrix[N, Kc] Xc;
real<lower=0> Y[N];
}
parameters{
vector[Kc] beta;
vector[Kb] gamma;
real<lower=0> sigmaLN;
}
model{
vector[N] mu;
vector[N] Pi;
mu = exp(Xc * beta);
for (i in 1:N) Pi[i] = inv_logit(Xb[i] * gamma);
for (i in 1:N) {
(Y[i] == 0) ~ bernoulli(Pi[i]);
if (Y[i] > 0) Y[i] ~ lognormal(mu[i], sigmaLN);
}
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
214 GLMs Part III – Zero-Inflated and Hurdle Models
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=mydata, iter=7000, chains=3,
warmup=4000, n_jobs=3)
# Output
print(fit)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0] 0.59 2.0e-4 9.2e-3 0.57 0.58 0.59 0.6 0.61 2036.0 1.0
beta[1] 1.26 2.1e-4 9.3e-3 1.24 1.25 1.26 1.26 1.28 2045.0 1.0
gamma[0] -3.09 4.1e-3 0.18 -3.44 -3.2 -3.09 -2.97 -2.74 1903.0 1.0
gamma[1] 4.55 5.1e-3 0.22 4.12 4.4 4.54 4.7 5.0 1905.0 1.0
sigmaLN 0.42 2.6e-4 0.01 0.39 0.41 0.42 0.43 0.44 2237.0 1.0
Further Reading
Cameron, E. (2011). “On the estimation of confidence intervals for binomial pop-
ulation proportions in astronomy: the simplicity and superiority of the Bayesian
approach.” Publ. Astronom. Soc. Australia 28, 128–139. DOI: 10.1071/AS10046.
arXiv:1012.0566 [astro-ph.IM].
de Souza, R. S., E. Cameron, M. Killedar, J. M. Hilbe, R. Vilalta, U. Maio, V. Biffi et
al. (2015). “The overlooked potential of generalized linear models in astronomy, I:
Binomial regression.” Astron. Comput. 12, 21–32. DOI: http://dx.doi.org/10.1016/
j.ascom.2015.04.002.
Elliott, J., R. S. de Souza, A. Krone-Martins, E. Cameron, E. O. Ishida, and J. M.
Hilbe (2015). “The overlooked potential of generalized linear models in astronomy,
II: gamma regression and photometric redshifts.” Astron. Comput. 10, 61–72. DOI:
10.1016/j.ascom.2015.01.002. arXiv: 1409.7699 [astro-ph.IM].
Hardin, J. W. and J. M. Hilbe (2012). Generalized Linear Models and Extensions, Third
Edition. Taylor & Francis.
Hilbe, J. M. (2011). Negative Binomial Regression, Second Edition. Cambridge University
Press.
Hilbe, J .M. (2014). Modeling Count Data. Cambridge University Press.
Hilbe, J. M. (2015). Practical Guide to Logistic Regression. Taylor & Francis.
McElreath, R. (2016). Statistical Rethinking: A Bayesian Course with Examples in R and
Stan. Chapman & Hall/CRC Texts in Statistical Science. CRC Press.
Smithson, M. and E. C. Merkle (2013). Generalized Linear Models for Categorical and
Continuous Limited Dependent Variables. Chapman & Hall/CRC Statistics in the Social
and Behavioral Sciences. Taylor & Francis.
Zuur, A. F., J. M. Hilbe, and E. N. Ieno (2013). A Beginner’s Guide to GLM and GLMM
with R: A Frequentist and Bayesian Perspective for Ecologists. Highland Statistics.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 16:57:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.008
8 Hierarchical GLMMs
The Bayesian models we have discussed thus far in the book have been based on a like-
lihood, and the mixture of a likelihood and prior distribution. The product of a model
likelihood, or log-likelihood, and a prior is called a posterior distribution. One of the key
assumptions of a likelihood distribution is that each component observation in the distribu-
tion is independent of the other observations. This assumption goes back to the probability
distribution from which a likelihood is derived. The terms or observations described by
a probability distribution are assumed to be independent. This criterion is essential to
creating and interpreting the statistical models we addressed in the last chapter.
When there is correlation or time series autocorrelation in the data caused by clustered,
nested, or panel structured data, statisticians must make adjustments to the model in order
to avoid bias in the interpretation of the parameters, especially the standard deviations. In
maximum likelihood estimation this is a foremost problem when developing models, and
an entire area of statistics is devoted to dealing with excess correlation in data.
In many cases the data being modeled is correlated by its structure, or is said to be
structurally correlated. For instance, suppose we have data that is collected by individual
observations over time or data that belongs to different sources or clusters, e.g., data in
which the observations are nested into levels and so is hierarchically structured. Example
data is provided in the tables. Table 8.1 gives an example of longitudinal data. Note that
time periods (Period) are nested within each observation (id). Table 8.2 gives an example
of hierarchical, grouped, or clustered data. This type of data is also referred to as cross
sectional data. It is the data structure most often used for random intercept data. We pool
the data if the grouping variable, grp, is ignored. However, since it may be the case that
values within groups are more highly correlated than are the observations when the data
is pooled, it may be necessary to adjust the model for the grouping effect. If there is more
correlation within groups, the data is likely to be overdispersed. A random intercept model
adjusts the correlation effect by having separate intercepts for each group in the data. We
will describe how this works in what follows.
Let us first give an example from an imagined relationship between a student’s
grade point average (GPA) and test score. It may help to clarify the basic relation-
ships involved with modeling this type of data structure by using a random intercept
or random-intercept–random-slopes (abbreviated to random intercept–slopes) model. Be
215
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
216 Hierarchical GLMMs
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
217 8.1 Overview of Bayesian Hierarchical Models/GLMMs
where β0 and β1 are parameters and ε is an error term (see below). What may con-
cern us, though, is that the GPA results could differ depending on where the applicant
did their undergraduate work. Students taking courses at one university may differ from
students at another university. Moreover, the type of courses and expectations of what
must be learned to achieve a given grade may differ from university to university. In
other words, GPA results within a university may be more correlated than results between
universities.
There are a number of ways in which statisticians have adjusted for this panel effect.
The most basic is referred to as a random intercept model. Recall the error term ε in
Equation 8.1. It represents the error that exists between the true GRE physics scores
obtained and the score values predicted on the basis of the model. Since the GPA results
between students are assumed to be independent of one another, the errors should be
normally distributed as N(0, σ 2 ).
The random intercept model conceptually divides the data into separate models for each
university that is part of the study. The coefficients are assumed to remain the same, but
the intercepts vary between universities. If the mean GPA results within each university are
nearly the same as the overall mean GPA results, then the universities will differ little in
their results. If they differ considerably, however, we know that there is a university effect.
It is important to have a sufficient number of groups, or universities, in the model
to be able to have a meaningful variance statistic to distinguish between groups; this is
standard statistical practice for this class of models. For frequency-based models most
simulation studies require at least 10 groups with a minimum of 20 observations in
each group. The advantage of using a Bayesian random intercept hierarchical model,
though, is that fewer groups may be sufficient for a well-fitted model (see the astronom-
ical examples in Sections 10.2 and 10.3). It depends on the fit of each parameter in the
model.
The general formula for a random intercept model for a single predictor can be given as
Here β0 represents the overall mean of y across all groups; β1 Xij represents the vector
of coefficients in the model acting on the matrix of the model data. For frequentist-based
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
218 Hierarchical GLMMs
models, the coefficients are considered as fixed. In Bayesian methodology they are ran-
dom; each predictor parameter is estimated using MCMC sampling. The quantities ξj are
the random intercepts, which are the same within each group j in the model. Finally, as
mentioned earlier, εij are the residual errors, indicating the difference between the pre-
dicted and actual model values for each observation in the model. The fit is evaluated by
comparing the mean value of each intercept with the pooled mean. The predicted values for
each intercept can be obtained in the usual way. That is, there are as many mean predicted
values as intercepts. The predictor values differ only by the differences in the transformed
intercept values. The inverse link function of the model transforms the various intercepts
to the fitted values. Of course, the parameters and data common to all groups are added to
the calculations.
Random or varying intercept models are also known as partially pooled models. The
reason for this is that the model provides a more accurate estimate of the component group
or cluster means than if the data were divided into the component models and each mean
were estimated separately, or if the data were pooled across groups with the clustering
effect ignored.
A second model, called a random coefficients or random slopes model, allows the coef-
ficients to vary across groups. For our example, the coefficient of GPA can vary across
schools. Usually, though, if we wish the coefficients to vary then we vary them in con-
junction with the intercepts. Only rarely does a researcher attempt to design a model
with random slopes only. We therefore will consider a model with both random slopes,
or coefficients, and intercepts.
The general formula for the combined random intercept and slopes model can be
expressed as
yij = β0 + β1 Xij + ξ0j + ξ1 Xij + εij , (8.3)
where an extra term, ξ1j Xij , has been added to the random intercept model (Equation 8.2).
This term represents the random coefficient, providing us with the difference between each
group’s slope and β1 Xij .
At times statisticians construct the equation for the random slopes (and intercept) model
in such a way that the fixed and random components of the model are combined:
here the first two terms on the right-hand side of the equation are the random intercepts
and the second two terms are the random slope components.
Again, there are a large variety of random effect models. In astronomy it is likely that the
grouping variables, which represent a second level when the model is thought of as hierar-
chical, take the form of clusters. A second-level group can also be longitudinal. However,
in longitudinal models the time periods are regarded as the level-1 variables and the indi-
viduals as the level-2 variables. Thus one may track changes in luminosity for a class of
stars over time. The individual stars, j, are the grouping variables; the changes in mea-
surement, i, within individual stars are the level-1 observations. Care must be taken when
assigning level-1 and level-2 status with respect to cross sectional nested or clustered data
and longitudinal data.
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
219 8.2 Bayesian Gaussian or Normal GLMMs
We should note as well that the symbolization of this class of models varies with indi-
vidual authors. There is no truly standard symbolization, although the above formulas
are commonly used. This is particularly the case for Bayesian hierarchical models, which
incorporate models varying by intercept as well as models varying by both parameter mean
and intercept. Moreover, when we focus on Bayesian models, the coefficients are no longer
slopes; they are posterior distribution means. They are also already random, and therefore
differ from frequentist-based models where the coefficients are assumed to be fixed. We
shall observe how these models are constructed for Gaussian, logistic, Poisson, and nega-
tive binomial hierarchical models. For the Poisson distribution we shall provide examples
for both a random intercept model and a combined random-intercept–random-slopes (or
coefficients) model. The logic used for the Poisson random intercept–slopes model will be
the same as for other distributions.
The reader can compare the code for creating a basic Bayesian model with the code we
show for both random intercept and random intercept–slope models. The same logic that
we use to develop random intercept models, for instance, from the code for basic Bayesian
models can be applied to random intercept or random intercept–slope models that we do
not address in this chapter. This is the case for beta and beta–binomial random intercept
models as well as zero-inflated and three-parameter NB-P models.
σgroups
2
φ= . (8.5)
σgroups
2 + σpooled
2
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
220 Hierarchical GLMMs
Slightly smaller numbers for both can still produce meaningful variance statistics, but we
advise not reducing the numbers too much. It is also important to remember that the data is
assumed to be a sample from a greater population of data, i.e., the model data is not itself
the population data.
For the Bayesian models we are discussing here, recommendations based on simula-
tion studies state that there should be at least five groups with 20 or more observations
in each. Bayesian models, like frequentist models, still use random intercept models to
address problems with overdispersion caused by the clustering of data into groups or
panels.
In Code 8.1, observe that the line that creates the number of groups and the number of
observations in each starts with Groups. The following line puts the normally distributed
values in each group, with mean zero and standard deviation 0.5. The random effect is
therefore signified as the variable a. This code is common for all the synthetic random
intercept models we develop in this chapter. Other distributions could be used for the ran-
dom intercepts, but generally a good reason must be given to do so. The line beginning
with y specifies the family distribution of the model being created. Here it is a normal or
Gaussian model – hence the use of the pseudo-random number generator rnorm.
Groups <- rep(1:20, each = 225) # 20 groups, each with 225 observations
a <- rnorm(NGroups, mean = 0, sd = 0.5)
print(a,2)
[1] 0.579 -0.115 -0.125 0.169 -0.500 -1.429 -1.171 -0.205 0.193
0.041 -0.917 -0.353 -1.197 1.044 1.084 -0.085 -0.886 -0.352
-1.398 0.350
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
221 8.2 Bayesian Gaussian or Normal GLMMs
which comes with the software describes its options and limitations. We primarily employ
the default options with MCMCglmm. These are 13 000 samples, or iterations of the MCMC
algorithm, with 3000 burn-in samples and a thinning rate of 10. That is, every tenth
sample is used to create the posterior distribution of each parameter in the model. Ten
thousand samples are actually used to develop the posterior. These are small num-
bers for most real-life analyses but are generally fine for synthetic data. However, for
this normal model we used the burnin option to specify 10 000 burn-in samples and
the nitt option to have 20 000 samples for the posterior. We could have used thin
to change the default thin rate value, but there is no appearance of a problem with
autocorrelation. When there is, changing the thin rate can frequently ameliorate the
problem.
The most important post-estimation statistics for the MCMCglmm function are
plot(model), summary(model), and autocorr(model$VCV). We show summary results
below, but not plot or autocorr results. The plot option provides trace graphs and his-
tograms for each parameter. The mean of the posterior distribution is the default parameter
value, but the median or mode may also be used. These are interpreted in the same way
as we interpreted the models in Chapter 5 and are vital in assessing the worth of a param-
eter. For space reasons we do not display them, neither do we display the autocorrelation
statistics.
The random intercept normal or Gaussian model can be estimated using the code below.
The verbose=FALSE option specifies that an iteration log of partial results is not displayed
to the screen. If this option is not specified, pages of partial results can appear on your
screen.
> library(MCMCglmm)
> summary(bnglmm)
DIC: 19001.4
G-structure: ˜Groups
R-structure: ˜units
Location effects: y ˜ x1 + x2
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
222 Hierarchical GLMMs
The G-structure results relate to the variance of the random effects or intercepts of
the model. The variance of Groups has posterior mean value 0.6054 and 95% credible
interval 0.2396 to 1.041. This is the estimated mean variance for Groups (intercepts). It is
a parameter for this class of Bayesian models.
The R structure relates to the variance of the model residuals, which has posterior
mean 3.976. The eff.samp statistic is the effective sample size, which is a measure of
the autocorrelation within the sampling distribution of the parameter. Ideally it should
be close to the MCMC sample size or alternatively it should be approximately 1000
or more. The pMCMC statistics tests whether the parameter is significantly different from
zero.
Intercept, x1, and x2 are fixed effects with mean posterior statistics 0.7136, 0.2026,
and −0.6913 respectively. These are close to the values specified in Code 8.1. The error
involved has its source in the random code used to create the data and in the randomness
of the sampling to develop the mean parameter.
We next turn to using JAGS from within R to calculate a Bayesian random intercept
normal or Gaussian model. The advantage of using JAGS is, of course, that it is not limited
to the models built into MCMCglmm. For instance, the Bayesian GLMM negative binomial is
not an option with MCMCglmm, but it is not difficult to code it using JAGS once we know the
standard Bayesian negative binomial.
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
223 8.2 Bayesian Gaussian or Normal GLMMs
# Fit
sink("lmm.txt")
cat("
model {
# Diffuse normal priors for regression parameters
beta ˜ dmnorm(b0[], B0[,])
# Likelihood
for (i in 1:N) {
Y[i] ˜ dnorm(mu[i], tau.eps)
mu[i] <- eta[i]
eta[i] <- inprod(beta[], X[i,]) + a[re[i]]
}
}
",fill = TRUE)
sink()
# Output
print(NORM0, intervals=c(0.025, 0.975), digits=3)
==================================================
Inference for Bugs model at "lmm.txt", fit using jags,
3 chains, each with 10000 iterations (first 6000 discarded), n.thin = 10
n.sims = 1200 iterations saved
mu.vect sd.vect 2.5% 97.5% Rhat n.eff
a[1] 0.855 0.217 0.436 1.266 1.000 1200
a[2] 0.346 0.219 -0.083 0.791 1.001 1100
a[3] 0.028 0.221 -0.401 0.462 1.000 1200
a[4] 0.397 0.220 -0.007 0.836 1.002 890
a[5] -0.209 0.213 -0.632 0.199 1.002 1200
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
224 Hierarchical GLMMs
There are 20 “a” statistics, one for each intercept in the model. An analyst should look
for consistency of the slopes. Also note that the estimated beta values, which are the means
of the posterior distributions for the model intercept and predictors, are consistent with
what we estimated using the MCMCglmm software. Figure 8.1 illustrates this, showing the
posteriors for each “a” (the vertical lines) and their fiducial values (the crosses).
The DIC statistic is comparative, like the AIC or BIC statistic in maximum likelihood
estimation. The lower the DIC between two models on the same data, the better the fit.
There are a number of post-estimation fit tests that can be employed to evaluate the model.
We do not address these tests here. However, n.eff and Rhat statistics are provided in the
displayed results of almost all Bayesian software. We should address them at this point.
The effective sample size is n.eff, the same statistic as the MCMCglmm eff.samp statis-
tic. When the n.eff statistic is considerably lower than the number of samples used to
form the posterior distribution following burn-in, this is evidence that the chains used in
estimation are inefficient. This does not mean that the statistics based on the posterior
are mistaken; it simply means that there may be more efficient ways to arrive at accept-
able results. For most models, a better-fitted parameter should have an n.eff of 1000 or
more.
Rhat is the Gelman–Rubin convergence statistic, which is also symbolized simply as
R. Values of the statistic above 1.0 indicate that there is a problem with convergence, in
particular that a chain has not yet converged. Usually the solution is to raise the number
of samples used to determine the posterior of the parameter in question. However, this
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
225 8.2 Bayesian Gaussian or Normal GLMMs
a[1] x
a[2] x
a[3] x
a[4] x
a[5] x
a[6] x
a[7] x
a[8] x
a[9] x
a[10] x
a[11] x
a[12] x
a[13] x
a[14] x
a[15] x
a[16] x
a[17] x
a[18] x
a[19] x
a[20] x
t
–1.5 –1.0 –0.5 0.0 0.5 1.0 1.5
Figure 8.1 Results from the random intercept model. The horizontal lines represent the posterior distributions for each intercept
a and the crosses represent their fiducial value.
solution should not be held as an absolute: values above 1.0 may be a genuine cause for
concern, and posterior statistics should be only tentatively accepted, especially if other
Rhat statistics in the model are also over 1.0.
Note in the JAGS code the tau.plot and tau.eps functions are defined in terms of the
inverse of sigma.plot squared and the inverse of sigma.eps squared, where sigma.plot
is the standard deviation of the mean intercepts and its square is the variance. Like-
wise, sigma.eps is the standard deviation of the pooled mean intercepts, and its square
is the variance. The functions tau.plot and tau.eps are the precision statistics for each
parameter mean.
There may be times when an analyst simply wants to create data with a specified number
of groups or clusters, without considering any random effects, or he or she may want to use
an entirely different type of model on the data. The example shown in Code 8.3 provides
data for the same 20 groups of 225 observations in each group.
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
226 Hierarchical GLMMs
x1 <- runif(N)
x2 <- runif(N)
Groups <- rep(1:20, each = 225) # 20 groups, each with 225 observations
mu <- 1 + 0.2 * x1 - 0.75 * x2
y <- rnorm(N, mean=mu, sd=2)
ndata <- data.frame(y = y, x1 = x1, x2 = x2, Groups = Groups)
==================================================
# Data
np.random.seed(1656) # set seed to replicate example
N = 4500 # number of obs in model
NGroups = 20
x1 = uniform.rvs(size=N)
x2 = uniform.rvs(size=N)
Groups = np.array([225 * [i] for i in range(20)]).flatten()
X = sm.add_constant(np.column_stack((x1,x2)))
K = X.shape[1]
re = Groups
Nre = NGroups
model_data = {}
model_data[’Y’] = y
model_data[’X’] = X
model_data[’K’] = K
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
227 8.2 Bayesian Gaussian or Normal GLMMs
model_data[’N’] = N
model_data[’NGroups’] = NGroups
model_data[’re’] = re
model_data[’b0’] = np.repeat(0, K)
model_data[’B0’] = np.diag(np.repeat(100, K))
model_data[’a0’] = np.repeat(0, Nre)
model_data[’A0’] = np.diag(np.repeat(1, Nre))
# Fit
stan_code = """
data{
int<lower=0> N;
int<lower=0> K;
int<lower=0> NGroups;
matrix[N, K] X;
real Y[N];
int re[N];
vector[K] b0;
matrix[K, K] B0;
vector[NGroups] a0;
matrix[NGroups, NGroups] A0;
}
parameters{
vector[K] beta;
vector[NGroups] a;
real<lower=0> sigma_plot;
real<lower=0> sigma_eps;
}
transformed parameters{
vector[N] eta;
vector[N] mu;
eta = X * beta;
for (i in 1:N){
mu[i] = eta[i] + a[re[i] + 1];
}
}
model{
sigma_plot ˜ cauchy(0, 25);
sigma_eps ˜ cauchy(0, 25);
Y ˜ normal(mu, sigma_eps);
}
"""
# Output
nlines = 30 # number of lines in screen output
output = str(fit).split(’\n’)
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
228 Hierarchical GLMMs
We have omitted part of the output in order to save space, but the reader can check for
the extract shown above that the results are consistent with those obtained with JAGS.
The data used for the random intercept binary logistic model are similar to those we used
for the normal model. In fact, only two lines of code need to be changed to switch to
synthetic model data. These are the lines beginning with mu and y. For the normal model mu
and eta are identical, but that is not the case for the other models we discuss. We therefore
have to add another line, transforming the linear predictor eta to mu. For the binary logistic
model we do this using the inverse logit function, μ = 1/(1 + exp(−η)).
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
229 8.3 Bayesian Binary Logistic GLMMs
> table(y)
y
0 1
1604 2396
> head(logitr)
y x1 x2 Groups RE
1 1 0.5210797 0.1304233 1 -0.4939037
2 1 0.2336428 0.1469366 1 -0.4939037
3 1 0.2688640 0.8611638 1 -0.4939037
4 0 0.0931903 0.8900367 1 -0.4939037
5 1 0.4125358 0.7587841 1 -0.4939037
6 0 0.4206252 0.6262765 1 -0.4939037
DIC: 3420.422
G-structure: ˜Groups
post.mean l-95% CI u-95% CI eff.samp
Groups 1.184 0.2851 2.489 22.48
R-structure: ˜units
post.mean l-95% CI u-95% CI eff.samp
units 8.827 3.383 17.12 6.252
Location effects: y ˜ x1 + x2
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
230 Hierarchical GLMMs
We give a caveat on using this particular model while replying on MCMCglmm outputs.
There is typically a great deal of variation in the results, which is not found when modeling
the other families we consider in this chapter. In general, though, the more sampling and
more groups with larger within-group observations, the more stable the results from run
to run. JAGS or Stan models, however, are usually quite close to what we specify for the
intercept and x1,x2 coefficient (mean parameter) values.
Code 8.7 Bayesian random intercept binary logistic model in Python using pymc3.
==================================================
import numpy as np
import pymc3 as pm
x1 = uniform.rvs(size=N)
x2 = uniform.rvs(size=N)
Groups = np.array([200 * [i] for i in range(20)]).flatten()
# Define likelihood
y = pm.Normal(’y’, mu=1.0/(1.0 + np.exp(-eta)), sd=sigma, observed=y)
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
231 8.3 Bayesian Binary Logistic GLMMs
# Fit
start = pm.find_MAP() # Find starting value by optimization
step = pm.NUTS(state=start) # Initiate sampling
trace = pm.sample(7000, step, start=start)
==================================================
beta1:
Mean SD MC Error 95% HPD interval
-------------------------------------------------------------------
1.140 0.160 0.004 [0.824, 1.450]
Posterior quantiles:
2.5 25 50 75 97.5
|-----------------|==============|==============|-----------------|
0.831 1.032 1.137 1.245 1.461
beta2:
Mean SD MC Error 95% HPD interval
-------------------------------------------------------------------
0.223 0.121 0.001 [-0.015, 0.459]
Posterior quantiles:
2.5 25 50 75 97.5
|----------------|==============|==============|------------------|
beta3:
Mean SD MC Error 95% HPD interval
-------------------------------------------------------------------
-0.861 0.126 0.001 [-1.125, -0.629]
Posterior quantiles:
2.5 25 50 75 97.5
|-----------------|==============|==============|-----------------|
-1.112 -0.945 -0.860 -0.776 -0.612
a_param:
Mean SD MC Error 95% HPD interval
-------------------------------------------------------------------
-0.343 0.183 0.004 [-0.690, 0.024]
-0.295 0.183 0.004 [-0.671, 0.042]
-0.192 0.187 0.004 [-0.552, 0.181]
-0.341 0.184 0.004 [-0.710, 0.014]
...
0.109 0.198 0.004 [-0.266, 0.510]
0.632 0.224 0.004 [0.201, 1.088]
0.859 0.249 0.005 [0.398, 1.377]
0.197 0.199 0.004 [-0.172, 0.601]
Posterior quantiles:
2.5 25 50 75 97.5
|--------------|==============|==============|--------------|
-0.698 -0.464 -0.343 -0.220 0.018
-0.656 -0.418 -0.292 -0.170 0.062
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
232 Hierarchical GLMMs
sigma:
Mean SD MC Error 95% HPD interval
-------------------------------------------------------------------
0.451 0.005 0.000 [0.441, 0.461]
Posterior quantiles:
2.5 25 50 75 97.5
|--------------|==============|==============|--------------|
0.441 0.447 0.451 0.455 0.461
sigma_a:
Mean SD MC Error 95% HPD interval
-------------------------------------------------------------------
0.561 0.112 0.007 [0.353, 0.777]
Posterior quantiles:
2.5 25 50 75 97.5
|--------------|==============|==============|--------------|
0.376 0.485 0.549 0.619 0.833
Code 8.8 Bayesian random intercept binary logistic model in R using JAGS.
==================================================
library(R2jags)
sink("GLMM.txt")
cat("
model {
# Diffuse normal priors for regression parameters
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
233 8.3 Bayesian Binary Logistic GLMMs
# Likelihood
for (i in 1:N) {
Y[i] ˜ dbern(p[i])
logit(p[i]) <- max(-20, min(20, eta[i]))
eta[i] <- inprod(beta[], X[i,]) + a[re[i]]
}
}",fill = TRUE)
sink()
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
234 Hierarchical GLMMs
The reader can compare the model’s results with its original values for a by typing >
print(a) in the R console. It may be instructive for the reader to know why we used num
and denom in the code:
Here sigma.re is the mean of the intercept standard deviations. The square of sigma.re
is the mean variance of the intercepts. As a parameter, it requires a prior. Gelman (2006),
Marley and Wand (2010), and Zuur et al. (2013) recommend that a half Cauchy(25) prior
be used for the standard deviation parameter, σGroups . The half Cauchy, as well as the half
Cauchy(25), is not a built-in prior in JAGS. Neither is it a built-in prior in OpenBUGS,
Stata, or other Bayesian packages. One way to define the half Cauchy(25) prior is as the
absolute value of a variable from a Normal(0, 625) distribution divided by a variable with
a Normal(0, 1) distribution. That is,
X1 ∼ Normal(0, 625)
σGroups ∼ half Cauchy(25) = . (8.6)
X2 ∼ Normal(0, 1)
The distribution of num is Normal(0, 0.0016), where 0.0016 is the precision and 625 is the
variance (1/0.0016). The denominator is the standard normal distribution. The absolute
value of num/denom is σ . The inverse of σ 2 is τ . The remainder of the code should be
clear.
8.3.5 Bayesian Random Intercept Binary Logistic Model in Python using Stan
This implementation has a lot in common with the one presented in Code 8.4. Modifica-
tions as follows need to be made in the # Data section,
from scipy.stats import bernoulli
and within the Stan model, where it is necessary to change the response variable declaration
to int and rewrite the likelihood using a Bernoulli distribution. Given that the Bernoulli
distribution uses only one parameter, it is also necessary to suppress the scale parameters.
In what follows we eliminate both sigma_plot and sigma_eps, and include the param-
eter tau_re in order to make the comparison with Code 8.8 easier. In the same line
of thought, the parameter mu is given the traditional nomenclature p in the context of a
Bernoulli distribution.
Notice that it is not necessary to define the logit transformation explicitly since there is
a built-in bernoulli_logit distribution in Stan:
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
235 8.4 Bayesian Binomial Logistic GLMMs
data{
int<lower=0> Y[N];
}
parameters{
# remove parameters sigma_plot and sigma_eps from Code 8.4 and add tau_re
real<lower=0> tau_re;
}
transformed parameters{
vector[N] p;
for (i in 1:N){
p[i] = eta[i] + a[re[i]+1];
}
}
model{
# remove priors for sigma_plot and sigma_eps from Code 8.4
# add prior for tau_re
tau_re ˜ cauchy(0, 25);
# rewrite prior for a
a ˜ multi_normal (a0, tau_re * A0);
Y ˜ bernoulli_logit(p);
}
As discussed in Chapter 5, binomial logistic models are grouped models. The format of
the response term is numerator/denominator, where the numerator is the number of suc-
cesses (for which y = 1) with respect to its associated denominator. The denominator
consists of separate predictors having the same values. If the predictors x1, x2, and x3
have values 1, 0, and 5 respectively, this is a unique denominator, m. If there are 10 obser-
vations in the data having that same profile of values, and four of the observations also
have a response term y with values of 1 then six observations will have zero for y. In
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
236 Hierarchical GLMMs
for
If we use the R head function, the structure of the data may be more easily observed:
> head(logitr)
y m x1 x2 Groups
1 6 45 1 1 1
2 11 54 1 0 2
3 9 39 1 1 3
4 13 47 1 0 4
5 17 29 1 1 5
6 21 44 1 0 6
The random intercept variable Groups simply identifies each cluster of data; m specifies
the number of observations in the respective group. Group 1 has 45 observations, Group 2
has 54 and Group 3 has 39. R does not have a Bayesian random intercept logistic model at
present, so we shall advance directly to JAGS.
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
237 8.4 Bayesian Binomial Logistic GLMMs
re <- length(unique(logitr$Groups))
Nre <- length(unique(Groups))
sink("GLMM.txt")
cat("
model{
# Diffuse normal priors for regression parameters
beta ˜ dmnorm(b0[], B0[,])
# Likelihood function
for (i in 1:N){
Y[i] ˜ dbin(p[i], m[i])
logit(p[i]) <- eta[i]
eta[i] <- inprod(beta[], X[i,]) + a[re[i]]
}
}",fill = TRUE)
sink()
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
238 Hierarchical GLMMs
inits = inits,
parameters = params,
model.file = "GLMM.txt",
n.thin = 10,
n.chains = 3,
n.burnin = 4000,
n.iter = 5000)
We did not use synthetic or simulated data with specified parameter values for this
model. We can determine whether the beta values are appropriate, though, by running
a maximum likelihood binomial logistic model and comparing parameter values.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0612 0.1381 -7.683 1.56e-14 ***
x1 0.2365 0.1554 1.522 0.1281
x2 -0.3271 0.1554 -2.105 0.0353 *
Given that the GLM parameter estimates are fairly close to the beta parameter values of
the Bayesian model, we can be confident that the code is correct. Remember that the GLM
model does not adjust for any grouping effect, nor is it estimated as a Bayesian model. Of
course, if we had informative priors to place on any of the Bayesian parameters then the
result would differ more. We shall discuss priors later.
8.4.3 Bayesian Random Intercept Binomial Logistic Model in Python using Stan
The Stan model is shown below. As in Subsection 8.3.5 we can avoid the explicit logit
transformation by using the Stan built-in binomial_logit distribution.
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
239 8.4 Bayesian Binomial Logistic GLMMs
Code 8.11 Random intercept binomial logistic model in Python using Stan.
=========================================================
import numpy as np
import statsmodels.api as sm
import pystan
y = [6,11,9,13,17,21,8,10,15,19,7,12,8,5,13,17,5,12,9,10]
m = [45,54,39,47,29,44,36,57,62,55,66,48,49,39,28,35,39,43,50,36]
x1 = [1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0]
x2 = [1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0]
Groups = range(len(y))
X = sm.add_constant(np.column_stack((x1,x2)))
K = X.shape[1]
model_data = {}
model_data[’Y’] = y # response
model_data[’X’] = X # covariates
model_data[’K’] = K # num. betas
model_data[’m’] = m # binomial denominator
model_data[’N’] = len(y) # sample size
model_data[’re’] = Groups # random effects
model_data[’b0’] = np.repeat(0, K)
model_data[’B0’] = np.diag(np.repeat(100, K))
model_data[’a0’] = np.repeat(0, len(y))
model_data[’A0’] = np.diag(np.repeat(1, len(y)))
# Fit
stan_code = """
data{
int<lower=0> N;
int<lower=0> K;
matrix[N, K] X;
int<lower=0> Y[N];
int re[N];
int m[N];
vector[K] b0;
matrix[K, K] B0;
vector[N] a0;
matrix[N, N] A0;
}
parameters{
vector[K] beta;
vector[N] a;
real<lower=0> sigma;
}
transformed parameters{
vector[N] eta;
vector[N] p;
eta = X * beta;
for (i in 1:N){
p[i] = eta[i] + a[re[i]+1];
}
}
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
240 Hierarchical GLMMs
model{
sigma ˜ cauchy(0, 25);
Y ˜ binomial_logit(m, p);
}
"""
# Output
nlines = 29 # number of lines in screen output
output = str(fit).split(’\n’)
for item in output[:nlines]:
print(item)
=========================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0] -1.11 0.02 0.33 -1.79 -1.34 -1.11 -0.87 -0.51 300.0 0.99
beta[1] 0.28 0.02 0.38 -0.45 0.03 0.3 0.51 1.03 300.0 0.99
beta[2] -0.32 0.02 0.38 -1.11 -0.55 -0.32 -0.05 0.39 286.0 1.0
a[0] -0.59 0.03 0.48 -1.57 -0.89 -0.54 -0.29 0.29 261.0 1.0
a[1] -0.44 0.02 0.4 -1.38 -0.68 -0.45 -0.19 0.29 300.0 1.0
a[2] -0.07 0.03 0.47 -0.99 -0.36 -0.07 0.2 0.83 300.0 1.0
...
a[17] 0.08 0.02 0.4 -0.73 -0.2 0.1 0.38 0.82 300.0 1.01
a[18] -0.1 0.03 0.44 -0.9 -0.39 -0.07 0.18 0.78 300.0 0.99
a[19] 0.1 0.02 0.41 -0.76 -0.16 0.09 0.38 0.95 300.0 1.0
sigma 0.54 0.02 0.3 0.16 0.33 0.48 0.66 1.37 269.0 1.0
As discussed in Chapter 5, the Poisson and negative binomial models are foremost in mod-
eling count data. In fact, if one has discrete count data to model then a Poisson model
should be estimated first. After that the model should be evaluated to determine whether
it is extra-dispersed, i.e. over or underdispersed. This can be evaluated easily by checking
the Poisson dispersion statistic, which is calculated as the Pearson χ 2 statistic divided by
the residual number of degrees of freedom. If the resulting dispersion statistic is greater
than 1, the model is likely to be overdispersed; if under 1, it is likely to be underdis-
persed. The P__disp function in the COUNT package in CRAN can quickly provide the
appropriate statistics.
If the Poisson model is overdispersed, the analyst usually employs a negative binomial
model on the data. However, another tactic is first to attempt to determine the likely source
of overdispersion. For the binomial and count models, when the data is clustered or is in
longitudinal form, the data is nearly always overdispersed. If the data within groups is more
highly correlated than is the data between groups, this gives rise to overdispersion. When
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
241 8.5 Bayesian Poisson GLMMs
this is the case the most appropriate tactic is first to model the data using a random intercept
Poisson model. If we find that there is variability in coefficients or parameters as well as
within groups, it may be necessary to employ a combined random-intercept–random-slopes
model. We discuss both these models in this section.
Of course, if there are other sources of excess variation in the data that cannot be
accounted for by using a random intercept model, or a combined random intercept–slopes
model, then a negative binomial model should be tried. In addition, if the data is not
grouped but is still overdispersed then using a negative binomial model is wise. How-
ever, it is always vital to attempt to identify the source of extra-dispersion before selecting
a count model to use subsequent to Poisson modeling.
If the data to be modeled is underdispersed, modeling the data as a Bayesian ran-
dom intercept or combined random intercept–slopes might be able to adjust for the
underdispersion. A negative binomial model cannot be used with underdispersed data. The
best alternative in this case is a Bayesian generalized Poisson model or, better, a Bayesian
hierarchical generalized Poisson model.
We begin this section with providing code and an example of a Bayesian Poisson random
intercept model.
8.5.1 Random Intercept Poisson Data
Unlike the synthetic normal and binary logistic models created earlier in this chapter, here
we only have 10 groups of 200 observations each. However, we retain the same parameter
values for the intercept and predictor parameter means as before.
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
242 Hierarchical GLMMs
> library(MCMCglmm)
> bpglmm <- MCMCglmm(y $\newtilde$ x1 + x2, random= ~Groups,
family="poisson", data=poir, verbose=FALSE,
burnin=10000, nitt=20000)
> summary(bpglmm)
DIC: 6980.156
G-structure: ˜Groups
R-structure: ˜units
Location effects: y ˜ x1 + x2
The intercept and predictor posterior means are fairly close to the values we specified
in the code to create the data. For a Bayesian model the disparity may be due to either the
randomness of the data or the randomness of the sampling used in the modeling process.
# Data
np.random.seed(1656) # set seed to replicate example
N = 2000 # number of obs in model
NGroups = 10
x1 = uniform.rvs(size=N)
x2 = uniform.rvs(size=N)
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
243 8.5 Bayesian Poisson GLMMs
y = poisson.rvs(mu, size=N)
# Define likelihood
y = pm.Poisson(’y’, mu=np.exp(eta), observed=y)
# Fit
start = pm.find_MAP() # Find starting value by optimization
step = pm.NUTS(state=start) # Initiate sampling
trace = pm.sample(20000, step, start=start)
Posterior quantiles:
2.5 25 50 75 97.5
|------------------|==============|==============|-----------------|
0.605 0.847 0.965 1.083 1.339
beta2:
Mean SD MC Error 95% HPD interval
-------------------------------------------------------------------
0.214 0.052 0.001 [0.111, 0.316]
Posterior quantiles:
2.5 25 50 75 97.5
|-----------------|==============|==============|------------------|
0.111 0.179 0.214 0.249 0.317
beta3:
Mean SD MC Error 95% HPD interval
-------------------------------------------------------------------
-0.743 0.054 0.001 [-0.846, -0.636]
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
244 Hierarchical GLMMs
Posterior quantiles:
2.5 25 50 75 97.5
|-----------------|==============|==============|------------------|
-0.848 -0.780 -0.743 -0.707 -0.638
a_param:
Mean SD MC Error 95% HPD interval
-------------------------------------------------------------------
0.374 0.188 0.005 [-0.011, 0.725]
0.347 0.189 0.005 [-0.020, 0.722]
...
0.123 0.189 0.005 [-0.262, 0.479]
-0.446 0.193 0.005 [-0.841, -0.078]
Posterior quantiles:
2.5 25 50 75 97.5
|------------------|==============|==============|------------------|
-0.003 0.256 0.375 0.495 0.737
-0.032 0.230 0.348 0.466 0.712
...
-0.254 0.004 0.124 0.242 0.488
-0.839 -0.567 -0.444 -0.322 -0.076
sigma_a:
Mean SD MC Error 95% HPD interval
-------------------------------------------------------------------
0.569 0.170 0.002 [0.304, 0.905]
Posterior quantiles:
2.5 25 50 75 97.5
|------------------|==============|==============|------------------|
0.339 0.453 0.537 0.648 0.994
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
245 8.5 Bayesian Poisson GLMMs
A0 = diag(Nre))
sink("GLMM.txt")
cat("
model {
# Diffuse normal priors for regression parameters
beta ˜ dmnorm(b0[], B0[,])
# Likelihood
for (i in 1:N) {
Y[i] ˜ dpois(mu[i])
log(mu[i])<- eta[i]
eta[i] <- inprod(beta[], X[i,]) + a[re[i]]
}
}
",fill = TRUE)
sink()
# Identify parameters
params <- c("beta", "a", "sigma.re", "tau.re")
# Run MCMC
PRI <- jags(data = model.data,
inits = inits,
parameters = params,
model.file = "GLMM.txt",
n.thin = 10,
n.chains = 3,
n.burnin = 4000,
n.iter = 5000)
print(PRI, intervals=c(0.025, 0.975), digits=3)
==================================================
mu.vect sd.vect 2.5% 97.5% Rhat n.eff
a[1] 1.086 0.240 0.609 1.579 1.002 300
a[2] -0.593 0.246 -1.081 -0.124 1.008 170
a[3] -0.535 0.247 -1.047 -0.077 1.005 220
a[4] 0.054 0.242 -0.418 0.509 1.009 200
a[5] -0.670 0.245 -1.175 -0.182 1.004 300
a[6] -0.528 0.246 -1.004 -0.060 1.006 210
a[7] 0.211 0.243 -0.294 0.685 1.000 300
a[8] -0.043 0.243 -0.564 0.442 1.004 270
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
246 Hierarchical GLMMs
The results of the intercept and predictor mean posteriors are nearly identical. The square
of the standard deviation, sigma.re, is close to the variance of the mean intercepts as posted
in the MCMCglmm output.
Code 8.15 Bayesian random intercept Poisson model in Python using Stan.
==================================================
import numpy as np
import pystan
import statsmodels.api as sm
# Data
np.random.seed(1656) # set seed to replicate example
N = 2000 # number of obs in model
NGroups = 10
x1 = uniform.rvs(size=N)
x2 = uniform.rvs(size=N)
y = poisson.rvs(mu)
X = sm.add_constant(np.column_stack((x1,x2)))
K = X.shape[1]
Nre = NGroups
model_data = {}
model_data[’Y’] = y
model_data[’X’] = X
model_data[’K’] = K
model_data[’N’] = N
model_data[’NGroups’] = NGroups
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
247 8.5 Bayesian Poisson GLMMs
model_data[’re’] = Groups
model_data[’b0’] = np.repeat(0, K)
model_data[’B0’] = np.diag(np.repeat(100, K))
model_data[’a0’] = np.repeat(0, Nre)
model_data[’A0’] = np.diag(np.repeat(1, Nre))
# Fit
stan_code = """
data{
int<lower=0> N;
int<lower=0> K;
int<lower=0> NGroups;
matrix[N, K] X;
int Y[N];
int re[N];
vector[K] b0;
matrix[K, K] B0;
vector[NGroups] a0;
matrix[NGroups, NGroups] A0;
}
parameters{
vector[K] beta;
vector[NGroups] a;
real<lower=0, upper=10> sigma_re;
}
transformed parameters{
vector[N] eta;
vector[N] mu;
eta = X * beta;
for (i in 1:N){
mu[i] = exp(eta[i] + a[re[i]+1]);
}
}
model{
sigma_re ˜ cauchy(0, 25);
beta ˜ multi_normal(b0, B0);
a ˜ multi_normal(a0, sigma_re * A0);
Y ˜ poisson(mu);
}
"""
# Output
nlines = 19 # number of lines in screen output
output = str(fit).split(’\n’)
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
248 Hierarchical GLMMs
beta[1] 0.22 3.1e-3 0.05 0.12 0.18 0.22 0.25 0.32 247 1.0
beta[2] -0.74 3.0e-3 0.05 -0.86 -0.78 -0.74 -0.71 -0.65 300 1.0
a[0] 0.37 0.01 0.22 -0.11 0.23 0.38 0.52 0.77 274 1.0
a[1] 0.35 0.01 0.21 -0.13 0.22 0.36 0.49 0.74 268 1.0
a[2] 0.15 0.01 0.22 -0.31 0.02 0.16 0.29 0.58 270 1.0
a[3] -1.12 0.01 0.22 -1.62 -1.24 -1.1 -0.99 -0.67 239 1.0
a[4] 0.16 0.01 0.22 -0.32 0.03 0.16 0.29 0.54 287 1.0
a[5] -0.27 0.01 0.22 -0.75 -0.41 -0.25 -0.12 0.18 278 1.0
a[6] 0.19 0.01 0.22 -0.3 0.05 0.21 0.35 0.59 270 1.0
a[7] 0.47 0.01 0.21 -2.5e-3 0.35 0.48 0.61 0.89 278 1.0
a[8] 0.12 0.01 0.22 -0.38 -9.9e-3 0.13 0.27 0.55 268 1.0
a[9] -0.45 0.01 0.22 -0.93 -0.59 -0.44 -0.3 -0.05 274 1.0
sigma_re 0.4 0.01 0.25 0.15 0.24 0.33 0.48 0.99 300 1.0
We now have data in 10 groups with 500 observations in each. The random intercept and
slope are defined on the eta line. The specified parameter predictor means for the model
are 2 for the intercept, 4 for x1, and −7 for x2. The random intercept is Groups as with the
previous models, with the random slopes as x1.
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
249 8.5 Bayesian Poisson GLMMs
sink("RICGLMM.txt")
cat("
model{
#Priors
beta ˜ dmnorm(b0[], B0[,])
a ˜ dmnorm(a0[], tau.ri * A0[,])
b ˜ dmnorm(a0[], tau.rs * A0[,])
tau.ri ˜ dgamma( 0.01, 0.01 )
tau.rs ˜ dgamma( 0.01, 0.01 )
sigma.ri <- pow(tau.ri,-0.5)
sigma.rs <- pow(tau.rs,-0.5)
# Likelihood
for (i in 1:N){
Y[i] ˜ dpois(mu[i])
log(mu[i])<- eta[i]
eta[i] <- inprod(beta[], X[i,]) + a[re[i]] + b[re[i]] * X[i,2]
}
}
",fill = TRUE)
sink()
# Initial values
inits <- function () {
list(
beta = rnorm(K, 0, 0.01),
tau = 1,
a = rnorm(NGroups, 0, 0.1),
b = rnorm(NGroups, 0, 0.1)
) }
# Identify parameters
params <- c("beta", "sigma.ri", "sigma.rs","a","b")
# Run MCMC
PRIRS <- jags(data = model.data,
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
250 Hierarchical GLMMs
inits = inits,
parameters = params,
model.file = "RICGLMM.txt",
n.thin = 10,
n.chains = 3,
n.burnin = 3000,
n.iter = 4000)
As defined earlier, the “a” statistics are the model random intercepts. Each intercept is a
separate parameter for which posterior distributions are calculated. The “b” statistics are
the random slopes on x1, each of which is a posterior. Recall that the synthetic data was
specified to have an intercept parameter of 2, with x1 = 4 and x2 = -7. The quantities
sigma.ri are the square roots of the random intercepts, whereas sigma.rs are the standard
deviations of the random slopes on x1. The model posterior means closely approximate
these values. The mean intercept parameter value is 0.128 and the mean random slopes
parameter (for x1) is 0.457.
The equivalent model in Python using Stan is shown below.
Code 8.18 Random-intercept–random-slopes Poisson model in Python using Stan.
==================================================
import numpy as np
import pystan
import statsmodels.api as sm
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
251 8.5 Bayesian Poisson GLMMs
# Data
np.random.seed(1656) # set seed to replicate example
N = 5000 # number of obs in model
NGroups = 10
x1 = uniform.rvs(size=N)
x2 = np.array([0 if item <= 0.5 else 1 for item in x1])
y = poisson.rvs(mu)
X = sm.add_constant(np.column_stack((x1,x2)))
K = X.shape[1]
model_data = {}
model_data[’Y’] = y
model_data[’X’] = X
model_data[’K’] = K
model_data[’N’] = N
model_data[’NGroups’] = NGroups
model_data[’re’] = Groups
model_data[’b0’] = np.repeat(0, K)
model_data[’B0’] = np.diag(np.repeat(100, K))
model_data[’a0’] = np.repeat(0, NGroups)
model_data[’A0’] = np.diag(np.repeat(1, NGroups))
# Fit
stan_code = """
data{
int<lower=0> N;
int<lower=0> K;
int<lower=0> NGroups;
matrix[N, K] X;
int Y[N];
int re[N];
vector[K] b0;
matrix[K, K] B0;
vector[NGroups] a0;
matrix[NGroups, NGroups] A0;
}
parameters{
vector[K] beta;
vector[NGroups] a;
vector[NGroups] b;
real<lower=0> sigma_ri;
real<lower=0> sigma_rs;
}
transformed parameters{
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
252 Hierarchical GLMMs
vector[N] eta;
vector[N] mu;
eta = X * beta;
for (i in 1:N){
mu[i] = exp(eta[i] + a[re[i]+1] + b[re[i] + 1] * X[i,2]);
}
}
model{
sigma_ri ˜ gamma(0.01, 0.01);
sigma_rs ˜ gamma(0.01, 0.01);
Y ˜ poisson(mu);
}
"""
# Output
nlines = 30 # number of lines in screen output
output = str(fit).split(’\n’)
The negative binomial model is nearly always used to model overdispersed Poisson count
data. The data are Poisson when the mean and variance of the model counts are equal. If the
variance is greater than the mean, the data is said to be overdispersed; if the variance is less
than the mean, the data is underdispersed. We may test for this relationship by checking
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
253 8.6 Bayesian Negative Binomial GLMMs
the Pearson dispersion statistic, which is defined as the value of the model’s Pearson χ 2
statistic divided by the residual number of degrees of freedom. If the dispersion statistic is
greater than 1, the data is likely to be overdispersed; if the statistic is less than 1, the data
is likely to be underdispersed.
As we saw in the previous chapter, however, there are a variety of reasons why count data
can be overdispersed: the counts in the model may be structurally missing a zero or there
may be far more zero counts than allowed by the Poisson distribution for a given mean.
There are a number of other causes of overdispersion, but we do not need to detail them
here. However, if you are interested in the ways in which data can be overdispersed and
how to ameliorate the problem then see Hilbe (2011) and Hilbe (2014). If a specific remedy
for overdispersion in our model data is not known then the negative binomial model should
be used, but it needs to be evaluated to determine whether it fits the data properly. There
are alternatives to the negative binomal model, though, when dealing with overdispersion,
e.g., the generalized Poisson, Poisson inverse Gaussian, and NB-P models. But the negative
binomial model usually turns out to be the best model to deal with overdispersion.
Probably the foremost reason for count data being Poisson overdispersed is that the data
exhibits a clustering effect. That is, many count data situations entail that the data is nested
in some manner or is longitudinal in nature. This generally gives rise to overdispersion.
We have given examples of clustered data throughout this chapter. But, like the Poisson
model, the negative binomial model can itself be overdispersed – or negative binomial
overdispersed – in the sense that there is more correlation in the data than is theoretically
expected for a negative binomial with a given mean and dispersion parameter. If the neg-
ative binomial Pearson dispersion is greater than 1 then the negative binomial model may
be said to be overdispersed or correlated.
The same logic as used in selecting a maximum likelihood model can be used in select-
ing the appropriate Bayesian model. This logic has been implicit in our discussion of the
various Bayesian models addressed in this text. The reason is that the likelihood that is cen-
tral to the maximum likelihood model is also central to understanding the model data before
mixing it with a prior distribution; the structure of the data is reflected in the likelihood.
Therefore, when there still exists extra correlation in a hierarchial or random intercept Pois-
son model, which itself is designed to adjust for overdispersion resulting from a clustering
variable or variables, then a hierarchical random intercept negative binomial model should
be used to model the data.
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
254 Hierarchical GLMMs
x1 <- runif(N)
x2 <- runif(N)
----------------------------------------------------
Sigma link function: log
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.70860 0.07061 -10.04 <2e-16 ***
----------------------------------------------------
No. of observations in the fit: 8000
Degrees of Freedom for the fit: 5
Residual Deg. of Freedom: 1995
at cycle: 1
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
255 8.6 Bayesian Negative Binomial GLMMs
Code 8.21 Bayesian random intercept negative binomial mixed model in Python using
pymc3.
==================================================
import numpy as np
import pymc3 as pm
import statsmodels.api as sm
# Data
np.random.seed(1656) # set seed to replicate example
N = 2000 # number of obs in model
NGroups = 10
x1 = uniform.rvs(size=N)
x2 = uniform.rvs(size=N)
y = nbinom.rvs(mu, 0.5)
# Define likelihood
y = pm.NegativeBinomial(’y’, mu=np.exp(eta), alpha=alpha, observed=y)
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
256 Hierarchical GLMMs
# Fit
start = pm.find_MAP() # Find starting value by optimization
step = pm.NUTS(state=start) # Initiate sampling
trace = pm.sample(7000, step, start=start)
Posterior quantiles:
2.5 25 50 75 97.5
|----------------|==============|==============|------------------|
0.693 0.939 1.041 1.164 1.409
beta2:
Mean SD MC Error 95% HPD interval
-------------------------------------------------------------------
0.217 0.072 0.001 [0.078, 0.356]
Posterior quantiles:
2.5 25 50 75 97.5
|-----------------|==============|==============|-----------------|
0.075 0.168 0.217 0.267 0.355
beta3:
Mean SD MC Error 95% HPD interval
-------------------------------------------------------------------
-0.824 0.075 0.001 [-0.973, -0.684]
Posterior quantiles:
2.5 25 50 75 97.5
|-----------------|==============|==============|-----------------|
-0.970 -0.873 -0.825 -0.773 -0.679
a_param:
Mean SD MC Error 95% HPD interval
-------------------------------------------------------------------
0.342 0.180 0.005 [-0.002, 0.715]
0.235 0.182 0.005 [-0.133, 0.589]
...
sigma_a:
Mean SD MC Error 95% HPD interval
-------------------------------------------------------------------
0.507 0.148 0.003 [0.286, 0.811]
Posterior quantiles:
2.5 25 50 75 97.5
|------------------|==============|==============|-----------------|
0.304 0.406 0.478 0.575 0.872
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
257 8.6 Bayesian Negative Binomial GLMMs
# Likelihood
for (i in 1:N) {
Y[i] ˜ dnegbin(p[i], 1/alpha)
p[i] <- 1 /( 1 + alpha * mu[i])
log(mu[i]) <- eta[i]
eta[i] <- inprod(beta[], X[i,]) + a[re[i]]
}
}
",fill = TRUE)
sink()
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
258 Hierarchical GLMMs
# Identify parameters
params <- c("beta", "a", "sigma.re", "tau.re", "alpha")
The betas are similar to the gamlss results, as is the value for alpha. The gamlss neg-
ative binomial dispersion parameter is given in log form as sigma intercept. The value
displayed is −0.70860. Exponentiating the value produces the actual dispersion statistic,
0.492 333. This compares nicely with the above Bayesian model result, 0.493.
Recall that the gamlss function, which can be downloaded from CRAN, parameter-
izes the negative binomial dispersion parameter alpha in a direct manner. That is, the
more variability there is in the data, the greater the value of alpha. There is a direct
relationship between alpha and mu. If alpha is zero, or close to zero, the model is
Poisson.
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
259 8.6 Bayesian Negative Binomial GLMMs
The glm function of R uses an indirect parameterization, calling the dispersion param-
eter theta, where α = 1/θ . Some statisticians who use R prefer to view the dispersion
parameter as 1/θ , which is in fact the same as alpha. Note that all commercial statistical
software use a direct parameterization for negative binomial models. You will find, how-
ever, many new books on the market that use R in examples follow R’s glm in inverting
the dispersion parameter. Care must be taken, when interpreting negative binomial models,
regarding these differing software conventions, and even when using different functions
within the same package. For those readers who prefer to use an indirect parameterization,
simply substitute the lines of code marked # in the Likelihood section in place of the
similar lines above.
Also note that in the above Bayesian output, sigma.re, which is the mean standard
deviation of the model random intercepts, has the value 0.499. This is very close to the
value of z in the gamlss model output above, which is in variance form: squaring sigma.re
results in the z statistic from the gamlss output.
8.6.5 Bayesian Random Intercept Negative Binomial Model in Python using Stan
We show below the same model in Python using Stan. The synthetic data used here
was generated using Code 8.21. The user should be aware of the difference between
parameterizations built in scipy and Stan, which need to be taken into account in the
likelihood definition.
Code 8.23 Bayesian random intercept negative binomial in Python using Stan.
==================================================
import pystan
X = sm.add_constant(np.column_stack((x1,x2)))
K = X.shape[1]
model_data = {}
model_data[’Y’] = y
model_data[’X’] = X
model_data[’K’] = K
model_data[’N’] = N
model_data[’NGroups’] = NGroups
model_data[’re’] = Groups
model_data[’b0’] = np.repeat(0, K)
model_data[’B0’] = np.diag(np.repeat(100, K))
model_data[’a0’] = np.repeat(0, NGroups)
model_data[’A0’] = np.diag(np.repeat(1, NGroups))
# Fit
stan_code = """
data{
int<lower=0> N;
int<lower=0> K;
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
260 Hierarchical GLMMs
int<lower=0> NGroups;
matrix[N, K] X;
int Y[N];
int re[N];
vector[K] b0;
matrix[K, K] B0;
vector[NGroups] a0;
matrix[NGroups, NGroups] A0;
}
parameters{
vector[K] beta;
vector[NGroups] a;
real<lower=0> sigma_re;
real<lower=0> alpha;
}
transformed parameters{
vector[N] eta;
vector[N] mu;
eta = X * beta;
for (i in 1:N){
mu[i] = exp(eta[i] + a[re[i]+1]);
}
}
model{
sigma_re ˜ cauchy(0, 25);
alpha ˜ cauchy(0, 25);
# Output
nlines = 20 # number of lines in screen output
output = str(fit).split(’\n’)
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
261 8.6 Bayesian Negative Binomial GLMMs
Further Reading
Downloaded from https:/www.cambridge.org/core. Boston University Theology Library, on 29 May 2017 at 13:02:40, subject to the Cambridge Core terms of use,
available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.009
9 Model Selection
Model selection is of foremost concern in both frequentist and Bayesian modeling. Selec-
tion is nearly always based on comparative tests, i.e. two or more models are evaluated and
one is determined as having the better fit. This started out as a way to determine whether
a particular model with predictors x1, x2, and x3 was superior to the same model using
only predictors x1 and x2. The goal was to determine whether x3 significantly contributed
to the model. For frequentist-based models, the likelihood ratio test and deviance test were
standard ways to determine which model gave the better fit in comparison with alternative
models. The likelihood ratio test is currently a very popular method to test predictors as to
the best model.
The likelihood ratio test and deviance test require that the models being compared are
nested, i.e. one model is nested within a model with more predictors. Broadly speak-
ing, these tests are not appropriate for comparing non-nested models. It was not until the
development of information criteria tests that non-nested models could be compared in a
meaningful manner. The problem with information criteria tests, though, is that there is no
clear way to determine whether one model gives a significantly better fit than another. This
problem still exists for information criteria tests.
In the frequentist tradition, in which the majority of models use some type of maximum
likelihood algorithm for the estimation of parameters, the most used tests are the AIC and
BIC tests and a host of variations. The AIC test is an acronym for the Akaike information
criterion, named after Hirotugu Akaike (1927–2009), the Japanese statistician who first
developed the test in 1973. Not published until 1974, the statistic has remained the most
used test for model selection. Another information test called the Bayesian information
criterion test (BIC) was developed by Gideon E. Schwarz (1933–2007) in 1978; this test is
commonly referred to as the Schwarz criterion.
The AIC test is given as
AIC = −2L + 2k = −2(L − k) (9.1)
where L is the log-likelihood of the model and k is the number of predictors and param-
eters, including the model intercept. If the model predictors and intercept are regarded
as parameters then k is simply the number of parameters in the model. This constitutes
262
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:35:47, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.010
263 9.1 Information Criteria Tests for Model Selection
an adjustment to the likelihood for additional model parameters, which biases the statis-
tic. The more parameters or predictors in a model, the better fitted the model becomes.
So, the above adjustment is used to balance the likelihood for models with more param-
eters. The model with a lower AIC statistic is regarded as the better fitted model. The
AIC statistic has been amended when comparing models with different numbers of obser-
vations n by dividing the statistic by n. Both these forms of AIC are found in statistical
outputs.
The Bayesian information criterion, which does not concern Bayesian models within
the frequentist modeling framework, was originally based on the model deviance function
rather than on the likelihood. But the test has been converted to log-likelihood form as
The penalty term is the number of parameters times the log of the number of model obser-
vations. This is important, especially when comparing models with different numbers of
observations. Again, as with the AIC statistic, a model is considered a better fit if its BIC
statistic is lower than that from another model.
For both the AIC and BIC statistics it is assumed that the models being compared come
from the same family of probability distributions and that there are significantly more
observations in the model than parameters. It is also assumed that the data are independent,
i.e., that they are not correlated or clustered. For model comparisons between clustered or
panel-level models, no commonly accepted statistic is available (see the book Hardin and
Hilbe, 2012). This is the case for the AIC and BIC statistics as well. A number of alterna-
tives have been developed in the literature to account for the shortcomings of using AIC
and BIC.
We should also define the deviance statistic, since it has been used for goodness-of-fit
assessment:
Deviance = −2(L (y, y) − L (μ, y)). (9.3)
Thus the deviance is minus twice the difference between the “saturated” log-likelihood
and the model log-likelihood. The “saturated” likelihood corresponds to an overparame-
terized model, which represents no more than an interpolation of the data. It is obtained
by substituting a y for every μ in the log-likelihood formula for the model. Lower values
indicate a smaller difference between the observed and predicted model values.
Remember that in a maximum likelihood model the parameters that are being estimated
are fixed. They are unknown parameters of the probability distribution giving rise to the
data being modeled. The model is in fact an attempt to determine the most unbiased
estimate of the unknown, but fixed, parameters on the basis of the given data. Bayesian
modeling, however, views parameters in an entirely different way. They are random distri-
butions and are not at all fixed. In Bayesian modeling we attempt to determine the posterior
distribution for each parameter in the model. The goal is to calculate the mean (or median
or mode, depending on the type of distribution involved), standard deviation, and credi-
ble intervals of each posterior distribution in the model. This typically requires engaging
in an MCMC type of sampling to determine the appropriate posterior distributions given
that they are all products of the likelihood and associated prior distributions. This means
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:35:47, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.010
264 Model Selection
that the deviance and information criteria for Bayesian models must be based on posterior
distributions instead of fixed parameters. We address this concern and interpretation in the
following subsections.
following the brace closing the calculation of the log-likelihood for each observation in the
model. Of course, we have assumed here that LL[i] is the symbol for the log-likelihood
being used. The code for summing the individual log-likelihoods is calculated directly
following the braces defining the log-likelihood module. We can see this better by showing
the following JAGS snippet:
model{
# Priors
# Likelihood
for (I in 1:N) {
# any preparatory code
We refer to the JAGS Code 5.22 for the binary probit model in Section 5.3.2 for how
these statistics appear in a complete model code. The Bayesian deviance statistic is deter-
mined by multiplying the model log-likelihood by −2. Observe the output for the binary
probit model mentioned above: the log-likelihood value is displayed using the code we
have just explained. Multiply that value, which is displayed in the output as -814.954, by
−2 and the result is 1629.908, which is the value displayed in the output for the deviance.
Note that this varies from the frequentist definition of deviance. The value is important
since it can be used as a rough fit statistic for nested models, with lower values indicating
a better fit. The deviance is also used in the creation of the pD and deviance information
criterion (DIC) statistic, which is analogous to the AIC statistic for maximum likelihood
models:
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:35:47, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.010
265 9.1 Information Criteria Tests for Model Selection
Deviance = -2 * loglikelihood
The deviance can be calculated within the JAGS code. One simply adds the following line
directly under the log-likelihood function in the model module,
Then you need to add Dev to the params line. We can amend the binary probit model we
have been referring to in such a way that the deviance is calculated using the following
JAGS code:
model{
# Priors
beta ˜ dmnorm(b0[], B0[,])
# Likelihood
for (i in 1:N){
Y[i] ˜ dbern(p[i])
probit(p[i]) <- max(-20, min(20, eta[i]))
eta[i] <- inprod(beta[], X[i,])
LLi[i] <- Y[i] * log(p[i]) +
(1 - Y[i]) * log(1 - p[i])
dev[i] <- -2 * LLi[i]
}
LogL <- sum(LLi[1:N])
Dev <- sum(dev[1:N])
AIC <- -2 * LogL + 2 * K
BIC <- -2 * LogL + LogN * K
}
Notice that we have retained the AIC and BIC statistics that were calculated in the
probit code. These are frequentist versions of AIC and BIC, but with diffuse priors they are
satisfactory.
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:35:47, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.010
266 Model Selection
does 2K for the AIC statistic and as does log(N) K for the BIC statistic. Recall that, for a
given model, adding extra parameters to the model generally results in a better model fit.
To adjust for this effect, statisticians have designed various penalty terms to add on to the
log-likelihood when it is used as a comparative fit statistic. We have provided you with
the two most well used penalties, which comprise the standard AIC and BIC statistics. The
DIC statistic, or deviance information criterion, employs the pD statistic as its penalty term.
The pD statistic evaluates the number of Bayesian parameters that are actually used in the
model. Bayesian models, and in particular hierarchical models, can have a large number of
parameters, which may be adjusted to reflect their respective contribution to the model.
The pD statistic can be regarded as a measure of model complexity and is typically
calculated as the posterior mean deviance minus the deviance of the posterior means. This
is not a simple statistic to calculate, but there is a shortcut: it may also be determined by
taking one-half the deviance variance. In JAGS language, this appears as
pD <- var(Dev)/2
In Code 5.22, the variance of the deviance is not provided in the output but the standard
deviation is given as 2.376. The standard deviation results in the variance. Half that then is
the pD statistic:
> (2.376^2)/2
[1] 2.822688
One simple approach for variable selection is to define an indicator function Ij that sets a
probability for each variable to enter the model:
θ = θ0 + θ1 x1 + · · · + θP xP ,
θj = Ij × βj , (9.4)
Ij ∈ {0, 1}.
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:35:47, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.010
267 9.2 Model Selection with Indicator Functions
y ∼ Normal(μi , σ 2 )
μi = θ0 + θ1 x1 + · · · + θP xP
θj = Ij βj
Ij ∼ Bernoulli(π )
βj ∼ Normal(0, τ −1 ) (9.5)
τ ∼ Uniform(0, 10)
σ ∼ Gamma(10−3 , 10−3 )
i = 1, . . . , N
j = 1, . . . , P
The addition of the extra parameter τ in βj helps to avoid the poor mixing of chains when
I0 = 0. This happens because, since the priors for I0 and β are independent, any value
produced by the MCMC algorithm will have no effect in the likelihood when I0 = 0. The
probability π can be adjusted depending on the level of desired shrinkage. For instance, if
we set π = 0.5, we are expecting on average 50% of the variables to remain in the model;
if we set π = 0.1, we expect 10%; and so forth. An alternative is to add an informative beta
prior on π itself, π ∼ Beta(a, b). In the following we show how to implement the method
in JAGS given in Code 9.3.
First we generate some synthetic multivariate data with an average correlation of 0.6
between the predictors:
# Data
set.seed(1056)
nobs <- 500 # number of samples
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:35:47, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.010
268 Model Selection
# Covariance matrix
d <- length(beta)
Sigma <- toeplitz(c(1, rep(rho, d - 1)))
Mu <- c(rep(0,d))
# Multivariate sampling
M <- mvrnorm(nobs, mu = Mu, Sigma = Sigma )
xb <- M %*% beta
# Dependent variable
y <- rnorm(nobs, xb, sd = 2)
==================================================
To visualize the correlation matrix for our data, we can use the package corrplot:
require(corrplot)
corrplot(cor(M), method="number",type="upper")
with the output shown in Figure 9.1. As we expected, the average correlation between all
variables is around 0.6 (see the article de Souza and Ciardi, 2015, for other examples of
the visualization of high-dimensional data).
Now we will fit the data with a normal model to check how well we can recover the
coefficients.
Code 9.2 Normal model applied to the multivariate synthetic data from Code 9.1.
==================================================
require(R2jags)
# Fit
NORM <-" model{
# Diffuse normal priors for predictors
for (i in 1:K) { beta[i] ~ dnorm(0, 0.0001) }
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:35:47, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.010
269 9.2 Model Selection with Indicator Functions
10
11
12
13
14
15
1
9
1
1 1 0.63 0.58 0.61 0.59 0.6 0.59 0.62 0.57 0.6 0.58 0.6 0.65 0.58 0.58
2 1 0.6 0.64 0.58 0.64 0.63 0.62 0.63 0.64 0.62 0.65 0.59 0.59 0.59 0.8
3 1 0.6 0.61 0.6 0.6 0.64 0.6 0.59 0.61 0.59 0.61 0.6 0.59
0.6
4 1 0.6 0.63 0.63 0.67 0.62 0.62 0.65 0.61 0.64 0.6 0.64
5 1 0.58 0.59 0.64 0.62 0.63 0.64 0.6 0.65 0.58 0.58 0.4
14 1 0.63 −0.8
15 1
t
−1
Figure 9.1 Correlation matrix for the synthetic multivariate normal data set. (A black and white version of this figure will appear
in some formats. For the color version, please refer to the plate section.)
# Likelihood function
for (i in 1:N){
Y[i]~dnorm(mu[i],tau)
mu[i] <- eta[i]
eta[i] <- inprod(beta[], X[i,])
}
}"
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:35:47, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.010
270 Model Selection
# MCMC
NORM_fit <- jags(data = jags_data,
inits = inits,
parameters = params,
model = textConnection(NORM),
n.chains = 3,
n.iter = 5000,
n.thin = 1,
n.burnin = 2500)
==================================================
In order to visualize how close the coefficients are to the original values, we show a
caterpillar plot in Figure 9.2, which can be achieved with the following code:
require(mcmcplots)
caterplot(NORM_fit,"beta",denstrip = T,
greek = T,style = "plain", horizontal = F,
reorder = F,cex.labels = 1.25,col="gray35")
caterpoints(beta, horizontal=F, pch="x", col="cyan")
β1 x
β2 x
β3 x
β4 x
β5 x
β6 x
β7 x
β8 x
β9 x
β10 x
β11 x
β12 x
β13 x
β14 x
β15 x
t
−2 0 2 4 6
Figure 9.2 Visualization of results from the normal model fit to synthetic multivariate data from Code 9.1. The gray regions
correspond to the posteriors of each parameter and the crosses to their true values.
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:35:47, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.010
271 9.2 Model Selection with Indicator Functions
Note that, although the normal model recovers the coefficients well, it does not set any of
them to exactly zero, so all the coefficients remain in the model. Next we show how to
implement the variable selection model in JAGS, placing an informative prior of 0.2 in the
probability for a given predictor to stay in the model:
Code 9.3 K and M model applied to multivariate synthetic data from Code 9.1.
==================================================
NORM_Bin <-" model{
# Diffuse normal priors for predictors
tauBeta <- pow(sdBeta,-2);
sdBeta ~ dgamma(0.01,0.01)
PInd <- 0.2
for (i in 1:K){
Ind[i] ˜ dbern(PInd)
betaT[i] ˜ dnorm(0,tauBeta)
beta[i] <- Ind[i]*betaT[i]
}
# Uniform prior for standard deviation
tau <- pow(sigma, -2) # precision
sigma ~ dgamma(0.01,0.01) # standard deviation
# Likelihood function
for (i in 1:N){
Y[i]˜dnorm(mu[i],tau)
mu[i] <- eta[i]
eta[i] <- inprod(beta[], X[i,])
}
}"
==================================================
Figure 9.3 shows the estimated variables for the K and M model. Note that now the
model actually set the coefficients to zero, so we can find a sparse solution for it.
The equivalent code in Python using Stan is displayed below. Notice that, given the dif-
ferent parameterizations for the Gaussian distribution built in JAGS and Stan, the exponents
for tau and beta have opposite signs.
# Data
np.random.seed(1056)
nobs = 500
nvar = 15
rho = 0.6
p = bernoulli.rvs(0.2, size=nvar)
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:35:47, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.010
272 Model Selection
β1 x
β2 x
β3 x
β4 x
β5 x
β6 x
β7 x
β8 x
β9 x
β10 x
β11 x
β12 x
β13 x
β14 x
β15 x
t
−2 0 2 4 6
Figure 9.3 Visualization of results from the K and M model fit to synthetic multivariate data from Code 9.1. The gray regions
correspond to the posteriors of each parameter and the crosses to their true values.
# Covariance matrix
d = beta.shape[0]
Sigma = toeplitz(np.insert(np.repeat(rho, d-1), 0, 1))
# Fit
mydata = {}
mydata[’X’] = M - 1.0
mydata[’K’] = mydata[’X’].shape[1]
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:35:47, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.010
273 9.2 Model Selection with Indicator Functions
mydata[’Y’] = y
mydata[’N’] = nobs
mydata[’Ind’] = bernoulli.rvs(0.2, size = nvar)
stan_model = ’’’
data{
int<lower=0> N;
int<lower=0> K;
matrix[N, K] X;
vector[N] Y;
int Ind[K];
}
parameters{
vector[K] beta;
real<lower=0> sigma;
real<lower=0> sdBeta;
}
transformed parameters{
vector[N] mu;
real<lower=0> tau;
real<lower=0> tauBeta;
mu = X * beta;
tau = pow(sigma, 2);
tauBeta = pow(sdBeta, 2);
}
model{
for (i in 1:K){
if (Ind[i] > 0) beta[i] ˜ normal(0, tauBeta);
}
sigma ˜ gamma(0.01, 0.01);
Y ˜ normal(mu, tau);
}
’’’
# Output
nlines = 21 # number of lines in screen output
output = str(fit).split(’\n’)
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:35:47, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.010
274 Model Selection
beta[5] -0.32 6.5e-3 0.26 -0.84 -0.5 -0.32 -0.15 0.17 1593 1.0
beta[6] -0.59 6.7e-3 0.25 -1.06 -0.77 -0.6 -0.43 -0.07 1434 1.0
beta[7] -0.28 6.9e-3 0.25 -0.75 -0.44 -0.28 -0.11 0.21 1264 1.0
beta[8] -1.29 8.1e-3 0.27 -1.83 -1.46 -1.29 -1.12 -0.77 1084 1.0
beta[9] 0.28 6.6e-3 0.25 -0.21 0.12 0.28 0.44 0.78 1415 1.0
beta[10] -2.41 6.7e-3 0.25 -2.9 -2.57 -2.4 -2.24 -1.93 1361 1.0
beta[11] -0.2 8.6e-3 0.26 -0.69 -0.37 -0.2 -0.02 0.3 882 1.0
beta[12] -0.12 6.1e-3 0.24 -0.59 -0.28 -0.12 0.05 0.36 1578 1.0
beta[13]-3.2e-3 6.8e-3 0.25 -0.49 -0.17 3.5e-4 0.16 0.48 1319 1.0
beta[14] -0.26 7.5e-3 0.25 -0.78 -0.43 -0.25 -0.08 0.24 1154 1.0
sigma 1.92 8.0e-4 0.03 1.86 1.89 1.91 1.94 1.98 1374 1.0
The least absolute shrinkage and selection operator (LASSO) is an alternative approach to
the use of indicators in a model. The prior shrinks the coefficients βj towards zero unless
there is strong evidence for them to remain in the model.
The original LASSO regression was proposed by Tibshirani (1996) to automatically
select a relevant subset of predictors in a regression problem by shrinking some coefficients
towards zero (see also Uemura et al., 2015, for a recent application of LASSO for modeling
type Ia supernovae light curves). For a typical linear regression problem,
yi = β0 + β1 x1 + · · · + βp xp + ε, (9.6)
where ε denotes a Gaussian noise. LASSO estimates the linear regression coefficients β =
β0 + β1 x1 + · · · + βp xp by imposing a L1 -norm penalty in the form
⎧ ⎛ ⎞2 ⎫
⎪
⎨ ⎪
⎬
N p p
argmin ⎝yi − βj xij ⎠ + κ |βj | , (9.7)
β ⎪ ⎩ i=1 ⎪
⎭
j=1 j=1
where κ ≥ 0 is a constant that controls the level of sparseness in the solution. The number
of zero coefficients thereby increases as κ increases. Tibshirani also noted that the LASSO
estimate has a Bayesian counterpart when the β coefficients have a double-exponential
prior (i.e., a Laplace prior) distribution:
1 |x|
f (x, τ ) = exp − , (9.8)
2τ τ
where τ = 1/κ is the scale. The idea was further developed in what is known as Bayesian
LASSO (see e.g., de Souza et al., 2015b; Park et al., 2008). The role of the Laplace
prior is to assign more weight to regions either near to zero or in the distribution tails
than would a normal prior. The implementation in JAGS is very straightforward; the
researcher just needs to replace the normal priors in the β coefficients in the Laplace
prior by
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:35:47, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.010
275 9.3 Bayesian LASSO
...
transformed parameters{
...
tauBeta = pow(sdBeta, -1);
...
}
model{
...
for (i in 1:K) beta[i] ˜ double_exponential(0, tauBeta);
...
}
...
We refer the reader to O’Hara and Sillanpää (2009) for a practical review of Bayesian
variable selection with code examples.
Further Reading
Gelman, A., J. Carlin, H. Stern, D. Dunson, A. Vehtari, and D. Rubin (2013). Bayesian
Data Analysis, Third Edition. Chapman & Hall/CRC Texts in Statistical Science. Taylor
& Francis.
Gelman, A., J. Hwang, and A. Vehtari (2014). “Understanding predictive informa-
tion criteria for Bayesian models.” Statist. Comput. 24(6), 997–1016. DOI:
10.1007/s11222-013-9416-2.
O’Hara, R. B. and M. J. Sillanpää (2009). “A review of Bayesian variable selection
methods: what, how and which.” Bayesian Anal. 4(1), 85–117. DOI: 10.1214/09-ba403.
Spiegelhalter, D. J., N. G. Best, B. P. Carlin, and A. Van der Linde (2002). “Bayesian
measures of model complexity and fit.” J. Royal Statist. Soc.:Series B (Stastistical
Methodology) 64(4), 583–639. DOI: 10.1111/1467-9868.00353.
Vehtari, A. and J. Ojanen (2012). “A survey of Bayesian predictive methods
for model assessment, selection and comparison.” Statist. Surv. 6, 142–228. DOI:
10.1214/12-SS102.
Downloaded from https:/www.cambridge.org/core. Columbia University Libraries, on 21 Jun 2017 at 10:35:47, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.010
10 Astronomical Applications
This chapter presents a series of astronomical applications using some of the models pre-
sented earlier in the book. Each section concerns a specific type of astronomical data
situation and the associated statistical model. The examples cover a large variety of top-
ics, from solar (sunspots) to extragalactic (type Ia supernova) data, and the accompanying
codes were designed to be easily modified in order to include extra complexity or alter-
native data sets. Our goal is not only to demonstrate how the models presented earlier
can impact the classical approach to astronomical problems but also to provide resources
which will enable both young and experienced researchers to apply such models in their
daily analysis.
Following the same philosophy as in previous chapters, we provide codes in R/JAGS
and Python/Stan for almost all the case studies. The exceptions are examples using type Ia
supernova data for cosmological parameter estimation and approximate Bayesian compu-
tation (ABC). In the former we take advantage of the Stan ordinary differential equation
solver, which, at the time of writing, is not fully functional within PyStan.1 Thus, we take
the opportunity to show one example of how Stan can also be easily called from within R.
The ABC approach requires completely different ingredients and, consequently, different
software. Here we have used the cosmoabc2 Python package in order to demonstrate how
the main algorithm works in a simple toy model. We also point the reader to the main
steps towards using ABC for cosmological parameter inference from galaxy cluster num-
ber counts. This is considered an advanced topic and is presented as a glimpse of the
potential of Bayesian analysis beyond the exercises presented in previous chapters. Many
models discussed in this book represent a step forward from the types of models generally
used by the astrophysical community. In the future we expect that Bayesian methods will
be the predominant statistical approach to the analysis of astrophysical data.
Accessing Data
For all the examples presented in this chapter we will use publicly available astronomical
data sets. These have been formatted to allow easy integration with our R and Python
codes. All the code snippets in this chapter contain a path_to_data variable, which has
the format path_to_data = "∼ <some path>", where the symbol ∼ should be substituted
for the complete path to our GitHub repository.3 In doing so, given an internet connection
1 This should not be an issue if you are using a PyStan version higher than 2.9.0.
2 Developed by the COsmostatistics INitiative (COIN).
3 https://raw.githubusercontent.com/astrobayes/BMAD/master
276
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
277 10.1 Normal Model, Black Hole Mass, and Bulge Velocity Dispersion
the data will be read on the fly from the online source. This format was chosen to avoid
long paths within the code snippets.
10.1 Normal Model, Black Hole Mass, and Bulge Velocity Dispersion
originally known as the Faber–Jackson law for black holes. Here M is the mass of the
Sun.5 Subsequently, Ferrarese and Merritt (2000) and Gebhardt et al. (2000) reported
a correlation tighter than had previously been expected but with significantly different
slopes, which started a vivid debate (see Harris et al. 2013, hereafter H2013, and refer-
ences therein). Merritt and Ferrarese (2001) showed that such discrepancies were due to
the different techniques used to derive M• . Since then an updated relation has allowed the
determination of central black hole masses in distant galaxies, where σe is easily measured
(Peterson, 2008).
In this example we follow H2013, modeling the M• –σe relation as
M• σ
log = α + β log , (10.2)
M σ0
where σ0 is a reference value, usually chosen to be 200 km s−1 (Tremaine et al., 2002);
α, β are the linear predictor coefficients to be determined. Notice that in order to make
the notation lighter we have suppressed the subscript e from the velocity dispersion in
Equation 10.2.
4 Note that this example is also addressed in the book by Andreon and Weaver (2015), who considered a different
data set.
5 In astronomy the solar mass is a unit of measurement, which corresponds to approximately 2 × 1030 kg.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
278 Astronomical Applications
10.1.1 Data
The data set used in this example was presented by H2013 (see also Harris et al., 2014).6
This is a compilation of literature data from a variety of sources obtained with the Hubble
Space Telescope as well as with a wide range of other ground-based facilities. The original
data set was composed of data from 422 galaxies, 46 of which have available measurements
of M• and σ .
Mi ∼ Normal(Mitrue , εM;i
2
)
Mitrue ∼ Normal(μi , ε2 )
μi = α + βσi
σi ∼ Normal(σitrue , εσ2 ;i )
σitrue ∼ Normal(0, 103 ) (10.3)
α ∼ Normal(0, 10 ) 3
β ∼ Normal(0, 103 )
ε2 ∼ Gamma(10−3 , 10−3 )
i = 1, . . . , N
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
279 10.1 Normal Model, Black Hole Mass, and Bulge Velocity Dispersion
Code 10.1 Normal linear model in R using JAGS for accessing the relationship between
central black hole mass and bulge velocity dispersion.
==================================================
require(R2jags)
# Data
path_to_data = "˜/data/Section_10p1/M_sigma.csv"
# Read data
MS<-read.csv(path_to_data,header = T)
# Identify variables
N <- nrow(MS) # number of data points
obsx <- MS$obsx # log black hole mass
errx <- MS$errx # error on log black hole mass
obsy <- MS$obsy # log velocity dispersion
erry <- MS$erry # error on log velocity dispersion
# Fit
NORM_errors <- "model{
for (i in 1:N){
obsx[i] ˜ dnorm(x[i], pow(errx[i], -2))
obsy[i] ˜ dnorm(y[i], pow(erry[i], -2)) # likelihood function
y[i] ˜ dnorm(mu[i], tau)
mu[i] <- alpha + beta*x[i] # linear predictor
}
}"
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
280 Astronomical Applications
# Identify parameters
params0 <- c("alpha","beta", "epsilon")
# Fit
NORM_fit <- jags(data = MS_data,
inits = inits,
parameters = params0,
model = textConnection(NORM_errors),
n.chains = 3,
n.iter = 50000,
n.thin = 10,
n.burnin = 30000
)
# Output
print(NORM_fit,justify = "left", digits=2)
==================================================
mu.vect sd.vect 2.5% 97.5% Rhat n.eff
alpha 8.35 0.05 8.24 8.46 1 3000
beta 4.44 0.33 3.80 5.06 1 3000
epsilon 0.27 0.06 0.17 0.39 1 3000
deviance -225.99 13.84 -250.85 -196.55 1 3000
The above results are consistent with the values found by H2013. The present authors used
the method presented in Tremaine et al. (2002), which searched for α and β parameters
that minimize
N
[yi − α − β(xi − x)]2
χ2 = 2 + ε 2 ) + β 2 (σ 2 + ε 2 )
. (10.4)
i=1
(σy;i y x;i x
From Table 2 of H2013 the reported parameter values are α = 8.412 ± 0.067 and β =
4.610±0.403. The general trend is recovered, despite the distinct assumptions in modeling
and error treatment (see H2013, Section 4.1).
An indication of how the final model corresponds to the original data is shown in
Figure 10.1, where the dashed line represents the mean fitted result and the shaded
regions denote 50% (darker) and 95% (lighter) prediction intervals. Posterior marginal
distributions for the intercept α, slope β, and intrinsic scatter ε are shown in Figure 10.2.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
281 10.1 Normal Model, Black Hole Mass, and Bulge Velocity Dispersion
11
10
log (M M )
8
5
−0.4 −0.2 0.0 0.2 0.4
log (σ/σ0)
tFigure 10.1 Supermassive black hole mass as a function of bulge velocity dispersion described by a Gaussian model with errors in
measurements. The dashed line represents the mean and the shaded areas represent the 50% (darker) and 95%
(lighter) prediction intervals. The dots and associated error bars denote the observed values and measurement errors
respectively. (A black and white version of this figure will appear in some formats. For the color version, please refer to
the plate section.)
defined in the parameters block with the domain of the prior distributions declared in the
model block (see Team Stan, 2016, Section 3.2).
In what follows we have employed a more compact notation than that presented in
Code 10.1. If you want to save information about the linear predictor mu you can define
it in the transformed parameters block (Section 2.5).
Code 10.2 Normal linear model, in Python using Stan, for assessing the relationship
between central black hole mass and bulge velocity dispersion.
==================================================
import numpy as np
import pandas as pd
import pystan
path_to_data = ˜/data/Section_10p1/M_sigma.csv
# Read data
data_frame = dict(pd.read_csv(path_to_data))
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
282 Astronomical Applications
epsilon
200
150
100
50
Frequency
0
200
200
150
150
100
100
50
50
0
8.1 8.2 8.3 8.4 8.5 8.6 3.5 4.0 4.5 5.0 5.5
t
Posterior distribution
Figure 10.2 Posterior distributions for the intercept (alpha), slope (beta), and intrinsic scatter (epsilon). The horizontal
thick lines mark the 95% credible intervals.
data[’obsx’] = np.array(data_frame[’obsx’])
data[’errx’] = np.array(data_frame[’errx’])
data[’obsy’] = np.array(data_frame[’obsy’])
data[’erry’] = np.array(data_frame[’erry’])
data[’N’] = len(data[’obsx’])
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
283 10.2 Gaussian Mixed Models, Type Ia Supernovae, and Hubble Residuals
model{
# likelihood and priors
alpha ˜ normal(0, 1000);
beta ˜ normal(0, 1000);
epsilon ˜ gamma(0.001, 0.001);
for (i in 1:N){
x[i] ˜ normal(0, 1000);
y[i] ˜ normal(0, 1000);
}
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=data, iter=15000, chains=3,
warmup=5000, thin=10, n_jobs=3)
==================================================
# Output
nlines = 8 # number of lines in screen output
output = str(fit).split(’\n’)
for item in output[:nlines]:
print(item)
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
alpha 8.35 1.8e-3 0.06 8.24 8.31 8.35 8.39 8.46 995.0 1.0
beta 4.45 0.01 0.34 3.77 4.22 4.45 4.68 5.1 977.0 1.0
epsilon 0.29 1.9e-3 0.06 0.18 0.25 0.28 0.32 0.4 948.0 1.0
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
284 Astronomical Applications
(stretch).7 This data treatment successfully accounts for most SNe Ia light-curve vari-
ability. However, a 1σ variability of ≈ 0.1 magnitude still remains, which trans-
lates into a 5% uncertainty in distance. In this context, understanding which host
galaxy characteristics correlate with Hubble residuals (the difference between the dis-
tance modulus after standardization and that predicted from the best-fit cosmological
model) can significantly improve the cosmological parameter constraints from SNe
Ia. Moreover, different classification methods (spectroscopic or photometric) can intro-
duce biases in the host galaxy–SNe relationship, which should also be taken into
account.
Wolf et al. (2016) approached this problem by using a mixed sample of photo-
metrically classified and spectroscopically confirmed SNe Ia to study the relationship
between host galaxy and SNe properties, focusing primarily on correlations with Hub-
ble residuals. As a demonstration of a normal model with varying intercepts and
slopes, we perform a similar analysis in order to probe the correlation between HRs
and host galaxy mass for the spectroscopic (Spec-IA) and photometric (Phot-IA)
samples.
10.2.1 Data
The data used in our example was presented by Wolf et al. (2016). It consists of
N = 345 SNe Ia from the Sloan Digital Sky Survey – Supernova Survey (SDSS-SN,
Sako et al. 2014), from which 176 SNe were photometrically classified and 169 were
spectroscopically confirmed as Ia. Host galaxy characteristics were derived mainly on
the basis of spectra from the SDSS-III Baryon Oscillation Spectroscopic Survey (BOSS,
Eisenstein et al. 2011).
In what follows we will use only the subset of this sample comprising the host galaxy
mass and Hubble residuals and their respective measurement errors.
7 This correction is based on empirical knowledge, since there is no consensus model determining which physical
elements (environmental effects, different progenitor systems, etc; see e.g. Hillebrandt and Niemeyer, 2000 and
Maoz et al., 2014) are responsible for such variations.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
285 10.2 Gaussian Mixed Models, Type Ia Supernovae, and Hubble Residuals
where we have used a non-informative prior for σ and a common normal hyperprior i.e., a
prior associated with the parameters of a prior distribution for the parameters β, connected
through μ0 (the mean) and σ0 (the standard deviation).
Code 10.3 Gaussian linear mixed model, in R using JAGS, for modeling the relationship
between type Ia supernovae host galaxy mass and Hubble residuals.
==================================================
library(R2jags)
# Data
path_to_data = "˜/data/Section_10p2/HR.csv"
dat <- read.csv(path_to_data, header = T)
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
286 Astronomical Applications
errx1 = errx1,
erry = erry,
K = 2,
N = nobs,
type = type)
# Fit
NORM_errors <-" model{
tau0˜dunif(1e-1,5)
mu0˜dnorm(0,1)
# Run MCMC
NORM <- jags(
data = jags_data,
inits = inits,
parameters = params0,
model = textConnection(NORM_errors),
n.chains = 3,
n.iter = 40000,
n.thin = 1,
n.burnin = 15000)
# Output
print(NORM,justify = "left", digits=3)
==================================================
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
287 10.2 Gaussian Mixed Models, Type Ia Supernovae, and Hubble Residuals
Table 10.1 For comparison, our results (in R using JAGS) and those reported by Wolf et al.
(2016) for the correlation between the Hubble residuals and the host galaxy mass.
Spec-Ia Photo-Ia
Wolf et al. (2016) 0.287 ± 0.188 1.042 ± 0.270
Intercept GLMM – no hyperpriors 0.275 ± 0.200 0.939 ± 0.260
GLMM – with hyperpriors 0.245 ± 0.179 0.894 ± 0.224
Wolf et al. (2016) −0.028 ± 0.018 −0.101 ± 0.026
Slope GLMM – no hyperpriors −0.027 ± 0.019 −0.091 ± 0.025
GLMM – with hyperpriors −0.024 ± 0.017 −0.087 ± 0.021
where {beta[1,1], beta[2,1]} are the intercept and slope for the photometric sample and
{beta[1,2], beta[2,2]} the same two quantities for the spectroscopic sample.
Table 10.1 gives our results along with those reported by Wolf et al. (2016). In order to
illustrate the effect of adding the hyperpriors, we also show in this table results obtained
without taking them into account8 (this means considering β ∼ Normal(0, 103 ) as a prior
for all βs). From this table it is possible to recognize two main characteristics: the signif-
icant difference between slope and intercept values for the two different subsamples and
the influence of the common hyperprior on the posterior means (although within the same
subsample all results agree within 1σ credible intervals).
The differences between the results from the photo and the spec subsamples show that
they have distinct behaviors, in other words, that we are actually dealing with separate
populations. Nevertheless, using the hyperpriors (and considering that, although recog-
nizable differences exist, these objects still have a lot in common) allows us to constrain
further the parameters for both populations (in order to achieve smaller scatter in both the
spectroscopic and photometric cases).
The same results are illustrated in Figure 10.3. The upper panel shows the prediction
intervals in the parameter space formed by the Hubble residuals as a function of host galaxy
mass for the two separate samples, and the lower panel highlights the difference between
slope and intercept values. In both panels we can clearly see the different trends followed
by each sample.
Wolf et al. (2016), encouraging further analysis into this result, point out, among other
reasons, the necessity to have a better understanding of type Ia SNe progenitor systems and
8 The corresponding code is not displayed, to avoid repetition, but it is available in the online material (see the
introductory statements at the start of this chapter).
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
288 Astronomical Applications
Phot-Ia Spec-Ia
1.0
0.5
μ SN – μz
0.0
−0.5
−1.0
8 9 10 11 12 13
log (M M )
1.5
Phot-Ia
JAGS posterior intercept
1.0
0.5
Spec-Ia
0.0
t
JAGS posterior slope
Figure 10.3 Upper panel: Hubble residuals (HR = μSN − μz ) as a function of the host galaxy mass (log(M/M )) for the PM
sample from Wolf et al. (2016). The shaded areas represent 50% (darker) and 95% (lighter) prediction intervals for
the spectroscopic (lower) and photometric (upper) samples. Lower panel: Contour intervals showing the 68% (darker)
and 95% (lighter) credible intervals of the Spec-Ia and Phot-Ia JAGS posterior distributions for the HR–mass
relation. (A black and white version of this figure will appear in some formats. For the color version, please refer to the
plate section.)
a more reliable photometric classification pipeline for type Ia SNe (Ishida and de Souza,
2013; Kessler et al., 2010).
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
289 10.2 Gaussian Mixed Models, Type Ia Supernovae, and Hubble Residuals
defines its domain. Thus, we will follow the guidelines from Stan (2016), adopting a
half-normal prior for the shared hyperparameter sig0 and using a weak informative prior
over beta.9 These choices were made to illustrate important points the reader should
consider when choosing priors; however, as will be made clear shortly, in this exam-
ple they have no effect on the results. This may not be the case for more complicated
scenarios.
In order to facilitate comparison, we designed the beta matrix in such a way that the
sequence of output parameter values is the same as that appearing in Code 10.3.
Code 10.4 Gaussian linear mixed model, in Python using Stan, for modeling the relation-
ship between type Ia supernovae host galaxy mass and Hubble residuals.
======================================================
import numpy as np
import pandas as pd
import pystan
# Data
path_to_data = ’˜/data/Section_10p2/HR.csv’
data_frame = dict(pd.read_csv(path_to_data))
# Fit
stan_code="""
data{
int<lower=0> N; # number of data points
int<lower=0> K; # number of distinct populations
int<lower=0> L; # number of coefficients
vector[N] obsx; # obs host galaxy mass
vector<lower=0>[N] errx; # errors in host mass measurements
vector[N] obsy; # obs Hubble Residual
vector<lower=0>[N] erry; # errors in Hubble Residual measurements
vector[N] type; # flag for spec/photo sample
}
parameters{
matrix[K,L] beta; # linear predictor coefficients
real<lower=0> sigma; # scatter around true black hole mass
vector[N] x; # true host galaxy mass
vector[N] y; # true Hubble Residuals
real<lower=0> sig0; # scatter for shared hyperprior on beta
9 An order of magnitude higher than our expectations for the parameter values (Stan, 2016).
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
290 Astronomical Applications
for (i in 1:N) {
if (type[i] == type[1]) mu[i] <- beta[1,1] + beta[2,1] * x[i];
else mu[i] = beta[1,2] + beta[2,2] * x[i];
}
}
model{
# Shared hyperprior
mu0 ˜ normal(0, 1);
sig0 ˜ normal(0, 5);
for (i in 1:K){
for (j in 1:L) beta[i,j] ~ normal(mu0, sig0);
}
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=data, iter=40000, chains=3,
warmup=15000, thin=1, n_jobs=3)
# Output
nlines = 10 # number of lines in screen output
output = str(fit).split(’\n’)
for item in output[:nlines]:
print(item)
======================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0,0] 0.83 2.1e-3 0.27 0.27 0.65 0.83 1.01 1.34 15974 1.0
beta[1,0] -0.08 2.0e-4 0.03 -0.13 -0.1 -0.08 -0.06 -0.03 15975 1.0
beta[0,1] 0.24 1.2e-3 0.18 -0.11 0.12 0.24 0.36 0.6 22547 1.0
beta[1,1] -0.02 1.2e-4 0.02 -0.06 -0.03 -0.02 -0.01 0.01 22507 1.0
sigma 0.12 7.6e-5 9.0e-3 0.1 0.11 0.12 0.13 0.14 14034 1.0
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
291 10.3 Multivariate Normal Mixed Model and Early-Type Contact Binaries
tFigure 10.4 Artist’s impression of VFTS 352, the most massive and earliest spectral type genuinely-contact binary system known
to date (Almeida et al., 2015). Image credits: ESO /L. Calçada. (A black and white version of this figure will appear in
some formats. For the color version, please refer to the plate section.)
chapters: multivariate normal (Section 4.2) and mixed models (Chapter 8). As a case study
we will investigate the period–luminosity–color (PLC) relation in early-type contact and
near-contact binary stars.
The star types O, B, and A are hot and massive objects traditionally called “early type”.10
It is believed that most such stars are born as part of a binary system, being close enough
to have a mass exchange interaction with their companion (Figure 10.4 shows the most
massive genuinely contact binary system known to date, reported by Almeida et al. (2015)).
This interaction will have significant impact in the posterior development of both stars
and, given their high frequency of occurrence, play an important role in stellar population
evolution.
A PLC relation is known to exist in close binary stars owing to an observed correla-
tion between the radius of the orbit and the mass of the stars forming the pair (Machida
et al., 2008). Our goal is to determine whether subpopulations of genuinely-contact and
close-contact systems follow the same underlying relationship (Pawlak, 2016). The math-
ematical formulation of the PLC relation (e.g. Pawlak, 2016; Rucinski, 2004) is usually
stated as
10 This nomenclature refers to an old model of stellar evolution prior to the discovery of nuclear fusion as a
source of stellar energy. Although the model is now discredited, the nomenclature is still used; see the book
Jain (2016).
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
292 Astronomical Applications
where MV is the absolute V-band magnitude, P denotes the period and (V − I)0 is the color
between the filters V and I.
10.3.1 Data
The data set we shall use is composed of a sample of N = 64 eclipsing binaries classified
as near-contact (NC, stars which are close but not in thermal contact), and genuinely-
contact (GC, where the two stars are in thermal equilibrium). This sample was built
by Pawlak (2016)11 from phase III of the Optical Gravitational Lensing Experiment
(OGLE-III) catalog of eclipsing binaries in the Large Magellanic Cloud (Graczyk et al.,
2011).
1 if GC
ti =
2 if NC
βkj ∼ Normal(μ0 , σ02 ) (10.7)
μ0 ∼ Normal(0, 103 )
σ0 ∼ Gamma(0.001, 0.001)
i = 1, . . . , N
k = 1, . . . , K
j ∈ [1, 2]
Notice that, for each data point i, the term ti is used as an index of linear predic-
tor coefficients, distinguishing between each class of the binary system. This means
that the set of β parameters to be used will be dependent on the classification of each
data point. It is also important to emphasize that, by employing a hierarchical Bayesian
11 The data table is given in the paper.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
293 10.3 Multivariate Normal Mixed Model and Early-Type Contact Binaries
model for the intercepts and slopes, we allow the model to borrow strength across pop-
ulations. That is, while the model acknowledges that each separate population (NC or
GC) has its own parameters, it allows the whole population to contribute to the infer-
ence of individual parameters (represented by the slopes and intercepts). This happens
via their joint influence on the posterior estimates of the unknown hyperparameters
μ0 and σ0 .
Although we have applied the concept for a simple model (only two classes of object),
the approach can be easily expanded for a more complex situation with dozens or hun-
dreds of classes. It reflects the hypothesis that, although there might be differences among
classes, they do share a global similarity (i.e., they are all close binary systems).
Code 10.5 Multivariate normal model in R using JAGS for accessing the relationship
between period, luminosity, and color in early-type contact binaries.
==================================================
library(R2jags)
# Data
#Read data
PLC <- read.csv("˜/data/Section_10p3/PLC.csv", header = T)
# Fit
NORM <-"model{
# Shared hyperprior
tau0 ˜ dgamma(0.001,0.001)
mu0 ˜ dnorm(0,1e-3)
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
294 Astronomical Applications
# Identify parameters
params <- c("beta", "sigma")
# Fit
jagsfit <- jags(data = jags_data,
inits = inits,
parameters = params,
model = textConnection(NORM),
n.chains = 3,
n.iter = 5000,
n.thin = 1,
n.burnin = 2500)
## Output
print(jagsfit,justify = "left", digits=2)
==================================================
mu.vect sd.vect 2.5% 97.5% Rhat n.eff
beta[1,1] -1.01 0.27 -1.55 -0.49 1 7500
beta[2,1] -3.29 0.95 -5.15 -1.37 1 7500
beta[3,1] 7.22 1.28 4.61 9.68 1 7500
beta[1,2] -0.41 0.15 -0.71 -0.11 1 4200
beta[2,2] -3.19 0.58 -4.33 -2.00 1 7500
beta[3,2] 8.50 0.82 6.87 10.08 1 4100
sigma[1] 0.62 0.09 0.47 0.82 1 920
sigma[2] 0.43 0.05 0.34 0.55 1 5200
deviance 91.82 4.38 85.47 102.28 1 7500
According to the notation adopted in Code 10.5, the first column of the beta matrix holds
coefficients for genuinely-contact (GC) objects while the second column stores coefficients
for the near-contact (NC) subsample.
From these we see that, beyond an overall shift in magnitude (due to incompatible poste-
riors for beta[1,1] and beta[2,1]), the two populations present a very similar dependence
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
295 10.3 Multivariate Normal Mixed Model and Early-Type Contact Binaries
Table 10.2 For comparison, the PLC relation coefficients reported by Pawlak (2016) and our results using
GLMM.
Pawlak 2016
β1 β2 β3 σ
GC −0.97 ± 0.26 −3.47 ± 0.87 7.57 ± 1.21 0.55
NC −0.40 ± 0.15 −3.41 ± 0.60 8.56 ± 0.80 0.40
GLMM
β1 β2 β3 σ
GC −1.01 ± 0.27 −3.29 ± 0.95 7.22 ± 1.28 0.62 ± 0.09
NC −0.41 ± 0.15 −3.19 ± 0.58 8.50 ± 0.82 0.43 ± 0.05
on period (very close posteriors for beta[2,1] and beta[2,2]) and only marginally agree
in their dependence on color (overlaping posteriors for beta[3,1] and beta[3,2]). More-
over, comparing the numbers in the sd.vect column we see that the genuinely-contact
systems present a larger scatter than the near-contact systems. Figure 10.5 illustrates the
positioning of the two mean models in PLC space.
The results presented above are consistent with the least-squares fit employed by Pawlak
(2016), and are presented in Table 10.2. This similarity is expected given the quality of the
data set (a subset of 64 systems having well-covered light curves and low photometric
noise) and the lack of informative priors in our analysis.
−1
MV
−2
−3
−4
−0.2
I) 0
0 −
−0.3 −0.2 −0.1 0.2
(V
0
0.1
log(P)
tFigure 10.5 Period–luminosity–color relation obtained for the sample of 64 near early-type binary systems from the Large
Magellanic Cloud (Pawlak, 2016). The sample is color-coded for genuinely-contact (upper) and near-contact (lower)
binaries. (A black and white version of this figure will appear in some formats. For the color version, please refer to the
plate section.)
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
296 Astronomical Applications
Code 10.6 Multivariate Gaussian mixed model in Python, using Stan, for accessing the
relationship between luminosity, period, and color in early-type contact binaries.
==================================================
import numpy as np
import pandas as pd
import pystan
import statsmodels.api as sm
# Data
path_to_data = ’~data/Section_10p3/PLC.csv’
# Read data
data_frame = dict(pd.read_csv(path_to_data))
# Fit
# Stan Multivariate Gaussian
stan_code="""
data{
int<lower=0> nobs; # number of data points
int<lower=1> M; # number of linear predicor coefficients
int<lower=1> K; # number of distinct populations
vector[nobs] x1; # obs log period
vector[nobs] x2; # obs color V-I
vector[nobs] y; # obs luminosity
int type[nobs]; # system type (near or genuine contact)
}
parameters{
matrix[M,K] beta; # linear predictor coefficients
real<lower=0> sigma[K]; # scatter around linear predictor
real mu0;
real sigma0;
}
transformed parameters{
vector[nobs] mu; # linear predictor
for (i in 1:nobs) {
if (type[i] == type[1])
mu[i] = beta[1,2] + beta[2,2] * x1[i] + beta[3,2] * x2[i];
else mu[i] = beta[1,1] + beta[2,1] * x1[i] + beta[3,1] * x2[i];
}
}
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
297 10.4 Lognormal Distribution and the Initial Mass Function
model{
# priors and likelihood
mu0 ˜ normal(0, 100);
sigma0 ˜ gamma(0.001, 0.001);
for (i in 1:K) {
sigma[i] ˜ gamma(0.001, 0.001);
for (j in 1:M) beta[j,i] ˜ normal(mu0,sigma0);
}
for (i in 1:nobs){
if (type[i] == type[1]) y[i] ˜ normal(mu[i], sigma[2]);
else y[i] ˜ normal(mu[i], sigma[1]);
}
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=data, iter=5000, chains=3,
warmup=2500, thin=1, n_jobs=3)
# Output
nlines = 13 # number of lines in screen output
output = str(fit).split(’\n’)
for item in output[:nlines]:
print(item)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0,0] -1.01 6.5e-3 0.27 -1.55 -1.19 -1.01 -0.83 -0.48 1781.0 1.0
beta[1,0] -3.31 0.02 0.98 -5.25 -3.95 -3.31 -2.67 -1.34 2082.0 1.0
beta[2,0] 7.21 0.03 1.28 4.63 6.39 7.23 8.08 9.64 1975.0 1.0
beta[0,1] -0.42 3.8e-3 0.16 -0.74 -0.52 -0.42 -0.31 -0.1 1836.0 1.0
beta[1,1] -3.2 0.01 0.58 -4.37 -3.58 -3.2 -2.82 -2.08 2156.0 1.0
beta[2,1] 8.46 0.02 0.83 6.83 7.92 8.47 9.03 10.07 1828.0 1.0
sigma[0] 0.62 2.1e-3 0.09 0.47 0.55 0.6 0.67 0.82 1990.0 1.0
sigma[1] 0.42 1.2e-3 0.05 0.34 0.39 0.42 0.46 0.55 2138.0 1.0
In this section we show how to use a given probability function to fit a distribution. This
means that we are not aiming to build a regression model. Instead, we wish to characterize
the underlying probability distribution driving the behavior of a measured quantity. This
is exactly the case for the long-standing problem of determining the stellar initial mass
function (IMF). We will show a Bayesian approach to the strategy, presented by Zaninetti
(2013), by fitting a lognormal distribution (Section 5.2.1) to stellar masses of the star cluster
NGC 6611 (Oliveira et al., 2005).
The IMF determines the probability distribution function for the mass at which a star
enters the main sequence. It regulates the relative abundance of massive versus low-
mass stars for each stellar generation and influences most observable properties of stellar
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
298 Astronomical Applications
populations and galaxies (Bastian et al., 2010). It is also a required input for semi-analytical
models of galaxy evolution (Fontanot, 2014). Given such a crucial role in shaping the stel-
lar population, the ultimate goal of any theory of star formation is to predict the stellar
IMF from first principles. Currently, there are a few empirical alternatives for describing
the IMF, mostly based on power law functions for different ranges of masses (Bastian et al.,
2010).
The IMF is usually considered to be universal, assuming the shape of a Salpeter power
law, dN ∝ M−2.35 dM (Salpeter, 1955), for stars with masses M > 0.5M (Kroupa,
2001), where dN is the number of stars per bin of stellar mass dM . Another com-
mon option to fit the IMF is to use a lognormal distribution in order to cover the mass
function down to the brown dwarf regime (M ≤ 0.1M , Chabrier, 2003). The goal
of this section is to demonstrate how to fit a lognormal distribution (Section 5.2.1) to
a data set of stellar mass measurements. In this special case there is no explanatory
variable.
10.4.1 Data
We will use the photometric observations of stellar masses from NGC 6611, the young
and massive cluster that ionizes the Eagle Nebula (Oliveira et al., 2005). The data are
from 208 stars for which mass measurements are available.12 We chose this particular
data set to allow a simple comparison with the analysis performed by Zaninetti (2013),
who made a comprehensive comparison of different probability distributions for fitting the
IMF.
Mi ∼ LogNormal(μ, σ 2 )
μ ∼ Normal(0, 103 ) (10.8)
−3 −3
σ ∼ Gamma(10
2
, 10 )
i = 1, . . . , N
where we have applied a non-informative Gaussian (gamma) prior for the location (scale)
parameters.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
299 10.4 Lognormal Distribution and the Initial Mass Function
Code 10.7 Lognormal model in R using JAGS to describe the initial mass function (IMF).
==================================================
library(R2jags)
# Data
path_to_data = "˜/data/Section_10p4/NGC6611.csv"
# Read data
IMF <- read.table(path_to_data,header = T)
N <- nrow(IMF)
x <- IMF$Mass
# Fit
LNORM <-" model{
# Uniform prior for standard deviation
tau <- pow(sigma, -2) # precision
sigma ˜ dunif(0, 100) # standard deviation
mu ˜ dnorm(0,1e-3)
# Likelihood function
for (i in 1:N){
x[i] ˜ dlnorm(mu,tau)
}
}"
# Identify parameters
params <- c("mu", "sigma")
# Run mcmc
LN <- jags(
data = jags_data,
parameters = params,
model = textConnection(LNORM),
n.chains = 3,
n.iter = 5000,
n.thin = 1,
n.burnin = 2500)
# Output
print(LN, justify = "left", intervals=c(0.025,0.975), digits=2)
==================================================
mu.vect sd.vect 2.5% 97.5% Rhat n.eff
mu -1.26 0.07 -1.39 -1.12 1 6600
sigma 1.03 0.05 0.94 1.14 1 1100
deviance 81.20 1.96 79.27 86.49 1 1100
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
300 Astronomical Applications
Density
0
0.0 0.5 1.0 1.5
t
M/M
Figure 10.6 Histogram of mass distribution for the NGC 6611 cluster data (208 stars) with a superposed lognormal distribution.
The dashed line shows the final mean model and the shaded areas correspond to 50% (darker) and 95% (lighter)
credible intervals.
The fitted values can be compared to those from Zaninetti (2013, Table 2), who found
μ = −1.258 and σ = 1.029. Figure 10.6 shows the lognormal distribution fitted to the
data set together with the 50% and 95% credible intervals around the mean. The full
code to reproduce this plot is given below. Note that the code makes use of the function
jagsresults, which makes it easier to extract the information from the Markov chain. This
function is found in the package jagstools and is available on GitHub. It can be installed
with the following lines:
require(devtools)
install_github("johnbaums/jagstools")
# Extract results
mx <- jagsresults(x=LN, params=c(’mu’))
sigmax <- jagsresults(x=LN, params=c(’sigma’))
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
301 10.4 Lognormal Distribution and the Initial Mass Function
ggplot(gdata,aes(x=xx))+
geom_histogram(data=IMF,aes(x=Mass,y = ..density..),
colour="red",fill="gray99",size=1,binwidth = 0.075,
linetype="dashed")+
geom_ribbon(aes(x=xx,ymin=lwr1, ymax=upr1,y=NULL),
alpha=0.45, fill=c("#00526D"),show.legend=FALSE) +
geom_ribbon(aes(x=xx,ymin=lwr2, ymax=upr2,y=NULL),
alpha=0.35, fill = c("#00A3DB"),show.legend=FALSE) +
geom_line(aes(x=xx,y=mean),colour="gray25",
linetype="dashed",size=0.75,
show.legend=FALSE)+
ylab("Density")+
xlab(expression(M/M[’\u0298’]))+
theme_bw() +
theme(legend.background = element_rect(fill = "white"),
legend.key = element_rect(fill = "white",color = "white"),
plot.background = element_rect(fill = "white"),
legend.position = "top",
axis.title.y = element_text(vjust = 0.1,margin = margin(0,10,0,0)),
axis.title.x = element_text(vjust = -0.25),
text = element_text(size = 25))
=========================================================
Code 10.9 Lognormal model in Python using Stan to describe the initial mass function
(IMF).
==================================================
import numpy as np
import pandas as pd
import pylab as plt
import pystan
import statsmodels.api as sm
# Data
path_to_data = ’˜/data/Section_10p4/NGC6611.csv’
# Read data
data_frame = dict(pd.read_csv(path_to_data))
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
302 Astronomical Applications
# Fit
# Stan model
stan_code="""
data{
int<lower=0> nobs; # number of data points
vector[nobs] X; # stellar mass
}
parameters{
real mu; # mean
real<lower=0> sigma; # scatter
}
model{
# priors and likelihood
sigma ˜ normal(0, 100);
mu ˜ normal(0, 100);
X ˜ lognormal(mu, sigma);
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=data, iter=5000, chains=3,
warmup=2500, thin=1, n_jobs=3)
# Output
print(fit)
==================================================
# plot chains and posteriors
fit.traceplot()
plt.show()
10.5 Beta Model and the Baryon Content of Low Mass Galaxies
Our next example deals with situations where the response variable can take any real value
between 0 and 1 (i.e., a fraction). We discussed the beta model in Section 5.2.4 within
a simulated framework. Here we will show how it can be applied directly to model the
baryonic gas fraction as a function of galaxy mass.
In trying to compare theoretical predictions of galaxy formation and evolution with
observations, one is faced with a difficult situation: how to identify which galaxy prop-
erties are driven by internal effects and which are a consequence of interaction with other
galaxies in the cluster? Such environmental effects are specially important for low mass
galaxies, which are more prone to have their gravitational potential perturbed by very
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
303 10.5 Beta Model and the Baryon Content of Low Mass Galaxies
massive neighbors. In this scenario, one possible strategy is to select a sample of isolated
galaxies, which are less perturbed by effects such as tidal forces and ram pressure and are
better laboratories to study correlations between different galaxy properties without too
much environmental interference. This approach was adopted by Bradford et al. (2015),
who studied the dependence of the baryon fraction in atomic gas, fgas , on other galaxy
properties. In this example we will focus in the correlation between fgas and stellar mass
(Bradford et al., 2015, Figure 4, left-hand panel).
10.5.1 Data
We will use the data set compiled by Bradford et al. (2015), which contains N = 1715
galaxies from the NASA-Sloan Atlas13 catalog (Blanton et al., 2011). These are considered
to be low mass galaxies (M ≤ 109.5 M ) in isolation14 (the projected distance to the
nearest neighbor ≥ 1.5 Mpc). From this we used two columns: the stellar mass, M =
log(M /M ), and the baryon fraction in atomic gas, fgas , defined as
Mgas
fgas = , (10.9)
Mgas + M
where Mgas is the atomic-gas mass and M is the stellar mass.
fgas;i ∼ Beta(αi , βi )
αi = θ pi
βi = θ (1 − pi )
pi
log = ηi (10.10)
1 − pi
ηi = β1 + β2 Mi
βj ∼ Normal(0, 103 )
θ ∼ Gamma(0.001, 0.001)
i = 1, . . . , N
j = 1, . . . , K
where α and β are the shape parameters for the beta distribution.
13 http://www.nsatlas.org
14 Complete data available at http://www.astro.yale.edu/jdbradford/data/hilmd/table_1_bradford_2015.fits.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
304 Astronomical Applications
Code 10.10 Beta model in R using JAGS, for accessing the relationship between the baryon
fraction in atomic gas and galaxy stellar mass.
==================================================
require(R2jags)
# Data
path_to_data = "../data/Section_10p5/f_gas.csv"
# Read data
Fgas0 <-read.csv(path_to_data,header=T)
# Estimate F_gas
Fgas0$fgas <- Fgas0$M_HI/(Fgas0$M_HI+Fgas0$M_STAR)
y <- Fgas0$fgas
x <- log(Fgas0$M_STAR,10)
X <- model.matrix(˜ 1 + x)
K <- ncol(X)
# Likelihood function
for (i in 1:N){
Y[i] ~ dbeta(a[i],b[i])
a[i] <- theta * pi[i]
b[i] <- theta * (1-pi[i])
logit(pi[i]) <- eta[i]
eta[i] <- inprod(beta[],X[i,])
}
}"
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
305 10.5 Beta Model and the Baryon Content of Low Mass Galaxies
# Identify parameters
params <- c("beta","theta")
# Output
print(Beta_fit,intervals=c(0.025, 0.975),justify = "left", digits=2)
==================================================
mu.vect sd.vect 2.5% 97.5% Rhat n.eff
beta[1] 9.29 0.13 9.00 9.54 1.04 100
beta[2] -0.98 0.01 -1.00 -0.95 1.03 110
theta 11.71 0.38 10.97 12.45 1.00 2200
deviance -2331.34 2.18 -2333.77 -2325.69 1.01 340
Figure 10.7 shows the relation between the fraction of atomic gas and the galaxy stellar
mass; the dashed line represents the mean relationship and the shaded regions denote the
50% (darker) and 95% (lighter) prediction intervals. From the coefficients, we can estimate
1.00
0.75
fgas 0.50
0.25
0.00
7 8 9 10 11
t
log (M*/M )
Figure 10.7 Baryon fraction in atomic gas as a function of stellar mass. The dashed line shows the mean posterior fraction and the
shaded regions denote the 50% (darker) and 95% (lighter) prediction intervals. The dots are the data points for
isolated low mass galaxies shown in Bradford et al. (2015, Figure 4, left-hand panel). (A black and white version of this
figure will appear in some formats. For the color version, please refer to the plate section.)
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
306 Astronomical Applications
that a one-dex15 variation in M decreases the average atomic fraction of the galaxy by a
factor 1 − exp(−0.98) ≈ 62.5%.
Code 10.11 Beta model in Python using Stan, for accessing the relationship between the
fraction of atomic gas and the galaxy stellar mass.
==================================================
import numpy as np
import pandas as pd
import pystan
import statsmodels.api as sm
# Data
path_to_data = ’˜/data/Section_10p5/f_gas.csv’
# Read data
data_frame = dict(pd.read_csv(path_to_data))
# Fit
# Stan model
stan_code="""
data{
int<lower=0> nobs; # number of data points
int<lower=0> K; # number of coefficients
matrix[nobs, K] X; # stellar mass
real<lower=0, upper=1> Y[nobs]; # atomic gas fraction
}
parameters{
vector[K] beta; # linear predictor coefficients
real<lower=0> theta;
}
model{
vector[nobs] pi;
real a[nobs];
real b[nobs];
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
307 10.6 Bernoulli Model and the Fraction of Red Spirals
for (i in 1:nobs){
pi[i] = inv_logit(X[i] * beta);
a[i] = theta * pi[i];
b[i] = theta * (1 - pi[i]);
}
Y ˜ beta(a, b);
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=data, iter=7500, chains=3,
warmup=5000, thin=1, n_jobs=3)
# Output
print(fit)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0] 9.24 5.5e-3 0.18 8.9 9.13 9.24 9.36 9.61 1061.0 1.0
beta[1] -0.42 2.4e-4 7.9e-3 -0.44 -0.43 -0.42 -0.42 -0.41 1068.0 1.0
theta 11.68 0.01 0.4 10.91 11.42 11.68 11.92 12.58 1297.0 1.0
We now turn to the treatment of binary data. We presented an application of the Bernoulli
models using synthetic data in Section 5.3. There we considered a system where the
response variable could take two states, success (1) and failure (0), and used a Bernoulli
distribution (one particular case of the binomial distribution) to construct our statistical
model. In that context, our goal was to determine the parameter of the Bernoulli distri-
bution, p, which represents the chances of getting a success (1) in each realization of the
response variable.
Here we will apply this technique to the practical astronomical case of modeling the
fraction of red spirals as a function of the bulge size in a given galaxy population. Since
the advent of large photometric surveys such as the Sloan Digital Sky Survey (SDSS),16
astronomers have been using a well-known correlation between galaxy color and morphol-
ogy to infer the morphological types of galaxies. According to this relation, spiral galaxies
are statistically bluer, and disk-dominated, and hold more star formation than their ellip-
tical, bulge-dominated, counterparts (Mignoli et al., 2009). However significant attention
has also been drawn to individuals or groups of galaxies violating this relation (see e.g.
Cortese and Hughes, 2009; Mahajan and Raychaudhury, 2009). The Galaxy Zoo project,17
a combined effort involving professional and citizen scientists, enabled for the first time
16 www.sdss.org/
17 www.galaxyzoo.org/
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
308 Astronomical Applications
10.6.1 Data
The Galaxy Zoo clean catalog (Lintott et al., 2008), containing around 900 000 galaxies
morphologically classified, was assembled thanks to the work of more than 160 000 volun-
teers, who visually inspected SDSS images through an online tool. Details on how multiple
classifications are converted into the probability that a given galaxy is a spiral, pspiral , are
given in Bamford et al. (2009); Lintott et al. (2008).
The subsample compiled by Masters et al. (2010) is presented in two separate tables:
one holding 294 red, passive, spirals18 and the other holding 5139 blue, active, spiral galax-
ies.19 We merged these two tables and constructed a single catalog of 5433 lines (galaxies)
and two columns (type and fracdeV, the fraction of the best-fit light profile from the de
Vaucouleurs fit, which is considered a proxy for the bulge size).
18 http://data.galaxyzoo.org/data/redspirals/RedSpiralsA1.txt
19 http://data.galaxyzoo.org/data/redspirals/BlueSpiralsA2.txt
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
309 10.6 Bernoulli Model and the Fraction of Red Spirals
and T = 0 otherwise, a failure). The bulge size will act as our explanatory variable x ≡
fracdeV and η = β1 + β2 x as the linear predictor, whose intercept and slope are given by
β1 and β2 , respectively. The Bernoulli parameter p is interpreted as the probability that a
galaxy is a red spiral (the probability of success), given its bulge size. The Bernoulli model
for this problem can be expressed as follows:
Ti ∼ Bernoulli(pi )
pi
log = ηi
1 − pi
ηi = β1 + β2 xi (10.11)
βj ∼ Normal(0, 103 )
i = 1, . . . , N
j = 1, . . . , K
where N = 5433 is the number of data points and K = 2 the number of linear predictor
coefficients.
Code 10.12 Bernoulli model in R using JAGS, for accessing the relationship between bulge
size and the fraction of red spirals.
==================================================
require(R2jags)
# Data
path_to_data = ’˜/data/Section_10p6/Red_spirals.csv’
# Read data
Red <- read.csv(path_to_data,header=T)
# Fit
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
310 Astronomical Applications
# Likelihood function
for (i in 1:N){
Y[i] ˜ dbern(p[i])
logit(p[i]) <- eta[i]
eta[i] <- inprod(beta[], X[i,])
}
}"
# Identify parameters
params <- c("beta")
# Fit
LOGIT_fit <- jags(data = logit_data,
inits = inits,
parameters = params,
model = textConnection(LOGIT),
n.thin = 1,
n.chains = 3,
n.burnin = 3000,
n.iter = 6000)
# Output
print(LOGIT_fit,intervals=c(0.025, 0.975),justify = "left", digits=2)
==================================================
mu.vect sd.vect 2.5% 97.5% Rhat n.eff
beta[1] -4.89 0.23 -5.23 -4.56 1 2100
beta[2] 8.11 0.59 7.12 9.06 1 2200
deviance 1911.15 44.15 1907.07 1915.78 1 9000
Figure 10.8 illustrates how the probability that a galaxy is a red spiral depends on
fracdeV (or bulge size) according to our model. The dashed line represents the mean pos-
terior, and the shaded regions denote the 50% (darker) and 95% (lighter) credible intervals.
In order to facilitate comparison with Figure 2 of Masters et al. (2010), we also show
the binned data corresponding to the fraction of red spirals, Nred /Ntot , in bin bulge-size
each (dots). The model fits the data quite well with a simple linear relation (see Equa-
tion 10.12). In order to reproduce this figure with his or her own plot tool, the reader just
needs to monitor the parameter p, using
params <- c("p")
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
311 10.6 Bernoulli Model and the Fraction of Red Spirals
0.4
0.3
pred 0.2
0.1
0.0
t
Fracdev
Figure 10.8 Probability that a given galaxy is red spiral, pred , as a function of fracdeV (or bulge size). The dashed line
represents the posterior mean probability that a galaxy is a red spiral, while the shaded areas illustrate the 50%
(darker) and 95% (lighter) credible intervals. The data points, with error bars, represent the fraction of red spirals for
each bulge-size bin as presented in Figure 2 of Masters et al. (2010).
and then extract the values with the function jagsresults from the package jagstools:
px <- jagsresults(x=LOGIT_fit, params=c(’p’))
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
312 Astronomical Applications
Code 10.13 Bernoulli model in Python using Stan, for assessing the relationship between
bulge size and the fraction of red spirals.
==================================================
import numpy as np
import pandas as pd
import pystan
import statsmodels.api as sm
# Data
path_to_data = ’˜/data/Section_10p6/Red_spirals.csv’
# Read data
data_frame = dict(pd.read_csv(path_to_data))
x = np.array(data_frame[’fracdeV’])
# Fit
# Stan model
stan_code="""
data{
int<lower=0> nobs; # number of data points
int<lower=0> K; # number of coefficients
matrix[nobs, K] X; # bulge size
int Y[nobs]; # galaxy type: 1, red; 0, blue
}
parameters{
vector[K] beta; # linear predictor coefficients
}
model{
# priors and likelihood
for (i in 1:K) beta[i] ~ normal(0, 100);
Y ~ bernoulli_logit(X * beta);
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=data, iter=6000, chains=3,
warmup=3000, thin=1, n_jobs=3)
# Output
print(fit)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0] -4.92 4.6e-3 0.16 -5.25 -5.02 -4.92 -4.81 -4.61 1232.0 1.0
beta[1] 8.18 0.01 0.46 7.29 7.87 8.17 8.49 9.12 1233.0 1.0
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
313 10.7 Count Models, Globular Cluster Population, and Host Galaxy Brightness
We approach now a recurrent issue in astronomical studies: the necessity to model discrete
(count) data. Astronomers often find that they need to establish relationships involving at
least one discrete variable, where standard approaches designed to deal with continuous
variables cannot operate (de Souza et al., 2015b). In such situations a common approach
is to analyze pairs of measured quantities (x, y) in loglog scale and then apply the standard
normal model with a Gaussian error distribution (these are the underlying assumptions
behind a χ 2 minimization). Such a transformation is not necessary (see Chapter 5), since
the GLM framework allows one to treat the data at its original scale. Moreover, a log trans-
formation makes it impossible to deal with zeros as observations and imposes an arbitrary
shift on the entire data set. These transformations are known to have a poor performance
and to lead to bias in parameter estimation (O’Hara and Kotze, 2010).
However, merely stating that one should use a statistical model able to handle count data
is far from being a final answer to the above problem. A reader who has gone through Chap-
ter 5 might be aware that there are several distributions which can be used to model count
data; choosing between them can pose an additional challenge. Following the arguments
presented by de Souza et al. (2015b), we present three distinct models with increasing lev-
els of complexity. This allows us to compare results from Poisson, negative binomial and
three-parameter negative binomial models in order to guide the reader through a model
selection exercise. We will use, as a case study, measurements of globular cluster (GC)
population size NGC and host galaxy visual magnitude MV .
Globular clusters are spherical groups of stars found mainly in the halos of galaxies
(Figure 10.9). They are gravitationally bounded, significantly more dense than the open
clusters populating the disk and pervasive in nearby massive galaxies (Kruijssen, 2014).
They are also among the oldest known stellar systems, which makes them important pieces
in the galaxy evolution puzzle. Being one of the first structures to form in a galaxy, it is
reasonable to expect that the GC population is correlated somehow with global host galaxy
properties. This hypothesis was confirmed by a number of previous studies, which also
found the Milky Way to be an important outlier (Burkert and Tremaine, 2010; Harris and
Harris, 2011; Harris et al., 2013, 2014; Rhode, 2012; Snyder et al, 2011). Our galaxy holds
a significantly larger number of GCs than expected given the mass of its central black hole.
More recently, de Souza et al. (2015b) showed that the use of a proper statistical model
results in prediction intervals which enclose the Milky Way in a natural way, demonstrating
the paramount role played by statistical models in astronomical data modeling.
10.7.1 Data
We use the same catalog as that described in Section 10.1.1, which was presented in Har-
ris et al. (2013, hereafter H2013). This is a compilation of literature data, from a variety
of sources, obtained with the Hubble Space Telescope as well as a wide range of other
ground-based facilities (see H2013 for further details and references). The original data set
holds NGC and visual magnitude measurements for 422 galaxies.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
314 Astronomical Applications
tFigure 10.9 Globular cluster NGC 6388. Image credits: ESO, F. Ferraro (University of Bologna). (A black and white version of this
figure will appear in some formats. For the color version, please refer to the plate section.)
In order to focus on the differences resulting solely from the choice of statistical model,
the examples shown below will not take into account errors in measurements. A detailed
explanation of how those can be handled, including the corresponding JAGS models, can
be found in de Souza et al. (2015b).
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
315 10.7 Count Models, Globular Cluster Population, and Host Galaxy Brightness
NGC;i ∼ Poisson(μi )
μi = β1 + β2 MV;i
βj ∼ Normal(0, 103 ) (10.12)
i = 1, . . . , N
j = 1, . . . , K
Code 10.14 Poisson model, in R using JAGS, for modeling the relation between globular
clusters population and host galaxy visual magnitude.
==================================================
require(R2jags)
require(jagstools)
# Data
path_to_data = "˜/data/Section_10p7/GCs.csv"
# Read data
GC_dat = read.csv(file=path_to_data,header = T,dec=".")
N = N,
K = K)
# Fit
model.pois <- "model{
# Diffuse normal priors betas
for (i in 1:K) { beta[i] ~ dnorm(0, 1e-5)}
for (i in 1:N){
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
316 Astronomical Applications
# Likelihood
eta[i] <- inprod(beta[], X[i,])
mu[i] <- exp(eta[i])
Y[i] ~ dpois(mu[i])
# Discrepancy
expY[i] <- mu[i] # mean
varY[i] <- mu[i] # variance
PRes[i] <- ((Y[i] - expY[i])/sqrt(varY[i]))^2
}
Dispersion <- sum(PRes)/(N-2)
}"
# Identify parameters
params <- c("beta","Dispersion")
# Start JAGS
pois_fit <- jags(data = JAGS_data ,
inits = inits,
parameters = params,
model = textConnection(model.pois),
n.thin = 1,
n.chains = 3,
n.burnin = 3500,
n.iter = 7000)
# Output
print(pois_fit , intervals=c(0.025, 0.975), digits=3)
==================================================
mu.vect sd.vect 2.5% 97.5% Rhat n.eff
Dispersion 1080.231 0.197 1079.888 1080.590 1.010 10000
beta[1] -11.910 0.035 -11.967 -11.855 1.021 9100
beta[2] -0.918 0.002 -0.920 -0.915 1.020 9400
deviance 497168.011 898.313 495662.548 498682.310 1.008 10000
A visual representation of the posterior distribution for the fitted coefficients and the
dispersion statistics (Figure 10.10) can be accessed with the following command in R:
require(lattice)
source("../CH-Figures.R")
out <- pois_fit$BUGSoutput
MyBUGSHist(out,c("Dispersion",uNames("beta",K)))
The large dispersion statistic value (> 103 ) indicates that this model is not a good
description of the underlying behavior of our data. This is confirmed by confronting the
original data with the model mean and corresponding prediction intervals. Figure 10.11
clearly shows that the model does not describe a significant fraction of the data points and
consequently it has a very limited prediction capability.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
317 10.7 Count Models, Globular Cluster Population, and Host Galaxy Brightness
Dispersion
600
400
200 0
Frequencies
3000
2000
2000
1000
1000
0
t
Posterior distribution
Figure 10.10 Posterior distributions over the intercept (beta1), slope (beta2), and dispersion parameter resulting from the
Poisson model shown in Code 10.14. The thick horizontal lines show the 95% credible intervals.
NGC;i ∼ NegBinomial(pi , θ )
θ
pi =
θ + μi
μi = exp(ηi )
ηi = β1 + β2 MV;i (10.13)
βj ∼ Normal(0, 10 ) 3
θ ∼ Gamma(10−3 , 10−3 )
i = 1, . . . , N
j = 1, . . . , K
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
318 Astronomical Applications
E Irr S S0
105
104
103
NGC
102
101
t
MV
Figure 10.11 Globular cluster population and host galaxy visual magnitude data (points) superimposed on the results from Poisson
regression. The dashed line shows the model mean and the shaded regions mark the 50% (darker) and 95% (lighter)
prediction intervals. (A black and white version of this figure will appear in some formats. For the color version, please
refer to the plate section.)
where we denote as θ the extra scatter parameter and where we have assigned non-
informative priors for βj and θ .
Code 10.15 Negative binomial model, in R using JAGS, for modeling the relationship
between globular cluster population and host galaxy visual magnitude.
==================================================
# Fit
model.NB <- "model{
# Diffuse normal priors betas
for (i in 1:K) { beta[i] ˜ dnorm(0, 1e-5)}
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
319 10.7 Count Models, Globular Cluster Population, and Host Galaxy Brightness
for (i in 1:N){
eta[i] <- inprod(beta[], X[i,])
mu[i] <- exp(eta[i])
p[i] <- theta/(theta+mu[i])
Y[i] ~ dnegbin(p[i],theta)
# Discrepancy
expY[i] <- mu[i] # mean
varY[i] <- mu[i] + pow(mu[i],2)/theta # variance
PRes[i] <- ((Y[i] - expY[i])/sqrt(varY[i]))^2
}
Dispersion <- sum(PRes)/(N-3)
}"
# Identify parameters
params <- c("beta","theta","Dispersion")
# Start JAGS
NB_fit <- jags(data = JAGS_data ,
inits = inits,
parameters = params,
model = textConnection(model.NB),
n.thin = 1,
n.chains = 3,
n.burnin = 3500,
n.iter = 7000)
# Output
# Plot posteriors
MyBUGSHist(out,c("Dispersion",uNames("beta",K),"theta"))
The model is slightly overdispersed (Dispersion ≈ 2.0, see the top left panel in Figure
10.12), which could be caused either by a missing covariate or by the lack of complexity in
the model. However, Figure 10.13 shows that it provides a quite good fit to the data, with
the 95% prediction intervals enclosing most of the data variance. Moreover, comparing its
DIC statistic (DIC = 5194.3) with that obtained with the Poisson model (DIC = 900648.4,
from Code 10.14) we notice that the negative binomial model is significantly preferable.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
320 Astronomical Applications
Dispersion theta
600
600
400
400
200
200
0
0
Frequencies
1.5 2.0 2.5 3.0 0.9 1.0 1.1 1.2 1.3 1.4
beta[1] beta[2]
100 200 300 400 500
t
Posterior distribution
Figure 10.12 Posterior distributions over the intercept (beta1), slope (beta2), scatter (theta), and dispersion parameter
resulting from the negative binomial model shown in Code 10.15. The horizontal thick lines show the 95% credible
intervals.
Q
NGC;i ∼ NegBinomial(pi , θ μi )
Q
θ μi
pi = Q
θ μi + μi
μi = exp(ηi )
ηi = β1 + β2 MV;i (10.14)
βj ∼ Normal(0, 103 )
θ ∼ Gamma(10−3 , 10−3 )
Q ∼ Uniform(0, 3)
i = 1, . . . , N
j = 1, . . . , K
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
321 10.7 Count Models, Globular Cluster Population, and Host Galaxy Brightness
105 E Irr S S0
104
103
NGC
102
101
t
MV
Figure 10.13 Globular cluster population and host galaxy visual magnitude data (points) superimposed on the results from
negative binomial regression. The dashed line shows the model mean. The shaded regions mark the 50% (darker) and
95% (lighter) prediction intervals. (A black and white version of this figure will appear in some formats. For the color
version, please refer to the plate section.)
where we have assigned non-informative priors for βj and θ and a uniform prior over the
additional dispersion parameter Q.
Code 10.16 NB-P model in R using JAGS, for modeling the relationship between globular
cluster population and host galaxy visual magnitude.
==================================================
# Fit
model.NBP <- "model{
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
322 Astronomical Applications
# Likelihood
for (i in 1:N){
eta[i] <- inprod(beta[], X[i,])
mu[i] <- exp(eta[i])
theta_eff[i] <- theta*(mu[i]^Q)
p[i] <- theta_eff[i]/(theta_eff[i]+mu[i])
Y[i] ~ dnegbin(p[i],theta_eff[i])
# Discrepancy
expY[i] <- mu[i] # mean
varY[i] <- mu[i] + pow(mu[i],2-Q)/theta #variance
PRes[i] <- ((Y[i] - expY[i])/sqrt(varY[i]))^2
}
Dispersion <- sum(PRes)/(N-4)#
}"
# Identify parameters
params <- c("Q","beta","theta","Dispersion")
# Start JAGS
NBP_fit <- jags(data = JAGS_data,
inits = inits,
parameters = params,
model = textConnection(model.NBP),
n.thin = 1,
n.chains = 3,
n.burnin = 5000,
n.iter = 20000)
# Output
# Plot posteriors
MyBUGSHist(out,c("Dispersion",uNames("beta",K),"theta"))
# Screen output
print(NBP_fit, intervals=c(0.025, 0.975), digits=3)
==================================================
mu.vect sd.vect 2.5% 97.5% Rhat n.eff
Dispersion 1.929 0.209 1.545 2.363 1.001 13000
Q 0.018 0.015 0.001 0.057 1.002 2200
beta[1] -11.822 0.333 -12.447 -11.145 1.003 1200
beta[2] -0.884 0.017 -0.916 -0.850 1.003 1200
theta 0.996 0.106 0.770 1.187 1.002 1800
deviance 5193.090 3.012 5189.255 5200.795 1.004 1300
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
323 10.7 Count Models, Globular Cluster Population, and Host Galaxy Brightness
Q theta
3000
1000 2000 3000 4000
2000
1000
0
0
Frequencies
0.00 0.02 0.04 0.06 0.08 0.10 0.6 0.8 1.0 1.2 1.4
beta[1] beta[2] Dispersion
3000
2000
1000
0
0
–12.5 –12.0 –11.5 –11.0 –10.5 –0.92 –0.88 –0.84 1.5 2.0 2.5 3.0
t
Posterior distribution
Figure 10.14 Posterior distributions over the intercept (beta1), slope (beta2), and dispersion (theta,Q) parameters resulting
from the three-parameter negative binomial model (NB-P) shown in Code 10.16. The horizontal thick lines show the
95% credible intervals.
Comparing the mean Dispersion parameters for the NB-P (1.929) and negative bino-
mial (1.928) models we notice no significant improvement, which is visually confirmed by
comparing Figures 10.12 and 10.14. Notice that the Q parameter value is consistent with
zero, which indicates that the extra complexity is not being employed in this more detailed
description of the data. Finally, the dispersion statistic of the NB-P model (DIC = 5197.6)
is slightly higher than that for the NB (DIC = 5194.3), which also indicates that the NB
model is more suited to describe our data.
After comparing three possible statistical models we conclude that, among them, the
negative binomial model provides the best fit to the globular-cluster-population versus
galaxy-visual-magnitude data. We have also shown that, although in this model there is still
some remaining overdispersion, using a more complex statistical model does not improve
the final results. This issue might be solved by addressing other important points such as
the measurement errors and/or the existence of subpopulations within the data set.
A more detailed analysis of the possible sources and methods to deal with this data set
is beyond our scope here, but we invite readers interested in a deeper analysis to check the
discussion in de Souza et al. (2015b).
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
324 Astronomical Applications
in JAGS and Stan. The results for the model and dispersion parameters are almost identical
to those obtained with the R counterpart of this code.
Code 10.17 Negative binomial model in Python using Stan, for modeling the relationship
between globular cluster population and host galaxy visual magnitude.
==================================================
import numpy as np
import pandas as pd
import pystan
import statsmodels.api as sm
# Data
path_to_data = ’˜/data/Section_10p7/GCs.csv’
data_frame = dict(pd.read_csv(path_to_data))
# Fit
stan_code="""
data{
int<lower=0> N; # number of data points
int<lower=1> K; # number of linear predictor coefficients
matrix[N,K] X; # galaxy visual magnitude
int Y[N]; # size of globular cluster population
}
parameters{
vector[K] beta; # linear predictor coefficients
real<lower=0> theta;
}
model{
vector[N] mu; # linear predictor
mu = exp(X * beta);
theta ˜ gamma(0.001, 0.001);
for (i in 1:N){
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
325 10.8 Bernoulli Mixed Model, AGNs, and Cluster Environment
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=data, iter=10000, chains=3,
warmup=5000, thin=1, n_jobs=3)
# Output
nlines = 9 # number of lines in screen output
output = str(fit).split(’\n’)
for item in output[:nlines]:
print(item)
==================================================
mean se_mean sd 2.5% ... 97.5% n_eff Rhat
beta[0] -11.73 6.8e-3 0.33 -12.38 ... -11.07 2349.0 1.0
beta[1] -0.88 3.4e-4 0.02 -0.91 ... -0.85 2339.0 1.0
theta 1.1 1.4e-3 0.07 0.96 ... 1.25 2650.0 1.0
dispersion 1.92 3.4e-3 0.21 1.54 ... 2.35 3684.0 1.0
Our next example describes a mixed model with a binary response variable following a
Bernoulli distribution. This is the binomial model equivalent to the normal examples shown
in Sections 10.2 and 10.3. As a case study, we reproduce the study from de Souza et al.
(2016) exploring the roles of morphology and environment in the occurrence of AGN in
elliptical and spiral galaxies.
Active galactic nuclei (AGN) are powered by the accretion of gas into a supermassive
black hole located at the center of their host galaxy (e.g. Lynden-Bell, 1969; Orban de
Xivry et al., 2011). The AGN feedback interacts with the gas of the host via radiation
pressure, winds, and jets, hence helping to shape the final mass of the stellar components
(Fabian, 2012). Environmental effects can also turn on or off the AGN activity. Instabilities
originating from galaxy mergers, and from interactions between the galaxy and the cluster
potential, could drive gas towards the galaxy center, powering the AGN. Recently, Pimbblet
et al. ( 2013) found a strong relation between AGN activity and the cluster-centric distance
r/r200 ,20 with significant decrease in AGN fraction towards the cluster center. The interplay
between these competing processes results in a very intricate relation between the AGN
activity, the galaxy properties, and the local environment, which requires careful statistical
analysis.
20 The quantity r
200 is the radius inside which the mean density is 200 times the critical density of the Universe
at the cluster redshift.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
326 Astronomical Applications
Using a hierarchical Bayesian model, we show that it is possible to handle such intricate
data scenarios on their natural scale and, at the same time, to take into account the different
morphological types in a unified framework (without splitting the data into independent
samples).
10.8.1 Data
The data set is composed of a subsample of N = 1744 galaxies within galaxy clus-
ters from the Sloan Digital Sky Survey seventh (SDSS-DR7, Abazajian et al., 2009) and
12th (SDSS-DR12, Alam et al., 2015) data release data bases within a redshift range
0.015 < z < 0.1. The sample is divided into two groups, ellipticals and spirals, rely-
ing on the visual classification scheme from the Galaxy Zoo Project21 (Lintott et al.,
2008).
This sample was classified according to the diagram introduced in the article Baldwin,
Phillips, and Terlevich (1981, hereafter BPT), as shown in Figure 10.15. It com-
prises galaxies with emission lines Hβ, [OIII], Hα, and [NII] and signal to noise ratio
1.5
Seyfert
1
0.5
log [O III]/Hβ
LINER
0 Star-forming
–0.5
–1 Composite
–1.5
−1 −0.5 0 0.5
t
log [N II]/Hα
Figure 10.15 Illustrative plot of the BPT plane. The vertical axis represents the ratio [OIII]/Hβ, while the horizontal axis represents
the ratio [NII]/Hα. The solid curve is due to Kauffmann et al. (2003): galaxies above the curve are designated AGN
and those below are regular star-forming galaxies. The dashed line represents the Kewley et al. (2001) curve; galaxies
between the Kauffmann et al. and Kewley et al. curves are defined as composites; weaker AGN whose hosts are also
star-forming galaxies. The dotted line is the Schawinski et al. (2007) curve, which separates low-ionization
nuclear-emission-line (LINER) and Seyfert objects.
21 Data from this project was also used in Section 10.6.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
327 10.8 Bernoulli Mixed Model, AGNs, and Cluster Environment
S/N > 1.5. In order to build a sample with the lowest possible degree of contami-
nation due to wrong or dubious classifications, we selected only the star-forming and
Seyfert galaxies (see de Souza et al., 2016, Section 2.1). The galaxies hosting Seyfert
AGN were compared with a control sample of inactive galaxies by matching each
Seyfert and non-Seyfert galaxy pair against their colors, star formation rates, and stellar
masses.
yi ∼ Bernoulli(pi )
logit(pi ) = ηi
ηi = xTi,k βk,j
⎛
r
⎞
1 (log M200 )1 r200 1
⎜ ⎟
xTi,k = ⎝ ... ..
.
..
. ⎠
r
1 (log M200 )N r200 N
τ ∼ Gamma(10−3 , 10−3 )
σ 2 = 1/τ
j = elliptical, spiral
k = 1, . . . , 3
i = 1, . . . , N
It reads as follows: for each galaxy in the data set, composed of N objects, its probability
of hosting a Seyfert AGN is described by a Bernoulli distribution whose probability of
success, p ≡ fSeyfert , relates to r/r200 and log M200 through a logit link function (to ensure
that the probabilities will fall between 0 and 1) and the linear predictor
η = β1,j + β2,j log M200 + β3,j r/r200 , (10.16)
where j is an index representing whether a galaxy is elliptical or spiral. We assume
non-informative priors for the coefficients β1 , β2 , β3 , i.e., normal priors with mean μ
and standard deviation σ for which we assign shared hyperpriors μ ∼ Normal(0, 103 )
and 1/σ 2 ∼ Gamma(10−3 , 10−3 ).22 By employing a hierarchical Bayesian model for
the varying coefficients βj , we allow the model to borrow strength across galaxy types.
22 The inverse gamma prior accounts for the fact that the variance is always positive.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
328 Astronomical Applications
This happens via the joint influence of these coefficients on the posterior estimates of the
unknown hyperparameters μ and σ 2 .
Code 10.18 Bernoulli logit model, in R using JAGS, for accessing the relationship between
Seyfert AGN activity and galactocentric distance.
==================================================
library(R2jags)
# Data
data<-read.csv("˜/data/Section_10p8/Seyfert.csv",header=T)
# Fit
jags_model<-"model{
# Shared hyperpriors for beta
tau ˜ dgamma(1e-3,1e-3) # precision
mu ˜ dnorm(0,1e-3) # mean
# Likelihood
for(i in 1:N){
Y[i] ˜ dbern(pi[i])
logit(pi[i]) <- eta[i]
eta[i] <- beta[1,gal[i]]*X[i,1]+
beta[2,gal[i]]*X[i,2]+
beta[3,gal[i]]*X[i,3]
}
}"
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
329 10.8 Bernoulli Mixed Model, AGNs, and Cluster Environment
# Run mcmc
jags_fit <- jags(data= jags_data,
inits = inits,
parameters = params,
model.file = textConnection(jags_model),
n.chains = 3,
n.thin = 10,
n.iter = 5*10^4,
n.burnin = 2*10^4
)
# Output
print(jags_fit,intervals=c(0.025, 0.975),
digits=3)
==================================================
To visualize how the model fits the data, we display in Figure 10.16 the predicted
probabilities fSeyfert as a function of r/r200 for halos with an average mass log M200 ≈
1014 M . We present the binned data, for illustrative purposes, and the fitted model
and uncertainty. The shaded areas represent the 50% and 95% probability intervals.
It should be noted that the fitting was performed without making use of any data
binning.
The coefficients for the logit model, whose posterior distribution is displayed in Fig-
ure 10.17, represent the log of the odds ratio for Seyfert activity; thus one unit variation
in r/r200 towards the cluster outskirts for an elliptical galaxy residing in a cluster with an
average mass log M200 = 14 produces on average a change of 0.197 in the log of the odds
ratio of Seyfert activity or, in other words, it is 21.7% more likely to be a Seyfert galaxy as
we move farther from the center. Unlike elliptical galaxies, however, spirals are virtually
unaffected by their position inside the cluster or by the mass of their host, all the fitted
coefficients being consistent with zero.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
330 Astronomical Applications
1.00
0.75
fSeyfert
0.50
0.25
0.00
0 2 4 6
r/r200
1.00
0.75
fSeyfert
0.50
0.25
0.00
0 2 4 6
t
r/r200
Figure 10.16 Two-dimensional representation of the six-dimensional parameter space describing the dependence of Seyfert AGN
activity as a function of r/r200 and log M200 , for clusters with an average log M200 ≈ 1014 M : upper panel, spirals;
lower panel, elliptical galaxies. In each panel the lines (dashed or dotted) represent the posterior mean probability of
Seyfert AGN activity for each value of r/r200 , while the shaded areas depict the 50% and 95% credible intervals. The
data points with error bars represent the data when binned, for purely illustrative purposes. (A black and white
version of this figure will appear in some formats. For the color version, please refer to the plate section.)
results from R/JAGS, we use the dot_product function so the configurations of matrices
are compatible. This step is not necessary unless you are performing such comparisons.
Code 10.19 Bernoulli logit model, in Python using Stan, for assessing the relationship
between Seyfert AGN activity and galactocentric distance.
==================================================
import numpy as np
import pandas as pd
import pystan
import statsmodels.api as sm
# Data
path_to_data = ’˜/data/Section_10p8/Seyfert.csv’
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
331 10.8 Bernoulli Mixed Model, AGNs, and Cluster Environment
[ , ]
beta[1,1] beta[2,1]
[ , ] [ , ]
beta[3,1]
500
100 200 300 400 500 600
500
400
400
300
300
200
200
100
100
Frequencies
0
0
–0.2 0.0 0.2 0.4 –0.4 -0.2 0.0 0.0 0.2 0.4 0.6 0.8
beta[1,2]
[ , ] beta[2,2]
[ , ] beta[3,2]
[ , ]
100 200 300 400 500 600
500
500
400
400
300
300
200
200
100
100
0
–0.2 –0.1 0.0 0.1 0.2 –0.2 –0.1 0.0 0.1 0 –0.2 –0.1 0.0 0.1 0.2
t
Posterior distribution
Figure 10.17 Computed posterior for the β coefficients of our model. From left to right: intercept, log M200 , and r/r200 for elliptical
( j = 1, upper panels) and spiral ( j = 2, lower panels) galaxies respectively.
# Read data
data_frame = dict(pd.read_csv(path_to_data))
x1 = data_frame[’logM200’]
x2 = data_frame[’r_r200’]
data = {}
data[’Y’] = data_frame[’bpt’]
data[’X’] = sm.add_constant(np.column_stack((x1,x2)))
data[’K’] = data[’X’].shape[1]
data[’N’] = data[’X’].shape[0]
data[’gal’] = [0 if item == data_frame[’zoo’][0] else 1
for item in data_frame[’zoo’]]
data[’P’] = 2
# Fit
# Stan model
stan_code="""
data{
int<lower=0> N; # number of data points
int<lower=0> K; # number of coefficients
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
332 Astronomical Applications
for (i in 1:N) {
if (gal[i] == gal[1]) pi[i] = dot_product(col(beta,1),X[i]);
else pi[i] = dot_product(col(beta,2), X[i]);
}
# shared hyperpriors
sigma ~ gamma(0.001, 0.001);
mu ~ normal(0, 100);
Y ˜ bernoulli_logit(pi);
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=data, iter=60000, chains=3,
warmup=30000, thin=10, n_jobs=3)
# Output
print(fit)
==================================================
mean se_mean sd 2.5% ... ... ... 97.5% n_eff Rhat
beta[0,0] 0.04 2.2e-3 0.09 -0.11 ... ... ... 0.23 1543.0 1.0
beta[1,0] -0.15 2.9e-3 0.1 -0.36 ... ... ... 0.01 1115.0 1.0
beta[2,0] 0.17 5.0e-3 0.12 -0.05 ... ... ... 0.42 595.0 1.01
beta[0,1] 9.4e-4 9.4e-4 0.05 -0.1 ... ... ... 0.1 2889.0 1.0
beta[1,1] -0.02 9.3e-4 0.05 -0.12 ... ... ... 0.08 2908.0 1.0
beta[2,1] 3.9e-3 1.0e-3 0.05 -0.1 ... ... ... 0.11 2745.0 1.0
sigma 0.14 5.7e-3 0.09 0.02 ... ... ... 0.37 265.0 1.0
mu 8.0e-3 8.2e-4 0.08 -0.14 ... ... ... 0.17 9000.0 1.0
We explore now the more subtle problem of considering two separate components in
the construction of our statistical model. The so-called hurdle models form a class of
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
333 10.9 Lognormal–Logit Hurdle Model and the Halo–Stellar-Mass Relation
10.9.1 Data
The data set used in this work was retrieved from a cosmological hydro-simulation based
on Biffi and Maio (2013) (see also de Souza et al., 2014; Maio et al., 2010, 2011). The
simulations have snapshots in the redshift range 9 z 19 for a cubic volume of comov-
ing side ∼0.7 Mpc, sampled with 2 × 3203 particles per gas and dark-matter species. The
simulation output considered in this work comprises N = 1680 halos in the whole red-
shift range, with about 200 objects at z = 9. The masses of the halos are in the range
105 M Mdm 108 M , with corresponding stellar masses 0 < M 104 M .
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
334 Astronomical Applications
modeled by another distribution that does not allow zeros, for which we choose the log-
normal. In other words: the Bernoulli process is causing the presence or absence of stars in
the halo; once stars are present, their abundance is described by a lognormal distribution.
In this context the stellar mass represents our response variable M , and the dark-matter
halo is our explanatory variable Mdm . Given that there are now two different underlying
statistical distributions, we will also have two distinct linear predictors (each one bearing
K = 2 coefficients). In the model in Equation 10.17 below, {β1 , β2 } and {γ1 , γ2 } are the
linear predictor coefficients for the Bernoulli and lognormal distributions, respectively. The
complete model can be expressed as follows:
Bernoulli(pi ) if M;i = 0
M;i ∼
LogNormal(μi , σ 2 ) otherwise
logit(pi ) = γ1 + γ2 Mdm;i
μi = β1 + β2 Mdm;i (10.17)
βj ∼ Normal(0, 103 )
γj ∼ Normal(0, 103 )
σ ∼ Gamma(0.001, 0.001)
i = 1, . . . , N
j = 1, . . . , K
Code 10.20 Lognormal–logit hurdle model, in R using JAGS, for assessing the relationship
between dark-halo mass and stellar mass.
==================================================
require(R2jags)
# Data
dataB <- read.csv("˜/data/Section_10p9/MstarZSFR.csv",header = T)
hurdle <- data.frame(x =log(dataB$Mdm,10), y = asinh(1e10*dataB$Mstar))
Kc <- ncol(Xc)
Kb <- ncol(Xb)
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
335 10.9 Lognormal–Logit Hurdle Model and the Halo–Stellar-Mass Relation
Xc = Xc, # covariates
Xb = Xb, # covariates
Kc = Kc, # number of betas
Kb = Kb, # number of gammas
N = nrow(hurdle), # sample size
Zeros = rep(0, nrow(hurdle)))
# Fit
load.module(’glm’)
sink("ZAPGLM.txt")
cat("
model{
# 1A. Priors beta and gamma
for (i in 1:Kc) {beta[i] ˜ dnorm(0, 0.0001)}
for (i in 1:Kb) {gamma[i] ˜ dnorm(0, 0.0001)}
# Identify parameters
params <- c("beta", "gamma", " sigmaLN")
# Run MCMC
H1 <- jags(data = JAGS.data,
inits = inits,
parameters = params,
model = "ZAPGLM.txt",
n.thin = 1,
n.chains = 3,
n.burnin = 5000,
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
336 Astronomical Applications
n.iter = 15000)
# Output
print(H1,intervals=c(0.025, 0.975), digits=3)
==================================================
mu.vect sd.vect 2.5% 97.5% Rhat n.eff
beta[1] -2.1750e+00 0.439 -3.0590e+00 -1.3080e+00 1.047 60
beta[2] 5.8900e-01 0.063 4.6500e-01 7.1600e-01 1.038 66
gamma[1] -5.3748e+01 4.065 -6.2322e+01 -4.6319e+01 1.001 32000
gamma[2] 7.9080e+00 0.610 6.7900e+00 9.1930e+00 1.001 26000
sigmaLN 2.2100e-01 0.012 2.0000e-01 2.4600e-01 1.001 7900
deviance 3.3600e+13 3.222 3.3600e+13 3.3600e+13 1.000 1
Figure 10.18 shows the posterior distributions for our model. The resulting fit for
the relationship between M and Mdm , and the data, can be seen in Figure 10.19. The
shaded areas represent the 50% (darker) and 95% (lighter) credible intervals around
the mean, represented by the dashed line. To illustrate the different parts of the pro-
cess, we also show an illustrative set of layers in Figure 10.20. From bottom to top
this figure shows the original dataset, the logistic part of the fit, that provides the prob-
ability of stellar presence given a certain value of Mdm ; the fitted lognormal model of
the positive continuous part, and finally the full model is displayed again in the upper
panel.
gamma[1] gamma[2]
1000 2000 3000 4000
1000 2000 3000 4000
Frequencies
0
0
t
Posterior distribution
Figure 10.18 Computed posteriors for the β and γ coefficients of our lognormal–logit hurdle model. The horizontal thick line at
the bottom of each histogram defines the 95% credible interval.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
337 10.9 Lognormal–Logit Hurdle Model and the Halo–Stellar-Mass Relation
105
104
103
M* (Msunh–1)
102
10
2.5
106 107
t
Mdm (Msunh–1)
Figure 10.19 Fitted values for the dependence between stellar mass M and dark-matter halo mass Mdm resulting from the
lognormal–logit hurdle model. The dashed line represents the posterior mean stellar mass for each value of Mdm ,
while the shaded areas depict the 50% (darker) and 95% (lighter) credible intervals around the mean; Msun is the
mass of the Sun and h is the dimensionless Hubble parameter. (A black and white version of this figure will appear in
some formats. For the color version, please refer to the plate section.)
Code 10.21 Lognormal–logit hurdle model, in Python using Stan, for assessing the
relationship between dark halo mass and stellar mass.
==================================================
import numpy as np
import pandas as pd
import pystan
import statsmodels.api as sm
# Data
path_to_data = ’˜/data/Section_10p9/MstarZSFR.csv’
# Read data
data_frame = dict(pd.read_csv(path_to_data))
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
338 Astronomical Applications
105
104
1) 103
h
n
su
2
107
(M
10
M*
h1 )
10 (M sun
2.5 106 M dm
0
105
104
103
1)
h
n
su
102 107
(M
M*
h1 )
10
dm
(M sun
2.5 106 M
0
1.00
0.75
0.50
P*
107
h1 )
0.25 M sun
106
M dm (
0.00
104
103
1)
h
102
n
su
107
(M
*
M
10 h 1 )
(M sun
2.5 106
M dm
0
t
Figure 10.20 Illustration of the different layers in the lognormal–logit hurdle model. From bottom to top: the original data; the
Bernoulli fit representing the presence or absence of stars in the halo; the lognormal fit of the positive part (i.e.,
ignoring the zeros in M ); and the full model. (A black and white version of this figure will appear in some formats. For
the color version, please refer to the plate section.)
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
339 10.9 Lognormal–Logit Hurdle Model and the Halo–Stellar-Mass Relation
data = {}
data[’Y’] = y
data[’Xc’] = sm.add_constant(x.transpose())
data[’Xb’] = sm.add_constant(x.transpose())
data[’Kc’] = data[’Xc’].shape[1]
data[’Kb’] = data[’Xb’].shape[1]
data[’N’] = data[’Xc’].shape[0]
# Fit
# Stan model
stan_code="""
data{
int<lower=0> N; # number of data points
int<lower=0> Kc; # number of coefficients
int<lower=0> Kb;
matrix[N,Kb] Xb; # dark matter halo mass
matrix[N,Kc] Xc;
real<lower=0> Y[N]; # stellar mass
}
parameters{
vector[Kc] beta;
vector[Kb] gamma;
real<lower=0> sigmaLN;
}
model{
vector[N] mu;
vector[N] Pi;
mu = Xc * beta;
for (i in 1:N) Pi[i] = inv_logit(Xb[i] * gamma);
for (i in 1:N) {
(Y[i] == 0) ~ bernoulli(Pi[i]);
if (Y[i] > 0) Y[i] ~ lognormal(mu[i], sigmaLN);
}
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=data, iter=15000, chains=3,
warmup=5000, thin=1, n_jobs=3)
# Output
print(fit)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[0] -2.1 7.3e-3 0.53 -3.15 -2.46 -2.11 -1.75 -1.07 5268.0 1.0
beta[1] 0.5 1.1e-3 0.08 0.43 0.53 0.58 0.63 0.73 5270.0 1.0
gamma[0] 54.61 0.06 4.35 46.65 51.63 54.45 57.36 63.66 4972.0 1.0
gamma[1] -8.04 9.2e-3 0.65 -9.39 -8.45 -8.01 -7.59 -6.84 4973.0 1.0
sigmaLN 0.27 2.3e-4 0.02 0.24 0.26 0.27 0.28 0.31 5943.0 1.0
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
340 Astronomical Applications
We now take one step further from the classical generalized linear models (GLMs) pre-
sented previously. The goal of this section is to show how the framework of GLMs can be
used to deal with a special data type, time series. A time series is characterized by a set of
measurements of the same event or experiment taken sequentially over time. This is funda-
mentally different from the situation in previous examples, where the order of the observed
points was irrelevant. In a time series, time itself drives causality and builds structure within
the data. The state of the system at a given time imposes constraints on its possible states
in the near future and has a non-negligible influence on the states further ahead.
In astronomy, this situation is common. The whole transient sky is studied by means of
light curves (measurement of brightness over time), which are essentially examples of time
series. Despite the fact that frequently, a non-homogeneous time separation between con-
secutive observations imposes some extra complexity, this does not invalidate the potential
application of such models in an astronomical context.23 Moreover, examples such as the
one shown here below fulfill all the basic requirements for time series analysis and illustrate
the potential of GLMs to deal with this type of data.
In what follows we will use the time series framework to model the evolution of the num-
ber of sunspots over time. The appearance or disappearance of black regions in the Sun’s
surface (sunspots) has been reported since ancient times (Vaquero, 2007). Their emergence
is connected with solar magnetic activity and is considered to be a precursor of more drastic
events such as solar flares and coronal mass ejections. Careful observation of the number
of sunspots over time has revealed the existence of a solar cycle of approximately 11 years,
which is correlated with extreme emissions of ultraviolet and X-ray radiation. Such bursts
can pose significant concerns for astronauts living in space, airline travelers in polar routes,
and engineers trying to optimize the lifetime of artificial satellites (Hathaway, 2015).
Here we show how the classic autoregressive (AR) model (see e.g. the book, Lunn Jack-
son, Best et al., 2012, for examples of Bayesian time series), which typically assumes the
residuals to be normally distributed, can be modified to fit a count time series (the number
of sunspots is a discrete response variable). Considering the number of sunspots in the yth
year, Nspots;y , the standard normal AR model states that Nspots;y depends linearly on its own
value in previous years and on a stochastic term, σ . This can be represented as:
23 In standard time series analysis, an equally spaced time interval is required between two consecutive observa-
tions. This restriction can be circumvented by taking into account the existence of missing data (see the book
Pole, West, and Harrison, Chapter 1).
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
341 10.10 Count Time Series and Sunspot Data
where μy can be identified as the linear predictor for year y, having {φ1 , φ2 } as its coef-
ficients, and p represents the order of the AR(p) process. The latter quantifies how the
influence of previous measurements on the current state of the system fades with time. A
low p means that the system only “remembers” recent events and a high p indicates that
even remote times can influence current and future system developments.
Here we will consider a simple case, the AR(1) model,
applied to the annual sunspot-number data set, but see Chattopadhyay and Chattopadhyay
(2012) for an application of an AR(3) model to the monthly sunspot-number time series.
10.10.1 Data
We will use the mean annual data for the International Sunspot number,24 under the respon-
sibility of the Royal Observatory in Brussels25 since 1980 (Feehrer, 2000). This is consid-
ered the “official” number of sunspots by the International Astronomical Union (IAU).
The Sunspot Index Data Center (SIDC), hosted by the Royal Observatory in Brussels, pro-
vides daily, monthly, and annual numbers, as well as predictions of sunspot activities. The
observations are gathered by a team of amateur and professional astronomers working at
40 individual stations.
Code 10.22 Normal autoregressive model AR(1) for accessing the evolution of the number
of sunspots through the years.
==================================================
require(R2jags)
require(jagstools)
# Data
# Read data
sunspot <- read.csv("˜/data/Section_10p10/sunspot.csv",header = T, sep=",")
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
342 Astronomical Applications
y <- round(sunspot[,2])
t <- seq(1700,2015,1)
N <- length(y)
# Fit
# Likelihood function
for (t in 2:N) {
Y[t] ˜ dnorm(mu[t],tau)
mu[t] <- phi[1] + phi[2] * Y[t-1]
}
# Prediction
for (t in 1:N){
Yx[t]~dnorm(mu[t],tau)
}
}"
# Identify parameters
# Include Yx only if you intend to generate plots
params <- c("sd", "phi", "Yx")
# Run mcmc
jagsfit <- jags(data = sun_data,
inits = inits,
parameters = params,
model = textConnection(AR1_NORM),
n.thin = 1,
n.chains = 3,
n.burnin = 3000,
n.iter = 5000)
# Output
print(jagsfit,intervals = c(0.025, 0.975),justify = "left", digits=2)
==================================================
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
343 10.10 Count Time Series and Sunspot Data
300
200
Sunspots
100
t
Year
Figure 10.21 Normal AR(1) model fit for sunspot data. The solid line represents the posterior mean and the shaded areas are the
50% (darker) and 95% (lighter) credible intervals; the dots are the observed data. The dashed horizontal line
represents a barrier which should not be crossed, since the number of sunspots cannot be negative.
In order to visualize how the model fits the data, the reader can ask JAGS to monitor the
parameter Yx and then extract the values with the function jagsresults. The fitted model
is shown in Figure 10.21 and the code to reproduce the figure is given below.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
344 Astronomical Applications
legend.position="top",
axis.title.y = element_text(vjust = 0.1,margin=margin(0,10,0,0)),
axis.title.x = element_text(vjust = -0.25),
text = element_text(size = 25,family="serif"))+
geom_hline(aes(yintercept=0),linetype="dashed",colour="gray45",size=1.25)
The model fits the data reasonably well (Figure 10.21), given its simplicity, but it allows
negative values of sunspot number (the shaded areas do not respect the barrier represented
by the dashed line); this is expected given that a model with normal residuals does not
impose any restrictions in this regard. Now we show how to change the classical AR model
to allow other distributions, in the hope this can serve as a template for the reader to try
more complex models on his or her own. Let us try a negative binomial AR(1) model,
which can be expressed as follows:
Nspots;y ∼ NB(py , θ )
py = θ/(θ + μy ) (10.20)
log(μy ) = φ1 + φ2 Nspots;y−1
Code 10.24 Negative binomial model (AR1) for assessing the evolution of the number of
sunspots through the years.
==================================================
# Fit
AR1_NB<-"model{
for(i in 1:2){
phi[i]˜dnorm(0,1e-2)
}
theta~dgamma(0.001,0.001)
mu[1] <- Y[1]
# Likelihood function
for (t in 2:N) {
Y[t] ˜ dnegbin(p[t],theta)
p[t] <- theta/(theta+mu[t])
log(mu[t]) <- phi[1] + phi[2]*Y[t-1]
}
for (t in 1:N){
Yx[t] ˜ dnegbin(px[t],theta)
px[t] <- theta/(theta+mu[t])
}
}"
# Identify parameters
# Include Yx only if interested in prediction
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
345 10.10 Count Time Series and Sunspot Data
Figure 10.22 illustrates the results for the NB AR(1) model. Note that the model now
respects the lower limit of zero counts, but the 95% credible intervals overestimate the
counts. It is important to emphasize that there are several other models that should be tested
in order to decide which one suits this particular data best. We did not intend to provide a
comprehensive overview of time series in the section but instead to provide the reader with
1000
Sunspots
500
0
1700 1800 1900 2000
t
Year
Figure 10.22 Negative binomial AR(1) model fit for sunspot data. The solid line represents the posterior mean and the shaded areas
are the 50% and 95% credible intervals; the dots are the observed data.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
346 Astronomical Applications
Code 10.25 Negative binomial model (AR1) in Python using Stan, for assessing the
evolution of the number of sunspots through the years.
==================================================
import numpy as np
import pandas as pd
import pystan
import statsmodels.api as sm
# Data
path_to_data = "˜/data/Section_10p10/sunspot.csv"
# Read data
data_frame = dict(pd.read_csv(path_to_data))
# Fit
# Stan model
stan_code="""
data{
int<lower=0> nobs; # number of data points
int<lower=0> K; # number of coefficients
int Y[nobs]; # nuber of sunspots
}
parameters{
vector[K] phi; # linear predictor coefficients
real<lower=0> theta; # noise parameter
}
model{
vector[nobs] mu;
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
347 10.11 Gaussian Model, ODEs, and Type Ia Supernova Cosmology
Y ˜ neg_binomial_2(mu, theta);
}
"""
# Run mcmc
fit = pystan.stan(model_code=stan_code, data=data, iter=7500, chains=3,
warmup=5000, thin=1, n_jobs=3)
# Output
print(fit)
==================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
phi[0] 3.33 1.6e-3 0.06 3.22 3.29 3.33 3.38 3.45 1510.0 1.0
phi[1] 0.01 1.5e-5 6.2e-4 9.4e-3 0.01 0.01 0.01 0.01 1699.0 nan
theta 2.57 5.8e-3 0.21 2.16 2.42 2.56 2.71 3.01 1386.0 1.0
The output results are almost identical to those found using R and JAGS. The reader
might notice, however, that one Rhat value is reported to be nan. This is an arithmetic
instability which appears when calculating Rhat for a parameter value very close to zero,
as is the case for phi[1], and is not connected with convergence failure.26
Further Reading
Hathaway, D. H. (2015). “The solar cycle.” Living Rev. Solar Phy. 12. DOI: 10.1007/
lrsp-2015-4. arXiv:1502.07020 [astro-ph.SR].
Harrison, J., A. Pole, M. West (1994). Applied Bayesian Forecasting and Time Series
Analysis. Springer.
Type Ia supernovae (SNe Ia) are extremely bright transient events which can be used as
standardizable candles for distance measurements in cosmological scales. In the late 1990s
they provided the first evidence for the current accelerated expansion of the Universe (Perl-
mutter et al., 1999; Riess et al., 1998) and consequently the existence of dark energy. Since
then they have been central to every large-scale astronomical survey aiming to shed light
on the dark-energy mystery.
In the last few years, a considerable effort has been applied by the astronomical commu-
nity in an attempt to popularize Bayesian methods for cosmological parameter inference,
especially when dealing with type Ia supernovae data (e.g. Andreon, 2011; Ma et al., 2016;
Mandel et al., 2011; Rubin et al., 2015; Shariff et al., 2015). Thus, we will not refrain from
26 https://groups.google.com/forum/#!topic/stan-users/hn4W_p8j3fs
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
348 Astronomical Applications
tackling this problem and showing how Stan can be a powerful tool to deal with such
complex model.
At maximum brightness the observed magnitude of an SN Ia can be connected to its
distance modulus μ through the expression
where mobs is the observed magnitude, M is the magnitude, and x1 and c are the stretch and
color corrections derived from the SALT2 standardization (Guy et al., 2007), respectively.
To take into account the effect of the host stellar mass M on M and β, we use the correction
proposed by Conley et al. (2011):
M if M < 1010 M
M= (10.22)
M + M otherwise
Considering a flat Universe, k = 0, containing dark energy and dark matter, the
cosmological connection can be expressed as
dL
μ = 5 log10 , (10.23)
10pc
! z
c dz
dL (z) = (1 + z) , (10.24)
H0 0 E(z)
"
E(z) = m (1 + z)3 + (1 − m )(1 + z)3(1+w) , (10.25)
where dL is the luminosity distance, c the speed of light, H0 the Hubble constant, m the
dark-energy density, and w the dark-energy equation of state parameter.
In what follows we will begin with a simplified version of this problem and, subse-
quently, guide the reader through implementations of further complexity.
10.11.1 Data
We used data provided by Betoule et al. (2014), known as the joint light-curve analysis
(JLA) sample.27 This is a compilation of data from different surveys which contains 740
high quality spectroscopically confirmed SNe Ia up to redshift z ∼ 1.0.
Our statistical model will thus have one response variable (the observer magnitude, mobs )
and four explanatory variables (the redshift z, the stretch x1 , the color c, and the host galaxy
mass Mhost ).
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
349 10.11 Gaussian Model, ODEs, and Type Ia Supernova Cosmology
mobs;i ∼ Normal(ηi , ε)
ηi = 25 + 5 log10 (dli (H0 , w, m )) + M(M) − αx1;i + βci
∼ Gamma(10−3 , 10−3 )
M ∼ Normal(−20, 5) (10.26)
α ∼ Normal(0, 1)
β ∼ Normal(0, 10)
M ∼ Normal(0, 1)
m ∼ Uniform(0, 1)
H0 ∼ Normal(70, 5)
i = 1, . . . , N
where dL is given by Equation 10.24 and M by Equation 10.22. We use conservative priors
over the model parameters. These are not completely non-informative but they do allow
a large range of values to be searched without putting a significant probability on non-
physical values.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
350 Astronomical Applications
Our goal with this example is to provide a clean environment where the ODE solver role
is highlighted. A more complex model is presented subsequently.
Code 10.26 Bayesian normal model for cosmological parameter inference from type Ia
supernova data in R using Stan.
=========================================================
library(rstan)
# Preparation
# Set initial conditions
z0 = 0 # initial redshift
E0 = 0 # integral(1/E) at z0
# physical constants
c = 3e5 # speed of light
H0 = 70 # Hubble constant
# Data
# Read data
data <- read.csv("˜/data/Section_10p11/jla_lcparams.txt",header=T)
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
351 10.11 Gaussian Model, ODEs, and Type Ia Supernova Cosmology
dEdz[1] = 1.0/sqrt(params[1]*(1+z)^3
+(1-params[1])*(1+z)^(3*(1+params[2])));
return dEdz;
}
}
data {
int<lower=1> nobs; // number of data points
real E0[1]; // integral(1/H) at z=0
real z0; // initial redshift, 0
real c; // speed of light
real H0; // Hubble parameter
vector[nobs] obs_mag; // observed magnitude at B max
real x1[nobs]; // stretch
real color[nobs]; // color
real redshift[nobs]; // redshift
real hmass[nobs]; // host mass
}
transformed data {
real x_r[0]; // required by ODE (empty)
int x_i[0];
}
parameters{
real<lower=0, upper=1> om; // dark matter energy density
real alpha; // stretch coefficient
real beta; // color coefficient
real Mint; // intrinsic magnitude
real deltaM; // shift due to host galaxy mass
real<lower=0> sigint; // magnitude dispersion
real<lower=-2, upper=0> w; // dark matter equation of state
parameter
}
transformed parameters{
real DC[nobs,1]; // co-moving distance
real pars[2]; // ODE input = (om, w)
vector[nobs] mag; // apparent magnitude
real dl[nobs]; // luminosity distance
real DH; // Hubble distance = c/H0
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
352 Astronomical Applications
pars[1] = om;
pars[2] = w;
DH = (c/H0);
# Integral of 1/E(z)
DC = integrate_ode_rk45(Ez, E0, z0, redshift, pars, x_r, x_i);
for (i in 1:nobs) {
dl[i] = DH * (1 + redshift[i]) * DC[i, 1];
if (hmass[i] < 10) mag[i] = 25 + 5 * log10(dl[i])
+ Mint - alpha * x1[i] + beta
* color[i];
else
mag[i] = 25 + 5 * log10(dl[i])
+ Mint + deltaM - alpha * x1[i] + beta * color[i];
}
}
model {
# Priors and likelihood
sigint ˜ gamma(0.001, 0.001);
Mint ˜ normal(-20, 5.);
beta ˜ normal(0, 10);
alpha ˜ normal(0, 1);
deltaM ˜ normal(0, 1);
# Run MCMC
fit <- stan(model_code = stan_model,
data = stan_data,
seed = 42,
chains = 3,
iter = 15000,
cores = 3,
warmup = 7500
)
# Output
# Summary on screen
print(fit,pars=c("om", "Mint","alpha","beta","deltaM", "sigint"),
intervals=c(0.025, 0.975), digits=3)
=========================================================
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
om 0.232 0.001 0.091 0.036 0.172 0.243 0.300 0.380 7423 1
Mint -19.059 0.000 0.017 -19.094 -19.071 -19.059 -19.048 19.027 8483 1
w -0.845 0.002 0.180 -1.237 -0.960 -0.829 -0.708 -0.556 7457 1
alpha 0.119 0.000 0.006 0.106 0.114 0.119 0.123 0.131 16443 1
beta 2.432 0.001 0.071 2.292 2.384 2.432 2.480 2.572 16062 1
deltaM -0.031 0.000 0.013 -0.055 -0.039 -0.031 -0.022 -0.006 11938 1
sigint 0.159 0.000 0.004 0.151 0.156 0.159 0.162 0.168 16456 1
The results are consistent with those reported by Ma et al. (2016, Section 4), who applied
Bayesian graphs to the same data. A visual representation of posteriors over m and w is
shown in Figure 10.23, left-hand panel.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
353 10.11 Gaussian Model, ODEs, and Type Ia Supernova Cosmology
t
Figure 10.23 Joint posterior distributions over the dark-matter energy density w and equation of state parameter w obtained
from a Bayesian Gaussian model applied to the JLA sample. Left: the results without taking into account errors in
measurements. Right: the results taking into account measurement errors in color, stretch, and observed magnitude.
(A black and white version of this figure will appear in some formats. For the color version, please refer to the plate
section.)
Small modifications are necessary to implement these effects. In Code 10.26 we must
read the measured errors and include them in the list of inputs:
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
354 Astronomical Applications
data{
vector[nobs] obs_magerr;
real colorerr[nobs];
real x1err[nobs];
parameters{
real truemag[nobs];
real truec[nobs];
real truex1[nobs];
}
transformed parameters{
for (i in 1:nobs) {
dl[i] = DH * (1 + redshift[i]) * DC[i, 1];
if (hmass[i] < 10) mag[i] = 25 + 5 * log10(dl[i])
+ Mint - alpha * truex1[i] + beta *
truec[i];
else mag[i] = 25 + 5 * log10(dl[i])
+ Mint + deltaM - alpha * truex1[i] + beta *
truec[i];
}
}
model{
truec ˜ normal(0,2);
color ˜ normal(truec, colorerr);
truex1 ˜ normal(0,5);
x1 ˜ normal(truex1, x1err);
truemag ˜ normal(mag, sigint);
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
355 10.12 Approximate Bayesian Computation
Mint -19.051 0.000 0.016 -19.084 ... ... ... -19.020 6694 1.000
w -0.877 0.003 0.185 -1.283 ... ... ... -0.572 5340 1.000
alpha 0.125 0.000 0.006 0.113 ... ... ... 0.137 6426 1.002
beta 2.569 0.001 0.067 2.440 ... ... ... 2.704 2672 1.002
deltaM -0.043 0.000 0.012 -0.067 ... ... ... -0.019 8113 1.001
sigint 0.016 0.001 0.008 0.005 ... ... ... 0.034 39 1.069
Figure 10.23 shows joint posterior distributions over m and w. Comparing the two
panels of this figure, we see the effect on the posterior shape of adding the errors, especially
the tighter constraints for the 1σ levels (dark blue). The numerical results also corroborate
with our expectations: including the errors in measurement increases m , and decreases w
and also has non-negligible effects on α and β.
Given the code snippets provided above, small modifications can be added to include
further effects such as the presence of other uncertainties and the systematic effects
described in Betoule et al. (2014, Section 5.5). However, it is important to be aware of
your computational capabilities, given that the memory usage can escalate very fast.29
Contemporary examples of similar exercises can be found in the literature pointed
below.
Further Reading
Andreon, S. and B. Weaver (2015). Bayesian Methods for the Physical Sciences:Learning
from Examples in Astronomy and Physics. Springer Series in Astrostatistics. Springer.
Ma, C., P.-S. Corasaniti, and B. A. Bassett (2016). “Application of Bayesian graphs to SN
Ia data analysis and compression.” ArXiv e-prints. arXiv:1603.08519.
Mandel, K. S., G. Narayan, and R. P. Kirshner (2011). “Type Ia supernova light curve
inference: hierarchical models in the optical and near-infrared.” Astrophys. J. 731, 120.
DOI : 10.1088/0004-637X/731/2/120. arXiv:1011.5910.
Rubin, D. et al. (2015). “UNITY: Confronting supernova cosmology’s statistical and sys-
tematic uncertainties in a unified Bayesian framework.” Astrophys. J. 813, 137. DOI:
10.1088/0004-637X/813/2/137. arXiv: 1507.01602.
Shariff, H. et al. (2015). “BAHAMAS: new SNIa analysis reveals inconsistencies with
standard cosmology.” ArXiv e-prints. arXiv: 1510.05954.
All the examples tackled up to this point require the definition and evaluation of a likeli-
hood function. However, it is not unlikely that in real-world situations one is faced with
problems whose likelihood is not well known or is too complex for the computational
resources at hand. In astronomy, such situations may appear in the form of a selec-
tion bias in stellar studies (Janson et al., 2014; Sana et al., 2012), data quality in time
series from AGN X-ray emissions (Shimizu and Mushotzky, 2013; Uttley et al., 2002),
29 Team Stan (2016, Section 4.3) provides a few tips which might help in optimizing more complex models.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
356 Astronomical Applications
or stellar coronae emission in the UV (Kashyap et al., 2002), to cite just a few. In this
last astronomical example we present an alternative algorithm which enables parameter
inference without the need to explicitly define a likelihood function.30
Approximate Bayesian computation (ABC) uses our ability to perform quick and realis-
tic simulations as a tool for parameter inference (a technique known as forward simulation
inference). The main idea consists in comparing the “distance” between a large number of
simulated data sets and the observed data set and keeping a record only of the parameter
values whose simulation satisfies a certain distance threshold.
Despite being proposed more than 30 years ago (Rubin, 1984) ABC techniques have
only recently appeared in astronomical analysis (e.g. Akeret et al., 2015; Cameron and
Pettitt, 2012; Ishida et al., 2015; Killedar et al., 2015; Lin and Kilbinger, 2015; Robin
et al., 2014; Schafer and Freeman, 2012; Weyant et al., 2013). Here we present an evo-
lution of the original ABC algorithm called Population Monte Carlo ABC (PMC-ABC,
Beaumont et al., 2009). It uses an initial set of random parameter values {θ} drawn from
the priors (called a particle system), which evolves through incremental approximations
and ultimately converges to the true posterior distribution. We give below details about
the necessary ingredients, the complete ABC algorithm (Algorithm 1) and instructions for
implementation using the Python package CosmoABC.31
The main ingredients necessary to implement ABC are: (i) a simulator, or forward
model, (ii) prior probability distributions p(θθ ) over the input parameters θ , and (iii) a dis-
tance function, ρ(DO , DS ), where DO and DS denote the observed and simulated catalogs
(data sets for θ ) , respectively. We begin by describing a simple toy model so that the reader
can gain a better intuition about how the algorithm and code can be used. Subsequently we
briefly describe how the same code can be applied to the determination of cosmological
parameters from measurements of galaxy cluster number counts.
Suppose we have a catalog (data set) of P observations DO = {x1 , . . . , xP }, which are
considered to be realizations of a random variable X following a Gaussian distribution,
X ∼ Normal(μ, σ 2 ). Our goal is to determine the credible intervals over the model param-
eters θ = {μ, σ } on the basis of DO and p(θθ ). The ABC algorithm can be summarized in
three main steps.
10.12.1 Distance
The definition of an appropriate distance, or summary statistic, is paramount when
designing an ABC algorithm. For illustrative purposes it can be pictured as a type of
dimensionality-reduction function whose purpose is to summarize the difference between
two catalogs in one number. The summary statistic needs to be null for identical catalogs,
30 See Ishida et al. (2015).
31 https://pypi.python.org/pypi/CosmoABC
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
357 10.12 Approximate Bayesian Computation
θ it−1 with covariance matrix built from St−1 and calculated at θ j . It is important to note that
the distribution, in this case the Gaussian PDF, must be chosen according to the data set
characteristics. For our synthetic model the Gaussian works well, but it should be replaced
if the parameter space has special restrictions (e.g. discrete values).
In the construction of the subsequent particle systems St>0 we use the weights in Equa-
tion 10.28 to performed an importance sampling, as shown in Algorithm 1. The process is
repeated until convergence, which, in our case, happens when the number of particle draws
needed to construct a particle system is larger than a given threshold.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
358 Astronomical Applications
Assuming that our model is true, and the behavior of X can be entirely described by a
Gaussian distribution, all the information contained in a catalog can be reduced to its mean
and standard deviation. Consequently, we choose a combination of these quantities as our
summary statistics:
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
359 10.12 Approximate Bayesian Computation
$ %
D̄O − D¯S σD O − σDS
ρ = abs + abs , (10.29)
D̄O σD O
where D̄O is the mean of all measurements in catalog DO and σD O is its standard deviation.
10.12.4 CosmoABC
The algorithm, toy model, and distance function described above are implemented in the
Python package32 CosmoABC.33
As we highlighted before, the first step in any ABC analysis is to make sure your distance
definition behaves satisfactorily in ideal situations. Thus, we advise you to start with a
synthetic “observed” data set so you can assess the efficiency of your summary statistic.
From the folder ∼cosmoabc/examples/ copy the files toy_model.input and
toy_model_functions.py to a new and empty directory.34 You might be interested in tak-
ing a look inside the toy_model_functions.py file to understand how the priors, distance,
and simulator are inserted into CosmoABC.
In order to start using the package in a test exercise, make sure that the keyword
path_to_obs is set to None. This guarantees that you will be using a synthetic “observed”
catalog. In order to visualize the behavior of your distance function, run in the command
line
32 We describe briefly here how to run the examples; a more detailed description on how to add customized
distances, simulators, and priors is given in Ishida et al. (2015, hereafter I2015) and in the code documentation
– http://CosmoABC.readthedocs.io/en/latest/.
33 https://pypi.python.org/pypi/CosmoABC
34 CosmoABC generates a lot of output files.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
360 Astronomical Applications
You will be asked to enter the name of the output file and the number of particles to be
drawn. The plot shown in Figure 10.24 will be generated and the following information
will be shown on the screen
The first line of this output indicates that the distance definition behaves as expected for
identical catalogs, the second shows a random set of parameter values drawn from the
prior (so you can check whether they fall within the expected boundaries) and the third
line shows the calculated distance from the original to the new catalog. The code will then
ask you to input the number of particles to be drawn for the visual output and Figure 10.24
will be generated. We emphasize that this is only one possible strategy to evaluate the
distance function, and it will not work in more complex scenarios. There is an extensive
literature which might help building intuition on the subject for more complex cases (e.g.
Aeschbacher et al., 2012; Burr and Skurikhin, 2013).
The complete ABC algorithm can be run by typing
$ run_ABC.py -i toy_model.input -f toy_model_functions.py
1.2
1.0
0.8
distance1
0.6
0.4
0.2
0.0
−0.2
−3 −2 −1 0 1 2 3 4 5
mean
1.2
1.0
0.8
distance1
0.6
0.4
0.2
0.0
−0.2
−1 0 1 2 3 4 5 6
t
std
Figure 10.24 Distance function ρ from Equation 10.29 as a function of the parameters mean (upper panel) and std (lower panel).
(A black and white version of this figure will appear in some formats. For the color version, please refer to the plate
section.)
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
361 10.12 Approximate Bayesian Computation
The code will output one file for each particle system and a results.pdf file with their
graphical representation. Figure 10.25 shows a few steps. At t = 0 (top left) we have an
almost homogeneous density across the parameter space. As the system evolves (t = 2, top
right, and t = 12, bottom left) the profiles converge to the fiducial parameter values (mean
= 2 and std = 1, bottom right).
In order to include your own simulator, distance function, and/or priors, customize the
functions in the file toy_model_functions.py and change the corresponding keywords in
the toy_model.input file. Remember that you can also use the latter to change the prior
parameters, the size of the particle system, and the convergence thresholds.
Further Reading
Cameron, E. and A. N. Pettitt (2012). “Approximate Bayesian computation for astro-
nomical model analysis: a case study in galaxy demographics and morphological
35 www.nongnu.org/numcosmo/
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
4 4
3 3
std
std
2 2
1 1
−2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4
mean mean
0.18 0.25 0.6
0.24 0.5
0.20
0.20 0.4
density
0.14
density
density
density
0.15
0.16 0.3
0.10 0.2
0.10
0.12 0.05 0.1
0.06 0.08 0.00 0.0
−2 −1 0 1 2 3 4 1 2 3 4 5 −2 −1 0 1 2 3 4 1 2 3 4 5
mean std mean std
4 4
3 3
std
std
2 2
1 1
−2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4
mean mean
10
1.2 3.0 8 12
density
density
density
density
0.8 2.0 6 8
4
0.4 1.0 4
2
0.0 0.0 0 0
t
−2 −1 0 1 2 3 4 1 2 3 4 5 −2 −1 0 1 2 3 4 1 2 3 4 5
mean std mean std
Figure 10.25 Evolution of the particle systems at different stages t of the ABC iterations for the toy model and distance function described in Section 10.12.1. At each stage
t the upper panel shows the density of particles in the two dimensional parameter space of our toy model (mean, std) and the lower left- and right-hand
panels show the density profiles over the same parameters. (A black and white version of this figure will appear in some formats. For the color version, please
refer to the plate section.)
363 10.13 Remarks on Applications
The examples discussed in this chapter represent only a glimpse of the potential
to be unraveled in the intersection between astronomy and statistics. We hope the
resources provided in this volume encourage researchers to join the challenge of break-
ing down the cultural barriers which so often have prevented such interdisciplinary
endeavors.
It is important to highlight that, beyond the many advantages astronomers can gain by
collaborating with statisticians, the challenge of dealing with astronomical data is also
a fertile ground for statistical research. As an observational science, astronomy has to
deal with data situations that are usually absent in other areas of research: the presence
of systematic errors in measurements, missing data, outliers, selection bias, censoring, the
existence of foreground–background effects and so forth.
Such challenges are already becoming overwhelmingly heavy, a weight astronomers
alone cannot carry and statisticians rarely have the opportunity to observe. As a final
remark, we would like to emphasize that the astronomical examples in this volume were
made as accessible as possible, so that statisticians who have never worked with astronomy
before might also contemplate its wonders and complexity.
We very much hope to see, in the coming decades, the recognition of astronomical data
as a major driver for the development of new statistical research.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 19:49:11, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.011
11 The Future of Astrostatistics
Astrostatistics has only recently become a fully fledged scientific discipline. With the cre-
ation of the International Astrostatistics Association, the Astroinformatics & Astrostatistics
Portal (ASAIP), and the IAU Commission on Astroinformatics and Astrostatistics, the
discipline has mushroomed in interest and visibility in less than a decade.
With respect to the future, though, we believe that the above three organizations will
collaborate on how best to provide astronomers with tutorials and other support for learning
about the most up-to-date statistical methods appropriate for analyzing astrophysical data.
But it is also vital to incorporate trained statisticians into astronomical studies. Even though
some astrophysicists will become experts in statistical modeling, we cannot expect most
astronomers to gain this expertise. Access to statisticians who are competent to engage in
serious astrophysical research will be needed.
The future of astrostatistics will be greatly enhanced by the promotion of degree pro-
grams in astrostatistics at major universities throughout the world. At this writing there
are no MS or PhD programs in astrostatistics at any university. Degree programs in
astrostatistics can be developed with the dual efforts of departments of statistics and
astronomy–astrophysics. There are several universities that are close to developing such
a degree, and we fully expect that PhD programs in astrostatistics will be common in 20
years from now. They would provide all the training in astrophysics now given in graduate
programs but would also add courses and training at the MS level or above in statistical
analysis and in modeling in particular.
We expect that Bayesian methods will be the predominant statistical approach to the
analysis of astrophysical data in the future. As computing speed and memory become
greater, it is likely that new statistical methods will be developed to take advantage of
the new technology. We believe that these enhancements will remain in the Bayesian tra-
dition, but modeling will become much more efficient and reliable. We expect to see more
non-parametric modeling taking place, as well as advances in spatial statistics – both two-
and three-dimensional methods.
Finally, astronomy is a data-driven science, currently being flooded by an unprecedented
amount of data, a trend expected to increase considerably in the next decade. Hence, it
is imperative to develop new paradigms of data exploration and statistical analysis. This
cross-disciplinary approach is the key to guiding astronomy on its continuous mission
to seek the next great discovery, observing unexplored regions of the cosmos and both
witnessing and understanding things no human being has dreamed before.
364
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.012
365 Further Reading
Further Reading
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:37:37, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.012
Appendix A Bayesian Modeling using INLA
> source("http://www.math.ntnu.no/inla/givemeINLA.R")
> # Or, if installed:
library(INLA)
> names(inla.models()$likelihood)
Now create the negative binomial synthetic data used in Chapter 6. We can request that the
DIC statistic be displayed using the control.compute option:
366
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:03:56, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.013
367 Appendix A Bayesian Modeling using INLA
library(MASS)
set.seed(141)
nobs <- 2500
x1 <- rbinom(nobs,size = 1, prob = 0.6)
x2 <- runif(nobs)
xb <- 1 + 2.0*x1 - 1.5*x2
a <- 3.3
theta <- 0.303 # 1/a
exb <- exp(xb)
nby <- rnegbin(n = nobs, mu = exb, theta = theta)
negbml <-data.frame(nby, x1, x2)
summary(NB)
Time used:
Pre-processing Running inla Post-processing Total
0.4185 1.1192 0.1232 1.6609
Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant mode kld
(Intercept) 0.9865 0.0902 0.8111 0.9859 1.1650 0.9847 0
x1 2.0404 0.0808 1.8814 2.0405 2.1986 2.0408 0
x2 -1.6137 0.1374 -1.8837 -1.6137 -1.3442 -1.6136 0
Model hyperparameters:
mean sd 0.025quant
size for the nbinomial observations (overdispersion) 0.2956 0.0099 0.2767
The parameter values are close to the values we set in the synthetic data, including the
dispersion statistic of 3.3. In addition, the time taken for producing the Bayesian posteriors
and related statistics is displayed as a total of 1.66 seconds. The inla function itself took
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:03:56, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.013
368 Appendix A Bayesian Modeling Using INLA
only 1.12 seconds. This compares with some 5 minutes for executing the JAGS function
on the same data. Generalized additive models and spatial analysis models may be called
by extending the above code. The INLA reference manual provides directions.
Unfortunately, giving informative priors to parameters is not as simple as in JAGS or
Stan. Also, some inla functions do not have the same capabilities as the models we give in
Chapters 6 and 7. For instance, the generalized Poisson dispersion, delta, is parameterized
in such a way that negative values cannot be estimated or displayed for it. Therefore, the
dispersion parameter displayed for underdispersed data is incorrect. We explain why this
is the case in Section 6.3. In addition, the inla functions for the Bayesian zero-inflated
Poisson and negative binomial models display only the binary component intercept value.
The count component is displayed properly but the binary component is deficient. It is
promised in future versions that the zero-inflated model limitation will be fixed, but no
mention has been made of correcting the Bayesian generalized Poisson.
Finally, researchers using the inla function are limited to models which are pre-
programmed into the software. There are quite a few such models, but statistically
advanced astronomers will generally want more control over the models they develop.
Nevertheless, INLA provides researchers with the framework for extending Bayesian mod-
eling to spatial, spatial-temporal, GAM, and other types of analysis. Built-in inla functions
such as the generalized extreme value (gev) distribution can be of particular importance
to astronomers. Bayesian survival or failure analysis functions are also available. How-
ever, astrostatisticians should be aware of the limitations of the inla function prior to
committing their research data to it.
Further Reading
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 17:03:56, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.013
Appendix B Count Models with Offsets
Offsets for count models are used to adjust the counts for their being collected or generated
over different periods of time or from different areas. More counts of some astrophysical
event may occur in larger areas than in smaller areas, or over longer periods than over
shorter periods.
An offset is added to the linear predictor of a model and is constrained to have a
coefficient equal to unity. It is not a parameter to be estimated but is given in the data
as an adjustment factor. In the synthetic data below, which generates a Poisson model
with an offset, we can see how the offset enters the model. Remember that, since the
Poisson model has a log link, the offset must be logged when added to the linear
predictor.
When count models are run with offsets, it is typical to refer to them as rate models,
meaning the rate at which counts are occurring in different areas or over different periods
of time. For the models in this text, offsets can be employed for the Poisson, negative
binomial, generalized Poisson, and NB-P models.
Code B.1 Data for Poisson with offset.
==================================================
library(MASS)
x1 <- runif(5000)
x2 <- runif(5000)
m <- rep(1:5, each=1000, times=1)*100 # creates offset as defined
logm <- log(m) # log the offset
xb <- 2 + .75*x1 -1.25*x2 + logm # linear predictor w offset
exb <- exp(xb)
py <- rpois(5000, exb)
pdata <- data.frame(py, x1, x2, m)
==================================================
The offset, m, is the same for each group of counts in the data – five groups of 1000 each.
> table(m)
m
100 200 300 400 500
1000 1000 1000 1000 1000
369
Downloaded from https://www.cambridge.org/core. University of Toronto, on 05 Aug 2017 at 20:31:12, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.014
370 Appendix B Count Models with Offsets
K = K,
m = pdata$m) # list offset
sink("PRATE.txt")
cat("
model{
# Diffuse normal priors betas
for (i in 1:K) { beta[i] ˜ dnorm(0, 0.0001)}
# Likelihood
for (i in 1:N){
Y[i] ˜ dpois(mu[i])
log(mu[i]) <- inprod(beta[], X[i,]) + log(m[i]) # offset added
}
}
",fill = TRUE)
sink()
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.0001451 0.0008601 2325.6 <2e-16
x1 0.7498218 0.0011367 659.7 <2e-16
x2 -1.2491636 0.0011594 -1077.4 <2e-16
Downloaded from https://www.cambridge.org/core. University of Toronto, on 05 Aug 2017 at 20:31:12, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.014
371 Appendix B Count Models with Offsets
Frequently, count data is structured in grouped format. In the ratep data below, the
counts y are adjusted for being generated or counted from areas m of different sizes, or
being recorded over different periods of time, also m. This is generic code that can be used
for spatial or temporal adjustment.
We next show specific data that could have come from a table of information. Consider
the table below with variables x1, x2, and x3. We have y counts from m observations, either
area sizes or time periods; y is the count variable and m the offset.
1 x3 0
x2 1 2 3 1 2 3
------------------------------------------------------------
1 6/45 9/39 17/29 11/54 13/47 21/44
x1
0 8/36 15/62 7/66 10/57 19/55 12/48
To put the information from the table into a form suitable for modeling, we may create the
following variables and values:
Grouped data
==================================================
y <- c(6,11,9,13,17,21,8,10,15,19,7,12)
m <- c(45,54,39,47,29,44,36,57,62,55,66,48)
x1 <- c(1,1,1,1,1,1,0,0,0,0,0,0)
x2 <- c(1,1,0,0,1,1,0,0,1,1,0,0)
x3 <- c(1,0,1,0,1,0,1,0,1,0,1,0)
ratep <-data.frame(y,m,x1,x2,x3)
==================================================
With a minor amendment to the above JAGS code we can model the tabular data as
follows.
sink("PRATE.txt")
cat("
model{
# Diffuse normal priors betas
for (i in 1:K) { beta[i] ˜ dnorm(0, 0.0001)}
Downloaded from https://www.cambridge.org/core. University of Toronto, on 05 Aug 2017 at 20:31:12, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.014
372 Appendix B Count Models with Offsets
# Likelihood
for (i in 1:N){
Y[i] ˜ dpois(mu[i])
log(mu[i]) <- inprod(beta[], X[i,]) + log(m[i])
}
}
",fill = TRUE)
sink()
Coefficients:
The results are close to the Bayesian model with diffuse priors. Remember that repeated
sampling will produce slightly different results, but they will generally be quite close to
what is displayed above.
Downloaded from https://www.cambridge.org/core. University of Toronto, on 05 Aug 2017 at 20:31:12, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.014
373 Appendix B Count Models with Offsets
The indirect dispersion parameter is 20.77. To calculate the direct dispersion we invert
this value. The directly parameterized Bayesian negative binomial model dispersion is
therefore
Downloaded from https://www.cambridge.org/core. University of Toronto, on 05 Aug 2017 at 20:31:12, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.014
374 Appendix B Count Models with Offsets
sink("NBOFF.txt")
cat("
model{
# Priors for betas
for (i in 1:K) { beta[i] ˜ dnorm(0, 0.01)}
# Prior for alpha
alpha ˜ dgamma(0.01, 0.01)
# Likelihood function
for (i in 1:N){
Y[i] ˜ dnegbin(p[i], 1/alpha)
p[i] <- 1.0/(1.0 + alpha*mu[i])
log(mu[i]) <- inprod(beta[], X[i,])+log(m[i])
}
}
",fill = TRUE)
sink()
The posterior means are all as expected given the synthetic data and the fact that we
have used diffuse priors. Note that the dispersion parameter is estimated as 0.049. This
is the value we expected as well. We may convert back to the indirect dispersion value by
changing the term θ to α and amending the lines
Downloaded from https://www.cambridge.org/core. University of Toronto, on 05 Aug 2017 at 20:31:12, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.014
375 Appendix B Count Models with Offsets
to
Y[i] ˜ dnegbin(p[i], theta)
p[i] <- theta / (theta + mu[i])
This produces the code needed for the indirect parameterization. The code in full is
provided below:
Code B.6 Indirect parameterization.
==================================================
require(R2jags)
X <- model.matrix(˜ x1 + x2 )
K <- ncol(X)
sink("NBOFF.txt")
cat("
model{
# Priors for betas
for (i in 1:K) { beta[i] ˜ dnorm(0, 0.01)}
# Prior for theta
theta ˜ dgamma(0.01, 0.01)
# Likelihood function
for (i in 1:N){
Y[i] ˜ dnegbin(p[i], theta)
p[i] <- theta / (theta + mu[i])
log(mu[i]) <- inprod(beta[], X[i,])+log(m[i])
}
}
",fill = TRUE)
sink()
Downloaded from https://www.cambridge.org/core. University of Toronto, on 05 Aug 2017 at 20:31:12, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.014
376 Appendix B Count Models with Offsets
The models fit the synthetic data properly. The same logic as employed for the JAGS
Bayesian Poisson and negative binomial models can be used for the generalized Poisson,
NB-P, and similar count models.
Finally, the above negative binomial models work with the grouped data, ratep. Amend-
ing the indirect negative binomial code for use with the ratep data, the results are given
below. They are close to the values produced using R’s glm.nb function.
Downloaded from https://www.cambridge.org/core. University of Toronto, on 05 Aug 2017 at 20:31:12, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.014
Appendix C Predicted Values, Residuals, and
Diagnostics
We did not address the matter of predicted or fitted values at length in the text. We assume
that the reader is familiar with these basic statistical procedures. There are several hints
that may be given for calculating these statistics using JAGS. Fitted values and diagnostics
may be calculated within the main JAGS model code, or they may be calculated follow-
ing posterior parameter estimation. We prefer to follow the recommendation of Zuur et al.
(2013) and calculate model diagnostics within the model code. However, the actual diag-
nostics are displayed after the model estimation. The code below comes from Code 6.4 in
the text. The data are assumed to be generated from Code 6.2. The pois data from Code
6.2 and predictors x1 and x2 are supplied to the X matrix at the beginning of the JAGS code
below. New diagnostic code is provided in the modules labeled “Model mean, variance,
Pearson residuals”, “Simulated Poisson statistics”, and “Poisson log-likelihood”. The lines
with objects L, AIC, PS, and PSsim are also added as diagnostic code. In addition, since
we wish to use the calculated statistics – which are in fact the means of statistics obtained
from simulated posterior values – to evaluate the model, they must be saved in the params
object. Therefore, instead of only saving “beta”, which gives the posterior means of the
predictors (these are analogous to frequentist coefficients or slopes), we add fitted, residual,
and other statistics. The names and explanations of the new diagnostic values are provided
in the code.
Code C.1 JAGS Bayesian Poisson model with diagnostic code.
==================================================
require(R2jags)
X <- model.matrix(˜ x1 + x2, data = pois)
K <- ncol(X)
model.data <- list(Y = pois$py,
X = X,
K = K,
N = nrow(pois))
sink("Poi.txt")
cat("
model{
for (i in 1:K) {beta[i] ˜ dnorm(0, 0.0001)}
for (i in 1:N) {
Y[i] ˜ dpois(mu[i])
log(mu[i]) <- inprod(beta[], X[i,])
}
377
Downloaded from https://www.cambridge.org/core. University of Toronto, on 05 Aug 2017 at 20:29:48, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.015
378 Appendix C Predicted Values, Residuals, and Diagnostics
# Poisson log-likelihood
for (i in 1:N) {
ll[i] <- Y[i] * log(mu[i]) - mu[i] - loggam(Y[i] +1)
}
L <- sum(ll[1:N]) # log-likelihood
AIC <- -2 * sum(ll[1:N]) + 2 * K # AIC statistic
PS <- sum(D[1:N]) # Model Pearson statistic
PSsim <- sum(DNew[1:N]) # Simulated Pearson statistic
}",fill = TRUE)
sink()
params <- c("beta", "ExpY", "E", "PS", "PSsim", "YNew", "L", "AIC")
We next load the source file we provide on the book’s web site. It is based in part on
code from Zuur et al. (2013). The file contains several functions that allow us to develop
nicer-looking output as well as trace and distribution plots for each parameter specified:
If we ran the print() code at the end of the above model code, the residuals and fitted
value for each observation in the model would be displayed. In order to avoid this, and to
obtain only the basic information, we can use the code below.
Downloaded from https://www.cambridge.org/core. University of Toronto, on 05 Aug 2017 at 20:29:48, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.015
379 Appendix C Predicted Values, Residuals, and Diagnostics
The following code provides a test of the specification of the model as Poisson. The test
can be used for any model. If the model fits well we expect that the summary parameters of
the model will be similar to the summary data from the simulated Poisson model. Ideally,
the mean number of times that the summary values of the model are greater or less than
the summary values of the simulated data should be about equal. Therefore, we look for a
value of about 0.5, which is at times referred to as a Bayesian model p-value. Values close
to 0 or 1 indicated a poorly specified model; i.e., the model is not truly Poisson.
> out <- POI$BUGSoutput
> mean(out$sims.list$PS > out$sims.list$PSsim)
[1] 0.9561667
The following code provides a Bayesian dispersion statistic for the Poisson model. Val-
ues over 1.0 indicate likely overdisperson. Values under 1.0 indicate likely underdispersion.
> E <- out$mean$E # average iterations per observation
> N <- nrow(pois) # observations in model
> p <- K # model betas
> sum(E^2)/(N-p) # Bayesian dispersion statistic
[1] 1.444726
The data appears to be mis-specified as a Poisson model. Our Bayesian p-value is far
too high and the dispersion statistic is considerably greater than 1.0. The two statistics are
consistent. We can obtain values for the log-likelihood and AIC statistic, assigning them
to another value using the code below. Note that the values are identical to the displayed
table output above.
> mean(out$sims.list$L)
[1] -1657.978
> mean(out$sims.list$AIC)
[1] 3321.957
Finally, trace plots and distribution curves (histograms) of each posterior parameter can
be obtained using the code below. The graphics are not displayed here.
> vars <- c("beta[1]", "beta[2]", "beta[3]")
> MyBUGSChains(POI$BUGSoutput, vars)
> MyBUGSHist(POI$BUGSoutput, vars)
Other statistics and plots can be developed from the diagnostic code calculated in the Pois-
son code above. Plots of Pearson residuals versus the fitted values are particularly useful
when checking for a fit. Several more advanced figures have been displayed in the text for
specific models. The code for these figures can be obtained from the book’s web site.
Downloaded from https://www.cambridge.org/core. University of Toronto, on 05 Aug 2017 at 20:29:48, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.015
References
Books
Andreon, S. and B. Weaver (2015). Bayesian Methods for the Physical Sciences: Learning
from Examples in Astronomy and Physics. Springer Series in Astrostatistics. Springer.
Chattopadhyay, A. K. and T. Chattopadhyay (2014). Statistical Methods for Astronomical
Data Analysis. Springer Series in Astrostatistics. Springer.
Cowles, M. K. (2013). Applied Bayesian Statistics: With R and OpenBUGS Examples.
Springer Texts in Statistics. Springer.
Dodelson, S. (2003). Modern Cosmology. Academic Press.
Feigelson, E. D. and G. J. Babu (2012a). Modern Statistical Methods for Astronomy: With
R Applications. Cambridge University Press.
Feigelson, E. D. and G. J. Babu (2012b). Statistical Challenges in Modern Astronomy V.
Lecture Notes in Statistics. Springer.
Finch, W. H., J. E. Bolin, and K. Kelley (2014). Multilevel Modeling Using R. Chapman &
Hall/CRC Statistics in the Social and Behavioral Sciences. Taylor & Francis.
Gamerman, D. and H. F. Lopes (2006). Markov Chain Monte Carlo: Stochastic Simula-
tion for Bayesian Inference, Second Edition. Chapman & Hall/CRC Texts in Statistical
Science. Taylor & Francis.
Gelman, A., J. Carlin, H. Stern, D. Dunson, A. Vehtari, and D. Rubin (2013). Bayesian
Data Analysis, Third Edition. Chapman & Hall/CRC Texts in Statistical Science. Taylor
& Francis.
Hardin, J. W. and J. M. Hilbe (2012). Generalized Linear Models and Extensions, Third
Edition. Taylor & Francis.
Hilbe, J. M. (2011). Negative Binomial Regression, Second Edition. Cambridge University
Press.
Hilbe, J. M. (2014). Modeling Count Data. Cambridge University Press.
Hilbe, J. M. (2015). Practical Guide to Logistic Regression. Taylor & Francis.
Hilbe, J. M. and A. P. Robinson (2013). Methods of Statistical Model Estimation. EBL-
Schweitzer. CRC Press.
Ivezić, Z., A. J. Connolly, J. T. Vanderplas, and A. Gray (2014). Statistics, Data Mining,
and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of
Survey Data. EBSCO ebook academic collection. Princeton University Press.
Jain, P. (2016). An Introduction to Astronomy and Astrophysics. CRC Press.
Korner-Nievergelt, F. et al. (2015). Bayesian Data Analysis in Ecology Using Linear
Models with R, BUGS, and Stan. Elsevier Science.
380
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:43:19, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.016
381 References
Articles
Abazajian, K. N. et al. (2009). “The seventh data release of the Sloan Digital Sky
Survey.” Astrophys. J. Suppl. 182, 543–558. DOI: 10.1088/0067-0049/182/2/543.
arXiv:0812.0649.
Aeschbacher, S. et al. (2012). “A novel approach for choosing summary statis-
tics in approximate Bayesian computation.” Genetics 192(3), 1027–1047. DOI:
10.1534/genetics.112.143164.
Akaike, H. (1974). “A new look at the statistical model identification.” IEEE Trans.
Automatic Control 19(6), 716–723.
Akeret, J. et al. (2015). “Approximate Bayesian computation for forward
modeling in cosmology.” J. Cosmology Astroparticle Phys. 8, 043. DOI:
10.1088/1475-7516/2015/08/043. arXiv:1504.07245.
Alam, S. et al. (2015). “The eleventh and twelfth data releases of the Sloan Dig-
ital Sky Survey: final data from SDSS-III.” Astrophys. J. Suppl. 219, 12. DOI:
10.1088/0067-0049/219/1/12. arXiv: 1501.00963 [astro-ph.IM].
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:43:19, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.016
382 References
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:43:19, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.016
383 References
Chabrier, G. (2003). “Galactic stellar and substellar initial mass function.” Publ. Astronom.
Soc. Pacific 115, 763–795. DOI: 10.1086/376392. eprint:arXiv:astro-ph/0304382.
Chattopadhyay, G. and S. Chattopadhyay (2012). “Monthly sunspot number time series
analysis and its modeling through autoregressive artificial neural network.” Europ.
Physical J. Plus 127, 43. DOI: 10.1140/epjp/i2012-12043-9. arXiv: 1204.3991
[physics.gen-ph].
Conley, A. et al. (2011). “Supernova constraints and systematic uncertainties from the
first three years of the Supernova Legacy Survey.” Astrophys. J. Suppl. 192, 1. DOI:
10.1088/0067-0049/192/1/1. arXiv: 1104.1443[astro-ph.CO].
de Souza, R. S. et al. (2014). “Robust PCA and MIC statistics of baryons in early mini-
haloes.” Mon. Not. Roy. Astronom. Soc. 440, 240–248. DOI: 10.1093/mnras/stu274.
arXiv: 1308.6009[astro-ph.co].
de Souza, R. S. et al. (2015a). “The overlooked potential of generalized linear mod-
els in astronomy – I: Binomial regression.” Astron. Comput. 12, 21–32. ISSN:
2213-1337. DOI: http://dx.doi.org/10.1016/j.ascom.2015.04.002. URL: www.
sciencedirect.com/science/article/pii/S2213133715000360.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:43:19, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.016
384 References
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:43:19, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.016
385 References
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:43:19, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.016
386 References
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:43:19, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.016
387 References
Lin, C.-A. and M. Kilbinger (2015). “A new model to predict weak-lensing peak counts II.
Parameter constraint strategies.” arXiv: 1506.01076.
Lintott, C. J. et al. (2008). “Galaxy Zoo: morphologies derived from visual inspection
of galaxies from the Sloan Digital Sky Survey.” Mon. Not. Roy. Astronom. Soc. 389,
1179–1189. DOI: 10.1111/j.1365-2966.2008.13689.x. arXiv: 0804.4483.
Lynden-Bell, D. (1969). “Galactic nuclei as collapsed old quasars.” Nature 223, 690–694.
DOI : 10.1038/223690a0.
Ma, C. et al. (2016). “Application of Bayesian graphs to SN Ia data analysis and
compression.” Mon. Not. Roy. Atronom. Soc. (preprint). arXiv: 1603.08519.
Macciò, A. V. et al. (2007). “Concentration, spin and shape of dark matter haloes: scatter
and the dependence on mass and environment.” Mon. Not. Roy. Astronom. Soc. 378,
55–71. DOI: 10.1111/j.1365-2966.2007.11720.x. eprint: arXiv: astro-ph/0608157.
Machida, M. N. et al. (2008). “Formation scenario for wide and close binary systems.”
Astrophys. J. 677, 327–347. DOI: 10.1086/529133. arXiv: 0709.2739.
Mahajan, S. and S. Raychaudhury (2009). “Red star forming and blue passive
galaxies in clusters.” Mon. Not. Roy. Astronom. Soc. 400, 687–698. DOI: 10.111
1/j.1365-2966.2009.15512.x. arXiv: 0908.2434.
Maio, U. et al. (2010). “The transition from population III to population II-I star
formation.” Mon. Not. Roy. Astronom. Soc. 407, 1003–1015. DOI: 10.1111/j.
1365-2966.2010.17003.x. arXiv: 1003.4992 [astro-ph.CO].
Maio, U. et al. (2011). “The interplay between chemical and mechanical feedback from
the first generation of stars.” Mon. Not. Roy. Astronom. Soc. 414, 1145–1157. DOI:
10.1111/j.1365- 2966.2011.18455.x. arXiv: 1011.3999[astro-ph.CO].
Mandel, K. S. et al. (2011). “Type Ia supernova light curve inference: hierar-
chical models in the optical and near-infrared.” Astrophys. J. 731, 120. DOI:
10.1088/0004-637X/731/2/120. arXiv: 1011.5910.
Maoz, D. et al. (2014). “Observational clues to the progenitors of type Ia supernovae.” Ann.
Rev. Astron. Astrophys. 52(1), 107–170. DOI: 10.1146/annurev-astro-082812-141031.
Marley, J. and M. Wand (2010). “Non-standard semiparametric regression via BRugs.” J.
Statist. Software 37(1), 1–30. DOI: 10.18637/jss.v037.i05.
Masters, K. L. et al. (2010). “Galaxy Zoo: passive red spirals.” Mon. Not. Roy. Astronom.
Soc. 405, 783–799. DOI: 10.1111/j.1365-2966.2010.16503.x. arXiv: 0910.4113.
McCullagh, P. (2002). “What is a statistical model?” Ann. Statist. 30(5), 1225–1310. DOI:
10.1214/aos/1035844977.
Merritt, D. (2000). “Black holes and galaxy evolution.” Dynamics of Galaxies: from
the Early Universe to the Present, eds. F. Combes, G. A. Mamon, and V. Charman-
daris Vol. 197. Astronomical Society of the Pacific Conference Series, p. 221. eprint:
astro-ph/9910546.
Merritt, D. and L. Ferrarese (2001). “Black hole demographics from the M–σ rela-
tion.” Mon. Not. Roy. Astronom. Soc. 320, L30–L34. DOI: 10.1046/j.1365-871
1.2001.04165.x. eprint: astro-ph/0009076.
Metropolis, N. and S. Ulam (1949). “The Monte Carlo method.” J. Amer. Statist. Assoc.
44(247), 335–341. www.jstor.org/stable/2280232.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:43:19, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.016
388 References
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:43:19, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.016
389 References
Robin, A. C. et al. (2014). “Constraining the thick disc formation scenario of the Milky
Way.” Astron. Astrophys. 569. arXiv: 1406.5384.
Rubin, D. B. (1984). “Bayesianly justifiable and relevant frequency calculations for the
applied statistician.” Ann. Statist. 12(4), 1151–1172. www.jstor.org/stable/2240995.
Rubin, D. et al. (2015). “UNITY: Confronting supernova cosmology’s statistical and sys-
tematic uncertainties in a unified Bayesian framework.” Astrophys. J. 813, 137. DOI:
10.1088/0004-637X/813/2/137. arXiv: 1507.01602.
Rucinski, S. M. (2004). “Contact binary stars of theW UMa-type as distance trac-
ers.” New Astron. Rev. 48, 703–709. DOI: 10.1016/j.newar.2004.03.005. eprint:
astro-ph/0311085.
Rue, H. et al. (2009). “Approximate Bayesian inference for latent Gaussian models by
using integrated nested Laplace approximations.” J. Royal Statist. Soc. Series B 71(2),
319–392. DOI: 10.1111/j.1467-9868.2008.00700.x.
Sako, M. et al. (2014). “The data release of the Sloan Digital Sky Survey – II Supernova
Survey.” arXiv: 1401.3317 [astro-ph.CO].
Salpeter, E. E. (1955). “The luminosity function and stellar evolution.” Astrophys. J. 121,
161. DOI: 10.1086/145971.
Sana, H. et al. (2012). “Binary interaction dominates the evolution of massive stars.”
Science 337, 444. DOI: 10.1126/science.1223344. arXiv:1207.6397 [astro-ph.SR].
Schafer, C. M. and P. E. Freeman (2012). “Likelihood-free inference in cosmology: poten-
tial for the estimation of luminosity.” Statistical Challenges in Modern Astronomy V,
eds. E. D. Feigelson and B. G. Jogesh, pp. 3–19. Springer.
Schawinski, K. et al. (2007). “Observational evidence for AGN feedback in early-
type galaxies.” Mon. Not. Roy. Astronom. Soc. 382, 1415–1431. DOI: 10.1111/j.
1365-2966.2007.12487.x. arXiv: 0709.3015.
Schwarz, G. (1978). “Estimating the dimension of a model.” Ann. Statist. 6(2), 461–464.
Shariff, H. et al. (2015). “BAHAMAS: new SNIa analysis reveals inconsistencies with
standard cosmology.” arXiv: 1510.05954.
Shimizu, T. T. and R. F. Mushotzky (2013). “The first hard X-ray power spec-
tral density functions of active galactic nucleus.” Astrophys. J. 770, 60. DOI:
10.1088/0004-637X/770/1/60. arXiv: 1304.7002 [astro-ph.HE].
Snyder, G. F. et al. (2011). “Relation between globular clusters and supermassive black
holes in ellipticals as a manifestation of the black hole fundamental plane.” Astrophys.
J. 728, L24. DOI: 10.1088/2041-8205/728/1/L24. arXiv: 1101.1299 [astro-ph.CO].
Somerville, R. S. et al. (2008). “A semi-analytic model for the co-evolution of galaxies,
black holes and active galactic nuclei.” Mon. Not. Roy. Astronom. Soc. 391, 481–506.
DOI : 10.1111/j.1365-2966.2008.13805.x. arXiv: 0808.1227.
Spiegelhalter, D. J. et al. (2002). “Bayesian measures of model complexity and fit.”
J. Royal Statist. Soc., Series B 64(4), 583–639. DOI: 10.1111/1467-9868.00353.
Stan (2016). “Prior choice recommendations.” https://github.com/stan-dev/
stan/wiki/Prior-Choice-Recommendations (visited on 06/27/2016).
Sunyaev, R. A. and Y. B. Zeldovich (1972). “The observations of relic radiation as a test
of the nature of X-ray radiation from the clusters of galaxies.” Comm. Astrophys. Space
Phys. 4, 173.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:43:19, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.016
390 References
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:43:19, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.016
40
0
y
−40
−10 −5 0 5 10
x
tFigure 4.5 Visualization of different elements of a Bayesian normal linear model that includes errors in variables. The dots and
corresponding error bars represent the data. The dashed line and surrounding shaded bands show the mean, 50%
(darker), and 95% (lighter) credible intervals when the errors are ignored. The dotted line and surrounding shaded
bands show the mean, 50% (darker), and 95% (lighter) credible intervals obtained when the errors are taken into
account (note that these shaded bands are very narrow). The solid line (i.e., the line with the steepest slope) shows
the fiducial model used to generate the data.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:32:14, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.017
1.00
0.75
y 0.50
0.25
0.00
tFigure 5.18 Visualization of the synthetic data from the Bayesian logistic model. The dashed and dotted lines and respective
shaded areas show the fitted and 95% probability intervals. The dots in the upper horizontal border correspond to
observed successes for each binary predictor and those in the lower horizontal border correspond to observed
failures. The dots with error bars denote the fraction of successes, in bins of 0.05.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:32:14, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.017
Poisson–logit hurdle PDF
0
x1
x2
x3 x
0
x1
x2
x3
x
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:32:14, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.017
10
11
12
13
14
15
1
9
1
1 1 0.63 0.58 0.61 0.59 0.6 0.59 0.62 0.57 0.6 0.58 0.6 0.65 0.58 0.58
2 1 0.6 0.64 0.58 0.64 0.63 0.62 0.63 0.64 0.62 0.65 0.59 0.59 0.59 0.8
3 1 0.6 0.61 0.6 0.6 0.64 0.6 0.59 0.61 0.59 0.61 0.6 0.59
0.6
4 1 0.6 0.63 0.63 0.67 0.62 0.62 0.65 0.61 0.64 0.6 0.64
5 1 0.58 0.59 0.64 0.62 0.63 0.64 0.6 0.65 0.58 0.58 0.4
14 1 0.63 −0.8
15
−1
tFigure 9.1 Correlation matrix for the synthetic multivariate normal data set.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:32:14, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.017
11
10
9
log (M M )
5
−0.4 −0.2 0.0 0.2 0.4
log (σ/σ0)
tFigure 10.1 Supermassive black hole mass as a function of bulge velocity dispersion described by a Gaussian model with errors
in measurements. The dashed line represents the mean and the shaded areas represent the 50% (darker) and 95%
(lighter) prediction intervals. The dots and associated error bars denote the observed values and measurement
errors respectively.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:32:14, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.017
Phot-Ia Spec-Ia
1.0
0.5
μ SN − μ z
0.0
−0.5
−1.0
8 9 10 11 12 13
log(M M )
1.5
I
Phot-Ia
JAGS posterior intercept
1.0
0.5
Spec-Ia
-
0.0
tFigure 10.3 Upper panel: Hubble residuals (HR = μSN − μz ) as a function of the host galaxy mass (log(M/M )) for the
PM sample from Wolf et al. (2016). The shaded areas represent 50% (darker) and 95% (lighter) prediction intervals
for the spectroscopic (lower) and photometric (upper) samples. Lower panel: Contour intervals showing the 68%
(darker) and 95% (lighter) credible intervals of the Spec-Ia and Phot-Ia JAGS posterior distributions for the HRmass
relation.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:32:14, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.017
tFigure 10.4 Artist’s impression of VFTS 352, the most massive and earliest spectral type genuinely-contact binary system
known to date (Almeida et al., 2015). Image credits: ESO /L. Calçada.
−1
MV
−2
−3
−4
−0.2
0
−0.3 I)0
−0.2 −0.1 −
0 0.2 (V
0.1
log(P)
tFigure 10.5 Period–luminosity–color relation obtained for the sample of 64 near early-type binary systems from the Large
Magellanic Cloud (Pawlak, 2016). The sample is color-coded for genuinely-contact (upper) and near-contact
(lower) binaries.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:32:14, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.017
1.00
0.75
fgas 0.50
0.25
0.00
7 8 9 10 11
log (M* /M )
tFigure 10.7 Baryon fraction in atomic gas as a function of stellar mass. The dashed line shows the mean posterior fraction and
the shaded regions denote the 50% (darker) and 95% (lighter) prediction intervals. The dots are the data points for
isolated low mass galaxies shown in Bradford et al. (2015, Figure 4, left-hand panel).
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:32:14, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.017
tFigure 10.9 Globular cluster NGC 6388. Image credits: ESO, F. Ferraro (University of Bologna).
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:32:14, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.017
E Irr S S0
105
104
103
NGC
102
101
t
Figure 10.11 Globular cluster population and host galaxy visual magnitude data (points) superimposed on the results from
Poisson regression. The dashed line shows the model mean and the shaded regions mark the 50% (darker) and
95% (lighter) prediction intervals.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:32:14, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.017
E Irr S S0
105
104
103
NGC
102
101
t
Figure 10.13 Globular cluster population and host galaxy visual magnitude data (points) superimposed on the results from
negative binomial regression. The dashed line shows the model mean. The shaded regions mark the 50% (darker)
and 95% (lighter) prediction intervals.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:32:14, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.017
1.00
0.75
fSeyfert
0.50
0.25
0.00
0 2 4 6
r/r200
1.00
0.75
fSeyfert
0.50
0.25
0.00
0 2 4 6
r/r200
t
Figure 10.16 Two-dimensional representation of the six-dimensional parameter space describing the dependence of Seyfert
AGN activity as a function of r/r200 and log M200 , for clusters with an average log M200 ≈ 1014 M : upper panel,
spirals; lower panel, elliptical galaxies. In each panel the lines (dashed or dotted) represent the posterior mean
probability of Seyfert AGN activity for each value of r/r200 , while the shaded areas depict the 50% and 95%
credible intervals. The data points with error bars represent the data when binned, for purely illustrative purposes.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:32:14, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.017
105
104
103
M* (Msunh–1)
102
10
2.5
106 107
Mdm (Msunh–1)
t
Figure 10.19 Fitted values for the dependence between stellar mass M and dark-matter halo mass Mdm resulting from the
lognormal–logit hurdle model. The dashed line represents the posterior mean stellar mass for each value of Mdm ,
while the shaded areas depict the 50% (darker) and 95% (lighter) credible intervals around the mean; Msun is the
mass of the Sun and h is the dimensionless Hubble parameter.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:32:14, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.017
105
104
–1 ) 103
102
h
n 107
su
(M
*
10
M
2.5 106 h–1 )
0 (M sun
M dm
105
104
103
–1 )
102
h
n
107
su
(M
*
10
M
2.5 h–1 )
106 (M sun
0 M dm
1.00
0.75
0.50
107
P*
0.25
h–1 )
106 (M sun
0.00 M dm
104
103
–1 )
102
h
107
n
su
(M
10
*
M
h–1 )
2.5 106 (M sun
0 M dm
t
Figure 10.20 Illustration of the different layers in the lognormal–logit hurdle model. From bottom to top: the original data; the
Bernoulli fit representing the presence or absence of stars in the halo; the lognormal fit of the positive part (i.e.,
ignoring the zeros in M ); and the full model.
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:32:14, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.017
t
Figure 10.23 Joint posterior distributions over the dark-matter energy density w and equation of state parameter w obtained
from a Bayesian Gaussian model applied to the JLA sample. Left: the results without taking into account errors in
measurements. Right: the results taking into account measurement errors in color, stretch, and observed
magnitude.
1.2
1.0
0.8
distance1
0.6
0.4
0.2
0.0
−0.2
−3 −2 −1 0 1 2 3 4 5
mean
1.2
1.0
0.8
distance1
0.6
0.4
0.2
0.0
−0.2
−1 0 1 2 3 4 5 6
std
t
Figure 10.24 Distance function ρ from Equation 10.29 as a function of the parameters mean (upper panel) and std (lower
panel).
Downloaded from https://www.cambridge.org/core. Columbia University Libraries, on 05 Aug 2017 at 20:32:14, subject to the Cambridge Core terms of use, available
at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781316459515.017
Particle System t = 0 Particle System t = 2
5 5
4 4
3 3
std
std
2 2
1 1
−2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4
mean mean
0.18 0.25 0.6
0.24 0.5
0.20
0.20 0.4
density
0.14
density
density
density
0.15
0.16 0.3
0.10 0.2
0.10
0.12 0.05 0.1
0.06 0.08 0.00 0.0
−2 −1 0 1 2 3 4 1 2 3 4 5 −2 −1 0 1 2 3 4 1 2 3 4 5
mean std mean std
4 4
3 3
std
std
2 2
1 1
−2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4
mean mean
10
1.2 3.0 8 12
density
density
density
density
0.8 2.0 6 8
4
0.4 1.0 4
2
0.0 0.0 0 0
−2 −1 0 1 2 3 4 1 2 3 4 5 −2 −1 0 1 2 3 4 1 2 3 4 5
t
mean std mean std
Figure 10.25 Evolution of the particle systems at different stages t of the ABC iterations for the toy model and distance function described in Section 10.12.1. At each stage
t the upper panel shows the density of particles in the two dimensional parameter space of our toy model (mean, std) and the lower left- and right-hand
panels show the density profiles over the same parameters.