0% found this document useful (0 votes)
8 views3 pages

Bayes Stats

biostatistics

Uploaded by

raju.suryawanshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views3 pages

Bayes Stats

biostatistics

Uploaded by

raju.suryawanshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

THIS MONTH

encode experience from other coins and the observed outcomes.


POINTS OF SIGNIFICANCE In part, this compatibility arises because, for the frequentist, only
the data have a probability distribution. The frequentist may test

Bayesian statistics whether the coin is fair using the null hypothesis, H0: p = p0 = 0.5.
In this case, H3 and T3 are the most extreme outcomes, each with
probability 0.125. The P value is therefore P(H3 | p0) + P(T3 | p0) =
Today’s predictions are tomorrow’s priors. 0.25. At the nominal level of a = 0.05, the frequentist fails to reject H0
and accepts that p = 0.5. The frequentist might estimate p using the
One of the goals of statistics is to make inferences about popula- sample percentage of heads or compute a 95% confidence interval for
tion parameters from a limited set of observations. Last month, we p, 0.29 < p ≤ 1. The interval depends on the outcome, but 95% of the
showed how Bayes’ theorem is used to update probability estimates intervals will include the true value of p.
as more data are collected1. We used the example of identifying a coin The frequentist approach can only tell us the probability of obtain-
as fair or biased based on the outcome of one or more tosses. This ing our data under the assumption that the null hypothesis is the
month, we introduce Bayesian inference by treating the degree of bias true data-generating distribution. Because it considers p to be fixed,
as a population parameter and using toss outcomes to model it as a it does not recognize the legitimacy of questions like “What is the
distribution to make probabilistic statements about its likely values. probability that the coin is biased towards heads?” The coin either
How are Bayesian and frequentist inference different? Consider is or is not biased toward heads. For the frequentist, probabilistic
a coin that yields heads with a probability of p. Both the Bayesian questions about p make sense only when selecting a coin by a known
and the frequentist consider p to be a fixed but unknown constant randomization mechanism from a population of coins.
and compute the probability of a given set of tosses (for example, k By contrast, the Bayesian, while agreeing that p has a fixed true
© 2015 Nature America, Inc. All rights reserved.

heads, Hk) based on this value (for example, P(Hk | p) = pk), which value for the coin, quantifies uncertainty about the true value as a
is called the likelihood. The frequentist calculates the probability of probability distribution on the possible values called the prior distri-
different data generated by the model, P(data | model), assuming a bution. For example, if she knows nothing about the coin, she could
probabilistic model with known and fixed parameters (for example, use a uniform distribution on [0,1] that captures her assessment that
coin is fair, P(Hk) = 0.5k). The observed data are assessed in light of any value of p is equally likely (Fig. 1a). If she thinks that the coin is
other data generated by the same model. most likely to be close to fair, she can pick a bell-shaped prior distri-
In contrast, the Bayesian uses probability to quantify uncertainty bution (Fig. 1a). These distributions can be imagined as the histo-
and can make more precise probability statements about the state of gram of the values of p from a large population of coins from which
the system by calculating P(model | data), a quantity that is meaning- the current coin was selected at random. However, in the Bayesian
less in frequentist statistics. The Bayesian uses the same likelihood as model, the investigator chooses the prior based on her knowledge
the frequentist, but also assumes a probabilistic model (prior distri- about the coin at hand, not some imaginary set of coins.
bution) for possible values of p based on previous experience. After Given the toss outcome of H3, the Bayesian applies Bayes’ theorem
observing the data, the prior is updated to the posterior, which is used to combine the prior, P(p), with the likelihood of observing the data,
for inference. The data are considered fixed and possible models are P(H3 | p), to obtain the posterior P(p | H3) = P(H3 | p) × P(p) / P(H3)
assessed on the basis of the posterior. (Fig. 1b). This is analogous to P(A | B) = P(B | A) × P(A)/P(B), except
Let’s extend our coin example from last month to incorpo- now A is the model parameter, B is the observed data and, because
rate inference and illustrate the differences in frequentist and p is continuous P(∙) is interpreted as a probability density. The term
npg

Bayesian approaches to it. Recall that we had two coins: coin C corresponding to the denominator P(B), the marginal likelihood
was fair, P(H | C) = p0 = 0.5, and coin Cb was biased toward heads, P(H3), becomes the normalizing constant so that the total probabil-
P(H | Cb) = pb = 0.75. A coin was selected at random with equal prob- ity (area under the curve) is 1. As long as this is finite, it is often left
ability and tossed. We used Bayes’ theorem to compute the probabil- out and the numerator is used to express the shape of density. That
ity that the biased coin was selected given that a head was observed; is the reason why it is commonly said that posterior distribution is
we found P(Cb | H) = 0.6. We also saw how we could refine our guess proportional to the prior times the likelihood.
by updating this probability with the outcome of another toss: seeing
a second head gave us P(Cb | H2) = 0.69. a Priors for coin bias b Posterior from tossing three heads
and using uniform prior
In this example, the parameter p is discrete and has two pos- Uniform Symmetric Head-weighted Likelihood Prior Posterior
P(H3 | π ) = π 3 P(π) P(π | H3)
sible values: fair (p0 = 0.5) and biased (pb = 0.75). The prior prob- 2 1 4 E[π ]

ability of each before tossing is equal, P(p0) = P(pb) = 0.5, and the P(π ) 1 P P(π > 0.5 | H3)

data-generating process has the likelihood P(Hk | p) = pk. If we 0 0 0


0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
observe a head, Bayes’ theorem gives the posterior probabilities as π π 95% CI

P(p0 | H) = p0/(p0 + pb) = 0.4 and P(pb | H) = pb/(p0 + pb) = 0.6. Here
all the probabilities are known and the frequentist and Bayesian Figure 1 | Prior probability distributions represent knowledge about the
agree on the approach and the results of computation. coin before it is tossed. (a) Three different prior distributions of p, the
In a more realistic inference scenario, nothing is known about probability of heads. (b) Toss outcomes are combined with the prior to
create the posterior distribution used to make inferences about the coin.
the coin and p could be any value in the interval [0,1]. What can The likelihood is the probability of observing a given toss outcome, which
be inferred about p after a coin toss produces H3 (where HkTn–k is p3 for a toss of H3. The gray area corresponds to the probability that the
denotes the outcome of n tosses that produced k heads and n–k coin is biased toward heads. The error bar is the 95% credible interval (CI)
tails)? The frequentist and the Bayesian agree on the data genera- for p. The dotted line is the posterior mean, E(p). The posterior is shown
tion model P(H3 | p) = p3, but they will use different methods to normalized to 4p3 to make its area 1.

NATURE METHODS | VOL.12 NO.5 | MAY 2015 | 377


THIS MONTH

a Effect of prior on posterior b Diminished effect of prior we might use H15T5 and set the prior proportional to p15(1 – p)5. In
Prior H3T1 H15T5 Prior H3T1 either case, the posterior distribution is obtained simply by adding the
Toss outcome H3T1 Toss outcome H1T3 H5T15 H25T75
number of observed heads and tails to the exponents of p and (1 – p),
Posterior
Likelihood P(π) respectively. If our toss outcome is H3T1, the posteriors are propor-
Prior
0 tional to p6(1 – p)2 and p18(1 – p)6.
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
π π As we collect data, the impact of the prior is diminished and the
posterior is shaped more like the likelihood. For example, if we use
Figure 2 | Effect of choice of prior and amount of data collected on the a prior that corresponds to H3T1, suggesting that the coin is head-
posterior. All curves are beta(a,b) distributions labeled by their equivalent biased, and collect data that indicates otherwise and see tosses of
toss outcome, Ha–1Tb–1. (a) Posteriors for a toss outcome of H3T1 using
weakly (H3T1) and strongly (H15T5) head-weighted priors. (b) The effect
H1T3, H5T15 and H25T75 (75% tails), our original misjudgment about
of a head-weighted prior, H3T1, diminishes with more tosses (4, 20, 100) the coin is quickly mitigated (Fig. 2b).
indicative of a tail-weighted coin (75% tails). In general, a distribution on p in [0,1] proportional to pa–1(1 – p)b–1
is called a beta(a,b) distribution. The parameters a and b must be posi-
Suppose the Bayesian knows little about the coin and uses the uni- tive, but they do not need to be whole numbers. When a ≥ 1 and b
form prior, P(p) = 1. The relationship between posterior and likeli- ≥ 1, then (a + b – 2) is like a generalized number of coin tosses and
hood is simplified to P(p | H3) = P(H3 | p) = p3 (Fig. 1b). The Bayesian controls the tightness of the distribution around its mode (location of
uses the posterior distribution for inference, choosing the posterior maximum of the density), and (a – 1) is like the number of heads and
mean (p = 0.8), median (p = 0.84) or value of p for which posterior is controls the location of the mode.
maximum (p = 1, mode) for a point estimate of p. All of the curves in Figure 2 are beta distributions. Priors corre-
The Bayesian can also calculate 95% credible region, the smallest sponding to a previous toss outcomes of HkTn–k are beta distributions
© 2015 Nature America, Inc. All rights reserved.

interval over which we find 95% of the area under the posterior— with a = k + 1 and b = n – k + 1. For example, the prior for H15T5
which is [0.47,1] (Fig. 1b). Like the frequentist, the Bayesian cannot has a shape of beta(16,6). For a prior of beta(a,b), a toss outcome of
conclude that the coin is not biased, because p = 0.5 falls within the HkTn–k will have a posterior of beta(a + k, b + n – k). For example, the
credible interval. Unlike the frequentist, they can make statements posterior for a toss outcome of H3T1 using a H15T5 prior is beta(19,7).
about the probability that the coin is biased toward heads (94%) using In general, when the posterior comes from the same family of dis-
the area under the posterior distribution for p > 0.5 (Fig. 1b). The tributions as the prior with an update formula for the parameter, we
probability that the coin is biased toward tails is P(p < 0.5 | H3) = 0.06. say that the prior is conjugate to the distribution generating the data.
Thus, given the choice of prior, the toss outcome H3 overwhelmingly Conjugate priors are convenient when they are available for data-gen-
supports the hypothesis of head bias, which is 0.94/0.06 = 16 times erating models because the posterior is readily computed. The beta
more likely than tail bias. This ratio of posterior probabilities is called distributions are conjugate priors for binary outcomes such as H or T
the Bayes factor and its magnitude can be associated with degree of and come in a wide variety of shapes, flat, skewed, bell- or U- shaped.
confidence2. By contrast, the frequentist would test H0 p0 ≤ 0.5 versus For a prior on the interval [0,1], it is usually possible to pick values of
HA p0 > 0.5 using the P value based on a one-tailed test at the bound- (a,b) for a suitable head probability prior for coin tosses (or the success
ary (p0 = 0.5) and obtain P = 0.125 and would not reject the null probability for independent binary trials).
hypothesis. Conversely, the Bayesian cannot test the hypothesis that Frequentist inference assumes that the data-generating mecha-
the coin is fair because, in using the uniform prior, statements about nism is fixed and that only the data have a probabilistic component.
P are limited to intervals and cannot be made for single values of p Inference about the model is therefore indirect, quantifying the agree-
npg

(which always have zero prior and posterior probabilities). ment between the observed data and the data generated by a putative
Suppose now that we suspect the coin to be head-biased and model (for example, the null hypothesis). Bayesian inference quan-
want a head-weighted prior (Fig. 1a). What would be a justifiable tifies the uncertainty about the data-generating mechanism by the
shape? It turns out that if we consider the general case of n tosses prior distribution and updates it with the observed data to obtain the
with outcome HkTn–k, we arrive at a tidy solution. With a uniform posterior distribution. Inference about the model is therefore obtained
prior, this outcome has a posterior probability proportional to directly as a probability statement based on the posterior. Although
pk(1 – p)n–k. The shape and interpretation of the prior is motivated by the inferential philosophies are quite different, advances in statistical
considering nʹ more tosses that produce kʹ heads, HkʹTn–kʹ. The com- modeling, computing and theory have led many ­statisticians to keep
bined toss outcome is Hk+kʹT(n+nʹ)–(k+kʹ), which, with a uniform prior, both sets of methodologies in their data analysis toolkits.
has a posterior probability proportional to pk+kʹ(1 – p)(n+nʹ)–(k+kʹ).
ACKNOWLEDGMENTS
Another way to think about this posterior is to treat the first set of The authors gratefully acknowledge M. Lavine for contributions to the manuscript.
tosses as the prior, pk(1 – p)n–k, and the second set as the likelihood,
pkʹ(1 – p)n–kʹ. In fact, if we extrapolate this pattern back to 0 tosses COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
(with outcome H0T0), the original uniform prior is exactly the distri-
bution that corresponds to this: p0(1 – p)0 = 1. This iterative updating Jorge López Puga, Martin Krzywinski & Naomi Altman
by adding powers treats the prior as a statement about the coin based Corrected after print 24 September 2015.
on the outcomes of previous tosses. 1. Puga, J.L., Krzywinski, M. & Altman, N. Nat. Methods 12, 277–278 (2015).
Let’s look how different shapes of priors might arise from this line of 2. Kass, R.E. & Raftery, A.E. J. Am. Stat. Assoc. 90, 791 (1995).
reasoning. Suppose we suspect that the coin is biased with p = 0.75. In a
large number of tosses we expect to see 75% heads. If we are uncertain Jorge López Puga is a Professor of Research Methodology at UCAM Universidad
Católica de Murcia. Martin Krzywinski is a staff scientist at Canada’s Michael
about this, we might let this imaginary outcome be H3T1 and set the Smith Genome Sciences Centre. Naomi Altman is a Professor of Statistics at The
prior proportional to p3(1 – p)1 (Fig. 2a). If our suspicion is stronger, Pennsylvania State University.

378 | VOL.12 NO.5 | MAY 2015 | NATURE METHODS


CORRIGENDA

Corrigendum: Bayesian statistics


Jorge López Puga, Martin Krzywinski & Naomi Altman
Nat. Methods 12, 377–378 (2015); published online 29 April 2015; corrected after print 24 September 2015.

In the version of this article initially published, the curves (in red) showing the likelihood distribution in Figure 2 were incorrectly drawn
in some panels. The error has been corrected in the HTML and PDF versions of the article.
© 2015 Nature America, Inc. All rights reserved.
npg

NATURE METHODS

You might also like