Bayes Theorem
Bayes Theorem
Bayes’ theorem b
0.6 0.75 0.5 0.625 0.75
Updating priors and iterative estimation of probabilities
0.6 0.625 0.5
H prior H
Observing, gathering knowledge and making predictions are the 0.5 0.6 0.6 0.69
foundations of the scientific process. The accuracy of our predic- Figure 2 | Graphical interpretation of Bayes’ theorem and its application
tions depends on the quality of our present knowledge and accuracy to iterative estimation of probabilities. (a) Relationship between
of our observations. Weather forecasts are a familiar example—the conditional probabilities given by Bayes’ theorem relating the probability
more we know about how weather works, the better we can use cur- of a hypothesis that the coin is biased, P(Cb), to its probability once the
rent observations and seasonal records to predict whether it will rain data have been observed, P(Cb|H). (b) The probability of the identity of
the chosen coin can be inferred from the toss outcome. Observing a head
tomorrow and any disagreement between prediction and observation
increases the chances that the coin is biased from P(Cb) = 0.5 to 0.6, and
can be used to refine the weather model. Bayesian statistics embodies further to 0.69 if a second head is observed.
this cycle of applying previous theoretical and empirical knowledge
to formulate hypotheses, rank them on the basis of observed data and the parameter. Given that experience, knowledge, and reasoning
update prior probability estimates and hypotheses using observed process vary among individuals, so do their priors—making speci-
data1. This will be our first of a series of columns about Bayesian fication of the prior one of the most controversial topics in Bayesian
© 2015 Nature America, Inc. All rights reserved.
statistics. This month, we’ll introduce the topic using one of its key statistics. However, the influence of the prior is usually diminished
concepts—Bayes’ theorem—and expand to include topics such as as we gather knowledge and make observations.
Bayesian inference and networks in future columns. At the core of Bayesian statistics is Bayes’ theorem, which
Bayesian statistics is often contrasted with classical (frequentist) sta- describes the outcome probabilities of related (dependent) events
tistics, which assumes that observed phenomena are generated by an using the concept of conditional probability. To illustrate these con-
unknown but fixed process. Importantly, classical statistics assumes cepts, we’ll start with independent events—tossing one of two fair
that population parameters are unknown constants, given that com- coins, C and C′. The toss outcome probability does not depend on
plete and exact knowledge about the sample space is not available2. For the choice of coin—the probability of heads is always the same,
estimation of population characteristics, the concept of probability is P(H) = 0.5 (Fig. 1). The joint probability of choosing a given coin
used to describe the outcomes of measurements. (e.g., C) and toss outcome (e.g., H) is simply the product of their
In contrast, Bayesian statistics assumes that population parameters, individual probabilities, P(C, H) = P(C) × P(H). But if we were to
though unknown, are quantifiable random variables and that our replace one of the coins with a biased coin, C b, that yields heads
uncertainty about them can be described by probability distributions. 75% of the time, the choice of coin would affect the toss outcome
We make subjective probability statements, or ‘priors’, about these probability, making the events dependent. We express this using
parameters based on our experience and reasoning about the popula- conditional probabilities by P(H|C) = 0.5 and P(H|Cb) = 0.75,
tion. Probability is understood from this perspective as a degree of where “|” means “given” or “conditional upon” (Fig. 1).
belief about the values of the parameter under study. Once we collect If P(H|Cb) is the probability of observing heads given the biased
npg
data, we combine them with the prior to create a distribution called coin, how can we calculate P(Cb|H), the probability that the coin is
the ‘posterior’ that represents our updated information about the biased having observed heads? These two conditional probabilities
parameters, as a probability assessment about the possible values of are generally not the same—failing to distinguish them is known
Marginal, conditional and joint probabilities
as the prosecutor’s fallacy. P(H|Cb) is a property of the biased coin
Marginal (individual) Conditional Joint and, unlike P(Cb|H), is unaffected by the chance of the coin being
Independent events
0 0.5 1.0 P(C) P(H) P(H|C) P(C|H) P(C,H) biased.
T
Toss 0.5
We can relate these conditional probabilities by first writing
H
0
the joint probability of selecting Cb and observing H: P(Cb, H) =
C Cʹ 0.5 0.5 0.5 0.5 P(C) × P(H) = 0.25
Coin
Dependent events
P(Cb|H) × P(H) (Fig. 1). The fact that this is symmetric, P(Cb|H)
0 0.5 1.0
P(Cb) P(H) P(H|Cb) P(Cb|H) P(Cb,H) × P(H) = P(H|Cb) × P(Cb), leads us to Bayes’ theorem, which is a
T
Toss
0.75
0.5
× = ×× rearrangement of this equality: P(Cb|H) = P(H|Cb) × P(Cb)/P(H)
H
Cb C
0
P(Cb|H) × P(H) P(H|Cb) × P(Cb) (Fig. 2a). P(Cb) is our guess of the coin being biased before data
0.5 0.625 0.75 0.6 0.375
Coin are collected (the prior), and P(Cb|H) is our guess once we have
Figure 1 | Marginal, joint and conditional probabilities for independent observed heads (the posterior).
and dependent events. Probabilities are shown by plots3, where columns If both coins are equally likely to be picked, P(Cb) = P(C) = 0.5.
correspond to coins and stacked bars within a column to coin toss outcomes, We also know that P(H|Cb) = 0.75, which is a property of the biased
and are given by the ratio of the blue area to the area of the red outline. The coin. To apply Bayes’ theorem, we need to calculate P(H), which is
choice of one of two fair coins (C, Cʹ) and outcome of a toss are independent
the probability of all the ways of observing heads—picking the fair
events. For independent events, marginal and conditional probabilities
are the same and joint probabilities are calculated using the product of
coin and observing heads and picking the biased coin and observ-
probabilities. If one of the coins, Cb, is biased (yields heads (H) 75% of the ing heads. This is P(H) = P(H|C) × P(C) + P(H|Cb) × P(Cb) = 0.5
time), the events are dependent, and joint probability is calculated using × 0.5 + 0.75 × 0.5 = 0.625. By substituting these values in Bayes’
conditional probabilities. theorem, we can compute the probability that the coin is biased
iteratively—when we need to update probabilities step by step as we ple, if instead we supposed that all three diseases are equally likely,
gain evidence. For example, if we toss the coin a second time, we can P(X) = P(Y) = P(Z) = 1/3, observing B would lead us to believe that
update our prediction that the coin is biased. On the second toss we the chances of Z are 69%.
no longer use P(Cb) = 0.5 because the first toss suggested that the Having observed A, we could refine our predictions by testing for
biased coin is more likely to be picked. The posterior from the first toss B. As with the coin example, we use the posterior probability of the
becomes the new prior, P(Cb) = 0.6. If the second toss yields heads, we disease after observing A as the new prior. The posterior probabilities
compute P(H) = 0.5 × 0.4 + 0.75 × 0.6 = 0.65 and apply Bayes’ theorem for diseases X, Y and Z given that A and B are both present are 0.25,
again to find P(Cb|HH) = 0.75 × 0.6/0.65 = 0.69 (Fig. 2b). We can 0.56 and 0.19, respectively, making Y the most likely. If the assay for
continue tossing to further refine our guess—each time we observe a B is negative, the calculations are identical but use complementary
head, the assessment of the posterior probability that the coin is biased probabilities (e.g., P(not B|X) = 1 – P(B|X)) and find 0.31, 0.69 and
is increased. For example, if we see four heads in a row, there is an 0.01 as the probabilities for X, Y and Z. Observing A but not B greatly
84% posterior probability that the coin is biased (see Supplementary decreases the chances of disease Z, from 19% to 1%. Figure 3c traces
Table 1). the change in posterior probabilities for each disease with each pos-
We have computed the probability that the coin is biased given sible outcome as we assay both markers in turn. If we find neither A
that we observed two heads. Up to this point we have not performed nor B, there is a 92% probability that the patient has disease X—the
any statistical inference because all the probabilities have been speci- marker profile with the highest probability for predicting X. The
fied. Both Bayesians and frequentists agree that P(Cb|HH) = 0.69 most specific profile for Y is A+B– (69%) and for Z is A–B+ (41%).
npg
and P(HH|C) = 0.25. Statistical inference arises when there is an When event outcomes map naturally onto conditional probabili-
unknown, such as P(H|Cb). The difference between frequentist and ties, Bayes’ theorem provides an intuitive method of reasoning and
Bayesian inference will be discussed more fully in the next column. convenient computation. It allows us to combine prior knowledge
Let’s extend the simple coin example to include multiple event out- with observations to make predictions about the phenomenon under
comes. Suppose a patient has one of three diseases (X, Y, Z) whose study. In Bayesian inference, all unknowns in a system are modeled
prevalence is 0.6, 0.3 or 0.1, respectively—X is relatively common, by probability distributions that are updated using Bayes’ theorem
whereas Z is rare. We have access to a diagnostic test that measures as evidence accumulates. We will examine Bayesian inference and
the presence of protein markers (A, B). Both markers can be present, compare it with frequentist inference in our next discussion.
and the probabilities of observing a given marker for each disease are
Note: Any Supplementary Information and Source Data files are available in the online
known and independent of each other in each disease state (Fig. 3a).
version of the paper (doi:10.1038/nmeth.3335).
We can ask: if we see marker A, can we predict the state of the patient?
Also, how do our predictions change if we subsequently assay for B? COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
Let’s first calculate the probability that the patient has disease X
given that marker A was observed: P(X|A) = P(A|X) × P(X)/P(A). We Jorge López Puga, Martin Krzywinski & Naomi Altman
know the prior probability for X, which is the prevalence P(X) = 0.6, 1. Eddy, S.R. Nat. Biotechnol. 22, 1177–1178 (2004).
and the probability of observing A given X, P(A|X) = 0.2 (Fig. 3a). 2. Krzywinski, M. & Altman, N. Nat. Methods 10, 809–810 (2013).
3. Oldford, R.W. & Cherry, W.H. Picturing probability: the poverty of Venn diagrams,
To apply Bayes’ theorem we need to calculate P(A), which is the total the richness of eikosograms. http://sas.uwaterloo.ca/~rwoldfor/papers/venn/
probability of observing A regardless of the state of the patient. To eikosograms/paperpdf.pdf (University of Waterloo, 2006)
find P(A) we sum over the product of the probability of each disease
Jorge López Puga is a Professor of Research Methodology at Universidad Católica
and finding A in that disease, which is all the ways in which A can de Murcia (UCAM). Martin Krzywinski is a staff scientist at Canada’s Michael
be observed: P(A) = 0.6 × 0.2 + 0.3 × 0.9 + 0.1 × 0.2 = 0.41 (Fig. 3b). Smith Genome Sciences Centre. Naomi Altman is a Professor of Statistics at The
Bayes’ theorem gives us P(X|A) = 0.2 × 0.6/0.41 = 0.29. Because Pennsylvania State University.