PMRprobabilistic Modelling Primer
PMRprobabilistic Modelling Primer
Michael Gutmann
Institute for Adaptive and Neural Computation
School of Informatics, University of Edinburgh
[email protected]
January 8, 2020
Abstract
We give a brief introduction to probability and probabilistic modelling. The document is a
refresher; it is assumed that the reader has some prior knowledge about the topic.
Contents
1 Probability 2
1.1 Probability space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Random variables 6
2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Distribution of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Discrete and continuous random variables . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Conditional distributions and Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Models 10
3.1 Probabilistic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Bayesian models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1
1 Probability
1.1 Probability space
Random or uncertain phenomena can be mathematically described using probability theory
where a fundamental quantity is the probability space. A probability space consists of three
elements: the sample space Ω, the event space F, and the probability (measure) P.
1. Ω is the set of all possible elementary outcomes of the phenomenon of interest. In the
literature, the random phenomenon of interest is usually considered to be some kind of
“experiment” whose outcome is uncertain. Ω is then the collection of all possible elementary,
i.e. finest-grain and distinguishable outcomes of the experiment.
2. F is the collection of all events (subsets of Ω) whose probability to occur one might want
to compute.
3. The probability P measures the plausibility of each event E ∈ F, assigning to it a number
between zero (most implausible/improbable) and one (most plausible/probable).
Probabilities are thus non-negative,
P(E) ≥ 0 ∀E ∈ F (1)
and normalised, which means that the maximal probability for an event to occur is one.
For the definition of the probability space to be consistent, the event space F needs to satisfy
some basic conditions: If we can compute the probability for an event E to occur, we should also
be able to compute the probability for the event E not to occur. That is:
If E ∈ F then Ē = Ω \ E ∈ F (2)
This property is called closure under complements. Further, if we can compute the probability
for E1 , E2 , . . . individually to occur, we should also be able to compute the probability that any
of the events occurs. That is:
This property is called closure under countable unions. The third condition is that
Ω ∈ F, (4)
2
Note that on the left hand side, we have the probability of the event ∪i Ei , while on the right
hand side, we have the (possibly infinite) sum over all the probabilities of the individual events
Ei . The equation says that the probability for ∪i Ei can be computed by summing up all P(Ei ).
For two events A, B ∈ F with A ⊆ B, it follows from the above that we must have P(A) ≤
P(B). This is because for A ⊆ B we can express B as B = A ∪ (B \ A) where A and B \ A
are disjoint, so that P(B) = P(A) + P(B \ A). Since P(B \ A) is non-negative, we must have
P(A) ≤ P(B). This result is known as monotonicity of probability. A further consequence is that
P(A ∩ B) ≤ P(A) and P(A ∩ B) ≤ P(B) for any A, B ∈ F.
3
where we used Equation (9). The equation
k
X
P(A ∩ B) = P(A ∩ Bi ) (16)
i=1
from where the so called Bayes’ rule follows (for P(A) > 0):
P(B)
P(B|A) = P(A|B) (19)
P(A)
On the left hand side, the conditioning event is A while on the right hand side the conditioning
event is B. Bayes’ rule shows how to move from P(A|B) to P(B|A) and vice versa, that is to
“revert the conditioning”.
For a partition B1 , . . . , Bk of Ω, we obtain with the law of total probability a more general
version of Bayes’ rule,
P(Bi ∩ A)
P(Bi |A) = (20)
P(A)
P(A|Bi )P(Bi )
= (21)
P(A)
P(A|Bi )P(Bi )
= Pk . (22)
j=1 P(A|Bj )P(Bj )
It can be seen that the posterior probability P(Bi |A) of Bi , after learning about event A, is
larger than the prior probability P(Bi ) (prior, or before observing A) if P(A|Bi ) is larger than
the weighted average of all P(A|Bj ).
Bayes’ rule has many important applications. A prototypical application is as follows: Sup-
pose that we observe an event A whose occurrence might be due to a number of mutually exclusive
causes B1 , . . . , Bk . If it is known for each cause Bi how probable it is to observe A, that is, if
the P (A|Bi ) are known, Bayes’ rule enables us to compute the “reverse” conditional probability
that Bi has indeed caused A, that is, P(Bi |A).
1.4 Examples
1.4.1 Coin tossing
A coin toss is perhaps the simplest example of a random experiment: The outcome space is
Ω = {H, T }, where H means that the outcome of the coin toss was heads while T means that
4
the outcome was tails; the event space is F = {{H}, {T }, Ω, ∅}, and the probability (measure)
is defined as
θ if E = {H}
1 − θ if E = {T }
P(E) = (23)
1 if E = Ω
0 if E = ∅,
where θ ∈ [0, 1]. This probability space can be used to describe any random phenomenon with
a binary outcome: For example, whether the spin of a magnet is up or down, whether a bit is 0
or 1, or weather some statement holds or not.
5
Meaning Nature Measurement Probability
True positive Y Y θ1
False negative Y N θ2
False positive N Y θ3
True negative N N θ4
Table 1: A noisy measurement whether nature is in a certain state or not. The probabilities
sum to one, θ1 + θ2 + θ3 + θ4 = 1. Only three of the four θi thus need to be specified to define
the probability measure.
Assume, for example, that both the specificity and sensitivity of a test for a medical condition
is 0.95. If the prior probability that a patient suffers from the medical condition is low, e.g.
P(BY ) = 0.001 (one out of thousand), the posterior probability for the patient to have the
condition given that the test was positive equals P(BY |EY ) = 0.019 only.
2 Random variables
Random variables are the main tools to model uncertain, or as the name suggests, random
quantities.
2.1 Definition
Given a probability space (Ω, F, P), a random variable x is a real-valued function defined on
Ω: x : Ω → Ωx ⊆ R. It can be thought of as an outcome of a probing measurement made
on a random phenomenon described by the probability space (Ω, F, P). While x is a known
(deterministic) function from one space to another its uncertainty or randomness derives from
the uncertainty or randomness about its inputs; if the inputs ω ∈ Ω are not observed and selected
randomly according to P the corresponding outputs x(ω) ∈ Ωx do indeed appear random even
6
though x is a known function. It thus makes sense to talk about the probability for the occurrence
of events like x ≤ α, that is of x ∈ [−∞, α].
Let Fx be an event space defined for Ωx . The probability that x takes a value in an event
E ∈ Fx can be determined by computing the probability of the set of all ω ∈ Ω which are mapped
to E,
P(x ∈ E) = P({ω ∈ Ω : x(ω) ∈ E}) (34)
For this equation to make sense, the set {ω ∈ Ω : x(ω) ∈ E} must be an element of F. This puts
a (mild) constraint on x. We assume that this condition is fulfilled for all mappings x and event
spaces Fx which follow.1
The above definition can be extended to vector valued functions x = (x1 , . . . , xn ): Ω → Ωx ⊆
Rn where Equation (34) becomes
where E is an element of the event space Fx . Such vector valued functions are sometimes called
random vectors. They are essentially collections of random variables which are defined on the
same underlying probability space (Ω, F, P). In what follows we often do not verbally distinguish
between random vectors and random variables, that is x may be called a random variable.
Knowing Fx allows one to compute the probability of any event E ∈ Fx . Importantly, this can
be done without having to go back to the original probability space (Ω, F, P) that was used to
define x and its distribution. Ultimately, this results in a new “stand-alone” probability space.
7
By the normalisation condition for probabilities, we obtain the condition that px must sum to
one,
∞
X
px (αk ) = 1. (40)
k=1
A random variable x is continuous if, for any event E ∈ Fx , the probability P(x ∈ E) can be
computed by an integral over a non-negative function px defined on Ωx ,
Z
P(x ∈ E) = px (α)dα. (41)
E
The function is called the probability density function (pdf) of x. We use the same symbol for
both the probability mass function and the probability density function. By the normalisation
condition for probabilities, a pdf must integrate to one,
Z
px (α)dα = 1. (42)
Ωx
The pdf can be determined from the cdf Fx by taking partial derivatives,
∂ n Fx (α1 , . . . , αn )
px (α) = . (43)
∂α1 · · · ∂αn
In what follows, we may call px the pdf of x whether it is a continuous or discrete random
variable.
It is often the case that notation is simplified and the pdf or pmf of x is denoted by p(x). In
this convention, x takes both the role of the random variable and the values it may take. The
context often makes it clear which is meant, but if not, the convention can lead to considerable
confusion so that one then better resorts to the more verbose notation px to denote the pdf or
pmf, and px (α) to denote the value of px at α.
There are also mixed-type of random variables which cannot be classified as either discrete
or continuous. The probability that x ∈ E is then computed by a combination of summation
and integration.
8
P
for all x1 where the marginal pmf px1 (α1 ) = α2 px (α1 , α2 ) > 0. The conditional is left
undefined for α1 where px1 (α1 ) = 0. This definition is in line with the definition of conditional
probability in Section 1.2. We also have a corresponding product rule
As in Section 1.2, the product rule is valid for all α1 , α2 even though px2 |x1 is left undefined for
those α1 where px1 (α1 ) = 0.
For continuous random variables, the conditional pdf px2 |x1 is defined as
px (α1 , α2 )
px2 |x1 (α2 |α1 ) = (46)
px1 (α1 )
R
for all α1 where the marginal pdf px1 (α1 ) = px (α1 , α2 )dα2 > 0. For α1 where px1 (α1 ) = 0,
the conditional pdf is left unspecified. In fact, we are free to define it as we wish as long as it is
a valid pdf. As before, we obtain the product rule
Since px (α1 , α2 ) ≥ 0, it follows that px1 (α1 ) = 0 implies that px (α1 , α2 ) = 0 for all α2 . Hence,
as before, the product rule is valid for all (α1 , α2 ) and the non-uniqueness of px2 |x1 for those α1
where px1 (α1 ) = 0 is irrelevant.
In simplified notation, the above equations become
p(x1 , x2 )
p(x2 |x1 ) = (48)
p(x1 )
p(x1 , x2 ) = p(x2 |x1 )p(x1 ). (49)
where Ω2 denotes the sample space of x2 . It is thus possible to use the product rule to define
the joint distribution of (x1 , x2 ) by separately specifying the marginal distribution of x1 and the
conditional distribution of x2 given x1 .
Finally, as in Section 1, the product rule yields
2.5 Examples
2.5.1 Bernoulli random variable
Let (Ω, F, P) be the probability space from the coin tossing example (Section 1.4.1) and x the
mapping from Ω to {−1, 1} where “head” is assigned to 1 and “tails” to −1. The mapping is
9
deterministic but since we do not see the coin flips, the output appears random: P(x = 1) =
p(1) = P(H) = θ and P(x = −1) = p(−1) = P(T ) = 1 − θ.
This is not the only way to construct binary random variables. There are other mappings x0
and probability spaces (Ω0 , F 0 , P0 ) which result in the same probability distribution. For example,
let Ω = [0, 1], F 0 the event space containing all intervals of the form [a, b), 0 ≤ a ≤ b ≤ 1, and
P0 ([a, b)) = b − a. Then, the mapping x0 with ω 7→ 1 if ω ≤ θ and ω 7→ −1 if ω > θ has
P(x0 = 1) = P0 ({ω : 0 ≤ ω < θ}) = θ as well.
where the first sum goes over 2α terms and the second over 2β terms. Figure 2(a) shows a
scatter plot of some random values that (u, v) take and Figure 2(b) shows a histogram of the
corresponding values of b. The distribution of b is called the “beta distribution”. It takes different
shapes for different values of α and β because u and v behave differently even though the mapping
(u, v) 7→ b(u, v) stays the same. If a random variable x has a beta distribution, it is denoted by
x ∼ Beta(α, β). Its pdf is given by
Γ(α + β) α−1
p(x|α, β) = x (1 − x)β−1 , (54)
Γ(α)Γ(β)
3 Models
We here explain the (fine) difference between probabilistic, statistical, and Bayesian models.
They are often confounded and people may use “statistical model” or “probabilistic model” to
refer to any one of them.
10
1
0.8
0.6
b
0.4
0.2
0
10 10
5 5
v 0 0 u
Figure 1: A plot of the function b(u, v) = u/(u+v) which can be used to generate beta-distributed
random variables.
25 90
80
20
70
60
15
50
counts
v
40
10
30
20
5
10
0 0
0 5 10 15 20 25 30 0 0.2 0.4 0.6 0.8 1
u b
11
For example, the Bernoulli random variable from Section 1.4.1 with success probability 1/2 is
a probabilistic model of coin tosses. A probabilistic model for a wide range of random phenomena
is the Gaussian random variable with probability density function
2
1 x
p(x) = √ exp − . (56)
2π 2
More generally, a Gaussian random variable with known mean µ0 and known variance σ02 is also
a probabilistic model. The corresponding density is
(x − µ0 )2
1
p(x) = p exp − . (57)
2πσ02 2σ02
12
3.4 Examples
3.4.1 Statistical model for binary data
Assume you have developed a new procedure which you think is faster or in other ways better
than existing ones. The procedure could, for example, be a new classification algorithm, a new
kind of measurement protocol, or a new treatment of an ailment. However, you also noticed
that sometimes, the new procedure performs worse than the existing ones. The situation can
be modelled using a binary random variable x where x = 1 means that the new procedure is
performing better, while x = 0 means that it is performing worse. The probability that the
procedure is a success is here the parameter θ, and the statistical model is specified by p(x|θ),
13
which one can again turn into a Bayesian or probabilistic model by attaching a (prior) probability
distribution to α and β. The parameters α and β are sometimes called hyperparameters, the
assumed prior a hyperprior, and the resulting model a hierarchical Bayesian model.
14