0% found this document useful (0 votes)
19 views14 pages

PMRprobabilistic Modelling Primer

This document provides an introduction to probability and probabilistic modelling. It discusses key concepts such as probability spaces, conditional probability, Bayes' rule, random variables, probabilistic models, and statistical and Bayesian models. Examples are also provided to illustrate these probabilistic concepts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views14 pages

PMRprobabilistic Modelling Primer

This document provides an introduction to probability and probabilistic modelling. It discusses key concepts such as probability spaces, conditional probability, Bayes' rule, random variables, probabilistic models, and statistical and Bayesian models. Examples are also provided to illustrate these probabilistic concepts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Introduction to Probabilistic Modelling

Michael Gutmann
Institute for Adaptive and Neural Computation
School of Informatics, University of Edinburgh
[email protected]
January 8, 2020

Abstract
We give a brief introduction to probability and probabilistic modelling. The document is a
refresher; it is assumed that the reader has some prior knowledge about the topic.

Contents
1 Probability 2
1.1 Probability space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Random variables 6
2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Distribution of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Discrete and continuous random variables . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Conditional distributions and Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Models 10
3.1 Probabilistic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Bayesian models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1
1 Probability
1.1 Probability space
Random or uncertain phenomena can be mathematically described using probability theory
where a fundamental quantity is the probability space. A probability space consists of three
elements: the sample space Ω, the event space F, and the probability (measure) P.
1. Ω is the set of all possible elementary outcomes of the phenomenon of interest. In the
literature, the random phenomenon of interest is usually considered to be some kind of
“experiment” whose outcome is uncertain. Ω is then the collection of all possible elementary,
i.e. finest-grain and distinguishable outcomes of the experiment.
2. F is the collection of all events (subsets of Ω) whose probability to occur one might want
to compute.
3. The probability P measures the plausibility of each event E ∈ F, assigning to it a number
between zero (most implausible/improbable) and one (most plausible/probable).
Probabilities are thus non-negative,

P(E) ≥ 0 ∀E ∈ F (1)

and normalised, which means that the maximal probability for an event to occur is one.
For the definition of the probability space to be consistent, the event space F needs to satisfy
some basic conditions: If we can compute the probability for an event E to occur, we should also
be able to compute the probability for the event E not to occur. That is:

If E ∈ F then Ē = Ω \ E ∈ F (2)

This property is called closure under complements. Further, if we can compute the probability
for E1 , E2 , . . . individually to occur, we should also be able to compute the probability that any
of the events occurs. That is:

If E1 ∈ F, E2 ∈ F, . . . then (∪i Ei ) ∈ F. (3)

This property is called closure under countable unions. The third condition is that

Ω ∈ F, (4)

which is a consequence of the above because Ω = E ∪ (Ω \ E).


Since Ω is the set of all possible outcomes, we can be certain that the event Ω occurs and the
normalisation condition is
P(Ω) = 1. (5)
This normalisation condition has a number of important consequences when we learn parameters
of a model.
Finally, in order to avoid inconsistencies, the assignment of probabilities to events needs
to satisfy an additivity condition: For every countable collection of pairwise disjoint events
E1 , E2 , . . ., X
P (∪i Ei ) = P(Ei ). (6)
i

2
Note that on the left hand side, we have the probability of the event ∪i Ei , while on the right
hand side, we have the (possibly infinite) sum over all the probabilities of the individual events
Ei . The equation says that the probability for ∪i Ei can be computed by summing up all P(Ei ).
For two events A, B ∈ F with A ⊆ B, it follows from the above that we must have P(A) ≤
P(B). This is because for A ⊆ B we can express B as B = A ∪ (B \ A) where A and B \ A
are disjoint, so that P(B) = P(A) + P(B \ A). Since P(B \ A) is non-negative, we must have
P(A) ≤ P(B). This result is known as monotonicity of probability. A further consequence is that
P(A ∩ B) ≤ P(A) and P(A ∩ B) ≤ P(B) for any A, B ∈ F.

1.2 Conditional probability


After observing some event, we may want to update the probabilities that we assign to the events
in F accordingly. The updated probability of event A after we learn that event B has occurred
is the conditional probability of A given B. It is denoted by P(A|B). If P(B) > 0, we compute
this probability as
P(A ∩ B)
P(A|B) = . (7)
P(B)
The conditional probability is not defined when P(B) = 0. The conditional probability of A
given B is thus the joint probability of A and B, P(A ∩ B), re-normalised by the probability of
B. We can think that the conditional probability P(.|B) defines a new probability P0 (.) that may
only take non-zero values on subsets of B and assigns probability one to B.
Since P(A ∩ B) = P(B ∩ A), we have for P(A) > 0
P(A ∩ B)
P(B|A) = , (8)
P(A)
so that P(A|B) and P(B|A) differ only by the normalisation in the denominator.
From the definition of conditional probability, it follows that
P(A ∩ B) =P(A|B)P(B) if P(B) > 0 (9)
=P(B|A)P(A) if P(A) > 0, (10)
which means that we can use conditional probabilities to assign joint probabilities. Equations
(9) and (10) are called the “product rule”.
We now show that the restrictions P(B) > 0 and P(A) > 0 are actually not needed in the
formulation of the product rule in Equations (9) and (10). This is because if, for example,
P(B) = 0 then we also must have P(A ∩ B) ≤ P(B) = 0 so that the product rule holds for all
events A, B even if P(A|B) is left undefined for events B with P(B) = 0.
Suppose that B can be partitioned into events B1 , . . . , Bk ,
B = ∪ki=1 Bi , Bi ∩ Bj = ∅ if i 6= j, (11)
with P(Bi ) > 0. Then
P(A ∩ B) = P A ∩ (∪ki Bi )

(12)
∪ki (A

=P ∩ Bi ) (13)
k
X
= P(A ∩ Bi ) (14)
i=1
k
X
= P(A|Bi )P(Bi ), (15)
i=1

3
where we used Equation (9). The equation
k
X
P(A ∩ B) = P(A ∩ Bi ) (16)
i=1

is known as the sum rule. Using Ω for B, we obtain


k
X
P(A) = P(A|Bi )P(Bi ), (17)
i=1

which is called the law of total probability.

1.3 Bayes’ rule


From Equation (9) and (10), we have

P(A|B)P(B) = P(B|A)P(A) (18)

from where the so called Bayes’ rule follows (for P(A) > 0):
P(B)
P(B|A) = P(A|B) (19)
P(A)
On the left hand side, the conditioning event is A while on the right hand side the conditioning
event is B. Bayes’ rule shows how to move from P(A|B) to P(B|A) and vice versa, that is to
“revert the conditioning”.
For a partition B1 , . . . , Bk of Ω, we obtain with the law of total probability a more general
version of Bayes’ rule,
P(Bi ∩ A)
P(Bi |A) = (20)
P(A)
P(A|Bi )P(Bi )
= (21)
P(A)
P(A|Bi )P(Bi )
= Pk . (22)
j=1 P(A|Bj )P(Bj )

It can be seen that the posterior probability P(Bi |A) of Bi , after learning about event A, is
larger than the prior probability P(Bi ) (prior, or before observing A) if P(A|Bi ) is larger than
the weighted average of all P(A|Bj ).
Bayes’ rule has many important applications. A prototypical application is as follows: Sup-
pose that we observe an event A whose occurrence might be due to a number of mutually exclusive
causes B1 , . . . , Bk . If it is known for each cause Bi how probable it is to observe A, that is, if
the P (A|Bi ) are known, Bayes’ rule enables us to compute the “reverse” conditional probability
that Bi has indeed caused A, that is, P(Bi |A).

1.4 Examples
1.4.1 Coin tossing
A coin toss is perhaps the simplest example of a random experiment: The outcome space is
Ω = {H, T }, where H means that the outcome of the coin toss was heads while T means that

4
the outcome was tails; the event space is F = {{H}, {T }, Ω, ∅}, and the probability (measure)
is defined as 

 θ if E = {H}

1 − θ if E = {T }
P(E) = (23)


 1 if E = Ω
0 if E = ∅,

where θ ∈ [0, 1]. This probability space can be used to describe any random phenomenon with
a binary outcome: For example, whether the spin of a magnet is up or down, whether a bit is 0
or 1, or weather some statement holds or not.

1.4.2 Noisy measurements


Another simple example is the noisy measurement of some state of nature. We can think that
nature may be randomly in a certain state or not (“Y” for “yes, it is in the state”, and “N” for
“no, it isn’t”), and that we measure its state using some imperfect apparatus or test, returning
Y or N . The outcome space is Ω = {Y Y, Y N, N Y, N N } where Y Y means that nature is truly
in the state considered and the outcome of the test is correct (true positive) while Y N would
mean that the test is false (nature is in the state, “Y”, while the test does not indicate so, “N”;
this is called a false negative). Similarly, N N corresponds to a true negative while N Y is a false
positive. As event space we can take the set of all subsets of Ω. We can assign probabilities to all
events by specifying the probabilities of the elementary outcomes ω ∈ Ω, that is by using three
non-negative numbers θ1 ,θ2 , θ3 as in Table 1.
We can compute the conditional probabilities for the test to return a certain result given the
state of nature. We denote the event that nature is in the state of interest by BY ,
BY = {Y Y, Y N }, (24)
while BN is the event that nature is not in the considered state
BN = {N Y, N N }. (25)
The probabilities of the two events are P(BY ) = θ1 + θ2 and P(BN ) = θ3 + θ4 = 1 − θ1 − θ2 .
Similarly, the event that the test returns Y corresponds to the set EY ,
EY = {Y Y, N Y }, (26)
while the set EN is the event that the test returns N ,
EN = {Y N, N N }. (27)
We thus have {Y Y } = BY ∩ EY and {N N } = BN ∩ EN and
P(EY ∩ BY ) P(EN ∩ BN )
P(EY |BY ) = P(EN |BN ) = (28)
P(BY ) P(BN )
or
θ1 θ4
P(EY |BY ) = P(EN |BN ) = (29)
θ1 + θ2 θ3 + θ4
The probability P(EY |BY ) is called the sensitivity of the test while the probability P(EN |BN )
is called the specificity. Since conditional probabilities also have to sum to one, the probability
P(EY |BN ) that the test says wrongly “Y ” equals 1 − P(EN |BN ) and is called type 1 error. The
probability P(EN |BY ) that the test says wrongly “N ” equals 1 − P(EY |BY ) and is called type 2
error.

5
Meaning Nature Measurement Probability

True positive Y Y θ1
False negative Y N θ2
False positive N Y θ3
True negative N N θ4

Table 1: A noisy measurement whether nature is in a certain state or not. The probabilities
sum to one, θ1 + θ2 + θ3 + θ4 = 1. Only three of the four θi thus need to be specified to define
the probability measure.

1.4.3 Bayes’ rule


Measurements do generally not mirror reality perfectly, so that the sensitivity and specificity of
a test are not equal to one. Bayes’ rule allows us to compute the probability that nature is a
certain state given the test result. For example, assume that the test returns Y . The probability
that nature is indeed in the specific state is

P(EY |BY )P(BY )


P(BY |EY ) = (30)
P(EY |BY )P(BY ) + P(EY |BN )P(BN )
sensitivity · P(BY )
≡ (31)
sensitivity · P(BY ) + type 1 error · P(BN )

or in terms of the specificity,

P(EY |BY )P(BY )


P(BY |EY ) = (32)
P(EY |BY )P(BY ) + (1 − P(EN |BN ))P(BN )
sensitivity · P(BY )
≡ (33)
sensitivity · P(BY ) + (1 − specificity) · P(BN )

Assume, for example, that both the specificity and sensitivity of a test for a medical condition
is 0.95. If the prior probability that a patient suffers from the medical condition is low, e.g.
P(BY ) = 0.001 (one out of thousand), the posterior probability for the patient to have the
condition given that the test was positive equals P(BY |EY ) = 0.019 only.

2 Random variables
Random variables are the main tools to model uncertain, or as the name suggests, random
quantities.

2.1 Definition
Given a probability space (Ω, F, P), a random variable x is a real-valued function defined on
Ω: x : Ω → Ωx ⊆ R. It can be thought of as an outcome of a probing measurement made
on a random phenomenon described by the probability space (Ω, F, P). While x is a known
(deterministic) function from one space to another its uncertainty or randomness derives from
the uncertainty or randomness about its inputs; if the inputs ω ∈ Ω are not observed and selected
randomly according to P the corresponding outputs x(ω) ∈ Ωx do indeed appear random even

6
though x is a known function. It thus makes sense to talk about the probability for the occurrence
of events like x ≤ α, that is of x ∈ [−∞, α].
Let Fx be an event space defined for Ωx . The probability that x takes a value in an event
E ∈ Fx can be determined by computing the probability of the set of all ω ∈ Ω which are mapped
to E,
P(x ∈ E) = P({ω ∈ Ω : x(ω) ∈ E}) (34)
For this equation to make sense, the set {ω ∈ Ω : x(ω) ∈ E} must be an element of F. This puts
a (mild) constraint on x. We assume that this condition is fulfilled for all mappings x and event
spaces Fx which follow.1
The above definition can be extended to vector valued functions x = (x1 , . . . , xn ): Ω → Ωx ⊆
Rn where Equation (34) becomes

P(x ∈ E) = P({ω ∈ Ω : x(ω) ∈ E), (35)

where E is an element of the event space Fx . Such vector valued functions are sometimes called
random vectors. They are essentially collections of random variables which are defined on the
same underlying probability space (Ω, F, P). In what follows we often do not verbally distinguish
between random vectors and random variables, that is x may be called a random variable.

2.2 Distribution of random variables


Computing the probabilities in Equation (35) for all events E ∈ Fx describes the (probabilistic)
behaviour of x, that is, its distribution, completely. Because of the properties of event spaces
and probabilities, which we reviewed in Section 1, it is turns out that the computations can be
restricted to events of the form {x : x1 ≤ α1 , . . . , xn ≤ αn }. The corresponding probabilities
define the cumulative distribution function (cdf) Fx of x,

Fx (α) = P ({x : x1 ≤ α1 , . . . , xn ≤ αn }) (36)


= P ({ω ∈ Ω : x1 (ω) ≤ α1 , . . . , xn (ω) ≤ αn }) . (37)

Knowing Fx allows one to compute the probability of any event E ∈ Fx . Importantly, this can
be done without having to go back to the original probability space (Ω, F, P) that was used to
define x and its distribution. Ultimately, this results in a new “stand-alone” probability space.

2.3 Discrete and continuous random variables


A random variable x is called discrete if Ωx contains countably many elements only, that if x
takes at most countable many different values α1 , α2 , . . .. The distribution of x is then completely
described by knowing the probabilities for each of the events x = αk . These probabilities define
the probability mass function (pmf) px ,

px (α) = P(x = α). (38)

The probability for x ∈ E, for some E ∈ Fx , can then be computed as


X
P(x ∈ E) = px (α). (39)
α∈E

1 Courses on probability and measure theory give more details.

7
By the normalisation condition for probabilities, we obtain the condition that px must sum to
one,

X
px (αk ) = 1. (40)
k=1

A random variable x is continuous if, for any event E ∈ Fx , the probability P(x ∈ E) can be
computed by an integral over a non-negative function px defined on Ωx ,
Z
P(x ∈ E) = px (α)dα. (41)
E

The function is called the probability density function (pdf) of x. We use the same symbol for
both the probability mass function and the probability density function. By the normalisation
condition for probabilities, a pdf must integrate to one,
Z
px (α)dα = 1. (42)
Ωx

The pdf can be determined from the cdf Fx by taking partial derivatives,

∂ n Fx (α1 , . . . , αn )
px (α) = . (43)
∂α1 · · · ∂αn
In what follows, we may call px the pdf of x whether it is a continuous or discrete random
variable.
It is often the case that notation is simplified and the pdf or pmf of x is denoted by p(x). In
this convention, x takes both the role of the random variable and the values it may take. The
context often makes it clear which is meant, but if not, the convention can lead to considerable
confusion so that one then better resorts to the more verbose notation px to denote the pdf or
pmf, and px (α) to denote the value of px at α.
There are also mixed-type of random variables which cannot be classified as either discrete
or continuous. The probability that x ∈ E is then computed by a combination of summation
and integration.

2.4 Conditional distributions and Bayes’ rule


Knowing the cumulative distribution function Fx in (36), that is the probabilities for events of the
form {x : x1 ≤ α1 , . . . , xn ≤ αn }, defines the distribution of the random variables x completely.
Typically, we have some information about some of the random variables x = (x1 , . . . , xn ), or
we learn about them over time. We thus want to be able to adjust the probabilities of the
events {x : x1 ≤ α1 , . . . , xn ≤ αn } in light of new evidence. This is the purpose of conditional
distributions.
If new information about x comes in the form of general events E ∈ Fx , updating of the
distribution of x is best done by working with the cumulative distribution function Fx and the
rules for conditioning of probabilities reviewed in Section 1.2.
If we observe the values of a subset of the random variables, it is generally easier to work
with the pmf or pdf px . Let x = (x1 , x2 ). For discrete random variables, the conditional pmf
px2 |x1 is defined as
px (α1 , α2 )
px2 |x1 (α2 |α1 ) = (44)
px1 (α1 )

8
P
for all x1 where the marginal pmf px1 (α1 ) = α2 px (α1 , α2 ) > 0. The conditional is left
undefined for α1 where px1 (α1 ) = 0. This definition is in line with the definition of conditional
probability in Section 1.2. We also have a corresponding product rule

px (α1 , α2 ) = px2 |x1 (α2 |α1 )px1 (α1 ). (45)

As in Section 1.2, the product rule is valid for all α1 , α2 even though px2 |x1 is left undefined for
those α1 where px1 (α1 ) = 0.
For continuous random variables, the conditional pdf px2 |x1 is defined as

px (α1 , α2 )
px2 |x1 (α2 |α1 ) = (46)
px1 (α1 )
R
for all α1 where the marginal pdf px1 (α1 ) = px (α1 , α2 )dα2 > 0. For α1 where px1 (α1 ) = 0,
the conditional pdf is left unspecified. In fact, we are free to define it as we wish as long as it is
a valid pdf. As before, we obtain the product rule

px (α1 , α2 ) = px2 |x1 (α2 |α1 )px1 (α1 ). (47)

Since px (α1 , α2 ) ≥ 0, it follows that px1 (α1 ) = 0 implies that px (α1 , α2 ) = 0 for all α2 . Hence,
as before, the product rule is valid for all (α1 , α2 ) and the non-uniqueness of px2 |x1 for those α1
where px1 (α1 ) = 0 is irrelevant.
In simplified notation, the above equations become

p(x1 , x2 )
p(x2 |x1 ) = (48)
p(x1 )
p(x1 , x2 ) = p(x2 |x1 )p(x1 ). (49)

The equations remain valid for mixed type of random variables.


The conditional pdf (pmf) is a valid pdf (pmf) for x2 . In particular it satisfies the normali-
sation condition for all values of x1 ,
X Z
p(α2 |x1 ) = 1 p(α2 |x1 )dα2 = 1, (50)
α2 ∈Ω2 Ω2

where Ω2 denotes the sample space of x2 . It is thus possible to use the product rule to define
the joint distribution of (x1 , x2 ) by separately specifying the marginal distribution of x1 and the
conditional distribution of x2 given x1 .
Finally, as in Section 1, the product rule yields

p(x2 |x1 )p(x1 )


p(x1 |x2 ) = , (51)
p(x2 )

which is the Bayes rule for random variables.

2.5 Examples
2.5.1 Bernoulli random variable
Let (Ω, F, P) be the probability space from the coin tossing example (Section 1.4.1) and x the
mapping from Ω to {−1, 1} where “head” is assigned to 1 and “tails” to −1. The mapping is

9
deterministic but since we do not see the coin flips, the output appears random: P(x = 1) =
p(1) = P(H) = θ and P(x = −1) = p(−1) = P(T ) = 1 − θ.
This is not the only way to construct binary random variables. There are other mappings x0
and probability spaces (Ω0 , F 0 , P0 ) which result in the same probability distribution. For example,
let Ω = [0, 1], F 0 the event space containing all intervals of the form [a, b), 0 ≤ a ≤ b ≤ 1, and
P0 ([a, b)) = b − a. Then, the mapping x0 with ω 7→ 1 if ω ≤ θ and ω 7→ −1 if ω > θ has
P(x0 = 1) = P0 ({ω : 0 ≤ ω < θ}) = θ as well.

2.5.2 Beta random variable


Figure 1 shows a plot of the function b(u, v),
u
b(u, v) = , (52)
u+v
for u > 0 and v > 0. It can be seen that b ∈ (0, 1). While the mapping is deterministic, if we
let u and v take random values, b will take random values as well. Let u and v be the sum of
squared independent Gaussian (standard normal) random variables,

X 2β
X
u= Zi2 , v= Zi02 , (53)
i=1 i=1

where the first sum goes over 2α terms and the second over 2β terms. Figure 2(a) shows a
scatter plot of some random values that (u, v) take and Figure 2(b) shows a histogram of the
corresponding values of b. The distribution of b is called the “beta distribution”. It takes different
shapes for different values of α and β because u and v behave differently even though the mapping
(u, v) 7→ b(u, v) stays the same. If a random variable x has a beta distribution, it is denoted by
x ∼ Beta(α, β). Its pdf is given by

Γ(α + β) α−1
p(x|α, β) = x (1 − x)β−1 , (54)
Γ(α)Γ(β)

for x ∈ (0, 1), and where Γ(.) is the gamma function,


Z ∞
Γ(t) = y t−1 exp(−y)dy. (55)
0

The term Γ(α + β)/(Γ(α)Γ(β)) is needed to normalise p(x|α, β).

3 Models
We here explain the (fine) difference between probabilistic, statistical, and Bayesian models.
They are often confounded and people may use “statistical model” or “probabilistic model” to
refer to any one of them.

3.1 Probabilistic models


A probabilistic, or probability model of some random phenomenon formally corresponds to a
probability space (Ω, F, P). Most often, one works with random variables and the probability
density (mass) function that corresponds to P. Furthermore, the sample space and the event
space are also often not explicitly indicated but implied by the context.

10
1

0.8

0.6
b

0.4

0.2

0
10 10

5 5

v 0 0 u

Figure 1: A plot of the function b(u, v) = u/(u+v) which can be used to generate beta-distributed
random variables.

25 90

80

20
70

60
15
50
counts
v

40
10
30

20
5

10

0 0
0 5 10 15 20 25 30 0 0.2 0.4 0.6 0.8 1
u b

(a) Scatter plot of u and v (b) Histogram of b = u/(u + v)

Figure 2: If u is obtained by squaring and summing 2α standard normal random variables,


and v in the same way by squaring and summing other 2β standard normal random variables,
b(u, v) = u/(u + v) follows a beta distribution with parameters (α, β).

11
For example, the Bernoulli random variable from Section 1.4.1 with success probability 1/2 is
a probabilistic model of coin tosses. A probabilistic model for a wide range of random phenomena
is the Gaussian random variable with probability density function
 2
1 x
p(x) = √ exp − . (56)
2π 2
More generally, a Gaussian random variable with known mean µ0 and known variance σ02 is also
a probabilistic model. The corresponding density is
(x − µ0 )2
 
1
p(x) = p exp − . (57)
2πσ02 2σ02

3.2 Statistical models


A statistical model is a collection, or set, of probability measures defined on the same sample
space Ω. In other words, a statistical model is a set of random variables that are defined on the
same domain.
A parametric statistical model is a set of random variables xθ parametrised by θ ∈ Θ ⊆ Rd .
This means that for each value of the parameters θ, xθ is a (possibly vector valued) random
variable as defined in Section 2 with probability density (mass) function p(.|θ).2 Furthermore,
it is common to drop the “parametric” and just refer to statistical models instead of parametric
statistical models.
For example, the collection of Bernoulli random variables parametrised by the success prob-
ability θ is a statistical model of coin tosses. The set of Gaussian random variables parametrised
by the mean µ and, possibly, the variance σ 2 is also a statistical model. The set (or family) of
probability density functions p(.|θ), with θ = (µ, σ 2 ), is
(x − µ)2
 
1
p(x|θ) = √ exp − . (58)
2πσ 2 2σ 2
For a probabilistic model the mean and variance are fixed, but for the statistical model, they
are free parameters. We typically use data to pick a member of the family {p(.|θ)}θ∈Θ by
determining a suitable value for θ. We then say that we learn the parameters, or equivalently,
that we estimate the (statistical) model. The outcome of the learning or estimation process is a
probabilistic model.

3.3 Bayesian models


A Bayesian model is obtained by combining a statistical model with a (prior) probability dis-
tribution for the parameters θ. Each member of the family {p(.|θ)}θ∈Θ is considered to be a
conditional pdf (or pmf), as implied by the notation p(.|θ). Together with the prior pdf for θ, the
family of conditional pdfs thus defines a single joint pdf p(.|θ)pθ . Assuming that the conditional
pdfs p(.|θ) are defined on Ωx , a Bayesian model thus formally corresponds to a probabilistic
model defined on Ωx × Θ.
For statistical models, we used xθ to denote the random variable that corresponds to the pdf
p(.|θ) for a given value of θ. When θ is considered a random variable, we often use the notation
x|θ instead. We thus say that x|θ has the conditional pdf p(.|θ). Moreover, we associate
the random variables (x, θ) with the joint pdf p(.|θ)pθ , which is often written more simply as
p(x, θ) = p(x|θ)p(θ).
2 The notation p(.; θ) is often used instead of p(.|θ), in particular in a non-Bayesian context.

12
3.4 Examples
3.4.1 Statistical model for binary data
Assume you have developed a new procedure which you think is faster or in other ways better
than existing ones. The procedure could, for example, be a new classification algorithm, a new
kind of measurement protocol, or a new treatment of an ailment. However, you also noticed
that sometimes, the new procedure performs worse than the existing ones. The situation can
be modelled using a binary random variable x where x = 1 means that the new procedure is
performing better, while x = 0 means that it is performing worse. The probability that the
procedure is a success is here the parameter θ, and the statistical model is specified by p(x|θ),

p(x|θ) = θx (1 − θ)(1−x) (59)

with x ∈ {0, 1} and θ ∈ [0, 1].

3.4.2 Statistical model for proportions


You decide to prepare a number of tests and compute the proportion f ∈ [0, 1] of successes to
assess the efficacy of the new procedure. A popular model for the unknown proportion f is a
beta random variable parametrised by α and β, see Section 2.5.2. The family of pdfs is
1
p(f |α, β) = f α−1 (1 − f )β−1 , (60)
Z(α, β)
where
Γ(α)Γ(β)
Z(α, β) = (61)
Γ(α + β)
is called the partition function. It ensures that p(f |α, β) integrates to one for all values of the
parameters α and β.

3.4.3 Bayesian model for binary data


We can turn the statistical model in (59) into a Bayesian model by attaching a probability
distribution to θ. A common choice is to use a beta-distribution, i.e. we assume that
1
p(θ) = p(θ|α0 , β0 ) = θα0 −1 (1 − θ)β0 −1 , (62)
Z(α0 , β0 )
where α0 and β0 are assumed fix. The joint distribution p(x, θ) of (x, θ) is thus
1
p(x, θ) = θx (1 − θ)(1−x) θα0 −1 (1 − θ)β0 −1 (63)
Z(α0 , β0 )
1
= θx θα0 −1 (1 − θ)(1−x) (1 − θ)β0 −1 (64)
Z(α0 , β0 )
1
= θx+α0 −1 (1 − θ)(β0 −x) , (65)
Z(α0 , β0 )
and it is defined on {0, 1} × [0, 1].
We here assumed that α0 and β0 are known and fixed. If they are unknown and we consider
them free parameters, we can formulate a statistical model
1
p(x, θ|α, β) = θx+α−1 (1 − θ)(β−x) , (66)
Z(α, β)

13
which one can again turn into a Bayesian or probabilistic model by attaching a (prior) probability
distribution to α and β. The parameters α and β are sometimes called hyperparameters, the
assumed prior a hyperprior, and the resulting model a hierarchical Bayesian model.

14

You might also like