0% found this document useful (0 votes)

19 views14 pages

PMRprobabilistic Modelling Primer

This document provides an introduction to probability and probabilistic modelling. It discusses key concepts such as probability spaces, conditional probability, Bayes' rule, random variables, probabilistic models, and statistical and Bayesian models. Examples are also provided to illustrate these probabilistic concepts.

Uploaded by

zeliawillscumberg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views14 pages

PMRprobabilistic Modelling Primer

Uploaded by

zeliawillscumberg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Introduction to Probabilistic Modelling

Michael Gutmann
Institute for Adaptive and Neural Computation
School of Informatics, University of Edinburgh
[email protected]
January 8, 2020

Abstract
We give a brief introduction to probability and probabilistic modelling. The document is a
refresher; it is assumed that the reader has some prior knowledge about the topic.

Contents
1 Probability 2
1.1 Probability space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Random variables 6
2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Distribution of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Discrete and continuous random variables . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Conditional distributions and Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Models 10
3.1 Probabilistic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Bayesian models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1
1 Probability
1.1 Probability space
Random or uncertain phenomena can be mathematically described using probability theory
where a fundamental quantity is the probability space. A probability space consists of three
elements: the sample space Ω, the event space F, and the probability (measure) P.
1. Ω is the set of all possible elementary outcomes of the phenomenon of interest. In the
literature, the random phenomenon of interest is usually considered to be some kind of
“experiment” whose outcome is uncertain. Ω is then the collection of all possible elementary,
i.e. finest-grain and distinguishable outcomes of the experiment.
2. F is the collection of all events (subsets of Ω) whose probability to occur one might want
to compute.
3. The probability P measures the plausibility of each event E ∈ F, assigning to it a number
between zero (most implausible/improbable) and one (most plausible/probable).
Probabilities are thus non-negative,

P(E) ≥ 0 ∀E ∈ F (1)

and normalised, which means that the maximal probability for an event to occur is one.
For the definition of the probability space to be consistent, the event space F needs to satisfy
some basic conditions: If we can compute the probability for an event E to occur, we should also
be able to compute the probability for the event E not to occur. That is:

If E ∈ F then Ē = Ω \ E ∈ F (2)

This property is called closure under complements. Further, if we can compute the probability
for E1 , E2 , . . . individually to occur, we should also be able to compute the probability that any
of the events occurs. That is:

If E1 ∈ F, E2 ∈ F, . . . then (∪i Ei ) ∈ F. (3)

This property is called closure under countable unions. The third condition is that

Ω ∈ F, (4)

which is a consequence of the above because Ω = E ∪ (Ω \ E).

Since Ω is the set of all possible outcomes, we can be certain that the event Ω occurs and the
normalisation condition is
P(Ω) = 1. (5)
This normalisation condition has a number of important consequences when we learn parameters
of a model.
Finally, in order to avoid inconsistencies, the assignment of probabilities to events needs
to satisfy an additivity condition: For every countable collection of pairwise disjoint events
E1 , E2 , . . ., X
P (∪i Ei ) = P(Ei ). (6)
i

2
Note that on the left hand side, we have the probability of the event ∪i Ei , while on the right
hand side, we have the (possibly infinite) sum over all the probabilities of the individual events
Ei . The equation says that the probability for ∪i Ei can be computed by summing up all P(Ei ).
For two events A, B ∈ F with A ⊆ B, it follows from the above that we must have P(A) ≤
P(B). This is because for A ⊆ B we can express B as B = A ∪ (B \ A) where A and B \ A
are disjoint, so that P(B) = P(A) + P(B \ A). Since P(B \ A) is non-negative, we must have
P(A) ≤ P(B). This result is known as monotonicity of probability. A further consequence is that
P(A ∩ B) ≤ P(A) and P(A ∩ B) ≤ P(B) for any A, B ∈ F.

1.2 Conditional probability

After observing some event, we may want to update the probabilities that we assign to the events
in F accordingly. The updated probability of event A after we learn that event B has occurred
is the conditional probability of A given B. It is denoted by P(A|B). If P(B) > 0, we compute
this probability as
P(A ∩ B)
P(A|B) = . (7)
P(B)
The conditional probability is not defined when P(B) = 0. The conditional probability of A
given B is thus the joint probability of A and B, P(A ∩ B), re-normalised by the probability of
B. We can think that the conditional probability P(.|B) defines a new probability P0 (.) that may
only take non-zero values on subsets of B and assigns probability one to B.
Since P(A ∩ B) = P(B ∩ A), we have for P(A) > 0
P(A ∩ B)
P(B|A) = , (8)
P(A)
so that P(A|B) and P(B|A) differ only by the normalisation in the denominator.
From the definition of conditional probability, it follows that
P(A ∩ B) =P(A|B)P(B) if P(B) > 0 (9)
=P(B|A)P(A) if P(A) > 0, (10)
which means that we can use conditional probabilities to assign joint probabilities. Equations
(9) and (10) are called the “product rule”.
We now show that the restrictions P(B) > 0 and P(A) > 0 are actually not needed in the
formulation of the product rule in Equations (9) and (10). This is because if, for example,
P(B) = 0 then we also must have P(A ∩ B) ≤ P(B) = 0 so that the product rule holds for all
events A, B even if P(A|B) is left undefined for events B with P(B) = 0.
Suppose that B can be partitioned into events B1 , . . . , Bk ,
B = ∪ki=1 Bi , Bi ∩ Bj = ∅ if i 6= j, (11)
with P(Bi ) > 0. Then
P(A ∩ B) = P A ∩ (∪ki Bi )

(12)
∪ki (A

=P ∩ Bi ) (13)
k
X
= P(A ∩ Bi ) (14)
i=1
k
X
= P(A|Bi )P(Bi ), (15)
i=1

3
where we used Equation (9). The equation
k
X
P(A ∩ B) = P(A ∩ Bi ) (16)
i=1

is known as the sum rule. Using Ω for B, we obtain

k
X
P(A) = P(A|Bi )P(Bi ), (17)
i=1

which is called the law of total probability.

1.3 Bayes’ rule

From Equation (9) and (10), we have

P(A|B)P(B) = P(B|A)P(A) (18)

from where the so called Bayes’ rule follows (for P(A) > 0):
P(B)
P(B|A) = P(A|B) (19)
P(A)
On the left hand side, the conditioning event is A while on the right hand side the conditioning
event is B. Bayes’ rule shows how to move from P(A|B) to P(B|A) and vice versa, that is to
“revert the conditioning”.
For a partition B1 , . . . , Bk of Ω, we obtain with the law of total probability a more general
version of Bayes’ rule,
P(Bi ∩ A)
P(Bi |A) = (20)
P(A)
P(A|Bi )P(Bi )
= (21)
P(A)
P(A|Bi )P(Bi )
= Pk . (22)
j=1 P(A|Bj )P(Bj )

It can be seen that the posterior probability P(Bi |A) of Bi , after learning about event A, is
larger than the prior probability P(Bi ) (prior, or before observing A) if P(A|Bi ) is larger than
the weighted average of all P(A|Bj ).
Bayes’ rule has many important applications. A prototypical application is as follows: Sup-
pose that we observe an event A whose occurrence might be due to a number of mutually exclusive
causes B1 , . . . , Bk . If it is known for each cause Bi how probable it is to observe A, that is, if
the P (A|Bi ) are known, Bayes’ rule enables us to compute the “reverse” conditional probability
that Bi has indeed caused A, that is, P(Bi |A).

1.4 Examples
1.4.1 Coin tossing
A coin toss is perhaps the simplest example of a random experiment: The outcome space is
Ω = {H, T }, where H means that the outcome of the coin toss was heads while T means that

4
the outcome was tails; the event space is F = {{H}, {T }, Ω, ∅}, and the probability (measure)
is defined as 

 θ if E = {H}

1 − θ if E = {T }
P(E) = (23)


 1 if E = Ω
0 if E = ∅,


where θ ∈ [0, 1]. This probability space can be used to describe any random phenomenon with
a binary outcome: For example, whether the spin of a magnet is up or down, whether a bit is 0
or 1, or weather some statement holds or not.

1.4.2 Noisy measurements

Another simple example is the noisy measurement of some state of nature. We can think that
nature may be randomly in a certain state or not (“Y” for “yes, it is in the state”, and “N” for
“no, it isn’t”), and that we measure its state using some imperfect apparatus or test, returning
Y or N . The outcome space is Ω = {Y Y, Y N, N Y, N N } where Y Y means that nature is truly
in the state considered and the outcome of the test is correct (true positive) while Y N would
mean that the test is false (nature is in the state, “Y”, while the test does not indicate so, “N”;
this is called a false negative). Similarly, N N corresponds to a true negative while N Y is a false
positive. As event space we can take the set of all subsets of Ω. We can assign probabilities to all
events by specifying the probabilities of the elementary outcomes ω ∈ Ω, that is by using three
non-negative numbers θ1 ,θ2 , θ3 as in Table 1.
We can compute the conditional probabilities for the test to return a certain result given the
state of nature. We denote the event that nature is in the state of interest by BY ,
BY = {Y Y, Y N }, (24)
while BN is the event that nature is not in the considered state
BN = {N Y, N N }. (25)
The probabilities of the two events are P(BY ) = θ1 + θ2 and P(BN ) = θ3 + θ4 = 1 − θ1 − θ2 .
Similarly, the event that the test returns Y corresponds to the set EY ,
EY = {Y Y, N Y }, (26)
while the set EN is the event that the test returns N ,
EN = {Y N, N N }. (27)
We thus have {Y Y } = BY ∩ EY and {N N } = BN ∩ EN and
P(EY ∩ BY ) P(EN ∩ BN )
P(EY |BY ) = P(EN |BN ) = (28)
P(BY ) P(BN )
or
θ1 θ4
P(EY |BY ) = P(EN |BN ) = (29)
θ1 + θ2 θ3 + θ4
The probability P(EY |BY ) is called the sensitivity of the test while the probability P(EN |BN )
is called the specificity. Since conditional probabilities also have to sum to one, the probability
P(EY |BN ) that the test says wrongly “Y ” equals 1 − P(EN |BN ) and is called type 1 error. The
probability P(EN |BY ) that the test says wrongly “N ” equals 1 − P(EY |BY ) and is called type 2
error.

5
Meaning Nature Measurement Probability

True positive Y Y θ1
False negative Y N θ2
False positive N Y θ3
True negative N N θ4

Table 1: A noisy measurement whether nature is in a certain state or not. The probabilities
sum to one, θ1 + θ2 + θ3 + θ4 = 1. Only three of the four θi thus need to be specified to define
the probability measure.

1.4.3 Bayes’ rule

Measurements do generally not mirror reality perfectly, so that the sensitivity and specificity of
a test are not equal to one. Bayes’ rule allows us to compute the probability that nature is a
certain state given the test result. For example, assume that the test returns Y . The probability
that nature is indeed in the specific state is

P(EY |BY )P(BY )

P(BY |EY ) = (30)
P(EY |BY )P(BY ) + P(EY |BN )P(BN )
sensitivity · P(BY )
≡ (31)
sensitivity · P(BY ) + type 1 error · P(BN )

or in terms of the specificity,

P(EY |BY )P(BY )

P(BY |EY ) = (32)
P(EY |BY )P(BY ) + (1 − P(EN |BN ))P(BN )
sensitivity · P(BY )
≡ (33)
sensitivity · P(BY ) + (1 − specificity) · P(BN )

Assume, for example, that both the specificity and sensitivity of a test for a medical condition
is 0.95. If the prior probability that a patient suffers from the medical condition is low, e.g.
P(BY ) = 0.001 (one out of thousand), the posterior probability for the patient to have the
condition given that the test was positive equals P(BY |EY ) = 0.019 only.

2 Random variables
Random variables are the main tools to model uncertain, or as the name suggests, random
quantities.

2.1 Definition
Given a probability space (Ω, F, P), a random variable x is a real-valued function defined on
Ω: x : Ω → Ωx ⊆ R. It can be thought of as an outcome of a probing measurement made
on a random phenomenon described by the probability space (Ω, F, P). While x is a known
(deterministic) function from one space to another its uncertainty or randomness derives from
the uncertainty or randomness about its inputs; if the inputs ω ∈ Ω are not observed and selected
randomly according to P the corresponding outputs x(ω) ∈ Ωx do indeed appear random even

6
though x is a known function. It thus makes sense to talk about the probability for the occurrence
of events like x ≤ α, that is of x ∈ [−∞, α].
Let Fx be an event space defined for Ωx . The probability that x takes a value in an event
E ∈ Fx can be determined by computing the probability of the set of all ω ∈ Ω which are mapped
to E,
P(x ∈ E) = P({ω ∈ Ω : x(ω) ∈ E}) (34)
For this equation to make sense, the set {ω ∈ Ω : x(ω) ∈ E} must be an element of F. This puts
a (mild) constraint on x. We assume that this condition is fulfilled for all mappings x and event
spaces Fx which follow.1
The above definition can be extended to vector valued functions x = (x1 , . . . , xn ): Ω → Ωx ⊆
Rn where Equation (34) becomes

P(x ∈ E) = P({ω ∈ Ω : x(ω) ∈ E), (35)

where E is an element of the event space Fx . Such vector valued functions are sometimes called
random vectors. They are essentially collections of random variables which are defined on the
same underlying probability space (Ω, F, P). In what follows we often do not verbally distinguish
between random vectors and random variables, that is x may be called a random variable.

2.2 Distribution of random variables

Computing the probabilities in Equation (35) for all events E ∈ Fx describes the (probabilistic)
behaviour of x, that is, its distribution, completely. Because of the properties of event spaces
and probabilities, which we reviewed in Section 1, it is turns out that the computations can be
restricted to events of the form {x : x1 ≤ α1 , . . . , xn ≤ αn }. The corresponding probabilities
define the cumulative distribution function (cdf) Fx of x,

Fx (α) = P ({x : x1 ≤ α1 , . . . , xn ≤ αn }) (36)

= P ({ω ∈ Ω : x1 (ω) ≤ α1 , . . . , xn (ω) ≤ αn }) . (37)

Knowing Fx allows one to compute the probability of any event E ∈ Fx . Importantly, this can
be done without having to go back to the original probability space (Ω, F, P) that was used to
define x and its distribution. Ultimately, this results in a new “stand-alone” probability space.

2.3 Discrete and continuous random variables

A random variable x is called discrete if Ωx contains countably many elements only, that if x
takes at most countable many different values α1 , α2 , . . .. The distribution of x is then completely
described by knowing the probabilities for each of the events x = αk . These probabilities define
the probability mass function (pmf) px ,

px (α) = P(x = α). (38)

The probability for x ∈ E, for some E ∈ Fx , can then be computed as

X
P(x ∈ E) = px (α). (39)
α∈E

1 Courses on probability and measure theory give more details.

7
By the normalisation condition for probabilities, we obtain the condition that px must sum to
one,
∞
X
px (αk ) = 1. (40)
k=1

A random variable x is continuous if, for any event E ∈ Fx , the probability P(x ∈ E) can be
computed by an integral over a non-negative function px defined on Ωx ,
Z
P(x ∈ E) = px (α)dα. (41)
E

The function is called the probability density function (pdf) of x. We use the same symbol for
both the probability mass function and the probability density function. By the normalisation
condition for probabilities, a pdf must integrate to one,
Z
px (α)dα = 1. (42)
Ωx

The pdf can be determined from the cdf Fx by taking partial derivatives,

∂ n Fx (α1 , . . . , αn )
px (α) = . (43)
∂α1 · · · ∂αn
In what follows, we may call px the pdf of x whether it is a continuous or discrete random
variable.
It is often the case that notation is simplified and the pdf or pmf of x is denoted by p(x). In
this convention, x takes both the role of the random variable and the values it may take. The
context often makes it clear which is meant, but if not, the convention can lead to considerable
confusion so that one then better resorts to the more verbose notation px to denote the pdf or
pmf, and px (α) to denote the value of px at α.
There are also mixed-type of random variables which cannot be classified as either discrete
or continuous. The probability that x ∈ E is then computed by a combination of summation
and integration.

2.4 Conditional distributions and Bayes’ rule

Knowing the cumulative distribution function Fx in (36), that is the probabilities for events of the
form {x : x1 ≤ α1 , . . . , xn ≤ αn }, defines the distribution of the random variables x completely.
Typically, we have some information about some of the random variables x = (x1 , . . . , xn ), or
we learn about them over time. We thus want to be able to adjust the probabilities of the
events {x : x1 ≤ α1 , . . . , xn ≤ αn } in light of new evidence. This is the purpose of conditional
distributions.
If new information about x comes in the form of general events E ∈ Fx , updating of the
distribution of x is best done by working with the cumulative distribution function Fx and the
rules for conditioning of probabilities reviewed in Section 1.2.
If we observe the values of a subset of the random variables, it is generally easier to work
with the pmf or pdf px . Let x = (x1 , x2 ). For discrete random variables, the conditional pmf
px2 |x1 is defined as
px (α1 , α2 )
px2 |x1 (α2 |α1 ) = (44)
px1 (α1 )

8
P
for all x1 where the marginal pmf px1 (α1 ) = α2 px (α1 , α2 ) > 0. The conditional is left
undefined for α1 where px1 (α1 ) = 0. This definition is in line with the definition of conditional
probability in Section 1.2. We also have a corresponding product rule

px (α1 , α2 ) = px2 |x1 (α2 |α1 )px1 (α1 ). (45)

As in Section 1.2, the product rule is valid for all α1 , α2 even though px2 |x1 is left undefined for
those α1 where px1 (α1 ) = 0.
For continuous random variables, the conditional pdf px2 |x1 is defined as

px (α1 , α2 )
px2 |x1 (α2 |α1 ) = (46)
px1 (α1 )
R
for all α1 where the marginal pdf px1 (α1 ) = px (α1 , α2 )dα2 > 0. For α1 where px1 (α1 ) = 0,
the conditional pdf is left unspecified. In fact, we are free to define it as we wish as long as it is
a valid pdf. As before, we obtain the product rule

px (α1 , α2 ) = px2 |x1 (α2 |α1 )px1 (α1 ). (47)

Since px (α1 , α2 ) ≥ 0, it follows that px1 (α1 ) = 0 implies that px (α1 , α2 ) = 0 for all α2 . Hence,
as before, the product rule is valid for all (α1 , α2 ) and the non-uniqueness of px2 |x1 for those α1
where px1 (α1 ) = 0 is irrelevant.
In simplified notation, the above equations become

p(x1 , x2 )
p(x2 |x1 ) = (48)
p(x1 )
p(x1 , x2 ) = p(x2 |x1 )p(x1 ). (49)

The equations remain valid for mixed type of random variables.

The conditional pdf (pmf) is a valid pdf (pmf) for x2 . In particular it satisfies the normali-
sation condition for all values of x1 ,
X Z
p(α2 |x1 ) = 1 p(α2 |x1 )dα2 = 1, (50)
α2 ∈Ω2 Ω2

where Ω2 denotes the sample space of x2 . It is thus possible to use the product rule to define
the joint distribution of (x1 , x2 ) by separately specifying the marginal distribution of x1 and the
conditional distribution of x2 given x1 .
Finally, as in Section 1, the product rule yields

p(x2 |x1 )p(x1 )

p(x1 |x2 ) = , (51)
p(x2 )

which is the Bayes rule for random variables.

2.5 Examples
2.5.1 Bernoulli random variable
Let (Ω, F, P) be the probability space from the coin tossing example (Section 1.4.1) and x the
mapping from Ω to {−1, 1} where “head” is assigned to 1 and “tails” to −1. The mapping is

9
deterministic but since we do not see the coin flips, the output appears random: P(x = 1) =
p(1) = P(H) = θ and P(x = −1) = p(−1) = P(T ) = 1 − θ.
This is not the only way to construct binary random variables. There are other mappings x0
and probability spaces (Ω0 , F 0 , P0 ) which result in the same probability distribution. For example,
let Ω = [0, 1], F 0 the event space containing all intervals of the form [a, b), 0 ≤ a ≤ b ≤ 1, and
P0 ([a, b)) = b − a. Then, the mapping x0 with ω 7→ 1 if ω ≤ θ and ω 7→ −1 if ω > θ has
P(x0 = 1) = P0 ({ω : 0 ≤ ω < θ}) = θ as well.

2.5.2 Beta random variable

Figure 1 shows a plot of the function b(u, v),
u
b(u, v) = , (52)
u+v
for u > 0 and v > 0. It can be seen that b ∈ (0, 1). While the mapping is deterministic, if we
let u and v take random values, b will take random values as well. Let u and v be the sum of
squared independent Gaussian (standard normal) random variables,
2α
X 2β
X
u= Zi2 , v= Zi02 , (53)
i=1 i=1

where the first sum goes over 2α terms and the second over 2β terms. Figure 2(a) shows a
scatter plot of some random values that (u, v) take and Figure 2(b) shows a histogram of the
corresponding values of b. The distribution of b is called the “beta distribution”. It takes different
shapes for different values of α and β because u and v behave differently even though the mapping
(u, v) 7→ b(u, v) stays the same. If a random variable x has a beta distribution, it is denoted by
x ∼ Beta(α, β). Its pdf is given by

Γ(α + β) α−1
p(x|α, β) = x (1 − x)β−1 , (54)
Γ(α)Γ(β)

for x ∈ (0, 1), and where Γ(.) is the gamma function,

Z ∞
Γ(t) = y t−1 exp(−y)dy. (55)
0

The term Γ(α + β)/(Γ(α)Γ(β)) is needed to normalise p(x|α, β).

3 Models
We here explain the (fine) difference between probabilistic, statistical, and Bayesian models.
They are often confounded and people may use “statistical model” or “probabilistic model” to
refer to any one of them.

3.1 Probabilistic models

A probabilistic, or probability model of some random phenomenon formally corresponds to a
probability space (Ω, F, P). Most often, one works with random variables and the probability
density (mass) function that corresponds to P. Furthermore, the sample space and the event
space are also often not explicitly indicated but implied by the context.

10
1

0.8

0.6
b

0.4

0.2

0
10 10

5 5

v 0 0 u

Figure 1: A plot of the function b(u, v) = u/(u+v) which can be used to generate beta-distributed
random variables.

25 90

20
70

60
15
50
counts
v

40
10
30

20
5

0 0
0 5 10 15 20 25 30 0 0.2 0.4 0.6 0.8 1
u b

(a) Scatter plot of u and v (b) Histogram of b = u/(u + v)

Figure 2: If u is obtained by squaring and summing 2α standard normal random variables,

and v in the same way by squaring and summing other 2β standard normal random variables,
b(u, v) = u/(u + v) follows a beta distribution with parameters (α, β).

11
For example, the Bernoulli random variable from Section 1.4.1 with success probability 1/2 is
a probabilistic model of coin tosses. A probabilistic model for a wide range of random phenomena
is the Gaussian random variable with probability density function
2
1 x
p(x) = √ exp − . (56)
2π 2
More generally, a Gaussian random variable with known mean µ0 and known variance σ02 is also
a probabilistic model. The corresponding density is
(x − µ0 )2

1
p(x) = p exp − . (57)
2πσ02 2σ02

3.2 Statistical models

A statistical model is a collection, or set, of probability measures defined on the same sample
space Ω. In other words, a statistical model is a set of random variables that are defined on the
same domain.
A parametric statistical model is a set of random variables xθ parametrised by θ ∈ Θ ⊆ Rd .
This means that for each value of the parameters θ, xθ is a (possibly vector valued) random
variable as defined in Section 2 with probability density (mass) function p(.|θ).2 Furthermore,
it is common to drop the “parametric” and just refer to statistical models instead of parametric
statistical models.
For example, the collection of Bernoulli random variables parametrised by the success prob-
ability θ is a statistical model of coin tosses. The set of Gaussian random variables parametrised
by the mean µ and, possibly, the variance σ 2 is also a statistical model. The set (or family) of
probability density functions p(.|θ), with θ = (µ, σ 2 ), is
(x − µ)2

1
p(x|θ) = √ exp − . (58)
2πσ 2 2σ 2
For a probabilistic model the mean and variance are fixed, but for the statistical model, they
are free parameters. We typically use data to pick a member of the family {p(.|θ)}θ∈Θ by
determining a suitable value for θ. We then say that we learn the parameters, or equivalently,
that we estimate the (statistical) model. The outcome of the learning or estimation process is a
probabilistic model.

3.3 Bayesian models

A Bayesian model is obtained by combining a statistical model with a (prior) probability dis-
tribution for the parameters θ. Each member of the family {p(.|θ)}θ∈Θ is considered to be a
conditional pdf (or pmf), as implied by the notation p(.|θ). Together with the prior pdf for θ, the
family of conditional pdfs thus defines a single joint pdf p(.|θ)pθ . Assuming that the conditional
pdfs p(.|θ) are defined on Ωx , a Bayesian model thus formally corresponds to a probabilistic
model defined on Ωx × Θ.
For statistical models, we used xθ to denote the random variable that corresponds to the pdf
p(.|θ) for a given value of θ. When θ is considered a random variable, we often use the notation
x|θ instead. We thus say that x|θ has the conditional pdf p(.|θ). Moreover, we associate
the random variables (x, θ) with the joint pdf p(.|θ)pθ , which is often written more simply as
p(x, θ) = p(x|θ)p(θ).
2 The notation p(.; θ) is often used instead of p(.|θ), in particular in a non-Bayesian context.

12
3.4 Examples
3.4.1 Statistical model for binary data
Assume you have developed a new procedure which you think is faster or in other ways better
than existing ones. The procedure could, for example, be a new classification algorithm, a new
kind of measurement protocol, or a new treatment of an ailment. However, you also noticed
that sometimes, the new procedure performs worse than the existing ones. The situation can
be modelled using a binary random variable x where x = 1 means that the new procedure is
performing better, while x = 0 means that it is performing worse. The probability that the
procedure is a success is here the parameter θ, and the statistical model is specified by p(x|θ),

p(x|θ) = θx (1 − θ)(1−x) (59)

with x ∈ {0, 1} and θ ∈ [0, 1].

3.4.2 Statistical model for proportions

You decide to prepare a number of tests and compute the proportion f ∈ [0, 1] of successes to
assess the efficacy of the new procedure. A popular model for the unknown proportion f is a
beta random variable parametrised by α and β, see Section 2.5.2. The family of pdfs is
1
p(f |α, β) = f α−1 (1 − f )β−1 , (60)
Z(α, β)
where
Γ(α)Γ(β)
Z(α, β) = (61)
Γ(α + β)
is called the partition function. It ensures that p(f |α, β) integrates to one for all values of the
parameters α and β.

3.4.3 Bayesian model for binary data

We can turn the statistical model in (59) into a Bayesian model by attaching a probability
distribution to θ. A common choice is to use a beta-distribution, i.e. we assume that
1
p(θ) = p(θ|α0 , β0 ) = θα0 −1 (1 − θ)β0 −1 , (62)
Z(α0 , β0 )
where α0 and β0 are assumed fix. The joint distribution p(x, θ) of (x, θ) is thus
1
p(x, θ) = θx (1 − θ)(1−x) θα0 −1 (1 − θ)β0 −1 (63)
Z(α0 , β0 )
1
= θx θα0 −1 (1 − θ)(1−x) (1 − θ)β0 −1 (64)
Z(α0 , β0 )
1
= θx+α0 −1 (1 − θ)(β0 −x) , (65)
Z(α0 , β0 )
and it is defined on {0, 1} × [0, 1].
We here assumed that α0 and β0 are known and fixed. If they are unknown and we consider
them free parameters, we can formulate a statistical model
1
p(x, θ|α, β) = θx+α−1 (1 − θ)(β−x) , (66)
Z(α, β)

13
which one can again turn into a Bayesian or probabilistic model by attaching a (prior) probability
distribution to α and β. The parameters α and β are sometimes called hyperparameters, the
assumed prior a hyperprior, and the resulting model a hierarchical Bayesian model.

What Is The Role of Students in Online Courses?
100% (1)
What Is The Role of Students in Online Courses?
4 pages
PRACTICAL MEDICAL PHYSICS A Guide To The Work of Hospital Clinical
100% (4)
PRACTICAL MEDICAL PHYSICS A Guide To The Work of Hospital Clinical
263 pages
MTH2222 Mathematics of Uncertainty
No ratings yet
MTH2222 Mathematics of Uncertainty
96 pages
4b ProbabilityNotes
No ratings yet
4b ProbabilityNotes
79 pages
3 Probability
No ratings yet
3 Probability
33 pages
Statistical Foundations: SOST70151 - LECTURE 1
No ratings yet
Statistical Foundations: SOST70151 - LECTURE 1
45 pages
Cours Chapter1
No ratings yet
Cours Chapter1
12 pages
MAT3003 Modules - (1 2 3) - Updated
No ratings yet
MAT3003 Modules - (1 2 3) - Updated
40 pages
01 - Probability Spaces
No ratings yet
01 - Probability Spaces
15 pages
Probablity Mit Removed
No ratings yet
Probablity Mit Removed
31 pages
Turn in Recitation and Tutorial Scheduling Form Policy: Text
No ratings yet
Turn in Recitation and Tutorial Scheduling Form Policy: Text
52 pages
Elementary Probability and Statistics
No ratings yet
Elementary Probability and Statistics
25 pages
Probability LectureNotes
No ratings yet
Probability LectureNotes
16 pages
SMo Notes1
No ratings yet
SMo Notes1
81 pages
Probability Review
No ratings yet
Probability Review
29 pages
Chapter 1
No ratings yet
Chapter 1
6 pages
Modeling With Probability
No ratings yet
Modeling With Probability
91 pages
Leon-Garcia-IPPR - Chapters 1-6
No ratings yet
Leon-Garcia-IPPR - Chapters 1-6
180 pages
Math 170A
No ratings yet
Math 170A
34 pages
MATHEMATICAL - STATISTICS (p.1-34)
No ratings yet
MATHEMATICAL - STATISTICS (p.1-34)
34 pages
Introduction To Discrete Probability Theory and Bayesian Networks
No ratings yet
Introduction To Discrete Probability Theory and Bayesian Networks
26 pages
Project Maths
No ratings yet
Project Maths
14 pages
1 Basic Ideas 1
No ratings yet
1 Basic Ideas 1
75 pages
Lecture 2 Review of Probabilty Theory
No ratings yet
Lecture 2 Review of Probabilty Theory
52 pages
PDM Notes
No ratings yet
PDM Notes
15 pages
Chapter 4: Probability and Probabilistic Models: Statistics I
No ratings yet
Chapter 4: Probability and Probabilistic Models: Statistics I
85 pages
Introduction To Probability
No ratings yet
Introduction To Probability
38 pages
IMEN319 1.probability Review
No ratings yet
IMEN319 1.probability Review
18 pages
Lecture Notes Statistics I
No ratings yet
Lecture Notes Statistics I
160 pages
All in One CheatSheet PDF
No ratings yet
All in One CheatSheet PDF
52 pages
02 - Conditioning and Independence
No ratings yet
02 - Conditioning and Independence
14 pages
Material MAT3003 Modules - (1+2+3)
No ratings yet
Material MAT3003 Modules - (1+2+3)
63 pages
Stochbasics Handout
No ratings yet
Stochbasics Handout
36 pages
GSM 199 Prev
No ratings yet
GSM 199 Prev
25 pages
Notes
No ratings yet
Notes
69 pages
Chap 5 TQM SHT
No ratings yet
Chap 5 TQM SHT
2 pages
HANDOUT 1. Probability Basics: Experiment Outcome Sample Space
No ratings yet
HANDOUT 1. Probability Basics: Experiment Outcome Sample Space
4 pages
Stat 333
No ratings yet
Stat 333
128 pages
CHAPTER 1. Probability (1) .Pps
No ratings yet
CHAPTER 1. Probability (1) .Pps
41 pages
Proba
No ratings yet
Proba
188 pages
Exam P Formula Sheet
100% (4)
Exam P Formula Sheet
14 pages
CENG 222 Statistical Methods For Computer Engineering
No ratings yet
CENG 222 Statistical Methods For Computer Engineering
31 pages
Cs229 Probability Review
No ratings yet
Cs229 Probability Review
36 pages
Week 5 Chapter 4 Basic Probability
No ratings yet
Week 5 Chapter 4 Basic Probability
45 pages
lecture-1-SMMD-11092024-080101pm (2) (Autosaved)
No ratings yet
lecture-1-SMMD-11092024-080101pm (2) (Autosaved)
39 pages
Lec-1 - Introduction To Probability Theory
No ratings yet
Lec-1 - Introduction To Probability Theory
25 pages
Introduction To Probability QMUL
No ratings yet
Introduction To Probability QMUL
117 pages
Introduction To Probability - Part II: Axioms and A Few Probability Rules
No ratings yet
Introduction To Probability - Part II: Axioms and A Few Probability Rules
13 pages
Pattern Recongnigation
No ratings yet
Pattern Recongnigation
9 pages
Probability Theory For Data Analytics: (CSPC-309)
No ratings yet
Probability Theory For Data Analytics: (CSPC-309)
50 pages
Lecure-2 Probability-1
No ratings yet
Lecure-2 Probability-1
44 pages
Statistics With R Hari
No ratings yet
Statistics With R Hari
59 pages
Probability
No ratings yet
Probability
78 pages
Lecture 05
No ratings yet
Lecture 05
39 pages
Material MAT3003 Modules - (1+2)
No ratings yet
Material MAT3003 Modules - (1+2)
48 pages
ML Cheat Sheet
50% (2)
ML Cheat Sheet
74 pages
ML Unit2-1
No ratings yet
ML Unit2-1
11 pages
Course Name: MEM601 Statistics: For Engineering Managers (2 Credit Hours)
No ratings yet
Course Name: MEM601 Statistics: For Engineering Managers (2 Credit Hours)
38 pages
Statif - 2 - Slides Probability I
No ratings yet
Statif - 2 - Slides Probability I
14 pages
Functions and Probability for Sixth Graders
From Everand
Functions and Probability for Sixth Graders
Home School Brew
No ratings yet
Probability For Dummies
From Everand
Probability For Dummies
Deborah J. Rumsey
3.5/5 (10)
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet
Master-of-Science-in-Renewable-Energy-and-Management
No ratings yet
Master-of-Science-in-Renewable-Energy-and-Management
1 page
Doing Business in Hungary
No ratings yet
Doing Business in Hungary
22 pages
w2c Central Limit
No ratings yet
w2c Central Limit
1 page
Award in Education and Training Sample
No ratings yet
Award in Education and Training Sample
9 pages
Week 2 Naive Bayes
No ratings yet
Week 2 Naive Bayes
15 pages
W2e Multivariate Gaussian
No ratings yet
W2e Multivariate Gaussian
6 pages
Biological Data Science Lecture4
No ratings yet
Biological Data Science Lecture4
21 pages
Part 4
No ratings yet
Part 4
24 pages
MDA3S
No ratings yet
MDA3S
22 pages
BDS 2016-17
No ratings yet
BDS 2016-17
4 pages
Biological Data Science Lecture6
No ratings yet
Biological Data Science Lecture6
29 pages
Part 3
No ratings yet
Part 3
29 pages
TS Part2
No ratings yet
TS Part2
62 pages
Part 5
No ratings yet
Part 5
31 pages
MATH11183 Week 1-Part 2
No ratings yet
MATH11183 Week 1-Part 2
18 pages
BDS 2018-19
No ratings yet
BDS 2018-19
6 pages
W6a Gaussian Process Kernels
No ratings yet
W6a Gaussian Process Kernels
6 pages
Heat Advection
No ratings yet
Heat Advection
12 pages
Week 8 Pca
No ratings yet
Week 8 Pca
26 pages
PMRslides 03 B
No ratings yet
PMRslides 03 B
45 pages
MLPR w0f - Machine Learning and Pattern Recognition
No ratings yet
MLPR w0f - Machine Learning and Pattern Recognition
3 pages
w9b Netflix Prize
No ratings yet
w9b Netflix Prize
3 pages
PMRslides 02
No ratings yet
PMRslides 02
13 pages
2019 AMAM Exam Paper
No ratings yet
2019 AMAM Exam Paper
3 pages
Slides 03 A
No ratings yet
Slides 03 A
21 pages
Bayesian Workshop1 Solution
No ratings yet
Bayesian Workshop1 Solution
3 pages
2017 AMAM Exam Paper
No ratings yet
2017 AMAM Exam Paper
6 pages
Bayesian Week4 LectureNotes
No ratings yet
Bayesian Week4 LectureNotes
15 pages
Machine Learning and Pattern Recognition - Laplace - Approximation
No ratings yet
Machine Learning and Pattern Recognition - Laplace - Approximation
4 pages
Bio Statslectures
No ratings yet
Bio Statslectures
60 pages
The Changing Face of The Ethiopian Rift Lakes and Their Environs
No ratings yet
The Changing Face of The Ethiopian Rift Lakes and Their Environs
18 pages
Technology and American Society A History 3rd Edition Cross Gary Szostak Rick PDF Download
No ratings yet
Technology and American Society A History 3rd Edition Cross Gary Szostak Rick PDF Download
54 pages
Science 9 Q4 SML17 V2
No ratings yet
Science 9 Q4 SML17 V2
15 pages
Red Team Blue Team Exercise Data Sheet
No ratings yet
Red Team Blue Team Exercise Data Sheet
2 pages
Considering Customer Lifetime Network Value in Oligopoly Markets With The Use of Game Theory Approach
No ratings yet
Considering Customer Lifetime Network Value in Oligopoly Markets With The Use of Game Theory Approach
27 pages
Conductivity-Depth Imaging of Helicopter-Borne TEM Data Based On Pseudo-Layer Half Space Model
No ratings yet
Conductivity-Depth Imaging of Helicopter-Borne TEM Data Based On Pseudo-Layer Half Space Model
7 pages
CH 4 - Book Exercise
No ratings yet
CH 4 - Book Exercise
3 pages
Perth 2014 - Abstract Book - Final PDF
100% (1)
Perth 2014 - Abstract Book - Final PDF
277 pages
1 Introduction To Consumer Behavior and Marketing Strategy
No ratings yet
1 Introduction To Consumer Behavior and Marketing Strategy
3 pages
Pitambara 1
No ratings yet
Pitambara 1
30 pages
Report Writing Fomat
No ratings yet
Report Writing Fomat
3 pages
Holiday Homework
No ratings yet
Holiday Homework
16 pages
Fumigation Within A Pharmaceutical Aseptic Filling Line
No ratings yet
Fumigation Within A Pharmaceutical Aseptic Filling Line
2 pages
DLL Mapeh G4 Q2 W1
No ratings yet
DLL Mapeh G4 Q2 W1
11 pages
HRD Final - 1
No ratings yet
HRD Final - 1
20 pages
G10 Answer Key Midterm Exam, Term 1
No ratings yet
G10 Answer Key Midterm Exam, Term 1
5 pages
Aquaguard Absolute Hot Cold N Ambient UV Aquagard Absloute Hot Cold N Ambient UV Echnical Specification
No ratings yet
Aquaguard Absolute Hot Cold N Ambient UV Aquagard Absloute Hot Cold N Ambient UV Echnical Specification
8 pages
Western Political Thought
No ratings yet
Western Political Thought
290 pages
Light XlTwgwQ0 OvDn1N7
No ratings yet
Light XlTwgwQ0 OvDn1N7
41 pages
Intro To Microscopes - Compiled Lesson Plan
No ratings yet
Intro To Microscopes - Compiled Lesson Plan
32 pages
OPIS Global Carbon Offsets Report Sample Issue
No ratings yet
OPIS Global Carbon Offsets Report Sample Issue
27 pages
TU108 Project 3
No ratings yet
TU108 Project 3
8 pages
Office 2016 Activation
No ratings yet
Office 2016 Activation
2 pages
BS 2C 4-1973 (2012)
No ratings yet
BS 2C 4-1973 (2012)
10 pages
USP-NF Purified Water
No ratings yet
USP-NF Purified Water
1 page
Summary Completion Ielts Reading
No ratings yet
Summary Completion Ielts Reading
8 pages
Petroleum Engineering 311 Reservoir Petr
No ratings yet
Petroleum Engineering 311 Reservoir Petr
224 pages
Ex Tenebris Marking System
No ratings yet
Ex Tenebris Marking System
5 pages

PMRprobabilistic Modelling Primer

Uploaded by

PMRprobabilistic Modelling Primer

Uploaded by

Introduction to Probabilistic Modelling

If E1 ∈ F, E2 ∈ F, . . . then (∪i Ei ) ∈ F. (3)

which is a consequence of the above because Ω = E ∪ (Ω \ E).

1.2 Conditional probability

is known as the sum rule. Using Ω for B, we obtain

which is called the law of total probability.

1.3 Bayes’ rule

P(A|B)P(B) = P(B|A)P(A) (18)

1.4.2 Noisy measurements

1.4.3 Bayes’ rule

P(EY |BY )P(BY )

or in terms of the specificity,

P(EY |BY )P(BY )

P(x ∈ E) = P({ω ∈ Ω : x(ω) ∈ E), (35)

2.2 Distribution of random variables

Fx (α) = P ({x : x1 ≤ α1 , . . . , xn ≤ αn }) (36)

2.3 Discrete and continuous random variables

px (α) = P(x = α). (38)

The probability for x ∈ E, for some E ∈ Fx , can then be computed as

1 Courses on probability and measure theory give more details.

2.4 Conditional distributions and Bayes’ rule

px (α1 , α2 ) = px2 |x1 (α2 |α1 )px1 (α1 ). (45)

px (α1 , α2 ) = px2 |x1 (α2 |α1 )px1 (α1 ). (47)

The equations remain valid for mixed type of random variables.

p(x2 |x1 )p(x1 )

which is the Bayes rule for random variables.

2.5.2 Beta random variable

for x ∈ (0, 1), and where Γ(.) is the gamma function,

The term Γ(α + β)/(Γ(α)Γ(β)) is needed to normalise p(x|α, β).

3.1 Probabilistic models

(a) Scatter plot of u and v (b) Histogram of b = u/(u + v)

Figure 2: If u is obtained by squaring and summing 2α standard normal random variables,

3.2 Statistical models

3.3 Bayesian models

p(x|θ) = θx (1 − θ)(1−x) (59)

with x ∈ {0, 1} and θ ∈ [0, 1].

3.4.2 Statistical model for proportions

3.4.3 Bayesian model for binary data

You might also like