L1 Prob
L1 Prob
Himanshu Yadav
2024-05-17
Contents
1 Set theory 2
1.1 Binary operations on sets 2
1.2 The algebra of sets 3
2 Probability theory 3
2.1 Foundations 3
2.2 Probability mass function and probability density function 5
3 Random variables 6
3.1 Discrete random variables 7
3.2 The expected value and variance of a random variable 8
3.3 Continuous random variables 9
3.4 Some important probability distributions 11
4 Conditional probability and Bayes’ theorem 12
4.1 Conditional probability 12
4.2 Independent events 13
4.3 Total probability 13
4.4 Bayes’ theorem 13
5 Using Bayes’ theorem for statistical inference 14
bayesian models & data analysis 2
1 Set theory
A set is a collection of objects or elements. Suppose that set A consists of two numbers 0 and 1. We
can denote this set as follows:
A = {0, 1}
From the above, we can also say that 0 is a member of set A:
0∈A
Similarly, 1 ∈ A.
The number 2 is not a member:
2∈
/A
Let us say S is a set of natural numbers between 2 and 8. We can write:
S = {2, 3, 4, 5, 6, 7, 8}
Also, we can describe the set S as follows. The vertical bar, |, is read “such that.”
S = { x | x ∈ N+ and 2 ≤ x ≤ 8}
If A = {1,2,3}, B= {1,3,2} and C={3,1,2,1}, we can write A = B = C.
5. A set A is a subset of another set B, denoted by A ⊆ B if all members of A are also in B, i.e., for all
a ∈ A, a ∈ B.
A is a proper subset of B, denoted by A ⊂ B if all members of A are also in B, but A ̸= B.
6. The power set of a set A, denoted by P ( A), is the set of all possible subsets of A.
7. Two sets A and B are called disjoint sets if A and B have no element in common.
A ∩ B = ∅ where ∅ represent an empty set (a set with no elements in it).
bayesian models & data analysis 3
Ā = { x | x ∈
/ A}
2. A ∪ B = B ∪ A
A∩B = B∩A
3. A ∪ ( B ∪ C ) = ( A ∪ B) ∪ C
A ∩ ( B ∩ C ) = ( A ∩ B) ∩ C
4. A ∩ ( B ∪ C ) = ( A ∩ B) ∪ ( A ∩ C )
A ∪ ( B ∩ C ) = ( A ∪ B) ∩ ( A ∪ C )
5. A ∪ ∅ = A
A∩∅ = ∅
2 Probability theory
Suppose you run an experiment where the participants have to decide whether a given sentence is
grammatically correct or not. And, the participants are forced to select either yes or no. The recorded
responses you have are “yes” and “no.”
What is the set of all possible outcomes from the experiment?
Ω = {“yes”, “no”}
The set of all possible outcomes from the experiment is called the sample space of the experiment.
What is the power set of the sample sapce Ω?
F = {∅,{“yes”},{“no”},{“yes”,“no”}}
∅ : there is no outcome
{“yes”} : the outcome is “yes”
{“no”} : the outcome is “no”
{“yes”,“no”} : the outcome is either “yes” or “no”
The above are the all different collections of possible results. These collections are called events.
For example, {“no”} is an event that the participant answers “no”; {“yes”,“no”} is the event that the
participant answers either “yes” or “no”. The proper set F is called the event space.
Probability is a way of assigning every event a real value between 0 and 1 based on some require-
ments. What are those requirements? How do we assign a probability value to every event?
2.1 Foundations
We can assign a probability value P( E) to an event E based on the following three axioms.
bayesian models & data analysis 4
1. First axiom:
The probability of an event E is a non-negative real number
P( E) ∈ R, P( E) ≥ 0 where E ∈ F
It follows that P( E) is always finite.
2. Second axiom:
The probability that at least one of the elementary events in the entire sample space will occur is 1.
P(Ω) = 1
3. Third axiom:
Any countable sequence of disjoint sets (also called mutually exclusive events) E1 , E2 , . . . En satis-
fies the following
P( E1 ∪ E2 ∪ E3 ∪ . . .) = P( E1 ) + P( E2 ) + P( E3 ) + . . .
P(∪i∞=1 Ei ) = ∑i∞=1 P( Ei )
Let’s see what we can deduce from the above three axioms about our grammaticality judgment
example.
Suppose that E1 and E2 are two mututally exclusive events in the sample space Ω, and an empty
set ∅ is also an event in the same sample space. According to the third axiom,
P(∅) = 0 (3)
P(∅) = 0 (4)
Now, the second axiom implies that:
So,
∑ f (x) = 1 (10)
x ∈Ω
P( E) = ∑ f (x) (11)
x∈E
3 Random variables
A random variable X is a function that maps the outcomes in a sample space Ω (say {“yes”,“no”}) to
another (real-valued) space ⊗§ (e.g., {0, 1} where 1 corresponds “yes” and 0 corresponds to “no”).
We can write a random variable X as
X : Ω → Ωx
such that
X (ω ) ∈ Ω x where ω ∈ Ω
For example, in a single-coin toss experiment, the sample space is Ω = { H, T }. We can define a
random variable X which is a function that counts the number of heads in an outcome ω that belongs
to Ω.
X: No. of heads in ω where ω ∈ Ω
( )
0 if ω ̸= H
X (ω ) = where w ∈ Ω
1 if ω = H
Similarly, we can also define a random variable Y that counts the number of tails in an outcome ω.
The probabilities are always assigned to the values of the random variable. We will see in the next
part that why the random variables are so useful for assigning probability values.
In the case of continuous sample space, the measurable space is often the same as the sample
space of the experiment. Suppose an experiment (or any generative process) produces outcomes that
belongs to a sample space Ω = { x | x ∈ R+ and 2 ≤ x ≤ 5}. These outcomes are values that are
coming from what is known as a (continuous) random variable. Suppose that in an experimental
trial the outcome is 2.5; we will write X = 2.5, where X is a random variable associated with the
experiment. A random variable is written with capital letters (X or Y, etc.), and the outcomes are
written in lower case (x or y, etc.). In Bayesian statistics, where parameters are also random variables,
it is common to use Greek letters like α, β, etc., to represent random variables (i.e., the capital letter
convention generally only applies to letters of the English alphabet).
Let us see how it is more convenient to assign probabilites to the values of the random variable
rather than assigning probabilities to the sample space Ω.
Consider an experiment where a coin is tossed three times. What will be the sample space of the
experiment?
There are a total of eight possible outcomes.
Ω = { HHH, HHT, HTH, HTT, THH, THT, TTH, TTT }
It is difficult to directly assign probabilities to this sample space, because we will need a probabil-
ity mass function that has probabilities defined for all 8 outcomes.
Now consider a different idea. What if we ask: how many heads appear (in a trial / experiment)
when a coin is tossed three times? We can define a random variable X.
X: No. of heads in the outcome ω where ω ∈ Ω.
If we represent outcome of each toss as ωi , we can write
ω = (ω1 , ω2 , ω3 ) ∈ Ω. The random variable X is given by
( )
3
0 if ωi ̸= H
X (ω ) = ∑ ϕ ( ωi ) where ϕ(ωi ) =
1 if ωi = H
i =1
• Random variables can be discrete. For example, in our coin tossing example, the random variable
X takes a countable list of values (i.e., 0, 1, 2 and 3).
• Random variables can be continuous: they can take any numerical value in an interval or collection
of intervals. For example, in an experiment where we record response or reading time, a random
variable X associated with the experiment can take any positive real number value.
• A random variable is associated with a function, called probability mass function (PMF) for dis-
crete random variables, and probability density function (PDF) for continuous random variables.
• The PMF assigns probabilities to the values of a discrete random variable. The PDF assigns prob-
abilities to particular intervals (ranges) of values of a continuous random variable. The PDF does
not assign a probability to a point value, but rather a density.
So, for any experiment or any generative process, you can define a sample space Ω, a random
variable X that maps its sample space to another space ⊗ x , an event space F which is a power set of
⊗ x , and a function P that maps the event space F to a set of probability values.
The sample space Ω, the event space F, and the function P from the event space F to a set of prob-
abilities together make the formal model of an underlying generative process, denoted by (Ω, F, P).
P ( X = xi ) = f ( xi )
under the requirement
n
∑ f ( xi ) = 1
i =1
An important example: The binomial random variable
Suppose that in an experiment, the trials are independent. And, in each trial, one of the two pos-
sible outcomes can occur with probabilities p and 1 − p. If p remains constant throughout the exper-
iment, each one of these trials is called a Bernoulli trial. Bernoulli trials can represent the generative
processes where each outcome is strictly binary, such as heads/tails, on/off, up/down, etc. The pair
bayesian models & data analysis 8
xi = 0 1 Table 1: A random
variable in which
f ( xi ) = 1− p p two outcomes are
A further distribution can arise from Bernoulli trials. Consider an experiment containing n inde- possible: success
or failure. The
pendent Bernoulli trials. Suppose there were k successes in n trials. We can define a new random outcome success
variable X which takes number of successes (out of total number of trials) as its values. is assigned the
number 1, and
The probability distribution of the random variable X that represents the number of successes in n failure the number
Bernoulli trials is given by 0, and a probability
is assigned to each
n! number.
P( X = k ) = f (k, n, p) = p k (1 − p ) n − k (15)
k!(n − k)!
The expression n!
k!(n−k )!
is written as (nk) in mathematics, leading to the above PMF being commonly
written as:
n k
P( X = k) = f (k, n, p) = p (1 − p ) n − k (16)
k
The above distribution is called the binomial distribution, and the random variable is called the
binomial random variable.
The expected value of X is the arithmetic mean of large number of independently drawn values for
the variable X.
The expected value satisfies the following relationships
1. E(cX ) = cE( X )
2. E( X + Y ) = E( X ) + E(Y )
Var ( X ) = E[ X 2 + E( X )2 − 2XE( X )]
Equivalently:
Var ( X ) = E( X 2 ) + E( E( X )2 ) − E(2XE( X ))
Var ( X ) = E( X 2 ) + E( X )2 − 2E( X ) E( X )
Var ( X ) = E( X 2 ) − E( X )2
E( X ) is often written as µ.
The standard deviation of a random variable X is given by
q
σX = Var ( X )
The variance satisfies the following relationships
2. Var ( X + c) = Var ( X )
Suppose the average height of the population is 6 feet, and the number of people with height > 6
and < 6 is almost same.
Consider an experiment where you randomly pick an individual from the population and record
their height. Let us say we define a random variable X that takes the recorded height as its value.
The variable X is a continuous random variable with specific properties. For example, it is sym-
metrically distributed around its mean, i.e., P( X < E( X )) ≈ P( X > E( X )).
The distribution of variable X in this example can be characterized by a normal distribution with
the following probability density function:
1 ( x − µ )2
−
f (x) = √ e 2σ2
σ 2π
such that:
bayesian models & data analysis 11
R∞
• −∞ f ( x ) dx = 1
R∞
• −∞ x f ( x ) dx = µ
R∞
• −∞ x2 f ( x ) dx = σ2
n!
1 Discrete Binomial PMF: f (k; n, p) = k!(n−k )!
p k (1 − p)k
λk e−λ
2 Discrete Poisson PMF: f (k; λ) = k! where λ > 0
( x − µ )2
−
3 Continuous Normal PDF: f ( x; µ, σ) = √1 e 2σ2
σ 2π
( α + β −1) !
4 Continuous Beta PDF: f ( x; α, β) = ( α −1) ! ( β −1) !
x α−1 (1 − x ) β−1 (where α, β > 0)
α
x α−1 e− βx (where α, β > 0)
β
5 Continuous Gamma PDF: f ( x; α, β) = ( α −1) !
bayesian models & data analysis 12
Let us look at some useful results and properties that emerge from the three axioms of probability.
P( A ∩ B)
P( A| B) = given that P( B) ̸= 0
P( B)
Let us verify the above relationship using an example.
Suppose you toss two fair coins simultaneously. The sample space would be Ω = { HH, HT, TH, TT }.
Consider two events A and B.
A : both the coins show heads
B : at least one coin show heads
What is the probability of occurrence of B given that A has occurred?
It will be equal to probability of A such that A is an event in the sample space B, A ⊆ B, where
B = { HH, HT, TH }
A = { HH }
Given that the coins were fair.
P( HH ) = P( HT ) = P( TH )
Consider B as the sample space, from the second and the third axoim we can deduce that,
P( HH ) + P( HT ) + P( TH ) = 1
So,
P( HH ) = P( HT ) = P( TH ) = 31
Hence,
1
P({ HH }| B) =
3
1
P( A| B) =
3
What is the probability of an event A ∩ B in the sample space Ω?
A ∩ B = { HH }
For the sample space Ω, P( HH ) + P( HT ) + P( TH ) + P( TT ) = 1
1
so, P( HH ) = P( HT ) = P( TH ) = P( TT ) = 1/4, which implies that P( A ∩ B) = 4
and,
P({ HH, HT, TH }) = P( HH ) + P( HT ) + P( TT ) = 34
hence, P( B) = 34
Finally,
P( A ∩ B) 1
= = P( A| B)
P( B) 3
bayesian models & data analysis 13
P( A ∩ B)
P( A| B) = = P( A)
P( B)
P( A ∩ B) = P( B) P( A)
The above result implies that two events A and B are independent if and only if the the probability
of joint occurrence of A and B is equal to the product of their probabilites.
The term P( A ∩ B) gives the probability that both events A and B occur, it is called the joint proba-
bility and also represented by P( A, B).
Generally, n events E1 , E2 , . . . , En are independent if and only if P( E1 , E2 , E3 , . . . En ) = P( E1 ) P( E2 ) P( E3 ) . . . P( En ).
P(( B ∩ A1 ) ∪ ( B ∩ A2 ) ∪ . . .) = P( B ∩ A1 ) + P( B ∩ A2 ) + . . .
From set theory we know that, ( B ∩ A1 ) ∪ ( B ∩ A2 ) ∪ ( B ∩ A3 ) ∪ . . . = B ∪ ( A1 ∩ A2 ∩ A3 ∩ . . .).
n
P( B ∪ (∩in=1 Ai )) = ∑ P ( B ∩ Ai )
i =1
n
P( B) = ∑ P ( B ∩ Ai )
i =1
P ( B ∩ A1 ) = P ( B | A1 ) P ( A1 ) = P ( A1 | B ) P ( B )
Similarly:
P ( B ∩ A2 ) = P ( B | A2 ) P ( A2 ) = P ( A2 | B ) P ( B )
From the above equations we can derive the following:
P ( B | A1 ) P ( A1 )
P ( A1 | B ) =
P( B)
And, from the law of total probability we know that,
P ( B ) = P ( B | A1 ) P ( A1 ) + P ( B | A2 ) P ( A2 )
Hence,
P ( B | A1 ) P ( A1 ) P ( B | A1 ) P ( A1 )
P ( A1 | B ) = =
P( B) P ( B | A1 ) P ( A1 ) + P ( B | A2 ) P ( A2 )
The above equation is Bayes’ rule.
Let us talk about the variables that assign values to the outcomes of an underlying generative
process (random event).
Suppose that an outcome x observed in an experiment is assumed to come from a normal distribu-
tion, such that
( x − µ )2
−
f ( x; µ, σ2 ) = √1 e 2σ2
σ 2π
where f ( x ) is the probability density function; f ( x ) assigns the probability density value to the
outcome x conditional on the parameters mean µ and variance σ2 of the normal distribution. The
probability density of x conditional on µ and σ2 can be written as,
( x − µ )2
−
p( x |µ, σ2 ) = √1 e 2σ2
σ 2π
The goal of statistical inference is figure out what value(s) of µ and σ2 have generated the observed
outcome x.
We know the probability density of obtaining x given µ and σ2 , can we calculate the probability
density of (a range of) values µ and σ2 conditional on the observed outcome x?
p(µ, σ2 | x ) =?
Using Bayes’ theorem,
p( x |µ,σ2 )· p(µ,σ2 )
p(µ, σ2 | x ) = RR
p( x |µ,σ2 )· p(µ,σ2 ) dµ dσ2
bayesian models & data analysis 15
More generally, suppose the observed outcome x is assumed to be a value of the random variable X
whose probability density function is f ( x; θ ); f ( x; θ ) assigns a probability density value to x condi-
tional on a parameter θ. The probability density of x given the parameter θ is given by p( x |θ ).
Our goal is to infer what value(s) of the parameter θ has generated the given (observed) datapoint x.
p( x |θ ) · p(θ )
p(θ | x ) = R (18)
p( x |θ ) · p(θ ) dθ
The term p( x |θ ) is called the likelihood function, p(θ ) is called the prior distribution of θ, and
p(θ | x ) is called the posterior distribution of θ.
Note: When f ( x; θ ) is seen as a function of x, it is called a probability density function; and when
f ( x; θ ) is seen as a function of θ, it is called a likelihood function, also denoted by L(θ | x ).