0% found this document useful (0 votes)
10 views15 pages

L1 Prob

The document covers fundamental concepts in set theory, probability, and random variables, including definitions, operations, and key theorems. It introduces the algebra of sets, probability axioms, and the distinction between probability mass functions and probability density functions. Additionally, it explains random variables, their types, and their significance in statistical analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views15 pages

L1 Prob

The document covers fundamental concepts in set theory, probability, and random variables, including definitions, operations, and key theorems. It introduces the algebra of sets, probability axioms, and the distinction between probability mass functions and probability density functions. Additionally, it explains random variables, their types, and their significance in statistical analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

CGS698C, Module 1: Sets, probability, and random variables

Himanshu Yadav
2024-05-17

Contents

1 Set theory 2
1.1 Binary operations on sets 2
1.2 The algebra of sets 3
2 Probability theory 3
2.1 Foundations 3
2.2 Probability mass function and probability density function 5
3 Random variables 6
3.1 Discrete random variables 7
3.2 The expected value and variance of a random variable 8
3.3 Continuous random variables 9
3.4 Some important probability distributions 11
4 Conditional probability and Bayes’ theorem 12
4.1 Conditional probability 12
4.2 Independent events 13
4.3 Total probability 13
4.4 Bayes’ theorem 13
5 Using Bayes’ theorem for statistical inference 14
bayesian models & data analysis 2

1 Set theory

A set is a collection of objects or elements. Suppose that set A consists of two numbers 0 and 1. We
can denote this set as follows:

A = {0, 1}
From the above, we can also say that 0 is a member of set A:

0∈A
Similarly, 1 ∈ A.
The number 2 is not a member:

2∈
/A
Let us say S is a set of natural numbers between 2 and 8. We can write:

S = {2, 3, 4, 5, 6, 7, 8}
Also, we can describe the set S as follows. The vertical bar, |, is read “such that.”

S = { x | x ∈ N+ and 2 ≤ x ≤ 8}
If A = {1,2,3}, B= {1,3,2} and C={3,1,2,1}, we can write A = B = C.

1.1 Binary operations on sets


1. The union of two sets A and B ia denoted by A ∪ B
A ∪ B is the set of all objects that are a member of A, or B, or both.

2. The intersection of two sets A and B is denoted by A ∩ B


A ∩ B is the set of all objects that are members of both A and B.

3. Set difference for the sets B and A is denoted by B\ A


B\ A is the set of all members of B that are not members of A.
B\ A = { x | x ∈ B and x ∈
/ A}

4. The Cartesian product of A and B, denoted by A × B


A × B is the set whose members are all possible ordered pairs ( a, b), such that a ∈ A, and b ∈ B.

5. A set A is a subset of another set B, denoted by A ⊆ B if all members of A are also in B, i.e., for all
a ∈ A, a ∈ B.
A is a proper subset of B, denoted by A ⊂ B if all members of A are also in B, but A ̸= B.

6. The power set of a set A, denoted by P ( A), is the set of all possible subsets of A.

7. Two sets A and B are called disjoint sets if A and B have no element in common.
A ∩ B = ∅ where ∅ represent an empty set (a set with no elements in it).
bayesian models & data analysis 3

8. The complement of a set A is denoted by Ā or Ac


Ā is the set of all those elements which belong to the universal set U but does not belong to A.

Ā = { x | x ∈
/ A}

1.2 The algebra of sets


1. A ∪ A = A
A∩A = A

2. A ∪ B = B ∪ A
A∩B = B∩A

3. A ∪ ( B ∪ C ) = ( A ∪ B) ∪ C
A ∩ ( B ∩ C ) = ( A ∩ B) ∩ C

4. A ∩ ( B ∪ C ) = ( A ∩ B) ∪ ( A ∩ C )
A ∪ ( B ∩ C ) = ( A ∪ B) ∩ ( A ∪ C )

5. A ∪ ∅ = A
A∩∅ = ∅

2 Probability theory

Suppose you run an experiment where the participants have to decide whether a given sentence is
grammatically correct or not. And, the participants are forced to select either yes or no. The recorded
responses you have are “yes” and “no.”
What is the set of all possible outcomes from the experiment?
Ω = {“yes”, “no”}
The set of all possible outcomes from the experiment is called the sample space of the experiment.
What is the power set of the sample sapce Ω?
F = {∅,{“yes”},{“no”},{“yes”,“no”}}
∅ : there is no outcome
{“yes”} : the outcome is “yes”
{“no”} : the outcome is “no”
{“yes”,“no”} : the outcome is either “yes” or “no”
The above are the all different collections of possible results. These collections are called events.
For example, {“no”} is an event that the participant answers “no”; {“yes”,“no”} is the event that the
participant answers either “yes” or “no”. The proper set F is called the event space.
Probability is a way of assigning every event a real value between 0 and 1 based on some require-
ments. What are those requirements? How do we assign a probability value to every event?

2.1 Foundations
We can assign a probability value P( E) to an event E based on the following three axioms.
bayesian models & data analysis 4

1. First axiom:
The probability of an event E is a non-negative real number
P( E) ∈ R, P( E) ≥ 0 where E ∈ F
It follows that P( E) is always finite.

2. Second axiom:
The probability that at least one of the elementary events in the entire sample space will occur is 1.
P(Ω) = 1

3. Third axiom:
Any countable sequence of disjoint sets (also called mutually exclusive events) E1 , E2 , . . . En satis-
fies the following
P( E1 ∪ E2 ∪ E3 ∪ . . .) = P( E1 ) + P( E2 ) + P( E3 ) + . . .
P(∪i∞=1 Ei ) = ∑i∞=1 P( Ei )

Let’s see what we can deduce from the above three axioms about our grammaticality judgment
example.
Suppose that E1 and E2 are two mututally exclusive events in the sample space Ω, and an empty
set ∅ is also an event in the same sample space. According to the third axiom,

P( E1 ∪ E2 ∅ ∪ ∅ ∪ ∅ . . .) = P( E1 ) + P( E2 ) + P(∅) + P(∅) + P(∅) + . . . (1)

From set theory you know that E1 ∪ ∅ ∪ ∅ = E1



P( E1 ∪ E2 ) = P( E1 ) + P( E2 ) + ∑ P(∅) (2)
i =3
From the first axiom we know that P(∅) ≥ 0, P( E1 ) ≥ 0, P( E2 ) ≥ 0 and P( E1 ∪ E2 ) is finite. Hence,

P(∅) = 0 (3)

Let us go back to our grammaticality judgment experiment.


Sample space: Ω = {“yes”, “no”}
Event space: F = {∅,{“yes”},{“no”},{“yes”,“no”}}
We just verified that

P(∅) = 0 (4)
Now, the second axiom implies that:

P({”yes”} ∪ {”no”}) = 1 (5)


Finally, the third axiom implies that,

P(∅ ∪ {”yes”} ∪ {”no”}) = P(∅) + P({”yes”}) + P({”no”}) (6)

P({”yes”} ∪ {”no”}) = P(∅) + P({”yes”}) + P({”no”}) (7)


bayesian models & data analysis 5

1 = 0 + P({”yes”}) + P({”no”}) (8)

So,

P({”yes”}) + P({”no”}) = 1 (9)


Above equation implies that the sum of probabilities of all elementary events in the sample space
Ω is equal to 1. More generally, if x ∈ Ω:

∑ f (x) = 1 (10)
x ∈Ω

where f ( x ) ∈ [0, 1].


f is a function that assigns a probability value to each elementary event x.
Also, for any event E ∈ F:

P( E) = ∑ f (x) (11)
x∈E

2.2 Probability mass function and probability density function


The function f ( x ) in Equation~11 maps a discrete outcome x in the sample space Ω to a probability
value; it is called a probability mass function.
Now, consider another experiment. You record the reading times for each participant: how much
time (in milliseconds) does it take to read a sentence?
What is the sample space now?
Ω = R+
This sample space is not a finite or countable set now. It is a continuous sample space.
How do we assign probabilities to an outcome x such that x ∈ Ω?
Z ∞
f ( x ) dx = 1 (12)
0

Suppose an event E exists such that E ⊂ R+


Z
P( X ∈ E) = f ( x ) dx (13)
x∈E
The function f ( x ) maps the values (outcomes) in the continuous sample space Ω to a continuous
probability space, such that Z x
P( X ≤ x ) = f ( x ) dx (14)
0
The function f ( x ) is called a probability density function.
In the above equation, there is a variable we have not defined yet, the variable X. The value of X
depend on the outcomes in the continuous sample space; it is called a continuous random variable.
More generally, for any experiment, you can define a random variable X whose values depend on the
outcome of the experiment. We will talk about random variables in the next section.
bayesian models & data analysis 6

3 Random variables

A random variable X is a function that maps the outcomes in a sample space Ω (say {“yes”,“no”}) to
another (real-valued) space ⊗§ (e.g., {0, 1} where 1 corresponds “yes” and 0 corresponds to “no”).
We can write a random variable X as
X : Ω → Ωx
such that
X (ω ) ∈ Ω x where ω ∈ Ω
For example, in a single-coin toss experiment, the sample space is Ω = { H, T }. We can define a
random variable X which is a function that counts the number of heads in an outcome ω that belongs
to Ω.
X: No. of heads in ω where ω ∈ Ω
( )
0 if ω ̸= H
X (ω ) = where w ∈ Ω
1 if ω = H
Similarly, we can also define a random variable Y that counts the number of tails in an outcome ω.
The probabilities are always assigned to the values of the random variable. We will see in the next
part that why the random variables are so useful for assigning probability values.
In the case of continuous sample space, the measurable space is often the same as the sample
space of the experiment. Suppose an experiment (or any generative process) produces outcomes that
belongs to a sample space Ω = { x | x ∈ R+ and 2 ≤ x ≤ 5}. These outcomes are values that are
coming from what is known as a (continuous) random variable. Suppose that in an experimental
trial the outcome is 2.5; we will write X = 2.5, where X is a random variable associated with the
experiment. A random variable is written with capital letters (X or Y, etc.), and the outcomes are
written in lower case (x or y, etc.). In Bayesian statistics, where parameters are also random variables,
it is common to use Greek letters like α, β, etc., to represent random variables (i.e., the capital letter
convention generally only applies to letters of the English alphabet).
Let us see how it is more convenient to assign probabilites to the values of the random variable
rather than assigning probabilities to the sample space Ω.
Consider an experiment where a coin is tossed three times. What will be the sample space of the
experiment?
There are a total of eight possible outcomes.
Ω = { HHH, HHT, HTH, HTT, THH, THT, TTH, TTT }
It is difficult to directly assign probabilities to this sample space, because we will need a probabil-
ity mass function that has probabilities defined for all 8 outcomes.
Now consider a different idea. What if we ask: how many heads appear (in a trial / experiment)
when a coin is tossed three times? We can define a random variable X.
X: No. of heads in the outcome ω where ω ∈ Ω.
If we represent outcome of each toss as ωi , we can write
ω = (ω1 , ω2 , ω3 ) ∈ Ω. The random variable X is given by
( )
3
0 if ωi ̸= H
X (ω ) = ∑ ϕ ( ωi ) where ϕ(ωi ) =
1 if ωi = H
i =1

The above random variables yields the following values:


bayesian models & data analysis 7

X(HHH) = 3, X(THH)=2, and so on.


Hence, X (w) ∈ {0, 1, 2, 3}
So, the random variable X takes “number of heads in three coin-tosses”, i.e., {0, 1, 2, 3}, as its val-
ues.
An experiment can be associated with more than one random variable. For example, consider
another idea: how many tails appear when a coin is tossed three times?
You can define another random variable Y : Ω → Ωy which takes “the number of tails in three
coin-tosses’ ’ as its values. It would map the sample space Ω to another space Ωy , such that ⊗y =
{0, 1, 2, 3}

• Random variables can be discrete. For example, in our coin tossing example, the random variable
X takes a countable list of values (i.e., 0, 1, 2 and 3).

• Random variables can be continuous: they can take any numerical value in an interval or collection
of intervals. For example, in an experiment where we record response or reading time, a random
variable X associated with the experiment can take any positive real number value.

• A random variable is associated with a function, called probability mass function (PMF) for dis-
crete random variables, and probability density function (PDF) for continuous random variables.

• The PMF assigns probabilities to the values of a discrete random variable. The PDF assigns prob-
abilities to particular intervals (ranges) of values of a continuous random variable. The PDF does
not assign a probability to a point value, but rather a density.

So, for any experiment or any generative process, you can define a sample space Ω, a random
variable X that maps its sample space to another space ⊗ x , an event space F which is a power set of
⊗ x , and a function P that maps the event space F to a set of probability values.
The sample space Ω, the event space F, and the function P from the event space F to a set of prob-
abilities together make the formal model of an underlying generative process, denoted by (Ω, F, P).

3.1 Discrete random variables


Suppose a discrete random variable X takes the values x1 , x2 , x3 , . . . , xn . What is the probability that
the random variable X takes the value xi , where i = 1, . . . , n? The probabilities can be assigned by the
probability mass functon f , such that

P ( X = xi ) = f ( xi )
under the requirement
n
∑ f ( xi ) = 1
i =1
An important example: The binomial random variable
Suppose that in an experiment, the trials are independent. And, in each trial, one of the two pos-
sible outcomes can occur with probabilities p and 1 − p. If p remains constant throughout the exper-
iment, each one of these trials is called a Bernoulli trial. Bernoulli trials can represent the generative
processes where each outcome is strictly binary, such as heads/tails, on/off, up/down, etc. The pair
bayesian models & data analysis 8

of possible outcomes is usually represented by success/failure where p is the probability of success


and 1 − p is the probability of failure.
The sample space is S = {success, failure}. For a single Bernoulli trial, let us define a random
variable X such that success is assigned a real number value 1 and failure is assigned 0.

xi = 0 1 Table 1: A random
variable in which
f ( xi ) = 1− p p two outcomes are
A further distribution can arise from Bernoulli trials. Consider an experiment containing n inde- possible: success
or failure. The
pendent Bernoulli trials. Suppose there were k successes in n trials. We can define a new random outcome success
variable X which takes number of successes (out of total number of trials) as its values. is assigned the
number 1, and
The probability distribution of the random variable X that represents the number of successes in n failure the number
Bernoulli trials is given by 0, and a probability
is assigned to each
n! number.
P( X = k ) = f (k, n, p) = p k (1 − p ) n − k (15)
k!(n − k)!
The expression n!
k!(n−k )!
is written as (nk) in mathematics, leading to the above PMF being commonly
written as:
 
n k
P( X = k) = f (k, n, p) = p (1 − p ) n − k (16)
k
The above distribution is called the binomial distribution, and the random variable is called the
binomial random variable.

k= 0 1 2 ... n Table 2: The proba-


n ( n −1) k bility mass function
f (k, n, p) = p k (1 − p )n−k npk (1 − p )n−k 2 p (1 − p )n−k ... p k (1 − p )n−k when we carry
out n independent
Bernoulli trials.
3.2 The expected value and variance of a random variable
The expected value (also called expectation, mean, first moment) of a random variable X is the
weighted average of the possible values. Suppose a discrete random variable X can take values
x1 , x2 , x3 , . . ., with probabilities f ( x1 ), f ( x2 ), f ( x3 ), . . ., where f represent the probability mass func-
tion. The expected value of X is given by
n
E( X ) = ∑ xi f ( xi )
i =1

The expected value of X is the arithmetic mean of large number of independently drawn values for
the variable X.
The expected value satisfies the following relationships

1. E(cX ) = cE( X )

2. E( X + Y ) = E( X ) + E(Y )

3. E( XY ) = E( X ) E(Y ) (if X and Y are independent)


bayesian models & data analysis 9

The variance of a random variable X is given by:

Var ( X ) = E[( X − E( X ))2 ]


This can be rewritten as:

Var ( X ) = E[ X 2 + E( X )2 − 2XE( X )]
Equivalently:

Var ( X ) = E( X 2 ) + E( E( X )2 ) − E(2XE( X ))

Var ( X ) = E( X 2 ) + E( X )2 − 2E( X ) E( X )

Var ( X ) = E( X 2 ) − E( X )2

E( X ) is often written as µ.
The standard deviation of a random variable X is given by
q
σX = Var ( X )
The variance satisfies the following relationships

1. Var (cX ) = c2 Var ( X )

2. Var ( X + c) = Var ( X )

3. Var ( X + Y ) = Var ( X ) + Var (Y ) (if X and Y are independent)

3.3 Continuous random variables


Consider another experiment where you record the decision times on a grammaticality judgment
task: how much time (in milliseconds) does it take to decide whether the sentence is grammatical or
not?
Suppose we define a random variable X which takes decision times as its values.
The variable X cannot take its values from a countable list; X is a continuous random variable and
can take any value in the continuous space X ≥ 0.
It is impossible to determine the probability of a specific value of X, so we cannot assign probabili-
ties like P( X = xi ).
We can however assign probability to an interval of values of X. For example, we can determine
probability of obtaining a value between x1 and x2 , i.e., P( x1 ≤ X ≤ x2 ) or, equivalently, P( x1 < X <
x2 ).
A continuous random variable X is associated with a probability density function f ( x ), which
assigns probabilities over an interval of values of X in the following way:
R∞
(a) −∞ f ( x ) dx = 1 where f ( x ) ≥ 0, and −∞ ≤ x ≤ ∞
bayesian models & data analysis 10

(b) For any x1 , x2 such that −∞ < x1 < x2 < ∞


Rx
P( x1 ≤ X ≤ x2 ) = P( x1 < X < x2 ) = x 2 f ( x ) dx
1

We can also define a cumulative distribution function F(x) such that


Z x
F ( x ) = P( X ≤ x ) = f ( x ) dx
−∞
Expected value and variance of a Continuous random variable
The expected value or the mean of a continuous random variable X is given by
Z ∞
µ = E( X ) = x f ( x ) dx
−∞
The variance is given by
Z ∞
σ2 = Var ( X ) = E[( X − µ)2 ] = ( x − µ)2 f ( x ) dx
−∞
The normal distribution
Think about the distribution of heights in a population.

Suppose the average height of the population is 6 feet, and the number of people with height > 6
and < 6 is almost same.
Consider an experiment where you randomly pick an individual from the population and record
their height. Let us say we define a random variable X that takes the recorded height as its value.
The variable X is a continuous random variable with specific properties. For example, it is sym-
metrically distributed around its mean, i.e., P( X < E( X )) ≈ P( X > E( X )).
The distribution of variable X in this example can be characterized by a normal distribution with
the following probability density function:

1 ( x − µ )2

f (x) = √ e 2σ2
σ 2π
such that:
bayesian models & data analysis 11

R∞
• −∞ f ( x ) dx = 1
R∞
• −∞ x f ( x ) dx = µ
R∞
• −∞ x2 f ( x ) dx = σ2

3.4 Some important probability distributions

Type of Name of Probability density function (PDF) or


Random variable the distribution Probability mass function (PMF)

n!
1 Discrete Binomial PMF: f (k; n, p) = k!(n−k )!
p k (1 − p)k

λk e−λ
2 Discrete Poisson PMF: f (k; λ) = k! where λ > 0

( x − µ )2

3 Continuous Normal PDF: f ( x; µ, σ) = √1 e 2σ2
σ 2π

( α + β −1) !
4 Continuous Beta PDF: f ( x; α, β) = ( α −1) ! ( β −1) !
x α−1 (1 − x ) β−1 (where α, β > 0)
α
x α−1 e− βx (where α, β > 0)
β
5 Continuous Gamma PDF: f ( x; α, β) = ( α −1) !
bayesian models & data analysis 12

4 Conditional probability and Bayes’ theorem

Let us look at some useful results and properties that emerge from the three axioms of probability.

4.1 Conditional probability


The probability of occurrence of an event A given that another event B has already occurred is called
the conditional probability of A given B, and it is denoted by P( A| B).

P( A ∩ B)
P( A| B) = given that P( B) ̸= 0
P( B)
Let us verify the above relationship using an example.
Suppose you toss two fair coins simultaneously. The sample space would be Ω = { HH, HT, TH, TT }.
Consider two events A and B.
A : both the coins show heads
B : at least one coin show heads
What is the probability of occurrence of B given that A has occurred?
It will be equal to probability of A such that A is an event in the sample space B, A ⊆ B, where
B = { HH, HT, TH }
A = { HH }
Given that the coins were fair.
P( HH ) = P( HT ) = P( TH )
Consider B as the sample space, from the second and the third axoim we can deduce that,
P( HH ) + P( HT ) + P( TH ) = 1
So,
P( HH ) = P( HT ) = P( TH ) = 31
Hence,

1
P({ HH }| B) =
3

1
P( A| B) =
3
What is the probability of an event A ∩ B in the sample space Ω?
A ∩ B = { HH }
For the sample space Ω, P( HH ) + P( HT ) + P( TH ) + P( TT ) = 1
1
so, P( HH ) = P( HT ) = P( TH ) = P( TT ) = 1/4, which implies that P( A ∩ B) = 4
and,
P({ HH, HT, TH }) = P( HH ) + P( HT ) + P( TT ) = 34
hence, P( B) = 34
Finally,

P( A ∩ B) 1
= = P( A| B)
P( B) 3
bayesian models & data analysis 13

4.2 Independent events


Two events A and B are said to be independent if the occurrence of one does not affect the (probabil-
ity or odds of) occurrence of the other.
The above statement implies that P( A| B) = P( A) (and also, P( B| A) = P( B)). The following
relationship is satisfied

P( A ∩ B)
P( A| B) = = P( A)
P( B)

P( A ∩ B) = P( B) P( A)
The above result implies that two events A and B are independent if and only if the the probability
of joint occurrence of A and B is equal to the product of their probabilites.
The term P( A ∩ B) gives the probability that both events A and B occur, it is called the joint proba-
bility and also represented by P( A, B).
Generally, n events E1 , E2 , . . . , En are independent if and only if P( E1 , E2 , E3 , . . . En ) = P( E1 ) P( E2 ) P( E3 ) . . . P( En ).

4.3 Total probability


Suppose n mutually exclusive events A1 , A2 , A3 , . . . , An occur in an event space F, such that
∩in=1 Ai = ∅ and ∪in=1 Ai = S
For another event B in F, (B ⊆ F),
we can say that B ∩ A1 and B ∩ A2 are mutually exclusive. So,

P(( B ∩ A1 ) ∪ ( B ∩ A2 ) ∪ . . .) = P( B ∩ A1 ) + P( B ∩ A2 ) + . . .
From set theory we know that, ( B ∩ A1 ) ∪ ( B ∩ A2 ) ∪ ( B ∩ A3 ) ∪ . . . = B ∪ ( A1 ∩ A2 ∩ A3 ∩ . . .).
n
P( B ∪ (∩in=1 Ai )) = ∑ P ( B ∩ Ai )
i =1
n
P( B) = ∑ P ( B ∩ Ai )
i =1

We know that P( B ∩ Ai ) = P( B| Ai ) P( Ai ). Hence,


n
P( B) = ∑ P ( B | Ai ) P ( Ai ) (17)
i =1

The above relationship is called the law of total probability.

4.4 Bayes’ theorem


Suppose two mututally exclusive and exhaustive events A1 and A2 occur in an event space F such
that
A1 ∪ A2 = S and A1 ∩ A2 = ∅
For an event B in F we can say that:
bayesian models & data analysis 14

P ( B ∩ A1 ) = P ( B | A1 ) P ( A1 ) = P ( A1 | B ) P ( B )

Similarly:

P ( B ∩ A2 ) = P ( B | A2 ) P ( A2 ) = P ( A2 | B ) P ( B )
From the above equations we can derive the following:

P ( B | A1 ) P ( A1 )
P ( A1 | B ) =
P( B)
And, from the law of total probability we know that,

P ( B ) = P ( B | A1 ) P ( A1 ) + P ( B | A2 ) P ( A2 )

Hence,

P ( B | A1 ) P ( A1 ) P ( B | A1 ) P ( A1 )
P ( A1 | B ) = =
P( B) P ( B | A1 ) P ( A1 ) + P ( B | A2 ) P ( A2 )
The above equation is Bayes’ rule.
Let us talk about the variables that assign values to the outcomes of an underlying generative
process (random event).

5 Using Bayes’ theorem for statistical inference

Suppose that an outcome x observed in an experiment is assumed to come from a normal distribu-
tion, such that
( x − µ )2

f ( x; µ, σ2 ) = √1 e 2σ2
σ 2π
where f ( x ) is the probability density function; f ( x ) assigns the probability density value to the
outcome x conditional on the parameters mean µ and variance σ2 of the normal distribution. The
probability density of x conditional on µ and σ2 can be written as,
( x − µ )2

p( x |µ, σ2 ) = √1 e 2σ2
σ 2π
The goal of statistical inference is figure out what value(s) of µ and σ2 have generated the observed
outcome x.
We know the probability density of obtaining x given µ and σ2 , can we calculate the probability
density of (a range of) values µ and σ2 conditional on the observed outcome x?
p(µ, σ2 | x ) =?
Using Bayes’ theorem,
p( x |µ,σ2 )· p(µ,σ2 )
p(µ, σ2 | x ) = RR
p( x |µ,σ2 )· p(µ,σ2 ) dµ dσ2
bayesian models & data analysis 15

More generally, suppose the observed outcome x is assumed to be a value of the random variable X
whose probability density function is f ( x; θ ); f ( x; θ ) assigns a probability density value to x condi-
tional on a parameter θ. The probability density of x given the parameter θ is given by p( x |θ ).
Our goal is to infer what value(s) of the parameter θ has generated the given (observed) datapoint x.

p( x |θ ) · p(θ )
p(θ | x ) = R (18)
p( x |θ ) · p(θ ) dθ
The term p( x |θ ) is called the likelihood function, p(θ ) is called the prior distribution of θ, and
p(θ | x ) is called the posterior distribution of θ.

Note: When f ( x; θ ) is seen as a function of x, it is called a probability density function; and when
f ( x; θ ) is seen as a function of θ, it is called a likelihood function, also denoted by L(θ | x ).

You might also like