0% found this document useful (0 votes)
43 views25 pages

Statistical Inference III: Mohammad Samsul Alam

This document is an introduction to Bayesian inference presented by Mohammad Samsul Alam. It discusses the three main paradigms of statistical inference: frequentist, Fisherian, and Bayesian. The key difference between the frequentist/Fisherian approaches and the Bayesian approach is that in Bayesian inference, parameters are treated as random variables rather than fixed unknown quantities. The document goes on to define conditional probability and Bayes' theorem, and explains how Bayesian inference uses Bayes' theorem to update prior beliefs about parameters based on observed data.

Uploaded by

Md Abdul Basit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views25 pages

Statistical Inference III: Mohammad Samsul Alam

This document is an introduction to Bayesian inference presented by Mohammad Samsul Alam. It discusses the three main paradigms of statistical inference: frequentist, Fisherian, and Bayesian. The key difference between the frequentist/Fisherian approaches and the Bayesian approach is that in Bayesian inference, parameters are treated as random variables rather than fixed unknown quantities. The document goes on to define conditional probability and Bayes' theorem, and explains how Bayesian inference uses Bayes' theorem to update prior beliefs about parameters based on observed data.

Uploaded by

Md Abdul Basit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Statistical Inference III

(Introduction to Bayesian Inference

Mohammad Samsul Alam


Assistant Professor of Applied Statistics
Institute of Statistical Research and Training (ISRT)
University of Dhaka

https://www.isrt.ac.bd/people/msalam

Email: [email protected] Lecture Material 5 1|25


Introduction I

The three main paradigms of statistical inference are Frequen-


tist, Fisherian and Bayesian.
Frequentist approach is based on the idea of repeated sampling
(sampling distribution) where as Fisherian approach is based
on the likelihood function and Bayesian approach is based on
the Bayes theorem.
However, in frequentist and Fisherian approaches, the param-
eter of interest θ is assumed fixed but an unknown quantity
whereas, in Bayesian approach, θ is assumed as random vari-
able.
In Bayesian inference, the assumption that θ is random states
that the θ is a realization of a random variable which takes val-
ues from Θ, the parameter space, by a probability mechanism.

Email: [email protected] Lecture Material 5 2|25


Introduction II

Any inferential problem, in Bayesian approach, is dealt using


the Bayes theorem that utilizes the concept of conditional
probability.
Assuming θ as random variable, Bayesian inference tries to find
out a probability mechanism exploiting the observed data.
This probability distribution is then used to make conclusion
regarding θ in form of estimation or hypothesis testing.

Email: [email protected] Lecture Material 5 3|25


Conditional Probability I

If we know that one event has occurred, does that affect


the probability that another event has occurred? To answer
this, we need to look at conditional prbobability.
Suppose we are told that the event A has occurred. Everything
outside of A is no longer possible. Then, we only have to
consider outcomes inside event A.
The given condition implies that the universe, U , will reduces
to the A. That is, for the given condition, U = A.
Therefore, the only part of event B that is now relevant is
that part which is also in A, that is B ∩ A.
Given that event A has occurred, the total probability in the
reduced universe must equal 1.

Email: [email protected] Lecture Material 5 4|25


Conditional Probability II

The probability of B given A is the unconditional probability


of that part of B that is also in A, multiplied by the scale factor
1/P (A).
That gives the conditional probability of event B gives event
A:
P (A ∩ B)
P (B|A) = .
P (A)

It has been seen that the conditional probability P (B|A) is


proportional to the joint probability P (A ∩ B) but has been
rescaled so the probability of the reduced universe equals 1.

Email: [email protected] Lecture Material 5 5|25


Conditional Probability III

A∩ B
A

Email: [email protected] Lecture Material 5 6|25


Bayes Theorem I

Let B1 , B2 , . . . , Bm be events that partition the sample space


Ω (i.e. Ω = B1 ∪ B2 ∪ . . . , ∪Bm and Bi ∩ Bj = ∅ when i 6= j)
and let A be an event on the space Ω for which P (A) > 0.
Moreover, the event A can defined as A = (B1 ∩ A) ∪ (B2 ∩
A) ∪ . . . ∪ (Bm ∩ A).
In this case the conditional probabilities P (A|B1 ), P (A|B2 ),
. . ., P (A|Bm ), and the marginal probabilities P (A) and P (Bi ),
for all i, are known to us.

Email: [email protected] Lecture Material 5 7|25


Bayes Theorem II

Then from the definition of conditional probability we can


write,

P (Bi ∩ A)
P (Bi |A) =
P (A)
P (A|Bi )P (Bi )
=
P (A)

From the definition of the event A we can write,


  m
X m
X
P (A) = P ∪(Bi ∩ A) = P (Bi ∩ A) = P (A|Bi )P (Bi )
i
i=1 i=1

Email: [email protected] Lecture Material 5 8|25


Bayes Theorem III

B1 B2 ... Bm

Replacing this quantity we have

P (A|Bi )P (Bi )
P (Bi |A) = Pm (1)
i=1 P (A|Bi )P (Bi )

The result in equation (1) is known as Bayes theorem.

Email: [email protected] Lecture Material 5 9|25


Link Between Bayes Theorem and Bayesian Inference I

Let, for our inferential problem, there is an observable quantity,


y, and an unobservable quantity, θ, where the probability model
for the y depends on the θ.
Moreover, assume that the unobserved quantity, θ, can take
values from Θ.
Further we are assuming that the observed quantity is the
realization of the interested variable Y which has probability
model that is defined as f (y|θ).
In addition, we have our belief regarding θ which we denote, in
terms of probability model, as P (θ). Note that this is our belief
regarding θ prior to having any knowledge of the observable
quantity Y .

Email: [email protected] Lecture Material 5 10|25


Link Between Bayes Theorem and Bayesian Inference II
Now, after observing the quantity Y , we can update our prior
belief P (θ) regarding θ using the Bayes theorem stated in
equation (1) as

f (y|θ)P (θ)
P (θ|y) = R , (2)
f (y|θ)P (θ)δθ
θ

where P (θ|y) is called our belief about θ after observing y.


Formally, in Bayesian inference, the quantities used in equation
(2) have their respective names as,
f (y|θ) stands for the likelihood function
P (θ) stands for the prior distribution
P (θ|y) stands for the posterior distribution
In short, Bayesian inference update our prior belief through
the data that we observe.
Email: [email protected] Lecture Material 5 11|25
Link Between Bayes Theorem and Bayesian Inference III

Finally, in Bayesian inference, any conclusion (probabilistic)


regarding the parameter θ based on the posterior distribution
P (θ|y).
Therefore, the Bayesian inference can be summarized as follows,

Assume a model f (y|θ) for the observable phenomena (data)


Y
Specify prior belief P (θ) regarding θ before observing the Y .
Update the prior belief to the posterior distribution P (θ|y)
using the Bayes theorem.

Email: [email protected] Lecture Material 5 12|25


Conditional Independence I

Suppose Y1 , . . . , Yn are random variables and that θ is a param-


eter describing the conditions under which the random variables
are generated. The variables Y1 , . . . , Yn are conditionally inde-
pendent given θ if for every collection of n sets {A1 , . . . , An }
if

P (Y1 ∈ A1 , . . . , Yn ∈ An |θ) = P (Y1 ∈ A1 |θ) × . . . × P (Yn ∈ An |θ)

The conditional independence assures that

P (Yi ∈ Ai |θ, Yj ∈ Aj ) = P (Yi ∈ Ai |θ),

that is Yj gives no additional information about Yi beyond that


in knowing θ.

Email: [email protected] Lecture Material 5 13|25


Conditional Independence II

In general, under conditional independence the joint density is


given by
n
Y
P (y1 , . . . , yn |θ) = PY1 (y1 |θ) × . . . × PYn (yn |θ) = PYi (yi |θ).
i=1

However, if Y1 , . . . , Yn are generated in similar ways from a


common process, the marginal densities are all equal to some
common density giving
n
Y
P (y1 , . . . , yn |θ) = P (yi |θ).
i=1

Email: [email protected] Lecture Material 5 14|25


Exchangeability I

Exchangeable
Let P (y1 , y2 , . . . , yn ) be the joint density of Y1 , Y2 , . . . , Yn . If P (y1 ,
y2 , . . . , yn ) = P (yπ1 , yπ2 , . . . , yπn ) for all permutations π of
{1, 2, . . . , n}, then Y1 , Y2 , . . . , Yn are exchangeable.

If θ ∼ P (θ) and Y1 , Y2 , . . . , Yn are conditionally i.i.d. given θ, then


marginally (unconditionally on θ), Y1 , Y2 , . . . , Yn are exchangeable.

Email: [email protected] Lecture Material 5 15|25


Exchangeability II

Suppose Y1 , Y2 , . . . , Yn are conditionally i.i.d. given some


unknown parameter θ. Then for any permutation of π of
{1, 2, . . . , n} and any set of values (y1 , y2 , . . . , yn ) ∈ Y n ,
Z
P (y1 , y2 , . . . , yn ) = P (y1 , y2 , . . . , yn |θ)P (θ)dθ
n
Z (Y )
= P (yi |θ) P (θ)dθ
i=1
Z ( n )
Y
= P (yπi |θ) P (θ)dθ
i=1
= P (yπ1 , yπ2 , . . . , yπn )

Email: [email protected] Lecture Material 5 16|25


Exchangeability III
de Finetti’s Theorem
Let Yi ∈ Y for all i ∈ {1, 2, . . .}. Suppose that, for any n, our belief
model for Y1 , Y2 , . . . , Yn is exchangeable:

P (y1 , y2 , . . . , yn ) = P (yπ1 , yπ2 , . . . , yπn )

for all permutations π of {1, 2, . . . , n}. Then our model can be


written as,
n
Z (Y )
P (y1 , y2 , . . . , yn ) = P (yi |θ) P (θ)dθ
i=1

for some parameter θ, some prior distribution on θ and some sam-


pling model P (y|θ). The prior and sampling model depend on the
form of the belief model P (y1 , y2 , . . . , yn ).

Email: [email protected] Lecture Material 5 17|25


Bayesian Inference I

Let the model for data is fY (y; θ) and our prior belief regarding
the parameter θ is P (θ).
Further assume that Y1 , Y2 , . . . , Yn be the random sample whose
elements are conditionally independent given that θ. Then the
model for data, the likelihood function, can be written as,
n
Y
L(θ|y) = P (yi |θ),
i=1

where yi is the realized value of Yi .

Email: [email protected] Lecture Material 5 18|25


Bayesian Inference II
The posterior distribution of θ can then be computed using the
Bayes theorem as,

L(θ|y)P (θ)
P (θ|y) = , (3)
P (y)

where P (y) is the marginal probability of observing the data


y.
Note that, for an observed sample y = {y1 , y2 , . . . , yn }, the
quantity P (y) in the denominator of equation (3) is a constant.
Therefore, we can write,

P (θ|y) ∝ L(θ|y)P (θ) (4)


Posterior ∝ Likelihood × Prior (5)

Email: [email protected] Lecture Material 5 19|25


Bayesian Inference III
As a result, we can say that the product of likelihood and prior
is sufficient to capture the kernel of the posterior distribution.
Therefore, in Bayesian inference, we need to assume a data
model and speicification of the prior distribution for drawing
inference.
However, the prior can be of either noninformative or informa-
tive.
The prior that assign equal probability with each element in
the parameter space is called a noninformative prior. In
such prior, we give equal preference to every value in the
parameter space. This is because we don’t have any information
to give preference to some values from the parameter space.

Email: [email protected] Lecture Material 5 20|25


Bayesian Inference IV
On the other hand, in informative prior, we have specific infor-
mation regarding the θ before the data collection. That is why
we assign different probabilities with the different values of the
parameter space.
There is another class of prior that is called conjugate prior.
Conjugate priors are defined for data model (likelihood). For a
given data model, the prior for which the posterior distribution
has the same form as the prior is called conjugated prior for
that data model.
Note that, though we can not assume an improper distribution,
a distribution for which sum of all probabilities is not equal to 1,
as data model, for prior both proper and improper distributions
can be assumed.

Email: [email protected] Lecture Material 5 21|25


Bayesian Inference V

No matter, what the prior distribution is (proper or improper),


the posterior distribution will be a proper distribution.
There is another kind of prior as well which is known as Jeffreys
prior. This kind of prior is used to retain the prior noninforma-
tive when reparameterization is made.

Email: [email protected] Lecture Material 5 22|25


Summary of the Posterior Distribution I

The posterior probability distribution contains all the current


information about the parameter θ.
One way to summarize the posterior distribution is its graphical
presentation, plot of the entire posterior density.
For many practical purposes
One useful summary of the posterior distribution is posterior
mean which is defined as
Z
E(θ|y) = θP (θ|y)dθ.
θ

Email: [email protected] Lecture Material 5 23|25


Summary of the Posterior Distribution II
Another useful summary is the posterior variance which is
defined as,

V (θ|y) = E [θ − E[θ|y]]2
Z
= [θ − E[θ|y]]2 P (θ|y)dθ
θ
Z h i
= θ2 − {E[θ|y]}2 P (θ|y)dθ
θ
h i Z
= E θ2 |y − {E[θ|y]}2 P (θ|y)dθ
θ
h i
2 2
= E θ |y − {E[θ|y]}
R
Note that, if the posterior distribution is discrete, the sign
P
will be replaced by .
Email: [email protected] Lecture Material 5 24|25
Summary of the Posterior Distribution III

Like posterior mean and variance, poterior quantiles are also


useful in summarizing the posterior distribution.

Email: [email protected] Lecture Material 5 25|25

You might also like