0% found this document useful (0 votes)
14 views51 pages

Lecture 2 - 4 Prior

The document discusses Bayesian statistics, focusing on prior distributions, including subjective and objective priors, and their implications in modeling. It explains the concept of conjugate priors, their advantages in Bayesian computation, and provides examples of how to determine conjugate priors for various distributions. Additionally, it addresses the use of uniform and improper priors, highlighting potential issues such as undefined posteriors and the marginalization paradox.

Uploaded by

Lavy Koilpitchai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views51 pages

Lecture 2 - 4 Prior

The document discusses Bayesian statistics, focusing on prior distributions, including subjective and objective priors, and their implications in modeling. It explains the concept of conjugate priors, their advantages in Bayesian computation, and provides examples of how to determine conjugate priors for various distributions. Additionally, it addresses the use of uniform and improper priors, highlighting potential issues such as undefined posteriors and the marginalization paradox.

Uploaded by

Lavy Koilpitchai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Bayesian Statistics

Prior

Shaobo Jin
Department of Mathematics

Shaobo Jin (Math) Bayesian Statistics 1 / 51


Prior Subjective Prior

Prior Distribution

The main dierence between a frequentist model and a Bayesian model


is that the parameter of the data generating distribution is random and
follows a known distribution (prior distribution). The parameters in a
prior distribution are called the hyperparameters.
1 A subjective prior incorporates our prior knowledge.
2 An objective prior fullls some desired (theoretical) properties.

It is in general very dicult to specify an exact prior distribution. Most


critiques of Bayesian methods is specifying a prior distribution.

Shaobo Jin (Math) Bayesian Statistics 2 / 51


Prior Subjective Prior

Subjective Prior: Expert Advise

Example
Suppose that we are interested in the eectiveness θ ∈ [0, 1] of a
vaccine.
An expert expects a 80% decrease in the number of disease cases
among the group of vaccinated people compared to non-vaccinated
group of people.
Suppose that we would like to use a Beta (a, b) prior.
The hyperparameters can be set such that the expectation of the
beta distribution a+b
a
is close to 80%.

Shaobo Jin (Math) Bayesian Statistics 3 / 51


Prior Subjective Prior

Subjective Prior: Previous Experiences

Example
Suppose that we want to predict the number of sold cups of coee
during midsommar celebration.
Suppose that the sales records from previous years show that the
number ranges between 600 and 800 cups.
We can choose a prior distribution such that the majority of mass
is close/within such range.

Shaobo Jin (Math) Bayesian Statistics 4 / 51


Prior Subjective Prior

Mixture Prior Distribution: Example

Example
Suppose that we are interested in the temperature θ at the midsommar
celebration.
One expert guesses that the temperature is around 22◦ C, and
another expert guesses 10◦ C.
One example is to specify the temperate as
wN 22, σ12 + (1 − w) N 10, σ22 .
 

Shaobo Jin (Math) Bayesian Statistics 5 / 51


Prior Conjugate Prior

Conjugate Prior

Denition
Let F be a family of probability distributions on Θ. If π (·) ∈ F and
π (· | x) ∈ F for every x, then the family of distributions F is
conjugate. The prior distribution that is an element in a conjugate
family is called a conjugate prior.

The main benet of a conjugate prior is tractability, that is, we only


need to update the hyperparameters without changing the family of
distributions. It makes Bayesian computation much easier.

Shaobo Jin (Math) Bayesian Statistics 6 / 51


Prior Conjugate Prior

Conjugate Prior: Example

Example
1 Suppose that we have an iid sample Xi | θ ∼ Bernoulli (θ). Show
that θ ∼ Beta (a0 , b0 ) is conjugate.
2 Suppose that we have an iid sample Xi | µ ∼ N µ, σ 2 , i = 1, ..., n,


where σ 2 is known. Show that µ ∼ N µ0 , σ02 is conjugate.




3 Suppose that we have an iid sample Xi | µ, σ 2 ∼ N µ, σ 2 ,




i = 1, ..., n. Show that µ | σ 2 ∼ N µ0 , σ 2 /λ0 and




σ 2 ∼ InvGamma (a0 , b0 ) form a conjugate prior, where

ba00
 
2
 1 b0
π σ = exp − 2 .
Γ (a0 ) (σ 2 )a0 +1 σ

Shaobo Jin (Math) Bayesian Statistics 7 / 51


Prior Conjugate Prior

Find Conjugate Prior

The likelihood f (x | θ) entirely determines the class of conjugate priors.


Example
Find the conjugate prior.
1 Suppose that we have an iid sample Xi | θ ∼ Poisson (θ).
2 Suppose that we have an iid sample

Xi | θ ∼ Multinomial (m, θ1 , ..., θk ) .

Shaobo Jin (Math) Bayesian Statistics 8 / 51


Prior Conjugate Prior

Exponential Family
Denition
A class of probability distributions P = {Pθ : θ ∈ Θ} is called an
exponential family, if there exist a number k ∈ N, real-valued functions
A, ζ1 , .., ζk on Θ, real-valued statistics T1 , ..., Tk , and a function h on
the sample space X such that
 
Xk 
f (x | θ) = A (θ) exp ζj (θ) Tj (x) h (x) ,
 
j=1

where A (θ) > 0 depends only on θ and h (x) ≥ 0 depends only on x.


We often denote the real valued functions by
T
ζ (θ) = ζ1 · · · ζk ,
T
T (x) = T1 · · · Tk .
Shaobo Jin (Math) Bayesian Statistics 9 / 51
Prior Conjugate Prior

Exponential Family: Example

Example (Normal distribution)


Normal distribution with θ = µ, σ 2 :


µ2 x2
   
2
 1 xµ
f x | µ, σ = √ exp − 2 exp − 2 + 2 .
2πσ 2 2σ 2σ σ

Example (Binomial distribution)


Binomial distribution:
 
n
P (X = x | θ) = exp {x log (θ) + (n − x) log (1 − θ)} .
x

Shaobo Jin (Math) Bayesian Statistics 10 / 51


Prior Conjugate Prior

Exponential Family: Counterexample

1 Exponential distribution:
f (x | θ) = θ exp {−θx} , x≥0
= θ exp {−θx} 1 (x ≥ 0) ,

where 1 (·) is the indicator function.


2 Shifted exponential distribution with θ = (λ, µ):
f (x | θ) = λ exp {−λ (x − µ)} , x≥µ
= λ exp {λµ} exp {−λx} 1 (x ≥ µ) .

Shaobo Jin (Math) Bayesian Statistics 11 / 51


Prior Conjugate Prior

Natural Parameter
We can parameterize the probability function as
 
Xk 
f (x | ζ) = C (ζ) exp ζj Tj (x) h (x) ,
 
j=1

where ζ is called the natural parameter.


Example (Binomial distributin)
For θ ∈ (0, 1),
    
n θ n
f (x | θ) = (1 − θ) exp x log .
1−θ x
 
Dene ζ = log θ
1−θ ∈ R. Then,
 n  
exp (ζ) n
f (x | ζ) = 1− exp {xζ} .
1 + exp (ζ) x
Shaobo Jin (Math) Bayesian Statistics 12 / 51
Prior Conjugate Prior

Conjugate Prior for Exponential Family

Theorem
Suppose that
 
Xk 
f (x | ζ) = exp ζj Tj (x) + log C (ζ) h (x) .
 
j=1

Then the conjugate family for ζ is given by

π (ζ) = K (µ0 , λ0 ) exp ζ T µ0 + λ0 log C (ζ) ,




where µ and λ are hyperparameters. The posterior satises

π (ζ | x) ∝ exp ζ T [µ0 + T (x)] + (λ0 + 1) log C (ζ) .




Shaobo Jin (Math) Bayesian Statistics 13 / 51


Prior Conjugate Prior

Conjugate Prior: Example

Example
Using the exponential family for the following examples.
1 Let X1 , .., Xn be an iid from N θ, σ 2 , where σ 2 is known. Show


that θ ∼ N µ0 , σ02 is conjugate.




2 Let X1 , ..., Xn be an iid sample from Bernoulli (θ). Show that


θ ∼ Beta (a, b) is conjugate.
3 Suppose that {Yi }n are independent observations such that
i=1

exp yi xTi θ
 
P (Yi = 1 | Xi = xi ) = .
1 + exp xTi θ

Find the conjugate prior for θ.

Shaobo Jin (Math) Bayesian Statistics 14 / 51


Prior Uniform Prior

No Prior Information

When no prior information is available, we still need to specify a prior


in order to use Bayesian modelling.
Denition
The Laplace prior is π (θ) is a constant over Θ.

Disambiguation: The name Laplace prior is often referred to as that θ


follows a Laplace distribution
 
1 |θ − a0 |
π (θ) = exp − , −∞ < θ < ∞.
2b0 b0

The prior in the above denition is often referred to as the at prior,
uniform prior, among others.

Shaobo Jin (Math) Bayesian Statistics 15 / 51


Prior Uniform Prior

Uniform Prior as Non-Informative Prior


Intuitively speaking, a constant π (θ) means that we treat all θ equally.
The posterior depends only on the likelihood.

For a distribution P with density p, its entropy is


S (P ) = −E [log p] .

The entropy is often called the Shannon entropy if the random variable
is discrete and the dierential entropy if the random variable is
continuous.
Example
Find the entropy of the following distributions.
1 X ∼ N 0, σ 2 .


2 X is uniform on the nite discrete set {1, 2, ..., n}.

Shaobo Jin (Math) Bayesian Statistics 16 / 51


Prior Uniform Prior

Uniform Distribution Maximizes Entropy

The entropy of a random variable measures its uncertainty.


If a random variable puts majority of probability mass on one
value, then the uncertainty is small.
If the possible values of a random variable are equally alike, then
the uncertainty is large.
Example
1 Suppose that X is a discrete random variable with a nite sample
space {1, 2, ..., n}. Show that the discrete uniform distribution
maximizes the Shannon entropy.
2 Suppose that X is a continuous random variable with a closed
sample space [a, b]. Show that the continuous uniform distribution
maximizes the dierential entropy.

Shaobo Jin (Math) Bayesian Statistics 17 / 51


Prior Uniform Prior

Improper Prior

The uniform prior is proportional to a density of a probability measure


if the parameter space Θ is bounded.
However, in many cases, the prior is not a probability measure. Instead
it yields
ˆ
π (θ) dθ = ∞.
Θ

Such prior distribution is said to be an improper prior.


The uniform prior is an improper prior if Θ is not bounded.
But as long as the posterior distribution is well dened, the Bayesian
methods still apply.

Shaobo Jin (Math) Bayesian Statistics 18 / 51


Prior Uniform Prior

Improper Posterior
One risk of using improper prior is that the posterior can be undened.
Example
Let X ∼ Binomial (n, θ) and π (θ) ∝ θ(1−θ) .
1
The posterior satises
1
π (θ | x) ∝ θx (1 − θ)n−x
θ (1 − θ)
= θx−1 (1 − θ)n−x−1 ,

which is not dened for x = 0 or x = n.


In order to have a well-dened posterior, we need
ˆ
f (x | θ) π (θ) dθ < ∞.

But this may not be an easy task to check.


Shaobo Jin (Math) Bayesian Statistics 19 / 51
Prior Uniform Prior

Marginalization Paradox
Since the improper prior is not a probability density, the posterior, even
exists, may not follow the rules of probability. One example is the
marginalization paradox.
Consider a model f (x | α, β) and a prior π (α, β). Suppose that
the marginal posterior π (α | x) satises
π (α | x) = π (α | z (x))

for some function z (x).


Suppose that f (z | α, β) = f (z | α), that is, does not depend on β .
If π (α, β) is a proper prior, we can recover π (α | x) from f (z | α)
and some π (α) as π (α | x) ∝ f (z | α) π (α).
However, if π (α, β) is not a proper prior, it can happen that
f (z | α) π (α) is not proportional to π (α | x) for any π (α).

Shaobo Jin (Math) Bayesian Statistics 20 / 51


Prior Uniform Prior

Marginalization Paradox: Example

Example
Let X1 , ..., Xn be independent exponential random variables. The rst
m have mean η −1 and the rest have mean (cη)−1 , where c ̸= 1 is a
known constant and m ∈ {1, ..., n − 1}.
We consider the improper prior π (η) = 1 such that
π (η, m) = π (η) π (m) = π (m).
The marginal posterior distribution satises
cn−m π (m)
π (m | x) ∝ Pm Pn n+1 ,
i=1 zi + c i=m+1 zi

where zi = xi /x1 . Hence, the marginal posterior depends only on


z = (z2 , ..., zn ), since z1 = 1.

Shaobo Jin (Math) Bayesian Statistics 21 / 51


Prior Uniform Prior

Marginalization Paradox: Example

Example

cn−m π (m)
π (m | x) ∝ Pm Pn n+1 .
i=1 zi + c i=m+1 zi

The density of z is
cn−m Γ (n)
f (z | η, m) = Pm Pn n ≡ f (z | m) ,
i=1 zi + c i=m+1 zi

which only depends on m, not η .


However, it is not possible to nd a π ∗ (m) such that
π (m | x) ∝ f (z | m) π ∗ (m) .

Shaobo Jin (Math) Bayesian Statistics 22 / 51


Prior Jereys Prior

Invariance?

Another issue of the uniform prior is that it is not invariant against


reparametrization.
Suppose that we choose the uniform prior for θ ∈ Θ.
Now we reparameterize to η = η (θ), which is one-to-one, such that
θ = h (η). Then,
 
∂h (η)
πη (η) = πθ (h (η)) det ,
∂η T

which is not a constant.


A constant prior on θ does not always yield a constant prior on η (θ),
even though η is a strictly monotone transformation.

Shaobo Jin (Math) Bayesian Statistics 23 / 51


Prior Jereys Prior

Invariance: Example

Example (Binomial distributin)


Suppose that X | θ ∼ Binomial (n, θ). We have no information
regarding θ. Hence we let θ ∼ Uniform (0, 1).
Consider the odds ratio ζ = 1−θ
θ
.
By change of variables,
 
∂h (η) ∂ ζ
πζ (ζ) = πθ (h (ζ)) det =
∂η T ∂ζ 1 + ζ
1
= .
(1 + ζ)2

Further, θ ∼ Uniform (0, 1) is the same as θ ∼ Beta (1, 1). The


prior is conjugate. But the resulting prior for ζ is not.

Shaobo Jin (Math) Bayesian Statistics 24 / 51


Prior Jereys Prior

Invariance Under Monotone Transformation


Suppose that a procedure of nding prior yields the prior density πθ (θ)
for θ.
Let h be a smooth and monotone transformation. By the change of
variables η = η (θ) and θ = h (η), the density of η induced from
πθ (θ) is

dh (η)
πθ (h (η)) .

If we use the same procedure of nding prior as we used for θ, it


should yield the prior density πη (η) for η .
Invariance means that such two densities should be the same:
dh (η)
πθ (h (η)) = πη (η) .

Shaobo Jin (Math) Bayesian Statistics 25 / 51


Prior Jereys Prior

Motivation: Use of Fisher Information


Suppose that P and Q are two probability measures with densities p
and q , respectively.
The Kullback-Leibler divergence is
ˆ  
p (x)
KL (P, Q) = log p (x) dx.
q (x)

We consider the symmetric metric


KL (Pθ , Pθ′ ) + KL (Pθ′ , Pθ ) .
If we change the parametrization such that θ = h (η) using Fisher
information, then parametrization leaves the distance between
distributions approximately unchanged:
KL (Pθ , Pθ′ ) + KL (Pθ′ , Pθ ) = KL Pη , Pη′ + KL Pη′ , Pη .
 

Shaobo Jin (Math) Bayesian Statistics 26 / 51


Prior Jereys Prior

Jereys Prior

Denition
Consider a statistical model f (x | θ) with Fisher information matrix
I (θ). The Jereys prior is

π (θ) ∝ [det (I (θ))]1/2 .

The Jereys prior is invariant to reparametrization under smooth


monotone transformation, because we can show
dh (η)
πθ (h (η)) = πη (η) .

Shaobo Jin (Math) Bayesian Statistics 27 / 51


Prior Jereys Prior

Jereys Prior: Example

Example
Find the Jereys prior for θ.
1 Suppose that X | θ ∼ Binomial (n, θ). Show also that the Jereys
prior is invariant to the transformation η = 1−θ
θ
.
2 Suppose that Xi | θ ∼ N (θ, 1), i = 1, ..., n.
3 Suppose that Xi | θ belongs to a location family with density
f (xi − θ), where f (x) is a density function.
4 Suppose that Xi | θ belongs to a scale family with density
θ−1 f θ−1 xi , where f (x) is a density function and θ ∈ R+ .


Shaobo Jin (Math) Bayesian Statistics 28 / 51


Prior Jereys Prior

Jereys Prior is Non-Informative

The Jereys prior is derived in order to achieve invariance. It turns out


that it is also non-informative.
Under the Jereys prior, the posterior can be approximated by
π (θ | x) ∝ π (θ) f (x | θ)
 
1/2 1 T 
≈ [det (I (θ))] exp − θ − θ̂ I (θ) θ − θ̂ ,
2
 
that is, θ | x ≈ N θ̂, I −1 (θ) .
The frequentist approach yields θ̂ − θ ≈ N 0, I −1 (θ) .


Inference using the Jereys prior coincides approximately with the


inference from the likelihood function.

Shaobo Jin (Math) Bayesian Statistics 29 / 51


Prior Jereys Prior

Independent Jereys Prior

Example
Suppose that Xi | µ, σ 2 ∼ N µ, σ 2 , i = 1, ..., n. Find the Jereys prior


for θ = µ, σ 2 .

When we have multiple parameters, it is also common to use the


independent Jereys prior.
Obtain the Jeeys prior for each parameter separately by xing
the others.
Multiple the single parameter Jereys prior together.
Example
Suppose that Xi | µ, σ 2 ∼ N µ, σ 2 , i = 1, ..., n. Find the independent


Jereys prior for θ = µ, σ 2 .

Shaobo Jin (Math) Bayesian Statistics 30 / 51


Prior Jereys Prior

A Cautious Note
The Jereys prior do not necessarily perform satisfactorily for all
inferential purposes.
Example
Suppose that we observe one observation X | θ ∼ Np (θ, I).
The Jereys prior is the uniform prior and the posterior is
θ | x ∼ Np (x, I).
Suppose that we are interested in the parameter η = θT θ . The
posterior distribution of η is noncentral χ2 with p degrees of
freedom. The posterior expected value is xT x + p.
If we consider a quadratic loss, the loss of another estimator
xT x − p is no greater than the loss of xT x + p for all θ.
This means that for any θ, we can always nd an estimator that is
better than the estimator using the Jereys prior.

Shaobo Jin (Math) Bayesian Statistics 31 / 51


Prior Reference Prior

Reference Prior
Consider the Kullback-Leibler divergence
ˆ  
π (θ | x)
KL (π (θ | x) , π (θ)) = π (θ | x) log dθ ≥ 0.
π (θ)

A large KL means that a lot information has come from the data.
The expected KL under the marginal of x is then
ˆ ˆ   
π (θ | x)
E [KL (π (θ | x) , π (θ))] = m (x) π (θ | x) log dθ dx
π (θ)
ˆ ˆ  
π (θ | x)
= f (x, θ) log dθdx
π (θ)
ˆ ˆ  
f (x, θ)
= f (x, θ) log dθdx,
π (θ) m (x)

where m (x) is the marginal density of x.


Shaobo Jin (Math) Bayesian Statistics 32 / 51
Prior Reference Prior

Mutual Information
In probability theory, the mutual information of two random variables
X and Y is dened as
ˆ ˆ  
f (x, y)
MI (X, Y ) = f (x, y) log dxdy ≥ 0.
f (x) f (y)

It is a measure to quantity the information in f (x, y) instead of


f (x) f (y).
The expected KL in the previous slide
ˆ ˆ  
f (x, θ)
E [KL (π (θ | x) , π (θ))] = f (x, θ) log dθdx
π (θ) m (x)

is the mutual information of X and θ.


The reference prior aims to maximize the mutual information of the
prior and posterior.
Shaobo Jin (Math) Bayesian Statistics 33 / 51
Prior Reference Prior

Reference Prior and Entropy

Result
Let p (x) be the density of a distribution P. Then,
ˆ
MI (X, θ) = S (π (θ)) − m (x) S (π (θ | x)) dx,

where
S (P ) = −E [log p (X)]

is the entropy.

Thus, a reference prior that generates a large mutual information


corresponds to a prior with large entropy and a posterior with low
expected entropy.
Shaobo Jin (Math) Bayesian Statistics 34 / 51
Prior Reference Prior

Reference Prior: Example


Example
Suppose that X | θ ∼ Binomial (n, θ) and we consider the class of
conjugate priors θ ∼ Beta (a, b). Find the expected KL.
1.5

ExpKL
1.75
1.0
1.50
1.25
b

1.00
0.5 0.75
0.50

0.0
0.0 0.5 1.0 1.5
a

Shaobo Jin (Math) Bayesian Statistics 35 / 51


Prior Reference Prior

Explicit Form of Reference Prior


Suppose that we can replicate the experiment independently k times.
Each time we observe a data setof sample size n. Denote all
realizations by x = x(1) , ..., x(k) .
Let π ∗ (θ) be a continuous and strictly positive function such that the
posterior π ∗ (θ | x) is proper and asymptotically consistent. For any
interior point θ0 of Θ, dene
ˆ 

pk (θ) = exp f (x | θ) log π (θ | x) dx ,

pk (θ)
p (θ) = lim ,
k→∞ pk (θ0 )

Suppose pk (θ) is continuous for all k. Then, under some extra


assumptions on the ratio pk (θ) /pk (θ0 ) and on p (θ), p (θ) is a reference
prior.
Shaobo Jin (Math) Bayesian Statistics 36 / 51
Prior Reference Prior

Approximate Reference Prior


Find the reference prior is not an easy task, since the integrals can be
dicult to evaluate.
Algorithm 1: Approximate reference prior
1 Choose an arbitrary continuous and positive function π ∗ (θ), e.g.,
π ∗ (θ) = 1 ;
2 for any θ of interest including a θ0 do
3 for j from 1 to m do n o
4 Simulate independently x(1) (k)
j , ..., xj from f (x | θ) ;
´ hQ  i
5 Compute the integral cj = Θ ki=1 f x(i) j |θ π ∗ (θ) dθ
analytically or approximate
nhQ numerically
 i ; o
6 Evaluate rj (θ) = log k (i)
i=1 f xj | θ π ∗ (θ) /cj ;
7 end n o
Compute pk (θ) = exp m−1 (θ) ;
Pm
8 j=1 rj
9 Let π (θ) ∝ pk (θ) /pk (θ0 ) ;
10 end
Shaobo Jin (Math) Bayesian Statistics 37 / 51
Prior Reference Prior

Reference Prior and Jereys Prior: Example

Example
Suppose that X | θ ∼ Binomial (n, θ). Approximate the reference prior.
√  
In fact, if the distribution of MLE n θ̂ − θ can be approximated by
√  
N θ, I −1 (θ) , and the posterior distribution of n θ − θ̂ can be


approximately N 0, I −1 (θ) , then, the reference prior and the joint




Jereys prior are asymptotically equivalent.

Shaobo Jin (Math) Bayesian Statistics 38 / 51


Prior Reference Prior

Reference Prior With Presence of Nuisance Parameter

Suppose that x | θ ∼ f (x | θ1 , θ2 ) and θ = (θ1 , θ2 ), where θ1 is the


parameter of interest and θ2 is the nuisance parameter. The reference
prior is obtained as follows.
First, treating θ1 as xed. Use the Jereys prior associated with
f (x | θ2 ) as π (θ2 | θ1 ).
Then, derive the marginal distribution
ˆ
f (x | θ1 ) = f (x | θ1 , θ2 ) π (θ2 | θ1 ) dθ2 .

Compute the Jereys prior π (θ1 ) associated with f (x | θ1 ).

Shaobo Jin (Math) Bayesian Statistics 39 / 51


Prior Reference Prior

Neyman-Scott Problem: Example


Consider the Neyman-Scott problem, where Xij | θ ∼ N µij , σ 2 ,


i = 1, ..., n and j = 1, 2. We are interested in σ and µij 's are nuisance


parameters.
The usual Jereys prior is π (θ) ∝ σ −n−1 . The posterior mean of
σ 2 is
Pn
i=1 (xi1 − xi2 )2 P σ 2
E σ |x = 2
̸= σ 2 ,



4n − 4 2

where →P
means convergence in probability.
The reference prior is π (θ) ∝ σ −1 . The posterior mean of σ 2 is
Pn
i=1 (xi1− xi2 )2 P 2
E σ2 | x =
 
→σ .
2n − 4

Shaobo Jin (Math) Bayesian Statistics 40 / 51


Prior Reference Prior

Berger-Bernardo Method

The idea of deriving the prior conditioning on a subset of parameter


can be applied to a general setting with more than two sets of
parameters. The resulting method is the Berger-Bernado method.
Suppose the p × 1 vector θ is partitioned into m groups, denoted by
θ1 ,..., θm . The reference prior is obtained in a similar manner to

π (θ) ∝ π (θm | θ1 , ..., θm−1 ) π (θm−1 | θ1 , ..., θm−2 ) · · · π (θ2 | θ1 ) π (θ1 ) .

Shaobo Jin (Math) Bayesian Statistics 41 / 51


Prior Reference Prior

Berger-Bernardo Method: Algorithm


Algorithm 2: Berger-Bernardo method
1 Initiate some πm (θm | θ1 , ..., θm−1 ), e.g., Jereys prior ;
2 for j in m − 1, m − 2, ..., 1 do
3 Obtain the marginal distribution
ˆ
f (x | θ1 , ..., θj ) = f (x | θ) πj+1 (θj+1 , ..., θm | θ1 , ..., θj ) d (θj+1 , ..., θm ) .

4 Determine the reference prior hj (θj | θ1 , ..., θj−1 ) related to the


model f (x | θ1 , ..., θj ), where θ1 , ..., θj−1 is treated as xed ;
5 Compute πj (θj , ..., θm | θ1 , ..., θj−1 ) by
πj (θj , ..., θm | θ1 , ..., θj−1 ) ∝ πj+1 (θj+1 , ..., θm | θ1 , ..., θj ) hj (θj | θ1 , ..., θj−1 ) .

6 end
7 Obtain the reference prior π (θ) = π1 (θ1 , ..., θm ) ;
Shaobo Jin (Math) Bayesian Statistics 42 / 51
Prior Reference Prior

Berger-Bernardo Method: Example

Example
Consider X | θ ∼ Multinomial (n, θ1 , ..., θ4 ). The likelihood is
n!
f (x | θ1 , θ2 , θ3 ) = θ x1 θ x2 θ x3 θ x4 ,
x1 !x2 !x3 !x4 ! 1 2 3 4

where θ4 = 1 − i=1 θi . Find the reference prior of θ = (θ1 , θ2 , θ3 ),


P3
where m = 3.

Shaobo Jin (Math) Bayesian Statistics 43 / 51


Prior Inuence of Prior

Inuence of Prior

The assessment of the inuence of the prior is called sensitivity analysis.


In general, the prior can have a big impact for small sample sizes.
But it becomes less important as the sample size increases. Most
priors will lead to similar inference that is equivalent to the one
based only on the likelihood.
Example
Suppose that we have an iid sample Xi | θ ∼ Bernoulli (θ).The
conjugate prior θ ∼ Beta (a0 , b0 ) yields the posterior
n n
!
Beta a0 +
X X
xi , b0 + n − xi .
i=1 i=1

Shaobo Jin (Math) Bayesian Statistics 44 / 51


Prior Inuence of Prior

Small Sample Size


Prior Posterior

4
Density value

0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
θ

Prior Beta(1, 1) Beta(20, 20) 0.25Beta(10,30)+0.75Beta(30,10)

Shaobo Jin (Math) Bayesian Statistics 45 / 51


Prior Inuence of Prior

Large Sample Size


Prior Posterior

15
Density value

10

0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
θ

Prior Beta(1, 1) Beta(20, 20) 0.25Beta(10,30)+0.75Beta(30,10)

Shaobo Jin (Math) Bayesian Statistics 46 / 51


Prior Inuence of Prior

Hierarchical Prior Distribution

We can apply a hierarchical prior, applying a prior on the prior.


Suppose that π1 (θ | λ) is a conjugate prior for f (x | θ), where λ is
the hyperparameter.
Instead of specifying the value of λ, we let
λ ∼ π2 (λ) , θ | λ ∼ π1 (θ | λ) , x | θ ∼ f (x | θ) .

For example, if X | z ∼ N µ, zσ 2 and z is inverse gamma, then X




follows a t distribution.
A t distribution prior with low degrees of freedom (e.g., 3) is a
popular choice.

Shaobo Jin (Math) Bayesian Statistics 47 / 51


Prior Inuence of Prior

Dierent Priors in Practice

We have introduced dierent ways of constructing the prior, e.g.,


conjugate prior, uniform prior, Jereys prior, and reference prior.
Depending on how much information the priors contain, we can roughly
partition the prior into the following groups according to their level of
informative relative to the likelihood:
1 noninformative at prior,
2 super-vague but proper prior, e.g., a prior with a massive variance
such as 1, 000, 000,
3 very weakly informative prior, e.g., a prior with a sizable variance
such as 10,
4 weakly informative prior, e.g., a prior with variance 1,
5 informative prior.

The rst two groups are generally not recommended.


Shaobo Jin (Math) Bayesian Statistics 48 / 51
Prior Inuence of Prior

Prior Predictive Check

The prior predictive check is a way to assess whether your prior is


appropriate.
Algorithm 3: Prior predictive check
1 for j in 1, 2, ..., m do
2 Simulate θsim ∼ π (θ) ;
3 Simulate xsim ∼ f (x | θsim ) of sample size n ;
4 end
5 Visualize each data set or investigate the summary statistics to
judge whether the simulated data are plausible to avoid super bad
priors.

Shaobo Jin (Math) Bayesian Statistics 49 / 51


Prior Inuence of Prior

Prior 1
Rep1 Rep2 Rep3
100
75
50
25
0
28750 29000 29250 29500 29750 80500 81000 81500 82000
47200 47600 48000 48400

Rep4 Rep5 Rep6


100
75
count

50
25
0
101500102000102500103000103500 3600 3700 3800 3900 7700 7900 8100

Rep7 Rep8 Rep9


100
75
50
25
0
216000 217000 218000 34000 34500 60000 60500 61000 61500
Poisson

Shaobo Jin (Math) Bayesian Statistics 50 / 51


Prior Inuence of Prior

Prior 2
Rep1 Rep2 Rep3
1000
750
500
250
0
0 1 2 3 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

Rep4 Rep5 Rep6


1000
750
count

500
250
0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.00 0.25 0.50 0.75 1.00

Rep7 Rep8 Rep9


1000
750
500
250
0
0 1 2 3 0 1 2 3 4 5 0 5 10
Poisson

Shaobo Jin (Math) Bayesian Statistics 51 / 51

You might also like