Lecture 2 - 4 Prior
Lecture 2 - 4 Prior
Prior
Shaobo Jin
Department of Mathematics
Prior Distribution
Example
Suppose that we are interested in the eectiveness θ ∈ [0, 1] of a
vaccine.
An expert expects a 80% decrease in the number of disease cases
among the group of vaccinated people compared to non-vaccinated
group of people.
Suppose that we would like to use a Beta (a, b) prior.
The hyperparameters can be set such that the expectation of the
beta distribution a+b
a
is close to 80%.
Example
Suppose that we want to predict the number of sold cups of coee
during midsommar celebration.
Suppose that the sales records from previous years show that the
number ranges between 600 and 800 cups.
We can choose a prior distribution such that the majority of mass
is close/within such range.
Example
Suppose that we are interested in the temperature θ at the midsommar
celebration.
One expert guesses that the temperature is around 22◦ C, and
another expert guesses 10◦ C.
One example is to specify the temperate as
wN 22, σ12 + (1 − w) N 10, σ22 .
Conjugate Prior
Denition
Let F be a family of probability distributions on Θ. If π (·) ∈ F and
π (· | x) ∈ F for every x, then the family of distributions F is
conjugate. The prior distribution that is an element in a conjugate
family is called a conjugate prior.
Example
1 Suppose that we have an iid sample Xi | θ ∼ Bernoulli (θ). Show
that θ ∼ Beta (a0 , b0 ) is conjugate.
2 Suppose that we have an iid sample Xi | µ ∼ N µ, σ 2 , i = 1, ..., n,
ba00
2
1 b0
π σ = exp − 2 .
Γ (a0 ) (σ 2 )a0 +1 σ
Exponential Family
Denition
A class of probability distributions P = {Pθ : θ ∈ Θ} is called an
exponential family, if there exist a number k ∈ N, real-valued functions
A, ζ1 , .., ζk on Θ, real-valued statistics T1 , ..., Tk , and a function h on
the sample space X such that
Xk
f (x | θ) = A (θ) exp ζj (θ) Tj (x) h (x) ,
j=1
µ2 x2
2
1 xµ
f x | µ, σ = √ exp − 2 exp − 2 + 2 .
2πσ 2 2σ 2σ σ
1 Exponential distribution:
f (x | θ) = θ exp {−θx} , x≥0
= θ exp {−θx} 1 (x ≥ 0) ,
Natural Parameter
We can parameterize the probability function as
Xk
f (x | ζ) = C (ζ) exp ζj Tj (x) h (x) ,
j=1
Theorem
Suppose that
Xk
f (x | ζ) = exp ζj Tj (x) + log C (ζ) h (x) .
j=1
Example
Using the exponential family for the following examples.
1 Let X1 , .., Xn be an iid from N θ, σ 2 , where σ 2 is known. Show
exp yi xTi θ
P (Yi = 1 | Xi = xi ) = .
1 + exp xTi θ
No Prior Information
The prior in the above denition is often referred to as the at prior,
uniform prior, among others.
The entropy is often called the Shannon entropy if the random variable
is discrete and the dierential entropy if the random variable is
continuous.
Example
Find the entropy of the following distributions.
1 X ∼ N 0, σ 2 .
Improper Prior
Improper Posterior
One risk of using improper prior is that the posterior can be undened.
Example
Let X ∼ Binomial (n, θ) and π (θ) ∝ θ(1−θ) .
1
The posterior satises
1
π (θ | x) ∝ θx (1 − θ)n−x
θ (1 − θ)
= θx−1 (1 − θ)n−x−1 ,
Marginalization Paradox
Since the improper prior is not a probability density, the posterior, even
exists, may not follow the rules of probability. One example is the
marginalization paradox.
Consider a model f (x | α, β) and a prior π (α, β). Suppose that
the marginal posterior π (α | x) satises
π (α | x) = π (α | z (x))
Example
Let X1 , ..., Xn be independent exponential random variables. The rst
m have mean η −1 and the rest have mean (cη)−1 , where c ̸= 1 is a
known constant and m ∈ {1, ..., n − 1}.
We consider the improper prior π (η) = 1 such that
π (η, m) = π (η) π (m) = π (m).
The marginal posterior distribution satises
cn−m π (m)
π (m | x) ∝ Pm Pn n+1 ,
i=1 zi + c i=m+1 zi
Example
cn−m π (m)
π (m | x) ∝ Pm Pn n+1 .
i=1 zi + c i=m+1 zi
The density of z is
cn−m Γ (n)
f (z | η, m) = Pm Pn n ≡ f (z | m) ,
i=1 zi + c i=m+1 zi
Invariance?
Invariance: Example
dh (η)
πθ (h (η)) .
dη
Jereys Prior
Denition
Consider a statistical model f (x | θ) with Fisher information matrix
I (θ). The Jereys prior is
Example
Find the Jereys prior for θ.
1 Suppose that X | θ ∼ Binomial (n, θ). Show also that the Jereys
prior is invariant to the transformation η = 1−θ
θ
.
2 Suppose that Xi | θ ∼ N (θ, 1), i = 1, ..., n.
3 Suppose that Xi | θ belongs to a location family with density
f (xi − θ), where f (x) is a density function.
4 Suppose that Xi | θ belongs to a scale family with density
θ−1 f θ−1 xi , where f (x) is a density function and θ ∈ R+ .
Example
Suppose that Xi | µ, σ 2 ∼ N µ, σ 2 , i = 1, ..., n. Find the Jereys prior
for θ = µ, σ 2 .
A Cautious Note
The Jereys prior do not necessarily perform satisfactorily for all
inferential purposes.
Example
Suppose that we observe one observation X | θ ∼ Np (θ, I).
The Jereys prior is the uniform prior and the posterior is
θ | x ∼ Np (x, I).
Suppose that we are interested in the parameter η = θT θ . The
posterior distribution of η is noncentral χ2 with p degrees of
freedom. The posterior expected value is xT x + p.
If we consider a quadratic loss, the loss of another estimator
xT x − p is no greater than the loss of xT x + p for all θ.
This means that for any θ, we can always nd an estimator that is
better than the estimator using the Jereys prior.
Reference Prior
Consider the Kullback-Leibler divergence
ˆ
π (θ | x)
KL (π (θ | x) , π (θ)) = π (θ | x) log dθ ≥ 0.
π (θ)
A large KL means that a lot information has come from the data.
The expected KL under the marginal of x is then
ˆ ˆ
π (θ | x)
E [KL (π (θ | x) , π (θ))] = m (x) π (θ | x) log dθ dx
π (θ)
ˆ ˆ
π (θ | x)
= f (x, θ) log dθdx
π (θ)
ˆ ˆ
f (x, θ)
= f (x, θ) log dθdx,
π (θ) m (x)
Mutual Information
In probability theory, the mutual information of two random variables
X and Y is dened as
ˆ ˆ
f (x, y)
MI (X, Y ) = f (x, y) log dxdy ≥ 0.
f (x) f (y)
Result
Let p (x) be the density of a distribution P. Then,
ˆ
MI (X, θ) = S (π (θ)) − m (x) S (π (θ | x)) dx,
where
S (P ) = −E [log p (X)]
is the entropy.
ExpKL
1.75
1.0
1.50
1.25
b
1.00
0.5 0.75
0.50
0.0
0.0 0.5 1.0 1.5
a
pk (θ)
p (θ) = lim ,
k→∞ pk (θ0 )
Example
Suppose that X | θ ∼ Binomial (n, θ). Approximate the reference prior.
√
In fact, if the distribution of MLE n θ̂ − θ can be approximated by
√
N θ, I −1 (θ) , and the posterior distribution of n θ − θ̂ can be
where →P
means convergence in probability.
The reference prior is π (θ) ∝ σ −1 . The posterior mean of σ 2 is
Pn
i=1 (xi1− xi2 )2 P 2
E σ2 | x =
→σ .
2n − 4
Berger-Bernardo Method
6 end
7 Obtain the reference prior π (θ) = π1 (θ1 , ..., θm ) ;
Shaobo Jin (Math) Bayesian Statistics 42 / 51
Prior Reference Prior
Example
Consider X | θ ∼ Multinomial (n, θ1 , ..., θ4 ). The likelihood is
n!
f (x | θ1 , θ2 , θ3 ) = θ x1 θ x2 θ x3 θ x4 ,
x1 !x2 !x3 !x4 ! 1 2 3 4
Inuence of Prior
4
Density value
0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
θ
15
Density value
10
0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
θ
follows a t distribution.
A t distribution prior with low degrees of freedom (e.g., 3) is a
popular choice.
Prior 1
Rep1 Rep2 Rep3
100
75
50
25
0
28750 29000 29250 29500 29750 80500 81000 81500 82000
47200 47600 48000 48400
50
25
0
101500102000102500103000103500 3600 3700 3800 3900 7700 7900 8100
Prior 2
Rep1 Rep2 Rep3
1000
750
500
250
0
0 1 2 3 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
500
250
0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.00 0.25 0.50 0.75 1.00