Lec17 PriorModeling
Lec17 PriorModeling
Email: [email protected]
URL: https://www.zabaras.com/
Robust Priors
Subjectivity does not imply being unscientific – one can use scientific
information to guide the specification of priors.
Conjugate priors
Exponential families
Maximum entropy priors
Conjugate priors are analytically tractable. Finding the posterior reduces to an updating of the
corresponding parameters of the prior.
( | x) 1 Be (a nH , b n nH )
f ( x | ) ( )d beta(a nH , b n nT )
0
Conjugate priors provide a first approximation to the adequate prior distribution which should
be followed by a robustness analysis.
This is a normal distribution (unknown mean, unit variance). For this case
note that:
2
f ( x | ) h( x ) exp x ( )
1
R( ) ; T ( x) x ; ( ) ; h( x ) exp( x 2 / 2)
2 2
Likelihood
f ( x | ) h( x) exp R( ) T ( x) ( )
Conjugate Prior
( | , ) exp R( ) ( ) , 0
Hyper Parameters
Posterior
( | x ) exp R( ) T ( x ) ( 1) ( )
i.e. ( | x) ( | T ( x), 1)
2
Conjugate prior
~ N (0 , 02 )
Posterior
02 0 2 x1
| x1 ~ N ( 1 , ),
2 2
2
, 1 2
1 1 0
02 2
weighted average of the
observation x1 and the prior
Posterior predictive: mean
( x )2 ( 1 )2
( x | x1 ) ( x | ) ( | x1 )d ~ e 2 2
e 212
d ~ N ( 1 , 2 12 )
n n 2
0 0
n
i i x n
2
x x / i 0 0
n n2 i 1 2 02 n2 i 1 i 1
0 n n0
2
𝜎2
One can think of the prior as 𝑛0 virtual observations with 𝑛0 = and
n 𝜎02
2 x n i 0 0
2
, n i 1
n n0 n n0
n
Normal Normal N ( 2 2 x , 2 2 )
N ( , 2 ) N ( , 2 ) 1 2 2
Poisson Gamma
( ) ( ) , G( α+ x, β + 1)
Gamma Gamma
G(ν, θ) G(α, β) G(α + ν, β + x)
Binomial Beta
B(n, θ) Be(α, β) Be(α + x, β + n − x)
Negative Binomial Beta
Neg(m, θ) Be(α, β) Be(α + m, β + x)
Multinomial Dirichlet
k (1 ,..., k ) (1 ,..., k ) (1 x1 ,..., k xk )
Normal Gamma
N(μ, 1/θ) Ga(α, β) G(α + 0.5, β + (μ − x)2/2)
Conjugate priors simplify the computation, but are often not robust, and not flexible enough to
encode our prior knowledge.
A mixture of conjugate priors is also conjugate and can approximate any kind of prior. Thus
such priors provide a good compromise between computational convenience and flexibility.
Example: to model coin tosses, we can take a prior which is a mixture of two beta distributions
to model coin tosses.
If 𝜃 comes from the first distribution, the coin is fair, but if it comes from the second, it is
biased towards heads.
w ( ) f (D | )
i i K
w 𝑤𝑖 𝜃𝑑) 𝜃|𝒟(𝑓)𝜃( 𝑖𝜋
we obtain ( | D ) K
i 1
i i ( ) f ( D | ) or using 𝑤𝑖′ ≡ 𝐾
w
j 1
j j ( ) f (D | )d i 1 A
𝑗=1
𝑤𝑗 ධ𝜋𝑗 (𝜃)𝑓(𝒟|𝜃 ൯𝑑𝜃
i ( ) f (D | )
K K K
( | D ) w '
wi i ( | D ) P( Z i | D ) ( | D , Z i )
'
i ( ) f (D | )d i 1
i
i 1 i 1
p( Z i ) p(D | Z i ) wi i ( ) f (D | )d K
p( Z j ) p (D | Z j ) K i i
j w
j 1
j j ( ) f ( D | )d
i 1
One can approximate arbitrary closely any prior distribution by a mixture of conjugate
distributions (Brown, 1986)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Mixture of Conjugate Priors
As an example, suppose we use the mixture prior
p ( ) 0.5Beta ( | a1 , b1 ) 0.5Beta ( | a2 , b2 )
a1 b1 20, a2 30, b2 10, we observe N1 heads, N 0 tails
The posterior becomes p ( | D ) p ( Z 1| D )Beta ( | a1 N1 , b1 N 0 )
p ( Z 2 | D )Beta ( | a2 N1 , b2 N 0 )
The posterior mixing weights are given as:
p( Z k ) p(D | Z k ) p(Z k ) p(D | Z k )
p(Z k | D )
p(Z k ') p(D | Z k ')
k'
p(D )
mixture of Beta distributions
5
N B a1 N1 , b1 N 0
3.5
p (D | Z 1)
B a1 , b1
3
N1 2.5
2
The posterior finally becomes 1.5
p ( | D ) 1
Since this is conjugate, we can easily compute 𝑝(𝑵𝑡|𝑍𝑡 ) (Brown et al. 1993)
CONS.
Approximation by mixtures while feasible is very tedious and thus not used
in practice.
Assume a distribution 𝑝(𝑥|𝜆) governed by a parameter 𝜆, and a prior 𝑝(𝜆) = 𝑐𝑜𝑛𝑠𝑡 e.g. if 𝜆 is
a discrete variable with 𝐾 states, this simply amounts to setting the prior probability of each
state to 1/𝐾.
In the case of continuous 𝜆 there are two difficulties with this approach. If the domain of 𝜆 is
unbounded, this prior distribution cannot be correctly normalized (improper prior).
Improper priors can often be used provided the corresponding posterior distribution is proper.
For example, if we put a uniform prior distribution over the mean of a Gaussian, then the
posterior distribution for the mean, once we have observed at least one data point, will be
proper.
An uninformative prior would be the uniform distribution, ℬℯ𝓉𝒶(1,1). In this case, the posterior
mean and MLE are:
N1 1
| D
N1 N 0 2
N1
N1 N 0
One could argue that the prior wasn’t completely uninformative after all.
( ) 1
Beta ( x) x (1 x) 1
( )( )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22
Noninformative Priors
One can argue that the most non-informative prior is
Note that the Haldane prior is an improper prior, meaning it does not integrate to 1. However,
as long as we see at least one head and at least one tail, the posterior will be proper.
If a function h() is constant, and we change variables to = 2, then ℎ(𝜂) = ℎ(𝜂2) will also
be constant. However, if we choose the density 𝑝𝜆 (𝜆) to be constant, then the density of 𝜂
will be given by
d
p ( ) p ( ) p ( 2 )2
d
and so the density over 𝜂 will not be constant.
This issue does not arise when we use maximum likelihood, because the likelihood function
𝑝(𝑥|𝜆) is a simple function of 𝜆 and so we are free to use any convenient parameterization.
If, however, we are to choose a prior distribution that is constant, we must take care to use an
appropriate representation for the parameters.
We would like to find a prior that satisfies this translational invariance – a density
independent of the origin.
p( )d
A A c
p (t )dt p ( c)d
t c A
A translation invariance requirement is thus that the prior distribution should satisfy:
p ( ) p ( c) for every c
p( ) constant (improper prior)
This flat prior is improper – but the resulting posterior is proper assuming
Having seen 𝑁 ≥ 1 data points
f ( x )d will satisfy this. One data point
is enough to fix the location.
Example of a location parameter is the mean 𝜇 of a Gaussian. The noninformative prior is
obtained from the conjugate prior
N ( | 0 , 02 ) with 02 .
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26
Scale Invariant Prior
Scale Invariant: If the density is of the form
1 x
p( x | ) f( )
then 𝑓(. ) is scale invariant and 𝜎 is the scale parameter.
1 𝑥ҧ
𝑝(𝑥|ҧ 𝜎)
ത = 𝑓( ), 𝑤ℎ𝑒𝑟𝑒 𝜎ത = 𝑐𝜎
𝜎ത 𝜎ത
We would like to find a prior that satisfies this scale invariance – a density independent of the
scaling used.
1
p ( ) p ( ) for every c
c c
1
p( ) (improper prior) p ln const
We can approximate this with a p( ) Gamma ( | 0, 0) . This improper prior leads to a proper
posterior if we observe 𝑁 ≥ 2 data (we need at least 2 data points to estimate a variance)
The conjugate prior for 𝜆 is 𝒢𝒶𝓂𝓂𝒶(𝜆 |𝑎0, 𝑏0). The noninformative prior is obtained from the
𝒢𝒶𝓂𝓂𝒶 with 𝑎0 = 𝑏0 = 0. Recall the general form of the posterior:
N
1 N
p ( | X, ) f ( xn | , )Gamma ( | a0 , b0 ) N /2 a0 1
exp b0 ( xn ) 2
n 1 2 n 1
Given a likelihood 𝑓(𝑥|𝜃), Jeffrey’s noninformative prior distributions are based on Fisher
information, given by
( ) I ( )
1/2
Determinant of I
Any rule for defining the prior distribution on 𝜃 should lead to an equivalent result when using a
transformed parameterization
d dg ( )
( ) ( ) ( g ( ))
d d
2 log f ( X | ) 2 log f ( X | ) d 2 d
2
I ( ) X | I ( )
2
X |
2
d d
Likelihood
xi | ~ N ( , 2 ) (known )
i.e. 𝑛 2
𝑖=1 𝑥𝑖 − 𝜃
𝑓(𝑥1:𝑛 |𝜃) ∝ 𝑒𝑥𝑝 −
2𝜎 2
Then:
2 log f ( x1:n | ) n
( ) 1
2 2
Note that 1 d 1
( log ) 1
d
For a multinoulli random variable with 𝐾 states, one can show that the Jeffreys’ prior is:
1 1
( ) Dir ,...,
Note that this is not any of the
2 2
expected answers: 1 1
( ) Dir ,..., or ( ) Dir 1,...,1
K K
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34
Pros and Cons of Jeffrey’s Priors
It can lead to incoherencies; e.g. the Jeffrey’s prior for Gaussian data and 𝜃 =
(𝜇, 𝜎) unknown is 𝜋(𝜃) ∝ 𝜎 −2 . Indeed using: ln f ( x | ) ln
1
2
ln
1
2 1/2 2
( x )2
1 2( x ) 1
2 2 0
3 1
I ( ) X | ( ) 2
2( x ) 3( x) 2 1 2
3 2 0
4 2
However 𝜇, 𝜎 are assumed a priori independent (using the results derived earlier) then
𝜋(𝜃) ∝ 𝜎 −1 .
It does NOT satisfy the likelihood principle. The Fisher information can differ for two
experiments providing proportional likelihoods. For an example consider the Binomial and
Negative Binomial distributions.
C. P. Robert, The Bayesian Choice, Springer, 2nd edition, chapter 3 (full text available)
30
flat prior
2
0,0.1𝜎ത conjugate prior
25
20
15
10
See MatLab
implementation
J.-M. Marin & C. P. Robert, The
Bayesian Core, Springer, 2nd 5
edition, chapter 2 (full text
available)
0
-0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04