0% found this document useful (0 votes)
37 views37 pages

Lec17 PriorModeling

- The document discusses prior modeling and selection of prior distributions in Bayesian statistics. - Conjugate priors and exponential families are introduced as they allow for tractable Bayesian inference. Exponential families include common distributions like normal, binomial, etc. - Mixture of conjugate priors and non-informative priors are also discussed as ways to specify prior distributions. The goals are to understand approaches to prior modeling.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views37 pages

Lec17 PriorModeling

- The document discusses prior modeling and selection of prior distributions in Bayesian statistics. - Conjugate priors and exponential families are introduced as they allow for tractable Bayesian inference. Exponential families include common distributions like normal, binomial, etc. - Mixture of conjugate priors and non-informative priors are also discussed as ways to specify prior distributions. The goals are to understand approaches to prior modeling.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Prior Modeling

Prof. Nicholas Zabaras

Email: [email protected]
URL: https://www.zabaras.com/

September 18, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras


References
 C P Robert, The Bayesian Choice: From Decision-Theoretic Motivations to
Compulational Implementation, Springer-Verlag, NY, 2001 (online resource)
 A Gelman, JB Carlin, HS Stern and DB Rubin, Bayesian Data Analysis,
Chapman and Hall CRC Press, 2nd Edition, 2003.
 J M Marin and C P Robert, The Bayesian Core, Spring Verlag, 2007 (online
resource)
 D. Sivia and J Skilling, Data Analysis: A Bayesian Tutorial, Oxford University
Press, 2006.
 Bayesian Statistics for Engineering, Online Course at Georgia Tech, B.
Vidakovic.
 Kevin Murphy, Machine Learning, A probabilistic Perspective, Chapter 5.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2


Contents
 Prior modeling, Conjugate priors , Exponential families

 Mixture of conjugate priors, Non-informative priors

 Translation and Scale invariance

 Jeffrey’s non-informative prior

 Robust Priors

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3


Goals
 The goals of today’s lecture are:

 Understand the importance and use of the exponential family conjugate


priors

 Learn about the scale and translation invariant priors

 Learn about Jeffrey’s non-informative priors and robust priors

 Understand how to use a mixture of conjugate priors

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4


Selection of Prior Distribution
 Once the prior distribution is selected, Bayesian inference can be performed
almost mechanically.

 A critical point of Bayesian statistics is the choice of the prior.

 Seldom there is enough “subjective information” to lead to an `exact’


determination of the prior distribution.

 Selection of prior includes subjectivity.

 Subjectivity does not imply being unscientific – one can use scientific
information to guide the specification of priors.

 We will review some of the work on uninformative and robust priors.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5


Informative Priors
 The prior is a tool summarizing the available information on a phenomenon of
interest, as well as the uncertainty related with this information.
 Informative priors convey specific and definite information about parameters 𝜃
associated with the random phenomenon.
 Pre-existing evidence which has already been taken into account is part of the
informative priors. This information can be based on historical data, insight or
personal beliefs.
 Typical subgroups of informative priors

 Conjugate priors
 Exponential families
 Maximum entropy priors

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6


Conjugate Priors
 Consider a class of probability distributions 𝑃. For every prior 𝜋(𝜃) 𝜖 𝑃, if the posterior
distribution 𝜋(𝜃|𝑥) belongs to 𝑃 and the likelihood 𝑓(𝑥|𝜃) to a family 𝐹, then the 𝑷 class is
conjugate for 𝑭.

 Conjugate priors are analytically tractable. Finding the posterior reduces to an updating of the
corresponding parameters of the prior.

 Consider a coin flipping example:

 Let θ the probability that the coin will draw heads


 Prior 𝜃~ℬℯ(𝑎, 𝑏)
 Data: the coin flipped 𝑛 times with 𝑛𝐻 of those were heads (binomial)
 Posterior:
f ( x |  ) ( )  a  n 1 (1   )b  n  n 1
H H

 ( | x)  1   Be (a  nH , b  n  nH )
 f ( x |  ) ( )d beta(a  nH , b  n  nT )
0

 Conjugate priors provide a first approximation to the adequate prior distribution which should
be followed by a robustness analysis.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7


Exponential Family
 Conjugate prior distributions are usually associated with the
Exponential Family, a class of probability distributions sharing a certain form
as specified below.

 Suppose 𝒙 are observations from the Exponential Family


f ( x |  )  C ( )h( x) exp R( )  T ( x)
f ( x |  )  h( x) exp R( )  T ( x)  ( )

We call this an exponential family. 𝑇(𝒙) are sufficient statistics.

 When   , X  and f ( x |  )  h( x ) exp   x  ( ) , the family is called


k k

natural family of dimension 𝑘.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8


Exponential Family: Example
 Consider the likelihood function
1  ( x   )2 
f (x |  )  exp  
2  2 

 This is a normal distribution (unknown mean, unit variance). For this case
note that:
2
f ( x |  )  h( x ) exp   x  ( )
1
R( )   ; T ( x)  x ;  ( )  ; h( x )  exp( x 2 / 2)
2 2

 Consider the normal distribution (unknown mean, unknown variance)


1  ( x   )2 
f (x |  )  exp  
2 2
 2 2

define 𝜃 = (𝜇, 𝜎), we can then see that
2
 1 T x2 T 1 
R( )  ( 2 , 2 ) ; T ( x)  ( x,  ) ; C ( )  e 2 ;
2

f ( x |  )  C ( )h( x) exp R( )  T ( x)   2 


 2
1 1
 ( )  2  log ; h( x) 
f ( x |  )  h( x) exp R( )  T ( x)  ( )
.
2  2
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9
Exponential Family
 Conjugate prior distributions for exponential families

 Likelihood

f ( x |  )  h( x) exp R( )  T ( x)  ( )

 Conjugate Prior

 ( | ,  )  exp R( )     ( ) ,   0
Hyper Parameters

 Posterior
 ( | x )  exp R( )     T ( x )   (  1) ( )

i.e.  ( | x)   ( |   T ( x),   1)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10


Exponential Family: Example
 Normal distribution (unknown mean, known variance)
Likelihood : x1 |  ~ N ( ,  2 ),  2  known, x1 
( x1  )2
1 
f ( x1 |  )  e 2 2

2

 Conjugate prior
 ~ N (0 ,  02 )
 Posterior
 02 0   2 x1
 | x1 ~ N ( 1 ,  ), 
2 2
 2
  , 1 2
1 1 0
 02   2
weighted average of the
observation x1 and the prior
 Posterior predictive: mean

( x  )2 (  1 )2
 
 ( x | x1 )    ( x |  ) ( | x1 )d ~  e 2 2
e 212
d ~ N ( 1 ,  2   12 )

Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11


Gaussian With Multiple Observations - Unknown Mean
 Assume we have observations X i |  ~ N ( ,  ) and  ~ N ( 0 ,  0 )
2 2

 The posterior is then:


 | x1 , x2 ,..., xn ~ N ( n ,  n2 ),
1 1  02 2
n 2 2
 2  2    2

n 0 
2
n 0  
2 2 n
n
 2
n  n0
 2
0

 n   n 2 
0 0 
n

 i    i     x  n
2
x x / i 0 0
n   n2  i 1 2  02    n2  i 1  i 1

  0     n  n0
2

   
   
𝜎2
 One can think of the prior as 𝑛0 virtual observations with 𝑛0 = and
n 𝜎02

2 x n  i 0 0
 
2
, n  i 1

n  n0 n  n0
n

Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12


Standard Exponential Family of Distributions
f(x|θ) π(θ) π(θ|x)

Normal Normal N (   2    2 x  ,  2 2 )
N ( ,  2 ) N (  , 2 )  1   2   2
Poisson Gamma
( ) ( )  ,   G( α+ x, β + 1)
Gamma Gamma
G(ν, θ) G(α, β) G(α + ν, β + x)
Binomial Beta
B(n, θ) Be(α, β) Be(α + x, β + n − x)
Negative Binomial Beta
Neg(m, θ) Be(α, β) Be(α + m, β + x)
Multinomial Dirichlet
k (1 ,...,  k ) (1 ,...,  k ) (1  x1 ,...,  k  xk )
Normal Gamma
N(μ, 1/θ) Ga(α, β) G(α + 0.5, β + (μ − x)2/2)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13


Mixture of Conjugate Priors
 Robust priors are useful, but can be computationally expensive to use.

 Conjugate priors simplify the computation, but are often not robust, and not flexible enough to
encode our prior knowledge.

 A mixture of conjugate priors is also conjugate and can approximate any kind of prior. Thus
such priors provide a good compromise between computational convenience and flexibility.

 Example: to model coin tosses, we can take a prior which is a mixture of two beta distributions
to model coin tosses.

p ( )  0.5Beta ( | 20, 20)  0.5Beta ( | 30,10)

 If 𝜃 comes from the first distribution, the coin is fair, but if it comes from the second, it is
biased towards heads.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14


Mixture of Conjugate Priors
 If we have a prior distribution which is a mixture of conjugate distributions to a given
likelihood, then the posterior is in closed form and is a mixture of conjugate distributions, i.e.
with K K
 ( )   wi i ( )   P( Z  i) ( | Z  i)
i 1 i 1

 w  ( ) f (D |  )
i i K
w 𝑤𝑖 ‫𝜃𝑑) 𝜃|𝒟(𝑓)𝜃( 𝑖𝜋׬‬
we obtain  ( | D )  K
i 1
  i  i ( ) f ( D |  ) or using 𝑤𝑖′ ≡ 𝐾
 w 
j 1
j j ( ) f (D |  )d i 1 A ෍
𝑗=1
𝑤𝑗 ධ𝜋𝑗 (𝜃)𝑓(𝒟|𝜃 ൯𝑑𝜃

 i ( ) f (D |  )
K K K
 ( | D )   w '
  wi i ( | D )   P( Z  i | D ) ( | D , Z  i )
'

  i ( ) f (D |  )d i 1
i
i 1 i 1

p( Z  i ) p(D | Z  i ) wi   i ( ) f (D |  )d K

where: p ( Z  i | D )   w, '


 w  1.
'

 p( Z  j ) p (D | Z  j ) K i i

j  w 
j 1
j j ( ) f ( D |  )d
i 1

 One can approximate arbitrary closely any prior distribution by a mixture of conjugate
distributions (Brown, 1986)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Mixture of Conjugate Priors
 As an example, suppose we use the mixture prior
p ( )  0.5Beta ( | a1 , b1 )  0.5Beta ( | a2 , b2 )
a1  b1  20, a2  30, b2  10, we observe N1 heads, N 0 tails
 The posterior becomes p ( | D )  p ( Z  1| D )Beta ( | a1  N1 , b1  N 0 )
 p ( Z  2 | D )Beta ( | a2  N1 , b2  N 0 )
 The posterior mixing weights are given as:
p( Z  k ) p(D | Z  k ) p(Z  k ) p(D | Z  k )
p(Z  k | D )  
 p(Z  k ') p(D | Z  k ')
k'
p(D )
mixture of Beta distributions
5

 If 𝑁1 = 20 heads and 4.5


prior
posterior
mixBetaDemo
from Kevin Murphys’ PMTK
𝑁0 = 10 tails, then, using 4

 N  B  a1  N1 , b1  N 0 
3.5

p (D | Z  1)   
B  a1 , b1 
3

 N1  2.5

2
 The posterior finally becomes 1.5
p ( | D )  1

0.346Beta ( | 40,30)  0.5

0.654Beta ( | 50, 20) 0


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16
Mixture of Conjugate Priors
 Dirichlet-multinomial models are widely used in biosequence analysis. Consider the sequence
logo problem. seqLogoDemo
from Kevin Murphys’ PMTK
 Suppose we want to find locations which
represent coding regions of the genome. Such
locations often have the same letter across all
sequences (mostly all A’s, or all T’s, or all C’s, or
all 𝐺’s).
 We believe adjacent locations are conserved together. We let 𝑍𝑡 = 1 if location t is
conserved, and let 𝑍𝑡 = 0 otherwise. We add a dependence between adjacent 𝑍𝑡 variables
using a Markov chain.
 To define a likelihood model, 𝑝(𝑵𝑡, 𝜽𝑡|𝑍𝑡 ), where 𝑵𝑡 is the vector of (𝐴, 𝐶, 𝐺, 𝑇) counts for
column 𝑡. We make this a multinomial distribution with parameter 𝜽𝑡.
 Since each column has a different distribution, we will want to integrate out 𝜽𝑡 and thus
compute the marginal likelihood 𝑝(𝑵𝑡|𝑍𝑡 ).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17


Mixture of Conjugate Priors
p ( N t | Z t )   p ( N t |  t ) p ( t | Z t )d t

 But what prior should we use for 𝜽𝑡?

 When 𝑍𝑡 = 0 we can use a uniform prior,


𝑝(𝜽|𝑍𝑡 = 0) = 𝒟𝒾𝓇(1, 1, 1, 1).

 If the column is conserved, 𝑍𝑡 = 1, it could be a nearly


pure column of A’s, C’s, G’s, or T’s. Use a mixture of
Dirichlet priors, each tilted towards the appropriate
corner of the 4-d simplex,
𝒟𝒾𝓇(𝜽|(2, 2, 10))

𝑝(𝜽|𝑍𝑡 = 1) = 1/4 𝒟𝒾𝓇(𝜽|(10, 1, 1, 1)) + · · · +1/4 𝒟𝒾𝓇(𝜽|(1, 1, 1, 10))

 Since this is conjugate, we can easily compute 𝑝(𝑵𝑡|𝑍𝑡 ) (Brown et al. 1993)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18


Summary: Conjugate Priors
 PROS.

 Simple to handle, can be interpreted through imaginary observations.

 Considered as the least informative ones.

 CONS.

 Not applicable to all likelihood functions.

 Not flexible, cannot account for constraints e.g.  > 0.

 Approximation by mixtures while feasible is very tedious and thus not used
in practice.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19


Noninformative Priors
 The motivation for noninformative priors

 When prior information about the model is too vague or unreliable, it is


usually impossible to justify the choice of prior distributions on a subjective
basis.

 “Objectivity” requirements which force us to provide prior distributions with


as little subjective input as possible, in order to base inference on the
sampling model alone.

 An intrinsic and acceptable notion of noninformative priors should satisfy


invariance under reparametrization.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20


Noninformative Priors
 Noninformative priors are intended to have as little influence on the posterior as possible i.e.
‘letting the data speak for themselves’.

 Assume a distribution 𝑝(𝑥|𝜆) governed by a parameter 𝜆, and a prior 𝑝(𝜆) = 𝑐𝑜𝑛𝑠𝑡 e.g. if 𝜆 is
a discrete variable with 𝐾 states, this simply amounts to setting the prior probability of each
state to 1/𝐾.

 In the case of continuous 𝜆 there are two difficulties with this approach. If the domain of 𝜆 is
unbounded, this prior distribution cannot be correctly normalized (improper prior).

 Improper priors can often be used provided the corresponding posterior distribution is proper.

 For example, if we put a uniform prior distribution over the mean of a Gaussian, then the
posterior distribution for the mean, once we have observed at least one data point, will be
proper.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21


Noninformative Priors
 If we don’t have strong beliefs about what 𝜃 should be, it is common to use an uninformative
prior, and to let the data speak for itself.

Consider as an example a Bernoulli parameter, 𝜃 ∈ 0,1 .

 An uninformative prior would be the uniform distribution, ℬℯ𝓉𝒶(1,1). In this case, the posterior
mean and MLE are:
N1  1
 | D  
N1  N 0  2
N1

N1  N 0
 One could argue that the prior wasn’t completely uninformative after all.
(   )  1
Beta ( x)  x (1  x)  1
( )(  )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22
Noninformative Priors
 One can argue that the most non-informative prior is

lim Beta (c, c)  Beta (0,0)


c 0

 This prior is a mixture of two equal point masses at 0 and 1.

 It is called the Haldane prior.

 Note that the Haldane prior is an improper prior, meaning it does not integrate to 1. However,
as long as we see at least one head and at least one tail, the posterior will be proper.

 We will see shortly that the right uninformative prior is:

Beta (1/ 2,1/ 2)


(   )  1
Beta ( x)  x (1  x)  1
( )(  )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23
Noninformative Priors
 A second difficulty arises from the transformation behavior of a probability density under a
nonlinear change of variables.

 If a function h() is constant, and we change variables to  = 2, then ℎ(𝜂) = ℎ(𝜂2) will also
be constant. However, if we choose the density 𝑝𝜆 (𝜆) to be constant, then the density of 𝜂
will be given by
d
p ( )  p ( )  p ( 2 )2  
d
and so the density over 𝜂 will not be constant.

 This issue does not arise when we use maximum likelihood, because the likelihood function
𝑝(𝑥|𝜆) is a simple function of 𝜆 and so we are free to use any convenient parameterization.

 If, however, we are to choose a prior distribution that is constant, we must take care to use an
appropriate representation for the parameters.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24


Translation Invariant Prior
 Translation Invariant: Consider a density of the form
p( x |  )  f ( x   )

then 𝑓(. ) is translation invariant and 𝜇 is a location parameter.

 Note that if we shift 𝑥 by a constant to give x  x  c then

𝑝(𝑥|ҧ 𝜇)ҧ = 𝑓(𝑥ҧ − 𝜇),


ҧ 𝑤ℎ𝑒𝑟𝑒 𝜇ҧ = 𝜇 + 𝑐

 Thus the form of the density remains the same.

 We would like to find a prior that satisfies this translational invariance – a density
independent of the origin.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25


Translation Invariant Prior
 We want a prior that assigns equal probability to the interval 𝐴 ≤ 𝜇 ≤ 𝐵 as to the interval 𝐴 −
𝑐 ≤𝜇 ≤𝐵−𝑐.
B B c B

 p(  )d   
A A c
p (t )dt   p (   c)d 
t   c A

 A translation invariance requirement is thus that the prior distribution should satisfy:

p (  )  p (   c) for every c  
p(  )  constant (improper prior)
 This flat prior is improper – but the resulting posterior is proper assuming
Having seen 𝑁 ≥ 1 data points
 f ( x   )d   will satisfy this. One data point
is enough to fix the location.
 Example of a location parameter is the mean 𝜇 of a Gaussian. The noninformative prior is
obtained from the conjugate prior
N (  | 0 ,  02 ) with  02  .
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26
Scale Invariant Prior
 Scale Invariant: If the density is of the form
1 x
p( x |  )  f( )
 
then 𝑓(. ) is scale invariant and 𝜎 is the scale parameter.

 Note that if we change the scale by a constant to give x  cx then

1 𝑥ҧ
𝑝(𝑥|ҧ 𝜎)
ത = 𝑓( ), 𝑤ℎ𝑒𝑟𝑒 𝜎ത = 𝑐𝜎
𝜎ത 𝜎ത

 Thus the form of the density remains the same.

 We would like to find a prior that satisfies this scale invariance – a density independent of the
scaling used.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27


Scale Invariant Prior
 We want a prior that assigns equal probability to the interval 𝐴 ≤ 𝜎 ≤ 𝐵 as to the interval
𝐴/𝑐 ≤ 𝜎 ≤ 𝐵/𝑐 .
B B/c B
 1
 p( )d  
A A/ c
p (t )dt 

 p ( ) d
A
c c
t
c
 A translation invariance requirement is thus that the prior distribution should satisfy:

 1
p ( )  p ( ) for every c  
c c
1
p( )  (improper prior)  p  ln   const

 We can approximate this with a p( )  Gamma ( | 0, 0) . This improper prior leads to a proper
posterior if we observe 𝑁 ≥ 2 data (we need at least 2 data points to estimate a variance)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28


Scale Invariant Prior
 Example of a scale parameter is the std 𝜎 of a Gaussian after we account for the location
parameter:
1 𝑥෤ 2

𝒩(𝑥|𝜇, 𝜎 2 ) ∝ 𝑒 𝜎 , 𝑥෤ = 𝑥 − 𝜇
𝜎
 We can express this in terms of the precision 𝜆 = 1/ 𝜎2 rather than 𝜎 itself.

 A distribution 𝑝(𝜎) ∝ 1/𝜎 corresponds to a distribution over 𝜆


of the form 𝑝(𝜆) ∝ 1/ 𝜆.

 The conjugate prior for 𝜆 is 𝒢𝒶𝓂𝓂𝒶(𝜆 |𝑎0, 𝑏0). The noninformative prior is obtained from the
𝒢𝒶𝓂𝓂𝒶 with 𝑎0 = 𝑏0 = 0. Recall the general form of the posterior:
N
 1 N 
p ( | X,  )   f ( xn |  ,  )Gamma ( | a0 , b0 )   N /2  a0 1
exp  b0    ( xn   ) 2 
n 1  2 n 1 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29


Jeffrey’s Noninformative Priors
 Jeffrey’s proposes a more intrinsic approach which avoids the need to take the invariance
structure into account.

 Given a likelihood 𝑓(𝑥|𝜃), Jeffrey’s noninformative prior distributions are based on Fisher
information, given by

  log f ( X |  )  log f ( X |  )T    2 log f ( X |  ) 


I ( )  X |   X |  
      2

the corresponding prior distribution is

 ( )  I ( )
1/2

Determinant of I

Sir Harold Jeffreys


(1891–1989)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30


Jeffrey’s Noninformative Priors
 Jeffreys Invariance Principle:

 Any rule for defining the prior distribution on 𝜃 should lead to an equivalent result when using a
transformed parameterization

 Let   h( ) and ℎ be an invertible function with inverse function   g ( ) , then

d dg ( )
 ( )   ( )   ( g ( ))
d d

 Jeffreys noninformative priors  ( )  I ( )


1/2
satisfy this invariant reparameterization requirement.

  2 log f ( X |  )    2 log f ( X |  ) d 2  d
2

I ( )   X |      I ( )
  2

X |
  2
d  d

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31


Jeffrey’s Noninformative Priors
 For example, consider normally distributed data with unknown mean.

 Likelihood
xi |  ~ N ( ,  2 ) (known  )

i.e. 𝑛 2
෌𝑖=1 𝑥𝑖 − 𝜃
𝑓(𝑥1:𝑛 |𝜃) ∝ 𝑒𝑥𝑝 −
2𝜎 2

 Then:
 2 log f ( x1:n |  ) n
    ( )  1
 2 2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32


Jeffrey’s Noninformative Priors
 Consider normally distributed data with unknown variance
 Likelihood
X i |  ~ N (  ,  ) ( known  )
i.e. 𝑛
෌𝑖=1 𝑥𝑖 − 𝜇 2
𝑓(𝑥1:𝑛 |𝜃) ∝ 𝜃 −𝑛Τ2 𝑒𝑥𝑝 −
2𝜃
𝑛 2
𝜕 2 log𝑓(𝑥1:𝑛 |𝜃) 𝑛 ෌𝑖=1 𝑥𝑖 − 𝜇
Then: = 2− ⇒
𝜕𝜃 2 2𝜃 𝜃3
𝑛 𝑛
𝑛 ෌𝑖=1 𝑥𝑖 − 𝜇 2 𝑛 ෌𝑖=1 𝑥𝑖 − 𝜇 2
𝐼 𝜃 = −𝔼𝑋| 𝜃 − = − 2 + 𝔼𝑋| 𝜃
2𝜃 2 𝜃3 2𝜃 𝜃3
𝑛 𝑛 𝑛
=− 2+ 2= 2
2𝜃 𝜃 2𝜃
1 1
 Jeffrey’s prior  (   2 )   (favors small variance)
  2

 Note that 1 d 1
 (  log  )    1
 d 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 33


Jeffrey’s Noninformative Priors
 Consider data following a binomial distribution (mean 𝑛𝜃)
 Likelihood
n x
f ( x |  )     (1   ) n  x
Then:  x
 2 log f ( x |  ) x nx  x n  x  n n  n n
    I ( )   X |       
 2  2 (1   ) 2  
2
(1   ) 2   2 (1   ) 2  (1   )
 1 1
 ( )   (1   ) 
 The Jeffrey’s prior is: 1/2
 Beta   ; , 
 2 2
Beta Distribution:

 For a multinoulli random variable with 𝐾 states, one can show that the Jeffreys’ prior is:
1 1
 ( )  Dir  ,..., 
 Note that this is not any of the 
2 2 
expected answers: 1 1
 ( )  Dir  ,...,  or  ( )  Dir 1,...,1
K K
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34
Pros and Cons of Jeffrey’s Priors
 It can lead to incoherencies; e.g. the Jeffrey’s prior for Gaussian data and 𝜃 =
(𝜇, 𝜎) unknown is 𝜋(𝜃) ∝ 𝜎 −2 . Indeed using: ln f ( x |  )  ln
1
 2 
 ln  
1
2 1/2 2
( x   )2

 1 2( x   )   1 
 2   2 0 
3 1
I ( )  X |      ( )  2
 2( x   ) 3(   x) 2 1   2  
  3  2 0
4     2 

 However 𝜇, 𝜎 are assumed a priori independent (using the results derived earlier) then
𝜋(𝜃) ∝ 𝜎 −1 .

 Automated procedure that however cannot incorporate any “physical” information.

 It does NOT satisfy the likelihood principle. The Fisher information can differ for two
experiments providing proportional likelihoods. For an example consider the Binomial and
Negative Binomial distributions.

C. P. Robert, The Bayesian Choice, Springer, 2nd edition, chapter 3 (full text available)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35


Lack of Robustness of the Normal Prior
 Comparison of two posterior distributions corresponding to the flat prior and a conjugate prior
𝒩 0,0.1𝜎ത 2 (where the variance 𝜎ത 2 refers here to the empirical variance of the sample). We
use the data normaldata. This shows the lack of robustness of the normal prior.
 When the hyperparameters in the prior vary, both the range and location of the posterior are
not limited by the data.

30
flat prior
2
0,0.1𝜎ത conjugate prior
25

20

15

10
See MatLab
implementation
J.-M. Marin & C. P. Robert, The
Bayesian Core, Springer, 2nd 5
edition, chapter 2 (full text
available)
0
-0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36


Robust Priors: Heavy Tails
 We want to make sure that the prior does not have an undue influence on the results.
 Can do this with robust priors, which typically have heavy tails, which avoids forcing things to
be too close to the prior mean.
 As an example, consider 𝑥~𝒩(𝜃, 1). We observe that 𝑥 = 5 and we want to estimate 𝜃. The
MLE is 𝜃ҧ = 5, which seems reasonable. The posterior mean under a uniform prior is also
[ | x  5]  5
 Prior constraints: Let the prior median is 0, and the prior quantiles are at −1 and 1, so 𝑝(𝜃 ≤
−1) = 𝑝(−1 < 𝜃 ≤ 0) = 𝑝(0 < 𝜃 ≤ 1) = 𝑝(1 < 𝜃) = 0.25. Assume the prior is
smooth and unimodal.
 Use the prior 𝒩 𝜃 0,2.192 : It satisfies the prior constraints. But the posterior mean is 3.43,
which is not very satisfactory.
 Use Cauchy prior 𝒯 𝜃 0,1,1 : This also satisfies the prior constraints of our example. But this
time we find that the posterior mean is about 4.6, which seems much more reasonable.
robustPriorDemo
from Kevin Murphys’ PMTK
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37

You might also like