0% found this document useful (0 votes)

37 views37 pages

Lec17 PriorModeling

- The document discusses prior modeling and selection of prior distributions in Bayesian statistics. - Conjugate priors and exponential families are introduced as they allow for tractable Bayesian inference. Exponential families include common distributions like normal, binomial, etc. - Mixture of conjugate priors and non-informative priors are also discussed as ways to specify prior distributions. The goals are to understand approaches to prior modeling.

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views37 pages

Lec17 PriorModeling

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Prior Modeling

Prof. Nicholas Zabaras

Email: [email protected]
URL: https://www.zabaras.com/

September 18, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras

References
 C P Robert, The Bayesian Choice: From Decision-Theoretic Motivations to
Compulational Implementation, Springer-Verlag, NY, 2001 (online resource)
 A Gelman, JB Carlin, HS Stern and DB Rubin, Bayesian Data Analysis,
Chapman and Hall CRC Press, 2nd Edition, 2003.
 J M Marin and C P Robert, The Bayesian Core, Spring Verlag, 2007 (online
resource)
 D. Sivia and J Skilling, Data Analysis: A Bayesian Tutorial, Oxford University
Press, 2006.
 Bayesian Statistics for Engineering, Online Course at Georgia Tech, B.
Vidakovic.
 Kevin Murphy, Machine Learning, A probabilistic Perspective, Chapter 5.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2

Contents
 Prior modeling, Conjugate priors , Exponential families

 Mixture of conjugate priors, Non-informative priors

 Translation and Scale invariance

 Jeffrey’s non-informative prior

 Robust Priors

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3

Goals
 The goals of today’s lecture are:

 Understand the importance and use of the exponential family conjugate

priors

 Learn about the scale and translation invariant priors

 Learn about Jeffrey’s non-informative priors and robust priors

 Understand how to use a mixture of conjugate priors

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4

Selection of Prior Distribution
 Once the prior distribution is selected, Bayesian inference can be performed
almost mechanically.

 A critical point of Bayesian statistics is the choice of the prior.

 Seldom there is enough “subjective information” to lead to an `exact’

determination of the prior distribution.

 Selection of prior includes subjectivity.

 Subjectivity does not imply being unscientific – one can use scientific
information to guide the specification of priors.

 We will review some of the work on uninformative and robust priors.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5

Informative Priors
 The prior is a tool summarizing the available information on a phenomenon of
interest, as well as the uncertainty related with this information.
 Informative priors convey specific and definite information about parameters 𝜃
associated with the random phenomenon.
 Pre-existing evidence which has already been taken into account is part of the
informative priors. This information can be based on historical data, insight or
personal beliefs.
 Typical subgroups of informative priors

 Conjugate priors
 Exponential families
 Maximum entropy priors

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6

Conjugate Priors
 Consider a class of probability distributions 𝑃. For every prior 𝜋(𝜃) 𝜖 𝑃, if the posterior
distribution 𝜋(𝜃|𝑥) belongs to 𝑃 and the likelihood 𝑓(𝑥|𝜃) to a family 𝐹, then the 𝑷 class is
conjugate for 𝑭.

 Conjugate priors are analytically tractable. Finding the posterior reduces to an updating of the
corresponding parameters of the prior.

 Consider a coin flipping example:

 Let θ the probability that the coin will draw heads

 Prior 𝜃~ℬℯ(𝑎, 𝑏)
 Data: the coin flipped 𝑛 times with 𝑛𝐻 of those were heads (binomial)
 Posterior:
f ( x |  ) ( )  a  n 1 (1   )b  n  n 1
H H

 ( | x)  1   Be (a  nH , b  n  nH )
 f ( x |  ) ( )d beta(a  nH , b  n  nT )
0

 Conjugate priors provide a first approximation to the adequate prior distribution which should
be followed by a robustness analysis.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7

Exponential Family
 Conjugate prior distributions are usually associated with the
Exponential Family, a class of probability distributions sharing a certain form
as specified below.

 Suppose 𝒙 are observations from the Exponential Family

f ( x |  )  C ( )h( x) exp R( )  T ( x)
f ( x |  )  h( x) exp R( )  T ( x)  ( )

We call this an exponential family. 𝑇(𝒙) are sufficient statistics.

 When   , X  and f ( x |  )  h( x ) exp   x  ( ) , the family is called

k k

natural family of dimension 𝑘.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8

Exponential Family: Example
 Consider the likelihood function
1  ( x   )2 
f (x |  )  exp  
2  2 

 This is a normal distribution (unknown mean, unit variance). For this case
note that:
2
f ( x |  )  h( x ) exp   x  ( )
1
R( )   ; T ( x)  x ;  ( )  ; h( x )  exp( x 2 / 2)
2 2

 Consider the normal distribution (unknown mean, unknown variance)

1  ( x   )2 
f (x |  )  exp  
2 2
 2 2

define 𝜃 = (𝜇, 𝜎), we can then see that
2
 1 T x2 T 1 
R( )  ( 2 , 2 ) ; T ( x)  ( x,  ) ; C ( )  e 2 ;
2

f ( x |  )  C ( )h( x) exp R( )  T ( x)   2 

 2
1 1
 ( )  2  log ; h( x) 
f ( x |  )  h( x) exp R( )  T ( x)  ( )
.
2  2
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9
Exponential Family
 Conjugate prior distributions for exponential families

 Likelihood

f ( x |  )  h( x) exp R( )  T ( x)  ( )

 Conjugate Prior

 ( | ,  )  exp R( )     ( ) ,   0
Hyper Parameters

 Posterior
 ( | x )  exp R( )     T ( x )   (  1) ( )

i.e.  ( | x)   ( |   T ( x),   1)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10

Exponential Family: Example
 Normal distribution (unknown mean, known variance)
Likelihood : x1 |  ~ N ( ,  2 ),  2  known, x1 
( x1  )2
1 
f ( x1 |  )  e 2 2

2

 Conjugate prior
 ~ N (0 ,  02 )
 Posterior
 02 0   2 x1
 | x1 ~ N ( 1 ,  ), 
2 2
 2
  , 1 2
1 1 0
 02   2
weighted average of the
observation x1 and the prior
 Posterior predictive: mean

( x  )2 (  1 )2
 
 ( x | x1 )    ( x |  ) ( | x1 )d ~  e 2 2
e 212
d ~ N ( 1 ,  2   12 )

Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11

Gaussian With Multiple Observations - Unknown Mean
 Assume we have observations X i |  ~ N ( ,  ) and  ~ N ( 0 ,  0 )
2 2

 The posterior is then:

 | x1 , x2 ,..., xn ~ N ( n ,  n2 ),
1 1  02 2
n 2 2
 2  2    2

n 0 
2
n 0  
2 2 n
n
 2
n  n0
 2
0

 n   n 2 
0 0 
n

 i    i     x  n
2
x x / i 0 0
n   n2  i 1 2  02    n2  i 1  i 1

  0     n  n0
2

   
   
𝜎2
 One can think of the prior as 𝑛0 virtual observations with 𝑛0 = and
n 𝜎02

2 x n  i 0 0
 
2
, n  i 1

n  n0 n  n0
n

Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12

Standard Exponential Family of Distributions
f(x|θ) π(θ) π(θ|x)

Normal Normal N (   2    2 x  ,  2 2 )
N ( ,  2 ) N (  , 2 )  1   2   2
Poisson Gamma
( ) ( )  ,   G( α+ x, β + 1)
Gamma Gamma
G(ν, θ) G(α, β) G(α + ν, β + x)
Binomial Beta
B(n, θ) Be(α, β) Be(α + x, β + n − x)
Negative Binomial Beta
Neg(m, θ) Be(α, β) Be(α + m, β + x)
Multinomial Dirichlet
k (1 ,...,  k ) (1 ,...,  k ) (1  x1 ,...,  k  xk )
Normal Gamma
N(μ, 1/θ) Ga(α, β) G(α + 0.5, β + (μ − x)2/2)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13

Mixture of Conjugate Priors
 Robust priors are useful, but can be computationally expensive to use.

 Conjugate priors simplify the computation, but are often not robust, and not flexible enough to
encode our prior knowledge.

 A mixture of conjugate priors is also conjugate and can approximate any kind of prior. Thus
such priors provide a good compromise between computational convenience and flexibility.

 Example: to model coin tosses, we can take a prior which is a mixture of two beta distributions
to model coin tosses.

p ( )  0.5Beta ( | 20, 20)  0.5Beta ( | 30,10)

 If 𝜃 comes from the first distribution, the coin is fair, but if it comes from the second, it is
biased towards heads.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14

Mixture of Conjugate Priors
 If we have a prior distribution which is a mixture of conjugate distributions to a given
likelihood, then the posterior is in closed form and is a mixture of conjugate distributions, i.e.
with K K
 ( )   wi i ( )   P( Z  i) ( | Z  i)
i 1 i 1

 i ( ) f (D |  )
K K K
 ( | D )   w '
  wi i ( | D )   P( Z  i | D ) ( | D , Z  i )
'

  i ( ) f (D |  )d i 1
i
i 1 i 1

p( Z  i ) p(D | Z  i ) wi   i ( ) f (D |  )d K

where: p ( Z  i | D )   w, '

 w  1.
'

 p( Z  j ) p (D | Z  j ) K i i

j  w 
j 1
j j ( ) f ( D |  )d
i 1

 One can approximate arbitrary closely any prior distribution by a mixture of conjugate
distributions (Brown, 1986)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Mixture of Conjugate Priors
 As an example, suppose we use the mixture prior
p ( )  0.5Beta ( | a1 , b1 )  0.5Beta ( | a2 , b2 )
a1  b1  20, a2  30, b2  10, we observe N1 heads, N 0 tails
 The posterior becomes p ( | D )  p ( Z  1| D )Beta ( | a1  N1 , b1  N 0 )
 p ( Z  2 | D )Beta ( | a2  N1 , b2  N 0 )
 The posterior mixing weights are given as:
p( Z  k ) p(D | Z  k ) p(Z  k ) p(D | Z  k )
p(Z  k | D )  
 p(Z  k ') p(D | Z  k ')
k'
p(D )
mixture of Beta distributions
5

 If 𝑁1 = 20 heads and 4.5

prior
posterior
mixBetaDemo
from Kevin Murphys’ PMTK
𝑁0 = 10 tails, then, using 4

 N  B  a1  N1 , b1  N 0 
3.5

p (D | Z  1)   
B  a1 , b1 
3

 N1  2.5

2
 The posterior finally becomes 1.5
p ( | D )  1

0.346Beta ( | 40,30)  0.5

0.654Beta ( | 50, 20) 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16
Mixture of Conjugate Priors
 Dirichlet-multinomial models are widely used in biosequence analysis. Consider the sequence
logo problem. seqLogoDemo
from Kevin Murphys’ PMTK
 Suppose we want to find locations which
represent coding regions of the genome. Such
locations often have the same letter across all
sequences (mostly all A’s, or all T’s, or all C’s, or
all 𝐺’s).
 We believe adjacent locations are conserved together. We let 𝑍𝑡 = 1 if location t is
conserved, and let 𝑍𝑡 = 0 otherwise. We add a dependence between adjacent 𝑍𝑡 variables
using a Markov chain.
 To define a likelihood model, 𝑝(𝑵𝑡, 𝜽𝑡|𝑍𝑡 ), where 𝑵𝑡 is the vector of (𝐴, 𝐶, 𝐺, 𝑇) counts for
column 𝑡. We make this a multinomial distribution with parameter 𝜽𝑡.
 Since each column has a different distribution, we will want to integrate out 𝜽𝑡 and thus
compute the marginal likelihood 𝑝(𝑵𝑡|𝑍𝑡 ).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17

Mixture of Conjugate Priors
p ( N t | Z t )   p ( N t |  t ) p ( t | Z t )d t

 But what prior should we use for 𝜽𝑡?

 When 𝑍𝑡 = 0 we can use a uniform prior,

𝑝(𝜽|𝑍𝑡 = 0) = 𝒟𝒾𝓇(1, 1, 1, 1).

 If the column is conserved, 𝑍𝑡 = 1, it could be a nearly

pure column of A’s, C’s, G’s, or T’s. Use a mixture of
Dirichlet priors, each tilted towards the appropriate
corner of the 4-d simplex,
𝒟𝒾𝓇(𝜽|(2, 2, 10))

𝑝(𝜽|𝑍𝑡 = 1) = 1/4 𝒟𝒾𝓇(𝜽|(10, 1, 1, 1)) + · · · +1/4 𝒟𝒾𝓇(𝜽|(1, 1, 1, 10))

 Since this is conjugate, we can easily compute 𝑝(𝑵𝑡|𝑍𝑡 ) (Brown et al. 1993)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18

Summary: Conjugate Priors
 PROS.

 Simple to handle, can be interpreted through imaginary observations.

 Considered as the least informative ones.

 CONS.

 Not applicable to all likelihood functions.

 Not flexible, cannot account for constraints e.g.  > 0.

 Approximation by mixtures while feasible is very tedious and thus not used
in practice.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19

Noninformative Priors
 The motivation for noninformative priors

 When prior information about the model is too vague or unreliable, it is

usually impossible to justify the choice of prior distributions on a subjective
basis.

 “Objectivity” requirements which force us to provide prior distributions with

as little subjective input as possible, in order to base inference on the
sampling model alone.

 An intrinsic and acceptable notion of noninformative priors should satisfy

invariance under reparametrization.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20

Noninformative Priors
 Noninformative priors are intended to have as little influence on the posterior as possible i.e.
‘letting the data speak for themselves’.

 Assume a distribution 𝑝(𝑥|𝜆) governed by a parameter 𝜆, and a prior 𝑝(𝜆) = 𝑐𝑜𝑛𝑠𝑡 e.g. if 𝜆 is
a discrete variable with 𝐾 states, this simply amounts to setting the prior probability of each
state to 1/𝐾.

 In the case of continuous 𝜆 there are two difficulties with this approach. If the domain of 𝜆 is
unbounded, this prior distribution cannot be correctly normalized (improper prior).

 Improper priors can often be used provided the corresponding posterior distribution is proper.

 For example, if we put a uniform prior distribution over the mean of a Gaussian, then the
posterior distribution for the mean, once we have observed at least one data point, will be
proper.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21

Noninformative Priors
 If we don’t have strong beliefs about what 𝜃 should be, it is common to use an uninformative
prior, and to let the data speak for itself.

Consider as an example a Bernoulli parameter, 𝜃 ∈ 0,1 .

 An uninformative prior would be the uniform distribution, ℬℯ𝓉𝒶(1,1). In this case, the posterior
mean and MLE are:
N1  1
 | D  
N1  N 0  2
N1

N1  N 0
 One could argue that the prior wasn’t completely uninformative after all.
(   )  1
Beta ( x)  x (1  x)  1
( )(  )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22
Noninformative Priors
 One can argue that the most non-informative prior is

lim Beta (c, c)  Beta (0,0)

c 0

 This prior is a mixture of two equal point masses at 0 and 1.

 It is called the Haldane prior.

 Note that the Haldane prior is an improper prior, meaning it does not integrate to 1. However,
as long as we see at least one head and at least one tail, the posterior will be proper.

 We will see shortly that the right uninformative prior is:

Beta (1/ 2,1/ 2)

(   )  1
Beta ( x)  x (1  x)  1
( )(  )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23
Noninformative Priors
 A second difficulty arises from the transformation behavior of a probability density under a
nonlinear change of variables.

 If a function h() is constant, and we change variables to  = 2, then ℎ(𝜂) = ℎ(𝜂2) will also
be constant. However, if we choose the density 𝑝𝜆 (𝜆) to be constant, then the density of 𝜂
will be given by
d
p ( )  p ( )  p ( 2 )2  
d
and so the density over 𝜂 will not be constant.

 This issue does not arise when we use maximum likelihood, because the likelihood function
𝑝(𝑥|𝜆) is a simple function of 𝜆 and so we are free to use any convenient parameterization.

 If, however, we are to choose a prior distribution that is constant, we must take care to use an
appropriate representation for the parameters.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24

Translation Invariant Prior
 Translation Invariant: Consider a density of the form
p( x |  )  f ( x   )

then 𝑓(. ) is translation invariant and 𝜇 is a location parameter.

 Note that if we shift 𝑥 by a constant to give x  x  c then

𝑝(𝑥|ҧ 𝜇)ҧ = 𝑓(𝑥ҧ − 𝜇),

ҧ 𝑤ℎ𝑒𝑟𝑒 𝜇ҧ = 𝜇 + 𝑐

 Thus the form of the density remains the same.

 We would like to find a prior that satisfies this translational invariance – a density
independent of the origin.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25

Translation Invariant Prior
 We want a prior that assigns equal probability to the interval 𝐴 ≤ 𝜇 ≤ 𝐵 as to the interval 𝐴 −
𝑐 ≤𝜇 ≤𝐵−𝑐.
B B c B

 p(  )d   
A A c
p (t )dt   p (   c)d 
t   c A

 A translation invariance requirement is thus that the prior distribution should satisfy:

p (  )  p (   c) for every c  
p(  )  constant (improper prior)
 This flat prior is improper – but the resulting posterior is proper assuming
Having seen 𝑁 ≥ 1 data points
 f ( x   )d   will satisfy this. One data point
is enough to fix the location.
 Example of a location parameter is the mean 𝜇 of a Gaussian. The noninformative prior is
obtained from the conjugate prior
N (  | 0 ,  02 ) with  02  .
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26
Scale Invariant Prior
 Scale Invariant: If the density is of the form
1 x
p( x |  )  f( )
 
then 𝑓(. ) is scale invariant and 𝜎 is the scale parameter.

 Note that if we change the scale by a constant to give x  cx then

1 𝑥ҧ
𝑝(𝑥|ҧ 𝜎)
ത = 𝑓( ), 𝑤ℎ𝑒𝑟𝑒 𝜎ത = 𝑐𝜎
𝜎ത 𝜎ത

 Thus the form of the density remains the same.

 We would like to find a prior that satisfies this scale invariance – a density independent of the
scaling used.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27

Scale Invariant Prior
 We want a prior that assigns equal probability to the interval 𝐴 ≤ 𝜎 ≤ 𝐵 as to the interval
𝐴/𝑐 ≤ 𝜎 ≤ 𝐵/𝑐 .
B B/c B
 1
 p( )d  
A A/ c
p (t )dt 

 p ( ) d
A
c c
t
c
 A translation invariance requirement is thus that the prior distribution should satisfy:

 1
p ( )  p ( ) for every c  
c c
1
p( )  (improper prior)  p  ln   const

 We can approximate this with a p( )  Gamma ( | 0, 0) . This improper prior leads to a proper
posterior if we observe 𝑁 ≥ 2 data (we need at least 2 data points to estimate a variance)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28

Scale Invariant Prior
 Example of a scale parameter is the std 𝜎 of a Gaussian after we account for the location
parameter:
1 𝑥෤ 2
−
𝒩(𝑥|𝜇, 𝜎 2 ) ∝ 𝑒 𝜎 , 𝑥෤ = 𝑥 − 𝜇
𝜎
 We can express this in terms of the precision 𝜆 = 1/ 𝜎2 rather than 𝜎 itself.

 A distribution 𝑝(𝜎) ∝ 1/𝜎 corresponds to a distribution over 𝜆

of the form 𝑝(𝜆) ∝ 1/ 𝜆.

 The conjugate prior for 𝜆 is 𝒢𝒶𝓂𝓂𝒶(𝜆 |𝑎0, 𝑏0). The noninformative prior is obtained from the
𝒢𝒶𝓂𝓂𝒶 with 𝑎0 = 𝑏0 = 0. Recall the general form of the posterior:
N
 1 N 
p ( | X,  )   f ( xn |  ,  )Gamma ( | a0 , b0 )   N /2  a0 1
exp  b0    ( xn   ) 2 
n 1  2 n 1 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29

Jeffrey’s Noninformative Priors
 Jeffrey’s proposes a more intrinsic approach which avoids the need to take the invariance
structure into account.

 Given a likelihood 𝑓(𝑥|𝜃), Jeffrey’s noninformative prior distributions are based on Fisher
information, given by

  log f ( X |  )  log f ( X |  )T    2 log f ( X |  ) 

I ( )  X |   X |  
      2


the corresponding prior distribution is

 ( )  I ( )
1/2

Determinant of I

Sir Harold Jeffreys

(1891–1989)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30

Jeffrey’s Noninformative Priors
 Jeffreys Invariance Principle:

 Any rule for defining the prior distribution on 𝜃 should lead to an equivalent result when using a
transformed parameterization

 Let   h( ) and ℎ be an invertible function with inverse function   g ( ) , then

d dg ( )
 ( )   ( )   ( g ( ))
d d

 Jeffreys noninformative priors  ( )  I ( )

1/2
satisfy this invariant reparameterization requirement.

  2 log f ( X |  )    2 log f ( X |  ) d 2  d
2

I ( )   X |      I ( )
  2

X |
  2
d  d


Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31

Jeffrey’s Noninformative Priors
 For example, consider normally distributed data with unknown mean.

 Likelihood
xi |  ~ N ( ,  2 ) (known  )

i.e. 𝑛 2
෌𝑖=1 𝑥𝑖 − 𝜃
𝑓(𝑥1:𝑛 |𝜃) ∝ 𝑒𝑥𝑝 −
2𝜎 2

 Then:
 2 log f ( x1:n |  ) n
    ( )  1
 2 2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32

Jeffrey’s Noninformative Priors
 Consider normally distributed data with unknown variance
 Likelihood
X i |  ~ N (  ,  ) ( known  )
i.e. 𝑛
෌𝑖=1 𝑥𝑖 − 𝜇 2
𝑓(𝑥1:𝑛 |𝜃) ∝ 𝜃 −𝑛Τ2 𝑒𝑥𝑝 −
2𝜃
𝑛 2
𝜕 2 log𝑓(𝑥1:𝑛 |𝜃) 𝑛 ෌𝑖=1 𝑥𝑖 − 𝜇
Then: = 2− ⇒
𝜕𝜃 2 2𝜃 𝜃3
𝑛 𝑛
𝑛 ෌𝑖=1 𝑥𝑖 − 𝜇 2 𝑛 ෌𝑖=1 𝑥𝑖 − 𝜇 2
𝐼 𝜃 = −𝔼𝑋| 𝜃 − = − 2 + 𝔼𝑋| 𝜃
2𝜃 2 𝜃3 2𝜃 𝜃3
𝑛 𝑛 𝑛
=− 2+ 2= 2
2𝜃 𝜃 2𝜃
1 1
 Jeffrey’s prior  (   2 )   (favors small variance)
  2

 Note that 1 d 1
 (  log  )    1
 d 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 33

Jeffrey’s Noninformative Priors
 Consider data following a binomial distribution (mean 𝑛𝜃)
 Likelihood
n x
f ( x |  )     (1   ) n  x
Then:  x
 2 log f ( x |  ) x nx  x n  x  n n  n n
    I ( )   X |       
 2  2 (1   ) 2  
2
(1   ) 2   2 (1   ) 2  (1   )
 1 1
 ( )   (1   ) 
 The Jeffrey’s prior is: 1/2
 Beta   ; , 
 2 2
Beta Distribution:

 For a multinoulli random variable with 𝐾 states, one can show that the Jeffreys’ prior is:
1 1
 ( )  Dir  ,..., 
 Note that this is not any of the 
2 2 
expected answers: 1 1
 ( )  Dir  ,...,  or  ( )  Dir 1,...,1
K K
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34
Pros and Cons of Jeffrey’s Priors
 It can lead to incoherencies; e.g. the Jeffrey’s prior for Gaussian data and 𝜃 =
(𝜇, 𝜎) unknown is 𝜋(𝜃) ∝ 𝜎 −2 . Indeed using: ln f ( x |  )  ln
1
 2 
 ln  
1
2 1/2 2
( x   )2

 1 2( x   )   1 
 2   2 0 
3 1
I ( )  X |      ( )  2
 2( x   ) 3(   x) 2 1   2  
  3  2 0
4     2 

 However 𝜇, 𝜎 are assumed a priori independent (using the results derived earlier) then
𝜋(𝜃) ∝ 𝜎 −1 .

 Automated procedure that however cannot incorporate any “physical” information.

 It does NOT satisfy the likelihood principle. The Fisher information can differ for two
experiments providing proportional likelihoods. For an example consider the Binomial and
Negative Binomial distributions.

C. P. Robert, The Bayesian Choice, Springer, 2nd edition, chapter 3 (full text available)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35

Lack of Robustness of the Normal Prior
 Comparison of two posterior distributions corresponding to the flat prior and a conjugate prior
𝒩 0,0.1𝜎ത 2 (where the variance 𝜎ത 2 refers here to the empirical variance of the sample). We
use the data normaldata. This shows the lack of robustness of the normal prior.
 When the hyperparameters in the prior vary, both the range and location of the posterior are
not limited by the data.

30
flat prior
2
0,0.1𝜎ത conjugate prior
25

10
See MatLab
implementation
J.-M. Marin & C. P. Robert, The
Bayesian Core, Springer, 2nd 5
edition, chapter 2 (full text
available)
0
-0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36

Robust Priors: Heavy Tails
 We want to make sure that the prior does not have an undue influence on the results.
 Can do this with robust priors, which typically have heavy tails, which avoids forcing things to
be too close to the prior mean.
 As an example, consider 𝑥~𝒩(𝜃, 1). We observe that 𝑥 = 5 and we want to estimate 𝜃. The
MLE is 𝜃ҧ = 5, which seems reasonable. The posterior mean under a uniform prior is also
[ | x  5]  5
 Prior constraints: Let the prior median is 0, and the prior quantiles are at −1 and 1, so 𝑝(𝜃 ≤
−1) = 𝑝(−1 < 𝜃 ≤ 0) = 𝑝(0 < 𝜃 ≤ 1) = 𝑝(1 < 𝜃) = 0.25. Assume the prior is
smooth and unimodal.
 Use the prior 𝒩 𝜃 0,2.192 : It satisfies the prior constraints. But the posterior mean is 3.43,
which is not very satisfactory.
 Use Cauchy prior 𝒯 𝜃 0,1,1 : This also satisfies the prior constraints of our example. But this
time we find that the posterior mean is about 4.6, which seems much more reasonable.
robustPriorDemo
from Kevin Murphys’ PMTK
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37

Lecture 8.2 - Variational Quantum Eigensolver
No ratings yet
Lecture 8.2 - Variational Quantum Eigensolver
27 pages
Conjugate Prior
0% (1)
Conjugate Prior
7 pages
Lec12 13 BayesianInferenceForTheGaussian
No ratings yet
Lec12 13 BayesianInferenceForTheGaussian
57 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
Bayesian Modelling Tuts-4-9
No ratings yet
Bayesian Modelling Tuts-4-9
6 pages
Conjugate Prior
No ratings yet
Conjugate Prior
6 pages
Lecture 10
No ratings yet
Lecture 10
33 pages
Conjugate Prior
No ratings yet
Conjugate Prior
5 pages
Lecture 2 - 4 Prior
No ratings yet
Lecture 2 - 4 Prior
51 pages
Conjugate Prior
No ratings yet
Conjugate Prior
47 pages
Chap 2
No ratings yet
Chap 2
28 pages
Bayesian Basics: Ryan P. Adams
No ratings yet
Bayesian Basics: Ryan P. Adams
7 pages
Lecture2 2013
No ratings yet
Lecture2 2013
60 pages
Slides 1
No ratings yet
Slides 1
73 pages
Introduction To Bayesian Methods With An Example
No ratings yet
Introduction To Bayesian Methods With An Example
25 pages
Nonparametric Inference Techniques For High-Dimensional Data: Challenges and Solutions
No ratings yet
Nonparametric Inference Techniques For High-Dimensional Data: Challenges and Solutions
16 pages
25 Intro To Bayesian Inference
No ratings yet
25 Intro To Bayesian Inference
31 pages
Modern Bayesian Econometrics
No ratings yet
Modern Bayesian Econometrics
100 pages
BaYesian Models Machine Learning 2016
No ratings yet
BaYesian Models Machine Learning 2016
126 pages
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
No ratings yet
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
53 pages
20 Bayesian2
No ratings yet
20 Bayesian2
50 pages
BT Wk3 LectureNotes
No ratings yet
BT Wk3 LectureNotes
16 pages
Bayesian Inference Slides 2021
No ratings yet
Bayesian Inference Slides 2021
37 pages
BT Wk3 LectureNotes
No ratings yet
BT Wk3 LectureNotes
19 pages
CLASS 2025 Bayesian Framework
No ratings yet
CLASS 2025 Bayesian Framework
46 pages
Bayesian Statistics: MA501, Statistics For Insurance
No ratings yet
Bayesian Statistics: MA501, Statistics For Insurance
28 pages
Slides 535 Day 5 SPR 2014
No ratings yet
Slides 535 Day 5 SPR 2014
13 pages
Bayesian Data Analysis - Reading Instructions 2: Chapter 2 - Outline
No ratings yet
Bayesian Data Analysis - Reading Instructions 2: Chapter 2 - Outline
36 pages
Lecture 5
No ratings yet
Lecture 5
6 pages
Lecture Notes For Probability and Statistics
No ratings yet
Lecture Notes For Probability and Statistics
7 pages
BML Lecture Notes
No ratings yet
BML Lecture Notes
126 pages
Conjugate Prior - Wikipedia
No ratings yet
Conjugate Prior - Wikipedia
6 pages
Chapter 1 B
No ratings yet
Chapter 1 B
35 pages
19-Bayesian 2
No ratings yet
19-Bayesian 2
39 pages
Bayesian Inference
No ratings yet
Bayesian Inference
18 pages
Lecture 4
No ratings yet
Lecture 4
7 pages
Lec22 Introduction2BayesianRegression
No ratings yet
Lec22 Introduction2BayesianRegression
42 pages
Ch3 - 2009 Conjugate Families of Distributions
No ratings yet
Ch3 - 2009 Conjugate Families of Distributions
67 pages
Bayes Lectures English
No ratings yet
Bayes Lectures English
74 pages
Lec24 BayesianLinearRegression
No ratings yet
Lec24 BayesianLinearRegression
29 pages
Bayesian Inference For The Gaussian
No ratings yet
Bayesian Inference For The Gaussian
11 pages
Bayesian Week2 LectureNotes
No ratings yet
Bayesian Week2 LectureNotes
14 pages
Lecture4 More Bayes
No ratings yet
Lecture4 More Bayes
24 pages
CH 5
No ratings yet
CH 5
45 pages
확통1 LectureNote09 on Bayesian Statistical Inference
No ratings yet
확통1 LectureNote09 on Bayesian Statistical Inference
78 pages
Week 11
No ratings yet
Week 11
11 pages
Introduction To Bayesian Statistics
No ratings yet
Introduction To Bayesian Statistics
33 pages
Prior Distributions For Variance Parameters in Hierarchical Models
No ratings yet
Prior Distributions For Variance Parameters in Hierarchical Models
20 pages
Frequentist vs. Bayesian Statistics Frequentist Thinking Bayesian Thinking
No ratings yet
Frequentist vs. Bayesian Statistics Frequentist Thinking Bayesian Thinking
18 pages
Tuto1 Merged
No ratings yet
Tuto1 Merged
11 pages
Lecture 5
No ratings yet
Lecture 5
6 pages
Bayesian Statistics and Modelling
No ratings yet
Bayesian Statistics and Modelling
28 pages
Bayesian Inference Under Small Sample
No ratings yet
Bayesian Inference Under Small Sample
20 pages
Intro Bayes Time Series 1
No ratings yet
Intro Bayes Time Series 1
72 pages
A Beginner's Notes On Bayesian Econometrics (Art)
No ratings yet
A Beginner's Notes On Bayesian Econometrics (Art)
21 pages
Lecture 3
No ratings yet
Lecture 3
4 pages
Lecture 15 Normal Model Joint Inference
No ratings yet
Lecture 15 Normal Model Joint Inference
34 pages
Priors in Bayesian Learning
No ratings yet
Priors in Bayesian Learning
26 pages
Notes4 BayesianLearning
No ratings yet
Notes4 BayesianLearning
8 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Seminar em
No ratings yet
Seminar em
51 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
Ek 2020
No ratings yet
Ek 2020
203 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
Dai 2020
No ratings yet
Dai 2020
62 pages
Lecture 7 - Introduction To Quantum Noise Bonus
No ratings yet
Lecture 7 - Introduction To Quantum Noise Bonus
13 pages
Lecture 3 - Entanglement in Action
No ratings yet
Lecture 3 - Entanglement in Action
36 pages
Lecture 1.1 - Single States
No ratings yet
Lecture 1.1 - Single States
49 pages
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
No ratings yet
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
31 pages
Lec33 MetropolisHastings
No ratings yet
Lec33 MetropolisHastings
66 pages
Lecture 4.1 - Quantum Query Algorithms
No ratings yet
Lecture 4.1 - Quantum Query Algorithms
38 pages
Lec27 AcceptReject
No ratings yet
Lec27 AcceptReject
30 pages
Lec29 ImportanceSampling
No ratings yet
Lec29 ImportanceSampling
84 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
Lec31 32 CaterpillarRegressionExample
No ratings yet
Lec31 32 CaterpillarRegressionExample
108 pages
Lec35 SequentialImportanceSampling
No ratings yet
Lec35 SequentialImportanceSampling
46 pages
Introduction To State Space Models and Sequential Bayesian Inference
No ratings yet
Introduction To State Space Models and Sequential Bayesian Inference
58 pages
Lec18 HierarchicalBayesianModels
No ratings yet
Lec18 HierarchicalBayesianModels
20 pages
Lec9 MultivariateGaussian
No ratings yet
Lec9 MultivariateGaussian
60 pages
Lec21 BiasVarianceDecomposition
No ratings yet
Lec21 BiasVarianceDecomposition
15 pages
Lec25 MonteCarloMethods
No ratings yet
Lec25 MonteCarloMethods
57 pages
Lec23 Evidence4Regression
No ratings yet
Lec23 Evidence4Regression
38 pages
Lec16 SummarizingPosteriors BayesianModelSelection
No ratings yet
Lec16 SummarizingPosteriors BayesianModelSelection
59 pages
Lec14 15 GenerativeModelsForDiscreteData
No ratings yet
Lec14 15 GenerativeModelsForDiscreteData
74 pages
Lec28 StratifiedSampling
No ratings yet
Lec28 StratifiedSampling
15 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Multiple Choice Test Bank Questions No Feedback - Chapter 5
No ratings yet
Multiple Choice Test Bank Questions No Feedback - Chapter 5
7 pages
Regression Analysis MCQ's
No ratings yet
Regression Analysis MCQ's
3 pages
Simplex Method and Sensitivity Analysis2012
No ratings yet
Simplex Method and Sensitivity Analysis2012
18 pages
REAL SPSS Data Saham
No ratings yet
REAL SPSS Data Saham
6 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
31 pages
Bayesian Data Analysis
No ratings yet
Bayesian Data Analysis
14 pages
Appendix A
No ratings yet
Appendix A
3 pages
Formula Sheet: Annuity Present Value C ! 1" Present Value Factor C ! 1" 1 (1+ R) # $ % &% ' (%) %
No ratings yet
Formula Sheet: Annuity Present Value C ! 1" Present Value Factor C ! 1" 1 (1+ R) # $ % &% ' (%) %
2 pages
Simple Regression Quiz
No ratings yet
Simple Regression Quiz
6 pages
1994 - Recent Textbooks On Game Theory
No ratings yet
1994 - Recent Textbooks On Game Theory
16 pages
ECON 241 or ECON C342 - COMPRE ANSWER KEY
No ratings yet
ECON 241 or ECON C342 - COMPRE ANSWER KEY
10 pages
BEHAVIOURAL FINANCE PPT PDF
100% (1)
BEHAVIOURAL FINANCE PPT PDF
27 pages
Module 6 Risk Analysis and Simulation Excel Template
No ratings yet
Module 6 Risk Analysis and Simulation Excel Template
56 pages
Lampiran 1 Irna Revisi FIX
No ratings yet
Lampiran 1 Irna Revisi FIX
18 pages
Non-Linear Regression Analysis To Determine Antoine Equation Constants
No ratings yet
Non-Linear Regression Analysis To Determine Antoine Equation Constants
5 pages
Example E7.2
No ratings yet
Example E7.2
17 pages
Fi Snish
No ratings yet
Fi Snish
3 pages
Ch14 - Design of Experiments
No ratings yet
Ch14 - Design of Experiments
18 pages
Belanigue Mansci Forecasting Quiz
No ratings yet
Belanigue Mansci Forecasting Quiz
11 pages
13 Correlation Analysis 1633738603
No ratings yet
13 Correlation Analysis 1633738603
17 pages
Smoothing
No ratings yet
Smoothing
24 pages
Final Report Econometrics
No ratings yet
Final Report Econometrics
26 pages
Demand Forecasting
No ratings yet
Demand Forecasting
18 pages
AP Stats Formulas:Tables
No ratings yet
AP Stats Formulas:Tables
7 pages
Notes Chi Square Test and Anova
No ratings yet
Notes Chi Square Test and Anova
5 pages
Quantitative Analysis For Management
No ratings yet
Quantitative Analysis For Management
43 pages
Quiz Materials - 1 - A PDF
No ratings yet
Quiz Materials - 1 - A PDF
3 pages
Bayes' Theorem in Artificial Intelligence
No ratings yet
Bayes' Theorem in Artificial Intelligence
21 pages
Covariance Correlation
No ratings yet
Covariance Correlation
14 pages
Artikel Ahmad Fadhil Imran PDF
No ratings yet
Artikel Ahmad Fadhil Imran PDF
5 pages