STA 114: Statistics Notes 12. The Jeffreys Prior

1. The document discusses Jeffreys priors, which are proposed non-informative prior distributions that are invariant under monotone transformations of the parameter. 2. It notes that uniform priors are not invariant, which was a criticism of Bayesian inference raised by statisticians like R.A. Fisher. 3. Jeffreys proposed that the prior density should be proportional to the square root of the Fisher information, which ensures invariance under reparametrization. For normal models with known variance, this leads to an improper flat prior on the mean.

Uploaded by

brian mbuthia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views5 pages

STA 114: Statistics Notes 12. The Jeffreys Prior

Uploaded by

brian mbuthia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

STA 114: Statistics

Notes 12. The Jeffreys Prior

Uniform priors and invariance

Recall that in his female birth rate analysis, Laplace used a uniform prior on the birth rate
p ∈ [0, 1]. His justification was one of “ignorance” or “lack of information”. He pretended
that he had no (prior) reason to consider one value of p = p1 more likely than another value
p = p2 (both values coming from the range [0, 1]). A uniform pdf is consistent with such a
consideration. But there is a logical flaw.
p
Consider the log-odds ratio of a female birth η = log 1−p . By the same logic, Laplace
should not prefer any value of η = η1 over any other η = η2 . So a prior plausibility score
ξη (η) on η should satisfy ξη (η1 )/ξη (η2 ) = 1 for all η1 , η2 (there is a technical difficulty to
turn ξη (η) into a pdf, but we ignore that for the moment). But if p is assigned the uniform
prior pdf ξp (p) = 1, p ∈ [0, 1], then it induces the following prior pdf on η (by the change of
variable p = 1/(1 + exp(−η))):

˜ 1 exp(η) exp(η)
ξη (η) = ξp 2
= ,
1 + exp(−η) (1 + exp(η)) (1 + exp(η))2

and hence ξ˜η (η1 )/ξ˜η (η2 ) 6= 1 unless η1 = η2 .

That is, the logic of no-preference on p leads to a (induced) prior pdf on η that does not
conform with the logic of no-preference applied directly to η, even through η is a monotone
transform of p. This was held as a major criticism against Bayesian inference in the early
20th century by, among others, the most eminent statisticians of all, R. A. Fisher. This all
but killed the development of Bayesian statistics until H Jeffreys revived this topic in mid
20th century.

Invariance under monotone transformation

Note that the premise of this discussion and debate is the case when there is not much
prior information about the parameter. The question is, is there a prior pdf (for a given
model) that would be universally accepted as a non-informative prior? Laplace’s proposal
was to use the uniform distribution. When the parameter space is discrete and finite, this
choice is indeed non-informative and even survives the scrutiny of monotone transformations
mentioned above. But when the parameter space is a continuum and one is seeking a
prior pdf, uniform distributions are not universally accepted. The lack of invariance under
monotone transformation being one big criticism.
Jeffreys proposed that an acceptable “non-informative prior finding principle” should be
invariant under monotone transformations of the parameter. Let the statistical model be
X ∼ f (x|θ), θ ∈ Θ. And suppose the principle under consideration produces the prior

1
ξθ (θ) for θ. Now suppose we look at a re-parametrization η = h(θ), given by a smooth
monotone transformation h. The reparametrized model is X ∼ g(x|η), η ∈ E, where g(x|η) =
f (x|h−1 (η)) and E = h(Θ) = {h(θ) : θ ∈ Θ}. Suppose the principle, when applied to the
re-parametrized model, produces a prior pdf ξη (η) on η.
But one could also derive a prior pdf ξ˜η (η) by starting from the prior pdf ξ(θ) on θ and
using the transformation η = h(θ). This pdf is given by ξ˜η (η) = ξθ (h−1 (η))/|h0 (h−1 (η)|.
Jeffreys demand of invariance is same as saying that the two pdfs ξη (η) [found by applying
the principle directly on η] and ξ˜η (η) [found by applying the principle to θ and then deriving
the corresponding pdf on η] should be the same. A little algebra shows that
ξη (η) = ξ˜η (η), for all η ∈ E
⇐⇒ ξη (h(θ)) = ξ˜η (h(θ)), for all θ ∈ Θ
ξθ (θ)
⇐⇒ ξη (h(θ)) = 0 , for all θ ∈ Θ
|h (θ)|

The Jeffreys priors

In addition to making the demand of invariance, Jeffreys also described how to construct such
a prior. The construction is based on the Fisher information function of a model. Consider
a model X ∼ f (x|θ), where θ ∈ Θ is scalar and θ 7→ log f (x|θ) is twice differentiable in θ for
every x. The Fisher information of the model at any θ is defined to be:
2
∂
F
I (θ) = E[X|θ] log f (X|θ) = E[X|θ] {`˙X (θ)}2 .
∂θ
Under some regularity conditions (look back at our approximate normality of MLE result),
this equals
∂2
I F (θ) = −E[X|θ] 2 log f (X|θ) = −E[X|θ] `¨X (θ).
∂θ
IID

where I1F (θ) is the single observation Fisher information of Xi ∼ g(xi |θ) at θ.
The Jeffreys proposal of a non-informative prior pdf for the model X ∼ f (x|θ) is
p
ξ J (θ) = const. × I F (θ).
R p
If Θ I F (θ)dθ is finite number, then the constant is taken to be one over this number, so that
ξ J (θ) defines a pdf over Θ. If this integral is infinite, the constant is left unspecified, and the
corresponding function ξ J (θ) is called an “improper” prior pdf of θ ∈ Θ. An improper prior
pdf is accepted so long as it produces a proper posterior pdf for every possibly observation
X = x. That is
f (x|θ)ξ J (θ)
ξ J (θ|x) = R
Θ
f (x|θ0 )ξ J (θ0 )dθ0

2
R
must be a pdf on Θ, which means the integral f (x|θ)ξ J (θ)dθ must be finite. Even though an
improper pdf is not really a pdf, it still expresses relative plausibility scores through the well
defined ratios ξ J (θ1 )/ξ J θ2 (the arbitrary constant cancels from numerator and denominator,
so its exact value does not matter).
Below we show that the principle behind the construction Jeffreys prior is invariant to
smooth, monotone transformation of the parameter. Here we briefly comment why it is
“non-informative”. It turns out that the Jeffreys prior is indeed the uniform prior over the
parameter space Θ, but not under the Euclidean geometry (pdfs depend on the geometry,
as they give limits of probability of a set over the volume of the set, and volume calculation
depends on geometry). The geometry that one needs to consider stems from defining a
distance between θ1 , θ2 ∈ Θ in terms of the distance between the two pdfs f (x|θ1 ) and f (x|θ2 ).
An advantage of this definition of distance is that it remains invariant to reparametrization
under monotone transformation.

The Jeffreys prior is invariant under monotone transformation

Consider a model X ∼ f (x|θ), θ ∈ Θ and its reparametrized version X ∼ g(x|η), η ∈ E,
where η = h(θ) with h a differentiable, monotone transformation (θ is assumed scalar). To
distinguish between the two models, we let IθF (θ) and IηF (η) denote the two Fisher information
functions. Then,
Z 2
F ∂
Iθ (θ) = log f (x|θ) f (x|θ)dx
∂θ
Z 2
∂
= log g(x|h(θ)) g(x|h(θ))dx
∂θ
Z ( )2
∂
h0 (θ) g(x|h(θ))dx [chain rule of differentiation]

= log g(x|η)
∂η η=h(θ)

= {h0 (θ)}2 IηF (h(θ)).

And therefore,
p
q IθF (θ) ξθJ (θ)
ξηJ (h(θ)) = const. × IηF (h(θ)) = const. × =
|h0 (θ)| |h0 (θ)|
as demanded by Jeffreys.
IID
Example (Normal model). Consider data X = (X1 , · · · , Xn ), modeled as Xi ∼ Normal(µ, σ 2 )
with σ 2 assumed known, and µ ∈ (−∞, ∞). The Fisher information function in µ of a single
observation is in µ is given by
∂ 2 (X1 − µ)2 1
I1F (µ) = −E[X1 |µ] 2 2
= 2
∂µ 2σ σ
and hence Fisher information at µ of the model for X is I F (µ) = nI1F (µ) = n/σ 2 . Therefore
the Jeffreys prior for µ is
p
ξ J (µ) = const. × n/σ 2 = const, −∞ < µ < ∞.

3
This is a “flat” prior over the parameter Rspace (−∞, ∞). Unfortunately, this does not lead
∞
to a pdf for any value of the constant as −∞ dµ = ∞. So this is an improper prior.
The posterior associated with the Jeffreys prior is
2
exp{− (x̄−µ)
2σ 2 /n
}
ξ J (µ|x) = R ∞ 0 )2 = Normal(x̄, σ 2 /n)
−∞
exp{− (x̄−µ
2σ 2 /n
}dµ0

which is a proper pdf. Thus the Jeffreys prior is an “acceptable one” in this case.
It is an interesting fact that summaries of ξ J (µ|x) numerically match summaries from
classical inference. For example, the posterior mean and median is x̄ which happens√ to be
µ̂MLE (x). Also, a 100(1 − α)% central posterior credible interval is x̄ ∓ σz(α)/ n which
matches the 100(1 − α)% confidence interval for µ.

Multiparameter model and the Jeffreys prior

When the model is indexed by multiple parameters, we need some extension of our definitions
of the Fisher information and the Jeffreys prior. For simplicity we only consider a two-
parameter model X ∼ f (x|θ1 , θ2 ). Then Fisher information is defined as
∂2 ∂2
!
E [X|θ 1 ,θ 2 ] {− ∂θ12 log f (x|θ1 , θ2 )} E [X|θ 1 ,θ 2 ] {− ∂θ1 ∂θ2
log f (x|θ 1 , θ2 )}
I F (θ1 , θ2 ) = 2 ∂2 .
E[X|θ1 ,θ2 ] {− ∂θ∂2 ∂θ1 log f (x|θ1 , θ2 )} E[X|θ1 ,θ2 ] {− ∂θ 2 log f (x|θ1 , θ2 )}
2

Next, Jeffreys’ prior is defined as

p
ξ J (θ1 , θ2 ) = const × det[I F (θ1 , θ2 )]

a b
where det[A] denotes the determinant of a matrix A. For a 2 × 2 matrix A = the
c d
determinant is: det[A] = ad − bc.
IID
Example (Normal model with unknown µ and σ 2 ). For the normal model X1 , · · · , Xn ∼
Normal(µ, σ 2 ), µ ∈ (−∞, ∞), σ 2 ∈ (0, ∞),

n (n − 1)s2x + n(x̄ − µ)2

log f (x|µ, σ 2 ) = const − log σ 2 − .
2 2σ 2
The second derivatives are
∂2 n
2
log f (x|µ, σ 2 ) = − 2
∂µ σ
2
∂ n(x̄ − µ)
2
log f (x|µ, σ 2 ) = −
∂µ∂(σ ) σ4
∂2
log f (x|µ, σ 2 ) = same as above
∂(σ 2 )∂µ
∂2 2 n (n − 1)s2x + n(x̄ − µ)2
log f (x|µ, σ ) = −
∂(σ 2 )2 2σ 4 σ6

4
So
 
E[X|µ,σ2 ] σn2 E[X|µ,σ2 ] n(X̄−µ) n !
σ4 σ2
0
I F (µ, σ 2 ) =  = .
(n−1)s2X +n(X̄−µ)2 n
E[X|µ,σ2 ] n(X̄−µ)
σ4
E[X|µ,σ2 ] {− 2σn4 + σ6
} 0 2σ 4

To derive the last equality in the above we have used,

E[X|µ,σ2 ] X̄ = µ

E[X|µ,σ2 ] (X̄ − µ)2 = σ 2 /n

n
X
E[X|µ,σ2 ] (Xi − X̄)2 = (n − 1)σ 2
i=1

which Pfollow from the facts (i) X̄ ∼ Normal(µ, σ 2 /n) which has mean µ and variance σ 2 and
(ii) σ2 ni=1 (Xi − X̄)2 ∼ χ2n−1 which has mean n − 1.
1

Finally, we get the Jeffreys prior

3/2
J 2
p
F 2
p
2 6
1
ξ (µ, σ ) = const × det[I (µ, σ )] = const × n /2σ = const 2 .
σ
The corresponding posterior pdf is
(n − 1)s2x + n(x̄ − µ)2

J 2 2 −3/2 2 −n/2
ξ (µ, σ |x) = const × (σ ) × (σ ) exp −
2σ 2
(n − 1)s2x + n(x̄ − µ)2

2 −(n+3)/2
= const × (σ ) exp −
2σ 2
which matches the Nχ−2 (x̄, n, n, n1 ni=1 (xi − x̄)2 ).
P

The reference prior for the two-parameter normal model

There are other formalizations of “low-informativeness” than the concept of uniform dis-
tribution over the parameter space. One such formalization leads to what is known as
the reference prior. These priors, too, satisfy various invariance principles like the Jeffreys
prior, and the two are often equal. We can’t discuss reference priors in details here, but
IID
note the reference prior for the two parameter normal model X1 , · · · , Xn ∼ Normal(µ, σ 2 ),
µ ∈ (−∞, ∞), σ 2 ∈ (0, ∞). This is given by,
1
ξ R (µ, σ 2 ) = const. ×
σ2
and is a very popular “default” choice of a non-informative prior for this model. The posterior
pdf associated with this prior is ξ R (µ|x) = Nχ−2 (x̄, n, n − 1, s2x ).
An interesting property of this posterior pdf is that it produces summaries of µ that are
same as our ML summaries (the same hold for the posterior of the Jeffreys prior for the single
parameter normal model with known σ 2 , the Jeffresy and the reference priors are the same
in this case). In particular a 100(1 −√ α)% central credible interval of ξ1R (µ|x) is precisely the
ML 100(1 − α)%-CI: x̄ ∓ sx zn−1 (α)/ n.