0% found this document useful (0 votes)

16 views15 pages

Bayesian Week4 LectureNotes

The document discusses Jeffrey's prior and how it can be used as an objective prior that is invariant to transformations of parameters. It provides examples calculating Jeffrey's prior for binomial and exponential distributions. Jeffrey's prior is proportional to the square root of the Fisher information and results in proper posteriors even for improper priors in some cases.

Uploaded by

zeliawillscumberg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views15 pages

Bayesian Week4 LectureNotes

Uploaded by

zeliawillscumberg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Chapter 4

Week 4

L4: Jeffrey’s Prior, Eliciting and Analysing Priors

4.1 Jeffrey’s prior

4.1.1 Example of the problem of “uninformative” prior

Priors considered “uninformative” for a parameter, θ, may not be considered uninformative for transfor-
mations of the parameter, ψ=g(θ). For example, consider a Uniform(0,20) prior for a parameter θ. The
problem√with such uniform priors is that the induced prior for simple transformations of the parameter,
e.g., φ= θ, will not be uniform. Given θ ∼ Uniform(0,20), the distribution for φ is found by the change of
variable theorem1 :
1 dφ2 √ 1 √ φ √
π(φ) = I(0 < φ < 20) = 2φI(0 < φ < 20) = I(0 < φ < 20)
20 dφ 20 10
√ √
where I(0 < φ < 20) is an indicator function that equals 1 when √ 0 < φ < 20 and 0 otherwise. This
induced prior clearly is informative, as it linearly increases from 0 to 20.

4.1.2 Definition and calculation of Jeffrey’s prior

Jeffrey’s prior is an example of an objective prior which can be seen as a remedy to the just discussed problem
of the induced priors resulting from transformations. It is a prior that is invariant to strictly monotonic (1-1
or bijective) transformations of the parameter, say φ =g(θ), where g is strictly monotonic.
Jeffrey’s prior is proportional to the square root of Fisher’s Information, I(θ|y):
p
Jeffrey’s prior πJP (θ) ∝ I(θ|y) (4.1)
1 The change of variable theorem is a procedure for determining the pdf of a (continuous) random variable Y that is a strictly

monotonic (1:1) transformation of another (continuous) random variable X, i.e., Y =g(X). The pdf for Y :
dg −1 (y)
pY (y) = pX (g −1 (y))
dy
See Section 4.5 for more details.

61
where
" 2 #
d log f (y|θ)
I(θ|y) = E (4.2)
dθ

Note that under certain regularity conditions (e.g. that the differentiation operation can be moved inside
the integral), Fisher Information can be calculated from the second derivative of the log likelihood:
2
d log f (y|θ)
I(θ|y) = −E (4.3)
dθ2

which is often much easier to calculate than Eq’n 4.25. For more discussion of Fisher’s Information see
Section 4.6.

4.1.3 Jeffrey’s prior is invariant to 1:1 transformations

Remark. Let f (y|θ) denote a probability density or mass function for a random variable y where θ is a
scalar. Let φ = g(θ) where g is a strictly p monotonic (1:1 or bijective) transformation. If we specify a
Jeffrey’s
p prior for θ, namely, π JP (θ) ∝ I(θ|y), then the induced prior on φ, π(φ), is proportional to
I(φ|y). In other words, a 1:1 transformation of a parameter that has a Jeffrey’s prior yields a Jeffrey’s
prior for the transformed parameter.

dy dy dz
Proof. This proof uses the chain rule, dx = dz dx , and the change of variable theorem.
Write the Fisher information for θ as follows:
" 2 # " 2 #
d log f (x|θ) d log f (y|φ)) dφ
I(θ|y) = E =E
dθ dφ dθ
" 2 # 2 2
d log f (y|φ)) dφ dφ
=E = I(φ|y)
dφ dθ dθ

Thus the Jeffrey’s prior for θ can be written:

p p dφ
πJP (θ) ∝ I(θ|y) = I(φ|y)
dθ

Then the induced distribution for φ given this prior:

dθ p dθ dφ p
π(φ) = πJP (θ) ∝ I(φ|y) = I(φ|y)
dφ dφ dθ

4.1.4 Example A. Binomial distribution

Suppose that the prevalence of Potato Virus Y in a population of aphids is an unknown parameter θ. A
random sample of n aphids is taken (using a trap) and the number of aphids with the virus is x. Assuming
independence between the aphids and that they all have the same probability of having the virus, x ∼

62
Binomial(n, θ). The Fisher information:
" #
d2 log nx + x log(θ) + (n − x) log(1 − θ)

−x n−x
I(θ) = −E = −E −
dθ2 θ2 (1 − θ)2
E(x) n − E(x) nθ n − nθ n
= 2
+ 2
= 2 + 2
=
θ (1 − θ) θ (1 − θ) θ(1 − θ)
Thus the Jeffrey’s prior is
s
1
πJP (θ) ∝ = θ−1/2 (1 − θ)−1/2
θ(1 − θ)

which is the kernel for a Beta(1/2,1/2) distribution.

θ(1−θ)
Aside. Note that the mle for θ is θ̂ = x/n and the variance of θ̂ is n , which equals I(θ)−1 .

4.1.5 Example B. Exponential distribution

iid
Let x1 , . . . , xn be an iid sample from an exponential distribution with rate parameter λ, namely, xi ∼
Exponential(λ), i = 1, . . . , n. For example, suppose that xi is amount of time individual i waits in a queue
at a bank during lunch hour until seeing a teller.
The Fisher information for a single random variable:
" 2 # " 2 # " 2 #
d log f (x|λ) d log λ − λx 1 1 x 2
I(λ) = E =E =E −x =E 2 −2 +x
dλ dλ λ λ λ
1 2 2 1
= − 2+ 2 = 2
λ2 λ λ λ
where we used the facts that E(x)=1/λ and E(x2 )=2/λ2 (use the moment generating function λ/(λ − t)).
Alternatively, we could use the other expression for I(λ):
2
d log f (x|λ) −1 1
I(λ) = −E = −E = 2
dθ2 λ2 λ
n
Thus the Fisher information for the entire sample is then In (λ) = λ2 .

Then the Jeffrey’s prior:

p 1
πJP (λ) ∝ I(λ) =
λ
Note that this is an improper prior as it does not integrate over the domain of λ, namely (0, ∞). However
this is a situation where the posterior for λ is proper:
n
! n
!
1 n X
n−1
X
p(λ|x1 , . . . , xn ) ∝ π(λ)f (x1 , . . . , xn |λ) = λ exp −λ xi = λ exp −λ xi
λ i=1 i=1
Pn
which is the kernel for a Gamma(n, i=1 xi ) density function.
To
√ demonstrate the invariance under a 1:1 transformation, reparameterize the exponential with θ=g(λ) =
λ, then
f (x|θ) = θ2 exp(−θ2 x)

63
and g −1 (θ) = θ2 = λ. Given π(λ) ∝ 1/λ, the induced prior for θ:

dg −1 (θ) dθ2 1 2
πθ (θ) = πλ (g −1 (θ)) = 2
=
dθ dθ θ θ
Checking that this is indeed the Jeffrey’s prior for θ:
2
d 2 log(θ) − θ2 x

4
I(θ) = −E 2
= 2
dθ θ
p
Thus the Jeffrey’s prior is πJP (θ) ∝ I(θ) = 2/θ.

4.1.6 Example C. Normal dist’n with known variance

The sampling distribution for x1 , . . . , xn is Normal(µ, σ 2 ) where σ 2 is known. It can be shown that the
Jeffrey’s prior for µ when σ 2 is known is

πJP (µ) ∝ 1, − ∞ < µ < ∞

R∞
This is an improper prior because −∞
1dµ is not finite. However the posterior distribution is proper:

(µ − ȳ)2

π(µ|y) ∝ exp − ∗1
2σ 2 /n
2

namely the kernel for a Normal ȳ, σn .

4.1.7 Example D. Normal dist’n with known mean

The sampling distribution for x1 , . . . , xn is Normal(µ, σ 2 ) where µ is known. The Jeffrey’s prior for σ 2 (given
known µ) can be shown to be the following:
1
πJP (σ 2 ) ∝ , 0 < σ2 < ∞ (4.4)
σ2
R∞ 1 2
which is an improper prior because 0
is not finite. The posterior distribution for σ 2 ,
σ 2 dσ

z2 z2

2
n
2 −2 1 n
2 −( 2 +1)
p(σ |) ∝ σ exp − 2 = σ exp − 2 (4.5)
2σ σ2 2σ
Pn 2

where z 2 = i=1 (xi − µ)2 . This is the kernel for Γ−1 n2 , z2 so long as n/2 > 0 (obviously so), and z 2 > 0,
which simply means that not all values in y are identical.

4.1.8 Jeffrey’s prior for a multivariate parameter vector.

To calculate Jeffrey’s prior for a multivariate parameter vector, θ, with p parameters, one calculates the
score function, which is now the gradient of the log likelihood:
 d
dθ1 ln(f (x))

..
S(θ) =  (4.6)
 
. 
d
dθp ln(f (x))

64
and then calculates the Hessian of the log likelihood, namely, the matrix of the partial derivatives of the
score function
d2 d2 d2
 
ln(f (x)) ln(f (x)) ... ln(f (x))
 d d d

dθ S(θ)[1] dθ1 S(θ)[2] . . . dθ1 S(θ)[p] dθ12 dθ1 dθ2 dθ1 dθp
 d1 S(θ)[1] d d   d2 d2 d2

 dθ2 dθ2 S(θ)[2] . . . dθ2 S(θ)[p]   dθ2 dθ1 ln(f (x)) dθ22 dθ2
ln(f (x)) ... dθ2 dθp ln(f (x)) 
H(θ) =  . = 
 .. .. .. .. .. .. ..
 ..
  
. . .  .
  . . . 

d d d 2 2
d2
dθp S(θ)[1] dθp S(θ)[2] ... dθp S(θ)[p]
d d
dθp dθ1 ln(f (x)) dθ2 dθp ln(f (x)) ... dθp2 ln(f (x))
(4.7)

Fisher information is

I(θ|x) = −E[H(θ)] (4.8)

Finally, the Jeffrey’s prior for the vector of parameters is proportional to the square root of the determinant
of the Fisher information matrix:
p
πJP (θ) ∝ det(I(θ|x)) (4.9)

Example E. Normal(µ,σ 2 ) Jeffrey’s prior

To make the differentiation a little less awkward, the normal distribution is parameterized with θ=σ 2 . Given
y1 , . . . , yn (iid) Normal(µ, θ) pdf can be written:
1
Pn 2
f (y; µ, θ) = (2πθ)−n/2 e− 2θ i=1 (yi −µ) (4.10)

Then the log likelihood:

n
X
log L(µ, θ) = l ∝ −n log(θ) − θ−1 (yi − µ)2 (4.11)
i=1

The score vector:

n
dl X
= 2θ−1 (yi − µ) = 2θ−1 n(−µ) (4.12)
dµ i=1
n
dl X
= −nθ−1 + θ−2 (yi − µ)2 (4.13)
dθ i=1

The Hessian matrix (H) components:

d2 l
= −2θ−1 n (4.14)
dµ2
d2 l d2 l
= = −2θ−2 n(−µ) (4.15)
dµdθ dθdµ
n
d2 l −2 −3
X
= nθ − 2θ (yi − µ)2 (4.16)
dθ2 i=1

Then the Fisher Information matrix

2n 2n

0 σ2 0
I(µ, θ) = −E[H] = θ
n = n (4.17)
0 θ2 0 σ4

65
Then the Jeffrey’s prior for (µ, σ 2 ):
r r
2
p 2n n 2n2 1
πJP (µ, σ ) ∝ |I(µ, σ 2 )| = = ∝ 3 (4.18)
σ2 σ4 σ6 (σ 2 ) 2
The posterior can be shown to be the product of an inverse Chi-square (Gamma) and a normal (conditional
on σ 2 ).

Example F. Normal with two independent Jeffrey’s priors.

Note: this example is Not using the Determinant as above.

If the Jeffrey’s prior for µ (given σ 2 is known) and the Jeffrey’s prior for σ 2 (given µ is known) are treated
as independent priors, the Jeffrey’s prior is
1
π(µ, σ 2 ) ∝ 1 ∗ , − ∞ < µ < ∞, 0 < σ 2 < ∞
σ2
then it turns out that the posterior marginal distribution for σ 2 is
n − 1 (n − 1)s2

2 −1
σ |y ∼ Γ ,
2 2
Pn
where s2 = (1/(n − 1)) i=1 (yi − ȳ)2 , the usual sample variance. And the posterior marginal distribution
for µ is a student’s t distribution with mean ȳ, scale parameter s2 /n, and n − 1 degrees of freedom.
Referring to the 2x4 timber example from Lecture 3 notes, ȳ=3.5283, s2 = 0.0122209, with 9 degrees of
freedom. Thus the posterior expected value of µ is 3.5283. A 95% credible interval for µ can be found by
calculating the 2.5 and 97.5 percentiles of the standard t9 df distribution, multiplying them by the square
root of the scale parameter and then adding the mean:

qt(c(0.025,0.975),df=9)*sqrt(0.0122209/10) + 3.5283
3.449219 3.607381

Thus 95% credible interval for µ of (3.45, 3.61).

For σ 2 , the 95% credible interval is calculated for the Gamma(9/2, 9*0.0122209/2) distribution and then
inverting the results:

gam.bds <- qgamma(c(0.025,0.975),shape=9/2,rate=9*0.0122209/2)

inv.gam.bds <- sort(1/gam.bds)
inv.gam.bds
0.005781919 0.040730458

Thus a 95% credible interval for σ 2 of (0.0058, 0.0407).

4.2 Reference Priors

As mentioned in LN 2, the Jeffrey’s prior is one of several objective priors, namely procedures for selecting
priors which will yield the same prior for anyone who uses the procedure. Reich and Ghosh (2019) discuss

66
four other objective priors that are at least worth knowing about and I recommend reading the approximately
two pages of discussion. Admittedly understanding the general concepts behind the procedures and being
able to implement the procedures can have quite different degrees of difficulty, with the latter generally more
difficult than the former. Here we will just examine one other type of objective prior, the Reference Prior,
denoted πRP (θ).

4.2.1 KL divergence

Before introducing Reference Priors, the notion of Kullback-Leibler (KL) divergence is introduced. KL
divergence is a measure of the difference between two pmfs or two pdfs. A KL divergence value of 0 means
that the two distributions are identical.
For the continuous case let f and g denote two pdfs. The KL divergence is defined “conditional” on one of
the two distributions, here denoted KL(f, g) or KL(g, f ) where KL(f, g) 6= KL(g, f ), except when f and g
are identical (almost everywhere). More exactly, KL(f, g) is the expected value of log(f (x)/g(x) assuming
that f is “true”—more accurately stated that the expectation is with respect to f (x), and KL(g, f ) is the
reverse:
Z Z Z
f (x)
KL(f, g) = log f (x)dx = log(f (x))f (x)dx − log(g(x))f (x)dx = Ef [log(f (X)] − Ef [log(g(X)]
g(x)
(4.19)
and
Z
g(x)
KL(g, f ) = log g(x)dx = Eg [log(g(x))] − Eg [log(f (x)] (4.20)
f (x)
h i
Notes: (1) if g(x) = f (x), then log fg(x)
(x)
= log(1) = 0; and (2) KL(f, g) ≥ 0.

KL divergence examples. As a simple example consider a discrete valued random variable with values
0, 1, or 2. Let f (x) be the Binomial(2,p=0.2) pmf with probabilities 0.64, 0.32, and 0.04 for X=0, 1, and 2,
respectively. Let g(x) be the discrete uniform where g(0) = g(1) = g(2) = 1/3. Then
2
X f (x)
KL(f, g) = f (x) log = 0.64 log(0.64/0.33) + 0.32 log(0.32/0.33) + 0.04 log(0.04/0.33) = 0.3196145
x=0
g(x)
2
X g(x)
KL(g, f ) = g(x) log = 0.33 log(0.33/0.64) + 0.33 log(0.33/0.32) + 0.33 log(0.33/0.04) = 0.5029201
x=0
f (x)

For another example let g(x) be a Binomial(2,p=0.25). Then KL(f, g)= 0.01400421 and KL(g, f ) =
0.01476399, thus the KL divergence measures are both close to 0 (and close to each other).

4.2.2 Reference prior

Given data y, the KL divergence between the prior and posterior (with respect to the posterior) is
Z
p(θ|y)
KL(p(θ|y), π(θ)) = p(θ|y) log dθ (4.21)
π(θ)
The key idea of a RP is that the KL divergence between the posterior and the prior should be as large as
possible, thus implying that the data are dominating the prior.

67
However, this measure is conditional on the data, y, which does not help for determining a prior. Thus to
remove the conditioning on the data, the data are integrated out, and πRP (θ) is the probability distribution
(pmf or pdf) that maximizes the following:
Z
Ey [KL(p(θ|y), π(θ))] = [KL(p(θ|y), π(θ))] m(y)dy (4.22)

where m(y) is the marginal distribution for the data.

While this approach is conceptually attractive, it can be technically challenging as one is trying to find
an entire probability distribution π(θ) that maximizes eq’n (4.22), noting that determining m(y) involves
integration as well.

4.3 Eliciting Informative Priors

Often the scientist or subject-matter specialist will have a definite opinion as to what the range of parameter
values should be. For example, an experienced heart surgeon who has done 1000s of coronary by-pass
surgeries on a variety of patients will have a definite opinion on post-surgery survival probability.
How does one translate that prior knowledge into a prior probability distribution?

• First, think about whether it’s even feasible or are there too many parameters, or is the underlying
sampling model fairly complex, e.g., a hierarchical model. For example, suppose the sampling model
is a multiple regression with 4 covariates, thus these five parameters, β0 , β1 , β2 , β3 , and β4 , and the
variance parameter. How might the expert’s knowledge be translated into prior distributions for these 5
parameters? The expert might have an opinion about the signs, positive or negative, of each coefficient,
and perhaps the relative importance of each covariate; e.g., dealing with standardized covariates, then
the effects of x1 and x2 are thought to be positive but the effect of x1 may be twice as large as x2 .
• Single parameter case. This is generally the most feasible situation. If the expert is not familiar with
probability distributions, the statistician may need to work with the expert to arrive at a prior and
can help by asking questions about the parameter without using statistics jargon.
For example, instead of asking for the median, ask “For what value of the parameter do you think that
it’s equally likely that values are either below or above it?”
“What do you think the range of values might be?”
“ What do you think the relative variation around an average value be, for example, if your best guess
for θ is 15, is your uncertainty within ± 10% of that value (± 1.5), or 20% (± 3.0)?” Thus potentially
getting a measure of the coefficient of variation, CV =σ/µ.
Given a mean value, and a range, standard deviation, or CV, and assuming a particular standard prob-
ability distribution might suffice, rough estimates of hyperparameters for the prior might be calculated.
This is the “moment matching” idea that we’ve examined previously.
For example, with the above example of the surgeon, the surgeon was thinking that θ would be 0.7 on
average. Further questioning about the surgeon’s uncertainty led to a determination that a CV of 0.1
would be appropriate. Using a Beta distribution for the prior, the mean is α/(α + β) and the variance
is (αβ)/[(α + β)2 (α + β + 1)], some algebra yields Beta(29.3, 12.6).
• Discrete histogram priors. Another simple way to elicit priors is to partition parameter values into
non-overlapping bins, and have the expert present relative weights for each bin. For example, θ is
grouped into three bins, [0,10), [10,25), [25,30], and the expert gives relative weights of 0.2, 0.5, and
0.3. A proper histogram, pdf, is constructed using the result that bin area = height × width, where

68
bin area corresponds to probability or weight. For the bin [0,10), area=0.2, width=10, thus height
equals area/width = 0.2/10 = 0.02. For [10,25) height is 0.5/15 = 0.033, and for [25,30] height is 0.3/5
= 0.06.

References
• “The elicitation of prior distributions”, Chaloner, 1996, in Bayesian Biostatistics, eds., Berry and
Stangl.
• Uncertain Judgements: Eliciting Experts’ Probabilities, O’Hagan, et al. 2006. This book is available
online as a pdf via the University Library.

4.4 Sensitivity analysis of priors

Using the term sensitivity analysis loosely here, we mean an examination of the effects of different priors on
the posterior. For example, the comparison of the posterior distribution for the probability of survival after
by-pass surgery for the surgeon’s prior and the medical student’s prior is a sensitivity analysis.
Another loosely put phrase, if the posterior distributions for different priors look much “the same”, e.g.,
have similar means and variances, then then one might say that the results are robust to the priors.
With large enough samples, unless the prior is particularly concentrated over a narrow range of possible
values, sometimes called a pig-headed prior, the posterior will look much the same for a wide range of priors
as the data (the likelihood) are dominating the prior.

Problematic issues
• Practical issue: if 100s of parameters, then tedious at least, to carry out a sensitivity analysis for all
the priors.
• Generalized linear models where data come from an exponential family distribution with parameter θ
and covariate(s) x, say F, and, g(θ, x), the “link” function, is a linear model:

y|θ, x ∼ F(θ, x)
g(θ, x) = β0 + β1 x

Apparently uninformative priors in the link function may induce quite informative priors at a lower
level. For example, a logistic regression for the number of patients surviving heart bypass surgery
where the probabilities differ with age:

yi |Agei ∼ Bernoulli(θ(Agei ))
exp(β0 + β1 Age)
where θ(Age) =
1 + exp(β0 + β1 Age)

θ(Age)
equivalently ln = β0 + β1 Age
1 − θ(Age)
A seemingly innocuous prior for both β0 and β1 is Normal(µ=0,σ 2 =52 ). Suppose that the ages range
from 40 to 70. Figure 4.1 shows the results of simulating from the priors for β0 and β1 on the induced
priors for survival for four different ages. Note how the probabilities are massed near 0 and 1. Such
simulation exercises can be quite valuable for detecting such effects.

69
Figure 4.1: Induced prior for survival probability (θ) as a function of age given Normal(0,5) priors for logit
transformation, ln (θ/(1 − θ)) = β0 + β1 Age.

Age= 40 Age= 50

5
5

4
4

3
3

2
2

1
1
0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Survival Survival

Age= 60 Age= 70
5

5
4

4
3

3
2

2
1

1
0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Survival Survival

70
4.5 Supplement A: Change of Variable Theorem

The Problem

A continuous random variable X has a pdf fX (X).

A new random variable Y is “constructed” from X by a strictly monotonic function g (a 1:1 function)
Y = g(X).
Thus X can be “recovered” from Y by an inverse function, g −1 , where X = g −1 (Y ).
The problem: what is the pdf for Y , namely, fY (Y )?

The Solution

The pdf for Y is found as follows:

dg −1 (Y )
fY (Y ) = fX (g −1 (Y )) (4.23)
dY

Example 1. X is Uniform(0,20) and Y =g(X) = 3X. Thus g −1 (Y ) = Y /3.

Note that the support of Y will be (0,60) as Y =3X. Then
1 dY /3 1 1 1
fY (Y ) = = = , IY (0 < Y < 60)
20 dY 20 3 60

• IY (condition) is an Indicator function which takes on one of two values: 1 when the condition is met,
is True, and 0 when it is not met, is False.
• This is an “intuitive” result, Y ∼ Uniform(0,60)

1 1
Example 2. X ∼ Uniform(16,64), thus pdf of X is 64−16 = 48 IX (16 < X < 64). Define Y = g(X) =
√ −1 2
X, noting that the support for Y is (4,8). Then g (Y ) = Y , and
1 dY 2 1 1
fY (Y ) = = 2Y = Y IY (4 < Y < 8)
48 dY 48 24

Skeleton of Proof

The essence of the change of variables theorem is that for x=g −1 (y), Pr(y ≤ Y ≤ y + dy) should equal
Pr(g −1 (y) < X < g −1 (y) + dg −1 (y) ≡ Pr(x ≤ X ≤ x + dx). This will happen if the following relationship
between the areas under the two pdfs holds:

|fY (y)dy| = |fX (g −1 (y))dg −1 (y)|

Note that fY (y)dy is approximately Pr(y < Y < y + dy). Think of the area of a rectangle, Area=Height ×
Width, where Area is probability, Height is the pdf evaluated at y, and Width is dy. Then
dg −1 (y)
fY (y) = fX (g −1 (y))
dy

71
Biological Allometry Example

This example is based on http://www.biology.arizona.edu/biomath/tutorials/applications/allometry.html:

Male fiddler crabs (Uca pugnax) possess an enlarged major claw for fighting or threatening other
males. In addition, males with larger claws attract more female mates.
The sex appeal (claw size) of a particular species of fiddler crab 4.2 is determined by the following
allometric equation:

Mc = 0.036Mb1.356

where Mc is the mass of the major claw and Mb is the body mass of the crab minus the mass of
the claw.

Suppose that Mb is on average 2000 mg with a CV of 0.20. Assuming a Gamma distribution for Mb , then
Mb ∼ Gamma(25, 0.0125).

Figure 4.2: Fiddler Crab. Image from Southeastern Regional Taxonomic Center (SERTC), South Carolina
Department of Natural Resources.

What is the pdf for Mc ? To reduce notation momentarily, let X=Mb , Y =Mc , a=0.036, b=1.856, α=25, and
β=0.0125. Then
1b
Y
Y = g(X) = aX b and X = g −1 (Y ) =
a
and
dg −1 (y) 1−b
= a−1/b y b
dy
The pdf for Y (Mc ):
" 1 #α−1
−1
1
dg (y) β Y b 1 1−b
e−β∗( a ) × a−1/b y b
Y b
fY (y) = fX (g −1 (y)) =
dy Γ(α) a b

Substituting the original values:

" 1 #25−1
0.012525
1.856 1
Y 1 1 1−1.856
e−0.125∗( 0.036 )
Y 1.856
fMc (Mc ) = 1/1.856
y 1.856 (4.24)
Γ(25) 0.036 0.036 1.856

The accuracy of the derivation was examined by simulating body mass from a Gamma(25,0.0125) and then
transforming using the allometric equation (the R code is shown below). The empirical and theoretical pdfs
are plotted in Figure 4.3, and the two are quite similar.

72
Claw mass pdf

Theory
Empirical

2.0e−05
1.5e−05
1.0e−05
5.0e−06
0.0e+00

20000 40000 60000 80000 100000 120000

Claw mass

Figure 4.3: Theoretical and empirical pdf for crab claw mass.

#---- Change of variable with Fiddler Crabs ----

body.alpha <- 25
body.beta <- 0.0125
n <- 500
set.seed(931)
sim.mass <- sort(rgamma(n=n,shape=body.alpha,rate=body.beta))

claw.a <- 0.036

claw.b <- 1.856
sim.claw <- claw.a*sim.mass^claw.b
plot(density(sim.claw))

Mc.density <- function(y,alpha,beta,a,b) {

x <- (y/a)^(1/b)
p1 <- dgamma(x,alpha,beta)
p2 <- a^(-1/b)*(1/b)*y^((1-b)/b)
out <- p1*p2
}

theory.density <- Mc.density(y=sim.claw,alpha=body.alpha,beta=body.beta,

a=claw.a,b=claw.b)

plot(sim.claw,theory.density,xlab="Claw mass",ylab="",main="Claw mass pdf",

type="l",col="blue")
lines(density(sim.claw),col="red",lty=2)
legend("topright",legend=c("Theory","Empirical"),col=c("blue","red"),lty=1:2)

73
4.6 Supplement B: Fisher Information

In the following we begin with the case of a single (scalar) parameter θ.

Definition of Fisher Information:

" 2 #
d log f (x|θ)
I(θ|x) = E (4.25)
dθ

Note that under certain regularity conditions2 , Fisher Information can be calculated from the second deriva-
tive of the log likelihood:
2
d log f (x|θ)
I(θ|x) = −E (4.26)
dθ2

which is often much easier to calculate than Eq’n 4.25.

Remarks.

1. Given n iid random variables x1 , . . . , xn from the same distribution with parameter θ, the Fisher
information for θ = nI1 (θ|x), where I1 (θ|x) denotes the information for a single observation:
2 2 Pn
d log f (x1 , . . . , xn |θ) d i=1 log f (xi |θ)
I(θ|x1 , . . . , xn ) = −E = −E
dθ2 dθ2
n
d2 log f (xi |θ)
X
= −E = nI1 (θ|x) (4.27)
i=1
dθ2

2. Inverse of I(θ) as lower bound on variance of θ̂. Under the previously mentioned regularity conditions,
the inverse of Fisher information is the lower bound on the variance of an unbiased estimator of a
parameter. In other words, given a probability distribution with parameter θ which satisfies certain
regularity conditions, if θ̂ is unbiased for θ, then

V (θ̂) ≥ I(θ)−1

The right hand term is called the Cramer-Rao bound.

Thus the variance of an unbiased estimate can never be less than the Cramer-Rao bound.

2 There d log(f (x|θ)

are three conditions in this case. (1) For all x such that f (x|θ) > 0, dx
exists and is finite. (2) The order of
operations of integration with respect to x and differentiation with respect to θ for the expectation of a function of T (x) can
be interchanged, i.e.,
Z Z
d df (x|θ)
T (x)f (x|θ)dx = T (x) dx
dθ dθ
. (3) The order of operations of integration and differentiation can also be reversed for the second derivative of f (x|θ) with
respect to θ, i.e.,
d2 d2 f (x|θ)
Z Z
T (x)f (x|θ)dx = T (x) dx
dθ2 dθ2
.

74
3. Maximum likelihood estimators. In the particular case of maximum likelihood estimates (mles), the
inverse of I(θ|x) evaluated at the mle, θ̂, is often used as an estimate of the variance of θ̂:

ar(θ̂) = I(θ)−1
Vd

y
Example. if y ∼ Binomial(n, p), then the mle for p is p̂ = n.
The variance of p̂ is
hyi 1 1 p(1 − p)
V ar[p̂] = V ar = V ar[y] = 2 np(1 − p) =
n n2 n n
It can be shown that Fisher information for p is
n
I(p) =
p(1 − p)
p(1−p)
Observe that I −1 (p) = n , which is the variance of p̂.
4. Observed Fisher Information is the Fisher information without the integration, i.e., without taking the
expectation of the second derivative of the log likelihood:

d2 log(f (x|θ))
J (θ) = (4.28)
dθ2

and an estimate of θ, namely θ̂, is substituted for θ.

5. Multivariate Θ. Extension of Fisher Information to the case of multiple parameters, Θ = (θ1 , . . . , θq ),

is similar to much of the above. The differences are that instead of having a single first derivative of
log(f (x|θ)), there is a vector of first derivatives, namely the gradient:
 d log(f (x)|θ1 )) 
dθ1
 d log(f (x)|θ2 )) 
 dθ2 
∇ log(f (x|Θ)) =  ..  (4.29)
.
 
 
d log(f (x)|θ2 ))
dθq

dθq dθ1 dθq dθ2 . . . d log(f (x)|Θ))

dθ 2 q

Prior Distribution
No ratings yet
Prior Distribution
14 pages
Homework5 PDF
No ratings yet
Homework5 PDF
2 pages
Bayesian Estimator For Weibull Distribution With Censored Data Using Extension of Jeffrey Prior Information
No ratings yet
Bayesian Estimator For Weibull Distribution With Censored Data Using Extension of Jeffrey Prior Information
7 pages
MA40189 20 Mock
No ratings yet
MA40189 20 Mock
4 pages
The University of Nottingham: Do NOT Turn Examination Paper Over Until Instructed To Do So
No ratings yet
The University of Nottingham: Do NOT Turn Examination Paper Over Until Instructed To Do So
6 pages
Fisher Information
No ratings yet
Fisher Information
59 pages
Predição em Modelos de Tempo de Falha Acelerado Com Efeito Aleatório para Avaliação de Riscos de Falha - (JoaoBC)
No ratings yet
Predição em Modelos de Tempo de Falha Acelerado Com Efeito Aleatório para Avaliação de Riscos de Falha - (JoaoBC)
22 pages
MA40189 20 Closed
No ratings yet
MA40189 20 Closed
4 pages
ln13
No ratings yet
ln13
5 pages
Tuto1 Merged
No ratings yet
Tuto1 Merged
11 pages
Ando and Kauffman 1965
No ratings yet
Ando and Kauffman 1965
13 pages
BT_Wk3_LectureNotes(2)
No ratings yet
BT_Wk3_LectureNotes(2)
19 pages
Paper 2
No ratings yet
Paper 2
35 pages
MAP&MLE
No ratings yet
MAP&MLE
44 pages
Variational Problems in Machine Learning and Their Solution With Finite Elements
No ratings yet
Variational Problems in Machine Learning and Their Solution With Finite Elements
11 pages
BT_Wk3_LectureNotes(3)
No ratings yet
BT_Wk3_LectureNotes(3)
16 pages
MA204 FinalTest 2022
No ratings yet
MA204 FinalTest 2022
14 pages
Bayesian Data Analysis - Reading Instructions 2: Chapter 2 - Outline
No ratings yet
Bayesian Data Analysis - Reading Instructions 2: Chapter 2 - Outline
36 pages
Parameter Estimation of The Weighted Generalized Inverse Weibull Distribution
No ratings yet
Parameter Estimation of The Weighted Generalized Inverse Weibull Distribution
12 pages
Assignment 1
No ratings yet
Assignment 1
7 pages
Statistics
No ratings yet
Statistics
60 pages
Introduction To Bayesian Methods With An Example
No ratings yet
Introduction To Bayesian Methods With An Example
25 pages
MA40189 20 Open
No ratings yet
MA40189 20 Open
6 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Week 11
No ratings yet
Week 11
11 pages
Bayesian Analysis of Extreme Operational Losses: Chyng-Lan Liang
No ratings yet
Bayesian Analysis of Extreme Operational Losses: Chyng-Lan Liang
17 pages
ProblemSet1Sol
No ratings yet
ProblemSet1Sol
7 pages
i i 2 i 1 2 θ i 2 2 3 2
No ratings yet
i i 2 i 1 2 θ i 2 2 3 2
159 pages
IJAMSS - Modify Bayes Estimation With Extension Loos Function For Parameter Weibull Distribution - Hadeel Salim Al Kutubi
No ratings yet
IJAMSS - Modify Bayes Estimation With Extension Loos Function For Parameter Weibull Distribution - Hadeel Salim Al Kutubi
4 pages
A Quantitative Study of Quantile Based Direct
No ratings yet
A Quantitative Study of Quantile Based Direct
30 pages
Bayesian Inference of Poisson Distribution Using Conjugate and Non-Informative Priors
No ratings yet
Bayesian Inference of Poisson Distribution Using Conjugate and Non-Informative Priors
10 pages
Actsc 432 Review Part 1
No ratings yet
Actsc 432 Review Part 1
7 pages
Ma40189 2016 2017 Problem Sheet 3 Solutions合并版
No ratings yet
Ma40189 2016 2017 Problem Sheet 3 Solutions合并版
67 pages
Statistics 580 Maximum Likelihood Estimation: 1 2 N 0 N 1 P 0 P
No ratings yet
Statistics 580 Maximum Likelihood Estimation: 1 2 N 0 N 1 P 0 P
25 pages
Statistical Learning: Problem Set 1: Problem 1 - Frequentist Decision Theory
No ratings yet
Statistical Learning: Problem Set 1: Problem 1 - Frequentist Decision Theory
4 pages
Frequentist Estimation: 4.1 Likelihood Function
No ratings yet
Frequentist Estimation: 4.1 Likelihood Function
6 pages
The University of Nottingham
No ratings yet
The University of Nottingham
6 pages
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
No ratings yet
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
53 pages
05_CJASR12-16-396
No ratings yet
05_CJASR12-16-396
14 pages
Part A Statistics HT 2017 Problem Sheet 4
No ratings yet
Part A Statistics HT 2017 Problem Sheet 4
2 pages
Covariance Matrix (W Krzanowski)
No ratings yet
Covariance Matrix (W Krzanowski)
5 pages
19-Bayesian 2
No ratings yet
19-Bayesian 2
39 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
Multivariate Statistical Distributions
No ratings yet
Multivariate Statistical Distributions
12 pages
20-Bayesian2
No ratings yet
20-Bayesian2
50 pages
Bayesian Week2 LectureNotes
No ratings yet
Bayesian Week2 LectureNotes
14 pages
Conjugate Prior
No ratings yet
Conjugate Prior
6 pages
Solutions 308
No ratings yet
Solutions 308
13 pages
Lecture 5
No ratings yet
Lecture 5
6 pages
Bayesian Lec.4
No ratings yet
Bayesian Lec.4
24 pages
Foundations of Statistical Inference
No ratings yet
Foundations of Statistical Inference
89 pages
Kanwal Research Paper
No ratings yet
Kanwal Research Paper
11 pages
James-Stein Estimator
No ratings yet
James-Stein Estimator
12 pages
Assign 1
No ratings yet
Assign 1
5 pages
A Very Gentle Note On The Construction of DP Zhang
No ratings yet
A Very Gentle Note On The Construction of DP Zhang
15 pages
Some Part of Bayesian
No ratings yet
Some Part of Bayesian
9 pages
ps2,3
No ratings yet
ps2,3
48 pages
STAT732: Solutions For Homework 2: Due: Wednesday, Feb 14
No ratings yet
STAT732: Solutions For Homework 2: Due: Wednesday, Feb 14
7 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
w2c_central_limit
No ratings yet
w2c_central_limit
1 page
w2e_multivariate_gaussian
No ratings yet
w2e_multivariate_gaussian
6 pages
Biological Data Science Lecture4
No ratings yet
Biological Data Science Lecture4
21 pages
Biological Data Science Lecture6
No ratings yet
Biological Data Science Lecture6
29 pages
Award_in_Education_and_Training_sample
No ratings yet
Award_in_Education_and_Training_sample
9 pages
BDS 2018-19
No ratings yet
BDS 2018-19
6 pages
MDA3S
No ratings yet
MDA3S
22 pages
Week 8 Pca
No ratings yet
Week 8 Pca
26 pages
Part 4
No ratings yet
Part 4
24 pages
BDS 2016-17
No ratings yet
BDS 2016-17
4 pages
Part 5
No ratings yet
Part 5
31 pages
Week 2 Naive Bayes
No ratings yet
Week 2 Naive Bayes
15 pages
PMRslides 02
No ratings yet
PMRslides 02
13 pages
MATH11183 Week 1-Part 2
No ratings yet
MATH11183 Week 1-Part 2
18 pages
MLPR w0f - Machine Learning and Pattern Recognition
No ratings yet
MLPR w0f - Machine Learning and Pattern Recognition
3 pages
Part 3
No ratings yet
Part 3
29 pages
Slides 03 A
No ratings yet
Slides 03 A
21 pages
TS Part2
No ratings yet
TS Part2
62 pages
PMRslides 03 B
No ratings yet
PMRslides 03 B
45 pages
W6a Gaussian Process Kernels
No ratings yet
W6a Gaussian Process Kernels
6 pages
w9b Netflix Prize
No ratings yet
w9b Netflix Prize
3 pages
Bio Statslectures
No ratings yet
Bio Statslectures
60 pages
Heat Advection
No ratings yet
Heat Advection
12 pages
2019 AMAM Exam Paper
No ratings yet
2019 AMAM Exam Paper
3 pages
Machine Learning and Pattern Recognition Variational KL
No ratings yet
Machine Learning and Pattern Recognition Variational KL
5 pages
Bayesian Workshop1 Solution
No ratings yet
Bayesian Workshop1 Solution
3 pages
Machine Learning and Pattern Recognition - Laplace - Approximation
No ratings yet
Machine Learning and Pattern Recognition - Laplace - Approximation
4 pages
Machine Learning and Pattern Recognition Sampling Based Approximations
No ratings yet
Machine Learning and Pattern Recognition Sampling Based Approximations
3 pages
2017 AMAM Exam Paper
No ratings yet
2017 AMAM Exam Paper
6 pages
Machine Learning and Pattern Recognition Minimal Stochastic Variational Inference Demo
No ratings yet
Machine Learning and Pattern Recognition Minimal Stochastic Variational Inference Demo
3 pages
DL Unit-2
No ratings yet
DL Unit-2
31 pages
Building Face Ageing Model Using Face Synthesis
No ratings yet
Building Face Ageing Model Using Face Synthesis
7 pages
New Sinc Methods Of Numerical Analysis Festschrift In Honor Of Frank Stengers 80th Birthday Trends In Mathematics 1st Ed 2021 Gerd Baumann Editor pdf download
No ratings yet
New Sinc Methods Of Numerical Analysis Festschrift In Honor Of Frank Stengers 80th Birthday Trends In Mathematics 1st Ed 2021 Gerd Baumann Editor pdf download
85 pages
Matlab Solutions For Maxflow-Mincut and Ford Fulkerston Algorithm
No ratings yet
Matlab Solutions For Maxflow-Mincut and Ford Fulkerston Algorithm
4 pages
Data Structure Question Bank
No ratings yet
Data Structure Question Bank
3 pages
Detecting Depression From Speech
No ratings yet
Detecting Depression From Speech
8 pages
Robust Face Detection Using Convolutional Neural Network: Robert Yao Aaronson Wu Chen Ben-Bright Benuwa
No ratings yet
Robust Face Detection Using Convolutional Neural Network: Robert Yao Aaronson Wu Chen Ben-Bright Benuwa
7 pages
Elementary Differential Equations and Boundary Value Problems International Student Version - 9781118323618 - Ejercicio 20 - Quizlet
No ratings yet
Elementary Differential Equations and Boundary Value Problems International Student Version - 9781118323618 - Ejercicio 20 - Quizlet
4 pages
Data Science Bootcamp Curriculum 2
No ratings yet
Data Science Bootcamp Curriculum 2
7 pages
Solving An Economic Dispatch Problem With Transmission System Representation by A Modified Hopfield Network
No ratings yet
Solving An Economic Dispatch Problem With Transmission System Representation by A Modified Hopfield Network
6 pages
SSRN Id4128165
No ratings yet
SSRN Id4128165
10 pages
Minor Project 19-20 PPT Format
No ratings yet
Minor Project 19-20 PPT Format
13 pages
Hinge Loss for SVM (2)
No ratings yet
Hinge Loss for SVM (2)
9 pages
Polynomial Test Class X
No ratings yet
Polynomial Test Class X
4 pages
08.06 Shooting Method For Ordinary Differential Equations
No ratings yet
08.06 Shooting Method For Ordinary Differential Equations
7 pages
Sigma Delta Adc
No ratings yet
Sigma Delta Adc
3 pages
Module 4 - Writing Functions in Python
No ratings yet
Module 4 - Writing Functions in Python
20 pages
Parallelism strategies in Machine Learning, get the Free cheat sheet __-2
No ratings yet
Parallelism strategies in Machine Learning, get the Free cheat sheet __-2
32 pages
Classification of Malware Detection Using Machine Learning Algorithms A Survey
No ratings yet
Classification of Malware Detection Using Machine Learning Algorithms A Survey
7 pages
Quantum Computing and Cryptography Today PDF
No ratings yet
Quantum Computing and Cryptography Today PDF
22 pages
Applied Mathematics 05-25 1morning
No ratings yet
Applied Mathematics 05-25 1morning
36 pages
Estimating_the_probability_of_a_shot_resulting_in_
No ratings yet
Estimating_the_probability_of_a_shot_resulting_in_
15 pages
Lesson 05 (Part I) Newton Raphson Method
No ratings yet
Lesson 05 (Part I) Newton Raphson Method
14 pages
interference-2025
No ratings yet
interference-2025
20 pages
Medical Statistics: by Dr. Wafaayousif
No ratings yet
Medical Statistics: by Dr. Wafaayousif
24 pages
ADBMS Chapter No. 6
No ratings yet
ADBMS Chapter No. 6
24 pages
Artificial Intelligence in Industrial Engineering Are View
No ratings yet
Artificial Intelligence in Industrial Engineering Are View
8 pages
Recent Developments In Sturmliouville Theory Anton Zettl pdf download
100% (1)
Recent Developments In Sturmliouville Theory Anton Zettl pdf download
83 pages
Assignment 1 Questions
No ratings yet
Assignment 1 Questions
1 page
EEE373 Electric Motor Drive: Asst. Prof. Dr. Mongkol Konghirun Ee, Kmutt
No ratings yet
EEE373 Electric Motor Drive: Asst. Prof. Dr. Mongkol Konghirun Ee, Kmutt
16 pages

Bayesian Week4 LectureNotes

Uploaded by

Bayesian Week4 LectureNotes

Uploaded by

Chapter 4

L4: Jeffrey’s Prior, Eliciting and Analysing Priors

4.1 Jeffrey’s prior

4.1.1 Example of the problem of “uninformative” prior

4.1.2 Definition and calculation of Jeffrey’s prior

4.1.3 Jeffrey’s prior is invariant to 1:1 transformations

Thus the Jeffrey’s prior for θ can be written:

Then the induced distribution for φ given this prior:

4.1.4 Example A. Binomial distribution

which is the kernel for a Beta(1/2,1/2) distribution.

4.1.5 Example B. Exponential distribution

Then the Jeffrey’s prior:

4.1.6 Example C. Normal dist’n with known variance

πJP (µ) ∝ 1, − ∞ < µ < ∞

4.1.7 Example D. Normal dist’n with known mean

4.1.8 Jeffrey’s prior for a multivariate parameter vector.

I(θ|x) = −E[H(θ)] (4.8)

Example E. Normal(µ,σ 2 ) Jeffrey’s prior

Then the log likelihood:

The score vector:

The Hessian matrix (H) components:

Then the Fisher Information matrix

Example F. Normal with two independent Jeffrey’s priors.

Note: this example is Not using the Determinant as above.

Thus 95% credible interval for µ of (3.45, 3.61).

gam.bds <- qgamma(c(0.025,0.975),shape=9/2,rate=9*0.0122209/2)

Thus a 95% credible interval for σ 2 of (0.0058, 0.0407).

4.2 Reference Priors

4.2.2 Reference prior

where m(y) is the marginal distribution for the data.

4.3 Eliciting Informative Priors

4.4 Sensitivity analysis of priors

A continuous random variable X has a pdf fX (X).

The pdf for Y is found as follows:

Example 1. X is Uniform(0,20) and Y =g(X) = 3X. Thus g −1 (Y ) = Y /3.

|fY (y)dy| = |fX (g −1 (y))dg −1 (y)|

This example is based on http://www.biology.arizona.edu/biomath/tutorials/applications/allometry.html:

Substituting the original values:

20000 40000 60000 80000 100000 120000

#---- Change of variable with Fiddler Crabs ----

claw.a <- 0.036

Mc.density <- function(y,alpha,beta,a,b) {

theory.density <- Mc.density(y=sim.claw,alpha=body.alpha,beta=body.beta,

plot(sim.claw,theory.density,xlab="Claw mass",ylab="",main="Claw mass pdf",

In the following we begin with the case of a single (scalar) parameter θ.

Definition of Fisher Information:

which is often much easier to calculate than Eq’n 4.25.

The right hand term is called the Cramer-Rao bound.

2 There d log(f (x|θ)

and an estimate of θ, namely θ̂, is substituted for θ.

5. Multivariate Θ. Extension of Fisher Information to the case of multiple parameters, Θ = (θ1 , . . . , θq ),

dθq dθ1 dθq dθ2 . . . d log(f (x)|Θ))

You might also like