Bayesian Week4 LectureNotes
Bayesian Week4 LectureNotes
Week 4
Priors considered “uninformative” for a parameter, θ, may not be considered uninformative for transfor-
mations of the parameter, ψ=g(θ). For example, consider a Uniform(0,20) prior for a parameter θ. The
problem√with such uniform priors is that the induced prior for simple transformations of the parameter,
e.g., φ= θ, will not be uniform. Given θ ∼ Uniform(0,20), the distribution for φ is found by the change of
variable theorem1 :
1 dφ2 √ 1 √ φ √
π(φ) = I(0 < φ < 20) = 2φI(0 < φ < 20) = I(0 < φ < 20)
20 dφ 20 10
√ √
where I(0 < φ < 20) is an indicator function that equals 1 when √ 0 < φ < 20 and 0 otherwise. This
induced prior clearly is informative, as it linearly increases from 0 to 20.
Jeffrey’s prior is an example of an objective prior which can be seen as a remedy to the just discussed problem
of the induced priors resulting from transformations. It is a prior that is invariant to strictly monotonic (1-1
or bijective) transformations of the parameter, say φ =g(θ), where g is strictly monotonic.
Jeffrey’s prior is proportional to the square root of Fisher’s Information, I(θ|y):
p
Jeffrey’s prior πJP (θ) ∝ I(θ|y) (4.1)
1 The change of variable theorem is a procedure for determining the pdf of a (continuous) random variable Y that is a strictly
monotonic (1:1) transformation of another (continuous) random variable X, i.e., Y =g(X). The pdf for Y :
dg −1 (y)
pY (y) = pX (g −1 (y))
dy
See Section 4.5 for more details.
61
where
" 2 #
d log f (y|θ)
I(θ|y) = E (4.2)
dθ
Note that under certain regularity conditions (e.g. that the differentiation operation can be moved inside
the integral), Fisher Information can be calculated from the second derivative of the log likelihood:
2
d log f (y|θ)
I(θ|y) = −E (4.3)
dθ2
which is often much easier to calculate than Eq’n 4.25. For more discussion of Fisher’s Information see
Section 4.6.
Remark. Let f (y|θ) denote a probability density or mass function for a random variable y where θ is a
scalar. Let φ = g(θ) where g is a strictly p monotonic (1:1 or bijective) transformation. If we specify a
Jeffrey’s
p prior for θ, namely, π JP (θ) ∝ I(θ|y), then the induced prior on φ, π(φ), is proportional to
I(φ|y). In other words, a 1:1 transformation of a parameter that has a Jeffrey’s prior yields a Jeffrey’s
prior for the transformed parameter.
dy dy dz
Proof. This proof uses the chain rule, dx = dz dx , and the change of variable theorem.
Write the Fisher information for θ as follows:
" 2 # " 2 #
d log f (x|θ) d log f (y|φ)) dφ
I(θ|y) = E =E
dθ dφ dθ
" 2 # 2 2
d log f (y|φ)) dφ dφ
=E = I(φ|y)
dφ dθ dθ
dθ p dθ dφ p
π(φ) = πJP (θ) ∝ I(φ|y) = I(φ|y)
dφ dφ dθ
Suppose that the prevalence of Potato Virus Y in a population of aphids is an unknown parameter θ. A
random sample of n aphids is taken (using a trap) and the number of aphids with the virus is x. Assuming
independence between the aphids and that they all have the same probability of having the virus, x ∼
62
Binomial(n, θ). The Fisher information:
" #
d2 log nx + x log(θ) + (n − x) log(1 − θ)
−x n−x
I(θ) = −E = −E −
dθ2 θ2 (1 − θ)2
E(x) n − E(x) nθ n − nθ n
= 2
+ 2
= 2 + 2
=
θ (1 − θ) θ (1 − θ) θ(1 − θ)
Thus the Jeffrey’s prior is
s
1
πJP (θ) ∝ = θ−1/2 (1 − θ)−1/2
θ(1 − θ)
θ(1−θ)
Aside. Note that the mle for θ is θ̂ = x/n and the variance of θ̂ is n , which equals I(θ)−1 .
63
and g −1 (θ) = θ2 = λ. Given π(λ) ∝ 1/λ, the induced prior for θ:
dg −1 (θ) dθ2 1 2
πθ (θ) = πλ (g −1 (θ)) = 2
=
dθ dθ θ θ
Checking that this is indeed the Jeffrey’s prior for θ:
2
d 2 log(θ) − θ2 x
4
I(θ) = −E 2
= 2
dθ θ
p
Thus the Jeffrey’s prior is πJP (θ) ∝ I(θ) = 2/θ.
The sampling distribution for x1 , . . . , xn is Normal(µ, σ 2 ) where σ 2 is known. It can be shown that the
Jeffrey’s prior for µ when σ 2 is known is
(µ − ȳ)2
π(µ|y) ∝ exp − ∗1
2σ 2 /n
2
namely the kernel for a Normal ȳ, σn .
The sampling distribution for x1 , . . . , xn is Normal(µ, σ 2 ) where µ is known. The Jeffrey’s prior for σ 2 (given
known µ) can be shown to be the following:
1
πJP (σ 2 ) ∝ , 0 < σ2 < ∞ (4.4)
σ2
R∞ 1 2
which is an improper prior because 0
is not finite. The posterior distribution for σ 2 ,
σ 2 dσ
z2 z2
2
n
2 −2 1 n
2 −( 2 +1)
p(σ |) ∝ σ exp − 2 = σ exp − 2 (4.5)
2σ σ2 2σ
Pn 2
where z 2 = i=1 (xi − µ)2 . This is the kernel for Γ−1 n2 , z2 so long as n/2 > 0 (obviously so), and z 2 > 0,
which simply means that not all values in y are identical.
To calculate Jeffrey’s prior for a multivariate parameter vector, θ, with p parameters, one calculates the
score function, which is now the gradient of the log likelihood:
d
dθ1 ln(f (x))
..
S(θ) = (4.6)
.
d
dθp ln(f (x))
64
and then calculates the Hessian of the log likelihood, namely, the matrix of the partial derivatives of the
score function
d2 d2 d2
ln(f (x)) ln(f (x)) ... ln(f (x))
d d d
dθ S(θ)[1] dθ1 S(θ)[2] . . . dθ1 S(θ)[p] dθ12 dθ1 dθ2 dθ1 dθp
d1 S(θ)[1] d d d2 d2 d2
dθ2 dθ2 S(θ)[2] . . . dθ2 S(θ)[p] dθ2 dθ1 ln(f (x)) dθ22 dθ2
ln(f (x)) ... dθ2 dθp ln(f (x))
H(θ) = . =
.. .. .. .. .. .. ..
..
. . . .
. . .
d d d 2 2
d2
dθp S(θ)[1] dθp S(θ)[2] ... dθp S(θ)[p]
d d
dθp dθ1 ln(f (x)) dθ2 dθp ln(f (x)) ... dθp2 ln(f (x))
(4.7)
Fisher information is
Finally, the Jeffrey’s prior for the vector of parameters is proportional to the square root of the determinant
of the Fisher information matrix:
p
πJP (θ) ∝ det(I(θ|x)) (4.9)
To make the differentiation a little less awkward, the normal distribution is parameterized with θ=σ 2 . Given
y1 , . . . , yn (iid) Normal(µ, θ) pdf can be written:
1
Pn 2
f (y; µ, θ) = (2πθ)−n/2 e− 2θ i=1 (yi −µ) (4.10)
d2 l
= −2θ−1 n (4.14)
dµ2
d2 l d2 l
= = −2θ−2 n(−µ) (4.15)
dµdθ dθdµ
n
d2 l −2 −3
X
= nθ − 2θ (yi − µ)2 (4.16)
dθ2 i=1
65
Then the Jeffrey’s prior for (µ, σ 2 ):
r r
2
p 2n n 2n2 1
πJP (µ, σ ) ∝ |I(µ, σ 2 )| = = ∝ 3 (4.18)
σ2 σ4 σ6 (σ 2 ) 2
The posterior can be shown to be the product of an inverse Chi-square (Gamma) and a normal (conditional
on σ 2 ).
qt(c(0.025,0.975),df=9)*sqrt(0.0122209/10) + 3.5283
3.449219 3.607381
As mentioned in LN 2, the Jeffrey’s prior is one of several objective priors, namely procedures for selecting
priors which will yield the same prior for anyone who uses the procedure. Reich and Ghosh (2019) discuss
66
four other objective priors that are at least worth knowing about and I recommend reading the approximately
two pages of discussion. Admittedly understanding the general concepts behind the procedures and being
able to implement the procedures can have quite different degrees of difficulty, with the latter generally more
difficult than the former. Here we will just examine one other type of objective prior, the Reference Prior,
denoted πRP (θ).
4.2.1 KL divergence
Before introducing Reference Priors, the notion of Kullback-Leibler (KL) divergence is introduced. KL
divergence is a measure of the difference between two pmfs or two pdfs. A KL divergence value of 0 means
that the two distributions are identical.
For the continuous case let f and g denote two pdfs. The KL divergence is defined “conditional” on one of
the two distributions, here denoted KL(f, g) or KL(g, f ) where KL(f, g) 6= KL(g, f ), except when f and g
are identical (almost everywhere). More exactly, KL(f, g) is the expected value of log(f (x)/g(x) assuming
that f is “true”—more accurately stated that the expectation is with respect to f (x), and KL(g, f ) is the
reverse:
Z Z Z
f (x)
KL(f, g) = log f (x)dx = log(f (x))f (x)dx − log(g(x))f (x)dx = Ef [log(f (X)] − Ef [log(g(X)]
g(x)
(4.19)
and
Z
g(x)
KL(g, f ) = log g(x)dx = Eg [log(g(x))] − Eg [log(f (x)] (4.20)
f (x)
h i
Notes: (1) if g(x) = f (x), then log fg(x)
(x)
= log(1) = 0; and (2) KL(f, g) ≥ 0.
KL divergence examples. As a simple example consider a discrete valued random variable with values
0, 1, or 2. Let f (x) be the Binomial(2,p=0.2) pmf with probabilities 0.64, 0.32, and 0.04 for X=0, 1, and 2,
respectively. Let g(x) be the discrete uniform where g(0) = g(1) = g(2) = 1/3. Then
2
X f (x)
KL(f, g) = f (x) log = 0.64 log(0.64/0.33) + 0.32 log(0.32/0.33) + 0.04 log(0.04/0.33) = 0.3196145
x=0
g(x)
2
X g(x)
KL(g, f ) = g(x) log = 0.33 log(0.33/0.64) + 0.33 log(0.33/0.32) + 0.33 log(0.33/0.04) = 0.5029201
x=0
f (x)
For another example let g(x) be a Binomial(2,p=0.25). Then KL(f, g)= 0.01400421 and KL(g, f ) =
0.01476399, thus the KL divergence measures are both close to 0 (and close to each other).
Given data y, the KL divergence between the prior and posterior (with respect to the posterior) is
Z
p(θ|y)
KL(p(θ|y), π(θ)) = p(θ|y) log dθ (4.21)
π(θ)
The key idea of a RP is that the KL divergence between the posterior and the prior should be as large as
possible, thus implying that the data are dominating the prior.
67
However, this measure is conditional on the data, y, which does not help for determining a prior. Thus to
remove the conditioning on the data, the data are integrated out, and πRP (θ) is the probability distribution
(pmf or pdf) that maximizes the following:
Z
Ey [KL(p(θ|y), π(θ))] = [KL(p(θ|y), π(θ))] m(y)dy (4.22)
Often the scientist or subject-matter specialist will have a definite opinion as to what the range of parameter
values should be. For example, an experienced heart surgeon who has done 1000s of coronary by-pass
surgeries on a variety of patients will have a definite opinion on post-surgery survival probability.
How does one translate that prior knowledge into a prior probability distribution?
• First, think about whether it’s even feasible or are there too many parameters, or is the underlying
sampling model fairly complex, e.g., a hierarchical model. For example, suppose the sampling model
is a multiple regression with 4 covariates, thus these five parameters, β0 , β1 , β2 , β3 , and β4 , and the
variance parameter. How might the expert’s knowledge be translated into prior distributions for these 5
parameters? The expert might have an opinion about the signs, positive or negative, of each coefficient,
and perhaps the relative importance of each covariate; e.g., dealing with standardized covariates, then
the effects of x1 and x2 are thought to be positive but the effect of x1 may be twice as large as x2 .
• Single parameter case. This is generally the most feasible situation. If the expert is not familiar with
probability distributions, the statistician may need to work with the expert to arrive at a prior and
can help by asking questions about the parameter without using statistics jargon.
For example, instead of asking for the median, ask “For what value of the parameter do you think that
it’s equally likely that values are either below or above it?”
“What do you think the range of values might be?”
“ What do you think the relative variation around an average value be, for example, if your best guess
for θ is 15, is your uncertainty within ± 10% of that value (± 1.5), or 20% (± 3.0)?” Thus potentially
getting a measure of the coefficient of variation, CV =σ/µ.
Given a mean value, and a range, standard deviation, or CV, and assuming a particular standard prob-
ability distribution might suffice, rough estimates of hyperparameters for the prior might be calculated.
This is the “moment matching” idea that we’ve examined previously.
For example, with the above example of the surgeon, the surgeon was thinking that θ would be 0.7 on
average. Further questioning about the surgeon’s uncertainty led to a determination that a CV of 0.1
would be appropriate. Using a Beta distribution for the prior, the mean is α/(α + β) and the variance
is (αβ)/[(α + β)2 (α + β + 1)], some algebra yields Beta(29.3, 12.6).
• Discrete histogram priors. Another simple way to elicit priors is to partition parameter values into
non-overlapping bins, and have the expert present relative weights for each bin. For example, θ is
grouped into three bins, [0,10), [10,25), [25,30], and the expert gives relative weights of 0.2, 0.5, and
0.3. A proper histogram, pdf, is constructed using the result that bin area = height × width, where
68
bin area corresponds to probability or weight. For the bin [0,10), area=0.2, width=10, thus height
equals area/width = 0.2/10 = 0.02. For [10,25) height is 0.5/15 = 0.033, and for [25,30] height is 0.3/5
= 0.06.
References
• “The elicitation of prior distributions”, Chaloner, 1996, in Bayesian Biostatistics, eds., Berry and
Stangl.
• Uncertain Judgements: Eliciting Experts’ Probabilities, O’Hagan, et al. 2006. This book is available
online as a pdf via the University Library.
Using the term sensitivity analysis loosely here, we mean an examination of the effects of different priors on
the posterior. For example, the comparison of the posterior distribution for the probability of survival after
by-pass surgery for the surgeon’s prior and the medical student’s prior is a sensitivity analysis.
Another loosely put phrase, if the posterior distributions for different priors look much “the same”, e.g.,
have similar means and variances, then then one might say that the results are robust to the priors.
With large enough samples, unless the prior is particularly concentrated over a narrow range of possible
values, sometimes called a pig-headed prior, the posterior will look much the same for a wide range of priors
as the data (the likelihood) are dominating the prior.
Problematic issues
• Practical issue: if 100s of parameters, then tedious at least, to carry out a sensitivity analysis for all
the priors.
• Generalized linear models where data come from an exponential family distribution with parameter θ
and covariate(s) x, say F, and, g(θ, x), the “link” function, is a linear model:
y|θ, x ∼ F(θ, x)
g(θ, x) = β0 + β1 x
Apparently uninformative priors in the link function may induce quite informative priors at a lower
level. For example, a logistic regression for the number of patients surviving heart bypass surgery
where the probabilities differ with age:
yi |Agei ∼ Bernoulli(θ(Agei ))
exp(β0 + β1 Age)
where θ(Age) =
1 + exp(β0 + β1 Age)
θ(Age)
equivalently ln = β0 + β1 Age
1 − θ(Age)
A seemingly innocuous prior for both β0 and β1 is Normal(µ=0,σ 2 =52 ). Suppose that the ages range
from 40 to 70. Figure 4.1 shows the results of simulating from the priors for β0 and β1 on the induced
priors for survival for four different ages. Note how the probabilities are massed near 0 and 1. Such
simulation exercises can be quite valuable for detecting such effects.
69
Figure 4.1: Induced prior for survival probability (θ) as a function of age given Normal(0,5) priors for logit
transformation, ln (θ/(1 − θ)) = β0 + β1 Age.
Age= 40 Age= 50
5
5
4
4
3
3
2
2
1
1
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Survival Survival
Age= 60 Age= 70
5
5
4
4
3
3
2
2
1
1
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Survival Survival
70
4.5 Supplement A: Change of Variable Theorem
The Problem
The Solution
• IY (condition) is an Indicator function which takes on one of two values: 1 when the condition is met,
is True, and 0 when it is not met, is False.
• This is an “intuitive” result, Y ∼ Uniform(0,60)
1 1
Example 2. X ∼ Uniform(16,64), thus pdf of X is 64−16 = 48 IX (16 < X < 64). Define Y = g(X) =
√ −1 2
X, noting that the support for Y is (4,8). Then g (Y ) = Y , and
1 dY 2 1 1
fY (Y ) = = 2Y = Y IY (4 < Y < 8)
48 dY 48 24
Skeleton of Proof
The essence of the change of variables theorem is that for x=g −1 (y), Pr(y ≤ Y ≤ y + dy) should equal
Pr(g −1 (y) < X < g −1 (y) + dg −1 (y) ≡ Pr(x ≤ X ≤ x + dx). This will happen if the following relationship
between the areas under the two pdfs holds:
Note that fY (y)dy is approximately Pr(y < Y < y + dy). Think of the area of a rectangle, Area=Height ×
Width, where Area is probability, Height is the pdf evaluated at y, and Width is dy. Then
dg −1 (y)
fY (y) = fX (g −1 (y))
dy
71
Biological Allometry Example
Male fiddler crabs (Uca pugnax) possess an enlarged major claw for fighting or threatening other
males. In addition, males with larger claws attract more female mates.
The sex appeal (claw size) of a particular species of fiddler crab 4.2 is determined by the following
allometric equation:
Mc = 0.036Mb1.356
where Mc is the mass of the major claw and Mb is the body mass of the crab minus the mass of
the claw.
Suppose that Mb is on average 2000 mg with a CV of 0.20. Assuming a Gamma distribution for Mb , then
Mb ∼ Gamma(25, 0.0125).
Figure 4.2: Fiddler Crab. Image from Southeastern Regional Taxonomic Center (SERTC), South Carolina
Department of Natural Resources.
What is the pdf for Mc ? To reduce notation momentarily, let X=Mb , Y =Mc , a=0.036, b=1.856, α=25, and
β=0.0125. Then
1b
Y
Y = g(X) = aX b and X = g −1 (Y ) =
a
and
dg −1 (y) 1−b
= a−1/b y b
dy
The pdf for Y (Mc ):
" 1 #α−1
−1
1
dg (y) β Y b 1 1−b
e−β∗( a ) × a−1/b y b
Y b
fY (y) = fX (g −1 (y)) =
dy Γ(α) a b
The accuracy of the derivation was examined by simulating body mass from a Gamma(25,0.0125) and then
transforming using the allometric equation (the R code is shown below). The empirical and theoretical pdfs
are plotted in Figure 4.3, and the two are quite similar.
72
Claw mass pdf
Theory
Empirical
2.0e−05
1.5e−05
1.0e−05
5.0e−06
0.0e+00
Claw mass
Figure 4.3: Theoretical and empirical pdf for crab claw mass.
73
4.6 Supplement B: Fisher Information
Note that under certain regularity conditions2 , Fisher Information can be calculated from the second deriva-
tive of the log likelihood:
2
d log f (x|θ)
I(θ|x) = −E (4.26)
dθ2
Remarks.
1. Given n iid random variables x1 , . . . , xn from the same distribution with parameter θ, the Fisher
information for θ = nI1 (θ|x), where I1 (θ|x) denotes the information for a single observation:
2 2 Pn
d log f (x1 , . . . , xn |θ) d i=1 log f (xi |θ)
I(θ|x1 , . . . , xn ) = −E = −E
dθ2 dθ2
n
d2 log f (xi |θ)
X
= −E = nI1 (θ|x) (4.27)
i=1
dθ2
2. Inverse of I(θ) as lower bound on variance of θ̂. Under the previously mentioned regularity conditions,
the inverse of Fisher information is the lower bound on the variance of an unbiased estimator of a
parameter. In other words, given a probability distribution with parameter θ which satisfies certain
regularity conditions, if θ̂ is unbiased for θ, then
V (θ̂) ≥ I(θ)−1
74
3. Maximum likelihood estimators. In the particular case of maximum likelihood estimates (mles), the
inverse of I(θ|x) evaluated at the mle, θ̂, is often used as an estimate of the variance of θ̂:
ar(θ̂) = I(θ)−1
Vd
y
Example. if y ∼ Binomial(n, p), then the mle for p is p̂ = n.
The variance of p̂ is
hyi 1 1 p(1 − p)
V ar[p̂] = V ar = V ar[y] = 2 np(1 − p) =
n n2 n n
It can be shown that Fisher information for p is
n
I(p) =
p(1 − p)
p(1−p)
Observe that I −1 (p) = n , which is the variance of p̂.
4. Observed Fisher Information is the Fisher information without the integration, i.e., without taking the
expectation of the second derivative of the log likelihood:
d2 log(f (x|θ))
J (θ) = (4.28)
dθ2
And instead of a single second derivative of log(f (x)|θ), there is a matrix of second derivatives, namely
the Hessian:
d2 log(f (x)|Θ)) d2 log(f (x)|Θ)) 2
. . . d log(f (x)|Θ))
dθ12 dθ1 dθ2 dθ1 dθq
d2 log(f (x)|Θ)) d2 log(f (x)|Θ))
2
dθ dθ dθ
∇∇T log(f (x|Θ)) =
2 1 1
(4.30)
..
.
d2 log(f (x)|Θ)) d2 log(f (x)|Θ)) 2
75